import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Titanic dataset provided by Udacity
FileName = 'titanic_data.csv'
fileNameTitanic = '/Users/Preston/Programing/Data Science/Udacity Nano Degree/project 2/' + FileName
titanicStats = pd.read_csv(fileNameTitanic)
The original data set (a subset of which is show below) includes information on 891 passengers. The factors that I will focus on are gender, passenger class and age.
titanicStats.tail()
The following code shows that, while data for the gender and passenger class were collected for each passenger in the data set, passenger age was not available for 177 passengers in the data set.
# Find all of the null fields in the data set.
titanicStats.isnull().sum()
I have created several formulas to simplify my analysis. The first of which will take the original data set and group it by one of the selected factors (column headings). The following calculate chi squared and Cramer's V.
'''Function to create a dataframe for the relevant subcategory.
@param send in the primary dataframe and the column name to group by.
@return a dataframe organized with columns, survived, perished and totals.
'''
def new_dataframe(old_frame, grouping_column):
# group the frame by the selected column and pull only the survial numbers.
newFrame = old_frame.groupby(grouping_column)['Survived'].agg(['sum', 'count'])
# change the column names.
newFrame.columns = ['Survived', 'Total']
# add a column for the total numbers.
newFrame['Perished'] = newFrame['Total'] - newFrame['Survived']
newFrame['Survival_Rate'] = newFrame['Survived'] / newFrame['Total']
# reorder the columns.
newFrame = newFrame[['Survived', 'Perished', 'Total', 'Survival_Rate']]
return newFrame
'''Chi squared function
@param observed_values observed values as a tuple; number survived, number perished.
the dictionary sent to the equation should have the numbrer of survivers in the first
columnn and number perished in the second column. The keys should be the dependent variables.
@return chi sqaured value of the survival rate for the dictionary.
'''
def chi_squared_survival(observed_dict):
# the running total for the chi squared statistic
chiSquared = 0
for i in range(len(observed_dict)):
obs_survived = observed_dict.iloc[i, 0]
obs_perished = observed_dict.iloc[i, 1]
totalPeople = obs_survived + obs_perished
# the expected to survive
expect_survived = totalPeople * total_survival_odds
expect_perished = totalPeople * (1 - total_survival_odds)
chiSquared += ((obs_survived - expect_survived)**2) / expect_survived
chiSquared += ((obs_perished - expect_perished)**2) / expect_perished
return chiSquared
'''Cramer's V function
@param input the chi squared score, degrees of freedom, total samples observed.
@return Cramer's v score
'''
import math
def cramerV(chiScore, totalObserved, k):
return math.sqrt(chiScore / (totalObserved * (k - 1)))
The overall odds of survival for the 891 passengers in the data set was 38.38%. This average survival rate will be compared with several subgroups of passengers.
total_survival_odds = titanicStats.Survived.sum() / float(titanicStats.Survived.count())
total_survival_odds
The following code evaluates the survival rates for each gender.
Based on the 55.3% difference in survival rates between men (18.9%) and women (74.2%), it appears that gender was a large factor in determining the likelyhood of survival.
# Create a dataframe for the gender survival statistics.
genderStats = new_dataframe(titanicStats, 'Sex')
genderStats
# Differnece in survival rates.
genderStats.Survival_Rate['female'] - genderStats.Survival_Rate['male']
# Survival by gender bar chart.
genderStats.Survival_Rate.plot(kind='bar', title='Survival Rate')
To determine the statistical significance of the gender based survival rate, I utilized a Chi Squared test.
Null hypothesis: gender was not a significant factor in determining survival rates
Alternative hypothesis: gender was a statistically significant factor in determining survival rates
Alpha level of .05.
Based on the chi squared score of 263.05 and degrees of freedom of one, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected, and gender considered a statisically significant factor in determining survial rate.
Additionally, based on the Cramer's V value of .54 (calculated below), gender can be considered to have a strong effect on survival rates.
# Calculate the Chi Squared score for survival rates based on gender.
chi_squared_gender = chi_squared_survival(genderStats)
chi_squared_gender
# Calculate Cramer's V for the survival rates based on gender.
cramerV(chi_squared_gender, genderStats['Total'].sum(), 2)
The following analysis compares the survival rate for each of the three ticket classes on board.
# Analyse the survival rates by passenger class.
classStats = new_dataframe(titanicStats, 'Pclass')
classStats
As the following two tables illustrate, there were more third class passengers than first or second class combined. However, the survival rate declined from first to second class then further to third class.
# plot of the total number of passengers by passenger class.
classStats.Total.plot(kind='pie', title='Passenger Count by Class')
# Plot of survival odds by passenger class.
classStats.Survival_Rate.plot(kind='bar', title='Survival Rate')
To determine the statistical significance of the survival rates of each passenger class, I utilized a Chi Squared test.
Null hypothesis: passenger class was not a significant factor in determining survival rates
Alternative hypothesis: passenger class was a significant factor in determining survival rates
Alpha level of .05.
Based on the chi squared score of 102.89 and degrees of freedom of two, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected and passenger class is considered a statisically significant factor in determining survial.
Based on the Cramer's V value of .34 (calculated below), passenger class can be considered to have a medium effect on survival rates.
# Calculate Chi Squared for the survival rates based on passenger class.
chi_by_class = chi_squared_survival(classStats)
chi_by_class
# Calculate Cramer's V for survival rates by class.
cramerV(chi_by_class, classStats.Total.sum(), 2)
The following analyzes the survival rates by age group of the passengers.
For the 714 passengers with available age data, the mean age is 29.7 years old, with median of 28, minimum of 0.42 (5 months) and maximum of 80.
# General description of the age data and the median value.
titanicStats.Age.describe(), titanicStats.Age.median()
To analyze the survival rates based on age, I seperated the passengers into categories consisting of ages by decade.
The function below returns the decade of the passengers age, this used to add the respective decate category to the data frame. I then created a new data frame slice and removed passengers with null age values.
'''Function to determine the age of the passengers by decade.
@param age of passenger
@return decade of passengers age as a string.
'''
def ageDecade(age):
if age <= 9:
return '0-9'
elif age < 20:
return '10-19'
elif age < 30:
return '20-29'
elif age < 40:
return '30-39'
elif age < 50:
return '40-49'
elif age < 60:
return '50-59'
elif age < 70:
return '60-69'
else:
return '70+'
# Add the ageDecade column to the dataframe.
titanicStats['ageDecade'] = titanicStats['Age'].map(ageDecade)
# Create a slice of the dataframe and remove anything with a null value in the age column.
ageDf = titanicStats.dropna(subset=['Age'])
ageDf.tail()
# Check the length of the new data frame to make sure that the 177 rows with null age fields were removed.
print 'Original table length = ', len(titanicStats)
print 'New table length = ', len(ageDf)
print "Difference = ", len(titanicStats) - len(ageDf)
The majority of the age groups have survival rates that are close to the average total survival rate (38.4%), with the exceptions of the 0-9 age group and over 70 age group, which are respecitively higher and lower than the overall average survival rate. However, there are two few samples in the 70+ group to draw any conclusions from (only 7 passengers over 70 years of age). What remains is the apperance of a 'privelidge of youth' as it relates to survial.
#Create the new dataframe grouped by the age by decades of the passengers.
statsByDecade = new_dataframe(ageDf, 'ageDecade')
statsByDecade
# Create a bar chart of the survival rates grouped by decade of age.
statsByDecade.Survival_Rate.plot(kind='bar', title='Survival Rate')
To deterimine if the seeming privelidge of youth was significant I first sought to find the age at which this priveledge expired. To do this I created a new data frame with only the passengers that were under 20 years old and examined the survival rates in the table below compared to the overall survival rate (38.4%).
The priveledge of youth appears to expire after the age of 15, as the survival rate drops from 80% at age 15 to 35% at 16 and stays relatively close to the 38% overall average from that point on.
youthDf = new_dataframe(ageDf[(ageDf.Age <= 19)], 'Age')
youthDf.tail(10)
# Create a bar graph of the survival rates based on age for passengers under 20.
youthDf.Survival_Rate.plot(kind='bar', title='Youth Survival Rates')
I then broke the passenges into two groups, those 15 and younger and those 16 and older.
'''Function to test if age is below 16.
@param age
@return boolean.
'''
def under_16(age):
return age < 16
# Create a new field in the data frame 'Under_16'.
titanicStats['Under_16'] = titanicStats['Age'].map(under_16)
# Remove any passengers with null age data.
youthDf = titanicStats.dropna(subset=['Age'])
# Create a new data frame based on the Under_16 categories.
youth_vs_Df = new_dataframe(youthDf, 'Under_16')
youth_vs_Df
To determine the statistical significance of the survival rates for passenger by age, under 16 compared to 16 and over, I utilized a Chi Squared test.
Null hypothesis: passenger age was not a significant factor in determining survival rates
Alternative hypothesis: passenger age was a statistically significant factor in determining survival rates
Alpha level of .05.
Based on the chi squared score of 14.98 and degrees of freedom of one, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected and passenger age is considered a statisically significant factor in determining survial.
Based on the Cramer's V value of .14 (calculated below), passenger age can be considered as having a small effect on survival rates.
# Calculate the Chi Squared value for age group survival rate, under 16 vs. 16 and up.
chi_youth_vs = chi_squared_survival(youth_vs_Df)
chi_youth_vs
# Calculate Cramer's V for survival by age group.
cramerV(chi_youth_vs, youth_vs_Df.Total.sum(), 2)
Combining the passenger variables reveals interesting interactions between the variables.
While it does appear that passengers under 16 in the first and second classes were favored, with only 1 out of 25 perrishing, 25 passengers in those categories are too few to draw conclusions from.
For those under 16 in the third class, youth was a notable advantage for males, with 32.1% survival rate for those under 16 compared to 12.9% for those older, and a smaller advantage for females with a 53.3% survival rate for third class females under 16 compared 43.1% for older females in third class.
For men older than 16, the only way to have had a reasonable chance at survival was to have been in first class. Men in first class had a 37.8% survival rate compared to 6.7% and 12.9% for men over 16 in second and third class, respectively.
For women over 16, class was a major factor in survival. Although women over 16 in third class had a survival rate of 43.1%, which was well above the men's overall survival rate (18.9%), it was still less than half of the survival rate for women over 16 in first and second class (97.6% and 90.6%, respectively).
'''Remove the null age data and create a data frame table with the three factors combined;
age under/over 16 years, gender, and passenger class.'''
ageOnlyDf = titanicStats.dropna(subset=['Age'])
combined_factors = new_dataframe(ageOnlyDf, ['Under_16', 'Sex', 'Pclass'])
combined_factors
For comparison with the above table, survival previously calculated survival rates are displayed here:
Overall = 38.4%
Male = 18.9%
Female = 74.2%
1st class = 63.0%
2nd class = 47.3%
3rd class = 24.3%
Based on the analysis above, gender, age and passenger class all influenced the survival rate for passengers on the Titanic.
To have the best chance of survival, you would have wanted to be female and in first (or second) class. If you were male, you did not want to be in second or third class especially if you were over 15 years old.