Titanic Survival Analysis¶

The following analysis attempts to answer the question; what passenger charectoristics infuenced the likelyhood of surviving the sinking of the Titanic?

The dataset was provided by Udacity.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Titanic dataset provided by Udacity 
FileName = 'titanic_data.csv'
fileNameTitanic = '/Users/Preston/Programing/Data Science/Udacity Nano Degree/project 2/' + FileName
titanicStats = pd.read_csv(fileNameTitanic)

Data Set¶

The original data set (a subset of which is show below) includes information on 891 passengers. The factors that I will focus on are gender, passenger class and age.

titanicStats.tail()

Missing Values¶

The following code shows that, while data for the gender and passenger class were collected for each passenger in the data set, passenger age was not available for 177 passengers in the data set.

# Find all of the null fields in the data set. 
titanicStats.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Formulas for analysis¶

I have created several formulas to simplify my analysis. The first of which will take the original data set and group it by one of the selected factors (column headings). The following calculate chi squared and Cramer's V.

'''Function to create a dataframe for the relevant subcategory.  
    @param send in the primary dataframe and the column name to group by. 
    @return a dataframe organized with columns, survived, perished and totals. 
'''
def new_dataframe(old_frame, grouping_column):
    # group the frame by the selected column and pull only the survial numbers. 
    newFrame = old_frame.groupby(grouping_column)['Survived'].agg(['sum', 'count'])
    # change the column names. 
    newFrame.columns = ['Survived', 'Total']
    # add a column for the total numbers. 
    newFrame['Perished'] = newFrame['Total'] - newFrame['Survived']
    newFrame['Survival_Rate'] = newFrame['Survived'] / newFrame['Total']
    # reorder the columns. 
    newFrame = newFrame[['Survived', 'Perished', 'Total', 'Survival_Rate']] 
    
    return newFrame

'''Chi squared function

    @param observed_values observed values as a tuple; number survived, number perished. 
    the dictionary sent to the equation should have the numbrer of survivers in the first
    columnn and number perished in the second column. The keys should be the dependent variables. 
    @return chi sqaured value of the survival rate for the dictionary. 
'''
def chi_squared_survival(observed_dict):
    # the running total for the chi squared statistic
    chiSquared = 0
    
    for i in range(len(observed_dict)):
        obs_survived = observed_dict.iloc[i, 0]
        obs_perished = observed_dict.iloc[i, 1]
        totalPeople = obs_survived + obs_perished
        # the expected to survive
        expect_survived = totalPeople * total_survival_odds
        expect_perished = totalPeople * (1 - total_survival_odds)
        
        chiSquared += ((obs_survived - expect_survived)**2) / expect_survived
        chiSquared += ((obs_perished - expect_perished)**2) / expect_perished
        
    return chiSquared

'''Cramer's V function

    @param input the chi squared score, degrees of freedom, total samples observed.
    @return Cramer's v score
'''
import math

def cramerV(chiScore, totalObserved, k):
    return math.sqrt(chiScore / (totalObserved * (k - 1)))

Total Survival Odds¶

The overall odds of survival for the 891 passengers in the data set was 38.38%. This average survival rate will be compared with several subgroups of passengers.

total_survival_odds = titanicStats.Survived.sum() / float(titanicStats.Survived.count())
total_survival_odds

0.3838383838383838

Survival By Gender¶

The following code evaluates the survival rates for each gender.

Based on the 55.3% difference in survival rates between men (18.9%) and women (74.2%), it appears that gender was a large factor in determining the likelyhood of survival.

# Create a dataframe for the gender survival statistics. 
genderStats = new_dataframe(titanicStats, 'Sex')
genderStats

# Differnece in survival rates. 
genderStats.Survival_Rate['female'] - genderStats.Survival_Rate['male']

0.55313007097992029

# Survival by gender bar chart.
genderStats.Survival_Rate.plot(kind='bar', title='Survival Rate')

<matplotlib.axes._subplots.AxesSubplot at 0x11dfb5090>

Chi Squared by Gender¶

To determine the statistical significance of the gender based survival rate, I utilized a Chi Squared test.

Null hypothesis: gender was not a significant factor in determining survival rates
Alternative hypothesis: gender was a statistically significant factor in determining survival rates
Alpha level of .05.

Based on the chi squared score of 263.05 and degrees of freedom of one, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected, and gender considered a statisically significant factor in determining survial rate.

Additionally, based on the Cramer's V value of .54 (calculated below), gender can be considered to have a strong effect on survival rates.

# Calculate the Chi Squared score for survival rates based on gender.  
chi_squared_gender = chi_squared_survival(genderStats)
chi_squared_gender

263.05057407065567

# Calculate Cramer's V for the survival rates based on gender. 
cramerV(chi_squared_gender, genderStats['Total'].sum(), 2)

0.5433513806577551

Survival Rates By Passenger Class¶

The following analysis compares the survival rate for each of the three ticket classes on board.

# Analyse the survival rates by passenger class.
classStats = new_dataframe(titanicStats, 'Pclass')
classStats

As the following two tables illustrate, there were more third class passengers than first or second class combined. However, the survival rate declined from first to second class then further to third class.

# plot of the total number of passengers by passenger class. 
classStats.Total.plot(kind='pie', title='Passenger Count by Class')

<matplotlib.axes._subplots.AxesSubplot at 0x11e22e890>

# Plot of survival odds by passenger class. 
classStats.Survival_Rate.plot(kind='bar', title='Survival Rate')

<matplotlib.axes._subplots.AxesSubplot at 0x11dfa6cd0>

Chi Squared for Passenger Class¶

To determine the statistical significance of the survival rates of each passenger class, I utilized a Chi Squared test.

Null hypothesis: passenger class was not a significant factor in determining survival rates
Alternative hypothesis: passenger class was a significant factor in determining survival rates
Alpha level of .05.

Based on the chi squared score of 102.89 and degrees of freedom of two, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected and passenger class is considered a statisically significant factor in determining survial.

Based on the Cramer's V value of .34 (calculated below), passenger class can be considered to have a medium effect on survival rates.

# Calculate Chi Squared for the survival rates based on passenger class. 
chi_by_class = chi_squared_survival(classStats)
chi_by_class

102.88898875696057

# Calculate Cramer's V for survival rates by class.
cramerV(chi_by_class, classStats.Total.sum(), 2)

0.33981738800531175

Survival Rate by Age¶

The following analyzes the survival rates by age group of the passengers.

For the 714 passengers with available age data, the mean age is 29.7 years old, with median of 28, minimum of 0.42 (5 months) and maximum of 80.

# General description of the age data and the median value. 
titanicStats.Age.describe(), titanicStats.Age.median()

(count    714.000000
 mean      29.699118
 std       14.526497
 min        0.420000
 25%       20.125000
 50%       28.000000
 75%       38.000000
 max       80.000000
 Name: Age, dtype: float64, 28.0)

Age by Decade¶

To analyze the survival rates based on age, I seperated the passengers into categories consisting of ages by decade.

The function below returns the decade of the passengers age, this used to add the respective decate category to the data frame. I then created a new data frame slice and removed passengers with null age values.

'''Function to determine the age of the passengers by decade.  
    @param age of passenger
    @return decade of passengers age as a string. 
'''

def ageDecade(age):
    if age <= 9:
        return '0-9'
    elif age < 20:
        return '10-19'
    elif age < 30: 
        return '20-29'
    elif age < 40: 
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age < 70:
        return '60-69'
    else:
        return '70+'

# Add the ageDecade column to the dataframe.
titanicStats['ageDecade'] = titanicStats['Age'].map(ageDecade)

# Create a slice of the dataframe and remove anything with a null value in the age column. 
ageDf = titanicStats.dropna(subset=['Age'])
ageDf.tail()

# Check the length of the new data frame to make sure that the 177 rows with null age fields were removed. 
print 'Original table length = ', len(titanicStats)
print 'New table length = ', len(ageDf)
print "Difference = ", len(titanicStats) - len(ageDf)

Original table length =  891
New table length =  714
Difference =  177

Survival Rates by Age in Decades¶

The majority of the age groups have survival rates that are close to the average total survival rate (38.4%), with the exceptions of the 0-9 age group and over 70 age group, which are respecitively higher and lower than the overall average survival rate. However, there are two few samples in the 70+ group to draw any conclusions from (only 7 passengers over 70 years of age). What remains is the apperance of a 'privelidge of youth' as it relates to survial.

#Create the new dataframe grouped by the age by decades of the passengers.
statsByDecade = new_dataframe(ageDf, 'ageDecade')
statsByDecade

# Create a bar chart of the survival rates grouped by decade of age. 
statsByDecade.Survival_Rate.plot(kind='bar', title='Survival Rate')

<matplotlib.axes._subplots.AxesSubplot at 0x11e432490>

Expiration Date of the Privelidge of Youth¶

To deterimine if the seeming privelidge of youth was significant I first sought to find the age at which this priveledge expired. To do this I created a new data frame with only the passengers that were under 20 years old and examined the survival rates in the table below compared to the overall survival rate (38.4%).

The priveledge of youth appears to expire after the age of 15, as the survival rate drops from 80% at age 15 to 35% at 16 and stays relatively close to the 38% overall average from that point on.

youthDf = new_dataframe(ageDf[(ageDf.Age <= 19)], 'Age')
youthDf.tail(10)

# Create a bar graph of the survival rates based on age for passengers under 20. 
youthDf.Survival_Rate.plot(kind='bar', title='Youth Survival Rates')

<matplotlib.axes._subplots.AxesSubplot at 0x11e6334d0>

I then broke the passenges into two groups, those 15 and younger and those 16 and older.

'''Function to test if age is below 16.
    @param age
    @return boolean.
'''
def under_16(age):
    return age < 16

# Create a new field in the data frame 'Under_16'.
titanicStats['Under_16'] = titanicStats['Age'].map(under_16)

# Remove any passengers with null age data.  
youthDf = titanicStats.dropna(subset=['Age'])

# Create a new data frame based on the Under_16 categories.  
youth_vs_Df = new_dataframe(youthDf, 'Under_16')
youth_vs_Df

Chi Squared for Passenger Age Group¶

To determine the statistical significance of the survival rates for passenger by age, under 16 compared to 16 and over, I utilized a Chi Squared test.

Null hypothesis: passenger age was not a significant factor in determining survival rates
Alternative hypothesis: passenger age was a statistically significant factor in determining survival rates
Alpha level of .05.

Based on the chi squared score of 14.98 and degrees of freedom of one, the p value is less than 0.0001 (per graphpad.com). As such the null hypothesis is rejected and passenger age is considered a statisically significant factor in determining survial.

Based on the Cramer's V value of .14 (calculated below), passenger age can be considered as having a small effect on survival rates.

# Calculate the Chi Squared value for age group survival rate, under 16 vs. 16 and up. 
chi_youth_vs = chi_squared_survival(youth_vs_Df)
chi_youth_vs

14.977970720971808

# Calculate Cramer's V for survival by age group.
cramerV(chi_youth_vs, youth_vs_Df.Total.sum(), 2)

0.1448362869911138

Combined Variables¶

Combining the passenger variables reveals interesting interactions between the variables.

Privelidge of Youth vs Class vs Gender¶

Under 16¶

While it does appear that passengers under 16 in the first and second classes were favored, with only 1 out of 25 perrishing, 25 passengers in those categories are too few to draw conclusions from.

For those under 16 in the third class, youth was a notable advantage for males, with 32.1% survival rate for those under 16 compared to 12.9% for those older, and a smaller advantage for females with a 53.3% survival rate for third class females under 16 compared 43.1% for older females in third class.

Over 16¶

For men older than 16, the only way to have had a reasonable chance at survival was to have been in first class. Men in first class had a 37.8% survival rate compared to 6.7% and 12.9% for men over 16 in second and third class, respectively.

For women over 16, class was a major factor in survival. Although women over 16 in third class had a survival rate of 43.1%, which was well above the men's overall survival rate (18.9%), it was still less than half of the survival rate for women over 16 in first and second class (97.6% and 90.6%, respectively).

'''Remove the null age data and create a data frame table with the three factors combined;
    age under/over 16 years, gender, and passenger class.'''

ageOnlyDf = titanicStats.dropna(subset=['Age'])
combined_factors = new_dataframe(ageOnlyDf, ['Under_16', 'Sex', 'Pclass'])

combined_factors

Survival Rates:¶

For comparison with the above table, survival previously calculated survival rates are displayed here:

Overall = 38.4%
Male = 18.9%
Female = 74.2%
1st class = 63.0%
2nd class = 47.3%
3rd class = 24.3%

Conclusion¶

Based on the analysis above, gender, age and passenger class all influenced the survival rate for passengers on the Titanic.

To have the best chance of survival, you would have wanted to be female and in first (or second) class. If you were male, you did not want to be in second or third class especially if you were over 15 years old.

	Survived	Perished	Total	Survival_Rate
Pclass
1	136	80	216	0.629630
2	87	97	184	0.472826
3	119	372	491	0.242363

	Survived	Perished	Total	Survival_Rate
Age
11.0	1	3	4	0.250000
12.0	1	0	1	1.000000
13.0	2	0	2	1.000000
14.0	3	3	6	0.500000
14.5	0	1	1	0.000000
15.0	4	1	5	0.800000
16.0	6	11	17	0.352941
17.0	6	7	13	0.461538
18.0	9	17	26	0.346154
19.0	9	16	25	0.360000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

	PassengerId	Survived	Pclass	Name	Sex	Age	Parch	Ticket	Fare	Cabin	Embarked	ageDecade
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	5	382652	29.125	NaN	Q	30-39
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	211536	13.000	NaN	S	20-29
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	112053	30.000	B42	S	10-19
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	111369	30.000	C148	C	20-29
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	370376	7.750	NaN	Q	30-39

	Survived	Perished	Total	Survival_Rate
ageDecade
0-9	38	24	62	0.612903
10-19	41	61	102	0.401961
20-29	77	143	220	0.350000
30-39	73	94	167	0.437126
40-49	34	55	89	0.382022
50-59	20	28	48	0.416667
60-69	6	13	19	0.315789
70+	1	6	7	0.142857

			Survived	Perished	Total	Survival_Rate
Under_16	Sex	Pclass
False	female	1	80	2	82	0.975610
		2	58	6	64	0.906250
		3	31	41	72	0.430556
	male	1	37	61	98	0.377551
		2	6	84	90	0.066667
		3	29	196	225	0.128889
True	female	1	2	1	3	0.666667
		2	10	0	10	1.000000
		3	16	14	30	0.533333
	male	1	3	0	3	1.000000
		2	9	0	9	1.000000
		3	9	19	28	0.321429

	Survived	Perished	Total	Survival_Rate
Sex
female	233	81	314	0.742038
male	109	468	577	0.188908