Predicting Titanic Survivors with Machine Learning (Detailed End-to-End Example)
March 23, 2018
On 15 April, 1912 Titanic met with an unfortunate event - it collided with an iceberg and sank. The ship was carrying 2224 people and that tragic accident costed the life of 1502 passengers.
In this tutorial we will be predicting which passengers survived the accident and which couldn’t from different features like age, sex, class, etc. This problem is hosted by Kaggle as a challenge and data can be found here: Titanic: Machine Learning from Disaster. You will have to create a Kaggle account and accept the rules to download the data.
Project Template on Google Colaboratory
Work on this project directly in-browser via Google Colaboratory. The link above is a starter template that you can save to your own Google Drive and work on. Google Colab is a free tool that lets you run small Machine Learning projects through your web browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset linked below to Google Colab after saving the notebook to your own system.
Data Description
We will start by understanding what our data looks like and get its overview. For that we will use Pandas
.
import pandas as pd
import numpy as np
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
print(train_data.head())
A few samples from our dataset (Click to enlarge)
The list of features included in the dataset for this problem are described below:
- Survived: Whether or not the passenger survived (0 = No, 1 = Yes)
- Pclass: A proxy for socio-economic status. (1 = Upper class. 2 = Middle class. 3 = Lower class.)
- Sex: Sex of the passenger (male / female)
- Age: Age (in years)
- SibSp: Number of siblings or spouse on board (mistresses and fiancés were ignored). The dataset defines the relations as follows:
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife
- Parch: Number of parents or children on board (some children travelled only with a nanny, parch = 0 for them). The dataset defines family relations in this way
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Let’s check to see if any of the features have missing values.
Now will get a general information about our data. We will find total number of data and data types of different features.
print(train_data.info())
# produces the following output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
We can see from the data above, that Age and Cabin features have a lot of missing values. We can get more information about our data using describe
function in pandas
.
print(train_data.describe())
Data Description for integer-valued features. (Click to enlarge)
Everything above looks normal. Note that 38.3% of the passenger’s survived, the rest did not. The above approach doesn’t provide a good description for categorical data. We’ll use the following method to summarize those features:
train_data.describe(include=['O'])
Data Description for categorical features. (Click to enlarge)
As we mentioned earlier, cabin is highly incomplete (only 204 data available out of 891), so we will be dropping that feature. Similarly, ticket has too many unique values (681 unique values out of 891), so we will be dropping that feature too. In addition to that, we can safely assume that name (and passengerID) are not a helpful features, so we can drop those too.
train_data = train_data.drop(labels=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
test_data = test_data.drop(labels=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
train_data.head() # check everything looks okay
Training data after dropping columns
Data Exploration and Data Visualization
Lets start by visualizing the number of people who survived and who failed.
train_data.Survived.value_counts().plot(kind='bar')
plt.title("Distribution of Survival, (1 = Survived)")
Barplot of Did Not Survive vs Survived
From the figure above, it is clear that fewer people were able to survive and more that 500 people died.
Next, let’s see how age and sex affected survival.
g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Age', bins=20)
g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Sex', bins=3)
Histograms of Age by Survived = {0, 1}
Histograms of Sex by Survived = {0, 1}
From the above figures, we can clearly see that survival and age has high correlation. We can see that many people from age group 25 to 30 died. Similarly, the death rate for people between age 50 to 65 was lower than the survival rate. We can also see that many more male passengers died in comparison to female passengers.
Next, lets look at how class affected the survival rates of passengers.
train_data.Pclass.value_counts().plot(kind="barh")
plt.title("Class Distribution")
We can clearly see that there were more people from class 3 so it is probably the case that more people from class 3 would have died. But is the death rate proportionally distributed?
pclass_survived = train_data[train_data['Survived']==1]['Pclass'].value_counts()
pclass_dead = train_data[train_data['Survived']==0]['Pclass'].value_counts()
df = pd.DataFrame([pclass_survived,pclass_dead])
df.index = ['Survived','Dead']
df.plot(kind='bar', stacked=True, figsize=(10,8))
Stacked plot of Pclass vs Survived
From the above figure, we can infer that the death rate was not proportionally distributed among classes. We can clearly see that the people from first and second class had higher survival rate than those from third class.
Let’s do an analogous analysis based on the port from where people embarked.
train_data.Embarked.value_counts().plot(kind='bar')
plt.title("Passengers per boarding location")
survived = train_data[train_data['Survived']==1]['Embarked'].value_counts()
dead = train_data[train_data['Survived']==0]['Embarked'].value_counts()
df = pd.DataFrame([survived,dead])
df.index = ['Survived','Dead']
df.plot(kind='bar', stacked=False, figsize=(8,6))
plt.title("Survival and Death in Different ports")
We can see that the survival was affected by the port from where they embarked.
Lastly, lets visualize the relation between ticket fare a passenger paid and how it affected their chance of survival. We plot the mean fare for two classes of Survived feature.
survived_0 = train_data[train_data['Survived'] == 0]["Fare"].mean()
survived_1 = train_data[train_data['Survived'] == 1]["Fare"].mean()
xs = [survived_0, survived_1]
ys = ['Dead','Survived']
plt.bar(ys, xs, 0.6, align='center',color = 'green')
plt.xlabel('Outcome')
plt.ylabel('Mean Fare')
plt.show()
We can observe that higher fare paying passenger have better chance of survival.
Data Wrangling: Convert categorical variables to integers
Before moving onto data exploration, we’ll convert the categorical features in our dataset to integer variables. This will make things like calculating correlations easier, and will also convert data into a format easier to process for machine learning algorithms.
We first convert the values of Sex feature. We will map female to 1 and male to 0. We also need to convert the categories of Embarked into numerical data. We will divide the embarked column into three columns - EmbarkedC, EmbarkedQ, Embarked_S and the values will be either 1 or 0 depending upon whether the passenger were from that port or not. Note that 2 values are missing (889 values available while it should be 891), but that is okay (since it is just 2 out of 891).
def wrangle(dataset):
# sex {male, female} to {0, 1}
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
# embarked {S, C, Q} => 3 binary variables
embarked_separate_port = pd.get_dummies(dataset['Embarked'], prefix='Embarked')
dataset = pd.concat([dataset, embarked_separate_port], axis=1)
return dataset.drop('Embarked', axis=1)
train_data = wrangle(train_data)
test_data = wrangle(test_data)
train_data.head()
Data after converting categorical features to integers
Data Exploration: Correlation Analysis
Now, we’re ready to do some data exploration. Let’s start by calculating correlations between every pair of features (and the survived variable).
corr = train_data.corr()
print(corr)
Correlations between features and the target variable (Click to enlarge)
Data visualization helps us see the relation between our features clearly and make better decisions. For visualization we will be importing two python packages seaborn
and matplotlib
.
We will visualize the correlation of features that we found above using heatmap. In heat map, we will use absolute values of the correlation, this make variables which have close to 0 correlation appear dark, and everything which is correlated (or anti-correlated) is bright. The shades of the color gives the relative strength of correlation.
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(np.abs(corr), # use absolute values
xticklabels=corr.columns,
yticklabels=corr.columns)
Visualization of the correlations
Observe that Pclass, Age and Fare have significant correlation (or anti-correlation) with Survived. Although, Pclass and Fare are quite correlated with each other, so they information content between the two variables might be overlapping. Also notice that Embarked_C and Embarked_S are also important variables.
Data Wrangling
Data wrangling is the process of cleaning and uniting messy and complex data-set for easy accessing and analysis. We’ve done some cleaning already.
Filling missing values
In this section, we’ll fill in incomplete values of age. We will do that by calculating the median value found by using age values for different sex and class. As we have two sexes (1, 0) and three classes (1, 2, 3), we will have 6 combinations and we will calculate the age from each combination.
guess_ages = np.zeros((2,3))
for i in range(0, 2):
for j in range(0, 3):
guess_data = train_data[(train_data['Sex'] == i) & (train_data['Pclass'] == j+1)]['Age'].dropna()
age_guess = guess_data.median()
# Convert random age float to nearest .5 age
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
def wrangle_age(dataset):
for i in range(0, 2):
for j in range(0, 3):
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),'Age'] = guess_ages[i,j]
dataset['Age'] = dataset['Age'].astype(int)
return dataset
train_data = wrangle_age(train_data)
test_data = wrangle_age(test_data)
Creating New Features
The two features SibSp and Parch both refer to family members. So we will create a new feature called FamilySize by adding SibSp and Parch. We will also add 1 as a family should have at least 1 member.
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1
Final check before Machine Learning
We can now see the changes we made.
print(train_data.info())
# output is as follows
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null int64
Age 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked_C 891 non-null float64
Embarked_Q 891 non-null float64
Embarked_S 891 non-null float64
dtypes: float64(4), int64(6)
memory usage: 69.7 KB
Similarly, getting information from our test data, we found that Fare (417 out of 418) lacks one value.
print(test_data.info())
# output is as follows
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
Pclass 418 non-null int64
Sex 418 non-null int64
Age 418 non-null int64
SibSp 418 non-null int64
Parch 418 non-null int64
Fare 417 non-null float64
Embarked_C 418 non-null float64
Embarked_Q 418 non-null float64
Embarked_S 418 non-null float64
dtypes: float64(4), int64(5)
memory usage: 29.5 KB
So we will assign the mean fare (which is 32) that we can observe to the missing Fare.
mean_fare = 32
test_data['Fare'] = test_data['Fare'].fillna(32)
Machine Learning Models: Training and Evaluation
We will now build models to make prediction from our data. We will be using four classification algorithms which are Logistic Regression, Support Vector Machine, K Nearest Neighbors and Random Forest.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
Lets first split our train data into features (X_train) and labels (Y_train). We will use 90% of the data (800 data points) for training and rest of the data (91 data points) for validation.
X_train = train_data.drop("Survived", axis=1)[:800]
Y_train = train_data["Survived"][:800]
X_crossValidation = train_data.drop("Survived", axis=1)[800:]
Y_crossValidation = train_data["Survived"][800:]
X_test = test_data
Logistic Regression
Let’s start with logistic regression, fit it with our train features and label. Then calculate training accuracy and validation accuracy.
model_logistic = LogisticRegression()
model_logistic.fit(X_train, Y_train)
train_accuracy = round(model_logistic.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(model_logistic.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predL = model_logistic.predict(X_test)
print(train_accuracy)
print(validation_accuracy)
We get: train_accuracy = 80.00; validation_accuracy = 82.42;
Support Vector Machines
We will then use Support Vector Machine.
svc = SVC()
svc.fit(X_train, Y_train)
train_accuracy = round(svc.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(svc.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predS = svc.predict(X_test)
print(train_accuracy)
print(validation_accuracy)
We get: train_accuracy = 89.75; validation_accuracy = 78.02;
K Nearest Neighbors
We will now use KNN algorithm.
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
train_accuracy = round(knn.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(knn.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predK = knn.predict(X_test)
print(train_accuracy)
print(validation_accuracy)
We get: train_accuracy = 82.50; validation_accuracy = 75.82;
Random Forest
Lastly, we will use Random Forest Classifier.
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
train_accuracy = round(random_forest.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(random_forest.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predR = random_forest.predict(X_test)
print(train_accuracy)
print(validation_accuracy)
We get: train_accuracy = 98.00; validation_accuracy = 89.01;
Our winner is Random Forest as it has highest validation accuracy.
Submission
As I mentioned at the very beginning, this problem is hosted as a competition in Kaggle. So we will submit our prediction to check how good we did in our test data set. We will use the predictions made from Random Forest Y_predR.
submission = pd.DataFrame({
"PassengerId": test_data["PassengerId"],
"Survived": Y_predR
})
submission.to_csv('submission.csv', index=False)
Result
Our prediction from the above model was able to score 0.76 and was ranked in top 62%. Not bad but not satisfactory either.
Bonus: Dealing with Overfitting
From the result we can clearly conclude that our model is overfitting. Though it has 97.88% training accuracy, it only has 89.01% validation accuracy and even less test accuracy (76%). It is a clear case of overfitting as our model is very successful in making correct predictions for our training data but it fails to do so for test data.
Hyper-parameter Tuning
One of the methods to prevent the model from over-fitting is to collect more data. But this is not a viable solution in our case. We can also make our model perform well and reduce the problem of over-fitting by tuning the hyper-parameters of our algorithm (Random Forest in our case). We will be using GridSearchCV
method from scikitlearn
to automate our process of choosing best combination of hyper-parameters.
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
parameter_grid = {
'max_depth' : [4, 6, 8],
'n_estimators': [10, 50,100],
'max_features': ['sqrt', 'auto', 'log2'],
'min_samples_split': [0.001,0.003,0.01],
'min_samples_leaf': [1, 3, 10],
'bootstrap': [True,False],
}
forest = RandomForestClassifier()
cross_validation = StratifiedKFold(Y_train, n_folds=5)
grid_search = GridSearchCV(forest,
scoring='accuracy',
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(X_train, Y_train)
model = grid_search
parameters = grid_search.best_params_
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
We get following values as best parameter values from the list of values we gave.
Best score: 0.835
Best parameters: {'bootstrap': True, 'min_samples_leaf': 1,
'n_estimators': 100, 'min_samples_split': 0.003,
'max_features': 'sqrt', 'max_depth': 8}
We will now use these parameters to train our Random Forest model.
parameters = {'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 100,
'min_samples_split': 0.01, 'max_features': 'log2', 'max_depth': 8}
model = RandomForestClassifier(**parameters)
model.fit(X_train, Y_train)
model.fit(X_train, Y_train)
train_accuracy = round(model.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(model.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_pred = model.predict(X_test)
print(train_accuracy)
print(validation_accuracy)
We get: train_accuracy = 90.88; validation_accuracy = 83.52;
Let’s submit our prediction as we did previously.
We see that we have made improvement in our model. Our model now has 79.425 % accuracy ( shown in Fig 22). And our rank has also improved significantly. Our model is now in top 16% ( Hurrah !!) . From the test and train accuracy we can see that we have reduce the problem of over-fitting but it is not completely eliminated yet.
Bonus: Further Improvement
As we discussed above, our model still have some problems and the accuracy can still be improved. Here are some tips that might help you to improve your model (you can get started with the IPython notebook here: KaggleTitanicProject.ipynb):
Feature Engineering
We can make our model perform better by making changes to our features. Lets visualize how different features are affecting our model.
features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = model.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.plot(kind='barh', figsize=(10, 5))
We can clearly see that Sex has the highest influence in our model. We can’t make changes in the values of Sex. But we also see that Fare and Age also have significant influence. One thing we can do to improve our model is to engineer these features. Representing the values of these features in band might improve the performance. For example, We can represent values of Age in a band like 0-5 years, 5-10 years etc.
Try Other Classifiers
We can also try out other classifier like Decision tree, Naive Bayes classifier, etc. We can also use the algorithm that we have already used. For example, the Logistic Regression model perform as accurately as Random Forest in test data (75.5% for Logistic Regression and 76.5% for Random Forest) though its training and validation accuracy was lower than that of Random Forest. Hyper-tuning Logistic Regression or other method can give better result than the result we achieve above.
Solution on Google Colaboratory
The complete notebook with all the cells executed is available via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your drive.