Predicting Titanic Survivors with Machine Learning (Detailed End-to-End Example)

March 23, 2018

On 15 April, 1912 Titanic met with an unfortunate event - it collided with an iceberg and sank. The ship was carrying 2224 people and that tragic accident costed the life of 1502 passengers.

In this tutorial we will be predicting which passengers survived the accident and which couldn’t from different features like age, sex, class, etc. This problem is hosted by Kaggle as a challenge and data can be found here: Titanic: Machine Learning from Disaster. You will have to create a Kaggle account and accept the rules to download the data.

Project Template on Google Colaboratory

Notebook Link

Work on this project directly in-browser via Google Colaboratory. The link above is a starter template that you can save to your own Google Drive and work on. Google Colab is a free tool that lets you run small Machine Learning projects through your web browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset linked below to Google Colab after saving the notebook to your own system.

Data Description

We will start by understanding what our data looks like and get its overview. For that we will use Pandas.

import pandas as pd
import numpy as np
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
print(train_data.head())

A few samples from our dataset (Click to enlarge)

The list of features included in the dataset for this problem are described below:

Survived: Whether or not the passenger survived (0 = No, 1 = Yes)
Pclass: A proxy for socio-economic status. (1 = Upper class. 2 = Middle class. 3 = Lower class.)
Sex: Sex of the passenger (male / female)
Age: Age (in years)
SibSp: Number of siblings or spouse on board (mistresses and fiancés were ignored). The dataset defines the relations as follows:
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife
Parch: Number of parents or children on board (some children travelled only with a nanny, parch = 0 for them). The dataset defines family relations in this way
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Let’s check to see if any of the features have missing values.

Now will get a general information about our data. We will find total number of data and data types of different features.

print(train_data.info())
# produces the following output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

We can see from the data above, that Age and Cabin features have a lot of missing values. We can get more information about our data using describe function in pandas.

print(train_data.describe())

Data Description for integer-valued features. (Click to enlarge)

Everything above looks normal. Note that 38.3% of the passenger’s survived, the rest did not. The above approach doesn’t provide a good description for categorical data. We’ll use the following method to summarize those features:

train_data.describe(include=['O'])

Data Description for categorical features. (Click to enlarge)

As we mentioned earlier, cabin is highly incomplete (only 204 data available out of 891), so we will be dropping that feature. Similarly, ticket has too many unique values (681 unique values out of 891), so we will be dropping that feature too. In addition to that, we can safely assume that name (and passengerID) are not a helpful features, so we can drop those too.

train_data = train_data.drop(labels=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1) 
test_data = test_data.drop(labels=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1) 
train_data.head() # check everything looks okay

Training data after dropping columns

Data Exploration and Data Visualization

Lets start by visualizing the number of people who survived and who failed.

train_data.Survived.value_counts().plot(kind='bar')
plt.title("Distribution of Survival, (1 = Survived)")

Barplot of Did Not Survive vs Survived

From the figure above, it is clear that fewer people were able to survive and more that 500 people died.

Next, let’s see how age and sex affected survival.

g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Age', bins=20)
g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Sex', bins=3)

Histograms of Age by Survived = {0, 1}

Histograms of Sex by Survived = {0, 1}

From the above figures, we can clearly see that survival and age has high correlation. We can see that many people from age group 25 to 30 died. Similarly, the death rate for people between age 50 to 65 was lower than the survival rate. We can also see that many more male passengers died in comparison to female passengers.

Next, lets look at how class affected the survival rates of passengers.

train_data.Pclass.value_counts().plot(kind="barh")
plt.title("Class Distribution")

We can clearly see that there were more people from class 3 so it is probably the case that more people from class 3 would have died. But is the death rate proportionally distributed?

pclass_survived = train_data[train_data['Survived']==1]['Pclass'].value_counts()
pclass_dead = train_data[train_data['Survived']==0]['Pclass'].value_counts()
df = pd.DataFrame([pclass_survived,pclass_dead])
df.index = ['Survived','Dead']
df.plot(kind='bar', stacked=True, figsize=(10,8))

Stacked plot of Pclass vs Survived

From the above figure, we can infer that the death rate was not proportionally distributed among classes. We can clearly see that the people from first and second class had higher survival rate than those from third class.

Let’s do an analogous analysis based on the port from where people embarked.

train_data.Embarked.value_counts().plot(kind='bar')
plt.title("Passengers per boarding location")

survived = train_data[train_data['Survived']==1]['Embarked'].value_counts()
dead = train_data[train_data['Survived']==0]['Embarked'].value_counts()
df = pd.DataFrame([survived,dead])
df.index = ['Survived','Dead']
df.plot(kind='bar', stacked=False, figsize=(8,6))
plt.title("Survival and Death in Different ports")

We can see that the survival was affected by the port from where they embarked.

Lastly, lets visualize the relation between ticket fare a passenger paid and how it affected their chance of survival. We plot the mean fare for two classes of Survived feature.

survived_0 = train_data[train_data['Survived'] == 0]["Fare"].mean()
survived_1 = train_data[train_data['Survived'] == 1]["Fare"].mean()
xs  = [survived_0, survived_1]
ys = ['Dead','Survived']
plt.bar(ys, xs, 0.6, align='center',color = 'green')
plt.xlabel('Outcome')
plt.ylabel('Mean Fare')
plt.show()

We can observe that higher fare paying passenger have better chance of survival.

Data Wrangling: Convert categorical variables to integers

Before moving onto data exploration, we’ll convert the categorical features in our dataset to integer variables. This will make things like calculating correlations easier, and will also convert data into a format easier to process for machine learning algorithms.

We first convert the values of Sex feature. We will map female to 1 and male to 0. We also need to convert the categories of Embarked into numerical data. We will divide the embarked column into three columns - EmbarkedC, EmbarkedQ, Embarked_S and the values will be either 1 or 0 depending upon whether the passenger were from that port or not. Note that 2 values are missing (889 values available while it should be 891), but that is okay (since it is just 2 out of 891).

def wrangle(dataset):
    # sex {male, female} to {0, 1}
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
    # embarked {S, C, Q} => 3 binary variables
    embarked_separate_port = pd.get_dummies(dataset['Embarked'], prefix='Embarked')
    dataset = pd.concat([dataset, embarked_separate_port], axis=1)
    return dataset.drop('Embarked', axis=1)
train_data = wrangle(train_data)
test_data = wrangle(test_data)
train_data.head()

Data after converting categorical features to integers

Data Exploration: Correlation Analysis

Now, we’re ready to do some data exploration. Let’s start by calculating correlations between every pair of features (and the survived variable).

corr = train_data.corr()
print(corr)

Correlations between features and the target variable (Click to enlarge)

Data visualization helps us see the relation between our features clearly and make better decisions. For visualization we will be importing two python packages seaborn and matplotlib.

We will visualize the correlation of features that we found above using heatmap. In heat map, we will use absolute values of the correlation, this make variables which have close to 0 correlation appear dark, and everything which is correlated (or anti-correlated) is bright. The shades of the color gives the relative strength of correlation.

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(np.abs(corr),          # use absolute values
        xticklabels=corr.columns,
        yticklabels=corr.columns)

Visualization of the correlations

Observe that Pclass, Age and Fare have significant correlation (or anti-correlation) with Survived. Although, Pclass and Fare are quite correlated with each other, so they information content between the two variables might be overlapping. Also notice that Embarked_C and Embarked_S are also important variables.

Data Wrangling

Data wrangling is the process of cleaning and uniting messy and complex data-set for easy accessing and analysis. We’ve done some cleaning already.

Filling missing values

In this section, we’ll fill in incomplete values of age. We will do that by calculating the median value found by using age values for different sex and class. As we have two sexes (1, 0) and three classes (1, 2, 3), we will have 6 combinations and we will calculate the age from each combination.

guess_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        guess_data = train_data[(train_data['Sex'] == i) & (train_data['Pclass'] == j+1)]['Age'].dropna()
        age_guess = guess_data.median()
        # Convert random age float to nearest .5 age
        guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
def wrangle_age(dataset):
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),'Age'] = guess_ages[i,j]
    dataset['Age'] = dataset['Age'].astype(int)
    return dataset
train_data = wrangle_age(train_data)
test_data = wrangle_age(test_data)

Creating New Features

The two features SibSp and Parch both refer to family members. So we will create a new feature called FamilySize by adding SibSp and Parch. We will also add 1 as a family should have at least 1 member.

train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1

Final check before Machine Learning

We can now see the changes we made.

print(train_data.info())
# output is as follows
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Sex           891 non-null int64
Age           891 non-null int64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked_C    891 non-null float64
Embarked_Q    891 non-null float64
Embarked_S    891 non-null float64
dtypes: float64(4), int64(6)
memory usage: 69.7 KB

Similarly, getting information from our test data, we found that Fare (417 out of 418) lacks one value.

print(test_data.info())
# output is as follows 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
Pclass        418 non-null int64
Sex           418 non-null int64
Age           418 non-null int64
SibSp         418 non-null int64
Parch         418 non-null int64
Fare          417 non-null float64
Embarked_C    418 non-null float64
Embarked_Q    418 non-null float64
Embarked_S    418 non-null float64
dtypes: float64(4), int64(5)
memory usage: 29.5 KB

So we will assign the mean fare (which is 32) that we can observe to the missing Fare.

mean_fare = 32
test_data['Fare'] = test_data['Fare'].fillna(32)

Machine Learning Models: Training and Evaluation

We will now build models to make prediction from our data. We will be using four classification algorithms which are Logistic Regression, Support Vector Machine, K Nearest Neighbors and Random Forest.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

Lets first split our train data into features (X_train) and labels (Y_train). We will use 90% of the data (800 data points) for training and rest of the data (91 data points) for validation.

X_train = train_data.drop("Survived", axis=1)[:800]
Y_train = train_data["Survived"][:800]
X_crossValidation = train_data.drop("Survived", axis=1)[800:]
Y_crossValidation = train_data["Survived"][800:]
X_test = test_data

Logistic Regression

Let’s start with logistic regression, fit it with our train features and label. Then calculate training accuracy and validation accuracy.

model_logistic = LogisticRegression()
model_logistic.fit(X_train, Y_train)
train_accuracy = round(model_logistic.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(model_logistic.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predL = model_logistic.predict(X_test)
print(train_accuracy)
print(validation_accuracy)

We get: train_accuracy = 80.00; validation_accuracy = 82.42;

Support Vector Machines

We will then use Support Vector Machine.

svc = SVC()
svc.fit(X_train, Y_train)
train_accuracy = round(svc.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(svc.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predS = svc.predict(X_test)
print(train_accuracy)
print(validation_accuracy)

We get: train_accuracy = 89.75; validation_accuracy = 78.02;

K Nearest Neighbors

We will now use KNN algorithm.

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
train_accuracy = round(knn.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(knn.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predK = knn.predict(X_test)
print(train_accuracy)
print(validation_accuracy)

We get: train_accuracy = 82.50; validation_accuracy = 75.82;

Random Forest

Lastly, we will use Random Forest Classifier.

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
train_accuracy = round(random_forest.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(random_forest.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_predR = random_forest.predict(X_test)
print(train_accuracy)
print(validation_accuracy)

We get: train_accuracy = 98.00; validation_accuracy = 89.01;

Our winner is Random Forest as it has highest validation accuracy.

Submission

As I mentioned at the very beginning, this problem is hosted as a competition in Kaggle. So we will submit our prediction to check how good we did in our test data set. We will use the predictions made from Random Forest Y_predR.

submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": Y_predR
    })
submission.to_csv('submission.csv', index=False)

Result

Our prediction from the above model was able to score 0.76 and was ranked in top 62%. Not bad but not satisfactory either.

Bonus: Dealing with Overfitting

From the result we can clearly conclude that our model is overfitting. Though it has 97.88% training accuracy, it only has 89.01% validation accuracy and even less test accuracy (76%). It is a clear case of overfitting as our model is very successful in making correct predictions for our training data but it fails to do so for test data.

Hyper-parameter Tuning

One of the methods to prevent the model from over-fitting is to collect more data. But this is not a viable solution in our case. We can also make our model perform well and reduce the problem of over-fitting by tuning the hyper-parameters of our algorithm (Random Forest in our case). We will be using GridSearchCV method from scikitlearn to automate our process of choosing best combination of hyper-parameters.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
parameter_grid = {
             'max_depth' : [4, 6, 8],
             'n_estimators': [10, 50,100],
             'max_features': ['sqrt', 'auto', 'log2'],
             'min_samples_split': [0.001,0.003,0.01],
             'min_samples_leaf': [1, 3, 10],
             'bootstrap': [True,False],
             }
forest = RandomForestClassifier()
cross_validation = StratifiedKFold(Y_train, n_folds=5)
grid_search = GridSearchCV(forest,
                           scoring='accuracy',
                           param_grid=parameter_grid,
                           cv=cross_validation)
grid_search.fit(X_train, Y_train)
model = grid_search
parameters = grid_search.best_params_
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

We get following values as best parameter values from the list of values we gave.

Best score: 0.835
Best parameters: {'bootstrap': True, 'min_samples_leaf': 1, 
    'n_estimators': 100, 'min_samples_split': 0.003, 
    'max_features': 'sqrt', 'max_depth': 8}

We will now use these parameters to train our Random Forest model.

parameters = {'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 100, 
                  'min_samples_split': 0.01, 'max_features': 'log2', 'max_depth': 8}
model = RandomForestClassifier(**parameters)
model.fit(X_train, Y_train)
model.fit(X_train, Y_train)
train_accuracy = round(model.score(X_train, Y_train) * 100, 2)
validation_accuracy = round(model.score(X_crossValidation, Y_crossValidation) * 100, 2)
Y_pred = model.predict(X_test)
print(train_accuracy)
print(validation_accuracy)

We get: train_accuracy = 90.88; validation_accuracy = 83.52;

Let’s submit our prediction as we did previously.

We see that we have made improvement in our model. Our model now has 79.425 % accuracy ( shown in Fig 22). And our rank has also improved significantly. Our model is now in top 16% ( Hurrah !!) . From the test and train accuracy we can see that we have reduce the problem of over-fitting but it is not completely eliminated yet.

Bonus: Further Improvement

As we discussed above, our model still have some problems and the accuracy can still be improved. Here are some tips that might help you to improve your model (you can get started with the IPython notebook here: KaggleTitanicProject.ipynb):

Feature Engineering

We can make our model perform better by making changes to our features. Lets visualize how different features are affecting our model.

features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = model.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.plot(kind='barh', figsize=(10, 5))

We can clearly see that Sex has the highest influence in our model. We can’t make changes in the values of Sex. But we also see that Fare and Age also have significant influence. One thing we can do to improve our model is to engineer these features. Representing the values of these features in band might improve the performance. For example, We can represent values of Age in a band like 0-5 years, 5-10 years etc.

Try Other Classifiers

We can also try out other classifier like Decision tree, Naive Bayes classifier, etc. We can also use the algorithm that we have already used. For example, the Logistic Regression model perform as accurately as Random Forest in test data (75.5% for Logistic Regression and 76.5% for Random Forest) though its training and validation accuracy was lower than that of Random Forest. Hyper-tuning Logistic Regression or other method can give better result than the result we achieve above.

Solution on Google Colaboratory

Notebook Link

The complete notebook with all the cells executed is available via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your drive.