Tutorial: Titanic dataset machine learning for Kaggle

Kaggle has a a very exciting competition for machine learning enthusiasts. They will give you titanic csv data and your model is supposed to predict who survived or not. Predict the values on the test set they give you and upload it to see your rank among others. The prediction accuracy of about 80% is supposed to be very good model.

What is Required

1. Python, Numpy, Pandas, Matplotlib

2. Kaggle titanic dataset : https://www.kaggle.com/c/titanic-gettingStarted/data

Goal

The machine learning model is supposed to predict who survived or not. A typical classification problem and we will build a machine learning model using Decision Trees or Random Forests which has atleast 80% of prediction accuracy.

This tutorial will only touch the basics of machine learning and will not go into depths of graphical analysis of data. Lets do it in a very simple way!

Lets Get Started

First things first, for machine learning algorithms to work, dataset must be converted to numeric data. You have to encode all the categorical lables to column vectors with binary values. Missing values or NaNs in the dataset is an annoying problem. You have to either drop the missing rows or fill them up with a mean or interpolated values..

Note: Kaggle provides 2 datasets: train and results data separately. Both must have same dimensions for the model.

To work on the data, you can either load the CSV in excel software or in pandas. Lets load the csv data in pandas.

df = pd.read_csv('train.csv',header=0)

Lets take a look at the data format below

>>> df.info()
 <class 'pandas.core.frame.DataFrame'>
 Int64Index: 891 entries, 0 to 890
 Data columns (total 12 columns):
 PassengerId 891 non-null int64
 Survived 891 non-null int64
 Pclass 891 non-null int64
 Name 891 non-null object
 Sex 891 non-null object
 Age 714 non-null float64
 SibSp 891 non-null int64
 Parch 891 non-null int64
 Ticket 891 non-null object
 Fare 891 non-null float64
 Cabin 204 non-null object
 Embarked 889 non-null object

If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values. One such way is columnisation ie. factorize to the row values to column header.

Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

cols = ['Name,'Ticket','Cabin']
 df = df.drop(cols,axis=1)

We dropped 3 columns:

PassengerId 891 non-null int64
 Survived 891 non-null int64
 Pclass 891 non-null int64
 Sex 891 non-null object
 Age 714 non-null float64
 SibSp 891 non-null int64
 Parch 891 non-null int64
 Fare 891 non-null float64
 Embarked 889 non-null object

Next if we want we can drop all rows in the data that has missing values (NaN). You can do it like

df = df.dropna()

Int64Index: 712 entries, 0 to 890
 Data columns (total 9 columns):
 PassengerId 712 non-null int64
 Survived 712 non-null int64
 Pclass 712 non-null int64
 Sex 712 non-null object
 Age 712 non-null float64
 SibSp 712 non-null int64
 Parch 712 non-null int64
 Fare 712 non-null float64
 Embarked 712 non-null object

Now you see the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can. We will see it later.

Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

dummies = []
 cols = ['Pclass','Sex','Embarked']
 for col in cols:
 dummies.append(pd.get_dummies(df[col]))

then

titanic_dummies = pd.concat(dummies, axis=1)

We have 8 columns transformed to columns. 1,2,3 represents passenger class.

1 2 3 female male C Q S
 0 0 0 1 0 1 0 0 1
 1 1 0 0 1 0 1 0 0
 2 0 0 1 1 0 0 0 1
 3 1 0 0 1 0 0 0 1
 4 0 0 1 0 1 0 0 1
 5 0 0 1 0 1 0 1 0

finally we concatenate to the original dataframe columnwise

df = pd.concat((df,titanic_dummies),axis=1)

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the dataframe

df = df.drop(['Pclass','Sex','Embarked'],axis=1)

Lets take a look on the new dataframe

PassengerId 891 non-null int64
 Survived 891 non-null int64
 Age 714 non-null float64
 SibSp 891 non-null int64
 Parch 891 non-null int64
 Fare 891 non-null float64
 1 891 non-null float64
 2 891 non-null float64
 3 891 non-null float64
 female 891 non-null float64
 male 891 non-null float64
 C 891 non-null float64
 Q 891 non-null float64
 S 891 non-null float64

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a nice interpolate() function that will replace all the missing NaNs to interpolated values.

df['Age'] = df['Age'].interpolate()

Now lets observe the data columns. Notice age which is interpolated now with imputed new values.

Data columns (total 14 columns):
 PassengerId 891 non-null int64
 Survived 891 non-null int64
 Age 891 non-null float64
 SibSp 891 non-null int64
 Parch 891 non-null int64
 Fare 891 non-null float64
 1 891 non-null float64
 2 891 non-null float64
 3 891 non-null float64
 female 891 non-null float64
 male 891 non-null float64
 C 891 non-null float64
 Q 891 non-null float64

Machine Learning

Now that we have converted all the data to numeric, its time for preparing the data for machine learning models. This is where scikit and numpy come into play:

X = Input set with 14 attributes
y = Small y Output, in this case ‘Survived’

Now we convert our dataframe from pandas to numpy and we assign input and output

X = df.values
y = df['Survived'].values

X has still Survived values in it, which should not be there. So we drop in numpy column which is the 1st column.

X = np.delete(X,1,axis=1)

Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit cross validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.3,random_state=0)

Lets start with simple Decision Tree Classifier machine learning algorithm and see how it goes

from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=5)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
0.78735805970149249

Not bad it gives score of 78.73%.

Here is one more trick:

Decision trees compute entropy in the information system. If you peform a decision tree on dataset, the variable importances_ contains important information on what columns of data has large variances thus contributing to the decision. Lets see the output

clf.feature_importances_

array([ 0.08944147,  0.11138782,  0.05743406,  0.02186072,  0.05721831, 0.03783151,  0.        ,  0.10597366,  0.50209245,  0.        , 0.        ,  0.        ,  0.01676001])

This output shows that second element in array 0.111, “Fare” has 11% importance, the last 5 shows 51% which is ‘Females’. Very interesting! yes the large number of survivors in titanic are women and children.

Lets push that accuracy score little bit more using Random Forests. Random forests pretty work work for everything out of the box.

from sklearn import ensemble
clf = ensemble.RandomForestClassifier(n_estimators=100)
clf.fit (X_train, y_train)
clf.score (X_test, y_test)
0.80970149253731338

Wow we get 80.9% accuracy in prediction. Not bad at all!

Lets try Gradient boosting algorithm and we will see if we could up that score a bit.

clf = ensemble.GradientBoostingClassifier()
clf.fit (X_train, y_train)
clf.score (X_test, y_test)
0.81343283582089554

Wow! nice we pushed to 81.3% of prediction score.

Let not give up and play around fine tune this Gradient booster.

>>> clf = ensemble.GradientBoostingClassifier(n_estimators=50)
>>> clf.fit(X_train,y_train)
>>> clf.score(X_test,y_test)

0.83208955223880599

Very impressive!! 83.20%

Although our model is 83% accurate, when we feed new data, the accuracy of our model goes down 5-10%. So if you upload the predicted values from Kaggle, our model can be accurate around 77% on new set of values. Lots of work needs to be done!!!

How to upload to Kaggle

1. Download the test data from Kaggle. https://www.kaggle.com/c/titanic-gettingStarted/data
2. Encode the data just like above on the test csv to numerical values. Remember both our train and new test set should have same dimensions for machine learning to work.
3. Predict the new set of values like this

>>> y_results = clf.predict(X_results)

4. To upload to kaggle, the resulting answers should have PassengerID, Survived as comma separated and write to new csv file

>>> output = np.column_stack((X_results[:,0],y_results))
>>> df_results = pd.DataFrame(output.astype('int'),columns=['PassengerID','Survived'])
>>> df_results.to_csv('titanic_results.csv',index=False)

5. Now upload the results csv to Kaggle and they will compute your score!!!

Hope you have enjoyed this tutorial and machine learning is very exciting !!!!