Titanic Survival Prediction

Data Science Dec 06, 2021

Introduction to Application:

Our goal is to apply machine-learning techniques to successfully predict passengers who have survived the sink of the Titanic. A case study based on the RMS Titanic data. We will analyse the Titanic data set and make two predictions. One prediction to see which passengers on board the ship would survive and then another prediction to see if we would’ve survived.

We are glad to share with you the standard code which you could use on our CODER to execute on the same dataset or your own dataset to build your own model.

Objectives of the Model:

The objective of this project is to build a predictive model to predict passengers who have survived in the Titanic’s ship accident.

Principle:

This is a Data Science model where the machine learning learns from the categorical and continuous variables in the form of historical data which would in turn to used train the model better prediction and accuracy.

Introduction to Dataset:

The data we used for our project was provided on the Kaggle website. We were given 891 passenger samples for our training set and their associated labels of whether or not the passenger survived. The dataset is used as it is without any modifications. You can download the dataset from the below link.

Dataset Link:https://raw.githubusercontent.com/schoolforai/AI-Models/main/titanic.csv

Methodology - SFAI's 5 Stage Approach

As you may be aware, any AI application could fall into one of these five stages. You probably learned about them in your Level 2: Data Science class. From data sourcing to deployment, this strategy would make AI application development more understandable. Please note that all applications at SchoolforAI are broken down into these 5-stages for better control. These five stages would be followed by every machine learning engineer. However, it is not always possible or necessary to demonstrate a few steps. For example, only a few data sets may necessitate Data Preparation and Feature Engineering. As a result, these 5-stages may need to be condensed.

As the purpose of this application is to introduce how we build a Data Science model, we have excluded or merged few stages to finalize with the below steps:

1. Data Sourcing

Data sourcing is the primary stage of find the relevant data for the model. For this model we are using Cleveland Heart Disease Data from Kaggle. You could see the data in this stage and understand about its type, number of records, features and such. In the below code you would find the procedure to include all the required libraries and how to read the data form the location. Please observe the code and learn from it.

# Essential Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import joblib

#Data Reading
data = pd.read_csv(r"file path")
var = data.columns

# To get information
def info(data):
    return data.info()
# Calling Function  
info(data)

# To get shape of data
def shape(data):
    return (data.shape)
# Calling Function
shape(data)

2. Model Building

In this step we have merged EDA & FE, Algo Selection and Optimization for simplicity purpose. EDA is a essential process where you try to explore the interdependencies in the data.

We must check for missing values and outliers before doing EDA. The code below demonstrates how to calculate the missing count and check for data outliers.

# Checking for missing values
def missingcount(data):
    return data.isnull().sum()
# Calling Function
missingcount(data)

# To Detect Outliers

def outliers(data,var):
    num_vars=[var for var in data.columns if data[var].dtypes != 'O']
    for var in num_vars:
        images=data.boxplot(column=var) # 
        plt.title(images)
        plt.show()
# Calling Function
images=outliers(data,var)

However, the provided dataset is cleaned, we will proceed with Data Visualization. Below is the code for Data Visualization.


# Data Visualization

                    # Categorical Data Visualization

# # Univariate Analysis

def uni_cat(data,var):
    cat_vars=[var for var in data.columns if len(data[var].unique())<20]
    for var in cat_vars:
        sns.countplot(x=data[var],data=data)
        plt.show()

uni_cat(data,var)

# # Bivariate Ananlysis

def bi_cat(data,var):
    cat_vars=[var for var in data.columns if len(data[var].unique())<20]
    for var in cat_vars:
        target=data['survived'] # User Defined Variable
        sns.countplot(x=data[var],hue=target,data=data)
        plt.show()

bi_cat(data,var)

                 # Continuous Data Visualization 
                
# # Univariate Analysis

def uni_cont(data,var):
    num_vars=[var for var in data.columns if data[var].dtypes != 'O']
    for var in num_vars:
        data[var].hist(bins=30)
        plt.show()
        plt.title(var)
        
uni_cont(data,var)

# # Bivariate Ananlysis

def bi_cont(data,var):
    num_vars=[var for var in data.columns if data[var].dtypes != 'O']
    for var in num_vars:
        target=data['survived'] # User Defined Variable
        plt.scatter(data[var],survived)
        plt.show()

Feature engineering helps in working on the features and selecting the right features (columns) that would contribute to the models power of predictability. In this case we are retaining all the features.

You must have already understood from your classes that selection of algorithm depends on the type of output or label in the dataset, i.e. either categorical or continuous output. As the target value in our dataset is categorical data we have considered it as the classification problem.

You must have already understood from your classes that selection of algorithm depends on the type of output or label in the dataset, i.e. either categorical or continuous output. As the target value in our dataset is categorical data we have considered it as the classification problem.

For the given dataset, we have tried several algorithms to finalized on Decision Tree Classifier that suits our classification problem. Below is the syntax for the Decision Tree Classifier. It would help to understand each parameter and its effect on the final accuracy. However, for simplicity we have considered only few important parameters for turning purpose.

In our SchoolforAI app, you could change the parameters to tune the algorithm for the given dataset to achieve maximum possible accuracy. You can see that in the below image.

What you are doing here is finding the best possible parameters to train the model for highest possible accuracy. Once you are done with the tuning and training you could proceed to the next step. Please note down the final optimization parameters you have used for training the algorithm, for your reference.


            # Dependent & Independent Variables
    
# Dependent Variable
def dependent(data):
    x=data.iloc[:,-1]
    return x

y=dependent(data)

# Independent Variable

def Independent(data):
    x=data.iloc[:,:-1]
    return x

X=Independent(data)

The code below is for Model Building. You can see how we implemented the model in the code.


# Data Splitting into training and testing

def train_test(X,y):
    X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state=0)
    
    return X_train,X_test,y_train,y_test

X_train,X_test,y_train,y_test = train_test(X,y)

# Model(LineraRegression) Building and Model Fitting

def modelfitting(x,y):
    dt=DecisionTreeClassifier(criterion='gini',  
                              splitter='best', 
                              max_depth=21, 
                              min_samples_split=11,
                              max_features='sqrt'
                        )
    dt.fit(x,y)
    joblib.dump(dt,'model_joblib') # Model Serilization
    print(dt.get_params())
    
modelfitting(X_train,y_train)

3. Data Validation

Once the training and parameter tuning is done it is time to cross check or validate the model with a new set of set. In this stage you would input a new set of data for the model to predict the output. This is the real-life scenario where the model we have built would be put to test.

You can see how we can predict the outcome from the test inputs in our SchoolforAI app in the image below.

This is how you test a model.

# Data Validation
y_predict = []

def predict(z):
    model=joblib.load('model_joblib')
    model.predict(z)
    y_hat = model.predict(z)
    y_predict.append(y_hat)
    
    return y_predict

Below is the code for evaluating the model prediction.

#confusion Matrix
def c_matrix(y_test, y_predict):
    cm = confusion_matrix(y_test, y_predict)
    return cm

5. Deployment

Once we are happy with the model, parameter tuning and the offered accuracy we would proceed with deployment. It could be on cloud or on customer’s server. This needs some additional skills such as FLASK. For the convenience and learning of the students the standard code could be downloaded in this stage. Students could load the code into our CODER and practice.

Coder Link: https://app.schoolforai.com/coder/python

Reference Link: https://app.schoolforai.com/ai-coder/titanic-survival-prediction

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.