Score: 0.78947

Loading the data

setwd("C:/Users/User/Desktop/JobSeeking/Project/Kaggle/titanic/120721")
train <- read.csv(file="train.csv", stringsAsFactors = FALSE, header = TRUE)
test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)

Overview

As we have read in the dataset as a data.table object, we can use head(test) to look the first 6 rows. Next, we can use summary() to check summary statistics such as mean, min, and max values for each variable to see if there are any obvious outliers in the data and if there are any nulls in any of the columns (NA's: number of nulls will appear in the output if there are any nulls).

# Examine passengers data in train dataset
head(train)

# Examine passengers data in test dataset
head(test)

# Summarise the data to check for nulls and possible outliers for the train dataset
summary(train)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

# Summarise the data to check for nulls and possible outliers for the test dataset
summary(test)

##   PassengerId         Pclass          Name               Sex           
##  Min.   : 892.0   Min.   :1.000   Length:418         Length:418        
##  1st Qu.: 996.2   1st Qu.:1.000   Class :character   Class :character  
##  Median :1100.5   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :1100.5   Mean   :2.266                                        
##  3rd Qu.:1204.8   3rd Qu.:3.000                                        
##  Max.   :1309.0   Max.   :3.000                                        
##                                                                        
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000                     
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
##  NA's   :86                                                          
##       Fare            Cabin             Embarked        
##  Min.   :  0.000   Length:418         Length:418        
##  1st Qu.:  7.896   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character  
##  Mean   : 35.627                                        
##  3rd Qu.: 31.500                                        
##  Max.   :512.329                                        
##  NA's   :1

There are 177 nulls in Age column in train dataset and 86 nulls in Age column and 1 null in Fare column in test dataset; morever, we found there are some blank cells in the train dataset. So next, we need to see how many nulls and blank cells are in each variable.

Identifying missing value

Missing values in train dataset

# Count both NA and blank cels for train dataset
sapply(train, function(x) sum(is.na(x)|x == ""))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

# Show them in percentage for train dataset
sapply(train, function(x) sum(is.na(x)|x == "")/length(x)*100)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000  19.8653199 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   0.0000000   0.0000000   0.0000000   0.0000000  77.1043771   0.2244669

There are 177 nulls in Age column, 687 blank cells in Cabin column, and 2 blank cells in Embarked column, which indicate that 19.87% of data in Age column, 77.10% of data in Cabin column, and 0.22% of data in Embarked column are missing in the train dataset. We have to leave Cabin column with those blank cells for it has too many missing data; however, we could try to fill the missing values in both Age and Embarked based on the existing data.

Missing values in test dataset

# Count both NA and blank cells for test dataset
sapply(test, function(x) sum(is.na(x)|x == ""))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           1         327           0

# Show them in percentage for test dataset
sapply(test, function(x) sum(is.na(x)|x == "")/length(x)*100)

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##   0.0000000   0.0000000   0.0000000   0.0000000  20.5741627   0.0000000 
##       Parch      Ticket        Fare       Cabin    Embarked 
##   0.0000000   0.0000000   0.2392344  78.2296651   0.0000000

There are 86 nulls in Age column, 327 blank cells in Cabin column, and 1 blank cell in Fare column, which indicate that 20.57% of data in Age column, 78.23% of data in Cabin column, and 0.24% of data in Embarked column are missing in the test dataset. We also have to leave Cabin column with those blank cells for it has too many missing data; on the other hand, we could try to fill the missing values in both Age and Fare based on the existing data.

Filling missing values

Fill the missing values in Age variable

# Fill the missing values of Age variable in train dataset with median
age.median.train <- median(train$Age,na.rm = TRUE)
train[is.na(train$Age), "Age"] <- age.median.train
sapply(train, function(x) sum(is.na(x)|x == ""))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

# Fill the missing values of Age variable in test dataset with median
age.median.test <- median(test$Age,na.rm = TRUE)
test[is.na(test$Age), "Age"] <- age.median.test
sapply(test, function(x) sum(is.na(x)|x == ""))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           1         327           0

We use the median of Age to fill the missing values in the Age column for both train and test datasets. Now no missing value exists in the Age column any more.

Fill the missing values in Fare variable

# Fill the missing values of Fare variable in test dataset with median
fare.median.test <- median(test$Fare,na.rm = TRUE)
test[is.na(test$Fare), "Fare"] <- fare.median.test
sapply(test, function(x) sum(is.na(x)|x == ""))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0         327           0

We use the median of Fare to fill the missing values in the Fare column for test dataset. There is no missing Fare value now.

Fill the missing values in Embarked variable

# Fill the missing values of Embarked variable in test dataset with "S"
table(train$Embarked)

## 
##       C   Q   S 
##   2 168  77 644

train[train$Embarked =='',"Embarked"] <- 'S'
table(train$Embarked)

## 
##   C   Q   S 
## 168  77 646

sapply(train, function(x) sum(is.na(x)|x == ""))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           0

We use the ‘S’ to fill the missing Embarked cell, for it is the value with the highest frequency, which would affect the result least if we use the wrong value to fill the blank cell.

Fitting the model

Before we fit the model, we need to use str() to look at the format of each column to see whether we need change the format of some variables into appropriate ones.

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

We found Survived, Pclass, Sex, and Embarked variables need to be changed to factor format.

train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
train$Sex <- as.factor(train$Sex)
train$Embarked <- as.factor(train$Embarked)

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Now, we can begin to fit our model.

First, we fit the model with all variables except PassengerId, Name, Ticket and Cabin. The reason to remove Cabin is obvious, for it has too many missing values. We also removed PassengerId, Name, and Ticket, for they are all variables with unique values, which we may take into account to use them to make new variables if we have more evidences to show they are related to the survive rate.

model_0 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Parch  + Fare +  Embarked,
    family = "binomial")
summary(model_0)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6199  -0.6089  -0.4176   0.6187   2.4514  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.064159   0.472813   8.596  < 2e-16 ***
## Pclass2     -0.919468   0.297326  -3.092  0.00199 ** 
## Pclass3     -2.150048   0.297720  -7.222 5.13e-13 ***
## Sexmale     -2.719444   0.200977 -13.531  < 2e-16 ***
## Age         -0.038517   0.007855  -4.903 9.43e-07 ***
## SibSp       -0.321794   0.109193  -2.947  0.00321 ** 
## Parch       -0.093329   0.118856  -0.785  0.43232    
## Fare         0.002339   0.002469   0.947  0.34346    
## EmbarkedQ   -0.056267   0.381471  -0.148  0.88274    
## EmbarkedS   -0.434226   0.239530  -1.813  0.06986 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  785.04  on 881  degrees of freedom
## AIC: 805.04
## 
## Number of Fisher Scoring iterations: 5

In model_0, we found that the P-values of Parch and Fare are greater than 0.05, so we decided to remove Parch to see whether the new model could have all variables with p-values which are less than 0.05 and have a lower AIC value.

# Remove Parch variable
model_01 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp +  Fare +  Embarked,
    family = "binomial")
summary(model_01)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare + 
##     Embarked, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6484  -0.6082  -0.4174   0.6224   2.4576  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.043962   0.470233   8.600  < 2e-16 ***
## Pclass2     -0.940058   0.295524  -3.181  0.00147 ** 
## Pclass3     -2.180214   0.294449  -7.404 1.32e-13 ***
## Sexmale     -2.687111   0.196045 -13.707  < 2e-16 ***
## Age         -0.038327   0.007838  -4.890 1.01e-06 ***
## SibSp       -0.343859   0.105981  -3.245  0.00118 ** 
## Fare         0.001916   0.002358   0.813  0.41629    
## EmbarkedQ   -0.029488   0.379350  -0.078  0.93804    
## EmbarkedS   -0.437273   0.239330  -1.827  0.06769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  785.67  on 882  degrees of freedom
## AIC: 803.67
## 
## Number of Fisher Scoring iterations: 5

In model_01, we found the AIC value decreased which means it is a better model than model_0; however, the p-value of Fare is still larger than 0.05. So we remove Fare and fit a new model.

# Remove Fare variable
model_02 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
    family = "binomial")
summary(model_02)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6021  -0.6012  -0.4151   0.6218   2.4651  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.22353    0.41876  10.086  < 2e-16 ***
## Pclass2     -1.03964    0.26956  -3.857 0.000115 ***
## Pclass3     -2.30238    0.25453  -9.046  < 2e-16 ***
## Sexmale     -2.69944    0.19541 -13.814  < 2e-16 ***
## Age         -0.03874    0.00782  -4.954 7.26e-07 ***
## SibSp       -0.32720    0.10369  -3.155 0.001602 ** 
## EmbarkedQ   -0.05588    0.37822  -0.148 0.882536    
## EmbarkedS   -0.46779    0.23610  -1.981 0.047557 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  786.37  on 883  degrees of freedom
## AIC: 802.37
## 
## Number of Fisher Scoring iterations: 5

Now, the model_02 has all variables with p-values which are less than 0.05 and has the smallest AIC value, so it is our final fitted model.

Prediction

Before we use the fitted model to predict Survived values in test data, we need to use str() to look at the format of each column to see whether the formats of all variables are the same as those in train dataset.

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

Survived, Pclass, Sex, and Embarked variables need to be changed to factor format.

test$Survived <- NA
test$Survived <- as.factor(test$Survived)
test$Pclass <- as.factor(test$Pclass)
test$Sex <- as.factor(test$Sex)
test$Embarked <- as.factor(test$Embarked)

str(test)

## 'data.frame':    418 obs. of  12 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Survived   : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...

Now, we are ready to predict with random forest algorithm.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

model <- randomForest(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked, 
                      data = train, ntree = 500, ntry = 3, nodesize = 0.01 * nrow(test))

Survived <- predict(model, newdata=test)

PassengerId <- test$PassengerId
output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived

write.csv(output.df, file="kaggle_submission_13.csv", row.names = FALSE)

Finally, we can use our output to see how accurate our prediction is.

Titanic - Machine Learning from Disaster (Kaggle Competition)

Yixi Deng