setwd("C:/Users/User/Desktop/JobSeeking/Project/Kaggle/titanic/120721")
train <- read.csv(file="train.csv", stringsAsFactors = FALSE, header = TRUE)
test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
As we have read in the dataset as a data.table
object, we can use head(test)
to look the first 6 rows. Next, we can use summary()
to check summary statistics such as mean, min, and max values for each variable to see if there are any obvious outliers in the data and if there are any nulls in any of the columns (NA's: number of nulls
will appear in the output if there are any nulls).
# Examine passengers data in train dataset
head(train)
# Examine passengers data in test dataset
head(test)
# Summarise the data to check for nulls and possible outliers for the train dataset
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
# Summarise the data to check for nulls and possible outliers for the test dataset
summary(test)
## PassengerId Pclass Name Sex
## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :30.27 Mean :0.4474 Mean :0.3923
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
## NA's :86
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA's :1
There are 177 nulls in Age
column in train dataset and 86 nulls in Age
column and 1 null in Fare
column in test dataset; morever, we found there are some blank cells in the train dataset. So next, we need to see how many nulls and blank cells are in each variable.
# Count both NA and blank cels for train dataset
sapply(train, function(x) sum(is.na(x)|x == ""))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
# Show them in percentage for train dataset
sapply(train, function(x) sum(is.na(x)|x == "")/length(x)*100)
## PassengerId Survived Pclass Name Sex Age
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 19.8653199
## SibSp Parch Ticket Fare Cabin Embarked
## 0.0000000 0.0000000 0.0000000 0.0000000 77.1043771 0.2244669
There are 177 nulls in Age
column, 687 blank cells in Cabin
column, and 2 blank cells in Embarked
column, which indicate that 19.87% of data in Age column, 77.10% of data in Cabin
column, and 0.22% of data in Embarked
column are missing in the train dataset. We have to leave Cabin
column with those blank cells for it has too many missing data; however, we could try to fill the missing values in both Age
and Embarked
based on the existing data.
# Count both NA and blank cells for test dataset
sapply(test, function(x) sum(is.na(x)|x == ""))
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked
## 0 0 1 327 0
# Show them in percentage for test dataset
sapply(test, function(x) sum(is.na(x)|x == "")/length(x)*100)
## PassengerId Pclass Name Sex Age SibSp
## 0.0000000 0.0000000 0.0000000 0.0000000 20.5741627 0.0000000
## Parch Ticket Fare Cabin Embarked
## 0.0000000 0.0000000 0.2392344 78.2296651 0.0000000
There are 86 nulls in Age
column, 327 blank cells in Cabin
column, and 1 blank cell in Fare
column, which indicate that 20.57% of data in Age
column, 78.23% of data in Cabin
column, and 0.24% of data in Embarked
column are missing in the test dataset. We also have to leave Cabin
column with those blank cells for it has too many missing data; on the other hand, we could try to fill the missing values in both Age
and Fare
based on the existing data.
# Fill the missing values of Age variable in train dataset with median
age.median.train <- median(train$Age,na.rm = TRUE)
train[is.na(train$Age), "Age"] <- age.median.train
sapply(train, function(x) sum(is.na(x)|x == ""))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
# Fill the missing values of Age variable in test dataset with median
age.median.test <- median(test$Age,na.rm = TRUE)
test[is.na(test$Age), "Age"] <- age.median.test
sapply(test, function(x) sum(is.na(x)|x == ""))
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Cabin Embarked
## 0 0 1 327 0
We use the median of Age
to fill the missing values in the Age
column for both train and test datasets. Now no missing value exists in the Age
column any more.
# Fill the missing values of Fare variable in test dataset with median
fare.median.test <- median(test$Fare,na.rm = TRUE)
test[is.na(test$Fare), "Fare"] <- fare.median.test
sapply(test, function(x) sum(is.na(x)|x == ""))
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 0 0
## Parch Ticket Fare Cabin Embarked
## 0 0 0 327 0
We use the median of Fare
to fill the missing values in the Fare
column for test dataset. There is no missing Fare
value now.
# Fill the missing values of Embarked variable in test dataset with "S"
table(train$Embarked)
##
## C Q S
## 2 168 77 644
train[train$Embarked =='',"Embarked"] <- 'S'
table(train$Embarked)
##
## C Q S
## 168 77 646
sapply(train, function(x) sum(is.na(x)|x == ""))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 0
We use the ‘S’ to fill the missing Embarked
cell, for it is the value with the highest frequency, which would affect the result least if we use the wrong value to fill the blank cell.
Before we fit the model, we need to use str()
to look at the format of each column to see whether we need change the format of some variables into appropriate ones.
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 28 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
We found Survived
, Pclass
, Sex
, and Embarked
variables need to be changed to factor format.
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
train$Sex <- as.factor(train$Sex)
train$Embarked <- as.factor(train$Embarked)
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 28 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Now, we can begin to fit our model.
First, we fit the model with all variables except PassengerId
, Name
, Ticket
and Cabin
. The reason to remove Cabin
is obvious, for it has too many missing values. We also removed PassengerId
, Name
, and Ticket
, for they are all variables with unique values, which we may take into account to use them to make new variables if we have more evidences to show they are related to the survive rate.
model_0 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
family = "binomial")
summary(model_0)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6199 -0.6089 -0.4176 0.6187 2.4514
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.064159 0.472813 8.596 < 2e-16 ***
## Pclass2 -0.919468 0.297326 -3.092 0.00199 **
## Pclass3 -2.150048 0.297720 -7.222 5.13e-13 ***
## Sexmale -2.719444 0.200977 -13.531 < 2e-16 ***
## Age -0.038517 0.007855 -4.903 9.43e-07 ***
## SibSp -0.321794 0.109193 -2.947 0.00321 **
## Parch -0.093329 0.118856 -0.785 0.43232
## Fare 0.002339 0.002469 0.947 0.34346
## EmbarkedQ -0.056267 0.381471 -0.148 0.88274
## EmbarkedS -0.434226 0.239530 -1.813 0.06986 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 785.04 on 881 degrees of freedom
## AIC: 805.04
##
## Number of Fisher Scoring iterations: 5
In model_0, we found that the P-values of Parch
and Fare
are greater than 0.05, so we decided to remove Parch
to see whether the new model could have all variables with p-values which are less than 0.05 and have a lower AIC value.
# Remove Parch variable
model_01 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked,
family = "binomial")
summary(model_01)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare +
## Embarked, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6484 -0.6082 -0.4174 0.6224 2.4576
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.043962 0.470233 8.600 < 2e-16 ***
## Pclass2 -0.940058 0.295524 -3.181 0.00147 **
## Pclass3 -2.180214 0.294449 -7.404 1.32e-13 ***
## Sexmale -2.687111 0.196045 -13.707 < 2e-16 ***
## Age -0.038327 0.007838 -4.890 1.01e-06 ***
## SibSp -0.343859 0.105981 -3.245 0.00118 **
## Fare 0.001916 0.002358 0.813 0.41629
## EmbarkedQ -0.029488 0.379350 -0.078 0.93804
## EmbarkedS -0.437273 0.239330 -1.827 0.06769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 785.67 on 882 degrees of freedom
## AIC: 803.67
##
## Number of Fisher Scoring iterations: 5
In model_01, we found the AIC value decreased which means it is a better model than model_0; however, the p-value of Fare
is still larger than 0.05. So we remove Fare
and fit a new model.
# Remove Fare variable
model_02 <- glm(train, formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
family = "binomial")
summary(model_02)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6021 -0.6012 -0.4151 0.6218 2.4651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.22353 0.41876 10.086 < 2e-16 ***
## Pclass2 -1.03964 0.26956 -3.857 0.000115 ***
## Pclass3 -2.30238 0.25453 -9.046 < 2e-16 ***
## Sexmale -2.69944 0.19541 -13.814 < 2e-16 ***
## Age -0.03874 0.00782 -4.954 7.26e-07 ***
## SibSp -0.32720 0.10369 -3.155 0.001602 **
## EmbarkedQ -0.05588 0.37822 -0.148 0.882536
## EmbarkedS -0.46779 0.23610 -1.981 0.047557 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 786.37 on 883 degrees of freedom
## AIC: 802.37
##
## Number of Fisher Scoring iterations: 5
Now, the model_02 has all variables with p-values which are less than 0.05 and has the smallest AIC value, so it is our final fitted model.
Before we use the fitted model to predict Survived
values in test data, we need to use str()
to look at the format of each column to see whether the formats of all variables are the same as those in train dataset.
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
Survived
, Pclass
, Sex
, and Embarked
variables need to be changed to factor format.
test$Survived <- NA
test$Survived <- as.factor(test$Survived)
test$Pclass <- as.factor(test$Pclass)
test$Sex <- as.factor(test$Sex)
test$Embarked <- as.factor(test$Embarked)
str(test)
## 'data.frame': 418 obs. of 12 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ Survived : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
Now, we are ready to predict with random forest algorithm.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked,
data = train, ntree = 500, ntry = 3, nodesize = 0.01 * nrow(test))
Survived <- predict(model, newdata=test)
PassengerId <- test$PassengerId
output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived
write.csv(output.df, file="kaggle_submission_13.csv", row.names = FALSE)
Finally, we can use our output to see how accurate our prediction is.