The data:

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.


Questions

  1. Use read_csv to import the titanic data.
library(tidyverse)
df <- read_csv('titanic.csv')

  1. How many missing values in total? Which columns has the most number if missing values?
sum(is.na(df))
## [1] 866
colSums(is.na(df))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
# The Cabin column has the most number of missing values

  1. Remove the column with the most number of missing value
df$Cabin <- NULL

  1. Remove rows containing missing values and save it as a new dataset. The original dataset remains unchanged by this action.
df1 <- drop_na(df, Age, Embarked)

  1. Replace the missing values of numeric variables with the corresponding average of the columns.
df$Age <- replace_na(df$Age, mean(df$Age, na.rm = TRUE))

  1. Replace the missing values of catagorical variables with the corresponding mode (most frequent value) of the columns.
table(df$Embarked)
## 
##   C   Q   S 
## 168  77 644
df$Embarked <- replace_na(df$Embarked, 'S')
colSums(is.na(df))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare    Embarked 
##           0           0           0           0           0

  1. The Survived column records whether a passenger is survived or not. Survived = 1 means the passenger survived. Thus, the chance of survived for a random passenger can be estimated by
mean(df$Survived)

Compare the chance of survived between male and female. Hint (Use group_by + summarise combo)

df %>% 
  group_by(Sex) %>% 
  summarise(mean(Survived))
## # A tibble: 2 x 2
##   Sex    `mean(Survived)`
##   <chr>             <dbl>
## 1 female            0.742
## 2 male              0.189

  1. Use the summary function to find the first quartile (Q1) and the thrid quartile(Q3) of variable Age. Create a new variable taking values of young (Age < Q1), middle (Age from Q1 to Q3), and old(Age > Q3). Compare the chance of survived between these three groups of age.
summary(df$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   29.70   29.70   35.00   80.00
df$quartile_age <- case_when(
  df$Age <22 ~ 'young',
  df$Age <35 ~ 'middle',
  TRUE ~ 'old'
)
df %>% 
  group_by(quartile_age) %>% 
  summarise(mean(Survived))
## # A tibble: 3 x 2
##   quartile_age `mean(Survived)`
##   <chr>                   <dbl>
## 1 middle                  0.356
## 2 old                     0.4  
## 3 young                   0.426