The data:
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
Questions
library(tidyverse)
df <- read_csv('titanic.csv')
sum(is.na(df))
## [1] 866
colSums(is.na(df))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
# The Cabin column has the most number of missing values
df$Cabin <- NULL
df1 <- drop_na(df, Age, Embarked)
df$Age <- replace_na(df$Age, mean(df$Age, na.rm = TRUE))
table(df$Embarked)
##
## C Q S
## 168 77 644
df$Embarked <- replace_na(df$Embarked, 'S')
colSums(is.na(df))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Embarked
## 0 0 0 0 0
Survived
column records whether a passenger is survived or not. Survived = 1
means the passenger survived. Thus, the chance of survived for a random passenger can be estimated bymean(df$Survived)
Compare the chance of survived between male and female. Hint (Use group_by + summarise combo)
df %>%
group_by(Sex) %>%
summarise(mean(Survived))
## # A tibble: 2 x 2
## Sex `mean(Survived)`
## <chr> <dbl>
## 1 female 0.742
## 2 male 0.189
summary
function to find the first quartile (Q1) and the thrid quartile(Q3) of variable Age
. Create a new variable taking values of young
(Age < Q1), middle
(Age from Q1 to Q3), and old
(Age > Q3). Compare the chance of survived between these three groups of age.summary(df$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 29.70 29.70 35.00 80.00
df$quartile_age <- case_when(
df$Age <22 ~ 'young',
df$Age <35 ~ 'middle',
TRUE ~ 'old'
)
df %>%
group_by(quartile_age) %>%
summarise(mean(Survived))
## # A tibble: 3 x 2
## quartile_age `mean(Survived)`
## <chr> <dbl>
## 1 middle 0.356
## 2 old 0.4
## 3 young 0.426