Predicting intergalactic transportations with Spaceship Titanic [Guided]
Predicting intergalactic transportations with Spaceship Titanic [Guided] Data Science Project
Classification in Depth with Scikit-Learn

Predicting intergalactic transportations with Spaceship Titanic [Guided]

The project involves working with a fictional dataset based on the traditional Titanic dataset to gain practical experience and validate one's skills for the final assessment.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

input

How many null/missing values has the column `CryoSleep`?

input

How many null/missing values has the column `FoodCourt`?

input

How many null/missing values has the column `PassengerId`?

codevalidated

Fill the null/missing values in `Destination` with the most common value

Find the most common destination in the Destination column, and fill the null values with it. Perform the fixes in place, modifying the df variable.

codevalidated

Fill the null/missing values in `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa` and `VRDeck` with the median

You must modify the df variable itself. Don't worry if you screw up with the DataFrame! just reload it with the first line of the notebook.

codevalidated

Fill the null/missing values in `VIP` with the most common value

Find the most common value for the VIP column, and fill the null values with it. Perform the fixes in place, modifying the df variable.

multiplechoice

What variables (columns) are independent and should be dropped before moving forward?

There are a few variables in the dataset that are independent and won't contribute to a good prediction for our final model, which ones are them?

codevalidated

Drop the previously defined columns

Drop the columns that you have previously identified as independent. Perform your drop inplace, modifying the df variable. If you have made a mistake, restart your notebook from the beginning.

codevalidated

Drop any other row that contains a null value

Drop any other row containing a null/NaN value. The final dataframe should have NO null values whatsoever.

multiplechoice

What features should be encoded?

What categorical features should be encoded before training our model?

codevalidated

Encode the features previously identified in its own dataframe

Given the features previously identified as categorical and that need to be encoded, encode them in a new dataframe named df_encoded. IMPORTANT. Do not modify df yet, it should be a brand new DataFrame with only the previously selected values in coded as one-hot values, tht is, 1s and 0s.

Don't perform any name changes to the columns; for example, the encoded columns for CryoSleep will be CryoSleep_False and CryoSleep_True. For HomePlanet they'll be HomePlanet_Earth, HomePlanet_Europa, HomePlanet_Mars, etc.

Important! You will (most likely) need to transform your the VIP column to an object/string before encoding it. Use the .astype(str) method before encoding.

codevalidated

Remove the original encoded features from `df`

Now it's time to remove from df the original features you have previously identified as categorical and encoded. But, this time, don't remove them from df. Create a NEW variable named df_no_categorical that contains the results of the drop operation.

codevalidated

Create a new dataframe combining `df_no_categorical` and `df_encoded`

Create a new DataFrame in the variable df_final that contains the combination of the two previously processed dataframes: df_no_categorical and df_encoded, in that order.

The result will look something like:

codevalidated

Finally, separate the target variable `Transported` from the training data

Using df_final, which contains all our data correctly cleaned and prepared, create two new derivative variables:

The transported variable should be a Series containing ONLY the Transported column.

The df_train variable should be a dataframe containing ALL the columns in df_final, EXCEPT for the Transported column. This is equivalent of saying: "remove the Transported column from df_final and store it in df_train.

Important: DO NOT modify df_final.

input

Use a `GridSearchCV` to find the best `max_depth` parameter for a `RandomForestClassifier`

Given the RandomForestClassifier created (with random_state=42, important, don't change it!) instantiate a GridSearchCV to find the best possible parameter for max_depth, in the range 5 to 25.

According to our grid search, what's the best hyperparameter value for max_depth?

codevalidated

Create a `RandomForestClassifier` that achieves at least `0.8` in precision and at least `0.75` in recall

For this project, it'll be important to optimize our recall, as we are trying to save people from being transported to another galaxy. So, now create a RandomForestClassifier object in the variable model and train it with X_train and y_train.

You should select the correct hyperparameters to achieve a precision of at least 0.8 and a recall of at least 0.75.

Instantiate and train your model your model in the variable model.

Predicting intergalactic transportations with Spaceship Titanic [Guided]Predicting intergalactic transportations with Spaceship Titanic [Guided]
Author

Santiago Basulto

This project is part of

Classification in Depth with Scikit-Learn

Explore other projects