All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Find the most common destination in the Destination column, and fill the null values with it. Perform the fixes in place, modifying the df variable.
You must modify the df variable itself. Don't worry if you screw up with the DataFrame! just reload it with the first line of the notebook.
Find the most common value for the VIP column, and fill the null values with it. Perform the fixes in place, modifying the df variable.
There are a few variables in the dataset that are independent and won't contribute to a good prediction for our final model, which ones are them?
Drop the columns that you have previously identified as independent. Perform your drop inplace, modifying the df variable. If you have made a mistake, restart your notebook from the beginning.
Drop any other row containing a null/NaN value. The final dataframe should have NO null values whatsoever.
What categorical features should be encoded before training our model?
Given the features previously identified as categorical and that need to be encoded, encode them in a new dataframe named df_encoded. IMPORTANT. Do not modify df yet, it should be a brand new DataFrame with only the previously selected values in coded as one-hot values, tht is, 1s and 0s.
Don't perform any name changes to the columns; for example, the encoded columns for CryoSleep will be CryoSleep_False and CryoSleep_True. For HomePlanet they'll be HomePlanet_Earth, HomePlanet_Europa, HomePlanet_Mars, etc.
Important! You will (most likely) need to transform your the VIP column to an object/string before encoding it. Use the .astype(str) method before encoding.
Now it's time to remove from df the original features you have previously identified as categorical and encoded. But, this time, don't remove them from df. Create a NEW variable named df_no_categorical that contains the results of the drop operation.
Create a new DataFrame in the variable df_final that contains the combination of the two previously processed dataframes: df_no_categorical and df_encoded, in that order.
The result will look something like:

Using df_final, which contains all our data correctly cleaned and prepared, create two new derivative variables:
The transported variable should be a Series containing ONLY the Transported column.
The df_train variable should be a dataframe containing ALL the columns in df_final, EXCEPT for the Transported column. This is equivalent of saying: "remove the Transported column from df_final and store it in df_train.
Important: DO NOT modify df_final.
Given the RandomForestClassifier created (with random_state=42, important, don't change it!) instantiate a GridSearchCV to find the best possible parameter for max_depth, in the range 5 to 25.
According to our grid search, what's the best hyperparameter value for max_depth?
For this project, it'll be important to optimize our recall, as we are trying to save people from being transported to another galaxy. So, now create a RandomForestClassifier object in the variable model and train it with X_train and y_train.
You should select the correct hyperparameters to achieve a precision of at least 0.8 and a recall of at least 0.75.
Instantiate and train your model your model in the variable model.