All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Find the most common destination in the
Destination column, and fill the null values with it. Perform the fixes in place, modifying the
You must modify the
df variable itself. Don't worry if you screw up with the DataFrame! just reload it with the first line of the notebook.
Find the most common value for the
VIP column, and fill the null values with it. Perform the fixes in place, modifying the
There are a few variables in the dataset that are independent and won't contribute to a good prediction for our final model, which ones are them?
Drop the columns that you have previously identified as independent. Perform your drop inplace, modifying the
df variable. If you have made a mistake, restart your notebook from the beginning.
Drop any other row containing a null/
NaN value. The final dataframe should have NO null values whatsoever.
What categorical features should be encoded before training our model?
Given the features previously identified as categorical and that need to be encoded, encode them in a new dataframe named
df_encoded. IMPORTANT. Do not modify
df yet, it should be a brand new DataFrame with only the previously selected values in coded as one-hot values, tht is,
Don't perform any name changes to the columns; for example, the encoded columns for
CryoSleep will be
HomePlanet they'll be
Important! You will (most likely) need to transform your the
VIP column to an
object/string before encoding it. Use the
.astype(str) method before encoding.
Now it's time to remove from
df the original features you have previously identified as categorical and encoded. But, this time, don't remove them from
df. Create a NEW variable named
df_no_categorical that contains the results of the drop operation.
Create a new DataFrame in the variable
df_final that contains the combination of the two previously processed dataframes:
df_encoded, in that order.
The result will look something like:
df_final, which contains all our data correctly cleaned and prepared, create two new derivative variables:
transported variable should be a Series containing ONLY the
df_train variable should be a dataframe containing ALL the columns in
df_final, EXCEPT for the
Transported column. This is equivalent of saying: "remove the
Transported column from
df_final and store it in
Important: DO NOT modify
RandomForestClassifier created (with
random_state=42, important, don't change it!) instantiate a
GridSearchCV to find the best possible parameter for
max_depth, in the range
According to our grid search, what's the best hyperparameter value for
For this project, it'll be important to optimize our recall, as we are trying to save people from being transported to another galaxy. So, now create a
RandomForestClassifier object in the variable
model and train it with
You should select the correct hyperparameters to achieve a precision of at least
0.8 and a recall of at least
Instantiate and train your model your model in the variable