All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
To initiate the data cleaning process, it's important to understand where the data is missing in your dataset.
Your task is to find the total number of null values
in the dataset. Please choose the correct method or code from the provided options to accomplish this.
Based on the results of Activity 1, which column do you think has the most missing values?
Next, analyze the data set for not null values
.
Your task is to choose the appropriate option from provided choices that correctly determines the count of not null values in the Subscribers
column.
Using dropna()
function, remove all the rows that are entirely empty.
Store the result in the dataframe df_cleaned
.
The result should match the following output:
Using dropna()
drop columns where more than 50% of the data is missing.
Store your results in df_cleaned_column
variable.
The result should match the following output:
After removing columns with excessive missing values, use isnull()
to identify which column now has the fewest missing values.
Write down the column name below.
Use mean imputation to handle missing values in the Rank
column.
The result should match the following output:
Why is mean imputation suitable for the Rank
column?
Use median imputation to handle missing values in the Average Comments
column.
Median imputation will provide a more typical representation of the general comment volume unaffected by extreme outliers.
The result should match the following output:
Apply forward fill to handle missing values in the Country
column.
The result should match the following output:
Apply backward fill to handle any remaining missing values in the Country
column.
The result should match the following output:
Why might combining forward and backward fill be beneficial?
Apply linear interpolation using interpolate()
to estimate and fill missing values in the Average Views
column.
The result should match the following output:
First apply forward fill, then impute remaining missing values in Subscribers
using the mode.
The result should match the following output:
Your task is to fill the missing Category
values with a new category named Unknown
.
This strategy allows the clear and simple handling of data, while preserving the integrity of your analysis by marking unknown data explicitly
The result should match the following output:
Fill the missing values in the Content Type
column with a new category named Unknown
.
The result should match the following output: