Dealing with invalid values by statistical definitions
Dealing with invalid values by statistical definitions Data Science Project
Data Cleaning with Pandas

Dealing with invalid values by statistical definitions

In this project you'll learn how to identify "invalid values" given some statistical analysis. You'll learn to identify and clean values that are outside of defined ranges and outliers, defined by different statistical notions (like quantiles, IQRs, etc).

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Clean the column `Rating` by removing invalid values

As ratings in Google Play Store fall in the range of 0-5; However by observing the histogram, you will find invalid values that lie oustide this range.

Perform the selection of invalid values and store the results in the variable df_invalid_ratings.

codevalidated

Clean the dataset by removing rows with invalid installs and ratings

As it is not reasonable for an app to have a rating greater than 0 without being installed, invalid values are defined as any app with maximum installs of 0 and has a rating above 0.

Perform the selection of invalid values and store the results in the variable df_invalid_install_ratings.

codevalidated

Clean the column `Installs` by removing invalid values

As the whole population in the world now is around 9 billion people, invalid values are defined as any value above or equal to 9 billion.

Perform the selection of invalid values and store the results in the variable df_invalid_installs.

codevalidated

Clean the column `Ratings` by removing outliers

Take a look at the histogram that is in the Notebook. By analyzing it, outliers are defined as any values 3 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new column df_rating['Rating_cleaned'].

codevalidated

Clean the column `Price` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.5 IQR to the left or right.

  • Note: Here, we are interested only in Paid apps not Free apps.

Perform the outlier identification and store the results in a new column df_Price['Price_cleaned'].

codevalidated

Clean the column `Size` by removing outliers

Take a look at the size counts that is in the Notebook. By analyzing it, outliers are defined as any value in GigaByte (G).

Perform the outlier identification and store the results in a new column df['Size_cleaned'].

codevalidated

Clean the column `Released` by removing invalid values.

Invalid values are defined as any value that contains a date in the future (later than now).

Perform the selection of invalid values and store the results in the variable df_invalid_release_date.

codevalidated

Clean the column `Developer Email` by removing invalid values

Invalid values are defined as any value that does not contain @ in the email.

Perform the selection of invalid values and store the results in the variable invalid_emails.

codevalidated

Clean the column `Size` by removing invalid values

Take a look at the histogram that is in the Notebook. By analyzing it, you will find different size units. As the mobile phones nowadays have a maximum storage of 1TB, let's define invalid values as any value above or equal 1TB.

Perform the selection of invalid values and store the results in the variable df_invalid_size.

  • Note: do not forget to consider only not (NA / NaN) values.
codevalidated

Clean the column `Installs` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are to the right of the 95% percentile (>= 95% percentile).

Perform the outlier identification and store the results in a new column df_installs['Installs_cleaned'].

codevalidated

Clean the column `Category` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values 2.5 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new series Category_outliers.

codevalidated

Clean the column `Release Year` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.8 IQR to the left or right.

Perform the outlier identification and store the results in a new column df_release_year['Release_Year_cleaned'].

Dealing with invalid values by statistical definitionsDealing with invalid values by statistical definitions
Author

Mohamed Rawash

This project is part of

Data Cleaning with Pandas

Explore other projects