Data Cleaning with Pandas

In this project you'll learn how to identify "invalid values" given some statistical analysis. You'll learn to identify and clean values that are outside of defined ranges and outliers, defined by different statistical notions (like quantiles, IQRs, etc).

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

As ratings in Google Play Store fall in the range of 0-5; However by observing the histogram, you will find invalid values that lie oustide this range.

Perform the selection of invalid values and store the results in the variable `df_invalid_ratings`

.

codevalidated

As it is not reasonable for an app to have a rating greater than 0 without being installed, invalid values are defined as any app with maximum installs of 0 and has a rating above 0.

Perform the selection of invalid values and store the results in the variable `df_invalid_install_ratings`

.

codevalidated

As the whole population in the world now is around 9 billion people, invalid values are defined as any value above or equal to 9 billion.

Perform the selection of invalid values and store the results in the variable `df_invalid_installs`

.

codevalidated

Take a look at the histogram that is in the Notebook. By analyzing it, outliers are defined as any values 3 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new column `df_rating['Rating_cleaned']`

.

codevalidated

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.5 IQR to the left or right.

- Note: Here, we are interested only in Paid apps not Free apps.

Perform the outlier identification and store the results in a new column `df_Price['Price_cleaned']`

.

codevalidated

Take a look at the size counts that is in the Notebook. By analyzing it, outliers are defined as any value in GigaByte (G).

Perform the outlier identification and store the results in a new column `df['Size_cleaned']`

.

codevalidated

Invalid values are defined as any value that contains a date in the future (later than now).

Perform the selection of invalid values and store the results in the variable `df_invalid_release_date`

.

codevalidated

Invalid values are defined as any value that does not contain `@`

in the email.

Perform the selection of invalid values and store the results in the variable `invalid_emails`

.

codevalidated

Take a look at the histogram that is in the Notebook. By analyzing it, you will find different size units. As the mobile phones nowadays have a maximum storage of 1TB, let's define invalid values as any value above or equal 1TB.

Perform the selection of invalid values and store the results in the variable `df_invalid_size`

.

- Note: do not forget to consider only not (NA / NaN) values.

codevalidated

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are to the right of the 95% percentile (>= 95% percentile).

Perform the outlier identification and store the results in a new column `df_installs['Installs_cleaned']`

.

codevalidated

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values 2.5 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new series `Category_outliers`

.

codevalidated

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.8 IQR to the left or right.

Perform the outlier identification and store the results in a new column `df_release_year['Release_Year_cleaned']`

.

Author

This project is part of

Explore other projects