Dealing with invalid values by statistical definitions

codevalidated

Clean the column `Rating` by removing invalid values

As ratings in Google Play Store fall in the range of 0-5; However by observing the histogram, you will find invalid values that lie oustide this range.

Perform the selection of invalid values and store the results in the variable df_invalid_ratings.

codevalidated

Clean the dataset by removing rows with invalid installs and ratings

As it is not reasonable for an app to have a rating greater than 0 without being installed, invalid values are defined as any app with maximum installs of 0 and has a rating above 0.

Perform the selection of invalid values and store the results in the variable df_invalid_install_ratings.

codevalidated

Clean the column `Installs` by removing invalid values

As the whole population in the world now is around 9 billion people, invalid values are defined as any value above or equal to 9 billion.

Perform the selection of invalid values and store the results in the variable df_invalid_installs.

codevalidated

Clean the column `Ratings` by removing outliers

Take a look at the histogram that is in the Notebook. By analyzing it, outliers are defined as any values 3 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new column df_rating['Rating_cleaned'].

codevalidated

Clean the column `Price` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.5 IQR to the left or right.

Note: Here, we are interested only in Paid apps not Free apps.

Perform the outlier identification and store the results in a new column df_Price['Price_cleaned'].

codevalidated

Clean the column `Size` by removing outliers

Take a look at the size counts that is in the Notebook. By analyzing it, outliers are defined as any value in GigaByte (G).

Perform the outlier identification and store the results in a new column df['Size_cleaned'].

codevalidated

Clean the column `Released` by removing invalid values.

Invalid values are defined as any value that contains a date in the future (later than now).

Perform the selection of invalid values and store the results in the variable df_invalid_release_date.

codevalidated

Clean the column `Developer Email` by removing invalid values

Invalid values are defined as any value that does not contain @ in the email.

Perform the selection of invalid values and store the results in the variable invalid_emails.

codevalidated

Clean the column `Size` by removing invalid values

Take a look at the histogram that is in the Notebook. By analyzing it, you will find different size units. As the mobile phones nowadays have a maximum storage of 1TB, let's define invalid values as any value above or equal 1TB.

Perform the selection of invalid values and store the results in the variable df_invalid_size.

Note: do not forget to consider only not (NA / NaN) values.

codevalidated

Clean the column `Installs` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are to the right of the 95% percentile (>= 95% percentile).

Perform the outlier identification and store the results in a new column df_installs['Installs_cleaned'].

codevalidated

Clean the column `Category` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values 2.5 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new series Category_outliers.

codevalidated

Clean the column `Release Year` by removing outliers

Take a look at the box plot that is in the Notebook. By analyzing it, outliers are defined as any values that are 1.8 IQR to the left or right.

Perform the outlier identification and store the results in a new column df_release_year['Release_Year_cleaned'].

Mohamed Rawash

Project Activities

Clean the column `Rating` by removing invalid values

Clean the dataset by removing rows with invalid installs and ratings

Clean the column `Installs` by removing invalid values

Clean the column `Ratings` by removing outliers

Clean the column `Price` by removing outliers

Clean the column `Size` by removing outliers

Clean the column `Released` by removing invalid values.

Clean the column `Developer Email` by removing invalid values

Clean the column `Size` by removing invalid values

Clean the column `Installs` by removing outliers

Clean the column `Category` by removing outliers

Clean the column `Release Year` by removing outliers

Mohamed Rawash

Data Cleaning with Pandas

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database