Capstone Project: Cleaning Google Playstore data
Capstone Project: Cleaning Google Playstore data Data Science Project

Capstone Project: Cleaning Google Playstore data

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

Which of the following column(s) has/have null values?

Select the columns that you have identified having null/missing values. We encourage you to use the missingno library.

Clean the `Rating` column and the other columns containing null values

This is a 3-part activity:

  • Remove the invalid values from Rating (if any). Just set them as NaN.
  • Fill the null values in the Rating column using the mean()
  • Clean any other non-numerical columns by just dropping the values

Perform the modifications "in place", modifying df. If you make a mistake, re-load the data.

Clean the column `Reviews` and make it numeric

You'll notice that some columns from this dataframe which should be numeric, were parsed as object (string). That's because sometimes the numbers are expressed with M, or k to indicate Mega or kilo.

Clean the Review column by transforming the values to the correct numeric representation. For example, 5M should be 5000000.

How many duplicated apps are there?

Count the number of duplicated rows. That is, if the app Twitter appears 2 times, that counts as 2.

Drop duplicated apps keeping only the ones with the greatest number of reviews

Now that the Reviews column is numeric, we can use it to clean duplicated apps. Drop duplicated apps, keeping just one copy of each, the one with the greatest number of reviews.

Hint: you'll need to sort the dataframe by App and Reviews, and that will change the order of your df.

Format the `Category` column

Categories are all uppercase and words are separated using underscores. Instead, we want them with capitalized in the first character and the underscores transformed as whitespaces.

Example, the category AUTO_AND_VEHICLES should be transformed to: Auto and vehicles

Clean and convert the `Installs` column to numeric type

Clean and transform Installs as a numeric type. Some values in Installs will have a + modifier. Just remove the string and honor the original number (for example +2,500 or 2,500+ should be transformed to the number 2500).

Clean and convert the `Size` column to numeric (representing bytes)

The Size column is of type object. Some values contain either a M or a k that indicate Kilobytes (1024 bytes) or Megabytes (1024 kb). These values should be transformed to their corresponding value in bytes. For example, 898k will become 919552 (898 * 1024).

Some other values are completely invalid (there's no way to infer the numeric type from them). For these, just replace the value for 0.

Some other rules are related to + modifiers, apply the same rules as the previous task.

Clean and convert the `Price` column to numeric

Values of the Price column are strings representing price with special symbol '$'.

Paid or free?

Now that you have cleaned the Price column, let's create another auxiliary Distribution column.

This column should contain Free/Paid values depending on the app's price.

What company has the most reviews?

What company has the greatest number of reviews?

Which is the category with the most most uploaded apps?
To which category belongs the most expensive app?
What's the name of the most expensive game?

Find the most expensive app in the Game category and enter its name:

Which is the most popular Finance App?

What app (from the Finance category) has the most installs?

What *Teen* Game has the most reviews?

What app from the Game category and catalogued as Teen in Content Rating has the greatest number of reviews?

What free game has the most reviews?

What free app (ie. price == 0) from the Game category has the greatest number of reviews?

How many TB (terabytes) were transferred (overall) for the most popular Lifestyle app?

This app produced the greatest amount of bytes transfer. Enter your answer in Terabytes as a whole number (rounding down to the nearest integer). Example, if you find the total transfer to be 780.9581 TB, just enter 780.

Capstone Project: Cleaning Google Playstore dataCapstone Project: Cleaning Google Playstore data
Matias Caputti

This Project is part of our

Data Cleaning with Pandas

Skill Track

Explore other projects