All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Select the columns that you have identified having null/missing values. We encourage you to use the missingno library.
This is a 3-part activity:
Rating (if any). Just set them as NaN.Rating column using the mean()Perform the modifications "in place", modifying df. If you make a mistake, re-load the data.
You'll notice that some columns from this dataframe which should be numeric, were parsed as object (string). That's because sometimes the numbers are expressed with M, or k to indicate Mega or kilo.
Clean the Review column by transforming the values to the correct numeric representation. For example, 5M should be 5000000.
Count the number of duplicated rows. That is, if the app Twitter appears 2 times, that counts as 2.
Now that the Reviews column is numeric, we can use it to clean duplicated apps.
Drop duplicated apps, keeping just one copy of each, the one with the greatest number of reviews.
Hint: you'll need to sort the dataframe by App and Reviews, and that will change the order of your df.
Categories are all uppercase and words are separated using underscores. Instead, we want them with capitalized in the first character and the underscores transformed as whitespaces.
Example, the category AUTO_AND_VEHICLES should be transformed to: Auto and vehicles
Clean and transform Installs as a numeric type. Some values in Installs will have a + modifier. Just remove the string and honor the original number (for example +2,500 or 2,500+ should be transformed to the number 2500).
The Size column is of type object. Some values contain either a M or a k that indicate Kilobytes (1024 bytes) or Megabytes (1024 kb).
These values should be transformed to their corresponding value in bytes. For example, 898k will become 919552 (898 * 1024).
Some other values are completely invalid (there's no way to infer the numeric type from them). For these, just replace the value for 0.
Some other rules are related to + modifiers, apply the same rules as the previous task.
Values of the Price column are strings representing price with special symbol '$'.
Now that you have cleaned the Price column, let's create another auxiliary Distribution column.
This column should contain Free/Paid values depending on the app's price.
What company has the greatest number of reviews?
Find the most expensive app in the Game category and enter its name:
What app (from the Finance category) has the most installs?
What app from the Game category and catalogued as Teen in Content Rating has the greatest number of reviews?
What free app (ie. price == 0) from the Game category has the greatest number of reviews?
This app produced the greatest amount of bytes transfer. Enter your answer in Terabytes as a whole number (rounding down to the nearest integer). Example, if you find the total transfer to be 780.9581 TB, just enter 780.