All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Select the columns that you have identified having null/missing values. We encourage you to use the
This is a 3-part activity:
Rating(if any). Just set them as
Ratingcolumn using the
Perform the modifications "in place", modifying
df. If you make a mistake, re-load the data.
You'll notice that some columns from this dataframe which should be numeric, were parsed as
object (string). That's because sometimes the numbers are expressed with
k to indicate
Review column by transforming the values to the correct numeric representation. For example,
5M should be
Count the number of duplicated rows. That is, if the app
Now that the
Reviews column is numeric, we can use it to clean duplicated apps.
Drop duplicated apps, keeping just one copy of each, the one with the greatest number of reviews.
Hint: you'll need to sort the dataframe by
Reviews, and that will change the order of your
Categories are all uppercase and words are separated using underscores. Instead, we want them with capitalized in the first character and the underscores transformed as whitespaces.
Example, the category
AUTO_AND_VEHICLES should be transformed to:
Auto and vehicles
Clean and transform
Installs as a numeric type. Some values in
Installs will have a
+ modifier. Just remove the string and honor the original number (for example
2,500+ should be transformed to the number
Size column is of type object. Some values contain either a
M or a
k that indicate Kilobytes (1024 bytes) or Megabytes (1024 kb).
These values should be transformed to their corresponding value in bytes. For example,
898k will become
898 * 1024).
Some other values are completely invalid (there's no way to infer the numeric type from them). For these, just replace the value for 0.
Some other rules are related to
+ modifiers, apply the same rules as the previous task.
Values of the
Price column are strings representing price with special symbol '$'.
Now that you have cleaned the
Price column, let's create another auxiliary
This column should contain
Paid values depending on the app's price.
What company has the greatest number of reviews?
Find the most expensive app in the
Game category and enter its name:
What app (from the
Finance category) has the most installs?
What app from the
Game category and catalogued as
Content Rating has the greatest number of reviews?
What free app (ie.
price == 0) from the
Game category has the greatest number of reviews?
This app produced the greatest amount of bytes transfer. Enter your answer in Terabytes as a whole number (rounding down to the nearest integer). Example, if you find the total transfer to be
780.9581 TB, just enter