Data Cleaning Capstone Project 2
Data Cleaning Capstone Project 2 Data Science Project
Data Cleaning with Pandas

Data Cleaning Capstone Project 2

Through this project, you will have the opportunity to apply the techniques you have previously learned for data cleaning with Pandas. By utilizing the Careem Rides Dataset, you will employ Pandas to read data from a CSV file, handle missing values, address duplicates, and ultimately clean and transform the data to answer a variety of questions.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

How do you check if a value in a Pandas DataFrame column is null?

multiplechoice

What is the Pandas function used to drop missing values?

codevalidated

Find out the missing values in each column

Perform the calculation and store the results in the variable col_missing_values.

codevalidated

Drop the columns that have more than 1000 missing values

You have to drop this column permanently as we can not use them for any purpose.

codevalidated

Drop the rows that have a `Service Area` other than `Karachi`

Perform this drop permanently in df.

codevalidated

Fill the missing values of the column `Wait Time Min` with the median of the column

Make sure to apply this change on the original df.

codevalidated

Fill all missing values in `Credit ID` using backward filling method

Make sure to apply this change on the original df.

multiplechoice

How do you handle duplicate values in Pandas?

multiplechoice

How do you extract a substring from a Pandas DataFrame column?

codevalidated

Find and drop duplicate rows based on `Booking ID`, `Trip ID`, `Car Model`, `Payment Type`, and `Pickup Location` columns while keeping last row

  • Make sure to permanently drop these duplicates.
input

How many users paid with `Credit Card` in the Column `Payment Type`?

  • Let's count all the Customers paid with Credit Card in the Column Payment Type.
codevalidated

Replace the `Car Type` having `Go Mini` with `GO Mini`

Make sure to apply this change to the original df.

codevalidated

Find the trips (rows) whose Column `Car Type` contains the substring `GO` & the locations in column `Pickup Location` contains `University`

  • Store your selection in the variable edu_trips_with_GO.
  • Note: make sure to pass the previous activity to avoid facing any issue in your result here.
input

How many Values in the Column `Car Type` end with `+`

multiplechoice

Which of the following is an example of a normalization technique used in data cleaning?

codevalidated

Clean the column `Wait Time Min` by selecting outliers

  • Outliers are defined as any values 3 or more std to the left or right of the mean.

  • Perform the outlier identification and drop them.

  • Important Note: Make sure to correctly solve the previous activities before solving this activity.

codevalidated

Clean the column `Trip Price` by identifying outliers

  • Outliers are defined as any values that are 1.5 IQR to the left or right.
  • Perform the outlier identification and drop them.
  • Important Note: Make sure to correctly solve the previous activities before solving this activity.
codevalidated

Clean the column `Payment Type` by removing invalid values

Invalid values are defined as any value other than Credit Card or Cash.

Perform the selection of valid values and store them in column Payment_Type_Fixed while invalid values should be NaN. Then select invalid values and store the results in the variable df_invalid_payment_type.

codevalidated

Clean the columns `Trip Currency` by removing invalid values

  • Invalid values are defined as any value other than PKR.

  • Perform the selection of invalid values and drop them from the original df.

codevalidated

Clean the column `Total Distance` by removing invalid values

  • Invalid values are defined as any value that is not an integer.

  • Perform the selection of valid values and store them in column Total_Distance_Fixed while invalid values should be NaN. Then select invalid values and store the results in the variable df_invalid_distance.

Data Cleaning Capstone Project 2Data Cleaning Capstone Project 2
Author

Mohamed Rawash

This project is part of

Data Cleaning with Pandas

Explore other projects