Data Cleaning Capstone: Cleaning NYC Airbnb Data
Data Cleaning Capstone: Cleaning NYC Airbnb Data Data Science Project
Data Cleaning with Pandas

Data Cleaning Capstone: Cleaning NYC Airbnb Data

In this project you'll apply all the previously learned techniques involving Data Cleaning with Pandas, including: identifying null and missing values, handling duplicate data, identifying and fixing invalid values (wrong types, statistically insignificant, outliers, etc). You'll also practice string handling with Pandas and the `.str` accessor. All this with a dataset containing information about NYC Airbnb bookings.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Find out the missing values in each column

Perform the calculation and store the results in the variable col_missing_values.

codevalidated

Drop the column `reviews_per_month` as it has the many values missing and we will not use it

You have to drop this column permanently as we can not use it for any purpose.

codevalidated

Drop the rows having more than 1 missing values

Store the resulting DataFrame in the variable df_rows_dropped.

codevalidated

Fill the 21 missing values in `host_name` with the value `Airbnb`

Store your result in the variable host_total.

multiplechoice

Check if any name in the Column `host_name` has digit(s) or number(s) in it

codevalidated

Fill the 2 missing values of the column `price` with the mean of the column

Store your result in the variable mean_df_price.

codevalidated

Fill all missing values in `last_review` using forward filling method

Store your result in the variable ffill_review.

codevalidated

Select duplicate hosts in a dataframe based on `name`, `host_id`, and `price` columns

Store your result in the variable duplicate_hosts.

multiplechoice

Which option to set for keep parameter when dropping all duplicates is needed?

codevalidated

Drop duplicates while keeping the first non-NaN value based on `name`, `host_id`, and `price` columns

Perform the dropping and store the results in the variable df_unique_hosts.

codevalidated

How many users in the Column `room_type` are `Private room`?

Let's count all the Private rooms in the column room_type and sum them up.

Store your sum in the private_rooms_counts variable.

codevalidated

Find the words in Column `name` which contain the substring `park`

Store your selection in the variable names_having_park.

codevalidated

Replace the neighbourhood having `Kitchen` with `Restaurant`

Store the output in the variable kitchen_to_restaurant.

codevalidated

Split the strings in the `room_type` column at ` ` (space) to find whether it is room or home/apt

Store them in the variable roomOrhome

Note: We are interested in the value at second index once you split all the strings in the Column room_type on space.

codevalidated

Clean the column `availability_365` by removing invalid values

Invalid values are defined as any host that offers a value of 0 in availability_365.

Perform the selection of invalid values and store the results in the variable df_invalid_availability.

multiplechoice

What is/are the most common value/s to be set in case we want to fill NaN values?

codevalidated

Clean the column `minimum_nights` by removing outliers

Outliers are defined as any values 4 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new column df_nights['Min_Nights_cleaned'].

codevalidated

Clean the column `price` by removing outliers

Outliers are defined as any values that are 1.5 IQR to the left or right.

Perform the outlier identification and store the results in a new column df_Price['Price_cleaned'].

Data Cleaning Capstone: Cleaning NYC Airbnb DataData Cleaning Capstone: Cleaning NYC Airbnb Data
Author

Mohamed Rawash

This project is part of

Data Cleaning with Pandas

Explore other projects