Data Cleaning Capstone: Cleaning NYC Airbnb Data

codevalidated

Find out the missing values in each column

Perform the calculation and store the results in the variable col_missing_values.

codevalidated

Drop the column `reviews_per_month` as it has the many values missing and we will not use it

You have to drop this column permanently as we can not use it for any purpose.

codevalidated

Drop the rows having more than 1 missing values

Store the resulting DataFrame in the variable df_rows_dropped.

codevalidated

Fill the 21 missing values in `host_name` with the value `Airbnb`

Store your result in the variable host_total.

multiplechoice

Check if any name in the Column `host_name` has digit(s) or number(s) in it

codevalidated

Fill the 2 missing values of the column `price` with the mean of the column

Store your result in the variable mean_df_price.

codevalidated

Fill all missing values in `last_review` using forward filling method

Store your result in the variable ffill_review.

codevalidated

Select duplicate hosts in a dataframe based on `name`, `host_id`, and `price` columns

Store your result in the variable duplicate_hosts.

multiplechoice

Which option to set for keep parameter when dropping all duplicates is needed?

codevalidated

Drop duplicates while keeping the first non-NaN value based on `name`, `host_id`, and `price` columns

Perform the dropping and store the results in the variable df_unique_hosts.

codevalidated

How many users in the Column `room_type` are `Private room`?

Let's count all the Private rooms in the column room_type and sum them up.

Store your sum in the private_rooms_counts variable.

codevalidated

Find the words in Column `name` which contain the substring `park`

Store your selection in the variable names_having_park.

codevalidated

Replace the neighbourhood having `Kitchen` with `Restaurant`

Store the output in the variable kitchen_to_restaurant.

codevalidated

Split the strings in the `room_type` column at ` ` (space) to find whether it is room or home/apt

Store them in the variable roomOrhome

Note: We are interested in the value at second index once you split all the strings in the Column room_type on space.

codevalidated

Clean the column `availability_365` by removing invalid values

Invalid values are defined as any host that offers a value of 0 in availability_365.

Perform the selection of invalid values and store the results in the variable df_invalid_availability.

multiplechoice

What is/are the most common value/s to be set in case we want to fill NaN values?

codevalidated

Clean the column `minimum_nights` by removing outliers

Outliers are defined as any values 4 or more std to the left or right of the mean.

Perform the outlier identification and store the results in a new column df_nights['Min_Nights_cleaned'].

codevalidated

Clean the column `price` by removing outliers

Outliers are defined as any values that are 1.5 IQR to the left or right.

Perform the outlier identification and store the results in a new column df_Price['Price_cleaned'].

Mohamed Rawash

Project Activities

Find out the missing values in each column

Drop the column `reviews_per_month` as it has the many values missing and we will not use it

Drop the rows having more than 1 missing values

Fill the 21 missing values in `host_name` with the value `Airbnb`

Check if any name in the Column `host_name` has digit(s) or number(s) in it

Fill the 2 missing values of the column `price` with the mean of the column

Fill all missing values in `last_review` using forward filling method

Select duplicate hosts in a dataframe based on `name`, `host_id`, and `price` columns

Which option to set for keep parameter when dropping all duplicates is needed?

Drop duplicates while keeping the first non-NaN value based on `name`, `host_id`, and `price` columns

How many users in the Column `room_type` are `Private room`?

Find the words in Column `name` which contain the substring `park`

Replace the neighbourhood having `Kitchen` with `Restaurant`

Split the strings in the `room_type` column at ` ` (space) to find whether it is room or home/apt

Clean the column `availability_365` by removing invalid values

What is/are the most common value/s to be set in case we want to fill NaN values?

Clean the column `minimum_nights` by removing outliers

Clean the column `price` by removing outliers

Mohamed Rawash

Data Cleaning with Pandas

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database