Cleaning duplicate data from an Online Retail store
Cleaning duplicate data from an Online Retail store Data Science Project
Data Cleaning with Pandas

Cleaning duplicate data from an Online Retail store

In this project you'll practice how to identify and clean duplicate data using a dataset of an Online Retail store.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

Which of the following parameters is used to only consider certain columns for identifying duplicates and it by default uses all of the columns?

multiplechoice

Which of the following parameters is used to determine whether to modify the DataFrame rather than creating a new one?

multiplechoice

Which of the following parameters takes 'first' as a value?

codevalidated

Select duplicate rows in a dataframe from the dataset?

Perform the selection and store the results in the variable duplicate_rows.

  • Note: use the defualt parameter of keep='first'.
multiplechoice

What is the number of duplicate rows?

codevalidated

Find and drop duplicate rows based on InvoiceNo, StockCode, Quantity, and UnitPrice columns

This data contains dulpicate orders with the same quantity and unit price, so drop these duplicates.

Perform the dropping and store the results in the variable df_without_duplicate_orders.

codevalidated

Drop duplicates while keeping the first non-NaN value based on InvoiceNo, StockCode, and CustomerID columns

As each invoice should have the stock code only one time for each customer and the customer may have different quantities: Drop duplicates while keeping the first non-NaN value.

Perform the dropping and store the results in the variable df_keep_first.

codevalidated

Drop duplicates while keeping the last order based on StockCode and InvoiceWeekday columns

If you want to show number of unique transactions per weekday and StockCode combination, you will need to drop duplicate stockcode on same day.

Perform the dropping and store the results in the variable df_unique_stock_day.

codevalidated

Drop all duplicate invoices as it reflects to multiple products in the same invoice

Imagine it is black friday and each customer is allowed to buy only one product in the invoice. So, we need to drop all data that has more than one product in the same invoice.

Perform the dropping and store the results in the variable df_black_friday.

codevalidated

Drop duplicate countries while keeping the last row

Imagine we want to know all unique countries in our stock, drop duplicate countries keeping first row.

Perform the dropping and store the results in the variable df_unique_countries.

codevalidated

Drop duplicate products while keeping last based on StockCode, Description, and UnitPrice

Imagine we want to know all ordered products in our retail, drop duplicate products based on StockCode, Description, and UnitPrice.

Perform the dropping and store the results in the variable df_unique_products.

codevalidated

Drop all duplicate rows based on TotalCost and CustomerID while keeping first

We want to know all unique total costs paid by each different customer, So drop these duplicates.

Perform the dropping and store the results in the variable df_customer_unique_payments.

codevalidated

Drop all duplicate rows while keeping first

Perform the dropping and store the results in the variable df_unique.

Cleaning duplicate data from an Online Retail storeCleaning duplicate data from an Online Retail store
Author

Mohamed Rawash

This project is part of

Data Cleaning with Pandas

Explore other projects