All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Enter the number of exact duplicate rows in the dataset. Exact duplicates are rows where all column values are identical. Use pandas functionality to identify these duplicates and sum up the total count. Store the result in a variable called duplicate_count.
Identify Equipment IDs that appear more than once in the dataset. First, count the occurrences of each unique Equipment_ID. Then, filter this count to keep only the IDs that occur multiple times. Store the result in a variable called duplicate_ids.
Count the number of duplicate Equipment_ID and remove these duplicates from the dataset. First, tally the number of duplicate Equipment_ID and store this count in a variable called duplicates. Then, create a new dataframe called df_cleaned that keeps only the first occurrence of each unique Equipment_ID, discarding any subsequent duplicates.
Enter the number that the duplicates variable contains.
Standardize the Status column values and remove any remaining duplicates from the dataset. First, convert all Status values to lowercase and remove any leading or trailing whitespace. Then, create a new dataframe called df_cleaned that contains only unique rows, removing any exact duplicates across all columns.
Identify duplicate rows based on specific columns and analyze the manufacturers of these duplicates. First, find rows with identical values in the Type, Model, and Year_Acquired columns, keeping all instances of duplicates. Store these in a duplicates variable. Then, count the occurrences of each manufacturer within these duplicates. Finally, calculate the percentage each manufacturer represents among the duplicates, rounding to two decimal places. Store the counts and percentages in mfg_counts and mfg_pct respectively.
Select all columns except Operational_Cost and Last_Maintenance as comparison columns. Use these to identify near-duplicate entries in the dataset. Employ the duplicated() method with keep=False to flag all duplicate rows based on the selected columns. Store these near-duplicates in the near_duplicates variable.
Define key equipment details (Equipment_ID, Type, Model, Manufacturer, Year_Acquired, Status, and Current_Base) in comparison_cols. Sort near_duplicates by these columns and Last_Maintenance. Keep only the last (most recent) maintenance record for each unique piece of equipment using drop_duplicates(), considering only the comparison_cols for determining duplicates. Store the result in resolved_maintenance. Finally, reset the index of the resulting DataFrame, dropping the old index.
Identify equipment with the same Type and Model in the original df. Use the duplicated() method with keep=False to flag all duplicate entries based on the Type and Model columns, including the first occurrence. Filter the DataFrame to keep only these flagged duplicate rows. Store these potential duplicate entries in a variable called hidden_duplicates.
Count how many times each unique combination of Type and Model appears in the hidden_duplicates DataFrame using the value_counts() method. Then, sort these counts from to lowest to highest. Store the resulting Series of counts in type_model_counts.