All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Group the data by topic and language:
Use value_counts() to find the most common chatbot for each combination.
Handle cases where there might be no common chatbot for a combination.
Create a function to fill missing chatbot names:
If a chatbot name is missing, replace it with the most common one for that topic-language pair.
If there's no common chatbot for a particular combination when a value is missing, keep it as None type.
Apply the function to fill missing values in the chatbot_name column.
Replace any remaining missing values with Unknown.
Handle negative values in the word_count column:
NaN (Not a Number).Calculate the median word count:
Fill missing values:
NaN values in the word_count column.Standardize data type:
word_count column to integers.Calculate the mode of non-null sentiments:
Fill missing values:
Calculate quartiles of the user_age column:
Calculate the Interquartile Range (IQR):
Determine the bounds for outliers:
Identify outliers:
outliers.Define age limits for the user_age column:
Clean age data outside the defined limits:
Handle missing age values:
Standardize age data type:
user_age column to integers.Note : Use
df.head().Tfor viewing yourdf.df.head().Tprovides a compact way to view the initial rows as columns, making it easier to scan the data horizontally.
Standardize the date column:
date column to datetime format.Sort the dataset:
date in ascending order (earliest to latest).response_time in descending order.Remove duplicate entries:
user_id and date columns.Reset the dataframe index:
Identify and remove duplicate entries:
chat_id column as the basis for identifying duplicates.Update the dataframe:
Create a composite key:
message, topic, and language columns.message_context.Remove duplicate entries:
message_context column to identify duplicates.Clean up temporary data:
message_context column.Update the dataframe:
Define subscription hierarchy:
'enterprise': 3, 'premium': 2, 'basic': 1, 'free': 0Create a temporary numeric column subscription rank sub_rank:
subscription_type to its corresponding numeric value.Sort the data:
Keep only the most recent or premium subscription:
user_idClean up:
sub_rank columnModify the topic column by capitalizing the first letter of each word.
last_login column to datetime format.login_month.chatbot_name column.chatbot_initials.'Short''Medium''Long'session_duration column.duration_category.date column to a string format '%Y%m%d'.chat_id, separating them with an underscore.full_chat_id.'Unknown': for missing or negative values'Under 18': for ages below 18'Young Adult': for ages 18-29'Adult': for ages 30-49'Senior': for ages 50 and aboveuser_age column.age_group.Task: Create a function to map language codes to full language names and generate a new user_location column in a DataFrame.
Define a function that:
'en' (English), 'es' (Spanish), 'fr' (French), 'de' (German), and 'it' (Italian).'Other' for any code not in the dictionary.Apply this function to the language column to get full language names.
Create a new user_location column by combining user_country and the full language name in the format Country (Language).
Handle missing or unknown data:
'Unknown' and the language is 'Other' with 'Unknown'.Ensure the original language column remains unchanged.
Remove negative values from the response_time column:
Handle outliers using the 99th percentile:
Improve precision:
Create a list of valid language codes:
'en', 'es', 'fr', 'de', and 'it'.Create a function to clean and validate language codes. This function should:
'other' if the code is not in the valid list.'unknown' if the code is missing or NaN.Apply this cleaning function to the language column.
Create a function to clean accuracy_scores. This function should:
NaN.Apply this cleaning function to the accuracy_score column.
Create a list of valid device types:
'mobile', 'tablet', and 'desktop'.Create a function to clean device types. This function should:
'unknown'.'other'.Apply this cleaning function to the device_type column.
Create a function to clean the is_returning_user values. This function should:
NaN.'true', 'yes', or '1'.Apply this cleaning function to the is_returning_user column.
Convert the entire column to boolean type:
Fill any remaining NaN values with False.
Create a list of valid sentiment categories:
positive, neutral, and negative.Create a function to normalize sentiment values. This function should:
'unknown'.'unknown'.Apply this cleaning function to the sentiment column.
Create a dictionary mapping common country names and abbreviations to their standardized forms. Include the following:
'usa', 'us', and 'united states' for 'United States''uk', 'united kingdom', and 'great britain' for 'United Kingdom''germany' for 'Germany''france' for 'France''italy' for 'Italy''spain' for 'Spain''canada' for 'Canada''australia' for 'Australia'Create a function to clean country names. This function should:
'Unknown'.Apply this cleaning function to the user_country column.
Calculate the Interquartile Range (IQR):
Define bounds for outliers:
Create a function to handle outliers. This function should:
Apply this function to the session_duration column to remove extreme outliers.
Extract the day of the week from the date column:
chat_day_of_week.Create a list defining the order of days:
Count the number of chats for each day of the week:
chat_counts.Calculate the days between date and last_login:
last_login from date.days_since_last_login.Handle potential data errors:
days_since_last_login are non-negative.Extract the week number from the date column:
week_number.Group the data by week number and count chats:
week_number column.weekly_chats.weekly_chats has columns for week number and chat count.Calculate the running total of chats:
weekly_chats.