All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
Using the pandas DataFrame df
, find the video with the highest number of views. Extract its title, channel name, and view count. Store this information in the max_views_video
variable as a pandas Series. This analysis will reveal the single most popular video in the dataset, providing insights into what content has achieved the highest level of viewer engagement.
Group the videos by their category_id
and count how many videos are in each category. Then, calculate the mean number of views for each category. Combine these results into a single DataFrame called category_analysis
with columns Video Count
and Mean Views
. Use pd.concat()
to merge the results along axis=1 (columns) This analysis will reveal which categories are most prevalent in the dataset and which tend to attract the most views.
Create a new column likes_ratio
by dividing likes
by views
for each video. Then, group the data by channel_title
and calculate the mean and maximum likes ratio for each channel. Store these results in channel_engagement
, sorting channels by their mean likes ratio in descending order. This analysis will reveal which channels are most effective at generating likes relative to their view counts, potentially indicating higher audience engagement or content quality.
Calculate the mean number of views for each video category and store it in a new column category_mean_views
. Then, compare each video's view count to its category average. Create a new column popularity_tier
that labels videos as Above Average
if their views exceed their category mean, and Below Average
otherwise. This analysis will help identify which videos are outperforming others in their category.
Note : New column added at the end of the
df
Convert publish_time
to datetime, then extract the hour into a new publish_hour
column. Create a cross-tabulation of video categories and publishing hours, storing the result in category_hour_crosstab
. Ensure the final cross-tab has no missing values by filling it with 0
and uses integer data type.
Split the tags
column, which contains pipe-separated tags for each video, into a list of individual tags. Store this list in a new column called tags_list
in the original DataFrame. Then, create a new DataFrame tags_df
by exploding the tags_list
column, so each tag gets its own row. Count the occurrences of each unique tag and find the 10 most common ones. Store these in common_tags
. This analysis will reveal which tags are most frequently used by popular YouTube videos.
Do disabled features affect popularity?
Group videos based on whether comments and ratings are enabled or disabled. Calculate the average views, likes, and dislikes for each group. Store the results in disabled_impact
. The comments_disabled
and ratings_disabled
columns indicate whether these features are turned off for each video. This analysis will reveal how disabling user interactions affects key engagement metrics.
Convert the trending_date
column to a datetime format, then extract the day of the week for each trending date. Store this in a new column called trending_day
in the original DataFrame. Count how many videos trend on each day of the week. The trending_date
column represents when a video appeared on the trending list. Store the results in trending_day_counts
. This analysis will reveal if there's a pattern in which days videos are more likely to become trending.
Note: Dates in
trending_date
are formatted as'%y.%d.%m'
(e.g., '23.15.07' for July 15, 2023). Also, usevalue_counts()
method to count the values as using different method, you might get different results.
Divide the range of view counts into five equal-width bins and assign each video to one of these categories: 'Very Low'
, 'Low'
, 'Medium'
, 'High'
, or 'Very High'
. Store the results in a new column called views_category
at the end of the df
dataframe. This categorization will help you understand the distribution of video popularity and identify which videos fall into different viewership tiers.
Note : New column added at the end of the
df
Divide videos into four equal-sized groups (quartiles) based on their number of likes, labeling them Q1 to Q4 in a new column likes_quartile
. Then, calculate the average number of views for each quartile. Store the results in likes_analysis
and sort them in descending order. This analysis will reveal if videos with more likes tend to have higher view counts.
Extract the hour from the publish_time
column (publish_hour
is already created in 5th activity) and categorize it into time periods: Night (9PM-4AM), Morning (5AM-11AM), Afternoon (12PM-4PM), and Evening (5PM-8PM). Create new column publish_time_category
in the original DataFrame with these labels. Then, calculate the average views for videos published in each time category, storing the results in time_analysis
. Sort the time_analysis
results in descending order of average views. This analysis will reveal if certain publishing times tend to result in higher view counts.
Create a function that bins videos into 'Low' (0-100)
, 'Medium' (101-1000)
, 'High' (1001-10000)
, or 'Viral' (>10000)
categories based on their comment count. Apply this function to the comment_count
column, which represents the number of comments each video received. Store the results in a new comment_category
column. Then, calculate the average views for each comment category, storing the results in comment_analysis
. Sort the comment_analysis
results in descending order of average views. This analysis will reveal if videos with more comments tend to have higher view counts.
Using the likes_ratio
column created in activity 3, create a new column called likes_ratio_category
by dividing it into five equal-sized groups. Label these groups as Very Low
, Low
, Medium
, High
, and Very High
. Then, calculate the average number of views for each category and sort the results in descending order. Store the final result in a variable called ratio_analysis
.
Calculate the length of each video title and store it in a new title_length
column. Then categorize it as 'Short' (0-30 characters)
, 'Medium' (31-60 characters)
, or 'Long' (over 60 characters)
. Store these categories in a new title_length_category
column. The title
column contains the name of each video. Compute the average number of views for videos in each title length category, storing the results in title_analysis
. Sort the title_analysis
results in descending order of average views. This analysis will help determine if the length of a video's title has any correlation with its viewership.
Count the words in each video's description and store the count in a new description_word_count
column. Before counting words, replace any NaN values in the description
column with an empty string. Categorize the word counts into four equal-sized groups: Brief
, Moderate
, Detailed
, and Very Detailed
. Store these categories in a new description_detail
column. The description
column contains the text description for each video. Then, calculate the average views for videos in each description length category, storing the results in description_analysis
. Sort the description_analysis
results in descending order of average views. This analysis will reveal if more detailed descriptions correlate with higher view counts.
Create a DataFrame category_map
with columns category_id
and category_name
that links category IDs to their corresponding names. The category names should be: 'Film & Animation'
, 'Autos & Vehicles'
, 'Music'
, 'Pets & Animals'
, 'Sports'
, 'Gaming'
, 'People & Blogs'
, 'Comedy'
, 'Entertainment'
, 'News & Politics'
, 'Howto & Style'
, 'Education'
, 'Science & Technology'
, and 'Nonprofits & Activism'
. Perform a left merge of this mapping with the main dataset using the category_id
column as the key. Store the result in df_with_categories
. This process adds a new category_name
column to the dataset, providing descriptive category names alongside the numeric category IDs.
Use the below mapping for
category_id
tocategory_name
category_mapping = {
'category_id': [1, 2, 10, 15, 17, 20, 22, 23, 24, 25, 26, 27, 28, 29],
'category_name': ['Film & Animation', 'Autos & Vehicles', 'Music', 'Pets & Animals', 'Sports', 'Gaming', 'People & Blogs', 'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style', 'Education', 'Science & Technology', 'Nonprofits & Activism']
}
No video should appear twice! Remove any duplicate rows from the dataset, keeping only the first occurrence of each unique video. The video_id
column contains a unique identifier for each video.
Set a random seed of 42 for reproducibility. Create a sample DataFrame channel_metadata
with extra information for the first 10 unique channels, including the columns channel_title
, subscriber_count
, and channel_age_years
. Generate random subscriber counts between 100,000 and 10,000,000, and random channel ages between 1 and 15 years. Perform a left merge of this data with the main dataset based on the channel_title
column, which represents the name of each YouTube channel. Store the result in df_enriched
. This process demonstrates how to combine data from multiple sources, simulating the enrichment of our dataset with external channel information to gain more comprehensive insights into video performance in relation to channel characteristics.
Split the dataset by trending_date
, storing groups in a list. Concatenate these groups using unique trending_date
values as keys. Reset the index to make trending_date
a regular column again. Store the result in df_concat
. The trending_date
column shows when a video trended on YouTube. You'll have this date in two columns - keep both for now.
Create a new DataFrame performance_data
with columns video_id
and engagement_rate
(calculated as (likes + dislikes + comment_count) / views * 100). Then, left merge this data with the main dataset using the video_id
column as the key. Store the result in df_with_performance
.
Clean the channel_title
column by creating a new column clean_channel_title
with extra spaces removed and text converted to lowercase. Then, group the data by these cleaned channel names and calculate: the number of videos, average views, average likes, and average dislikes. Store the results in a new DataFrame called channel_stats
. Reset the index after grouping and rename columns to channel
, video_count
, avg_views
, avg_likes
, and avg_dislikes
. This process creates a summary of channel performance for comparison.
Create a function to analyze sentiment in titles. Define lists of positive words (amazing, awesome, best, great, love, perfect, top) and negative words (worst, terrible, bad, awful, hate, disappointing). Extract words from lowercase titles using regex. Check if any positive or negative words are present. Assign sentiment based on these conditions: 'Neutral'
if both positive and negative words exist or neither exist, 'Positive'
if only positive words are found, and 'Negative'
if only negative words are present. Apply this function to the title
column to create a new title_sentiment
column.
Select the numerical columns views
, likes
, dislikes
, and comment_count
. Create a new DataFrame called df_log
by applying a logarithmic transformation (log(x+1)) to these columns. This transformation helps to normalize the distribution of these metrics, which often have a wide range of values. The resulting df_log
DataFrame will allow for easier visualization and comparison of these metrics across different scales.