Data Wrangling with Pandas

What makes a book beloved by readers? Is it the gripping plot, the unforgettable characters, or perhaps the profound themes it tackles? This project embarks on a journey to highlights the hidden patterns that define the literary world using a fascinating dataset of popular and highly acclaimed books across various genres and time periods

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

input

Some quick trivia before we dive head first into it!

codevalidated

Notice how in the first exercise the result was in full date format? Let's create a column that extracts the years from the `PublishDate`

column. Having a `year_published`

column improves the clarity and readability of the dataset, making it easier for users to understand and interpret the data.

Ensure the column in question is in the correct data type before extracting the year.

input

codevalidated

Who are the rockstars of the book world? Can you find the authors with the highest average ratings? Were your favorite authors on the list?

In this activity, you will calculate the average rating for each author by grouping the DataFrame by the `author`

column. Then compute the average rating. The result is reset and saved to a new variable: `author_avg_ratings`

using `reset_index()`

. Finally, the DataFrame is sorted in descending order by the `rating`

column.

codevalidated

First, remove the texts in parenthesis in the `author`

column. Some entries look like this : `Markus Zusak (Goodreads Author)`

, make it so that it looks just like this: `Markus Zusak`

. Then group the DataFrame by the `author`

column and count the number of books associated with each author.
Reset and save your result to a variable named `author_book_count`

.

codevalidated

Calculate the mean price of books written by each author. Make sure the `price`

column is the correct data type before grouping.
Reset index and save your result to a variable named `author_avg_price`

.

codevalidated

The BBE score is an indicator of the overall reader feedback and engagement with a particular book How much is the combined BBE score for each author, obtained by summing the BBE scores of all books written by that author?

Reset index and save your result to a variable named `author_total_bbe_score`

. Finally, sort the result in descending order.

codevalidated

Calculate the average number of pages written per author. Then organize this information into a table using the `reset_index()`

method and sort it based on the average pages, with the highest at the top. Finally, select the top 10 authors from this sorted list to identify those who wrote the most pages on average. Save your final result in the variable : `author_avg_pages`

.

codevalidated

This will help us understand the distribution of books across different languages, providing insight into language-specific publishing trends and audience diversity.

Save your result in the variable: `books_per_language`

and reset the index.

input

codevalidated

Here, we will analyze the yearly distribution of book publications, which will reveal trends in the publishing industry and highlight significant years of activity or growth. But first filter your dataset to only include years before 2022. The column has anomalous entries like dates in the future (e.g., 2027) that might be due to typos.

Save your result in the variable: `books_per_year`

and organize this information into a table using the `.reset_index()`

method.

input

In this activity, your task is to identify the year with the least number of books published.

codevalidated

Filter the DataFrame to include only books written in English, then sort the filtered DataFrame by the `pages`

column in descending order and select the Top 10 entries.

codevalidated

Sort the dataset by the `PublishDate`

column. Then, select the top 10 records and create a new dataframe containing only the `title`

and `PublishDate`

columns. Save this result to a variable named `oldest_books`

.

codevalidated

Which genre reign supreme in the reading world? Can you identify the most popular genres based on the average ratings?
Begin by extracting a new column: `first_genre`

from the `genres`

column . Subsequently, group the data accordingly to unveil the top five most popular genres based on ratings. Reset the index and save your result in a variable named : `top_5_genres_by_rating`

codevalidated

Is genre a key factor in book pricing? Calculate the average price for each genre to see if fantasy novels leave your wallet feeling fictional, or if self-help books offer more bang for your buck!

Sort your final result in descending order and save it in the variable: `average_price_by_genre_sorted`

. Organize this information into a table using the `.reset_index()`

method.

input

Using the `first_genre`

column, identify the most popular genre.

codevalidated

Which authors are most commonly associated with each genre? Begin by grouping the dataset by `first_genre`

and `author`

, then count the occurrences.
Afterwards, identify the top author for each genre based on frequency, sorting in descending order.
Save your final result to a variable named `sorted_genre_author_count`

and organize this information into a table using the `.reset_index()`

method.

codevalidated

Group the DataFrame by `first_genre`

and then filter for books with ratings above 4.5

codevalidated

Here, we'll calculate the average rating for each language to see if certain tongues tend to inspire higher praise (or criticism) from readers.

Reset the index of your final result and save in a variable named `average_rating_by_language`

input

codevalidated

Group the dataframe by `language`

and calculate the average book length (`pages`

).
This explores potential differences in book length across languages.
Save your result in the variable: `avg_book_length_by_language`

and reset the index

codevalidated

Group the dataframe by `language`

and split the data into `highly_rated_genres`

(above 3.5) and `lowly_rated_genres`

(3.5 or below) groups for each language in our data.

Analyze the distribution of genres within each group for each language. The goal is to explore potential connections between language, book rating, and genre preference.

codevalidated

This question aims to assess the popularity of books by calculating the ratio of the percentage of users who liked the book (`LikedPercent`

) to the total number of ratings (`NumRatings`

).

By ranking these books based on this calculated ratio to identify those that are highly regarded compared to reader engagement levels.

First calculate the ratio and store the calculated ratio in a new column named `Liked_to_NumRatings_ratio`

, then rank the books based on this ratio in descending order, creating a Rank column. Finally, sort the dataframe by this `rank`

and reset the index. Save your final result in the varaible: `df_ratio`

.

Remember that `likedPercent`

is given as a percentage. This means that to obtain a ratio, `likedPercent`

would have to be divided by 100 first.

input

Using the `df_ratio`

calculated in the previous activity, Identify the year that had the highest rating

This project is part of

Explore other projects