Pandas Capstone Project: Analyzing Covid Data
Pandas Capstone Project: Analyzing Covid Data Data Science Project
Intro to Pandas for Data Analysis

Pandas Capstone Project: Analyzing Covid Data

In this project you'll apply all the previously learned techniques involving Pandas for Data Analysis, including: statistical analysis and question/answering, filtering (using boolean and comparison operators), creating new columns, plotting and much more. All this with a dataset containing information about the COVID-19 pandemic.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Read CSV File

Read the covid.csv file into a dataframe named df and include first column as the index column.

multiplechoice

Select the correct shape

Choose the correct shape for the df dataframe.

multiplechoice

Select the correct datatype

Choose the correct data type. There can be multiple correct answers.

multiplechoice

Find the minimum and maximum values

Select the minimum and maximum values of the total_cases column in the COVID-19 dataset stored in df dataframe.

multiplechoice

Total cases in the COVID-19 dataset

Select the total number of cases in the COVID-19 dataset using the total_cases column in the df dataframe.

multiplechoice

Find the mean cases per day

Select the mean number of new cases per day in the COVID-19 dataset and select the correct answer. Answer is rounded to two decimal places.

codevalidated

Select values from a dataframe using indexing

Craete a new dataframe named df1 which contains only the continent and location columns from the df dataframe.

codevalidated

Drop columns from the dataframe

Drop the iso_code, new_cases_smoothed, new_deaths_smoothed, total_cases_per_million, new_cases_per_million, new_cases_smoothed_per_million, total_deaths_per_million, new_deaths_per_million, and new_deaths_smoothed_per_million columns from the df dataframe.

codevalidated

Add more rows to a dataframe

Add a new row to the df dataframe with the following values:

new_data = {'continent': ['Africa'], 'location': ['Zimbabwe'], 'date': ['2022-12-07'], 'total_cases': [259356.0], 'new_cases': [192.0], 'total_deaths': [5622.0], 'new_deaths': [2.0], 'population_density': [42.729], 'median_age': [19.6], 'aged_65_older': [2.822], 'aged_70_older': [1.845], 'gdp_per_capita': [1899.767], 'cardiovasc_death_rate': [307.846], 'diabetes_prevalence': [1.85], 'life_expectancy': [61.55], 'population': [16320539.0]}
codevalidated

Update a specific cell value in the COVID-19 dataset

Update the value of the total_cases column for the row with index 166620 to 259357.0 in df dataframe.

codevalidated

Update a multiple cell value in the COVID-19 dataset

Update the values of the total_cases column for the rows with index 166620 and 166621 to 259357.0 and 259358.0 respectively.

codevalidated

Remove rows from the dataframe

Remove the rows with index 166620 and 166621 from the dataframe.

codevalidated

Use `.loc` to select rows based on a condition

Select all the rows from the dataframe where the total_cases column is greater than 1000000.0. Store the result in a variable named df_1m.

codevalidated

Select specific columns and rows

Select the total_cases and total_deaths columns for the rows with index 5168, 5172 and 163703. Store the result in a variable named df_cases_death.

codevalidated

Sort COVID-19 data in ascending order

Sort the dataframe in ascending order of the total_cases column. Store the result in a variable named df_sorted.

codevalidated

Sort COVID-19 data in descending order

Sort the dataframe in descending order of the total_cases column. Store the result in a variable named df_sorted_desc.

codevalidated

Sort the COVID-19 data by multiple columns

Sort the dataframe in descending order of the total_cases column and then in ascending order of the total_deaths column. Store the result in a variable named df_sorted_multi.

codevalidated

Add new columns using arithmetic operations

Create a new column named total_cases_per_million in the dataframe df by dividing the total_cases column by the population column.

codevalidated

Using vectorized operations to update a column

Update the total_cases_per_million column in the dataframe df by multiplying it by 1000.

codevalidated

Remove columns using `del` statement

Remove the total_cases_per_million column from the df dataframe.

codevalidated

Rename columns

Rename the total_cases column to Total Cases and the total_deaths column to Total Deaths.

codevalidated

Filter COVID-19 data using boolean indexing

Create three dataframe objects named df_india, df_china, and df_greater_new_cases by filtering the df dataframe object using boolean indexing as follows:

  • For df_india, select all rows from the COVID-19 DataFrame where the location is either "India" or "China".

  • For df_china, select all rows from the COVID-19 DataFrame where the number of new_cases is between 100000 and 200000.

  • For df_greater_new_cases, select all rows from the COVID-19 DataFrame where the number of new_cases per day is greater than or equal to 10000.

codevalidated

Read the data from Covid-19 dataset for visualization

Read the data from the covid.csv file and store it in the df_for_visualization dataframe object. Also parse the date column as a datetime object.

codevalidated

Filter data by month

Filter the data_for_visualization dataframe object to select only the rows where the date is in the month of March 2020 and location is India. Store the filtered dataframe object in the df_for_plot variable.

multiplechoice

Create a line plot

Plot a line plot using the df_for_plot dataframe object. The x-axis should be the date column and the y-axis should be the new_cases column. Based on the plot, which of the following statements is true?

multiplechoice

Create a bar plot

Plot a bar plot using the df_for_plot dataframe object. The x-axis should be the date column and the y-axis should be the total_deaths column. Based on the plot, which of the following statements is true?

Pandas Capstone Project: Analyzing Covid DataPandas Capstone Project: Analyzing Covid Data
Author

Anurag Verma

What's up, friends! 👋 I'm a computer science student about to finish my last year of college. 🎓 I LOVE writing code! ❤️ It makes me so happy! 😄 Whether I'm goofing in notebooks 📓 or coding in Python 🐍, writing programs is a blast! 💥 When I'm not geeking out over AI 🤖 with my classmates or building neural networks, 🧠 you can find me buried in statistics textbooks. 📚 I know, what a nerd! 🤓 I'm always down to learn new ways to speak human 🫂 and computer 💻. Making tech more fun is my jam! 🍇 If you want a cheery data buddy 😎 who can make difficult things easy-peasy 🥝 and learning a party 🎉, I'm your guy! 🙋‍♂️ Let's chat codes 👨‍💻, numbers 🧮, and machines 🤖 over coffee! ☕ I'd love to meet more techy humans. 💁‍♂️ Can't wait to talk! 🗣️

What's up, friends! 👋 I'm a computer science student about to finish my last year of college. 🎓 I LOVE writing code! ❤️ It makes me so happy! 😄 Whether I'm goofing in notebooks 📓 or coding in Python 🐍, writing programs is a blast! 💥 When I'm not geeking out over AI 🤖 with my classmates or building neural networks, 🧠 you can find me buried in statistics textbooks. 📚 I know, what a nerd! 🤓 I'm always down to learn new ways to speak human 🫂 and computer 💻. Making tech more fun is my jam! 🍇 If you want a cheery data buddy 😎 who can make difficult things easy-peasy 🥝 and learning a party 🎉, I'm your guy! 🙋‍♂️ Let's chat codes 👨‍💻, numbers 🧮, and machines 🤖 over coffee! ☕ I'd love to meet more techy humans. 💁‍♂️ Can't wait to talk! 🗣️

This project is part of

Intro to Pandas for Data Analysis

Explore other projects