Analyzing newborn given names in Argentina
Analyzing newborn given names in Argentina Data Science Project
Data Wrangling with Pandas

Analyzing newborn given names in Argentina

This project starts with a great dataset that contains all the names given to newborns in Argentina between 1922 and 2015. Your job will be to perform some Data Analysis to unveil secrets and trends of naming in the Latin American country. Before analyzing the project though, you'll need to do some Data Cleaning and Wrangling, with groupby operations and visualizations.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

input

How many rows contain null values in the `name` column?

The first step is to identify null values in our dataframe. Let's start by name. Count the number of null values in the name column and answer, how many np.nan values are there?

codevalidated

Drop any rows with null values, do it inplace

Let's clean the dataframe now by removing any rows that have null values (in any columns). Perform the cleaning task in-place, that is, modifying the original df variable.

input

What's the most popular name from 1953?

input

What's the most popular name from 1992?

input

What's the least popular name from 1978?

If there are multiple names with the same quantity, enter the name with the highest index value.

input

What's the least popular name from 2007?

If there are multiple names with the same quantity, enter the name with the highest index value.

input

How many people were born in the year 1950?

Input answer as an integer, without any commas or dots.

input

How many people were born in the year 1980?

Input answer as an integer, without any commas or dots.

multiplechoice

What's the Growth Rate of newborns from 1930 to 1990?

The growth rate measures the change from one period to another. In this case, we want to see the total change of quantity of babies between 1930 and 1990. Select the option that better matches the growth rate. Keep in mind that these are all approximate figures.

input

What's the year with the most babies born?

input

What's the year with the least babies born?

codevalidated

Plot the number of babies born per year

Create a plot showing the total babies born per year. Use the fig and ax variables already defined, create your plot on the axis ax. The plot must have the title "Number of babies born per year" and the y-axis should be formatted using , as thousands separators.

Your plot must match perfectly the figure that you see below:

Note: Plot activity checks are performed on a pixel-by-pixel basis, so your plot has to match perfectly what you see in the image above, including the values of the axis, labels, titles, etc.

codevalidated

Create a dataframe representing the 'uniqueness' of names

We want to analyze how parents were with newborns across the years. To do so, we'll compare the number of unique names of each year to the total number of babies born. Uniqueness is defined then as: Total Unique Names / Total Newborns. For example, given the following baby names in a year:

John
Jane
Jane
Mary

We get a "uniqueness" score of .75 (3 unique names / 4 total babies)

Store your results in a new dataframe named unique_names_df. The dataframe should contain the columns Total Unique Names, Total Newborns and Uniqueness. It should be indexed by year, in ascending mode.

It should look something like this:

input

What's the year with the most 'variation' of names?

Using the dataframe created before, which is the year with the most variations of names? Or, what's the same, the highest uniqueness score.

input

When was the year with the least (lowest) 'variation' of names?

Similar to the previous activity, now answer: which was the year with the least uniqueness?

codevalidated

Create a visualization of the 'uniqueness' of names across the years

Using the dataframe unique_names_df, create a plot displaying the uniqueness of names across the years.

The title of your chart should be "Baby name uniqueness across the years" and it should contain the legend "Uniqueness of names" for the single series plotted. It should look like this:

input

How many babies were named 'carlos'?

Juan Carlos, Jose Carlos, Giancarlos, or just plain Carlos...

Carlos is a very popular name in spanish speaking countries

So, answer the following: how many people were named "Carlos" throughout history?

Warning! The following are all valid "Carlos", so be mindful about casing: Juan Carlos, Carlos, Giancarlos.

Input your answer as an integer, without any commas or dots.

input

What is the most popular 'Carlos' name?

Is it Carlo Alberto, Roberto Carlos, or just plain Carlos?

What's the most popular name containing carlos?

codevalidated

The 'Diego' phenomenon

"Diego Maradona" was a renowned Soccer/Football player from Argentina that played in the 80s/90s. He was an absolute sensation in Argentina. We want to know if he impacted new baby names.

Create a Dataframe containing an aggregation of the total number of babies named "Diego" per year, in any variation: "Diego Martin", "Diego Alejandro", or just "Diego". In this case, we don't want to count any names that contain "diego" (in lowercase), just the names that contain the actual "Diego" name.

Create the aggregation and store the result in the series diegos_per_year_s. It should look something like:

year
1922    22
1923    21
1924    21
1925    33
1926    41
Name: quantity, dtype: int64
input

When was the year with most 'Diegos' born?

codevalidated

Create a visualization of 'Diegos' born between 1960 and 2015

Create a plot showing the number of "Diegos" born between 1960 and 2015, including both limits ([1960, 2015]). The plot should have the title Total 'Diegos' born per year [1960-2015], and it should look something like:

codevalidated

Extract the most popular names per year

Create a DataFrame containing information of the most popular name for each year (that with the highest quantity). Store it in the variable most_popular_per_year_df. It should be sorted by year in ascending order.

Important! The index of the dataframe is important and must be respected based on the most popular name of each year. It should look like this:

For example, the most popular name of 1999 was Valentina with 3084 occurrences, or in 2015 it was Benjamin with 3695 occurrences.

input

Which was the most popular name among the most popular names?

Which name got the "most popular name of the year" the most times?

Analyzing newborn given names in ArgentinaAnalyzing newborn given names in Argentina
Author

Santiago Basulto

This project is part of

Data Wrangling with Pandas

Explore other projects