Capstone Project: NBA 2017 season analysis
Capstone Project: NBA 2017 season analysis Data Science Project
Data Wrangling with Pandas

Capstone Project: NBA 2017 season analysis

In this project, you'll combine all your data wrangling skills to perform an analysis of the 2017 Season of the NBA. You'll need to merge datasets, clean data by removing columns, transform data types, and much more.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Merge `s2017_df` and `players_df` with a left join

Merge s2017_df and players_df using a left outer join, that means, we want to have all the stats information, but if there are missing values that can't be matched from players_df, we want to set those season values as null.

Store the results from the merge in the variable df.

multiplechoice

Are there misses (mismatches) in the resulting dataframe?

As we performed a left outer join, if some values in from s2017_df couldn't be matched in players_df, the result will be null values (also referred as "misses" or "mismatches"). Are there any?

input

How many rows couldn't be matched?

How many misses were in the resulting frame?

codevalidated

Extract the names of the players that couldn't be matched

Extract the names of the players from s2017_df that couldn't be matched with players_df in a list (it must be a list), under the variable name player_misses.

codevalidated

Modify `players_df` with the correct names to re-try a successful merge

Now it's time to do some detective work...

We can guarantee that the players missing in players_df do exist, but they just have different names. Now, you must find the players discrepancies and update the names in the players_df dataframe.

For example, if in s2017_df the player's name was "Michael J. Jordan", and in players_df it was just "Michael Jordan", the task is to modify players_df to make it now "Michael J. Jordan".

Important: Modify the players_df dataframe in place! If you "break" something, you can always read the data again.

codevalidated

Perform the merge between `s2017_df` and `players_df` again, this time, without misses

Now that you've fixed the data in players_df, perform the merge between s2017_df and players_df again. Should be the same merge as before, left outer. Store the result in df.

codevalidated

Remove unnecessary columns

We won't use some columns in our follow up analysis, so we can drop them to simplify the understanding of the data. Drop from df the following columns:

columns_to_drop = [
    "Year",
    "PER",
    "TS%",
    "3PAr",
    "FTr",
    "USG%",
    "blanl",
    "OWS",
    "DWS",
    "WS",
    "WS/48",
    "blank2",
    "OBPM",
    "DBPM",
    "BPM",
    "VORP",
    "FG%",
    "3P%",
    "eFG%",
    "FT%",
    "name",
]

Important: you must modify df in place, removing the columns directly in the same dataframe. If you think you've made a mistake, re-read the data and perform the joins again.

codevalidated

Rename teams to their full names

The Tm column contains an acronym of the team. For example, GSW for Golden State Warriors. Create a new column Team with the full name of the team. In the associated notebook, you can find a mapping to help you in the process.

codevalidated

Convert birthday to a datetime object

The column birth_date is a string in the format Month Day, Year (August 1, 1993). Convert the column to a datetime object.

codevalidated

Delete all players from the `TOT` team

Finally, if you explore the dataset, you'll notice that there's a team TOT. In reality, that team doesn't exist, and it's just an aggregation for players that have switched teams in the season.

Your task is to delete all the rows that have TOT in the column Tm. Perform the modification in the df dataframe, in place.

multiplechoice

What's the team with the most players in the league?

Count the number of players registered in each team and answer which team has the most players.

multiplechoice

What's the team with the lowest `FG`?

What's the team with the lowest sum of field goals (FG)?

input

What's the team with the best `FG%`?

FG% is defined as FG / FGA, that is, total field goals divided by the number of attempts. What team has the best FG% in the league? Enter the full team name below (example, Dallas Mavericks).

multiplechoice

What's the difference between the best and worst 3P shooters (by position)?

It is known that Shooting Guards (SG) are the best 3P throwers (by efficacy). The question is, what's the difference (in accuracy / efficacy) with the worst 3P throwers, always considering by position?

Note: use the position from the Pos column.

codevalidated

Find the best scorers in each team

Create a new dataframe containing the best scorers per team (by PTS, total points scored). The resulting dataframe should contain the columns Player, Team, Pos and PTS, and should be stored in the variable best_scorers_per_team. It should be sorted by PTS in descending mode.

Here's a preview of the expected result:

multiplechoice

Which team has the 'youngest squad', by average player age?

Calculate the average player age per team and answer which team has the "youngest squad"?

Capstone Project: NBA 2017 season analysisCapstone Project: NBA 2017 season analysis
Author

Santiago Basulto

This project is part of

Data Wrangling with Pandas

Explore other projects