All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.
All our activities include solutions with explanations on how they work and why we chose them.
players_df using a left outer join, that means, we want to have all the stats information, but if there are missing values that can't be matched from
players_df, we want to set those season values as null.
Store the results from the merge in the variable
As we performed a left outer join, if some values in from
s2017_df couldn't be matched in
players_df, the result will be null values (also referred as "misses" or "mismatches"). Are there any?
How many misses were in the resulting frame?
Extract the names of the players from
s2017_df that couldn't be matched with
players_df in a list (it must be a
list), under the variable name
Now it's time to do some detective work...
We can guarantee that the players missing in
players_df do exist, but they just have different names. Now, you must find the players discrepancies and update the names in the
For example, if in
s2017_df the player's name was
"Michael J. Jordan", and in
players_df it was just
"Michael Jordan", the task is to modify
players_df to make it now
"Michael J. Jordan".
Important: Modify the
players_df dataframe in place! If you "break" something, you can always read the data again.
Now that you've fixed the data in
players_df, perform the merge between
players_df again. Should be the same merge as before, left outer. Store the result in
We won't use some columns in our follow up analysis, so we can drop them to simplify the understanding of the data. Drop from
df the following columns:
columns_to_drop = [ "Year", "PER", "TS%", "3PAr", "FTr", "USG%", "blanl", "OWS", "DWS", "WS", "WS/48", "blank2", "OBPM", "DBPM", "BPM", "VORP", "FG%", "3P%", "eFG%", "FT%", "name", ]
Important: you must modify
df in place, removing the columns directly in the same dataframe. If you think you've made a mistake, re-read the data and perform the joins again.
Tm column contains an acronym of the team. For example,
GSW for Golden State Warriors. Create a new column
Team with the full name of the team. In the associated notebook, you can find a mapping to help you in the process.
birth_date is a string in the format
Month Day, Year (
August 1, 1993).
Convert the column to a datetime object.
Finally, if you explore the dataset, you'll notice that there's a team
TOT. In reality, that team doesn't exist, and it's just an aggregation for players that have switched teams in the season.
Your task is to delete all the rows that have
TOT in the column
Tm. Perform the modification in the
df dataframe, in place.
Count the number of players registered in each team and answer which team has the most players.
What's the team with the lowest sum of field goals (
FG% is defined as
FG / FGA, that is, total field goals divided by the number of attempts. What team has the best
FG% in the league? Enter the full team name below (example,
It is known that Shooting Guards (SG) are the best 3P throwers (by efficacy). The question is, what's the difference (in accuracy / efficacy) with the worst 3P throwers, always considering by position?
Note: use the position from the
Create a new dataframe containing the best scorers per team (by
PTS, total points scored). The resulting dataframe should contain the columns
PTS, and should be stored in the variable
best_scorers_per_team. It should be sorted by
PTS in descending mode.
Here's a preview of the expected result:
Calculate the average player age per team and answer which team has the "youngest squad"?
Data Wrangling with Pandas