  Classification in Depth with Scikit-Learn

# Supervised machine learning classification: Customer Churn prediction

In this project you'll apply all the previously learned techniques and models involving cleaning, feature engineering, tuning hyperparameters and much more. All this with a dataset containing information about Customer Churn. This project combines quizzes and practical activities to guide you towards achieving the best possible results.

## Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

multiplechoice

multiplechoice

multiplechoice

multiplechoice

multiplechoice

input

### Determine the percentage of non-active members.

The answer should be rounded to two decimal places.

input

### What is the mean age of the women who have closed their accounts with the bank?

The answer should be rounded to two decimal places.

multiplechoice

multiplechoice

multiplechoice

### Compute the number of people per country that has a credit card.

There could be more than just one correct answer.

multiplechoice

multiplechoice

### How do you calculate the correlation matrix for a dataset using Pearson's correlation coefficient?

There could be more than just one correct answer.

multiplechoice

### Correlation

Based on the correlation analysis of the dataset, which variable has the highest correlation with the target column?

multiplechoice

multiplechoice

multiplechoice

### Correlation analysis

Based on the calculation of the correlation between the variables. There could be more than just one correct answer.

multiplechoice

multiplechoice

### Visualization

What type of plot would you use to compare the credit score distribution for customers who have churned the bank versus those who have not? Please select the figure that shows the correct representation.

multiplechoice

### What type of information can you gain from a box plot in statistical analysis and data visualization?

There could be more than just one correct answer.

multiplechoice

### How can you create a plot to show the relationship between all variables in a single layout using Seaborn in Python?

For this task select the following variables: `CreditScore`, `Age`, `Balance`, `HasCrCard`,`EstimatedSalary` and show the relationship between these variables classified by `Exited`. Then select the correct answer.

multiplechoice

### Histogram

Two students created histograms for the 'credit score' variable using the same bin width and boundary values, but their plots have distinctively different shapes. What could be the reason for the different shapes in their histograms?

Let's see the figures:

``````fig, (ax1, ax2, ax3) = plt.subplots(1, 3,figsize=(10,5))
ax1.hist(df.CreditScore,align='left', color='#0504aa',alpha=0.7)
ax2.hist(df.CreditScore,align='right', color='#0504aa',alpha=0.7)
ax3.hist(df.CreditScore,color='#0504aa',alpha=0.7)
ax1.set_xlabel('Value',fontsize=15)
ax2.set_xlabel('Value',fontsize=15)
ax3.set_xlabel('Value',fontsize=15)
ax1.set_ylabel('Frequency',fontsize=15)
ax1.set_title('a',fontsize=15)
ax2.set_title('b',fontsize=15)
ax3.set_title('c',fontsize=15)
plt.show()
`````` multiplechoice

### What is the appropriate figure or chart to represent the number of classes for the 'Exited' variable?

There could be more than just one correct answer.

multiplechoice

### Density plot

Compared to overlapping histograms, overlapping density plots generally do not present the same issues, as the continuous density lines assist the viewer in distinguishing between the different distributions. This is because the smooth lines of the density plot allow for a more intuitive understanding of the shape of the data, even when multiple distributions are being presented simultaneously

Select the code that shows the balance distribution by country.

multiplechoice

### Data leakage

Data leakage can cause you to create overly optimistic if not completely invalid predictive models.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Which of the following columns would you remove because they would cause data leakage?

There could be more than just one correct answer.

multiplechoice

### Drop unwanted features

Not it's time to drop unwanted features ('Surname', 'RowNumber', 'CustomerId'). Which of the following statement are correct?

There could be more than just one correct answer.

multiplechoice

multiplechoice

### Encode categorical variables

A machine learning algorithm needs to be able to understand the data it receives. There are plenty of methods to encode categorical variables into numeric and each method comes with its advantages and disadvantages. Which is the correct way to encode the variables `gender` and `geography`?:

multiplechoice

### Encode categorical variables

Execute the following code:

``````Geography= pd.get_dummies(df['Geography'], drop_first=True)
Gender= pd.get_dummies(df['Gender'], drop_first=True)

df = pd.concat([df, Geography, Gender], axis=1)
df.info()
``````

Which of the following statements are true?

There could be more than just one correct answer.

multiplechoice

### Classification or Regression

Based on this, you should select wheater this scenario is a classification or a regression problem.

multiplechoice

### Split train and test

Select the correct way to split the dataset in 30% test and 70% train.

There could be more than just one correct answer.

multiplechoice

### Confusion Matrix

We ask you to build a predictive model that answers the question: “what sorts of people were more likely to commit churn?”.

Which of the following statements of the confusion matrix are true?

There could be more than just one correct answer.

multiplechoice

### XGBoost

Which of the following statements of the model are true about XGBoost? There could be more than just one correct answer.

multiplechoice

### XGBClassifier

Train an XGBoost with the following parameters: `objective='"binary:logistic"` and `random_state=42` and calculated the accuracy for the training set.

multiplechoice

### XGBoost: evaluation metrics

Now, let's train an xgboost with `logistic objective` and `n_estimators` 30 and `maximal depth` 2.

Use `random state` = 42.

Plot the histogram of the score, and estimate the precision and recall for threshold equal to [0.1,0.5,0.7,0.8] using the test dataset.

Based on these results, which of the following statements of the model are true? There could be more than just one correct answer.

multiplechoice

### XGBoost: precision and recall

Use the following function to make a precision and recall curve for the training set.

``````def plot_prc(name, labels, predictions, **kwargs):
precision, recall, _ = precision_recall_curve(labels, predictions)
plt.plot(precision, recall, label=name, linewidth=2, **kwargs)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.grid(True)
ax = plt.gca()
ax.set_aspect('equal')
``````

Which of the following statements are true?

There could be more than just one correct answer.

multiplechoice

### XGBoost: Tuning Parameters

The `RandomizedSearchCV()` function takes in the following arguments:

• `estimator`: The estimator being fit, here it's XGBoost.
• `param_distributions`: Unlike params - this is the distribution of possible hyperparameters to use.
• `cv`: Number of cross-validation iterations
• `n_iter`: Number of hyperparameter combinations to choose from verbose: Prints more output

Follow the instructions and solve the exercise:

1. Create a parameter grid called rs_param_grid that contains:

• 'max_depth': list((range(3,12)))
• 'alpha': [0,0.001, 0.01,0.1,1]
• 'subsample': [0.5,0.75,1]
• 'learning_rate': np.linspace(0.01,0.5, 10)
• 'n_estimators': [10, 25, 40]
2. Create a `RandomizedSearchCV` object called `xgb_rs`, passing in the parameter grid to `param_distributions`. Also, specify `verbose=2`, `cv=3`, and `n_iter=5`.

3. Your objective is to maximize F1-score.

4. Fit the `RandomizedSearchCV` object to `X` and `y`.

What are the best parameters?  Author

#### Verónica Barraza

This project is part of

## Classification in Depth with Scikit-Learn

Explore other projects   Anurag Verma   Anurag Verma   Santiago Basulto