Classification in Depth with Scikit-Learn

# Binary Classification

During this project, you'll have the opportunity to practice binary classification using both Decision Tree and KNN models on a real dataset. The dataset focuses on stroke prediction, which is a major public health concern. It includes several lifestyle, medical history, and demographic features that will enable you to investigate and analyze the factors that contribute to the occurrence of stroke.
Project Created by

## Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

### Did you find any missing value?

Calculate the percentage of missing values. If you find any missing values, please drop them using `df.dropna(inplace=True)`.

multiplechoice

### Dropping Unnecessary Features

`id` column is unique for every row and will be deviating from the model. So let's just remove it.

Choose the correct code to drop this feature.

multiplechoice

### Categorical variables

Complete the following code with the categorical variables and run it in the notebook:

``````cat_variables = Complete
for i in cat_variables:
fig_dims = (10, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot(x=i, hue="stroke", ax=ax, data=df,palette="RdBu",order=df[i].value_counts().index)
plt.xticks(rotation=90)
plt.legend(loc='upper right')
plt.show()
``````
multiplechoice

multiplechoice

codevalidated

### Sample the dataset

With the following code, we will sample 249 instances from the negative class.

Let's run this code in the notebook.

``````non_stroke=df.loc[df.stroke==0].sample(df.loc[df.stroke==1].shape[0],random_state=1)
stroke=df.loc[df.stroke==1]
frames = [stroke,non_stroke]
result = pd.concat(frames)
result.sample(frac=1,random_state=1)
``````

After that drop categorical variables and separate the target and the features into two variables.

Store the features in `X` and the target `y`.

codevalidated

### Train and test split

First, use `train_test_split` to split the data into training and testing sets. Split the dataset in 80% training, 20% testing, and random_state=0.

Store the values in the variables in `X_train`,`X_test`,`y_train`, `y_test`,`random_state` .

codevalidated

### Decision tree classifier

Train a `Decision Tree Classifier` using the training data, and store the model in `dt`. You can specify the model parameters such as the maximum depth of the tree or the minimum number of samples required to split an internal node.

Calculate the accuracy of both the training and testing sets and run the code in a Jupyter Notebook.

Store the results in the variables `train_accuracy` and `test_accuracy`.

The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.

codevalidated

### K-Nearest Neighbors (KNN)

Train a `KNeighborsClassifier` using the training data, and store the model in the variable `knn`. In KNN, the value of k determines the number of nearest neighbors to consider for making a prediction. In this example, you will specify the value of k by passing the n_neighbors parameter to the class constructor. Remember that KNN is sensitive to the scale of the features, so you should use `StandardScaler` to standardize the features and store the results in the variables `X_train_scaler` and `X_test_scaler`.

After training the model, calculate the accuracy of both the training and testing sets using an appropriate metric. The code should be run in a Jupyter Notebook. The results of the accuracy calculation should be stored in the variables `train_accuracy` and `test_accuracy` for the training and testing sets, respectively.

The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.

Project Created by

#### Verónica Barraza

This project is part of

## Classification in Depth with Scikit-Learn

Explore other projects