Binary Classification
Binary Classification Data Science Project
Classification in Depth with Scikit-Learn

Binary Classification

During this project, you'll have the opportunity to practice binary classification using both Decision Tree and KNN models on a real dataset. The dataset focuses on stroke prediction, which is a major public health concern. It includes several lifestyle, medical history, and demographic features that will enable you to investigate and analyze the factors that contribute to the occurrence of stroke.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

multiplechoice

Did you find any missing value?

Calculate the percentage of missing values. If you find any missing values, please drop them using df.dropna(inplace=True).

multiplechoice

Dropping Unnecessary Features

id column is unique for every row and will be deviating from the model. So let's just remove it.

Choose the correct code to drop this feature.

multiplechoice

Categorical variables

Complete the following code with the categorical variables and run it in the notebook:

cat_variables = Complete
for i in cat_variables:
    fig_dims = (10, 4)
    fig, ax = plt.subplots(figsize=fig_dims)
    sns.countplot(x=i, hue="stroke", ax=ax, data=df,palette="RdBu",order=df[i].value_counts().index)
    plt.xticks(rotation=90)
    plt.legend(loc='upper right')
    plt.show()
multiplechoice

What is the purpose of calculating a correlation matrix?

multiplechoice

What does a value of -1 in a correlation matrix indicate?

codevalidated

Sample the dataset

With the following code, we will sample 249 instances from the negative class.

Let's run this code in the notebook.

non_stroke=df.loc[df.stroke==0].sample(df.loc[df.stroke==1].shape[0],random_state=1)
stroke=df.loc[df.stroke==1]
frames = [stroke,non_stroke]
result = pd.concat(frames)
result.sample(frac=1,random_state=1)
result.head()

After that drop categorical variables and separate the target and the features into two variables.

Store the features in X and the target y.

codevalidated

Train and test split

First, use train_test_split to split the data into training and testing sets. Split the dataset in 80% training, 20% testing, and random_state=0.

Store the values in the variables in X_train,X_test,y_train, y_test,random_state .

codevalidated

Decision tree classifier

Train a Decision Tree Classifier using the training data, and store the model in dt. You can specify the model parameters such as the maximum depth of the tree or the minimum number of samples required to split an internal node.

Calculate the accuracy of both the training and testing sets and run the code in a Jupyter Notebook.

Store the results in the variables train_accuracy and test_accuracy.

The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.

codevalidated

K-Nearest Neighbors (KNN)

Train a KNeighborsClassifier using the training data, and store the model in the variable knn. In KNN, the value of k determines the number of nearest neighbors to consider for making a prediction. In this example, you will specify the value of k by passing the n_neighbors parameter to the class constructor. Remember that KNN is sensitive to the scale of the features, so you should use StandardScaler to standardize the features and store the results in the variables X_train_scaler and X_test_scaler.

After training the model, calculate the accuracy of both the training and testing sets using an appropriate metric. The code should be run in a Jupyter Notebook. The results of the accuracy calculation should be stored in the variables train_accuracy and test_accuracy for the training and testing sets, respectively.

The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.

Binary ClassificationBinary Classification
Author

Verónica Barraza

This project is part of

Classification in Depth with Scikit-Learn

Explore other projects