Binary Classification

multiplechoice

Did you find any missing value?

Calculate the percentage of missing values. If you find any missing values, please drop them using df.dropna(inplace=True).

multiplechoice

Dropping Unnecessary Features

id column is unique for every row and will be deviating from the model. So let's just remove it.

Choose the correct code to drop this feature.

multiplechoice

Categorical variables

Complete the following code with the categorical variables and run it in the notebook:

cat_variables = Complete
for i in cat_variables:
    fig_dims = (10, 4)
    fig, ax = plt.subplots(figsize=fig_dims)
    sns.countplot(x=i, hue="stroke", ax=ax, data=df,palette="RdBu",order=df[i].value_counts().index)
    plt.xticks(rotation=90)
    plt.legend(loc='upper right')
    plt.show()

multiplechoice

What is the purpose of calculating a correlation matrix?

multiplechoice

What does a value of -1 in a correlation matrix indicate?

codevalidated

Sample the dataset

With the following code, we will sample 249 instances from the negative class.

Let's run this code in the notebook.

non_stroke=df.loc[df.stroke==0].sample(df.loc[df.stroke==1].shape[0],random_state=1)
stroke=df.loc[df.stroke==1]
frames = [stroke,non_stroke]
result = pd.concat(frames)
result.sample(frac=1,random_state=1)
result.head()

After that drop categorical variables and separate the target and the features into two variables.

Store the features in X and the target y.

codevalidated

Train and test split

First, use train_test_split to split the data into training and testing sets. Split the dataset in 80% training, 20% testing, and random_state=0.

Store the values in the variables in X_train,X_test,y_train, y_test,random_state .

codevalidated

Decision tree classifier

Train a Decision Tree Classifier using the training data, and store the model in dt. You can specify the model parameters such as the maximum depth of the tree or the minimum number of samples required to split an internal node.

Calculate the accuracy of both the training and testing sets and run the code in a Jupyter Notebook.

Store the results in the variables train_accuracy and test_accuracy.

The expected accuracy for a simple problem varies depending on the specifics of the problem and data. However, for a well-defined and simple problem with a large and diverse training dataset, a well-trained machine learning model could achieve an accuracy of over 70% in some cases.

codevalidated

K-Nearest Neighbors (KNN)

Train a KNeighborsClassifier using the training data, and store the model in the variable knn. In KNN, the value of k determines the number of nearest neighbors to consider for making a prediction. In this example, you will specify the value of k by passing the n_neighbors parameter to the class constructor. Remember that KNN is sensitive to the scale of the features, so you should use StandardScaler to standardize the features and store the results in the variables X_train_scaler and X_test_scaler.

After training the model, calculate the accuracy of both the training and testing sets using an appropriate metric. The code should be run in a Jupyter Notebook. The results of the accuracy calculation should be stored in the variables train_accuracy and test_accuracy for the training and testing sets, respectively.

Verónica Barraza

Project Activities

Did you find any missing value?

Dropping Unnecessary Features

Categorical variables

What is the purpose of calculating a correlation matrix?

What does a value of -1 in a correlation matrix indicate?

Sample the dataset

Train and test split

Decision tree classifier

K-Nearest Neighbors (KNN)

Verónica Barraza

Classification in Depth with Scikit-Learn

Set Operations using Sakila

LIKE Operator using World

Membership and Range Operators with World Database