Data Cleaning with Pandas

NLP is one of the hot topics of Data Science given the impact of GPT and other LLMs. But NLP is not all about Machine Learning and predicting text. A big part of it is the preprocessing required to train the models. In this project you'll need to do the preprocessing required to train a Sentiment Analysis algorithm using NLTK. Roll up your sleeves and do some String preprocessing using Pandas!
Read dataset, but only the `Tweet` and `Sentiment` columns

Read the data in fifa_world_cup_2022_tweets.csv into a dataframe, but only the columns Tweet and Sentiment.

Your df should look something like:

Activity 1


Lowercase the column `Tweet` in the new column `Tweet Lower`

Create a new column Tweet Lower that contains the contents of the Tweet column, but all lowercased.


Remove all URLs from `Tweet Lower` in `Tweet Clean`

URLs, hashtags, mentions are mostly useless elements when it comes to sentiment analysis. We'll start by removing all the URLs. Remove all the URLs from Tweet Lower and store your results in Tweet Clean.

Warning! Don't forget to remove any leading or trailing whitespaces. For example, if you remove the URL from the following tweet:

what are we drinking today @tucantribe 


The result should be:

"""what are we drinking today @tucantribe 


Without a trailing space after the #worldcup2022 hashtag.


Remove username mentions in `Tweet Clean`

Still in Tweet Clean, remove any twitter mentions (in the form @datawars_io). In this case, we're modifying the original column Tweet Clean, so if you make a mistake, you'll have to re-run your previous code and start over.

Remember to strip any trailing or leading whitespaces.


Remove Hashtags from `Tweet Clean`

Still in Tweet Clean, remove any hashtags.


Create the list `tokenized_twwets` by applying the function `word_tokenize` to the values of the column `Tweet Clean`

We'll now start using the nltk module. Don't worry if you've never used it before, as these are all simple functions that don't require an NLP background.

We'll start by "tokenizing" the tweets. Tokenizing means basically splitting a corpus of text into different words or tokens.

Your task is to use the word_tokenize function to create a list of tweet tokens and store the result in tokenized_tweets. This means that tokenized_tweets is a list of lists, a list of tokens in the following form:

    ['what', 'are', 'we', 'drinking', 'today'], # tweet
    ['worth', 'reading', 'while', 'watching'],  # tweet

Filter stop words

Stop words are words that don't contribute much to the meaning of a sentence, like conjunctions ("for", "and") or the word "the", "a", etc. The nltk module contains stop words for english, that we can get with stopwords.words('english').

Your task is to remove any stop words from the tokens you have previously generated. Store your results in the variable filtered_tokenized_tweets, which continues to be a list of lists, but with the stop words filtered out.


Glue all the tweets back again

Use a single space to concat the tokens that we have preprocessed in our previous tasks and build the tweet again. Store your results in the variable cleaned_tweets. In this case, it'll no longer be a list of lists, but a list of strings, the tweets we have assembled again, and it'll look something like:

['drinking today',
 'amazing launch video . shows much face canada men ’ national team changed since last world cup entry 1986. ’ wait see boys action ! canada : fifa world cup opening video',
 'worth reading watching']

Apply VADER to all the words

Use the analyzer.polarity_scores method to perform sentiment analysis on all the tweets in cleaned_tweets. Store the list of results in the variable tweet_sentiment_scores.

As we mentioned before, this requires just a method invocation:

>>> analyzer.polarity_scores(YOUR_TWEET)

Your tweet_sentiment_scores variable will look something like:

    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
    {'neg': 0.0, 'neu': 0.864, 'pos': 0.136, 'compound': 0.6239},
    {'neg': 0.0, 'neu': 0.513, 'pos': 0.487, 'compound': 0.2263},

Calculate the sentiment of each tweet based on the following rule....

The result of analyzer.polarity_scores is a dictionary with several keys:

>>> analyzer.polarity_scores("DataWars is awesome! I love it so much!")
{'neg': 0.0, 'neu': 0.36, 'pos': 0.64, 'compound': 0.8715}

The neg, neu and pos keys represent the proportions of the text that fall in each category (Negative, Neutral and Positive). They add up to 1.

But the key that we're really interested in is compound, which is a weighted composite score that has been normalized between -1 (most extreme negative) and +1 (most extreme positive). In this case, the compound score of 0.8715 indicates a very high positive sentiment.

The general rule of thumb for interpreting the compound score is:

  • Positive sentiment: compound score > 0.05
  • Neutral sentiment: compound score between -0.05 and 0.05
  • Negative sentiment: compound score < -0.05

Calculate the sentiment of each score and store it in the variable tweet_sentiment_results that should look something like: ['neutral', 'positive', 'positive', ...].


Delete the columns `Tweet Lower` and `Tweet Clean` from and add the new column `Calculated Sentiment`

Remove the columns we previously used (Tweet Lower, Tweet Clean) and create a new one named Calculated Sentiment with the results of tweet_sentiment_results.


How many tweets were incorrectly classified?

Assuming the column Sentiment had the correct sentiment, how many did we classified erroneously in our Calculated Sentiment column?


How many Negative tweets were incorrectly classified (either as `Neutral` or `Positive`)?

