Matching Strings by Similarity using Levenshtein distance
Matching Strings by Similarity using Levenshtein distance Data Science Project
Data Cleaning with Pandas

Matching Strings by Similarity using Levenshtein distance

In this project we'll use Levenshtein distance to clean a dataset containing names of companies that are not exactly the same. We'll employ an external fuzzy-matching library and pandas for our final analysis.

Project Activities

All our Data Science projects include bite-sized activities to test your knowledge and practice in an environment with constant feedback.

All our activities include solutions with explanations on how they work and why we chose them.

codevalidated

Create the `df` dataframe containing the product of the two CSVs

We have already read the 2 CSVs in to the df1 and df2 variables. Now, use the itertools.product method to create a resulting dataframe df that will contain the product of the two CSVs. The columns should be named CSV 1 and CSV 2.

As we have 266 rows in df1 and 368 in df2, the resulting df will have 97,888 rows (266 * 368), and it'll look something like:

codevalidated

Create a new column `Ratio Score` that contains the distance for all the rows in `df`

Now apply the function fuzz.partial_ratio to all the companies in df to calculate the distance between them. Store the distance in a new column named Ratio Score. It'll look similar to:

input

How many rows have a Ratio score of `90` or more?

input

What's the corresponding company in CSV2 to `AECOM` in CSV1?

We saw that in CSV1 there's a company AECOM, what's the corresponding value in CSV2?

input

What's the corresponding CSV2 company of *Starbucks*?

CSV1 company is Starbucks, what's the corresponding name in CSV2?

multiplechoice

Is there a matching company for `Pinnacle West Capital Corporation`?

CSV1 contains Pinnacle West Capital Corporation, is there a matching in CSV2?

input

How many matching companies are there for `County of Los Angeles Deferred Compensation Program`?

CSV1 contains County of Los Angeles Deferred Compensation Program. How many matching companies seem to be in CSV 2?

multiplechoice

Is there a matching company for `The Queens Health Systems`?

CSV1 contains The Queens Health Systems, is there a matching in CSV2?

Matching Strings by Similarity using Levenshtein distanceMatching Strings by Similarity using Levenshtein distance
Author

Santiago Basulto

This project is part of

Data Cleaning with Pandas

Explore other projects