For the last few weeks, I’ve been talking about using Microsoft Access tools to clean your Excel data. Excel is a popular place to store data, especially among small businesses, but its easy to end up with messy data. So far we’ve looked at correcting U.S. States that are entered inconsistently, fixing phone number formatting issues, and cleaning non-numeric characters from numeric fields.
Today we’ll tackle duplicates. Take a look at the sample data that I’m working with, which I’ve imported from Excel, and see if you can spot the problem:
Note rows 1 and 7. These appear to be duplicates, with the exact same name. There are others that are nearly matched, and we'll deal with those later. For now, we're going to focus on those that are exact matches. Let’s see how we can use Access tools to identify these records.
Find Exact Duplicates
There’s a feature in Access called a “Find Duplicates” query. It’ll look through the data in one of your columns, and find items that are the same on multiple records. In the sample data, it’ll pick up on the duplicated company names in rows 1 and 7.
The Find Duplicates query is very easy to create. Start by importing your Excel data into Access. Then, on the Create ribbon, click Query Wizard:
Walk through the Wizard screens that pop up, with the following parameters:
Choose the table that contains your imported Excel data.
Choose the column that you want to check for duplicates – in my case, the Vendor Name column.
Choose any other columns that you want to display in the results. You’ll want some additional data to appear in the results to help you identify the rows. With my sample data, I might choose to display state, phone and credit limit, for example.
Give the query a name.
Next, I execute the query, and I see this result:
This allows me to clearly see the duplicates in my table. I can now edit the table or the Excel file, and remove the duplicate record.
Close, but Not Exact
The Find Duplicates query is a great feature of Access, but its limitation is that it’ll only find exact matches. My data sample had a other duplicates: “XYZ Company” and “XYZ Co”; and "NH Cold Fill" and "NH Cold Fill Industries". These don’t appear in the Find Duplicates query, because they are not an exact match.
I can alter the query to find rows that are almost exactly the same, using a couple of techniques.
First Few Characters
First, you can identify names that match in the first few characters. Let’s talk about “XYZ Company” and “XYZ Co”. In this case, the first 6 characters of both names match. I’ll alter my Find Duplicates query to check for a match only on the first 6 characters.
Here’s the definition of my Find Duplicates query, as it exists now, checking only for exact matches:
Take a look at the Criteria row under the Vendor Name column. You may have to brush up on your SQL syntax to decipher this. It’s using a Group By statement to group the list together by Vendor Name, which gives us a unique list of vendor names. It also uses a “Count” statement to find only those Vendor Names that have a count greater than 1. In other words, show me every vendor name that has more than one record.
In order to change this to check only the first 6 digits of the vendor name, I’ll change the query to look like this:
Notice that I added a “Left” function in three places: on the top row of the grid; and twice in the Criteria row (once in the Select statement and once in the Group By statement). Now I’m telling the query to extract the first 6 digits from each vendor name, and group on that value. My result set looks like this:
Now we can see that I have three potential duplicates. But the list is confusing, because I can’t differentiate between different rows on the result set in some cases. I’m going to edit the query to show a couple more columns:
It may seem odd that I now have Vendor Name twice in the query, but that’s OK. The first column will show only the first 6 digits of the Vendor Name, while the third column will show the entire Vendor Name. I’m also showing the Vendor ID, so that we’ll be clear on the records that are problematic:
Still Limited
This method is good as long as you’re pretty sure the first 6 digits of duplicate entries will match. Let’s look at one more method that will allow you some additional options, though.
With this method, I’m going to include the same table in my query twice. It looks like this:
Notice that the two tables appear to have different names. That’s an illusion. When you add the same table into your query twice, Access automatically assigns an alias to one of them. In this case, it used the alias “Excel Vendor List_1”.
Now check out the grid. Let’s see what’s happening in the Criteria of the first column. I’m using the “&” character to append a wildcard character to either end of the vendor name. So, if the vendor name is “XYZ Co”, that will end up being “*XYZ Co*”. I’m using the “Like” operator in order to perform a wildcard search. So, I'm telling it to show me vendor names in the first table that match a wild-carded version of the vendor name in the second. In other words" "XYZ Company" in the first table matches "*XYZ Co*" in the second table.
I have another criteria on the Vendor ID column, indicating that I only want to see records where the Vendor ID from “Excel Vendor List” doesn’t match the Vendor ID from “Excel Vendor List_1”. This prevents the query from showing me a record that matches itself. I’ll only show matches where the record in the first table is not the exact same record in the second table.
Let’s see the result:
This is not bad. It correctly picked up all of the duplicates. Our only problem now is that if the names are an exact match (as in ABC-X Company), you see the record twice. In this case, it’s telling me that Vendor ID 7 in the first table matches Vendor ID 1 in the second, and also Vendor ID 1 in the first table matches Vendor ID 7 in the second table.
There are many ways you can get around this issue. In my opinion, the easiest way to handle this is by first using a “Find Duplicates” query, described above, to find and resolve all exact matches, and then use this query to find close matches.
You’ve Got Options
With this last query, you’ve got options. You can search for near matches in many different ways, by changing the Criteria row, and by using functions such as Right and Left. Let’s look at just one example. If I add the State columns from both tables to my query, we can see that NH Cold Fill has records in two different states:
If they are in two different states, that might indicate to me that they are not duplicates. I can exclude records with mismatched states by adding another criteria to my query:
Now I’ve told the query to only show me records if the states match, in addition to the other criteria. As you can see, this drops NH Cold Fill off the results:
I like this query because it query gives you the ability to add multiple criteria in order to fine tune your duplicate search.
I’m nearly finished with this series on cleaning your Excel data using Microsoft Access tools. Next week, in the final installment of the series, I’ll talk about getting your cleaned data back to Excel. If you’d like to be notified when that post goes live, subscribe to my mailing list using the form below.
Need help cleaning your Excel data? I’ve got lots of tricks up my sleeve that can make this tedious process more efficient. Call or email to start a discussion about your project. I look forward to hearing from you!
Kim
About the Author
Kimberlee Martin is the owner of North Port Solutions and has 30 years of IT experience. She's worked with several programming languages and database tools over the years, with her favorites being Microsoft Excel VBA, SQL Server, and Visual C#. Her passion is helping small businesses gain insights into their business with effective reporting and data management.
Contact Kim at kmartin@northportsoftware.com.
#northportsolutions #learningneverstops #microsoftaccess #msaccess #database #dataanalysis #smallbusiness #excel #microsoftexcel #msexcel
© North Port Solutions, LLC, 2022. All rights reserved.
Comments