Practical CRM Deduplication and Data Cleaning

Nick Petri

July 17, 2013

If you’re a CRM administrator or salesperson, chances are pretty good that you have to deal with duplicate entries in your CRM on a daily basis. Eliminating dupes from a CRM — or even better, making sure they never get in — is a constant struggle that can lead to major headaches if not addressed in a systematic way.
While most CRMs have some sort of deduplication software (in the case of Salesforce.com, there are numerous third-party apps), de-duplication is not just about having the right software, for two reasons:
1) You won’t always have the software. I often find myself deduplicating outside of a CRM for ad-hoc analysis projects or to pre-screen companies for additional research. I can’t use a Salesforce deduplication app for this purpose.
2) Software can solve the workflow problems of data deduplication, but not the logical problems. Is “John S. Smith at RA Associates” the same person as “John Smith at R.A. Associates, Inc.?” Answering this question takes knowledge of your target companies and list quality. The software doesn’t have this knowledge, so usually it will lean towards data preservation and not flag this entry as a duplicate. Only by pre-scrubbing the data before it goes into your CRM can you be sure you’ve cleared formatting inconsistencies in the way that makes sense for your company.
To address manual deduplication and data cleaning, following these best practices.

Company Deduplication

Let’s say I have a spreadsheet of leads who downloaded a whitepaper along with their company and website. I want to match these leads against another spreadsheet of Alexa traffic rankings to segment the list into SMBs and Enterprise. Each list has a company name and website.
It’s as simple as running a vlookup function on the website field, right?
Wrong. The website field in your lead list has lots of blank entries, and even the ones that are populated don’t always have the same format. In one list Amazon may be listed as, “Amazon.com,” while in the other list it’s displayed as “http://www.amazon.com/aws”. While this difference is obvious to a person, the vlookup function doesn’t know to capture this subtlety. You’ll have to take some steps to clean each field before the vlookup function is able to do its job.

1) Cleaning the Website Fields

Our first order of business is to clean each field down to a domain. Since domains are unique and have a strict standard format, this will bypass any formatting inconsistencies.
While all of the above adjustments can be made using the Find and Replace feature in excel or left(), right(), and find() functions, power users may find it much easier to write a simple script to do this for you. I’ve done this using Python, with a plugin called DataNitro that allows me to run them as excel macros.
Regardless of how you go about it, cleaning the website down to a domain will enable you to run a vlookup function and cleanly match your list of leads to an Alexa rank, or any other separate list, as well as deduping internally.
But it won’t solve all of your problems.

2) Cleaning the Company Field

If your lists are like ours, you’ll often find yourself missing the website field entirely. In that case, you’ll have to use the company field.
Unfortunately, unlike a domain, which has a standard format, the same company can go by many aliases. Take, for instance, the fictional law firm T.A.B.H. and Associates, Incorporated. The following table shows you how to clean both entries to a common “dedup key” that will be easily matched using a vlookup function.
Depending on how large and messy your list is, you may have to take additional steps to clean company names down to a common format for your deduplication key. Still, we recommend importing the entry under its original ‘dirty’ company name to prevent any data loss.

3) Stricter Matching

When you have a large number of companies, the steps above will sometimes flag too many companies as duplicates when they really aren’t the same for your purposes. For instance, “Computer Consultants Inc” and “Computer Consultants Corporation” will be incorrectly identified as duplicates using the deduplication key.
To create a stricter matching key, consider concatenating the dedup key with another piece of information, such as the state abbreviation. Therefore, “computerconsultants ma” would no longer match “computerconsultants nj”. This strategy only works when both “State” fields are populated.

Deduplicating People

Now let’s say we’re trying to merge a list of event attendees against the same list of website visitors.
Unfortunately, the steps above will miss duplicates in the case of nicknames or misspellings. To catch more duplicates, you can use just the first initial (“jsmithcomputerassociates”), but be careful — larger companies like IBM may have dozens of J. Smiths that will result in false duplicates. Only use this strategy for very small companies.Next concatenate the person dedup key with the domain, and separately, the company dedupe key. Try matching both to uncover any possible duplicate people.
Executing all of these steps in Excel is painful, error-prone and time-consuming, so we highly recommend using macros to automate the process, if you find yourself doing it frequently.
For tips on how I’ve automated much of this process using DataNitro, feel free to email me.

Practical CRM Deduplication and Data Cleaning

Company Deduplication

1) Cleaning the Website Fields

2) Cleaning the Company Field

3) Stricter Matching

Deduplicating People

Subscribe & Scale

Sign up for our newsletter

Share on Mastodon