Cleaning dirty data is considered one of the biggest obstacles in data warehousing due to the large volume of issues that can occur. Using an affordable, strong data cleansing tool such as Data Ladder’s DataMatch Enterprise can save one much time and money – and the hassle of having to correct errors through a manual process.
The Data Warehousing Institute (TDWI) estimates that poor quality customer data costs U.S. businesses $611 billion a year in postage, printing and staff overhead (TDWI estimates based on cost-savings cited by survey respondents and others who have cleaned up name and address data).
How does one address the pervasive problem of dirty data, and cost effectively? Implementing a data cleansing program and adhering to certain standards of data governance is a great start.
Let’s start by defining what data cleansing is. Data cleansing is the act of removing and correcting a database’s dirty data. These records may be incorrect, dated, irrelevant, or inaccurate. From incorrect phone numbers to missing zip codes, these errors can come from a number of different sources, when multiple data sources are placed together and duplicate data is created due to various representations of information.
Fuzzy matching is a key component of any data cleansing program. Broadly defined, fuzzy matching uses advanced algorithms that determine similarities between sets of data, where the results are neither true nor false. Fuzzy matching software relies on a set of parameters that finds terms related to query terms.
A sophisticated tool in the data cleansing toolbox, fuzzy matching software uses complex processes that operate at various levels of interpretation – from sentences to phrases.
Data Ladder’s DataMatch Enterprise finds the right data – even with incomplete information. Our algorithms can find the areas of similarity regardless of what fields they’re located in or however the data is aligned.
Our platform is an affordable, robust approach to making imperfect data usable. Our software can make the right connections with any type of structured data. From spelling errors to redundancies, our tool can work through many of the common issues found in large amounts of data.
Implementing a data quality program at your organization does not have to be cost prohibitive. Many companies using large data quality providers find they can be spending in excess of $400K. The initial investment can be far larger in terms of additional fees, additional full-time ETL and data quality stewards, and other internal costs. Here are a few tips to consider when shopping for a data cleansing software provider:
1 Never accept the original quote. Large software providers usually reduce the price by 20-50% within the data quality space.
2 Expect additional service fees once the project is up and running that can double the cost of the project. Set a limit on these upfront.
3 Include the cost of internal resources, usually those people trained to use the large data quality solution.
4 Use an affordable, easy-to-use desktop based cleansing software tool before buying a large solution. With DataMatch Enterprise, you’ll achieve 80% of the results quickly and affordably. Show your organization the value of data quality, and get the buy in needed to implement higher cost/effort solutions.
DataMatch Enterprise can handle many of the issues that compromise your data systems. Our system is scalable – even with large datasets, the information can be analyzed with lightning fast response times.
The result for you? Increased accuracy and less manual work required. Our software integrates
directly with your database, yet functions independently and doesn’t affect any other applications.
We know our clients need good matches. That’s why we developed the Tetrahedric Model of matching, finding specific information the first time, every time, regardless of spelling, numeric value, or limited information. DataMatch Enterprise matches data through the Tetrahedric Model via phonetic, numeric, domain specific, and of course fuzzy search. Our approach to matching has consistently beat IBM and SAS for best match accuracy by matching the way a person would.
DataMatch Enterprise Server can handle very difficult search problems, such as:
- Missing letters: “Hammer” or “Hamer”
- Variations: “Vinnie Smith” or “Vinny Smith”
- Extraneous letters: “Folder” or “Foldwer”
- Incomplete words: “Cleaners” or “leaners”
- Incorrect fielding in fielded data sets: “Larry Jones” for “Jones Larry”
- Incorrect or missing punctuation: “World-class data” for “World class data”
- Words that sound the same: “Stephen” and “Steven”
- Numerical variances: “100” and “99”
- Domain specific matching: “Wifi” and “Wireless connection”
Additional features include:
- Cross field matching support
- Advanced matching techniques
- Scoring and cut-off modes, predicates
- Multi-table selection, semantic equivalence classes
- Advanced domain specific libraries such as nicknames, continuously updated
- Multilanguage Support (Unicode)
Our work approach is to automate the way a person would view a match. Logically speaking,
there is no one algorithm is used in your brain for matching. Consider when you’re learning
language as a young child. First you learn how things sound — this gives us our phonetic
algorithms. Then you learn to read and write, which gives us our fuzzy matching algorithms.
Then you learn about numbers — think of 99 and 100. They are not spelled the same or use
any of the same characters, but they are 99% similar.
Many other aspects of similarity are learned over time: nicknames, domain specific expertise
and libraries. Your brain does all of these simultaneously, and our software is designed to
operate in the same way.
It is important to take a few minutes to complete some due diligence on the impact of data quality on your business. The facts show that a business that isn’t implementing a data quality program and adhering to standards on a regular basis will be negatively impacted.
Data Ladder is constantly testing and improving algorithms and workflow to improve
matching percentages, both in terms of finding all matches and avoiding false positive matches.
This approach leads to consistently finding 5-12% more matches than any other software, so
you can get the most out of your data.