Deterministic Matching versus Probabilistic Matching
Which is better, Deterministic Matching or Probabilistic Matching?
I am not promising to give you an answer. But through this article, I would like to share some of my hands-on experiences that may give some insights to help you make an informed decision in regards to your MDM implementation.
Before I got into the MDM space three years ago, I worked on systems development encompassing various industries that deal with Customer data. It was a known fact that duplicate Customers existed in those systems. But it was a problem that was too complicated to address and was not in the priority list as it wasn’t exactly revenue-generating. Therefore, the reality of the situation was simply accepted and systems were built to handle and work around the issue of duplicate Customers.
Corporations, particularly the large ones, are now recognizing the importance of having a better knowledge of their Customer base. In order to achieve their target market share, they need ways to retain and cross-sell to their existing Customers while at the same time, acquire new business through potential Customers. To do this, it is essential for them to truly know their Customers as individual entities, to have a complete picture of each Customer’s buying patterns, and to understand what makes each Customer tick. Hence, solving the problem of duplicate Customers has now become not just a means to achieve cost reduction, higher productivity, and improved efficiencies, but also higher revenues.
But how can you be absolutely sure that two customer records in fact represent one and the same individual? Conversely, how can you say with absolute certainty that two customer records truly represent two different individuals? The confidence level depends on a number of factors as well as on the methodology used for matching. Let us look into the two methodologies that are most-widely used in the MDM space.
Deterministic Matching mainly looks for an exact match between two pieces of data. As such, one would think that it is straightforward and accurate. This may very well be true if the quality of your data is at a 100% level and your data is cleansed and standardized in the same way 100% of the time. We all know though that this is just wishful thinking. The reality is, data is collected in the various source systems across the enterprise in many different ways. The use of data cleansing and standardization tools that are available in the market may provide significant improvements, but experience has shown that there is still some level of customization required to even come close to the desired matching confidence level.
Deterministic Matching is ideal if your source systems are consistently collecting unique identifiers like Social Security Number, Driver’s License Number, or Passport Number. But in a lot of industries and businesses, the collection of such information is not required, and even if you try to, most customers will refuse to give you such sensitive information. Thus, in majority of implementations, several data elements like Name, Address, Phone Number, Email Address, Date of Birth, and Gender are deterministically matched separately and the results are tallied to come up with an overall match score.
The implementation of Deterministic Matching requires sets of business rules to be carefully analyzed and programmed. These rules dictate the matching and scoring logic. As the number of data elements to match increases, the matching rules become more complex, and the number of permutations of matching data elements to consider substantially multiplies, potentially up to a point where it may become unmanageable and detrimental to the system’s performance.
Probabilistic Matching uses a statistical approach in measuring the probability that two customer records represent the same individual. It is designed to work using a wider set of data elements to be used for matching. It uses weights to calculate the match scores, and it uses thresholds to determine a match, non-match, or possible match. Sounds complicated? There’s more.
I recently worked on a project using the IBM InfoSphere MDM Standard Edition, formerly Initiate, which uses Probabilistic Matching. Although there were other experts in the team who actually worked on this part of the project, here below are my high-level observations. Note that other products available in the market using the Probabilistic Matching methodology may generally work around similar concepts.
- It is fundamental to properly analyze the data elements, as well as the combinations of such data elements, that are needed for searching and matching. This information goes into the process of designing an algorithm where the searching and matching rules are defined.
- Access to the data up-front is crucial, or at least a good sample of the data that is representative of the entire population.
- Probabilistic Matching takes into account the frequency of the occurrence of a particular data value against all the values in that data element for the entire population. For example, the First Name ‘JOHN’ matching with another ‘JOHN’ is given a low score or weight because ‘JOHN’ is a very common name. This concept is used to generate the weights.
- Search buckets are derived based on the combinations of data elements in the algorithm. These buckets contain the hashed values of the actual data. The searching is performed on these hashed values for optimum performance. Your search criteria are basically restricted to these buckets, and this is the reason why it is very important to define your search requirements early on, particularly the combinations of data elements forming the basis of your search criteria.
- Thresholds (i.e. numeric values representing the overall match score between two records) are set to determine when two records should: (1) be automatically linked since there is absolute certainty that the two records are the same; (2) be manually reviewed as the two records may be the same but there is doubt; or (3) not be linked because there is absolute certainty that the two records are not the same.
- It is essential to go through the exercise of manually reviewing the matching results. In this exercise, sample pairs of real data that have gone through the matching process are presented to users for manual inspection. These users are preferably a handful of Data Stewards who know the data extremely well. The goal is for the users to categorize each pair as a match, non-match, or maybe.
- The categorizations done by the users in the sample pairs analysis are then compared with the calculated match scores, determining whether or not the thresholds that have been set are in line with the users’ categorizations.
- The entire process may then go through several iterations. Per iteration, the algorithm, weights, and thresholds may require some level of adjustment.
As you can see, the work involved in Probabilistic Matching appears very complicated. But think about the larger pool of statistically relevant match results that you may get, of which a good portion might be missed if you were to use the relatively simpler Deterministic Matching.
Factors Influencing the Confidence Level
Before you make a decision on which methodology to use, here are some data-specific factors for you to consider. Neither the Deterministic nor the Probabilistic methodology is immune to these factors.
Knowledge of the Data and the Source Systems
First and foremost, you need to identify the Source Systems of your data. For each Source System that you are considering, do the proper analysis, pose the questions. Why are you bringing in data from this Source System? What value will the data from this Source System bring into your overall MDM implementation? Will the data from this Source System be useful to the enterprise?
For each Source System, you need to identify which data elements will be brought into your MDM hub. Which data elements will be useful across the enterprise? For each data element, you need to understand how it is captured (added, updated, deleted) and used in the Source System, the level of validation and cleansing done by the Source System when capturing it, and what use cases in the Source System affect it. Does it have a consistent meaning and usage across the various Source Systems supplying the same information?
Doing proper analysis of the Source Systems and its data will go a long way in making the right decisions on which data elements to use or not to use for matching.
A very critical task that is often overlooked is Data Profiling. I cannot emphasize enough how important it is to profile your data early on. Data Profiling will reveal the quality of the data that you are getting from each Source System. It is particularly vital to profile the data elements that you intend to use for matching.
The results of Data Profiling will be especially useful in identifying the anonymous and equivalence values to be considered when searching and matching.
Here are some examples of Anonymous values:
- Phone Numbers: 1-234-5678, 1-111-11111, 9-999-9999
- Email Addresses: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Here are some examples of Equivalence values:
- First Name ‘WILLIAM’ has the following equivalencies (nicknames): WILLIAM, BILL, BILLY , WILL, WILLY, LIAM
- First Name ‘ROBERT’ has the following equivalencies (nicknames): ROBERT, ROB, ROBBY, BOB, BOBBY
- In Organization Name, ‘LIMITED’ has the following equivalencies: LIMITED, LTD, LTD.
- In Organization Name, ‘CORPORATION’ has the following equivalencies: CORPORATION, CORP, CORP.
If the Data Profiling results reveal poor data quality, you may need to consider applying data cleansing and/or standardization routines. The last thing you want is polluting your MDM hub with bad data. Clean and standardized data will significantly improve your match rate. If you decide to use cleansing and standardization tools available in the market, make sure that you clearly understand its cleansing and standardization rules. Experience has shown that some level of customization may be required.
Here are important points to keep in mind in regards to Address standardization and validation:
- Some tools do not necessarily correct the Address to produce exactly the same standardized Address every time. This is especially true when the tool is simply validating that the Address entry is mailable. If it finds the Address entry as mailable, it considers it as successfully standardized without any correction/modification.
- There is also the matter of smaller cities being amalgamated into one big city over time. Say one Address has the old city name (e.g. Etobicoke), and another physically the same Address has the new city name (e.g. Toronto). Both Addresses are valid and mailable addresses, and thus both are considered as successfully standardized without any correction/modification.
You have to consider how these will affect your match rate.
Take the time and effort to ensure that each data element you intend to use for matching has good quality data. Your investment will pay off.
Ideally, each data element you intend to use for matching should always have a value in it, i.e. it should be a mandatory data element in all the Source Systems. However, this is not always the case. This goes back to the rules imposed by each Source System in capturing the data.
If it is important for you to use a particular data element for matching even if it is not populated 100% of the time, you have to analyze how it will affect your searching and matching rules. When that data element is not populated in both records being compared, would you consider that a match? When that data element is populated in one record but not the other, would you consider that a non-match, and if so, would your confidence in that being a non-match be the same as when both are populated with different values?
Applying a separate set of matching rules to handle null values adds another dimension to the complexity of your matching.
Timeliness of the Data
How old or how current is the data coming from your various Source Systems? Bringing outdated and irrelevant data into the hub may unnecessarily degrade your match rate, not to mention the negative impact the additional volume may have on performance. In most cases, old data is also incomplete, and collected with fewer validation rules imposed on it. As a result, you may end up applying more cleansing, standardization, and validation rules to accommodate such data in your hub. Is it really worth it? Will the data, which might be as much as 10 years old in some cases, truly be of value across the enterprise?
Volume of the Data
Early on in the MDM implementation, you should have an idea on the volume of data that you will be bringing in to the hub from the various Source Systems. It will also be worthwhile if you have some knowledge on the level of Customer duplication that currently exists in each Source System.
A fundamental decision that will have to be made is the style of your MDM implementation. (I will reserve the discussion on the various implementation styles for another time.) For example, you may require a Customer hub that will just persist the cross reference to the data but the data is still owned by and maintained in the Source Systems, or you may need a Customer hub that will actually maintain, be the owner and trusted source of the Customer’s golden record.
Your knowledge of the volume of data from the Source Systems, combined with the implementation style that you need, will give you an indication of the volume of data that will in fact reside in your Customer hub. This will then help you make a more informed decision on which matching methodology will be able to handle that volume better.
Other Factors to Consider
In addition to the data-specific factors above, here are other factors that you should give a great deal of thought.
Goal of the Customer Hub
What are your short-term and long-term goals for your Customer hub? What will you use it for? Will it be used for marketing and analytics only, or to support your transactional operations only, or both? Will it require real-time or near-real-time interfaces with other systems in the enterprise? Will the interfaces be one-way or two-way?
Just like any software development project, it is essential to have a clear vision of what you need to achieve with your Customer hub. It is particularly important because the Customer hub will touch most, if not all, facets of your enterprise. Proper requirements definition early on is key, as well as the high-level depiction of your vision, illustrating the Customer hub and its part in the overall enterprise architecture. You have a much better chance of making the right implementation decisions, particularly as to which matching methodology to use, if you have done the vital analysis, groundwork, and planning ahead of time.
Tolerance for False Positives and False Negatives
False Positives are matching cases where two records are linked because they were found to match, when they in fact represent two different entities. False Negatives are matching cases where two records are not linked because they were found to not match, when they in fact represent the same entity.
Based on the very nature of the two methodologies, Deterministic Matching tends to have more False Negatives than False Positives, while Probabilistic Matching tends to have more False Positives than False Negatives. But these tendencies may change depending on the specific searching and matching rules that you impose in your implementation.
The question is: what is your tolerance for these false matches? What are the implications to your business and your relationship with the Customer(s) when such false matches occur? Do you have a corrective measure in place?
Your tolerance may depend on the kind of business that you are in. For example, if your business deals with financial or medical data, you may have high tolerance for False Negatives and possibly zero tolerance for False Positives.
Your tolerance may also depend on what you are using the Customer hub data for. For example, if you are using the Customer hub data for marketing and analytics alone, you may have a higher tolerance for False Positives than False Negatives.
Performance and Service Level Requirements
The performance and service level requirements, together with the volume of data, need careful consideration in choosing between the two methodologies. The following, to name a few, may also impact performance and hence need to be factored in: complexity of the business rules, transactions that will retrieve and manipulate the data, the volume of these transactions, and the capacity and processing power of the machines and network in the system infrastructure.
In the Deterministic methodology, the number of data elements being used for matching and the complexity of the matching and scoring rules can seriously impact performance.
The Probabilistic methodology uses hashed values of the data to optimize searching and matching, however there is also that extra overhead of deriving and persisting the hashed values when updating/adding data. A poor bucketing strategy can degrade the performance.
On-going Match Tuning
Once your Customer hub is in production, your work is not done yet. There’s still the on-going task of monitoring how your Customer hub’s match rate is working for you. As data is added from new Source Systems, new locales, new lines of business, or even just as updates to existing data are made, you have to observe how the match rate is being affected. In the Probabilistic methodology, tuning may include adjustments to the algorithm, weights, and thresholds. For Deterministic methodology, tuning may include adjustments to the matching and scoring rules.
Regular tuning is key, more so with Probabilistic than Deterministic methodology. This is due to the nature of Probabilistic, where it takes into account the frequency of the occurrence of a particular data value against all the values in that data element for the entire population. Even if there is no new Source System, locale, or line of business, the Probabilistic methodology requires tuning on a regular basis.
It is therefore prudent to also consider the time and effort required for the on-going match tuning when making a decision on which methodology to use.
So, which is better, Deterministic Matching or Probabilistic Matching? The question should actually be: ‘Which is better for you, for your specific needs?’ Your specific needs may even call for a combination of the two methodologies instead of going purely with one.
The bottom line is, allocate enough time, effort, and knowledgeable resources in figuring out your needs. Consider the factors that I have discussed here, which by no means is an exhaustive list. There could be a lot more factors to take into account. Only then will you have a better chance of making the right decision for your particular MDM implementation.