Let me start by saying that this is not an article about big data. While the source of big data is external to your organization, it is a topic of its own. Many of the concepts and approaches discussed will definitely apply to your big data initiatives, but that won’t be the focus of this article.
External data is information that is sourced from outside of your organization. This could be information you purchase from a marketing or service organization, a government agency, the post office, or a business partner. There are many potential sources of external information.
External data can be used for various purposes in your MDM implementation. You can use external data to:
- Enrich your MDM data with new information you are unable to collect on your own
- Validate information you have captured in your own systems
- Update information you have captured to improve the quality of the data
- Use it as reference data to provide additional information without fully integrating it into your MDM system
- Use it as a source of test data for environments where data privacy prevents usage without masking but valid data is required
There are many approaches that can be used to integrate external data with your MDM implementation. Your external data can be used to update your MDM data as is the case with enrichment and data quality initiatives. Your external data can be stored outside of your MDM implementation which is used for reference implementations and data quality initiatives. How you decide to integrate your external data can also have licensing implications either for your data provider or your MDM licensing. Integration choices, depending on the source, may also offer real time integration services versus taking the information base in-house.
When integrating information from external sources, you may face the issue of which source to trust over another. Typically, purchased sources from outside provide a high quality of information, while data provided by a business partner may have issues of its own. If you haven't yet addressed trusted sources for your MDM single version of truth, external data sources may highlight this need.
Using External Data for Enrichment
Data enrichment is the process of augmenting your MDM data with external information. Sources used for data enrichment are usually purchased and the provider has made significant investment ensuring the quality of that information.
When purchasing external data to be used for enrichment it is important to access the information provider, the data they provide, and even their information delivery methods. Things to consider could include:
- Does the provider use appropriate methods to collect and validate the information? Data to be used for enrichment should be of high quality and be able to be trusted.
- Is the information provided complete and consistent? You do not want to have to scrub external data you have paid for.
- Is the information well formatted so it can be easily integrated with your own information? Addresses can be particularly troublesome to integrate if the core information is incomplete such as missing a country code, or inconsistently formatted, making parsing and standardization a challenge.
- Does the licensing of the data allow you to keep the data you have stored in your MDM platform if you decide to discontinue licensing the data from the provider? Understand your rights using the information even after you decide you no longer want to purchase it.
When integrating external data, make sure you understand the attributes you are collecting and that they are fit for the purpose intended. Status attributes and categorizations are usually tied to some business rules. Ensure that you and your consumers understand what the attributes mean and what they should be used for.
When purchasing external data, there’s often a rich set of attributes available for you to use in your MDM platform. Care should be taken to observe your MDM rules for what is and is not master data. Just because the attributes are available does not mean that you should automatically collect them. Select your external data attributes with the same rigor you apply to your internal data. Don't collect the attributes because you can!
Using External Data for Validation and Quality Improvement
External data sources can be used to validate information you have collected from other sources. The most common of these validation processes is address validation. Address validation contributes to data quality and can often also provide correction capability so that addresses that were supplied incorrectly can be fixed to represent valid addresses.
Address correction can be a tricky proposition and may have implications for your consumers. If the address being captured in MDM is tied to a legal document such as a sale or insurance policy, correction may be unwanted as it changes the legal document - which cannot be arbitrarily changed. Address correction is also very susceptible to error if the address information was incomplete and missing key attributes such as country codes or postal/zip codes. The quality of the source data and the validation data can be an issue during this process.
Using external data for validation and quality improvement can have performance implications. The data source for the validation is usually outside of your MDM platform and so a call must be made to invoke the service. This adds time to your update time for validation.
Some providers may offer real time services which can be integrated with your MDM service to perform the validation and correction. The performance implications of calling an external service may also affect whether you use an in-house or external service. You may also want to consider applying the data validation and quality improvements after capturing the source data to keep MDM transaction performance high, and deal with the corrections afterwards.
External data sources can be a valuable tool to keep your MDM data clean and fresh. Validation routines can vastly improve the quality of your data by ensuring organizations exist on government registries and post office change of address files keep addresses current.
External Data as Reference Data
External data does not always have to be integrated to be useful for an MDM implementation. You can decide to use the external data as reference data. When implementing external data as reference data your MDM platform contains an identifier, or key, from your external data on your MDM object. The identifier is used to identify the link between your MDM object (such as a party) to the additional information stored outside of MDM. This is the same concept you would use when tracking the source system key to your party record.
If the reference data approach is used then the implication is that queries for party data that incorporate attributes from the external data must be merged. The usual place for such integration of the two sources is the service bus. If you do not use such an integration platform, then it may be possible to extend your MDM platform to do the lookup in the external data source for you. If possible you may want to have a way to indicate when these additional attributes are actually required so you can avoid the additional overhead when the data is not required. You may use a separate service to get the two sets of information, or some kind of indicator to show that the two sources need to be queried.
Often when external data is integrated into MDM as reference data, there is a desire to provide services targeted only at the reference data, as it may have significantly more attributes available than is required in your MDM platform. Design these services so that your MDM platform requirements are also met by the same set of services to avoid duplicating effort. Both DB2 and Oracle provide tools to automate the generation of web services based on queries to databases and provide a simple capability to expose your reference data as a service.
Using External Data as Test Data
Every implementation of MDM requires test data for testing of database performance, application features, and integration processes. Some environments have very strict rules on the use of data and often for MDM, valid data is required for testing, as standardization and validation routines depend on real data to perform properly. External data sources may be a useful tool for generating test data, because:
- It provides the volume of information required for performance testing and sizing estimates
- It may be publically available information so there are no security issues with it being seen by staff and consultants
- It is well formatted so it is easy to transform into the formats required for loading your MDM platform
You should check your licensing of the data before using it for long term testing purposes as you want to be able to retain test data even if you decide to stop your subscription for the external data.
One of the new issues you may face when introducing external data is the concept of trusted sources. Trusted sources are data sources that are trusted to have higher quality data than other data sources that supply information. Since the external data is of higher quality you may not want values set from the external data to be replaced by values from an internal system.
A source trust framework may be required to prevent updates to information that was set by a more trusted source by a less trusted source. These update rules can be complicated and are often implemented with sophisticated tools such as rules engines. Source trust is often implemented in the integration platform, as it can be an enterprise problem. Source trust can also be implemented in your MDM platform for a localized solution.
An alternate approach that can also be used to control this type of data protection is to store the data from the trusted source in attributes dedicated to data from that source. A set of custom services with additional capability are used to process these attributes to keep them away from the normal consumers.
Integrating External Data
There are many choices you will face when trying to integrate your external data.
Sometimes licensing can be one of the drivers affecting your decisions. Your MDM platform may be licensed according to the number of parties to be stored in the system. Your external data source may provide data for every organization in the country but you don’t do business with every organization, so loading them all into your MDM platform will just inflate the number of parties and thus needlessly affect your licensing.
Some data providers only supply complete extracts of the data and others provide a full extract and delta changes. When you are planning your use of the external data you may need to add your own change detection process requiring you to match the previous file with the current file to detect updates, adds and deletes.
Some data providers offer file based data as well as online, real time services for you to integrate with. You must consider:
- The amount of information you require versus the base of data that would need to be managed
- The performance and cost implications of calling outside for the service versus an in-house service call
- The cost of the file base of data compared to the cost of the size of data you really need and can be accessed real time
When using external data to enrich your MDM data, you need to be able to update your MDM platform based on changes reflected in the external data. Depending on the size of both your own MDM data and the external data, sophisticated processes may be required to apply the updates to your MDM platform. If large amounts of data need to be applied daily, then you may need to maintain pointers to your MDM data in the external data so you can efficiently process and avoid needless MDM lookups to see if anything is impacted. In this scenario you may want to consider having the external data be a consumer of MDM change notifications to keep its pointers up-to-date so you can recognize when a change in the external data affects your MDM data.
While we have talked mostly about purchased external data, data sourced from a business partner is subject to the same problems and considerations discussed here. External data can be a very useful tool especially when it comes to validation and enrichment. Be forewarned that there is work involved to integrate the external data, but the rewards associated with higher quality and breadth and depth can be of significant business value.