Best practices in enterprise data integration-A reference architecture

Introduction

Data integration involves more than just connecting a source to a target using a boilerplate solution (e.g. data pipelines). For companies to achieve actual business outcomes from their data integration initiatives as opposed to just successfully connecting A to B, they need a strategic mindset to data integration that identifies conceptual building blocks and then a set of best practices for each building block.

In this article, we lay down a conceptual blueprint for top-level building blocks of a typical data integration project. This article is the first in a series of documents dedicated to Data Integration. The rest of the series will focus on the best-practices and sample processes for each of these building blocks.

The basics: Why do we need a reference architecture?

Think of how a house is constructed. There are multiple components to the architecture including masonry and brickwork, plumbing and pipework, and electrical installation to name a few. All these components are covered by building regulations and best-practices and this allows builders to develop common solutions that are effectively and efficiently compliant. Whether you build a ranch, a condo, or a multi-story apartment, each has the same common set of common requirements in each of the building blocks.

The specific design, however, is entirely up to you. For example, you might want RCD insulated circuits, chrome-plated pipes, concrete flooring, and water-proof plasterboards throughout but the assembly of these will need to follow a common approach or ‘reference architecture’.

Data integration architecture is no different. The ‘reference architecture’ in this case, is a set of common best practices that can be adapted to specific business contexts, but the underlying framework does not change, regardless of vertical or industry.

Introducing the data integration reference architecture

In this context, a reference architecture for data integration can be thought of as consisting of the following ‘components’ or ‘building blocks’

The sections below outline what each of these components entails

1-Source data extraction

This component defines the best practices for extracting data from source systems. Several scenarios arise for which companies must define common processes and guidelines. These include:-

Change data capture: Capturing incremental data since last change. A best practice here is to maintain the last identifier/timestamp of each batch and then only request data generated after that.
Refresh and snapshot loads: In many cases, the whole target database needs to be refreshed, or there is a need to process data only for a certain time range. Planning these conditionalities in advance will go a long way in preventing individual project delays.
Handling batched extracts: In many systems, data is supplied in batches. For example, the historical data for a large marketing campaign would typically be supplied in individual batches and the challenge is to combine these into a single load. A best practice here is to ensure that a unique campaign identifier along with a unique sequence id is present in all sub-batches.
Source data coming from several channels: API pulls, webhooks, batch loads, pub/sub messaging, FTP downloads, Email extracts, etc. are a few of the many ways in which data gets extracted from source systems. A unique set of processes should be defined for each tactic. Best practices need to be carefully thought out for each scenario, such as a common authentication server for webhooks or establishing a common DNS for all pull requests to avoid having to register multiple endpoints.
Throttling limits: Many systems such as Google APIs implement rate limits and throttling and this can cause severe data quality issues if not planned for properly.
Extraction technology: data extraction starts with defining a standard tool to extract data. This may well be an ETL tool (e.g. Talend, Informatica, etc.) but it should not vary from project to project. Licensing, contract terms, maintenance contracts, etc. must all be negotiated centrally.

Breaking down the entire data integration lifecycle into smaller components and then developing global best practices for each helps to standardize the entire process flow. With these best practices defined, individual projects can now focus on business outcomes and efficiency rather than spend time on non-functional groundwork.

2-Raw data staging

With data coming from multiple source systems, at varying intervals, and using different channels (API, Pub/Sub, etc.), where are we going to have raw storage before downstream processing? Will it be a no-sql database or a cloud platform like Amazon S3 or Google BigQuery? Every client would have a unique context to these but the fact that a raw staging area would need to be architected is a given and should be part of every project solution architecture. The best-practices here could be related to identifying specific guidelines for working with each platform like a no-sql database, Google BigQuery, Amazon S3.

3-Data profiling and quality assessment

This is a universal headache in almost all data integration project and needs identifying and addressing multiple considerations-

Undocumented source formats: It is not common for source systems to have outdated metadata documentation. How will we handle incorrect data types/formats (as compared to documented evidence), handle semantically incorrect data elements, or ingest unstructured data are some of the key issues that need to be considered here.
Inconsistent data: Case sensitivity, spelling mistakes while describing the same data element, data containing what it should not be, based on existing documentation, are all examples of inconsistent data for which best practices would need to be defined.
Outliers and missing values: This is hugely important in data mining projects. How do we handle missing values? Do we substitute or ignore records? How do we address outliers? Do we normalize or reject the dataset completely? While it is impossible to prepare global guidelines that might be applied to all scenarios, there should be, at a minimum, a mention of the various types of challenges that data is likely to present.
Incomplete or out-of-range data: Consider a data mining project which relies heavily on personally identifiable information being present in the underlying datasets. What if this is missing in a majority of the cases? Having in place measures that can trigger actions before such issues are encountered much further down the project may well end up saving a lot of wasted time investment and bad will.
Lack of client subject matter knowledge: This is a very common scenario in which the business users do not know the technical details (e.g. meta-data, authorization tactics) while the IT people do not have the functional knowledge.

4-Clean data staging

It is a best practice to store all data that has passed the data quality checks and which is now ready to be transformed. Some key considerations could include

Where should this clean data be located physically?
What should be the storage format?
Should be it stored in raw format or after some basic data preparation?
What is the sign-off process to move data into a clean data staging area?

All these are issues that will be encountered in every project. The specific answers may well change but they all need to be addressed regardless.

5-Data transformation

This encompasses several activities that need standardizing including but not limited to

Coding standards and naming conventions for ETL scripts
Development of canonical data formats
ETL vs. ELT design pattern
Pre-preparing data using ETL and feeding into reporting/analytics engine or doing this on the fly
Version control of transformation scripts
Transformation tool to use (Standalone scripts, off-the-shelf too), etc.

A detailed set of best practices and guidelines should be developed for each of the sub-components above. For example, for use cases like end-of-period reporting, it might be ok to prepare the data in advance. But for something like ad-hoc campaign data analysis, it would limit the insights if data granularity is lost as part of preparation.

6-Publish ready zone

This zone is for storing data that has been processed but which can not be processed by the target application which runs on a schedule. In this case, a temporary storage area must be designed to host data. What could be the general best practices here? For example, what if the target system errors out and could not complete a pull? What are the security guidelines for access to this data? Where is this data to be physically hosted? (cloud or in-premise?)

7-Data publish

This building block represents a set of standardized activities for the final publishing of data into the target application(s).

The best practices here would need to be built around the various possible channel of data loading including

Direct FTP target server – Where will we store connection details, where will data be picked up from and how will it be relayed? What happens with failed record transfers?
Webhooks-In this scenario, third party apps ‘pull’ data by making REST calls to an endpoint specified by the provider. How will the authorization work here? How will the client specify data properties of data to be selected? What about rate limiting, and batched pulls?
RDBMS bulk loading-In this technique, data is loaded into the target using native database utilities rather than third party loaders. What could be the best practices here?
Messaging-In this mode, data is pushed out as events to a broker which then handles the downstream processing. How do we implement guaranteed delivery? What if there is a need for once and only once delivery? What if the processing needs to be done in parallel (e.g. in cases where there needs to be high scalability) rather than using FIFO techniques?

Summary

The reference architecture outlined above, and the considerations identified for each lay down a foundation for some detailed due diligence to create a standardized set of processes for the entire data integration life cycle. The result is that individual projects can leverage these global best practices instead of spending time on this kind of architectural plumbing of largely non-functional issues. This not only improves delivery quality but also has a bottom-line impact on project delivery times and profitability.