Architecting large scale data integration projects requires framework templates that lay down the blueprint for how raw data will be converted into a form that is ready for business use. With a large number of issues involved including data sourcing, storage, transformation, environment design, security and so on, companies must strive to develop consistent integration methodologies that can guide the implementation projects to successful delivery within business constraints of time, skills availability and budget.

In this article, we put forth a high-level data integration reference architecture that can help companies develop robust data integration methodologies for repetitive use.

Why adopt a reference architecture?

All data integration projects involve commercial risk in the sense that without a structured delivery approach, investments are highly unlikely to deliver the intended business outcome. The principal advantage of a reference architecture is that it allows breaking down projects into distinct sets of activities which can be planned and tracked in order to mitigate implementation risks. With such an architecture in place, template project plans can be constructed, requirement gathering templates can be prepared, input/output for each step defined, roles and responsibilities clarified, risks and associated mitigation approaches identified, governance frameworks put in place and so on. By providing this level of implementation rigor, reference architectures allow companies to not only plan tactical project deliveries but also develop strategic people, process and technology capabilities as part of a comprehensive technology transformation roadmap.

Data integration Reference Architecture building blocks

At a conceptual level, our Architecture consists of 7 conceptual building blocks each of which provides a generic, high level set of related implementation activities that cover business analysis, technology design and project management. The idea is that these generic activities can easily help create a concrete implementation plan for any data integration project by adding specific business context.

Notice that the architecture is platform neutral in that it can be applied consistently to any ETL tool. The focus is more on developing a structured process for building out the project scope and implementation plan as opposed to prescribing tool specific implementation guidelines.

The 7 building blocks include-

Source data capture

These processes encapsulate a set of activities that capture data, transactional or bulk, structured or unstructured, from various sources and direct them into the initial staging landing zone. Cataloguing the various source systems, interfaces within each system, technical dependencies for each interface, technical infrastructure currently in place, roles and responsibilities in managing these interfaces etc. are just some of the activities that comprise this process block.

Preparing a plan for implementing this block provides Consultants and Project Mangers with both a strategic and tactical overview of scope, risks, and dependencies for the entire project.

Raw data staging

The initial staging zone is an optional “landing zone” where source data is temporarily persisted as a result of the initial source data capture. Planning of this building block would involve several activities including infrastructure design of the staging area, data model design (if any), security planning, change data capture, deciding what to do with the raw data once it has been processed downstream, and so on.

Data quality processes

These processes qualify and cleanse the raw data, based upon various business and technical process rules. Planning this block would involve multiple activities including but not limited to capturing and cataloguing the qualification criteria, implementing it at a technical level, handling rejected data files, generating reject reports, discussing rejected data files with key stakeholders, and incorporating changes to qualification algorithms.

Clean data staging

As a best practice, clean data is persisted in intermediate storage in its raw form before applying any business transformations. This allows for building out diverse data sets for different consumption needs without losing the original granularity. Planning this building block would need to consider a number of activities including infrastructure design of the clean staging area, data model design, access control, and data recycling once it has been absorbed into the target environment.

Transform processes

These processes include sets of activities that process the raw data and make it ready for final business consumption. From a planning perspective, these involve identifying the business rules for data transformations in each interface. It is a common practice for project teams to develop interface registers and source-to-target mapping documents that clearly outline both the business context and data transformation requirements. Depending upon the scope of project, these activities can form a formidable chunk of the overall project effort and must be carefully planned, and coordinated.

Load-ready data staging

This optional environment is used for persisting target-specific load-ready files that require processing in batch mode. It is commonly deployed in scenarios where the file processing from ETL tool is done out of synch with the consumption routine of target system in order to ensure operational efficiency.

Target data loads

These processes incorporate a set of activities for loading data into the target system(s). These include identifying interface scheduling and dependencies (for e.g. a data load may not be run until a prior load has been completed), failed load re-runs, error handling in case of bundled uploads, and so on.

Summary

Data integration is a complex undertaking and for successful delivery, these projects must be approached with meticulous planning and method. Reference architectures allow companies to develop business specific implementation plans by tapping into a ‘bank’ of generic , best-practices based activities that can be quickly customized based on current context, goals, constraints and priorities. Doing so prevents ad-hoc, error prone implementations and thereby greatly reduces the risk of failed technology investments.

About the Article

This article provides a conceptual framework in which the various logically related data integration activities are organized as ‘building blocks’ within a reference architecture. By recognizing these activities independently of a specific project, implementation teams can develop best practice templates and plans that can be quickly adapted to specific project context.

Article purpose

To demonstrate tool agnostic, best practices expertise in data integration.

Target audience

Enterprise Architects, Solution Architects, Project Managers handling data integration projects