Segment.io is a popular visitor analytics tool that allows companies to extract rich information out of their clickstream data using SQL. It enables this capability by allowing companies to download data into well-organized tables in any of the supported data warehouses (E.g. BigQuery, Redshift and Postgress). While this enables easy access to underlying data, the process of converting this data into meaningful, business-relevant information remains anything but straightforward.

In this post, we discuss two alternative technology approaches for transforming raw data into information within the context of a Segment.io deployment. We build the argument for deploying a strategic ETL capability that provides scale and flexibility as measurement requirements evolve past simple count based metrics into business outcomes.

Two approaches to data transformation

Customers can use direct SQL or a dedicated ETL tool to implement data transformations.

SQL – This approach involves writing SQL queries that fire on update/insert of the raw data tables and which populate views with output of the SQL queries.  The queries fire within the database engine as opposed to inside any formal ETL tool and populate a view which can then be plugged into by a BI engine for reporting.

Taking the example of Segment.io, a data warehouse connected to it would organize visitor level data in event tables with one table per event and where the table columns depend upon the event parameters. For example, an event to capture opt-ins from a web form could have parameters including opt-in form name, funnel name, and referring campaign. When these parameters are sent to Segment.io using JavaScript, a table is created for this conversion event and with columns to hold the values for opt-in name, funnel name and referring campaign. The link field between various such event tables is the anonymous_id field that identifies each visitor uniquely even if they have never logged into the website and have no CRM footprint. If we now wanted to find out the net customer revenue in any given month, we could write a trigger on the account cancellation table and which when fired would execute the required SQL to join paid acquisitions, recurring billing, and account cancellations tables and provide the net revenue and also net active customer count like so-

Month NetCustomerCount NetRevenue

Using a dedicated ETL tool – Under this approach, we would still be using SQL/triggers to identify updated rows in the account cancellations table, it is just that the joins would not be implemented within the database engine but in a dedicated data integration tool. This subtle technical detail apart, there are a number of other reasons why this ETL tool approach scales better

  • Handling complex business rules-What if a customer cancels the trial but is persuaded to accept discounts and continue his subscription? Or if the net revenue were to be further divided into revenue from new signups vs. recurring revenue? Sure, we could write another complex SQL query that joins multiple tables but as business rules get more complex, this hand-coded approach becomes quickly untenable.
  • Implementing data transformations in SQL-Using SQL queries for manipulating data relies on the big assumption that the underlying data is clean and already in the right format. This, however, is rarely the case. Take the example of a Campaign name column where the values are of the format <Campaign key>-<Brand>-<Geo>. So for example, the value MMG-OleoPar-USA represents data for Campaign MMG, Brand OleoPar, and region USA. Parsing values using SQL queries on even a mid-sized database like above is a significant drain on hardware resources- if at all possible within cost and speed constraints.
  • Separating out integration and transformation implementation-In any significant sized deployment, it is likely to already be some kind of integration tool that handles extraction of data from third-party sources. Farming out integration and transformation capabilities across multiple systems/technologies does not make architectural sense anyway.
  • Handling non-functional requirements-It is a common practice to implement triggered notifications such as email, SMS, SNMP alerts and so on in case of errors. Implementing such logic inside SQL is not possible.

Planning Segment.io deployments

The direct SQL coding approach is easier to implement and can usually be deployed with minimal technology know-how. It does not need investments in any dedicated ETL software (hardware/technical skills) and is not a major activity from the point of view of Project management.

On the other hand, depending upon the complexity of final business metrics required, the cost, time and resources requirements for an ETL approach can range from moderate to significant and potentially be a major component of the overall project plan. Segment.io recently started allowing their customers to pipe data into BI tools such as chart.io. However, the need for ETL remains intact given that it is almost impossible to implement complex data manipulations inside any tool that is not built from grounds-up for ETL.

The ideal solution would be to build a Marketing Operational Data Store which technically means a set of intermediate tables that are fed with transformed and reporting-ready data from the Segment data warehouse. The BI tool can then simply plugin to these databases thereby significantly reducing the complexity of SQL scripting or using formulae/calculation capabilities native to the BI tool. By putting in adequate provisions for ETL hardware, hosting, and development activities, customers can greatly mitigate the risks of failed analytics/business intelligence deployments.

Article purpose

To demonstrate technical expertise in data integration for digital analytics. More specifically, to project expertise in collecting visitor level clickstream data for setting up data management platforms in marketing.

About the Client

One of the UK’s leading analytics consultancy focused specifically on providing Enterprise IT solutions for Google Analytics UA/360, Adobe suite and digital business intelligence.