A key step when creating an enterprise data warehouse is to build pipelines, to ingest and transform your source data, which are robust, reliable, flexible and fast.
There are two ways to solve this challenge: Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT). Each has its advantages and drawbacks.
ETL (Extract, Transform, Load)
ETL is the traditional approach to data ingestion, adopted when high storage and compute costs prohibited loading raw data directly into the data store. With ETL, data is cleansed and aggregated outside the warehouse, reducing the amount of data to be loaded and the amount of data that needs to be crunched when running queries against the database. While ETL lowers your storage costs, it:
- sometimes results in the loss of data that might prove valuable at a later date
- makes it impossible to apply a different transformation process later if that proves necessary
- requires external compute resources to be made perform the transformation
- often increases the time taken to deliver insights to users, because data has to be staged to the external system for processing, passed through each stage of the transformation and then loaded into the data warehouse before it’s available for analysis
ELT (Extract, Load, Transform)
ELT is a more modern data ingestion strategy where data is ingested and stored in raw form in the data store before being transformed to make it ready for consumption. ELT has been made feasible and affordable thanks to cloud-based hyper-scale databases that are not only able to store vast quantities of data without breaking your budget but which also allow transformations to take place within the database itself, rather than requiring data to be transferred an external system during this step. With ELT, you can:
- ingest all the raw data you can get your hands on and then figure out what insights you want and can get out of it
- apply different transformations to the data at a later date if your assumptions change or you realise there was value in data thrown away during the original transformation
- provide immediate access to data because there’s no need to run the entire transformation process across all the data before loading it; you only need to process the subset of data required for a particular analysis
- more easily ingest and work with unstructured and semi-structured data
- maintain an archive of data which may not have value now but could provide insights over the longer term. For example, you can store operational data which is relatively transient, such as smartphone activity streams or IoT feeds, that might give you insights into user trends when reviewed over months or even years
Even the cost advantage of ETL may no longer hold true, thanks to the latest generation of cloud data warehouse platforms such as Google BigQuery. Because they’re serverless, there’s no standing charge and you just pay for what you consume in terms of storage and compute resources each month — with access to processing power that rivals research-tier supercomputers and can return results from an uncached SQL query over a 15 billion row table in seconds.
In our experience, ELT is almost always now the best choice for organisations that are implementing modern cloud-based data analytics platforms and looking to exploit advanced analytics techniques like machine learning. To find out more, why not read about the work we’ve done with Play Sports Network to build a collaborative recommendation engine based on app activity data using this approach, or come and talk to the experts in our Data Analytics team.