Comparing ETL to ELT Data Pipelines

Anis Hamadouche
Jun 22, 2023
7 min read

ETL or Extract, Transform, and Load processes are used for cases where flexibility, speed, and scalability of data are important. In this article, we will explore some key differences between similar processes, ETL and ELT, which include the place of transformation, flexibility, Big Data support, and time-to-insight. We will also talk about the increasing demand for access to raw data that drives the evolution from ETL to ELT.

ETL Fundamentals

What is an ETL process?

ETL stands for Extract, Transform, and Load. ETL is an automated data pipeline engineering methodology, whereby data is acquired and prepared for subsequent use in an analytics environment, such as a data warehouse or data mart. ETL refers to the process of curating data from multiple sources, conforming it to a unified data format or structure, and then loading the transformed data into its new environment.

The Extraction process obtains or reads the data from one or more sources. The Transformation process wrangles the data into a format that is suitable for its destination and its intended use. The final Loading process takes the transformed data and loads it into its new environment, ready for visualisation, exploration, further transformation, and modelling. The curated data may also be utilised to support automation and decision-making.

What is data extraction?

To extract data is to configure access to it and read it into an application. Normally this is an automated process. Some common methods include: Web scraping, where data is extracted from web pages using applications such as Python or R to parse the underlying HTML code, and Using APIs to programmatically connect to data and query it.

The source data may be relatively static, such as a data archive, in which case the extraction step would be a stage within a batch process. On the other hand, the data could be streaming live, and from many locations. Examples include weather station data, social networking feeds, and IoT devices.

What is data transformation?

Data transformation, also known as data wrangling, means processing data to make it conform to the requirements of both the target system and the intended use case for the curated data. Transformation can include any of the following kinds of processes:

Cleaning: fixing errors or missing values.

Filtering: selecting only what is needed.

Joining disparate data sources: merging related data.

Feature engineering: such as creating KPIs for dashboards or machine learning.

Formatting and data typing: making the data compatible with its destination.

What is data loading?

Generally this just means writing data to some new destination environment. Typical destinations include databases, data warehouses, and data marts. The key goal of data loading is to make the data readily available for ingestion by analytics applications so that end users can gain value from it. Applications include dashboards, reports, and advanced analytics such as forecasting and classification.

There are many use cases for ETL pipelines. A very large amount of information is either already recorded or being generated, but is not yet captured, or accessible, as a digital file. Examples include: paper documents, photos and illustrations, and analog audio and video tapes. Digitizing analog data includes: extraction by some form of scanning, analog-to-digital transformation, and, finally, storage into a repository.

Online transaction processing (OLTP) systems don’t save historical data. Accordingly, ETL processes capture the transaction history and prepare it for subsequent analysis in an online analytical processing (OLAP) system.

Other use cases include engineering ‘features’, or KPIs, from data sources, as preparation for: Ingestion by dashboards used by operations, sales and marketing, customers, and executives. Training and deploying machine learning models for prediction and augmented decision making.

In brief, ETL (Extract, Transform, Load) is an acronym for an automated data pipeline engineering methodology whereby data is acquired and prepared for subsequent use in an analytics environment, such as a data warehouse or data mart. The Extraction process obtains the data from one or more sources. The Transformation process wrangles the data into a format that is suitable for its destination and its intended use. The final Loading process takes the transformed data and loads it into its new environment, ready for visualisation, exploration, further transformation, and modelling. ETL is used for curating data and making it accessible to end users, for example, training and deploying machine learning models for prediction and augmented decision making.

ETL Basics

What is an ELT process?

ELT stands for: Extract, Load, and Transform. ELT is an acronym for a specific automated data pipeline engineering methodology. ELT is similar to ETL in that similar stages are involved but the order in which they are performed is different. For ELT processes, data is acquired and directly loaded, as-is, into its destination environment. From its new home, usually a sophisticated analytics platform such as a data lake, it can be transformed on demand and however users wish.

Like ETL, the first stage in the ELT process is Extraction. The Extraction process obtains the data from all sources and reads the data, often in an asynchronous fashion, into an application. The Loading process takes the raw data as-is, and loads it into its new environment, where modern analytics tools can then be used directly. The Transformation process for ELT is much more dynamic than it is for conventional ETL.

Modern analytics tools in the destination environment enable interactive, on-demand exploration and visualisation of your data, including advanced analytics such as modelling and prediction. Use cases for ELT processes typically fall within the high-performance computing and Big Data realms.

Cases include: Dealing with the massive swings in scale that come with implementing Big Data products, Calculating real time analytics on streaming Big Data, and Bringing together data sources that are highly distributed around the globe.

In terms of speed, moving data is usually more of a bottleneck than processing it, so the less you move it, the better. Therefore, ELT may be your best bet when you want flexibility in building a suite of data products from the same sources.

Why is ELT emerging?

Firstly, cloud computing solutions are evolving at tremendous rates due to the demands of Big Data. They can easily handle huge amounts of asynchronous data which can be highly distributed around the world. Cloud computing resources are practically unlimited, and they can scale on demand. Unlike traditional on-premises hardware, you only pay for the computing resources you use. You don’t have to worry about underutilizing resources, that is, overspending on equipment. With ELT, you have a clean separation between moving data and processing data. Of course, cloud computing is equally prepared to handle the most challenging cases for either of these two tasks. There may be many reasons to transform your data and just as many ways to do it. Thus, ELT is a flexible option that enables a variety of applications from the same source of data. Because you are working with a replica of the source data, there is no information loss. Many kinds of transformations can lead to information loss, and if these happen somewhere upstream in the pipeline, it may be a long time before you can have a change request met. Worse yet, the information may be forever lost if the raw data is not stored.

In a nutshell, ELT processes are used for cases where flexibility, speed, and scalability are important. Cloud-based analytics platforms are ideally suited for handling Big Data and ELT processes in a cost-efficient manner. ELT is an emerging trend mainly because cloud platform technologies are enabling it.

Comparing ETL to ELT

Differences between ETL and ELT:

For one thing, the transformations happen in a different order: Transformations for ETL pipelines take place within the data pipeline, before the data reaches its destination, whereas Transformations for ELT are decoupled from the data pipeline, and happen in the destination environment at will.

They also differ in flexibility in how they can be used: ETL is normally a fixed process meant to serve a very specific function, whereas ELT is flexible, making data readily available for self-serve analytics.

They also differ in their ability to handle Big Data. ETL processes traditionally handle structured, relational data, and on-premise computing resources handle the workflow. Thus, scalability can be a problem. ELT on the other hand, handles any kind of data, structured and unstructured. To handle scalability problems posed by Big Data, ELT leverages the on-demand scalability offered by cloud computing services.

With regard to data discovery and time-to-insight: ETL pipelines take time and effort to modify, which means users must wait for the development team to implement their requested changes. ELT provides more agility.

With some training in modern analytics applications, end users can easily connect to and experiment with the raw data, create their own dashboards, and run predictive models themselves.

ELT is a natural evolution of ETL. One of the factors driving that evolution is the demand to release raw data to a wider user base for the enterprise. Traditionally, ETL processes include an intermediate storage facility called a staging area. This is a holding area for raw extracted data, where you can run processes prior to loading the resulting transformed data into a data warehouse or a data mart. This sounds a lot like an ELT process, and the staging area fits the description of a data lake, which is a modern self-serve repository for storing and manipulating raw data. A traditional staging area, however, is not something that is usually shared across the company. It's a private, siloed area set aside for developing, monitoring, and performance tuning the data pipeline and its built-in transformations.

Along with the ever-increasing ease-of-use and connection capabilities of analytics tools, raw data sources have become much more accessible to less technical end users. Accordingly, the paradigm is shifting to self-service data platforms. There is still a place for conventional ETL in developing data products, so ETL is not disappearing any time soon. However, there is a trend taking place – a trend which is favouring modern ELT over conventional ETL. The trend is being driven by the pain points that ELT solves, namely, the lengthy time-to-insight the challenges, for example, scalability, imposed by Big Data, and the conventional siloed nature of data.

In summary, the key differences between ETL and ELT are the location where the transformation takes place, flexibility, Big Data support, and time-to-insight. One of the factors driving the evolution from ETL to ELT is the demand to release raw data to a wider user base for the enterprise. Conventional ETL still has many applications, and still has its place. ELT is more flexible than ETL, enabling end-users to perform ad-hoc, self-serve data analytics in real-time.

Paperclips Newsletter

Want to stay in the loop on product launches, upcoming events, and other innovations from Paperclips Data Services? Subscribe to the monthly newsletter. Easily unsubscribe at any time if you want!

More info: https://www.paperecos.co.uk/get-started

Comparing ETL to ELT Data Pipelines

Paperclips Newsletter

Recent Posts

Comments

Subscribe to Our Newsletter