Data Pipelines - What They Do & How to Build One

Paul
Sep 24, 2020
4 min read

Introduction

What, intrinsically, is a Data Pipeline?

Data pipelines are collections of linked digital systems for acquiring, storing, and analysing data:

They are powerful, automated workflows for turning raw data from a multitude of sources into information and insight.

Think, for example, of the electricity demand and generation data from around the UK, used by operators to predict when to turn generation assets on and off, or the operational data generated by an aviation turbofan, used to predict maintenance schedules - In both cases, a data pipeline is the support system underpinning the capability.

You may think that data pipelines are only worthwhile for big, complex data problems, but you would be incorrect. Even the smallest or simplest devices can generate data for performance optimisation, durability improvements or cost reduction. Data pipelines are about streamlining, standardising, and scaling these types of data-leveraging opportunities.

In this article, we share some tips on building (and improving) data pipelines, from proof-of-concept systems to systems piping petabytes of data, with a focus on broad strategies rather than specific technologies or frameworks.

Implementation Advice

We think that there are at least five key actions when building a new data pipeline (or reworking an existing one) that will lead to better results.

1. Define the Problem to be Solved

Like any major project, defining the objectives of the work will help bring clarity to what the outcomes should look like, and help keep design and development activities on track and within scope.

Data pipelines are tools for answering questions rooted in data - What are those questions? What might the answers look like, and how will they be used, communicated, and leveraged?

We recommend documenting both a problem statement ("What are we trying to solve?") and a set of requirements for a pipeline and keeping this documentation with the pipeline source code or project folder. As objectives change, the document can be updated.

2. Map Data Acquisition & Processing Steps

All pipelines draw data in from somewhere (a website, a database, text and number files, an IoT system ...), and all pipelines will perform some automated operations on this data to prepare it for downstream use (storage, analysis, machine learning).

Whilst all pipelines perform these steps, the details of the operations are (usually) bespoke and have been developed to solve quirks and peculiarities in data sourcing and management.

An activity that pays dividends here is to map data acquisition and processing steps. This could include details around the format of the data, particular cleaning steps required, and calculations and formulae used in the pipeline.

Put this in a wiki or shared area with the source code, so engineers and developers can work with it - It will save time, save rework, and keep the project focused on the problem statement.

3. Develop a Reusable Project Template

Although we said that many operations are bespoke to a particular pipeline, at an abstract level most pipelines have elements that can be templated to create a reusable project template.

These elements can include quality systems and checks, how reports and documents are stored, common visualisations, and reusable elements of source code.

Templating can be implemented at two levels:

1. Common file and folder structures can be used across disparate projects.

2. Standard project component checklists can be used to guide and accelerate greenfield projects.

Templating reusable elements saves time when building new systems, by moving towards a standard structure approach and avoiding reinvention during the development process.

4. Focus on Visualisation & Reporting

One of the common ways a data project can fail is that the pipeline becomes 'islanded' from the rest of the organisation - The outcomes are not shared with anyone outside the project team.

This can be overcome with good visualisation and reporting techniques.

Visualisation and reporting are primarily about telling a story that can be widely understood and communicated with stakeholders. Good visualisation is about developing the right figures to drive insight, with appropriate units, scales, colours, and other parameters, and linking back to the target problem and objectives of the project. Of similar importance is developing a system for generating, archiving, and reporting updated figures as new data emerges.

We will talk about this in more detail in a future article.

5. Audit & Continuously Improve

Once a pipeline goes live, it can run for weeks, months or sometimes years. New technologies are released that can augment the pipeline, older technologies are deprecated and superseded, and the nature of the raw data, or the objectives of the project, can evolve.

The only constant is change, but this can be managed by building an audit and continuous improvement system around the pipeline.

We think auditing should be an explicit, structured activity, rather than ad hoc or unmanaged, in the same way a factory audits production processes and equipment on a prescribed schedule. Auditing should start with the problem statement ("still relevant?") and look at the rest of the system from this perspective.

Improvement opportunities identified in the audit can then be planned into future sprints (continuous improvement).

In Summary

In this article we have provided an overview of what a data pipeline is, and we have outlined a range of actions that can be used when building new or improving existing data pipelines.

Hopefully, we have convinced you of the merits of building and running data pipelines, independent of your market sector or your project size/complexity, and that taking a strategic approach to designing, building and maintaining your pipeline is the way to go.

Despite being a relatively new business system, the data pipeline as a concept is moving towards greater standardisation (particularly in terms of structure and management), and many of the steps outlined in this article are really about tuning that standardisation to the organisation building the pipeline.

In a future post, we will talk about how all the data your pipeline is producing can be applied to engineering and technical problems with machine learning.

As always, thanks for reading and please do get in touch with us if you would like to know more.

Data Pipelines - What They Do & How to Build One

Introduction

Implementation Advice

In Summary

Recent Posts

Comentários