Journalist/developer. Storytelling developer @ USA Today Network. Builder of @HomicideWatch. Sinophile for fun. Past: @frontlinepbs @WBUR, @NPR, @NewsHour.
1862 stories
·
45 followers

SQL should be your default choice for data engineering pipelines

1 Share

Originally posted: 2023-01-30. View source code for this page here.

Summary

SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes.

Introduction

My first exposure to analytics programming was using SQL in Microsoft Access in the early 2000s. Later, I became a data scientist, and loved pandas and dplyr for their expressiveness and power. As a data engineer, I dabbled in PySpark. Most recently, I’ve returned to SQL for data manipulation.

These alternative tools were developed to address deficiencies in SQL, and they are undoubtedly better in certain respects. But overall, I’m convinced that SQL is better in most circumstances, especially when working in a team or on long-term projects.

This post will make the case for SQL. I’ll then suggest when other tools may be preferable. Finally, I’ll finish by mentioning some future directions, and new libraries to keep an eye on.

Advantages of SQL

More people will be able to understand your code

Code is read more often than it is written. By using SQL, a much wider range of people can read your code, including BI developers, business analysts, data engineers and data scientists.

Data engineering pipelines have a habit of sticking around for many years, and SQL has stood the test of time. It is the language most likely to still be understood 10 or even 20 years from now.

SQL is also declarative - meaning the author describes what results they want, rather than how to compute them, meaning it’s arguably closer to self-documenting than some other imperative data processing languages.

Future proofed, with automatic speed improvements and ‘autoscaling’

Despite having been around since the 1970s, SQL-based tools have been one of the most active areas of innovation in data tooling in the past decade. Data pipelines written in SQL are capable of running much faster, and on larger data than would have been possible a decade ago, with little or no changes in the underlying SQL code.

In addition to continuous improvements in more traditional SQL engines, we’ve seen the advent of distributed tools like Spark and Presto that have enabled SQL to run on huge datasets. More recently, DuckDB enables extremely fast parallelised analytics queries on a single machine - competitive with some of the fastest alternatives such as data.table and polars, and able to operate directly on csv and parquet files. If you’re using any non-standard features of a particular flavour of SQL, SQLGlot allows automatic translation.

Overall, SQL is probably the most future-proof tool for writing data pipelines - which have a habit of sticking around for longer than expected. Whilst there are many competitors, SQL is the most likely to be still in use in 20 years time.

Make data typing someone else’s problem

Experienced data engineers will know data types (datetimes, strings, etc.) can be a notorious time sink. The main problem is the lack of one-to-one mappings between data types in different programming languages and SQL engines. A secondary problem is that some tools such as pandas and sqlite do not enforce data type constraints - meaning a column can have a mix of data types.

Use of SQL mitigates these problems because SQL engines are usually strongly typed. The ecosystem also provides tools that go further: Apache Arrow seeks to address this challenge by enabling robust and consistent handling of data types across different tools (such as R, Python and different databases). By choosing SQL engines that are compatible with Arrow, many of these problems of compatibility of data typing between tools disappear.

Simpler maintenance, with less dependency management

Dependency management adds a significant maintenance burden to data pipelines and means maintainers need additional skills. Whilst the use of SQL does not eliminate the problem, it simplifies it considerably since SQL syntax changes much less frequently, and requires few dependencies to run.

For example, a pipeline written in R or Python five years ago could require days or even weeks of work to bring up to date. It’s likely to take significant effort to set up an environment to even run the code. The same pipeline written in SQL would need far fewer changes to bring up to date, and a reader seeking to simply understand the code could easily execute the SQL without setting up a new development environment.

SQL can also be executed from almost any programming language, making it easier to migrate pipelines to different tools, or to embed logic in other applications.

Compatibility with good practice software engineering

Many data engineers will be familiar with the pain of uncovering an incomprehensible thousand-line SQL script that was written a decade ago, but is a critical part of an organisation’s data pipelines. Or the challenges of making small changes to Spark SQL pipelines that run on huge volumes of data. Overall, historically it has been challenging to write SQL that aligns to good engineering practice - such as clear, concise, and tested code that is split out into easily digestible components.

Some of these challenges are now much easier to overcome with the use of three components:

  • DuckDB, a zero-dependency SQL engine optimised for analytics queries that can be used to run unit tests, and more generally makes it much more ergonomic to rapidly iterate SQL code
  • CTEs (Common Table Expressions) - a way of splitting up large queries into multiple semantically meaningful parts, which can be tested individually. These tests can form part of your CI pipelines
  • SQLGlot, a SQL transpilation engine, that allows you to more easily test your code in DuckDB even if it’s written against a different target engine (e.g. PySpark).

There are even tools such as dbt which take similar ideas and compose them into a framework.

SQL is more expressive and versatile than it used to be

Modern SQL engines have support for a range of functions that make complex operations much simpler than they used to be, addressing earlier shortcomings:

Other features with more narrow applications include full text search, geospatial functions, PIVOT operations and user defined functions - though these should be used with care because they have limited support.

Situations in which SQL may be inappropriate

When may there be a strong argument for using other tools? I’ve argued in this post that SQL will often enable you to write pipelines which are simple, readable and easy to test. And where the totality of code and infrastructure is comparatively simple and maintainable in the long run.

But there are some situations where the opposite is true. For example, to interpolate a time series in pandas, you can use the resample method - a single line of code, where the intention is clear. In many SQL engines, the equivalent SQL is more complex and difficult to read. Similarly, SQL is probably not the best tool to operate on graph-like data structures.

Ultimately SQL should not be the only tool you consider - but I suggest a presumption against other tools without a strong justification.

Putting it all together: Why I wrote Splink using SQL

Part of the motivation for writing this post is the success I’ve had in using SQL to power Splink, a library for probabilistic record linkage at scale. All the core computations are written in SQL.

This has the following advantages:

  • Splink is able to support multiple execution backends: workloads can be executed using DuckDB, SQLite, Spark or AWS Athena (Presto). DuckDB is by far the fastest option for small to medium size data, but Spark and Athena are available for very large jobs that need a distributed backend. This means that Splink is optimised for a wide variety of scenarios without the team having to maintain any of that execution logic.
  • Splink is future-proofed. If a new SQL engine comes along that’s faster, we can support it by writing a small adaptor. In fact, that’s exactly what happened with DuckDB.
  • Once a model is trained, Splink is able to output SQL code that expresses the model, enabling the logic to be embedded in other applications without Splink being a dependency. In this way, a Splink model could be productionised e.g. in dbt!
  • All SQL outputted by Splink is outputted as a step-by-step CTE pipeline. This makes it easy to understand and unit test each part of the algorithm, making its inner workings much less of a black box.

This has all been possible despite the need to express some fairly complex iterative algorithms (such as the Expectation Maximisation algorithm) using SQL.

Future trends

Far from becoming outdated, SQL is going from strength to strength.

One interesting trend is the development of alternative APIs that ‘compile down’ to SQL for execution. The idea here is that SQL engines are so well optimised that library authors are better reusing this work than writing their own execution logic.

Some interesting things I have an eye on:

  • Malloy promises to offer a clearer and more concise declarative alternative to SQL that compiles to SQL.
  • Similarly, Ibis offers a replacement API for SQL queries which is more familiar to Python users, but ‘compiles down’ to SQL for execution against its supported backends. Dbplyr is similar in R. Polars allows the execution of SQL, though it looks like the SQL is transpiled in the opposite direction.
  • SQLGlot has the potential for lots of interesting applications such as keeping track of lineage. It seems possible that it might be used in future as a tool to re-write a CTE pipeline to improve the efficiency of operation without changing its meaning (since CTEs can sometimes be bad for performance).
  • dbt has gained widespread adoption for data engineering. I’m particularly interested in future directions such as functions that can provide more information about lineage, or that introspect the SQL and improve the efficiency of execution
Read the whole story
chrisamico
4 days ago
reply
Boston, MA
Share this story
Delete

Taking stock of my

1 Share

Twelve software projects and companies I use all the time as a working data journalist.

I’ve been doing data work for nearly a decade and, at this point, I’ve got a set of tools that I really enjoy and make me really productive. At least I feel that they do.

I wrote a recent post where I gushed about Datasette, a tool I use all the time. I likened it to my "data hammer" because it is always within reach of my work.

Well, now I want to talk about the other tools I use. In part to strain the metaphor and in part because I’ve never talked about the software I use day in and day out. There’s only one criteria for something to show up on this list: I have to use it all the time.

These tools are always close at hand because they’re that useful. But I also use these tools because they feel like magic. I can cast data alchemy spells with non-trivial geospatial analysis - the sort of feeling that hooks you, that makes you go “I guess computers aren’t all terrible.”

It’s a bit hyperbolic, but I love each and every software project on this list. And I’m super grateful to the hundreds of thousands of hours that have been put in to them.

I’ve broken my toolset into a few categories to provide some sort of organization. Also note that most of these tools are in the Javascript ecosystem. It is a language that is super expressive and jives with my brain, for better or worse.

D3

Books have been written about how useful, and complicated, the D3 library is. And even though I don't use it so much for visualization anymore, it has some features that I just can't live without. Namely:

  1. CSV parsing and serializing
  2. Date and time parsing
  3. Scales and interpolation
  4. Basic math like sums and means

lodash

This collection of utility functions is in nearly every one of my projects. I use it to sort data, ensure uniqueness, and do basic string manipulation.

p-queue

The thing for when I need promise based control flow. It makes it easy to run tasks in parallel and allows me to adjust the number of concurrent tasks going at once. Why is that cool? Because it means I can dramatically reduce the run time of some of my web scrapers.

cheerio

A jQuery inspired API for parsing and manipulating HTML outside of a browser.

puppeteer

Scraping websites can mean you need to be able to automate a browser. There are a number of options to control browsers but I've found that I like puppeteer's API the most so it is the one I use.

mapshaper

Need to convert geographic data between file formats? Need to simplify it in the process? Or filter out particular features that you don't want to include in a dataset?

Mapshaper does it all and because it's a command-line tool it's easy to integrate into data pipelines.

turfjs

This project is a fucking gem; it enables me to conduct geospatial analysis from the comfort of Javascript and JSON. The most magical function of this library is called booleanPointInPolygon which can tell you if a point is inside of a polygon.

Sounds simple but it turns out to be really powerful and is the crux of how I built a tool for California voters to see how redistricting affected their home address.

QGIS

And sometimes you just need a GUI to do... well, something. I'm just a novice when it comes to this open-source geospatial power house but I use it a lot to explore new data by putting it on a map.

SvelteKit

If I'm making a website, SvelteKit is the first thing I'm going to reach for. Not only does it use the very popular component model for building interfaces, but it also makes it very easy to keep API/data fetching code and UI code in the same repository.

Observable notebooks

Doing data analysis is hard and ensuring that it's repeatable and auditable is crucial. When possible, I use Observable notebooks to handle that. The major thing I like about this tool over other notebook options, besides that it's Javascript, is that I can just share a URL with a colleague. I love that.

Github Actions

Github Actions are where all of my scrapers and recurring tasks run. It's easy to schedule scrapers, read the logs of the times they've succeeded and failed, and it can easily commit data to a repository.

Netlify

I don't have the skill to run webservers, let alone the interest. I want my websites to be deployed whenever a change happens on Github and I want to be able to see when a deploy succeeded and when it fails. Netlify does it all.

Do you use any of these tools in your day to day work? Love them? Hate 'em? I'm super curious to hear from you.

Read the whole story
chrisamico
4 days ago
reply
Boston, MA
Share this story
Delete

NASA plans to retire the International Space Station by 2031 by crashing it into the Pacific Ocean

1 Share

CNN  — 

NASA intends to keep operating the International Space Station until the end of 2030, after which the ISS would be crashed into a remote part of the Pacific Ocean known as Point Nemo, according to newly published plans outlining its future.

Launched in 2000, the space lab has orbited 227 nautical miles above Earth with more than 200 astronauts from 19 different countries enjoying stints aboard – representing a continuous human presence in space.

NASA said that commercially operated space platforms would replace the ISS as a venue for collaboration and scientific research.

“The private sector is technically and financially capable of developing and operating commercial low-Earth orbit destinations, with NASA’s assistance. We look forward to sharing our lessons learned and operations experience with the private sector to help them develop safe, reliable, and cost-effective destinations in space,” said Phil McAlister, director of commercial space at NASA Headquarters in a statement.

“The report we have delivered to Congress describes, in detail, our comprehensive plan for ensuring a smooth transition to commercial destinations after retirement of the International Space Station in 2030.”

In the International Space Station Transition Report, NASA said the plan was for the ISS to fall to Earth in an area known as the South Pacific Oceanic Uninhabited Area – also known as Point Nemo. The report said that its budget estimate assumed that the deorbit would happen in January 2031.

Named after the submarine sailor in Jules Verne’s novel “Twenty Thousand Leagues Under the Sea,” Point Nemo is the point in the ocean that is farthest from land and has been a watery grave for many other spacecraft.

The area is approximately 3,000 miles off of New Zealand’s eastern coast and 2,000 miles north of Antarctica and it’s estimated that space-faring nations such as the US, Russia, Japan and European countries have sunk more than 263 pieces of space debris there since 1971.

The report said the ISS would perform thrusting maneuvers that would ensure “safe atmospheric entry.”

The ISS won’t rest on its laurels for the next eight years. NASA said goals for the next decade including using the ISS as “analog for a Mars transit mission,” according to the report.

“The International Space Station is entering its third and most productive decade as a groundbreaking scientific platform in microgravity,” said Robyn Gatens, director of the International Space Station at NASA Headquarters, said in the statement.

“This third decade is one of results, building on our successful global partnership to verify exploration and human research technologies to support deep space exploration, continue to return medical and environmental benefits to humanity, and lay the groundwork for a commercial future in low-Earth orbit.”

“We look forward to maximizing these returns from the space station through 2030 while planning for transition to commercial space destinations that will follow.”

The space station has been home to many scientific firsts. The first item to be 3D-printed on the space station occurred in 2014. NASA astronaut Kate Rubins sequenced DNA in space for the first time in 2016. And the fifth state of matter, called a Bose-Einstein condensate, was produced in space by NASA’s Cold Atom Lab on the station in 2018.

Astronauts have learned how to grow lettuces and leafy greens in space. The first space-grown salad was sampled by astronauts in 2015. Now, they’re even growing radishes and chilis on the station. This could be used to one day help astronauts grow their own food on deep space missions.

China, whose astronauts have long been excluded from the ISS, launched the first module of its planned space station last year. While not as large as the ISS, the Chinese space station is expected to be fully operational by the end of this year.

Russia has said it will leave the ISS project in 2025 and plans to build its own space station that could launch in 2030.

CNN’s Ashley Strickland contributed to this report

Read the whole story
chrisamico
5 days ago
reply
Boston, MA
Share this story
Delete

The rockets NASA and SpaceX plan to send to the moon

1 Share

Now that NASA has successfully completed the first flight of the SLS and Orion, a mission known as Artemis I, NASA is looking to put a crew of as many as four astronauts on Orion for a flight around the moon. That Artemis II mission could come as soon as 2024, with a landing on the lunar surface planned for a year or two later.

SpaceX’s Starship could also soon hit a major milestone, with the first launch attempt to reach orbit expected later this year.

About this story

Additional design and development by Betty Chavarria. Editing by Kate Rabinowitz, Manuel Canales and Jeff Dooley. Copy editing by Wayne Lockwood and Liz McGehee.

Read the whole story
chrisamico
6 days ago
reply
Boston, MA
Share this story
Delete

Code Lifespan

6 Comments and 19 Shares
Surely (no one/everyone) will (recognize how flexible and useful this architecture is/spend a huge amount of effort painstakingly preserving and updating this garbage I wrote in 20 minutes)
Read the whole story
chrisamico
8 days ago
reply
Boston, MA
Share this story
Delete
5 public comments
DexX
8 days ago
reply
Depressingly true.
Melbourne, Australia
ChrisDL
8 days ago
reply
10/10 can confirm.
New York
JayM
8 days ago
reply
Yep
Atlanta, GA
fanguad
8 days ago
reply
Pure, unadulterated truth
Boston, MA
alt_text_bot
8 days ago
reply
Surely (no one/everyone) will (recognize how flexible and useful this architecture is/spend a huge amount of effort painstakingly preserving and updating this garbage I wrote in 20 minutes)

If you ever want to read The Winds of Winter, pray for the New York Giants

2 Shares

A theory

By

Read the whole story
chrisamico
14 days ago
reply
Boston, MA
Share this story
Delete
Next Page of Stories