Using DuckDB for diffing CSV files

At work, I had to reconciliate some data distributed over multiple databases, and did turn to the programmer’s favorite data format: comma-separated values, the mighty CSV. I planned to use diff and awk to do the grunt job until I realized I had to compare more than one column per row, meaning I would have to clean the data, organize it, and so on…

Suddenly, I had an epiphany: I have been reading for months, perhaps years, about people using DuckDB for multiple purposes, and just remembered (from the single time I read their docs) that it could import CSVs and query them using SQL. Guess what? You can even import multiple CSVs and they will work like individual database tables, where you can do JOINs, ANTI JOINs, WHEREs and most of the ANSI SQL stuff.

Using DuckDB couldn’t be easier - install it, and drop into the REPL using duckdb. Then, one has to create a table from the CSV file. If the CSV file includes headers, they will be automatically be taken as the column names:

CREATE TABLE original_records AS FROM 'original_records.csv';
CREATE TABLE replicated_records AS FROM 'replicated_records.csv';

You can then query the tables as usual:

SELECT *
FROM original_records WHERE id NOT IN (
    SELECT id
    FROM replicated_records
    WHERE answer = '42'
      AND animal = 'dolphin'
)

And the cherry on top, you can export the results to CSV again:

COPY(SELECT * FROM original_records) TO 'results.csv' (HEADER, DELIMITER ', ');

Cool, right? DuckDB is now part of my toolbox, and I may already have another use for it - just need to convince my peers at work. 🤓