Importing CSV files with SQLite
GitHub offers a very superficial view of how GitHub Actions runners are spending their minutes on private repositories. Currently, the only way to get detailed information about it is via the Get usage report
button in the project/organization billing page. The only problem is that the generated report is a CSV file, shifting the responsibility of filtering and visualizing data to the user. While it’s true that most of the users of this report are used to deal with CSV files, be them developers or accountants experts in handling spreadsheets, this is definitely not the most user-friendly way of offering insights into billing data.
When facing this issue, at first I thought about using harelba/q to query the CSV files directly in the command line. The problem is that q
isn’t that straightforward to install, as apparently it is not available via apt
nor pip
, nor one is able to easily change the data once it’s imported, like in a regular database. In the first time I resorted to create a database on PostgreSQL and import the CSV file into it, but after that I never remember the CSV import syntax and it still requires a daemon running just for that. I kept thinking that there should be a simpler way: what if I use SQLite for that?
In order to not have to CAST()
each TEXT
column whenever working with dates or numbers, the following schema.sql
can be used:
CREATE TABLE billing (
date DATE,
product TEXT,
repository TEXT,
quantity NUMERIC,
unity TEXT,
price NUMERIC,
workflow TEXT,
notes TEXT
);
After that, it’s possible to import the CSV file with the sqlite3
CLI tool. The --skip 1
argument to the .import
command is needed to avoid importing the CSV header as data, given that SQLite considers it to be a regular row when the table already exists:
$ sqlite3 github.db
SQLite version 3.36.0 2021-06-18 18:58:49
Enter ".help" for usage hints.
sqlite> .read schema.sql
sqlite> .mode csv
sqlite> .import --skip 1 c2860a05_2021-12-23_01.csv billing
sqlite> SELECT COUNT(*) FROM billing;
1834
Now it’s easy to dig into the billing data. In order to have a better presentation, .mode column
can be enabled to both show the column names and align their output. We can, for instance, find out which workflows consumed most minutes in the last week and their respective repositories:
sqlite> .mode column
sqlite> SELECT date, repository, workflow, quantity FROM billing WHERE date > date('now', '-7 days') AND product = 'actions' ORDER BY quantity DESC LIMIT 5;
date repository workflow quantity
---------- ----------------- ------------------------------------ --------
2021-12-21 contoso/api .github/workflows/main.yml 392
2021-12-18 contoso/terraform .github/workflows/staging-images.yml 361
2021-12-22 contoso/api .github/workflows/main.yml 226
2021-12-21 contoso/api .github/workflows/qa.yml 185
2021-12-20 contoso/api .github/workflows/main.yml 140
Another important example of the data that can be fetched is the cost per repository in the last week, summing the cost of all their workflows. An UPDATE
statement is required to apply a small data fix, given that the CSV contains a dollar sign $
in the rows of the price
column that needs to be dropped:
sqlite> UPDATE billing SET price = REPLACE(price, '$', '');
sqlite> SELECT repository, SUM(quantity) * price AS amount FROM billing WHERE date > date('now', '-7 days') AND product = 'actions' GROUP BY repository;
repository amount
------------------ ------
contoso/api 11.68
contoso/public-web 0.128
contoso/status 1.184
contoso/terraform 2.92
contoso/webapp 0.6
Not intuitive as a web page where one can just click around to filter and sort a report, but definitely doable. As a side note, one cool aspect of SQLite is that it doesn’t require a file database do be used. If started as sqlite3
, with no arguments, all of it’s storage needs are handled entirely in memory. This makes it even more interesting for data exploration cases like these, offering all of its queries capabilities without ever persisting data to disk.