ETL to QE, Update 6, Native SQL Decision
Date: 2023-10-10
See Discord Binding for project context
Issues and evolution of ETL Design
The same friend that recommended the psycopg2 optimizations in ETL to QE, Update 5, Optimizations mentioned that I was using my Postgres like MongoDB when I should be using proper native SQL. In this update I will defend my initial decision to use JSON tables and then circle back to why this design decision is not as optimal as I first expected and why JSON Files to Native SQL is the way to go.
Please read ETL to QE, Update 1, SQLite to Postgres for additional context.
Initial Design
JSON Files -> SQLite JSON Tables
I initally chose SQLite for the following reasons
- I had previously used it in my Keybase Binding
- No dependencies, SQLite is built into python
- No need to worry about connecting to server
- Portable, database is a file
Now add Step to load data into Postgres
JSON Files -> SQLite JSON Tables -> Native SQLite Tables -> CSV Tables -> Postgres Tables -> -> Concatenate Postgres Tables -> Cast Timestamps in Postgres Tables -> Any desired format
At one point I realized that I had so much data it would be better managed in a proper database like Postgres. Getting the data from SQLite into Postgres proved much more difficult than I had intended. If I had practiced the ETL pipeline with synthetic data things might have worked out better.
Only Postgres Design, using JSON tables like MongoDB
JSON Files -> JSON Postgres SQL Tables -> Native Postgres SQL Tables -> Any desired format
As articulated in ETL to QE, Update 1, SQLite to Postgres I explained why I initially wanted to use JSON tables rather than native SQL tables.
At this point when practicing loading the ETL pipline into the database I noticed it took hours to run even with The Optimizations therefore it was time to write some real SQL.
Native SQL Postgres Design
JSON Files -> Native Postgres SQL Tables -> Any desired format
Just like when doing a calculus proof you arrive at a simple description at the end.