Kaoru
Kohashigawa

Dummy Data with PSQL


postgresql database psql

I'm gonna be honest, almost 2 weeks have gone by and I haven't finished my blog post on Postgres indexes. Mainly because I want to reproduce the behavior I saw. Alas I could not because Postgres' query planner behaves differently based on your query **and the data in your database**. Because interesting things happen when you database is filled with rows, I had to fill err up! That in of itself was an adventure and learning. ### How not fill up your database with dummy data Closest tool to my hand was Rails. Yeah, it's not ideal. It's single threaded and is geared more towards usability than speed. Whatever, I didn't have to get off my fat butt and dig in the shelf (internet) for another tool. Off I went to create 5 million rows using ActiveRecord. I wrote a seed file which looped 5 million times generating random data for each row. It took 3+ hours. I actually walked away did errands and came back to find my computer sleeping, that lazy...I had to change the settings on my computer and re-run the seed file. This was _after_ I had remembered that migration files run in transaction blocks. **UGH** That's right! I forgot I initially tried this with a migration! When my computer went to sleep, the transaction threw an error, making it as if nothing happened. Yeah it was so horrible I forgot I did that until now.... ### PSQL to the rescue! While reading up on indexes I remember reading a blog posts which had a snippet on how to insert rows into Postgres to then run `EXPLAIN` against. I ignored it at the time because that wasn't what I was looking for, I'm such a tool. Welp I can't find it now cause that's just my luck so I had to go dig through the internet to learn this, one stackoverflow question / answer at a time..._le sigh_ Here's what my test table looks like: ```sql Table "public.flow" Column | Type | Modifiers --------------+-------------------+----------- run_id | character varying | not null event_uuid | character varying | not null state | smallint | not null last_updated | bigint | not null timeout | bigint | not null ``` I'll need to create random text for `run_id` and `event_uuid` a random integer for `state` and epoch times for `last_updated` and `timeout`. Remember, I'm trying to create data as close to production as possible. Times need to be relatively close to today. `timeout` should be within the next 2 weeks, `last_updated` within the last week. Okay here's how I did it so you don't have to scroll all the way to the bottom: ```sql INSERT INTO flow SELECT run_id, md5(random()::text) AS event_uuid, round((random() * 5)) AS state, round(extract('epoch' FROM now() - random() * (now() - timestamp '07-10-2017')) * 1000), round(extract('epoch' FROM now() - random() * (timestamp '07-24-2017' - now())) * 1000) FROM generate_series(1,5000000) AS run_id; ``` Still interested eh? Good cause now I'll explain what's going on! #### Create fake text / ids When you need to create random strings use `md5` + `random()`. `random()` will give you a double from 0.0 - 1.0. `::text` will cast the double into text so that you can hash it, giving you a random string! ```sql SELECT md5(random()::text); md5 ---------------------------------- 31687e13ac4e58491a8c745bc1c9c188 (1 row) ``` #### Extracting epoch `timestamp` is great because it'll generate a timestamp for you without a lot of work. Sucks if you're using epoch as time though! Not really cause that's what `extract` is for. ```sql SELECT EXTRACT('epoch' FROM now()); date_part ------------------ 1497725140.25024 (1 row) ``` #### Generate lots of data The secret sauce is definitely in `generate_series`. This will create a series of values from the first argument to the second. ```sql SELECT generate_series(1, 5); generate_series ----------------- 1 2 3 4 5 (5 rows) ``` ### How fast is it? Running this script (gen-data.sql): ```sql SELECT now(); INSERT INTO flow SELECT run_id, md5(random()::text) AS event_uuid, round((random() * 5)) AS state, round(extract('epoch' FROM now() - random() * (now() - timestamp '07-10-2017')) * 1000), round(extract('epoch' FROM now() - random() * (timestamp '07-24-2017' - now())) * 1000) FROM generate_series(1,5000000) AS run_id; SELECT now(); ``` Takes about 1.5 minutes on my 2013 MacBook Pro. ```sql flow_runner_test=# \i gen-data.sql now ------------------------------- 2017-06-17 12:35:49.284048-07 (1 row) INSERT 0 5000000 now ------------------------------- 2017-06-17 12:37:03.664021-07 (1 row) ``` **AMAZING** ![](https://media.giphy.com/media/3FXZxuQZaepRS/giphy.gif) ## References - Math functions in Postgres: [https://www.postgresql.org/docs/9.6/static/functions-math.html](https://www.postgresql.org/docs/9.6/static/functions-math.html) - Create Random strings: [https://stackoverflow.com/a/4566583](https://stackoverflow.com/a/4566583) - Generate series: [https://www.compose.com/articles/postgresql-series-random-with/](https://www.compose.com/articles/postgresql-series-random-with/) - Inserting random characters: [http://www.postgresql-archive.org/How-to-insert-random-character-data-into-tables-for-testing-purpose-THanks-td5680973.html](http://www.postgresql-archive.org/How-to-insert-random-character-data-into-tables-for-testing-purpose-THanks-td5680973.html)