Skip to content

Latest commit

 

History

History
220 lines (164 loc) · 7.96 KB

Bulk_data_query.md

File metadata and controls

220 lines (164 loc) · 7.96 KB

Using SQL to generate bulk data for Postgres table

Introduction

  • Generating sample data for database testing is one of the common steps.
  • In the previous articles, we covered how to generate sample data in bulk using plain Java and using regex.
  • In this article, we will learn how we can use SQL and Postgres functions to generate N number of sample records in seconds.

Schema

  • Our schema is an Account table that looks like below which contains different columns such as username, password, email, etc.

Generating Single Account Record

Generating Random Usernames

  • Generating a username is the concatenation of a few names and numbers which is getting selected using the random() function.
-- usernames
with usernames as (
select ((select (array['john', 'jane', 'jacky'])[floor(random() * 3 + 1)]) || (floor(1000+random()*1000)::text)) as username
),

Generating Random Passwords

  • Returning md5 of random text as password.
-- password
passwords as (
SELECT md5(random()::text) as password
),

Generating Random Email Domain

  • Picking email domains randomly from the list.
email_domain as (
 select (array['@gmail.com', '@yahoo.com', '@outlook.com'])[floor(random() * 3 + 1)] as domain
),

Joining to Return Single Record

  • Now that we have randomness to choose different values for different columns we can write SQL to return them as a single record each time we execute the SQL.
account_record as (
select username, password, (username || domain) as email 
from usernames 
join passwords
on 1=1 
join email_domain 
on 1=1
)
  • Now selecting each field from the account record.
select username, password, email, now(), now(), 1 from account_record;

Convert to Postgres Function

  • We know how to generate a single unique account record, we can convert that logic to the postgres function as below.
CREATE OR REPLACE FUNCTION sample_account_record() 
RETURNS TABLE(username text, password text, email text, created_at timestamp, last_login timestamp, permissions_id int)
AS 
$$
-- usernames
with usernames as (
select ((select (array['john', 'jane', 'jacky'])[floor(random() * 3 + 1)]) || (floor(1000+random()*1000)::text)) as username
),

-- password passwords as ( SELECT md5(random()::text) as password ),

-- email email_domain as ( select (array['@gmail.com', '@yahoo.com', '@outlook.com'])[floor(random() * 3 + 1)] as domain ),

account_record as ( select username, password, (username || domain) as email from usernames join passwords on 1=1 join email_domain on 1=1 )

select username, password, email, now(), now(), 1 from account_record; $$ LANGUAGE sql;

  • Our function sample_account_record got created in Postgres.
  • Now that we have everything natively as postgres function, we can just query select to that function and we will get our output.
select * from sample_account_record()

Generating Record In Bulk

  • So far we are only generating a single record, but then we can use the generate_series() function from Postgres to generate N number of records.
select 
sample_account_record() as record
from generate_series(1,5)
  • The output contains different sample records for account schema but its format is not what we want. We are looking for each record as a separate column instead of a single CSV record inside the bracket.
  • Let’s do some data processing to split that into multiple columns.
  • The very first thing we can do is to replace the bracket with nothing so that we can csv record without the bracket.
 select 
 REGEXP_REPLACE( cast(sample_account_record() as text), '[\(\)]', '', 'g')  as record
 from generate_series(1,50)
  • The bracket has been removed.
  • Once we have csv record we can split them and assign them to different columns that they belong to.
select 
split_part(cast(record as text), ',', 1) as username,
split_part(cast(record as text), ',', 2) as password,
split_part(cast(record as text), ',', 3) as email,
split_part(cast(record as text), ',', 4) as created_at, 
split_part(cast(record as text), ',', 5) as last_login,
split_part(cast(record as text), ',', 6) as permission
from(
 select 
 REGEXP_REPLACE( cast(sample_account_record() as text), '[\(\)]', '', 'g')  as record
 from generate_series(1,5)
) x
  • Our records now contain multiple columns with different records.

One thing to know is that we should be afraid to use this logic on STG and PROD instances since generate_series and other used Postgres functions might take a lot of computing time that might impact database. We can use it for developmement and testing purposes.

Conclusion

  • In this article, we learned how to generate sample records using SQL in postgres.
  • We also learned how to convert our SQL logic to the Postgres function and scale it to generate bulk inserts.