- Generating sample data for database testing is one of the common steps.
- In the previous articles, we covered how to generate sample data in bulk using plain Java and using regex.
- In this article, we will learn how we can use SQL and Postgres functions to generate N number of sample records in seconds.
- Our schema is an Account table that looks like below which contains different columns such as username, password, email, etc.
Generating Random Usernames
- Generating a username is the concatenation of a few names and numbers which is getting selected using the random() function.
-- usernames with usernames as ( select ((select (array['john', 'jane', 'jacky'])[floor(random() * 3 + 1)]) || (floor(1000+random()*1000)::text)) as username ),
Generating Random Passwords
- Returning md5 of random text as password.
-- password passwords as ( SELECT md5(random()::text) as password ),
Generating Random Email Domain
- Picking email domains randomly from the list.
email_domain as ( select (array['@gmail.com', '@yahoo.com', '@outlook.com'])[floor(random() * 3 + 1)] as domain ),
Joining to Return Single Record
- Now that we have randomness to choose different values for different columns we can write SQL to return them as a single record each time we execute the SQL.
account_record as ( select username, password, (username || domain) as email from usernames join passwords on 1=1 join email_domain on 1=1 )
- Now selecting each field from the account record.
select username, password, email, now(), now(), 1 from account_record;
- We know how to generate a single unique account record, we can convert that logic to the postgres function as below.
CREATE OR REPLACE FUNCTION sample_account_record() RETURNS TABLE(username text, password text, email text, created_at timestamp, last_login timestamp, permissions_id int) AS $$ -- usernames with usernames as ( select ((select (array['john', 'jane', 'jacky'])[floor(random() * 3 + 1)]) || (floor(1000+random()*1000)::text)) as username ),-- password passwords as ( SELECT md5(random()::text) as password ),
-- email email_domain as ( select (array['@gmail.com', '@yahoo.com', '@outlook.com'])[floor(random() * 3 + 1)] as domain ),
account_record as ( select username, password, (username || domain) as email from usernames join passwords on 1=1 join email_domain on 1=1 )
select username, password, email, now(), now(), 1 from account_record; $$ LANGUAGE sql;
- Our function sample_account_record got created in Postgres.
- Now that we have everything natively as postgres function, we can just query select to that function and we will get our output.
select * from sample_account_record()
- So far we are only generating a single record, but then we can use the generate_series() function from Postgres to generate N number of records.
select sample_account_record() as record from generate_series(1,5)
- The output contains different sample records for account schema but its format is not what we want. We are looking for each record as a separate column instead of a single CSV record inside the bracket.
- Let’s do some data processing to split that into multiple columns.
- The very first thing we can do is to replace the bracket with nothing so that we can csv record without the bracket.
select REGEXP_REPLACE( cast(sample_account_record() as text), '[\(\)]', '', 'g') as record from generate_series(1,50)
- The bracket has been removed.
- Once we have csv record we can split them and assign them to different columns that they belong to.
select split_part(cast(record as text), ',', 1) as username, split_part(cast(record as text), ',', 2) as password, split_part(cast(record as text), ',', 3) as email, split_part(cast(record as text), ',', 4) as created_at, split_part(cast(record as text), ',', 5) as last_login, split_part(cast(record as text), ',', 6) as permission from( select REGEXP_REPLACE( cast(sample_account_record() as text), '[\(\)]', '', 'g') as record from generate_series(1,5) ) x
- Our records now contain multiple columns with different records.
One thing to know is that we should be afraid to use this logic on STG and PROD instances since generate_series and other used Postgres functions might take a lot of computing time that might impact database. We can use it for developmement and testing purposes.
- In this article, we learned how to generate sample records using SQL in postgres.
- We also learned how to convert our SQL logic to the Postgres function and scale it to generate bulk inserts.