Python script using Faker for generating mock data.
Steps:
1 - Create generation plan using plan-template.csv file. This file is stored in the root directory. It will show the script the order in which the entities will be generated. Save it in csv format in the root dir. Please see test_mock_plan.csv as example.
2 - Create the schema csv files. You can use the schmema-template.csv file This contains the definition of your data entity so the mock generator can emulate your data structure. Save it in csv-schema dir. Please see test_employees.csv as example.
3 - Generate the lookups which will be referenced in any entity field. This is done by calling the generate_lookups.py program:
python3 generate_lookups.py --lookup job --numrows 100 --filename test_jobs_lkp.csv
The existing lookups you can generate by now are:
- exhibit
- phone_number
- job
4 - Call the script generate_mock_data.py passing the mock plan file name:
python3 generate_mock_data.py -p test_mock_plan.csv
5 - If you need to convert any lookup table to parquet you can use the script utils/convert_to_parquet.py. All the files are converted to parquet by default.
6 - If you need to partition the parquet files to simulate a prod file structure you can run 1) partition_parquet_files.py for partition all the entities in a plan or 2) partition_parquet_file.py to partition a single entity
processing_order entity_name number_of_rows type (all are base at this stage) reference
field_name mandatory data_type max_size unique type reference
Expects the mock plan
Expects the type of lookups, number of rows and file name for the output.
- lookup
It gets a random value from another data entity. As example, if you want to get a country name from test_countries_lkp you can use this function but you need to specify the <lookup_entity>.<field_name> in the reference field.
Example of a column with lookup:
field_name,mandatory,data_type,max_size,unique,type,reference office_num,yes,string,150,No,lookup,test_offices.office_code
- gen_int
Generates a random int number with the max_size number of digits
- gen_text_free
Generates a random text the max_size length
- gen_imei
Generates an imei number/char
- gen_literal
It generates the literal you refer into the reference:
Example of column with literal:
field_name,mandatory,data_type,max_size,unique,type,reference source_doc,yes,string,,No,gen_literal, mockdata-v1.xlsx
- gen_nomis_no
It generates a nomis number.
- gen_dateyyyymmddrand4y
It generates a random date within the last 4 years.
- gen_upperstr3
It generates a random string uppercase of 3 chars
- gen_upperstr2
It generates a random string uppercase of 2 chars
- gen_intstr4
It generates a random string with 4 digit numbers
- gen_device_family
It generates a random device family. Values: ['Disk', 'Phone', 'Laptop', 'Tablet']
- gen_office_name
It generates an office name with the following pattern: -
- gen_person_name
It generates a person name with First and Last name
- gen_email
It generates an email address
- gen_filepath
It generates a file path
- gen_webdomain
It generates a web domain