Skip to content

sara-almedacaballero-cognizant/synth-data-gen

Repository files navigation

Mock Data Generator

Python script using Faker for generating mock data.

Description

Steps:

1 - Create generation plan using plan-template.csv file. This file is stored in the root directory. It will show the script the order in which the entities will be generated. Save it in csv format in the root dir. Please see test_mock_plan.csv as example.

2 - Create the schema csv files. You can use the schmema-template.csv file This contains the definition of your data entity so the mock generator can emulate your data structure. Save it in csv-schema dir. Please see test_employees.csv as example.

3 - Generate the lookups which will be referenced in any entity field. This is done by calling the generate_lookups.py program:

python3 generate_lookups.py --lookup job --numrows 100 --filename test_jobs_lkp.csv

The existing lookups you can generate by now are:

  • exhibit
  • phone_number
  • job

4 - Call the script generate_mock_data.py passing the mock plan file name:

python3 generate_mock_data.py -p test_mock_plan.csv

5 - If you need to convert any lookup table to parquet you can use the script utils/convert_to_parquet.py. All the files are converted to parquet by default.

6 - If you need to partition the parquet files to simulate a prod file structure you can run 1) partition_parquet_files.py for partition all the entities in a plan or 2) partition_parquet_file.py to partition a single entity

Usage

Generation Plan

processing_order entity_name number_of_rows type (all are base at this stage) reference

Entity Schema

field_name mandatory data_type max_size unique type reference

Generate Mock Data

Expects the mock plan

Generate Lookups

Expects the type of lookups, number of rows and file name for the output.

Available functions for Data Entities

  • lookup

It gets a random value from another data entity. As example, if you want to get a country name from test_countries_lkp you can use this function but you need to specify the <lookup_entity>.<field_name> in the reference field.

Example of a column with lookup:

field_name,mandatory,data_type,max_size,unique,type,reference office_num,yes,string,150,No,lookup,test_offices.office_code

  • gen_int

Generates a random int number with the max_size number of digits

  • gen_text_free

Generates a random text the max_size length

  • gen_imei

Generates an imei number/char

  • gen_literal

It generates the literal you refer into the reference:

Example of column with literal:

field_name,mandatory,data_type,max_size,unique,type,reference source_doc,yes,string,,No,gen_literal, mockdata-v1.xlsx

  • gen_nomis_no

It generates a nomis number.

  • gen_dateyyyymmddrand4y

It generates a random date within the last 4 years.

  • gen_upperstr3

It generates a random string uppercase of 3 chars

  • gen_upperstr2

It generates a random string uppercase of 2 chars

  • gen_intstr4

It generates a random string with 4 digit numbers

  • gen_device_family

It generates a random device family. Values: ['Disk', 'Phone', 'Laptop', 'Tablet']

  • gen_office_name

It generates an office name with the following pattern: -

  • gen_person_name

It generates a person name with First and Last name

  • gen_email

It generates an email address

  • gen_filepath

It generates a file path

  • gen_webdomain

It generates a web domain

About

Synthetic Data Generator

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages