Enhancement: Add a rejection sampler #174

lsumption · 2023-04-12T10:29:07Z

Summary

Currently the batch method fails with a validation error if any of the generated rows fail the schema validators. To allow use of the package in a testing environment, it would be useful to be able to generate a dataframe of any size using a rejection sampler method. This method should store the random seeds of successful builds in order to reproduce the same dataframe each time.

I have created a class that performs these actions included below. Given this is something I have needed for my project, it could be a useful feature for others wanting to use Polyfactory for testing. I built it based off the original pydantic factories package, but I imagine it would be pretty similar for the additional Factory options in Polyfactory.

Basic Example

import time
import json
import pandas as pd
from polyfactory.factories.pydantic_factory import ModelFactory

class RejectionSampler:
    """Function to create a synthetic dataset based off the pydantic schema,
    dropping rows that do not meet the validation set up in the schema.

    Parameters
    ----------

    factory (ModelFactory): pydantic factories ModelFactory created from pydantic schema
    size (int): Length of dataset to create
    """

    def __init__(self, factory: ModelFactory, size: int) -> None:

        self.factory = factory
        self.size = size
        self.used_seeds = []

    def setup_seeds(self):

        start = time.time()

        synthetic_data = pd.DataFrame()

        # start seed at 1, increase seed by 1 each pass/fail of factory.build() to ensure reproducibility
        seed_no = 1

        for _ in range(self.size):
            result = None
            while not result:
                try:
                    self.factory.seed_random(seed_no)
                    result = self.factory.build()
                    result_dict = json.loads(result.json())
                    synthetic_data = synthetic_data.append(
                        pd.DataFrame(result_dict, index=[0])
                    )
                    self.used_seeds += [seed_no]
                    seed_no += 1
                    result = True
                except ValidationError:
                    seed_no += 1

        end = time.time()

        print(f"finished, took {seed_no-1} attempts to generate {self.size} rows")
        print(f"took {end-start} seconds to setup seeds")

    def generate(self):

        start = time.time()

        synthetic_data = pd.DataFrame()

        for seed in self.used_seeds:
            self.factory.seed_random(seed)
            result = self.factory.build()
            result_dict = json.loads(result.json())
            synthetic_data = synthetic_data.append(pd.DataFrame(result_dict, index=[0]))

        end = time.time()

        print(f"took {end-start} seconds to generate new data")

        return synthetic_data

Drawbacks and Impact

No response

Unresolved questions

No response

The text was updated successfully, but these errors were encountered:

Goldziher · 2023-04-12T15:13:18Z

Hiya, sure - but this requires a dependency on pandas or Polars, no?

lsumption · 2023-04-18T07:21:38Z

True, it could instead just return a list of jsons removing the pandas dependency but keeping the reproducible valid batch component?

Goldziher · 2023-05-29T16:18:13Z

Yes, this should not have a dependency on any third party library to do.

williamjamir · 2023-11-23T15:45:27Z

How about adding the possibility of installing it as an extension?
I mean something like
pip install polyfactory[pandas] or pip install polyfactory[extras]

guacs · 2023-11-24T04:39:08Z

@williamjamir why do you want to have pandas for this? Also, I'm not sure about the need for this feature. polyfactory should not be creating instances that fail the validation of any of the libraries it supports. If it does, then that's a bug which should be fixed.

EDIT: Actually, this could be useful where you have your own custom validators which polyfactory cannot support. Though if this is the case, I think the better option would be to use Use or implement a classmethod for those fields to generate values that will pass your custom validators as well.

lsumption added the enhancement New feature or request label Apr 12, 2023

Goldziher changed the title ~~Enhancement: <Create valid reproducible dataframes>~~ Enhancement: Add a rejection sampler May 29, 2023

Goldziher added help wanted Extra attention is needed good first issue Good for newcomers labels May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Add a rejection sampler #174

Enhancement: Add a rejection sampler #174

lsumption commented Apr 12, 2023 •

edited by polar-sh bot

Goldziher commented Apr 12, 2023

lsumption commented Apr 18, 2023

Goldziher commented May 29, 2023

williamjamir commented Nov 23, 2023

guacs commented Nov 24, 2023 •

edited

Enhancement: Add a rejection sampler #174

Enhancement: Add a rejection sampler #174

Comments

lsumption commented Apr 12, 2023 • edited by polar-sh bot

Summary

Basic Example

Drawbacks and Impact

Unresolved questions

Goldziher commented Apr 12, 2023

lsumption commented Apr 18, 2023

Goldziher commented May 29, 2023

williamjamir commented Nov 23, 2023

guacs commented Nov 24, 2023 • edited

lsumption commented Apr 12, 2023 •

edited by polar-sh bot

guacs commented Nov 24, 2023 •

edited