Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Add a rejection sampler #174

Open
lsumption opened this issue Apr 12, 2023 · 5 comments
Open

Enhancement: Add a rejection sampler #174

lsumption opened this issue Apr 12, 2023 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@lsumption
Copy link

lsumption commented Apr 12, 2023

Summary

Currently the batch method fails with a validation error if any of the generated rows fail the schema validators. To allow use of the package in a testing environment, it would be useful to be able to generate a dataframe of any size using a rejection sampler method. This method should store the random seeds of successful builds in order to reproduce the same dataframe each time.

I have created a class that performs these actions included below. Given this is something I have needed for my project, it could be a useful feature for others wanting to use Polyfactory for testing. I built it based off the original pydantic factories package, but I imagine it would be pretty similar for the additional Factory options in Polyfactory.

Basic Example

import time
import json
import pandas as pd
from polyfactory.factories.pydantic_factory import ModelFactory

class RejectionSampler:
    """Function to create a synthetic dataset based off the pydantic schema,
    dropping rows that do not meet the validation set up in the schema.

    Parameters
    ----------

    factory (ModelFactory): pydantic factories ModelFactory created from pydantic schema
    size (int): Length of dataset to create
    """

    def __init__(self, factory: ModelFactory, size: int) -> None:

        self.factory = factory
        self.size = size
        self.used_seeds = []

    def setup_seeds(self):

        start = time.time()

        synthetic_data = pd.DataFrame()

        # start seed at 1, increase seed by 1 each pass/fail of factory.build() to ensure reproducibility
        seed_no = 1

        for _ in range(self.size):
            result = None
            while not result:
                try:
                    self.factory.seed_random(seed_no)
                    result = self.factory.build()
                    result_dict = json.loads(result.json())
                    synthetic_data = synthetic_data.append(
                        pd.DataFrame(result_dict, index=[0])
                    )
                    self.used_seeds += [seed_no]
                    seed_no += 1
                    result = True
                except ValidationError:
                    seed_no += 1

        end = time.time()

        print(f"finished, took {seed_no-1} attempts to generate {self.size} rows")
        print(f"took {end-start} seconds to setup seeds")

    def generate(self):

        start = time.time()

        synthetic_data = pd.DataFrame()

        for seed in self.used_seeds:
            self.factory.seed_random(seed)
            result = self.factory.build()
            result_dict = json.loads(result.json())
            synthetic_data = synthetic_data.append(pd.DataFrame(result_dict, index=[0]))

        end = time.time()

        print(f"took {end-start} seconds to generate new data")

        return synthetic_data

Drawbacks and Impact

No response

Unresolved questions

No response

Fund with Polar
@lsumption lsumption added the enhancement New feature or request label Apr 12, 2023
@Goldziher
Copy link
Contributor

Hiya, sure - but this requires a dependency on pandas or Polars, no?

@lsumption
Copy link
Author

True, it could instead just return a list of jsons removing the pandas dependency but keeping the reproducible valid batch component?

@Goldziher Goldziher changed the title Enhancement: <Create valid reproducible dataframes> Enhancement: Add a rejection sampler May 29, 2023
@Goldziher Goldziher added help wanted Extra attention is needed good first issue Good for newcomers labels May 29, 2023
@Goldziher
Copy link
Contributor

Yes, this should not have a dependency on any third party library to do.

@williamjamir
Copy link

How about adding the possibility of installing it as an extension?
I mean something like
pip install polyfactory[pandas] or pip install polyfactory[extras]

@guacs
Copy link
Member

guacs commented Nov 24, 2023

@williamjamir why do you want to have pandas for this? Also, I'm not sure about the need for this feature. polyfactory should not be creating instances that fail the validation of any of the libraries it supports. If it does, then that's a bug which should be fixed.

EDIT: Actually, this could be useful where you have your own custom validators which polyfactory cannot support. Though if this is the case, I think the better option would be to use Use or implement a classmethod for those fields to generate values that will pass your custom validators as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants