Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: respect both mulitple_of and minimum/maximum constraints #505

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

richardxia
Copy link

@richardxia richardxia commented Feb 26, 2024

Pull Request Checklist

  • New code has 100% test coverage
  • (If applicable) The prose documentation has been updated to reflect the changes introduced by this PR
  • (If applicable) The reference documentation has been updated to reflect the changes introduced by this PR
  • Pre-Commit Checks were ran and passed
  • Tests were ran and passed

Description

Library Code Changes

Previously, generate_constrained_number() would potentially generate invalid numbers when multiple_of is not None and exactly one of either minimum or maximum is not None, since it would just return multiple_of without respecting the upper or lower bound.

This significantly changes the implementation of the code to correctly handle this code. The generate_constrained_number() method has been completely removed, being replaced with a generate_constrained_multiple_of() function. A major difference between the old function and the new function is that the new one does not accept a method argument for generating random numbers. This is because in the new function, we always use create_random_integer(), since the problem reduces to generating a random integer multiplier.

The high-level algorithm behind generate_constrained_multiple_of() is that we need to constrain the random integer generator to generate numbers such that when they are multiplied with multiple_of, they still fit within the original bounds constraints. This simplify involves dividing the original bounds by multiple_of, with some special handling for negative multiple_of numbers as well as carefully chosen rounding behavior.

We also need to make some changes to other functions.

get_increment() needs to take an additional argument for the actual value that the increment is for. This is because floating-point numbers can't use a static increment or else it might get rounded away if the numbers are too large. Python fortunately provides a math.ulp() function for computing this for a given float value, so we make use of that function. We still use the original float_info.epsilon constant as a lower bound on the increment, though, since in the case that the value is too close to zero, we still need to make sure that the increment doesn't disappear when used against other numbers.

Finally, we rename and modify passes_pydantic_multiple_validator() to is_almost_multiple_of(), modifying its implementation to defer the casting of values to float() to minimize rounding errors. This specifically affects Decimal numbers, where casting to float too early causes too much loss of precision.

Test Changes

A significant number of changes were made to the tests as well, since the original tests missed the bug being fixed here. Each of the integer, floating-point, and decimal tests has been updated to assert that the result is actually within the minimum and maximum constraints. In addition, we remove some unnecessary sorting of the randomly generated test input values, since this was unnecessarily constraining multiple_of to be greater than or less than the minimum and maximum values. This was causing a lot of the scenarios involving negative values to be skipped.

Lastly, the floating-point and decimal tests need additional constraints to avoid unrealistic extreme values from hitting precision issues. This was done by adding a number of constraints on the number of significant digits in the input numbers and on the relative magnitudes of the input numbers.

Close Issue(s)

@richardxia richardxia requested review from a team as code owners February 26, 2024 21:28
@guacs
Copy link
Member

guacs commented Feb 28, 2024

@richardxia thanks for the writeup! I haven't had a chance to look into it in detail yet, but I'll do that soon :)

Also, feel free to change the API if it's necessary. Or add a new function to handle these cases. For example, you mentioned passes_pydantic_multiple_validator - you can change that to handle floats and Decimal types differently if that'll make the implementation easer.

@guacs
Copy link
Member

guacs commented Mar 13, 2024

@richardxia sorry for the delay, but I've had a chance to look into the PR.

So I think the biggest issues that we're facing here is the opaque nature of method in generate_constrained_number. I am comfortable with deprecating that and then using dedicated functions for each numerical type i.e. generate_constrained_integer, generate_constrained_float etc. These would have no need to take in a method but could decide what method to create the constrained number depending on the constraints and other context. Furthermore, each of them could determine whether the generated number passes the constraints however they wish without relying on pydantic_pydantic_multiple_validator (this is a faulty implementation in pydantic as well as seen in this issue though I think that's only for pydantic v1). WDYT?

However, these would have to be deterministic since users have to be able to reliably create the same instances with the same values if needed (based on the seeding of the Random instance on the factory).

I think we have room for making whatever changes we need here, and deprecate accordingly, without too many issues since most, if not all, users wouldn't be using these functions but only the APIs directly related to the factories.

@guacs
Copy link
Member

guacs commented Mar 13, 2024

Also, I think we should be fine as long as reasonable scenarios are working, If hypothesis keeps coming up with extreme cases that won't work, then we can just restrict hypothesis or something along those lines and just document that such extreme cases may fail.

@richardxia
Copy link
Author

@guacs, thanks for taking the time to review my PR and leaving comments. What you said sounds good to me, in terms of deprecating the method argument to generator_constrained_number(). I think I'll probably be busy for the rest of the work week, but I may have some time over the weekend to revisit this and hopefully come up with a simpler change.

Previously, `generate_constrained_number()` would potentially generate
invalid numbers when `multiple_of` is not None and exactly one of either
`minimum` or `maximum` is not None, since it would just return
`multiple_of` without respecting the upper or lower bound.

This significantly changes the implementation of the code to correctly
handle this code. The `generate_constrained_number()` method has been
completely removed, being replaced with a
`generate_constrained_multiple_of()` function. A major difference
between the old function and the new function is that the new one does
not accept a `method` argument for generating random numbers. This is
because in the new function, we always use `create_random_integer()`,
since the problem reduces to generating a random integer multiplier.

The high-level algorithm behind `generate_constrained_multiple_of()` is
that we need to constrain the random integer generator to generate
numbers such that when they are multiplied with `multiple_of`, they
still fit within the original bounds constraints. This simplify involves
dividing the original bounds by `multiple_of`, with some special
handling for negative `multiple_of` numbers as well as carefully chosen
rounding behavior.

We also need to make some changes to other functions.

`get_increment()` needs to take an additional argument for the actual
value that the increment is for. This is because floating-point numbers
can't use a static increment or else it might get rounded away if the
numbers are too large. Python fortunately provides a `math.ulp()`
function for computing this for a given float value, so we make use of
that function. We still use the original `float_info.epsilon` constant
as a lower bound on the increment, though, since in the case that the
value is too close to zero, we still need to make sure that the
increment doesn't disappear when used against other numbers.

Finally, we rename and modify `passes_pydantic_multiple_validator()` to
`is_almost_multiple_of()`, modifying its implementation to defer the
casting of values to `float()` to minimize rounding errors. This
specifically affects Decimal numbers, where casting to float too early
causes too much loss of precision.

A significant number of changes were made to the tests as well, since
the original tests missed the bug being fixed here. Each of the integer,
floating-point, and decimal tests has been updated to assert that the
result is actually within the minimum and maximum constraints. In
addition, we remove some unnecessary sorting of the randomly generated
test input values, since this was unnecessarily constraining
`multiple_of` to be greater than or less than the minimum and maximum
values. This was causing a lot of the scenarios involving negative
values to be skipped.

Lastly, the floating-point and decimal tests need additional constraints
to avoid unrealistic extreme values from hitting precision issues. This
was done by adding a number of constraints on the number of significant
digits in the input numbers and on the relative magnitudes of the input
numbers.
@richardxia richardxia force-pushed the fix-gen-constrained-number-ge-multipleof branch from a65482d to 8881603 Compare March 18, 2024 07:18
@@ -1,8 +1,9 @@
from __future__ import annotations

from decimal import Decimal
from math import ceil, floor, ulp
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I'm failing the Python 3.8 tests because I'm importing ulp, which was only added in Python 3.9. I'll probably have to implement an alternative way of computing the correct increment for a given floating-point number.

@richardxia
Copy link
Author

I pushed a completely rewritten solution to this branch, but it looks like I'm currently failing the Python 3.8 tests because I accidentally used a function that was only implemented in Python 3.9. I'll have to revisit this tomorrow.

In the meantime, please let me know if the new changes I made look like they're in the right direction. I made some breaking changes to the functions in constrained_numbers.py. I wasn't sure if it was important to keep these functions backwards-compatible, but let me know if it is, and I can think about how to do that. get_increment() will be tricky to preserve backwards compatibility, since it's technically impossible for that function to do its job correctly with its current function signature.

Python 3.8 doesn't have the ulp() function, so we need to manually
compute an increment ourselves.
Copy link

Documentation preview will be available shortly at https://litestar-org.github.io/polyfactory-docs-preview/505

}
return cast("T", values[t_type])
# See https://github.com/python/mypy/issues/17045 for why the redundant casts are ignored.
if t_type == int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these checks should maybe use is_safe_subclass so that we handle custom int as well. Same for float and Decimal.

Copy link
Member

@guacs guacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardxia thanks for this! I think this looks good. We just have to make sure that any functions that we're changing (such as names or the arguments it takes) should be deprecated and we create new functions as replacements for them. Once that's done, I think this is good to go :)

Comment on lines +130 to +139
# When ``value`` is large in magnitude, we need to choose an increment that is large enough
# to not be rounded away, but when ``value`` small in magnitude, we need to prevent the
# incerement from vanishing. ``float_info.epsilon`` is defined as the smallest delta that
# can be represented between 1.0 and the next largest number, but it's not sufficient for
# larger values. We instead open up the floating-point representation to grab the exponent
# and calculate our own increment. This can be replaced with ``math.ulp`` in Python 3.9 and
# later.
_, exp = frexp(value)
increment = float_info.radix ** (exp - float_info.mant_dig + 1)
return cast("T", max(increment, float_info.epsilon))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably fine to use math.ulp in the case of Python 3.9 and then fall back to this in case it's not 3.8. We have a constant called PY_38 in polyfactory.constants.py that you can use to handle this. Or maybe even something like:

if PY_38:
    def get_float_increment(value: float) -> float:
        # the current logic
else:
    def get_float_increment(value: float) -> float:
        # math.ulp based logic

Then we can just call this function without having to worry about the Python version.



def get_increment(t_type: type[T]) -> T:
def get_increment(value: T, t_type: type[T]) -> T:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change. Instead, we should have another get_increment_v2 function and deprecate this one.

@@ -99,8 +100,8 @@ def is_multiply_of_multiple_of_in_range(
return False


def passes_pydantic_multiple_validator(value: T, multiple_of: T) -> bool:
"""Determine whether a given value passes the pydantic multiple_of validation.
def is_almost_multiple_of(value: T, multiple_of: T) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the passes_pydantic_multiple_)validator as it is and deprecate it and then create a new is_almost_multiple_of function as a replacement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we could just do something like this:

def passes_pydantic_multiple_validator(value: T, multiple_of: T)  -> bool:
    return is_almost_mulitple_of(value, multiple_of)

That is, changing the logic in it is fine since it's doing the same, but we would need to keep the old name around.

@@ -210,33 +225,36 @@ def get_constrained_number_range(
return minimum, maximum


def generate_constrained_number(
def generate_constrained_multiple_of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above regarding it being breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants