Skip to content

Support minLength and maxLength for stringMatching #5562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomerAberbach opened this issue Dec 26, 2024 · 5 comments
Open

Support minLength and maxLength for stringMatching #5562

TomerAberbach opened this issue Dec 26, 2024 · 5 comments
Labels
💡 Idea to investigate Non straightforward features that seems great but need to be assessed and designed carefully.

Comments

@TomerAberbach
Copy link
Contributor

TomerAberbach commented Dec 26, 2024

🚀 Feature Request

stringMatching generates based on the regular expression, but doesn't provide a way to set a min or max length on the generated strings.

Motivation

Sometimes you want to test specific lengths of a string matching a regex, but modifying a regular expression to have a specific min and/or max length can be pretty hard and confusing. It would be a lot easier to just be able to pass minLength and maxLength like the string arbitrary.

Example

const nameRegex = /^[a-z]+$/

// Test different lengths without having to modify regex
fc.stringMatching(nameRegex, { minLength: 10 })
fc.stringMatching(nameRegex, { maxLength: 10 })
fc.stringMatching(nameRegex, { minLength: 5, maxLength: 10 })
@dubzzz
Copy link
Owner

dubzzz commented Jan 6, 2025

So far the only option I have for it is to play with .filter but unfortunately performance-wise it will not be ideal. I'd need to think about a way to make such constraint more optimal before offering it on the arbitrary.

@gruhn
Copy link
Contributor

gruhn commented Jan 7, 2025

Hmm, interesting problem. Regular expressions are closed under intersection. So one idea might be to "just" compute the intersection of nameRegex and for example /^.{5,10}$/. That gives a new regular expression which is restricted to strings with min length 5 and max length 10. I think this can be done in ~O(n^2). Not super cheap but it's an upfront cost and maybe negligible for small user defined regular expressions.

One way to do it might be:

  1. Convert both regular expressions to NFAs (I think: O(n))
  2. Compute NFA intersection (I think: O(n^2))
  3. Convert resulting NFA back to regular expression (I think: O(n))

Any route involving DFAs seems to come with the danger of exponential blow up.

For the regular expressions from formal language theory, this is maybe not too complicated but I can imagine you need a ton of case distinctions for the the full JavaScript regular expressions. So not sure if this is the best approach.

@TomerAberbach
Copy link
Contributor Author

This certainly sounds quite tricky! Thanks for all the research here :)

By the way @dubzzz, in case it changes priority, in addition to needing this for typespec-fast-check, we actually coincidentally needed the same exact thing at Stainless (where I work).

We were trying to automatically generate example values for OpenAPI spec schemas like the following (this is just an example schema):

type: string
pattern: [a-z]+
minLength: 3
maxLength: 10

I ended doing fc.stringMatching(...).filter(s => s.length >= minLength && s.length <= maxLength) and passing that to fc.sample.

It's still very quick, but I only needed to generate one value from fc.sample so that probably helped

@dubzzz dubzzz added 💡 Idea 💡 Idea to investigate Non straightforward features that seems great but need to be assessed and designed carefully. labels Mar 1, 2025
@dubzzz dubzzz removed the 💡 Idea label Mar 1, 2025
Copy link

stale bot commented Apr 30, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 30, 2025
@dubzzz dubzzz removed the stale label May 1, 2025
@gruhn
Copy link
Contributor

gruhn commented May 7, 2025

@TomerAberbach It's been a few months so I'm not sure if you still bother 😅 but you nerd sniped me with this problem. I really wanted to see if this regex intersection approach could work. Turns out there are almost no implementations for this (in any programming language). I thought this is standard computer science blabla? So I started hacking on a library myself. The API looks like this:

import { intersection } from '@gruhn/regex-utils'

const combinedRegex = intersection(/^[a-z]+$/, /^.{3,10}$/)

Preliminary results are not too bad. I created this benchmark that generates random emails addresses with 3-10 characters. Once by going the generate+filter route:

fc.sample(
    fc.stringMatching(emailRegex).filter(
        str => 3 <= str.length && str.length <= 10
   ), 
   sampleCount
)

And once by computing the intersection with the regex /^.{3,10}$/ first and then sampling directly without filtering:

fc.sample(
    fc.stringMatching(intersection(/^.{3,10}$/, emailRegex)), 
    sampleCount
)

Predictably, filtering is faster for smaller samples sizes because computing the intersection has quite some overhead. But for larger sample sizes it's clearly much faster:

sample count: 10
time (post-hoc filter)    :  84 ms
time (regex intersection) :  506 ms

sample count: 50
time (post-hoc filter)    :  272 ms
time (regex intersection) :  467 ms

sample count: 100
time (post-hoc filter)    :  571 ms
time (regex intersection) :  455 ms

sample count: 500
time (post-hoc filter)    :  3831 ms
time (regex intersection) :  473 ms

sample count: 1000
time (post-hoc filter)    :  7802 ms
time (regex intersection) :  467 ms

sample count: 2000
time (post-hoc filter)    :  17397 ms
time (regex intersection) :  479 ms

Even bigger samples size only with intersection:

sample count: 10_000
time (regex intersection) :  604 ms

sample count: 20_000
time (regex intersection) :  723 ms

sample count: 50_000
time (regex intersection) :  959 ms

sample count: 100_000
time (regex intersection) :  1351 ms

Although this is exciting, I would enjoy this with caution. It's still very easy to find inputs where this intersection function completely blows up for whatever reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 Idea to investigate Non straightforward features that seems great but need to be assessed and designed carefully.
Projects
None yet
Development

No branches or pull requests

3 participants