Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17181: Using apache commons implementation for wildcard matching for glob patterns #2301

Merged
merged 5 commits into from
Feb 26, 2024

Conversation

justinrsweeney
Copy link
Contributor

@justinrsweeney justinrsweeney commented Feb 26, 2024

https://issues.apache.org/jira/browse/SOLR-17181

Description

This is intended to fix a bug raised where glob pattern matching suffered a performance degradation after #1996.

Solution

This reverts back to using the Apache Commons IO FilenameUtils.wildcardMatch method to do glob pattern matching. This is the original implementation used prior to 9.5 and thus should match performance. This does introduce commons-io as a new library for solrj since glob pattern matching is needed in solrj as well to be available for all use cases.

Tests

Using existing tests to validate functionality.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@janhoy
Copy link
Contributor

janhoy commented Feb 26, 2024

We just got rid of commons-io, https://issues.apache.org/jira/browse/SOLR-15784

Can you find an alternative that is already a dependency, or copy just the code we need into one of Solr's util classes?

@madrob
Copy link
Contributor

madrob commented Feb 26, 2024

Do we have a test case here that includes Unicode in the file name? For example https://docs.modular.com/mojo/faq.html#why-does-mojo-have-the-file-extension

Edit: that might not apply, I skimmed the change without realising we're talking about field names here

@justinrsweeney
Copy link
Contributor Author

We just got rid of commons-io, https://issues.apache.org/jira/browse/SOLR-15784

Can you find an alternative that is already a dependency, or copy just the code we need into one of Solr's util classes?

Updated this to just reuse the applicable code from commons-io without the dependency. I'd like to use that code since we know it will perform the same as previous Solr versions.

@justinrsweeney
Copy link
Contributor Author

Do we have a test case here that includes Unicode in the file name? For example https://docs.modular.com/mojo/faq.html#why-does-mojo-have-the-file-extension

Edit: that might not apply, I skimmed the change without realising we're talking about field names here

Not including a unicode test here since field names should be alphanumeric and that is what current use cases are checking.

@justinrsweeney
Copy link
Contributor Author

justinrsweeney commented Feb 26, 2024

@risdenk I noticed that for the one test failing that seems to be unrelated to changes here, but more of a random occurrence. Looking at the code, I think

params("q", "*:*", "rows", "10000", "sort", "_version_ desc"), client0, client1);
should be changed to sort by ID since the failure here is just an issue with ordering of the documents in the responses. What do you think?

@risdenk
Copy link
Contributor

risdenk commented Feb 26, 2024

@justinrsweeney would probably open a separate MR to fix that

Copy link
Contributor

@janhoy janhoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@@ -27,11 +28,161 @@ public class GlobPatternUtil {
* PathMatcher to match glob patterns in the same way to how glob patterns are matches for file
* paths, rather than implementing our own glob pattern matching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the javadoc to reflect the change

@justinrsweeney justinrsweeney merged commit d3a5908 into main Feb 26, 2024
3 of 4 checks passed
@justinrsweeney justinrsweeney deleted the SOLR-17181-glob-pattern-performance-degradation branch February 26, 2024 22:34
justinrsweeney pushed a commit that referenced this pull request Feb 26, 2024
… for glob patterns (#2301)

* Using apache commons implementation for wildcard matching for glob patterns

* Removing commons io dependency and instead reusing applicable code from that library

* Fixing comments with Javadoc links

* Actually fixing comments with Javadoc links

* Fixing javadoc comments
@gus-asf
Copy link
Contributor

gus-asf commented Feb 27, 2024

We just got rid of commons-io, https://issues.apache.org/jira/browse/SOLR-15784
Can you find an alternative that is already a dependency, or copy just the code we need into one of Solr's util classes?

Updated this to just reuse the applicable code from commons-io without the dependency. I'd like to use that code since we know it will perform the same as previous Solr versions.

Hmm, problem with copying code out of other libraries is we won't get any bug-fixes with new versions or be aware of relevant CVE's in the copied code. Might be better to just have the dependency?

@janhoy
Copy link
Contributor

janhoy commented Feb 27, 2024

Might be better to just have the dependency?

For SolrJ we really have to keep removing external deps, not adding. We can of course keep looking for a performant glob method in one of the other existing libs we have in SolrJ as a replacement. But it is not worth re-adding entire commons-io for this one method imo.

@justinrsweeney
Copy link
Contributor Author

The functionality we are trying to implement here is not super complicated so I don't think we likely see it change that much in the future or needing to incorporate commons-io changes necessarily. The FilenameUtils approach is mostly just a convenience that it does what we need and could certainly be changed replaced on its own in the Solr codebase.

@gus-asf
Copy link
Contributor

gus-asf commented Feb 28, 2024

None the less there should be some mechanism to track it so that if a CVE for commons IO does come out somebody looks at it to verify that that CVE has nothing to do with glob matching. This particular case probably low risk, but pulling code out of dependencies doesn't reduce our exposure to CVE's in other dependencies, and there's also still a license question. I think we track NOTICE.txt entries based on what libs we use right? Well we still need to atribute this, and I'm pretty sure this PR is not complying with the final bullet here:

  1. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
  • You must give any other recipients of the Work or Derivative Works a copy of this License; and
  • You must cause any modified files to carry prominent notices stating that You changed the files; and
  • You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
  • If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License

If we put an entry back in NOTICE.txt, how to we keep it from getting removed when someone reviews dependency information.

@janhoy
Copy link
Contributor

janhoy commented Feb 28, 2024

Are you suggesting we add a snippet to solr/NOTICE.txt with a mention of the copied content, in addition to the generic attribution we have in place for commons-in?

@gus-asf
Copy link
Contributor

gus-asf commented Feb 28, 2024

I haven't taken time to verify it but I presume the removal of commons-io would have also removed references to it from NOTICE.txt? If we are pulling their code back in that has to return. If it was missed and still is there, that's good. Either way, there's a risk that at a later date someone will audit NOTICE.txt, not find commons-io in the output of ./gradlew dependencies and clean it up... how do we prevent that? (I don't see any changes to NOTICE.txt in this pr)

@janhoy
Copy link
Contributor

janhoy commented Feb 28, 2024

Commons-io is still a core dependency and will likely be forever, so both license and notice is still there.

But it is not a dependency of solrj.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants