Skip to content

Display extra-words in detection_log if present #4402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

alok1304
Copy link
Contributor

@alok1304 alok1304 commented May 27, 2025

Fixes #4400

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Signed-off-by: Alok Kumar [email protected]

@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch from ff9eaec to 2a4c4c6 Compare May 27, 2025 10:22
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alok1304 thanks! See comments for changes.

Please always run a subset of the tests which could be effected locally and inspect failures before raising a PR, and make sure you also regenerate test expectations for these failures, and verify that they are intended. See https://scancode-toolkit.readthedocs.io/en/stable/contribute/contrib_dev.html#running-tests

Please also add a test from #4400 just focused on extra words, without tests it's much more work for reviewers to verify this is working correctly.

@alok1304 alok1304 closed this May 27, 2025
@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch from 4042c2b to a4415e7 Compare May 27, 2025 14:58
@alok1304 alok1304 reopened this May 27, 2025
@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch 2 times, most recently from 1e24b13 to 49bac47 Compare May 27, 2025 16:16
@alok1304
Copy link
Contributor Author

still, we are not getting extra-words in detection_log because of this

 "matcher": "2-aho",

I am trying to find examples files where extra-words are present.

@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch from c33c3c0 to bad458f Compare May 27, 2025 18:49
@alok1304
Copy link
Contributor Author

@AyanSinhaMahapatra I added test when we get extra-words and one more things, this happens only when matcher is 3-seq.

{
      "identifier": "bsd_new-95249a8d-f533-e7c7-159a-9b6e173cba42",
      "license_expression": "bsd-new",
      "license_expression_spdx": "BSD-3-Clause",
      "detection_count": 3,
      "detection_log": [
        "extra-words"
      ],
      "reference_matches": [
        {
          "license_expression": "bsd-new",
          "license_expression_spdx": "BSD-3-Clause",
          "from_file": "diff-4.0.2.tgz/package/LICENSE",
          "start_line": 1,
          "end_line": 31,
          "matcher": "3-seq",
          "score": 93.89,
          "matched_length": 215,
          "match_coverage": 100.0,
          "rule_relevance": 100,
          "rule_identifier": "bsd-new_578.RULE",
          "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/bsd-new_578.RULE",
          "matched_text": "Software License Agreement (BSD License)\n\nCopyright (c) 2009-2015, Kevin Decker <[email protected]>\n\nAll rights reserved.\n\nRedistribution and use of this software in source and binary forms, with or without modification,\nare permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above\n  copyright notice, this list of conditions and the\n  following disclaimer.\n\n* Redistributions in binary form must reproduce the above\n  copyright notice, this list of conditions and the\n  following disclaimer in the documentation and/or other\n  materials provided with the distribution.\n\n* Neither the name of Kevin Decker nor the names of its\n  contributors may be used to endorse or promote products\n  derived from this software without specific prior\n  written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR\nIMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND\nFITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR\nCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,\nDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER\nIN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT\nOF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.",
          "matched_text_diagnostics": "Software License Agreement (BSD License)\n\n[Copyright] ([c]) [2009]-[2015], [Kevin] [Decker] <[kpdecker]@[gmail].[com]>\n\n[All] [rights] [reserved].\n\nRedistribution and use [of] [this] [software] in source and binary forms, with or without modification,\nare permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above\n  copyright notice, this list of conditions and the\n  following disclaimer.\n\n* Redistributions in binary form must reproduce the above\n  copyright notice, this list of conditions and the\n  following disclaimer in the documentation and/or other\n  materials provided with the distribution.\n\n* Neither the name of [Kevin] [Decker] nor the names of its\n  contributors may be used to endorse or promote products\n  derived from this software without specific prior\n  written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR\nIMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND\nFITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR\nCONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,\nDATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER\nIN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT\nOF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE."
        }
      ]
    }
    ```

@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch from bad458f to 7516775 Compare May 27, 2025 19:00
@alok1304
Copy link
Contributor Author

These failing test cases are passing on my system see

(venv) PS C:\Users\lenovo\Documents\GitHub\web dev\scancode-toolkit> pytest tests/scancode/test_cli.py                                
======================================================== test session starts =========================================================
platform win32 -- Python 3.11.0, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\lenovo\Documents\GitHub\web dev\scancode-toolkit, configfile: pyproject.toml
plugins: forked-1.4.0, xdist-2.5.0
collecting ...
74 tests selected, 5 tests skipped.
collected 79 items / 5 deselected / 74 selected

tests\scancode\test_cli.py ............................s....s.....ss.................................                           [100%]

====================================== 70 passed, 4 skipped, 5 deselected in 194.73s (0:03:14) ======================================= 

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alok1304, looking good mostly, ready to merge with a couple changes, see comments for more details.

Don't worry about the extra test failures, there are there because of #4369, in the tests where we perform a force upgrade of all our dependencies and run the tests there. See https://github.com/aboutcode-org/scancode-toolkit/blob/develop/azure-pipelines.yml#L224

@@ -1726,7 +1734,7 @@ def analyze_detection(license_matches, package_license=False):
):
return DetectionCategory.LICENSE_CLUES.value

# Case where all matches have `matcher` as `1-hash` or `4-spdx-id`
# Case where all matches have `matcher` as `1-hash` or `4-spdx-id` or 2-aho
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, could you revert is_correct_detection back to:
all(matcher in ("1-hash", "1-spdx-id") for matcher in matchers)

The 2-aho cases are meant to be caught below here, only if all other cases in between are not present:

    # Cases where Match Coverage is a perfect 100 for all matches
    else:
          return DetectionCategory.PERFECT_DETECTION.value

The only bug we needed to fix was in get_detected_license_expression, where we were missing catching the analysis == DetectionCategory.EXTRA_WORDS.value and thus the detection log not being populated..

@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch 2 times, most recently from 5204567 to 197b261 Compare May 28, 2025 15:01
@alok1304
Copy link
Contributor Author

These failing test cases happening when i change this all(matcher in ("1-hash", "1-spdx-id", "2-aho") for matcher in matchers) to all(matcher in ("1-hash", "1-spdx-id") for matcher in matchers) in is_correct_detection.

see: https://github.com/aboutcode-org/scancode-toolkit/pull/4402/files#diff-5f6ccc1d8f9540cae715bddaf337246f4c69c48ca596b2086e05b18c541aef4aL1074

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please see comment below.

@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch 2 times, most recently from b15bfee to 8f29232 Compare May 29, 2025 05:12
@alok1304
Copy link
Contributor Author

alok1304 commented May 29, 2025

@AyanSinhaMahapatra I did all modifications or changes that you said, in case of 2-aho matcher, we didn't get extra-words in detection_log, otherwise all good.
Thanks for the review :)

Also, I squash all my commits into a single commit.

@AyanSinhaMahapatra
Copy link
Member

@alok1304 could you add the 2-aho test back? Also we need to cover both cases here for detection_log and see that it does not affect other tested cases.

@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch 2 times, most recently from b253cd9 to 0532da8 Compare May 31, 2025 11:14
@alok1304 alok1304 force-pushed the improve-detection-log-for-extra-words branch from d8acde4 to 0b2d3a3 Compare May 31, 2025 11:52
@alok1304
Copy link
Contributor Author

Hii @AyanSinhaMahapatra I do this things,
is_correct_detection This function is used in many places, so I write a new function i.e, is is_correct_detection_2 This includes
only 1-hash or 1-spdx-id.
Now, we got extra-words for both 2-aho and 3-seq. Also, all test cases are passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extra-words does not show up in detection_log properly
2 participants