Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add a flexible way to assign files with arbitrary names to languages #4945

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

oowekyala
Copy link
Member

Describe the PR

This adds this functionality to LanguageVersionDiscoverer. We have a list of (patterns, language) tuples that are matched in order. The first pattern that matches decides the language of the file. This is wired into AbstractConfiguration and the CLI. It is also supported in CPD for free.

The ticket #461 asked about a preference file. This is sort-of possible through @-files of PicoCLI. Maybe this should also be documented

Related issues

Ready?

  • Add tests for CPD I suppose
  • Review documentation
  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

@pmd-test
Copy link

pmd-test commented Apr 9, 2024

1 Message
📖 Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Download full report as build artifact

Generated by 🚫 Danger



Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and
will not allow analyzing several languages in one go:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to give less space to --force-language in this documentation as it is really inferior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?

If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. You are probably right that we should deprecate it, I don't see a use case for it anymore

Comment on lines +50 to +51
// Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded.
addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done from the language module itself, the same way it registers versions and CPD / PMD capabilities.

Copy link
Member Author

@oowekyala oowekyala Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, but the order in which languages are loaded is not specified so it isn't clear which pattern would win in case of conflicts. So for now I think it is a good start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could drop the semantics of ordering, and say that if several patterns match, we log a warning and ignore the file. This is already what happens with the current file extension matching. But I find it intuitive and useful to have an ordering between patterns, because it allows a user to override the default patterns...

Copy link
Member

@jsotuyod jsotuyod Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could try something different…

All the cases I've identified so far are suffixes, may that be:

  • a standard language extension (ie: .xml)
  • a standard file name (ie: pom.xml), which is the extreme case of a "suffix"
  • a longer suffix (ie: SF flow files: .flow-meta.xml)

I'd say that that we have multiple valid options:

  • processing the file with all configured rules for all aplicable languages would be reasonable (ie: I don't need an additional rule to require pom files to specify an XML encoding)
  • alternatively, if we assume it's always file suffixes, we don't need a regex, but a string, and can decide on applying the most-specific (longest) one, so pom.xml or .flow-meta.xml would always have precedence over standard .xml irrespective of the order of definition.

Bonus points, if we assume this is always suffixes, we don't need a separate mechanism for normal language mapping and user-specified mappings, even if we decide to apply all matching languages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processing the file with all configured rules for all aplicable languages would be reasonable (ie: I don't need an additional rule to require pom files to specify an XML encoding)

Tbh I don't think this should be done by applying several languages to the same file. I'd like to be able to say that the language POM is a dialect of the language XML. Then XML rules apply on a POM file because POM is-a XML, but there is still exactly one language that gets picked, namely POM because it is the most specific. This means the file is only parsed once. It would also allow to share other things between XML dialects, eg parser and other language extensions.

Assuming all .xml files are XML files (or some dialect of it, like POM for pom.xml) we can arbitrate conflicts using the specificity of the language. But the user should still be able to override the default mappings I think.

Bonus points, if we assume this is always suffixes, we don't need a separate mechanism for normal language mapping and user-specified mappings, even if we decide to apply all matching languages.

I think it's a good thing to have a separate user-defined mapping. That way we don't need to modify the Language instances themselves at runtime. And the user can override mappings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, that we don't just use suffixes and use patterns and check them against the whole path - that way, this is a very flexible solution.

About XML vs. POM - I'm not sure yet. Currently they are two completely separate languages and just happen to share some implementation and are therefore in the same maven module. But logically they have nothing in common (as we don't have such a concept like dialect or so). They are treated as completely different languages with there own rules etc. So, I don't think the rules can be shared at the moment, as a rule is explicitly for one language only.

In terms of this PR, in order to get a defined ordering, the POM Language module could depend on XML. Would that help? Would that ensure that POM is always initialized after XML? If so, we build up a complete ordered languageFilePatterns-list which includes the patterns from CLI (assign-language) first as well as the default patterns from the loaded languages (we would need to convert the extensions into LanguageFilePatterns) in the "correct" order. (The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)

Then we can implement getLanguagesForFile by just using the first match in the languageFilePatterns-list. I just noticed, that we return here a list of languages, so a file could have multiple languages...
If that is true, we maybe don't need an ordered languageFilePatterns-list at all? Ok, we currently take only the first language... and at least log a warning, if multiple languages matched.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).

Longer term (outside of this PR), I guess, we need to think about:

  • can a file have multiple languages and therefore need to be parsed multiple times?
  • should we introduce the concept of language dialects (e.g. POM is a dialect of XML, TypeScript is a dialect of JavaScript (or the other way round))?
  • How can we reuse rules for different languages, e.g. XML's MissingEncoding for both xml and pom (and wsdl, xslt)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to convert the extensions into LanguageFilePatterns

If you want me to make this change in this PR I will, but as shown here it works without, and that is an implementation detail.

The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)

That's why I think the more intuitive order is that the last pattern added wins. Intuitively it is just overriding the patterns that are already there. It also makes sense because users that add patterns on the CLI are appending patterns to this list, which overrides the existing patterns.

So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).

IMO the point of defining an ordering is to be able to override existing patterns (like saying my XML files are actually some dialect of XML). Overriding means, ignoring other patterns that matched and taking precedence over them. Therefore for this to be useful there will necessarily be other patterns that matched, and it is a normal occurence. We should log this in debug, as we log other file-language assignments, but I would not print a warning for what is the normal operation of things.

@@ -88,6 +98,24 @@ protected void setPositionalInputPaths(final List<Path> inputPaths) {
this.setInputPaths(inputPaths);
}

private Path ignoreListPath;

@Option(names = "--ignore-list",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like --exclude which we already have?

ok, it seems we named it "exclude" in CPD and "ignore-list" in PMD. So, we have here another opportunity to streamline CLI paramaters...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CPD's --exclude does not take a file list but a path. Maybe we should add --exclude to PMD?

@@ -59,6 +59,18 @@ The tool comes with a rather extensive help text, simply running with `--help`!
<p>See also [Providing the auxiliary classpath](pmd_languages_java.html#providing-the-auxiliary-classpath).</p>"
languages="Java"
%}
{% include custom/cli_option_row.html options="--assign-language"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to add the same for CPD CLI doc.



Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and
will not allow analyzing several languages in one go:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?

If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.

@@ -208,6 +208,19 @@ protected void checkLanguageIsAcceptable(Language lang) throws UnsupportedOperat
// do nothing
}

/**
* Add a pattern that will be matched to a language. If the language is unknown,
* return false. File patterns are matched in the reverse order they were added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... matched in reverse order..." - Can we make it the other way round? IMHO this makes it a bit more difficult to grasp. It would be easier, if we said "language file patterns are applied in the order they are added here - the first match wins".

Maybe we can give the impl. detail, that patterns from CLI are added first and take therefore precedence over the default patterns from the language modules.

Not sure, if that makes the implementation more complicated or simpler - I didn't look at the actual implementation yet.

Ok, so the patterns from CLI are actually added in AbstractAnalysisPmdSubcommand - in the order they are given in the CLI. I would find it more intuitive, if we preserve the order - the first match given with "--assign-language" wins.

public final class PathMatcher implements Predicate<String> {

private final Pattern pattern;
private final String glob;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final String glob;
private final String regex;

glob patterns usually work a bit different (e.g. "*" itself stands for one or more characters, while for regex the same is ".*").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add an example on how to externalize the CLI parameters and load them with @...

Otherwise we didn't fix #461

extension will be recognized.

This option may be repeated to make several assignments. If several regexes match the same file, only
the last occurrence of the option is taken into account. If the language is not loaded, the assignment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for defining the first match/occurrence wins...

for (LanguageFilePattern pat : languageFilePatterns) {
Language lang = configuration.getLanguageRegistry().getLanguageById(pat.getLanguageid());
if (lang == null) {
configuration.getReporter().warn("Language {0} mentioned in --assign-language option is not loaded.", pat.getLanguageid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just not loaded or does it not exist? On the CLI, you can provide arbitrary strings...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't know in the CLI :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning or a ParameterException? Actually, why is this not validated directly in LanguageFilePatternConverter? Rather than have langId be a String use the Language and have it strongly typed in LanguageFilePattern. At the end of the day, te validation the Language exists should still be needed regardless of the entry point (CLI in this case). Requiring a non-null Language instance to create the LanguageFilePattern enforces that throughout.

Also, I'd speak of a language being "unknown" instead of loaded / not loaded. This would be consistent with what we currently throw when we set an invalid language elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a ParameterException yes...

I thought it would be useful to be able to define patterns even for languages that are not loaded... The use case I had in mind is adding patterns for POM even if it is not loaded, so that XML does not take precedence. If we make languages register their patterns upon loading then that should not be a valid use case anymore.

Comment on lines +50 to +51
// Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded.
addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, that we don't just use suffixes and use patterns and check them against the whole path - that way, this is a very flexible solution.

About XML vs. POM - I'm not sure yet. Currently they are two completely separate languages and just happen to share some implementation and are therefore in the same maven module. But logically they have nothing in common (as we don't have such a concept like dialect or so). They are treated as completely different languages with there own rules etc. So, I don't think the rules can be shared at the moment, as a rule is explicitly for one language only.

In terms of this PR, in order to get a defined ordering, the POM Language module could depend on XML. Would that help? Would that ensure that POM is always initialized after XML? If so, we build up a complete ordered languageFilePatterns-list which includes the patterns from CLI (assign-language) first as well as the default patterns from the loaded languages (we would need to convert the extensions into LanguageFilePatterns) in the "correct" order. (The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)

Then we can implement getLanguagesForFile by just using the first match in the languageFilePatterns-list. I just noticed, that we return here a list of languages, so a file could have multiple languages...
If that is true, we maybe don't need an ordered languageFilePatterns-list at all? Ok, we currently take only the first language... and at least log a warning, if multiple languages matched.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).

Longer term (outside of this PR), I guess, we need to think about:

  • can a file have multiple languages and therefore need to be parsed multiple times?
  • should we introduce the concept of language dialects (e.g. POM is a dialect of XML, TypeScript is a dialect of JavaScript (or the other way round))?
  • How can we reuse rules for different languages, e.g. XML's MissingEncoding for both xml and pom (and wsdl, xslt)?

// match patterns from most recent to most ancient
for (LanguageFilePattern pat : IteratorUtil.asReversed(languageFilePatterns)) {
if (pat.matches(fileName)) {
return pat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, we should maybe return here a list of all matched patterns, so that users get a warning, if there are multiple different languages matched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants