New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add a flexible way to assign files with arbitrary names to languages #4945
base: master
Are you sure you want to change the base?
Conversation
TODO don't use Java nio glob implementation. It's FS dependent and has shitty error messages
Generated by 🚫 Danger |
|
||
|
||
Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and | ||
will not allow analyzing several languages in one go: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to give less space to --force-language
in this documentation as it is really inferior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?
If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. You are probably right that we should deprecate it, I don't see a use case for it anymore
// Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded. | ||
addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done from the language module itself, the same way it registers versions and CPD / PMD capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so, but the order in which languages are loaded is not specified so it isn't clear which pattern would win in case of conflicts. So for now I think it is a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could drop the semantics of ordering, and say that if several patterns match, we log a warning and ignore the file. This is already what happens with the current file extension matching. But I find it intuitive and useful to have an ordering between patterns, because it allows a user to override the default patterns...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could try something different…
All the cases I've identified so far are suffixes, may that be:
- a standard language extension (ie:
.xml
) - a standard file name (ie:
pom.xml
), which is the extreme case of a "suffix" - a longer suffix (ie: SF flow files:
.flow-meta.xml
)
I'd say that that we have multiple valid options:
- processing the file with all configured rules for all aplicable languages would be reasonable (ie: I don't need an additional rule to require pom files to specify an XML encoding)
- alternatively, if we assume it's always file suffixes, we don't need a regex, but a string, and can decide on applying the most-specific (longest) one, so
pom.xml
or.flow-meta.xml
would always have precedence over standard.xml
irrespective of the order of definition.
Bonus points, if we assume this is always suffixes, we don't need a separate mechanism for normal language mapping and user-specified mappings, even if we decide to apply all matching languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processing the file with all configured rules for all aplicable languages would be reasonable (ie: I don't need an additional rule to require pom files to specify an XML encoding)
Tbh I don't think this should be done by applying several languages to the same file. I'd like to be able to say that the language POM is a dialect of the language XML. Then XML rules apply on a POM file because POM is-a XML, but there is still exactly one language that gets picked, namely POM because it is the most specific. This means the file is only parsed once. It would also allow to share other things between XML dialects, eg parser and other language extensions.
Assuming all .xml files are XML files (or some dialect of it, like POM for pom.xml) we can arbitrate conflicts using the specificity of the language. But the user should still be able to override the default mappings I think.
Bonus points, if we assume this is always suffixes, we don't need a separate mechanism for normal language mapping and user-specified mappings, even if we decide to apply all matching languages.
I think it's a good thing to have a separate user-defined mapping. That way we don't need to modify the Language instances themselves at runtime. And the user can override mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, that we don't just use suffixes and use patterns and check them against the whole path - that way, this is a very flexible solution.
About XML vs. POM - I'm not sure yet. Currently they are two completely separate languages and just happen to share some implementation and are therefore in the same maven module. But logically they have nothing in common (as we don't have such a concept like dialect or so). They are treated as completely different languages with there own rules etc. So, I don't think the rules can be shared at the moment, as a rule is explicitly for one language only.
In terms of this PR, in order to get a defined ordering, the POM Language module could depend on XML. Would that help? Would that ensure that POM is always initialized after XML? If so, we build up a complete ordered languageFilePatterns-list which includes the patterns from CLI (assign-language) first as well as the default patterns from the loaded languages (we would need to convert the extensions into LanguageFilePatterns) in the "correct" order. (The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)
Then we can implement getLanguagesForFile by just using the first match in the languageFilePatterns-list. I just noticed, that we return here a list of languages, so a file could have multiple languages...
If that is true, we maybe don't need an ordered languageFilePatterns-list at all? Ok, we currently take only the first language... and at least log a warning, if multiple languages matched.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).
Longer term (outside of this PR), I guess, we need to think about:
- can a file have multiple languages and therefore need to be parsed multiple times?
- should we introduce the concept of language dialects (e.g. POM is a dialect of XML, TypeScript is a dialect of JavaScript (or the other way round))?
- How can we reuse rules for different languages, e.g. XML's MissingEncoding for both xml and pom (and wsdl, xslt)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to convert the extensions into LanguageFilePatterns
If you want me to make this change in this PR I will, but as shown here it works without, and that is an implementation detail.
The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)
That's why I think the more intuitive order is that the last pattern added wins. Intuitively it is just overriding the patterns that are already there. It also makes sense because users that add patterns on the CLI are appending patterns to this list, which overrides the existing patterns.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).
IMO the point of defining an ordering is to be able to override existing patterns (like saying my XML files are actually some dialect of XML). Overriding means, ignoring other patterns that matched and taking precedence over them. Therefore for this to be useful there will necessarily be other patterns that matched, and it is a normal occurence. We should log this in debug, as we log other file-language assignments, but I would not print a warning for what is the normal operation of things.
@@ -88,6 +98,24 @@ protected void setPositionalInputPaths(final List<Path> inputPaths) { | |||
this.setInputPaths(inputPaths); | |||
} | |||
|
|||
private Path ignoreListPath; | |||
|
|||
@Option(names = "--ignore-list", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like --exclude
which we already have?
ok, it seems we named it "exclude" in CPD and "ignore-list" in PMD. So, we have here another opportunity to streamline CLI paramaters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CPD's --exclude
does not take a file list but a path. Maybe we should add --exclude
to PMD?
@@ -59,6 +59,18 @@ The tool comes with a rather extensive help text, simply running with `--help`! | |||
<p>See also [Providing the auxiliary classpath](pmd_languages_java.html#providing-the-auxiliary-classpath).</p>" | |||
languages="Java" | |||
%} | |||
{% include custom/cli_option_row.html options="--assign-language" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to add the same for CPD CLI doc.
|
||
|
||
Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and | ||
will not allow analyzing several languages in one go: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?
If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.
@@ -208,6 +208,19 @@ protected void checkLanguageIsAcceptable(Language lang) throws UnsupportedOperat | |||
// do nothing | |||
} | |||
|
|||
/** | |||
* Add a pattern that will be matched to a language. If the language is unknown, | |||
* return false. File patterns are matched in the reverse order they were added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... matched in reverse order..." - Can we make it the other way round? IMHO this makes it a bit more difficult to grasp. It would be easier, if we said "language file patterns are applied in the order they are added here - the first match wins".
Maybe we can give the impl. detail, that patterns from CLI are added first and take therefore precedence over the default patterns from the language modules.
Not sure, if that makes the implementation more complicated or simpler - I didn't look at the actual implementation yet.
Ok, so the patterns from CLI are actually added in AbstractAnalysisPmdSubcommand - in the order they are given in the CLI. I would find it more intuitive, if we preserve the order - the first match given with "--assign-language" wins.
public final class PathMatcher implements Predicate<String> { | ||
|
||
private final Pattern pattern; | ||
private final String glob; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private final String glob; | |
private final String regex; |
glob patterns usually work a bit different (e.g. "*"
itself stands for one or more characters, while for regex the same is ".*"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add an example on how to externalize the CLI parameters and load them with @
...
Otherwise we didn't fix #461
extension will be recognized. | ||
|
||
This option may be repeated to make several assignments. If several regexes match the same file, only | ||
the last occurrence of the option is taken into account. If the language is not loaded, the assignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for defining the first match/occurrence wins...
for (LanguageFilePattern pat : languageFilePatterns) { | ||
Language lang = configuration.getLanguageRegistry().getLanguageById(pat.getLanguageid()); | ||
if (lang == null) { | ||
configuration.getReporter().warn("Language {0} mentioned in --assign-language option is not loaded.", pat.getLanguageid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just not loaded or does it not exist? On the CLI, you can provide arbitrary strings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't know in the CLI :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a warning or a ParameterException
? Actually, why is this not validated directly in LanguageFilePatternConverter
? Rather than have langId
be a String
use the Language
and have it strongly typed in LanguageFilePattern
. At the end of the day, te validation the Language
exists should still be needed regardless of the entry point (CLI in this case). Requiring a non-null Language
instance to create the LanguageFilePattern
enforces that throughout.
Also, I'd speak of a language being "unknown" instead of loaded / not loaded. This would be consistent with what we currently throw when we set an invalid language elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a ParameterException yes...
I thought it would be useful to be able to define patterns even for languages that are not loaded... The use case I had in mind is adding patterns for POM even if it is not loaded, so that XML does not take precedence. If we make languages register their patterns upon loading then that should not be a valid use case anymore.
// Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded. | ||
addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, that we don't just use suffixes and use patterns and check them against the whole path - that way, this is a very flexible solution.
About XML vs. POM - I'm not sure yet. Currently they are two completely separate languages and just happen to share some implementation and are therefore in the same maven module. But logically they have nothing in common (as we don't have such a concept like dialect or so). They are treated as completely different languages with there own rules etc. So, I don't think the rules can be shared at the moment, as a rule is explicitly for one language only.
In terms of this PR, in order to get a defined ordering, the POM Language module could depend on XML. Would that help? Would that ensure that POM is always initialized after XML? If so, we build up a complete ordered languageFilePatterns-list which includes the patterns from CLI (assign-language) first as well as the default patterns from the loaded languages (we would need to convert the extensions into LanguageFilePatterns) in the "correct" order. (The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)
Then we can implement getLanguagesForFile by just using the first match in the languageFilePatterns-list. I just noticed, that we return here a list of languages, so a file could have multiple languages...
If that is true, we maybe don't need an ordered languageFilePatterns-list at all? Ok, we currently take only the first language... and at least log a warning, if multiple languages matched.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).
Longer term (outside of this PR), I guess, we need to think about:
- can a file have multiple languages and therefore need to be parsed multiple times?
- should we introduce the concept of language dialects (e.g. POM is a dialect of XML, TypeScript is a dialect of JavaScript (or the other way round))?
- How can we reuse rules for different languages, e.g. XML's MissingEncoding for both xml and pom (and wsdl, xslt)?
// match patterns from most recent to most ancient | ||
for (LanguageFilePattern pat : IteratorUtil.asReversed(languageFilePatterns)) { | ||
if (pat.matches(fileName)) { | ||
return pat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, we should maybe return here a list of all matched patterns, so that users get a warning, if there are multiple different languages matched.
Describe the PR
This adds this functionality to LanguageVersionDiscoverer. We have a list of (patterns, language) tuples that are matched in order. The first pattern that matches decides the language of the file. This is wired into AbstractConfiguration and the CLI. It is also supported in CPD for free.
The ticket #461 asked about a preference file. This is sort-of possible through
@
-files of PicoCLI. Maybe this should also be documentedRelated issues
Ready?
./mvnw clean verify
passes (checked automatically by github actions)