[core] Add a flexible way to assign files with arbitrary names to languages #4945

oowekyala · 2024-04-09T16:37:41Z

Describe the PR

This adds this functionality to LanguageVersionDiscoverer. We have a list of (patterns, language) tuples that are matched in order. The first pattern that matches decides the language of the file. This is wired into AbstractConfiguration and the CLI. It is also supported in CPD for free.

The ticket #461 asked about a preference file. This is sort-of possible through @-files of PicoCLI. Maybe this should also be documented

Related issues

Fix [xml] Please externalize the mapping of file extensions to languages as preference file #461
Fix [pom] Maven POM Rules don't check files named "pom.xml" #1540

Ready?

Add tests for CPD I suppose
Review documentation
Added unit tests for fixed bug/feature
Passing all unit tests
Complete build ./mvnw clean verify passes (checked automatically by github actions)
Added (in-code) documentation (if needed)

Fix pmd#1540

TODO don't use Java nio glob implementation. It's FS dependent and has shitty error messages

pmd-test · 2024-04-09T19:09:02Z

	1 Message
📖	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Download full report as build artifact

Generated by 🚫 Danger

oowekyala · 2024-04-10T15:00:09Z

docs/pages/pmd/userdocs/cli_reference.md

+
+
+Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and 
+will not allow analyzing several languages in one go:


Maybe we need to give less space to --force-language in this documentation as it is really inferior

With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?

If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.

That's correct. You are probably right that we should deprecate it, I don't see a use case for it anymore

jsotuyod · 2024-04-10T17:27:40Z

pmd-core/src/main/java/net/sourceforge/pmd/lang/LanguageVersionDiscoverer.java

+ // Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded.
+ addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml"));


This should be done from the language module itself, the same way it registers versions and CPD / PMD capabilities.

I guess so, but the order in which languages are loaded is not specified so it isn't clear which pattern would win in case of conflicts. So for now I think it is a good start.

Alternatively we could drop the semantics of ordering, and say that if several patterns match, we log a warning and ignore the file. This is already what happens with the current file extension matching. But I find it intuitive and useful to have an ordering between patterns, because it allows a user to override the default patterns...

Alternatively, we could try something different…

All the cases I've identified so far are suffixes, may that be:

a standard language extension (ie: .xml)

a standard file name (ie: pom.xml), which is the extreme case of a "suffix"

a longer suffix (ie: SF flow files: .flow-meta.xml)

I'd say that that we have multiple valid options:

processing the file with all configured rules for all aplicable languages would be reasonable (ie: I don't need an additional rule to require pom files to specify an XML encoding)

alternatively, if we assume it's always file suffixes, we don't need a regex, but a string, and can decide on applying the most-specific (longest) one, so pom.xml or .flow-meta.xml would always have precedence over standard .xml irrespective of the order of definition.

Bonus points, if we assume this is always suffixes, we don't need a separate mechanism for normal language mapping and user-specified mappings, even if we decide to apply all matching languages.

processing the file with all configured rules for all aplicable languages would be reasonable (ie: I don't need an additional rule to require pom files to specify an XML encoding)

Tbh I don't think this should be done by applying several languages to the same file. I'd like to be able to say that the language POM is a dialect of the language XML. Then XML rules apply on a POM file because POM is-a XML, but there is still exactly one language that gets picked, namely POM because it is the most specific. This means the file is only parsed once. It would also allow to share other things between XML dialects, eg parser and other language extensions.

Assuming all .xml files are XML files (or some dialect of it, like POM for pom.xml) we can arbitrate conflicts using the specificity of the language. But the user should still be able to override the default mappings I think.

Bonus points, if we assume this is always suffixes, we don't need a separate mechanism for normal language mapping and user-specified mappings, even if we decide to apply all matching languages.

I think it's a good thing to have a separate user-defined mapping. That way we don't need to modify the Language instances themselves at runtime. And the user can override mappings.

I like it, that we don't just use suffixes and use patterns and check them against the whole path - that way, this is a very flexible solution.

About XML vs. POM - I'm not sure yet. Currently they are two completely separate languages and just happen to share some implementation and are therefore in the same maven module. But logically they have nothing in common (as we don't have such a concept like dialect or so). They are treated as completely different languages with there own rules etc. So, I don't think the rules can be shared at the moment, as a rule is explicitly for one language only.

In terms of this PR, in order to get a defined ordering, the POM Language module could depend on XML. Would that help? Would that ensure that POM is always initialized after XML? If so, we build up a complete ordered languageFilePatterns-list which includes the patterns from CLI (assign-language) first as well as the default patterns from the loaded languages (we would need to convert the extensions into LanguageFilePatterns) in the "correct" order. (The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)

Then we can implement getLanguagesForFile by just using the first match in the languageFilePatterns-list. I just noticed, that we return here a list of languages, so a file could have multiple languages...
If that is true, we maybe don't need an ordered languageFilePatterns-list at all? Ok, we currently take only the first language... and at least log a warning, if multiple languages matched.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).

Longer term (outside of this PR), I guess, we need to think about:

can a file have multiple languages and therefore need to be parsed multiple times?

should we introduce the concept of language dialects (e.g. POM is a dialect of XML, TypeScript is a dialect of JavaScript (or the other way round))?

How can we reuse rules for different languages, e.g. XML's MissingEncoding for both xml and pom (and wsdl, xslt)?

We would need to convert the extensions into LanguageFilePatterns

If you want me to make this change in this PR I will, but as shown here it works without, and that is an implementation detail.

The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)

That's why I think the more intuitive order is that the last pattern added wins. Intuitively it is just overriding the patterns that are already there. It also makes sense because users that add patterns on the CLI are appending patterns to this list, which overrides the existing patterns.

So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).

IMO the point of defining an ordering is to be able to override existing patterns (like saying my XML files are actually some dialect of XML). Overriding means, ignoring other patterns that matched and taking precedence over them. Therefore for this to be useful there will necessarily be other patterns that matched, and it is a normal occurence. We should log this in debug, as we log other file-language assignments, but I would not print a warning for what is the normal operation of things.

adangel · 2024-05-02T06:53:06Z

...i/src/main/java/net/sourceforge/pmd/cli/commands/internal/AbstractAnalysisPmdSubcommand.java

@@ -88,6 +98,24 @@ protected void setPositionalInputPaths(final List<Path> inputPaths) {
 this.setInputPaths(inputPaths);
 }

+ private Path ignoreListPath;
+
+ @Option(names = "--ignore-list",


Sounds like --exclude which we already have?

ok, it seems we named it "exclude" in CPD and "ignore-list" in PMD. So, we have here another opportunity to streamline CLI paramaters...

I think CPD's --exclude does not take a file list but a path. Maybe we should add --exclude to PMD?

adangel · 2024-05-02T06:58:24Z

docs/pages/pmd/userdocs/cli_reference.md

@@ -59,6 +59,18 @@ The tool comes with a rather extensive help text, simply running with `--help`!
 <p>See also [Providing the auxiliary classpath](pmd_languages_java.html#providing-the-auxiliary-classpath).</p>"
 languages="Java"
 %}
+ {% include custom/cli_option_row.html options="--assign-language"


We'll need to add the same for CPD CLI doc.

adangel · 2024-05-02T07:02:24Z

docs/pages/pmd/userdocs/cli_reference.md

+
+
+Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and 
+will not allow analyzing several languages in one go:


With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?

If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.

adangel · 2024-05-02T07:20:19Z

pmd-core/src/main/java/net/sourceforge/pmd/AbstractConfiguration.java

@@ -208,6 +208,19 @@ protected void checkLanguageIsAcceptable(Language lang) throws UnsupportedOperat
 // do nothing
 }

+ /**
+ * Add a pattern that will be matched to a language. If the language is unknown,
+ * return false. File patterns are matched in the reverse order they were added.


"... matched in reverse order..." - Can we make it the other way round? IMHO this makes it a bit more difficult to grasp. It would be easier, if we said "language file patterns are applied in the order they are added here - the first match wins".

Maybe we can give the impl. detail, that patterns from CLI are added first and take therefore precedence over the default patterns from the language modules.

Not sure, if that makes the implementation more complicated or simpler - I didn't look at the actual implementation yet.

Ok, so the patterns from CLI are actually added in AbstractAnalysisPmdSubcommand - in the order they are given in the CLI. I would find it more intuitive, if we preserve the order - the first match given with "--assign-language" wins.

adangel · 2024-05-02T07:23:45Z

pmd-core/src/main/java/net/sourceforge/pmd/internal/util/PathMatcher.java

+public final class PathMatcher implements Predicate<String> {
+
+ private final Pattern pattern;
+ private final String glob;


Suggested change

private final String glob;

private final String regex;

glob patterns usually work a bit different (e.g. "*" itself stands for one or more characters, while for regex the same is ".*").

adangel · 2024-05-02T07:28:08Z

docs/pages/pmd/userdocs/cli_reference.md

We also need to add an example on how to externalize the CLI parameters and load them with @...

Otherwise we didn't fix #461

adangel · 2024-05-02T07:33:57Z

docs/pages/pmd/userdocs/cli_reference.md

+ extension will be recognized.
+
+ This option may be repeated to make several assignments. If several regexes match the same file, only
+ the last occurrence of the option is taken into account. If the language is not loaded, the assignment


I would vote for defining the first match/occurrence wins...

adangel · 2024-05-02T07:35:10Z

...i/src/main/java/net/sourceforge/pmd/cli/commands/internal/AbstractAnalysisPmdSubcommand.java

+ for (LanguageFilePattern pat : languageFilePatterns) {
+ Language lang = configuration.getLanguageRegistry().getLanguageById(pat.getLanguageid());
+ if (lang == null) {
+ configuration.getReporter().warn("Language {0} mentioned in --assign-language option is not loaded.", pat.getLanguageid());


Is it just not loaded or does it not exist? On the CLI, you can provide arbitrary strings...

We can't know in the CLI :/

Should this be a warning or a ParameterException? Actually, why is this not validated directly in LanguageFilePatternConverter? Rather than have langId be a String use the Language and have it strongly typed in LanguageFilePattern. At the end of the day, te validation the Language exists should still be needed regardless of the entry point (CLI in this case). Requiring a non-null Language instance to create the LanguageFilePattern enforces that throughout.

Also, I'd speak of a language being "unknown" instead of loaded / not loaded. This would be consistent with what we currently throw when we set an invalid language elsewhere.

Maybe a ParameterException yes...

I thought it would be useful to be able to define patterns even for languages that are not loaded... The use case I had in mind is adding patterns for POM even if it is not loaded, so that XML does not take precedence. If we make languages register their patterns upon loading then that should not be a valid use case anymore.

adangel · 2024-05-02T07:55:43Z

pmd-core/src/main/java/net/sourceforge/pmd/lang/LanguageVersionDiscoverer.java

+ // Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded.
+ addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml"));


I like it, that we don't just use suffixes and use patterns and check them against the whole path - that way, this is a very flexible solution.

About XML vs. POM - I'm not sure yet. Currently they are two completely separate languages and just happen to share some implementation and are therefore in the same maven module. But logically they have nothing in common (as we don't have such a concept like dialect or so). They are treated as completely different languages with there own rules etc. So, I don't think the rules can be shared at the moment, as a rule is explicitly for one language only.

In terms of this PR, in order to get a defined ordering, the POM Language module could depend on XML. Would that help? Would that ensure that POM is always initialized after XML? If so, we build up a complete ordered languageFilePatterns-list which includes the patterns from CLI (assign-language) first as well as the default patterns from the loaded languages (we would need to convert the extensions into LanguageFilePatterns) in the "correct" order. (The correct order of the language modules is maybe reverse: if pom depends on xml, then xml is before pom, but we need to add the patterns of pom before xml)

Then we can implement getLanguagesForFile by just using the first match in the languageFilePatterns-list. I just noticed, that we return here a list of languages, so a file could have multiple languages...
If that is true, we maybe don't need an ordered languageFilePatterns-list at all? Ok, we currently take only the first language... and at least log a warning, if multiple languages matched.
So for that PR, I would then return all matched languages from the ordered languageFilePatterns-list keeping the same order (and not just the first match). That way, we log a warning, if two assign-language patterns would match. In the end, only the first match will be used (just like it is now).

Longer term (outside of this PR), I guess, we need to think about:

can a file have multiple languages and therefore need to be parsed multiple times?

should we introduce the concept of language dialects (e.g. POM is a dialect of XML, TypeScript is a dialect of JavaScript (or the other way round))?

How can we reuse rules for different languages, e.g. XML's MissingEncoding for both xml and pom (and wsdl, xslt)?

adangel · 2024-05-02T08:01:52Z

pmd-core/src/main/java/net/sourceforge/pmd/lang/LanguageVersionDiscoverer.java

+ // match patterns from most recent to most ancient
+ for (LanguageFilePattern pat : IteratorUtil.asReversed(languageFilePatterns)) {
+ if (pat.matches(fileName)) {
+ return pat;


As commented above, we should maybe return here a list of all matched patterns, so that users get a warning, if there are multiple different languages matched.

oowekyala added 11 commits April 9, 2024 12:56

Add language file patterns mentionable on the CLI

63a6849

Test CLI

1630237

Add default pattern for POM

a6caf80

Fix pmd#1540

Doc

fd283bf

Make option repeatable

bd05b72

Replace with glob

628fabd

TODO don't use Java nio glob implementation. It's FS dependent and has shitty error messages

Use custom glob implementation

257f6b1

Checkstyle

de63594

Simplify

027a6a7

Remove glob impl, use regex

01684f2

checkstyle

165092e

oowekyala commented Apr 10, 2024

View reviewed changes

jsotuyod reviewed Apr 10, 2024

View reviewed changes

adangel reviewed May 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Add a flexible way to assign files with arbitrary names to languages #4945

[core] Add a flexible way to assign files with arbitrary names to languages #4945

oowekyala commented Apr 9, 2024

pmd-test commented Apr 9, 2024

oowekyala Apr 10, 2024

adangel May 2, 2024

oowekyala May 2, 2024

jsotuyod Apr 10, 2024

oowekyala Apr 10, 2024 •

edited

oowekyala Apr 10, 2024

jsotuyod Apr 10, 2024 •

edited

oowekyala Apr 11, 2024

adangel May 2, 2024

oowekyala May 13, 2024

adangel May 2, 2024

oowekyala May 2, 2024

adangel May 2, 2024

adangel May 2, 2024

adangel May 2, 2024

adangel May 2, 2024

adangel May 2, 2024

adangel May 2, 2024

adangel May 2, 2024

oowekyala May 13, 2024

jsotuyod May 13, 2024

oowekyala May 13, 2024

adangel May 2, 2024

adangel May 2, 2024



		Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and
		will not allow analyzing several languages in one go:

		// Add pattern to recognize POM. This can be overridden by patterns added later. POM defaults to XML if not loaded.
		addLanguageFilePattern(LanguageFilePattern.ofRegex(Pattern.compile("(.*/)?pom\\.xml$"), "pom", "xml"));

[core] Add a flexible way to assign files with arbitrary names to languages #4945

Are you sure you want to change the base?

[core] Add a flexible way to assign files with arbitrary names to languages #4945

Conversation

oowekyala commented Apr 9, 2024

Describe the PR

Related issues

Ready?

pmd-test commented Apr 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oowekyala Apr 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsotuyod Apr 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oowekyala Apr 10, 2024 •

edited

jsotuyod Apr 10, 2024 •

edited