Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add a flexible way to assign files with arbitrary names to languages #4945

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
Draft
26 changes: 25 additions & 1 deletion docs/pages/pmd/userdocs/cli_reference.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add an example on how to externalize the CLI parameters and load them with @...

Otherwise we didn't fix #461

Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,18 @@ The tool comes with a rather extensive help text, simply running with `--help`!
<p>See also [Providing the auxiliary classpath](pmd_languages_java.html#providing-the-auxiliary-classpath).</p>"
languages="Java"
%}
{% include custom/cli_option_row.html options="--assign-language"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to add the same for CPD CLI doc.

option_arg="langId regex"
description="Specify that files whose name match the regex should be assigned the given language.
This option is useful when the default extension-based matching does not match some file names in
your project. In particular, languages like XML have a variety of extensions, but by default only the `.xml`
extension will be recognized.

This option may be repeated to make several assignments. If several regexes match the same file, only
the last occurrence of the option is taken into account. If the language is not loaded, the assignment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for defining the first match/occurrence wins...

is still valid and no further regexes/extension matching will occur, but the file will be ignored. A warning
will be logged once on the CLI before the execution.
%}
{% include custom/cli_option_row.html options="--benchmark,-b"
description="Enables benchmark mode, which outputs a benchmark report upon completion.
The report is sent to standard error."
Expand Down Expand Up @@ -279,7 +291,19 @@ All formats are described at [PMD Report formats](pmd_userdocs_report_formats.ht

### Analyze other xml formats

If your xml language doesn't use `xml` as file extension, you can still use PMD with `--force-language`:
If your XML language doesn't use `xml` as file extension, you can match these files with the option `--assign-language`:

{% include cli_example.html
id="force"
linux="pmd check -d src/xml-file.ext -f text -R ruleset.xml --assign-language xml '.*\.ext'"
windows="pmd.bat check -d src\xml-file.ext -f text -R ruleset.xml --assign-language xml '.*\.ext'" %}

You can repeat the `--assign-language` option several time to match different patterns. The first argument
is a language ID, the second is a Java regular expression. You can therefore write eg `--assign-language xml '.*\.(ext|ext2)|foo.myxmlformat'`.


Another, less flexible way to do this is to use `--force-language`, but this will assign all files to the same language and
will not allow analyzing several languages in one go:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to give less space to --force-language in this documentation as it is really inferior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "--force-language java", you could e.g. parse a "File.txt" as a Java-File. With "--assign-language java '.*.txt'" you could achieve the same, correct?

If so, we might consider to deprecate "force-language" then... as it seems all use cases are covered by assign-language.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. You are probably right that we should deprecate it, I don't see a use case for it anymore


{% include cli_example.html
id="force"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,25 @@
import java.net.URI;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Set;
import java.util.Stack;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

import org.slf4j.LoggerFactory;

import net.sourceforge.pmd.AbstractConfiguration;
import net.sourceforge.pmd.cli.commands.mixins.internal.EncodingMixin;
import net.sourceforge.pmd.cli.internal.CliExitCode;
import net.sourceforge.pmd.cli.internal.PmdRootLogger;
import net.sourceforge.pmd.lang.Language;
import net.sourceforge.pmd.lang.LanguageVersionDiscoverer.LanguageFilePattern;
import net.sourceforge.pmd.util.log.internal.SimpleMessageReporter;

import picocli.CommandLine;
import picocli.CommandLine.Mixin;
import picocli.CommandLine.Option;
import picocli.CommandLine.ParameterException;
Expand All @@ -29,26 +39,26 @@ public abstract class AbstractAnalysisPmdSubcommand<C extends AbstractConfigurat
// see the setters #setInputPaths and setPositionalInputPaths for @Option and @Parameters annotations
// Note: can't use annotations on the fields here, as otherwise the complete list would be replaced
// rather than accumulated.
protected Set<Path> inputPaths;
private Set<Path> inputPaths;

@Option(names = "--file-list",
description =
"Path to a file containing a list of files to analyze, one path per line. "
+ "One of --dir, --file-list or --uri must be provided.")
protected Path fileListPath;

@Option(names = { "--uri", "-u" },
description = "Database URI for sources. "
+ "One of --dir, --file-list or --uri must be provided.")
protected URI uri;

@Option(names = "--no-fail-on-violation",
description = "By default PMD exits with status 4 if violations are found. "
+ "Disable this option with '--no-fail-on-violation' to exit with 0 instead and just write the report.",
defaultValue = "true", negatable = true)
protected boolean failOnViolation;

protected List<Path> relativizeRootPaths;
private List<Path> relativizeRootPaths;

@Option(names = { "--relativize-paths-with", "-z"}, description = "Path relative to which directories are rendered in the report. "
+ "This option allows shortening directories in the report; "
Expand Down Expand Up @@ -88,6 +98,24 @@ protected void setPositionalInputPaths(final List<Path> inputPaths) {
this.setInputPaths(inputPaths);
}

private Path ignoreListPath;

@Option(names = "--ignore-list",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like --exclude which we already have?

ok, it seems we named it "exclude" in CPD and "ignore-list" in PMD. So, we have here another opportunity to streamline CLI paramaters...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CPD's --exclude does not take a file list but a path. Maybe we should add --exclude to PMD?

description = "Path to a file containing a list of files to exclude from the analysis, one path per line. "
+ "This option can be combined with --dir, --file-list and --uri.")
public void setIgnoreListPath(final Path ignoreListPath) {
this.ignoreListPath = ignoreListPath;
}

@Option(names = "--assign-language", description = "Use a regex pattern to assign filenames to a language."
+ " Eg `--assign-language html '.*\\.twig'` will recognize files with extension \\.twig and assign them the language HTML."
+ " This only affects language assignment for the files that are mentioned with other options like --dir, it will not search for "
+ " new files outside of these. These patterns take precedence over the default language assignment. If several patterns match,"
+ " only the latest pattern to be mentioned on the CLI will be considered. Note that the regex will be applied on the absolute path"
+ " of the file, with file separators normalized to '/'.", parameterConsumer = LanguageFilePatternConverter.class)
protected List<LanguageFilePattern> languageFilePatterns = new ArrayList<>();


@Override
protected final void validate() throws ParameterException {
super.validate();
Expand All @@ -99,6 +127,30 @@ protected final void validate() throws ParameterException {
}
}

protected void configureCommonOptions(C configuration) {

// Setup CLI message reporter
configuration.setReporter(new SimpleMessageReporter(LoggerFactory.getLogger(PmdCommand.class)));

if (inputPaths != null) {
configuration.setInputPathList(new ArrayList<>(inputPaths));
}
configuration.setInputFilePath(fileListPath);
configuration.setIgnoreFilePath(ignoreListPath);
configuration.setInputUri(uri);
configuration.setSourceEncoding(encoding.getEncoding());
if (relativizeRootPaths != null) {
configuration.addRelativizeRoots(relativizeRootPaths);
}
for (LanguageFilePattern pat : languageFilePatterns) {
Language lang = configuration.getLanguageRegistry().getLanguageById(pat.getLanguageid());
if (lang == null) {
configuration.getReporter().warn("Language {0} mentioned in --assign-language option is not loaded.", pat.getLanguageid());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just not loaded or does it not exist? On the CLI, you can provide arbitrary strings...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't know in the CLI :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning or a ParameterException? Actually, why is this not validated directly in LanguageFilePatternConverter? Rather than have langId be a String use the Language and have it strongly typed in LanguageFilePattern. At the end of the day, te validation the Language exists should still be needed regardless of the entry point (CLI in this case). Requiring a non-null Language instance to create the LanguageFilePattern enforces that throughout.

Also, I'd speak of a language being "unknown" instead of loaded / not loaded. This would be consistent with what we currently throw when we set an invalid language elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a ParameterException yes...

I thought it would be useful to be able to define patterns even for languages that are not loaded... The use case I had in mind is adding patterns for POM even if it is not loaded, so that XML does not take precedence. If we make languages register their patterns upon loading then that should not be a valid use case anymore.

}
configuration.addLanguageFilePattern(pat);
}
}


protected abstract C toConfiguration();

Expand All @@ -113,4 +165,23 @@ protected CliExitCode execute() {
this::doExecute);
}

static class LanguageFilePatternConverter implements CommandLine.IParameterConsumer {

@Override
public void consumeParameters(Stack<String> args, CommandLine.Model.ArgSpec argSpec, CommandLine.Model.CommandSpec commandSpec) {
if (args.size() < 2) {
throw new ParameterException(commandSpec.commandLine(),
"Expected two arguments for the language file pattern: one language ID and one pattern.");
}
String langId = args.pop();
String globString = args.pop();
Pattern pattern;
try {
pattern = Pattern.compile(globString);
} catch (PatternSyntaxException pse) {
throw new ParameterException(commandSpec.commandLine(), "Invalid regex specification for language " + langId + ": " + pse.getMessage());
}
((List<LanguageFilePattern>) argSpec.getValue()).add(LanguageFilePattern.ofRegex(pattern, langId));
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ public class CpdCommand extends AbstractAnalysisPmdSubcommand<CPDConfiguration>
defaultValue = CpdLanguagePropertiesDefaults.DEFAULT_SKIP_BLOCKS_PATTERN)
private String skipBlocksPattern;

// todo move this option up to the base class
@Option(names = "--exclude", arity = "1..*", description = "Files to be excluded from the analysis")
private List<Path> excludes = new ArrayList<>();

Expand All @@ -98,15 +99,9 @@ public class CpdCommand extends AbstractAnalysisPmdSubcommand<CPDConfiguration>
@Override
protected CPDConfiguration toConfiguration() {
final CPDConfiguration configuration = new CPDConfiguration();
configureCommonOptions(configuration);
configuration.setExcludes(excludes);
if (relativizeRootPaths != null) {
configuration.addRelativizeRoots(relativizeRootPaths);
}
configuration.setFailOnViolation(failOnViolation);
configuration.setInputFilePath(fileListPath);
if (inputPaths != null) {
configuration.setInputPathList(new ArrayList<>(inputPaths));
}
configuration.setIgnoreAnnotations(ignoreAnnotations);
configuration.setIgnoreIdentifiers(ignoreIdentifiers);
configuration.setIgnoreLiterals(ignoreLiterals);
Expand All @@ -120,8 +115,6 @@ protected CPDConfiguration toConfiguration() {
configuration.setSkipBlocksPattern(skipBlocksPattern);
configuration.setSkipDuplicates(skipDuplicates);
configuration.setSkipLexicalErrors(skipLexicalErrors);
configuration.setSourceEncoding(encoding.getEncoding());
configuration.setInputUri(uri);

return configuration;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@
import net.sourceforge.pmd.reporting.ReportStats;
import net.sourceforge.pmd.util.StringUtil;
import net.sourceforge.pmd.util.log.PmdReporter;
import net.sourceforge.pmd.util.log.internal.SimpleMessageReporter;

import picocli.CommandLine.Command;
import picocli.CommandLine.Option;
Expand Down Expand Up @@ -78,8 +77,6 @@ public class PmdCommand extends AbstractAnalysisPmdSubcommand<PMDConfiguration>

private List<String> rulesets;

private Path ignoreListPath;

private String format;

private int threads;
Expand Down Expand Up @@ -117,13 +114,6 @@ public void setRulesets(final List<String> rulesets) {
this.rulesets = rulesets;
}

@Option(names = "--ignore-list",
description = "Path to a file containing a list of files to exclude from the analysis, one path per line. "
+ "This option can be combined with --dir, --file-list and --uri.")
public void setIgnoreListPath(final Path ignoreListPath) {
this.ignoreListPath = ignoreListPath;
}

@Option(names = { "--format", "-f" },
description = "Report format.%nValid values: ${COMPLETION-CANDIDATES}%n"
+ "Alternatively, you can provide the fully qualified name of a custom Renderer in the classpath.",
Expand Down Expand Up @@ -251,20 +241,11 @@ public void setShowProgressBar(final boolean showProgressBar) {
@Override
protected PMDConfiguration toConfiguration() {
final PMDConfiguration configuration = new PMDConfiguration();
if (inputPaths != null) {
configuration.setInputPathList(new ArrayList<>(inputPaths));
}
configuration.setInputFilePath(fileListPath);
configuration.setIgnoreFilePath(ignoreListPath);
configuration.setInputUri(uri);
configureCommonOptions(configuration);
configuration.setReportFormat(format);
configuration.setSourceEncoding(encoding.getEncoding());
configuration.setMinimumPriority(minimumPriority);
configuration.setReportFile(reportFile);
configuration.setReportProperties(properties);
if (relativizeRootPaths != null) {
configuration.addRelativizeRoots(relativizeRootPaths);
}
configuration.setRuleSets(rulesets);
configuration.setShowSuppressedViolations(showSuppressed);
configuration.setSuppressMarker(suppressMarker);
Expand All @@ -284,9 +265,6 @@ protected PMDConfiguration toConfiguration() {
configuration.setForceLanguageVersion(forcedLangVer);
}

// Setup CLI message reporter
configuration.setReporter(new SimpleMessageReporter(LoggerFactory.getLogger(PmdCommand.class)));

try {
configuration.prependAuxClasspath(auxClasspath);
} catch (IllegalArgumentException e) {
Expand Down
27 changes: 27 additions & 0 deletions pmd-cli/src/test/java/net/sourceforge/pmd/cli/PmdCliTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ class PmdCliTest extends BaseCliTest {

private static final String DUMMY_RULESET_WITH_VIOLATIONS = "net/sourceforge/pmd/cli/FakeRuleset2.xml";
static final String RULESET_WITH_VIOLATION = "net/sourceforge/pmd/cli/RuleSetWithViolations.xml";
private static final String RULESET_WITH_VIOLATION_FOR_DUMMY2_LANG = "net/sourceforge/pmd/cli/RuleSetWithViolationsForDummy2.xml";
private static final String RULESET_NO_VIOLATIONS = "net/sourceforge/pmd/cli/FakeRuleset.xml";
private static final String NOT_A_RULESET = "ThisRuleSetDoesNotExist.xml";
private static final String STRING_TO_REPLACE = "__should_be_replaced__";
Expand All @@ -72,6 +73,7 @@ void setup() throws IOException {
// create a few files
srcDir = Files.createDirectories(tempRoot().resolve("src"));
writeString(srcDir.resolve("someSource.dummy"), "dummy text");
writeString(srcDir.resolve("someSource.DUMMY_2_CUSTOM_EXT"), "dummy text");
}


Expand Down Expand Up @@ -318,6 +320,31 @@ void exitStatusWithViolations() throws Exception {
));
}

@Test
void testAssignLanguageOption() throws Exception {

// nothing found bc no file matches
runCli(OK, "-d", srcDir.toString(), "-f", "text", "-R", RULESET_WITH_VIOLATION_FOR_DUMMY2_LANG)
.verify(r -> r.checkStdOut(equalTo("")));

runCli(VIOLATIONS_FOUND, "-d", srcDir.toString(), "-f", "text", "-R", RULESET_WITH_VIOLATION_FOR_DUMMY2_LANG,
"--assign-language", "dummy2", ".*\\.DUMMY_2_CUSTOM_EXT")
.verify(r -> r.checkStdOut(containsString("someSource.DUMMY_2_CUSTOM_EXT:1:\tReportAllRootNodes")));
}


@Test
void testRepeatAssignLanguageOption() throws Exception {

runCli(VIOLATIONS_FOUND, "-d", srcDir.toString(), "-f", "text", "-R", RULESET_WITH_VIOLATION_FOR_DUMMY2_LANG,
"--assign-language", "dummy2", ".*\\.DUMMY_2_CUSTOM_EXT",
"--assign-language", "dummy2", ".*/someSource.dummy")
.verify(r -> {
r.checkStdOut(containsString("someSource.DUMMY_2_CUSTOM_EXT"));
r.checkStdOut(containsString("someSource.dummy"));
});
}

@Test
void testZipFileAsSource() throws Exception {
Path zipArchive = createTemporaryZipArchive("sources.zip");
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<?xml version="1.0"?>
<ruleset name="Test Ruleset" xmlns="http://pmd.sourceforge.net/ruleset/2.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pmd.sourceforge.net/ruleset/2.0.0 https://pmd.sourceforge.io/ruleset_2_0_0.xsd">

<description>
Ruleset used by test RuleSetFactoryTest
</description>

<rule name="ReportAllRootNodes" language="dummy2" since="1.0" message="Violation from ReportAllRootNodes"
class="net.sourceforge.pmd.lang.rule.xpath.XPathRule"
externalInfoUrl="${pmd.website.baseurl}/rules/test/TestRuleset3.xml#Ruleset3Rule1">
<description>Just for test</description>
<priority>3</priority>
<properties>
<!-- Reports one violation per file. -->
<property name="xpath" value="(/)"/>
</properties>
</rule>

</ruleset>
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,19 @@ protected void checkLanguageIsAcceptable(Language lang) throws UnsupportedOperat
// do nothing
}

/**
* Add a pattern that will be matched to a language. If the language is unknown,
* return false. File patterns are matched in the reverse order they were added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... matched in reverse order..." - Can we make it the other way round? IMHO this makes it a bit more difficult to grasp. It would be easier, if we said "language file patterns are applied in the order they are added here - the first match wins".

Maybe we can give the impl. detail, that patterns from CLI are added first and take therefore precedence over the default patterns from the language modules.

Not sure, if that makes the implementation more complicated or simpler - I didn't look at the actual implementation yet.

Ok, so the patterns from CLI are actually added in AbstractAnalysisPmdSubcommand - in the order they are given in the CLI. I would find it more intuitive, if we preserve the order - the first match given with "--assign-language" wins.

* This behavior allows later patterns to take precedence over already added patterns.
* The first match stops the search. If the language is unknown (not loaded), the
* search is stopped anyway.
*
* @param pattern A pattern
*/
public void addLanguageFilePattern(LanguageVersionDiscoverer.LanguageFilePattern pattern) {
languageVersionDiscoverer.addLanguageFilePattern(pattern);
}

/**
* Get the LanguageVersion of the source file with given name. This depends
* on the fileName extension, and the java version.
Expand All @@ -220,9 +233,7 @@ protected void checkLanguageIsAcceptable(Language lang) throws UnsupportedOperat
* Name of the file, can be absolute, or simple.
* @return the LanguageVersion
*/
// FUTURE Delete this? I can't think of a good reason to keep it around.
// Failure to determine the LanguageVersion for a file should be a hard
// error, or simply cause the file to be skipped?
// TODO Delete this, only used in tests, duplicates logic from FileCollector.
public @Nullable LanguageVersion getLanguageVersionOfFile(String fileName) {
LanguageVersion forcedVersion = getForceLanguageVersion();
if (forcedVersion != null) {
Expand Down