-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Replace univocity-parsers with FastCSV #4606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Replace univocity-parsers with FastCSV #4606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very promising! 👍
...ooling-support-tests/src/test/java/platform/tooling/support/tests/ModularUserGuideTests.java
Outdated
Show resolved
Hide resolved
platform-tooling-support-tests/platform-tooling-support-tests.gradle.kts
Outdated
Show resolved
Hide resolved
* The `CsvFileSource.lineSeparator()` parameter is deprecated because line separators | ||
are now detected automatically during CSV parsing. This setting is no longer required | ||
and will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does auto-detection work in all cases? What happens if \n
is used in a cell like in the following example with 4 columns?
a;b;\n
c;d\r\n
e;f;g;h\r\n
(assuming \n
and \r
are replaced with the corresponding character)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does auto-detection work in all cases?
The auto-detection treats each of \r
, \n
, and \r\n
as a line separator. For example, given the following input:
a;b\r
c;d\n
e;f\r\n
g;h
The result is:
[["a", "b"], ["c", "d"], ["e", "f"], ["g", "h"]]
In contrast, univocity-parsers (when configured with \n
as the line separator) produces different results:
["a", "b\rc", "d"], ["e", "f"], ["g", "h"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if \n is used in a cell like in the following example with 4 columns?
In this case, the results from FastCSV and univocity-parsers are mostly similar.
FastCSV:
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]
univocity-parsers:
// .lineSeparator("\n") - same as FastCSV
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]
// .lineSeparator("\r\n") - same as FastCSV
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]
// .lineSeparator("\r") - differs from FastCSV
[["a", "b", "c", "d"], ["e", "f", "g", "h"], [null]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m afraid this breaks compatibility if someone uses a character sequence as a line delimiter that is not a newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, considering 3 possible scenarios, all of them imply a breaking change 😞
-
User explicitly relies on
\r\n
as the line separator:
\r
- causes an unexpected line break;
\n
- causes an unexpected line break; -
User explicitly relies on
\r
as the line separator:
\n
- causes an unexpected line break;
\r\n
- no change, since\r
is already interpreted as a line break; -
User explicitly relies on
\n
as the line separator:
\r
- causes an unexpected line break;
\r\n
- no change, since\n
is already interpreted as a line break;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osiegmar, would it be possible to add support for a lineSeparator()
parameter in FastCSV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, yes. Of course, this wouldn’t be a valid CSV file at all. Is this really a desired feature or just lack of specification/documentation and a good chance to change that with the new major version of JUnit?
Is there a (good) reason, someone separates text records by anything that is not a newline sequence?
Is there known usage of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dropping this in the new major version makes sense. I'm not aware of any cases other than using the same line separator on different operating systems. IIRC I initially introduced it because univocity-parsers would use the system line separator (and only that) be default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dropping this in the new major version makes sense.
Great 👍
I'll make sure to clarify that in the release notes. Adding a few tests wouldn't hurt either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, do we intend to remove lineSeparator()
entirely, or simply deprecate it? I initially considered deprecation, but now I’m not sure it’s appropriate here, since deprecation generally implies the feature will continue to function as before, which is not the case.
} | ||
return String.join("\n", csvSource.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does FastCSV provide an API for line-by-line reading so we don't have to create a string first? It's probably not a big deal since it comes from literals in an annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With osiegmar/FastCSV@1077389 there is one now. @vdmitrienko You may want to give it a try if it simplify things for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @osiegmar. This works well with individual strings, but it doesn't support headers. I think adding an overload that accepts an array (or varargs) of strings could simplify this use case:
build(final CsvCallbackHandler<T> callbackHandler, final String... data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the validation of empty records, having a setting for that could be quite handy. It would also be great if the exception message included the index of the empty record. That said, we could also handle this on our side 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. In that case, I'd rather stick to the current String.join()
implementation. The effort for this extra build-method (bigger than initially thought) doesn't seem to match this edge-case use. I also highly doubt that it would improve performance.
junit-jupiter-params/src/main/java/org/junit/jupiter/params/provider/CsvArgumentsProvider.java
Outdated
Show resolved
Hide resolved
@@ -24,6 +24,7 @@ dependencies { | |||
testImplementation(libs.kotlinx.coroutines) | |||
testImplementation(libs.groovy4) | |||
testImplementation(libs.memoryfilesystem) | |||
testImplementation(libs.fastcsv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the library is now shadowed, this is no longer necessary 👍
@@ -0,0 +1,21 @@ | |||
MIT License | |||
|
|||
Copyright (c) 2021 Oliver Siegmar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that this hasn't been updated for a while. Now, I did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part of FastCSV's jar file? If so, we should probably extract it from there to avoid it becoming outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added META-INF/LICENSE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdmitrienko Please let me know if you need help with Gradle to achieve that! 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcphilipp, please, take a look at d5139e2
static CsvReader<? extends CsvRecord> createReaderFor(CsvSource csvSource, String data) { | ||
String delimiter = selectDelimiter(csvSource.delimiter(), csvSource.delimiterString()); | ||
// @formatter:off | ||
CsvReader.CsvReaderBuilder builder = CsvReader.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you rely on skipEmptyLines(true)
(current default), you may want to set that explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Updated
|
||
String delimiter = selectDelimiter(csvFileSource.delimiter(), csvFileSource.delimiterString()); | ||
// @formatter:off | ||
CsvReader.CsvReaderBuilder builder = CsvReader.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you rely on skipEmptyLines(true)
(current default), you may want to set that explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
// @formatter:off | ||
if (useHeadersInDisplayName) { | ||
return NamedCsvRecordHandler.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation allows duplicate header names. For compatibility, add allowDuplicateHeaderFields(true)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Object[] arguments = new Object[record.getFields().size()]; | ||
|
||
for (int i = 0; i < record.getFields().size(); i++) { | ||
String field = record.getFields().get(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should either call getFields()
only once (as it constructs new objects internally) or (preferably) use the methods getFieldCount()
and getField(int)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated
@@ -147,6 +103,10 @@ static Arguments processCsvRecord(@Nullable String[] csvRecord, Set<String> null | |||
return Named.of(name, column); | |||
} | |||
|
|||
private static List<String> getHeaders(CsvRecord record) { | |||
return ((NamedCsvRecord) record).getHeader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the header field names were trimmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also just noticed, that empty header fields (like in foo,,bar
) are causing NullPointerExceptions with the univocity implementation. I also noticed that while the emptyValue is applied to header fields, nullValues are not.
The new implementation currently uses the fieldModifier also for the header record and would produce unexpected results (NULL_MARKER, nullValues, emptyValue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the header field names were trimmed.
Currently, they are still trimmed because header fields are treated as regular fields, meaning that ignoreLeadingAndTrailingWhitespace()
(default true
) applies to them as well.
The new implementation currently uses the fieldModifier also for the header record and would produce unexpected results (NULL_MARKER, nullValues, emptyValue).
Thanks for the observation! Now it applies to headers consistently as well.
I believe treating headers as regular fields would be the simplest and most reliable solution, this way, attributes like ignoreLeadingAndTrailingWhitespace()
, nullValues()
, etc. apply uniformly to both headers and records.
@marcphilipp @osiegmar WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the header field names were trimmed.
Currently, they are still trimmed because header fields are treated as regular fields, meaning that
ignoreLeadingAndTrailingWhitespace()
(defaulttrue
) applies to them as well.
Regardless of the setting ignoreLeadingAndTrailingWhitespace
and regardless of whether quoted or not, headers were always trimmed:
Lines 107 to 112 in b580aa9
// Cannot get parsed headers until after parsing has started. | |
static String[] getHeaders(CsvParser csvParser) { | |
return Arrays.stream(csvParser.getContext().parsedHeaders())// | |
.map(String::trim)// | |
.toArray(String[]::new); | |
} |
I believe treating headers as regular fields would be the simplest and most reliable solution, this way, attributes like
ignoreLeadingAndTrailingWhitespace()
,nullValues()
, etc. apply uniformly to both headers and records.@marcphilipp @osiegmar WDYT?
I believe the same, just wanted to point out that difference.
} | ||
return String.join("\n", csvSource.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. In that case, I'd rather stick to the current String.join()
implementation. The effort for this extra build-method (bigger than initially thought) doesn't seem to match this edge-case use. I also highly doubt that it would improve performance.
…-parsers-with-fastcsv # Conflicts: # junit-jupiter-params/src/main/java/org/junit/jupiter/params/provider/CsvArgumentsProvider.java
for (int i = 0; i < csvSource.value().length; i++) { | ||
if (csvSource.value()[i].isEmpty()) { | ||
int finalI = i; | ||
Preconditions.condition(!csvSource.value()[i].isEmpty(), // | ||
() -> "CSV record at index %d is empty".formatted(finalI + 1) // | ||
); | ||
} | ||
Preconditions.notNull(csvRecord, | ||
() -> "Record at index " + index + " contains invalid CSV: \"" + input + "\""); | ||
argumentsList.add(processCsvRecord(csvRecord, nullValues, useHeadersInDisplayName, headers)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: why are empty strings in the value array handled as an error? In text blocks empty lines are simply skipped. I also couldn't find something in the javadoc about it. Seems like an unnecessary difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior originates from univocity parseLine() implementation, which we use to parse individual lines: it returns null
in case a single line is empty. For text blocks, however, we use a different method: parseAll(), which simply discards empty lines.
I preserved this behavior for the compatibility sake, but seems like we could actually drop it and unify the handling of value arrays and text blocks.
@@ -147,6 +103,10 @@ static Arguments processCsvRecord(@Nullable String[] csvRecord, Set<String> null | |||
return Named.of(name, column); | |||
} | |||
|
|||
private static List<String> getHeaders(CsvRecord record) { | |||
return ((NamedCsvRecord) record).getHeader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the header field names were trimmed.
Currently, they are still trimmed because header fields are treated as regular fields, meaning that
ignoreLeadingAndTrailingWhitespace()
(defaulttrue
) applies to them as well.
Regardless of the setting ignoreLeadingAndTrailingWhitespace
and regardless of whether quoted or not, headers were always trimmed:
Lines 107 to 112 in b580aa9
// Cannot get parsed headers until after parsing has started. | |
static String[] getHeaders(CsvParser csvParser) { | |
return Arrays.stream(csvParser.getContext().parsedHeaders())// | |
.map(String::trim)// | |
.toArray(String[]::new); | |
} |
I believe treating headers as regular fields would be the simplest and most reliable solution, this way, attributes like
ignoreLeadingAndTrailingWhitespace()
,nullValues()
, etc. apply uniformly to both headers and records.@marcphilipp @osiegmar WDYT?
I believe the same, just wanted to point out that difference.
if (useHeadersInDisplayName) { | ||
column = asNamed(requireNonNull(headers)[i] + " = " + column, column); | ||
String header = resolveNullMarker(getHeaders(record).get(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see this before: getHeaders
(which calls getHeader
on NamedCsvRecord
) should also better not called in a loop - it creates objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that. Updated
static void validate(CsvSource csvSource) { | ||
validateMaxCharsPerColumn(csvSource.maxCharsPerColumn()); | ||
validateDelimiter(csvSource.delimiter(), csvSource.delimiterString(), csvSource); | ||
} | ||
|
||
static void validate(CsvFileSource csvFileSource) { | ||
validateMaxCharsPerColumn(csvFileSource.maxCharsPerColumn()); | ||
validateDelimiter(csvFileSource.delimiter(), csvFileSource.delimiterString(), csvFileSource); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A validation should be added to ensure that the (now deprecated) lineSeparator contains only \r\n, \r, or \n.
Overview
#4339
I hereby agree to the terms of the JUnit Contributor License Agreement.
Definition of Done
@API
annotations