Skip to content

Replace univocity-parsers with FastCSV #4606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

vdmitrienko
Copy link
Contributor

@vdmitrienko vdmitrienko commented Jun 1, 2025

Overview

#4339


I hereby agree to the terms of the JUnit Contributor License Agreement.


Definition of Done

Copy link
Member

@marcphilipp marcphilipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising! 👍

Comment on lines +113 to +115
* The `CsvFileSource.lineSeparator()` parameter is deprecated because line separators
are now detected automatically during CSV parsing. This setting is no longer required
and will be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does auto-detection work in all cases? What happens if \n is used in a cell like in the following example with 4 columns?

a;b;\n
c;d\r\n
e;f;g;h\r\n

(assuming \n and \r are replaced with the corresponding character)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does auto-detection work in all cases?

The auto-detection treats each of \r, \n, and \r\n as a line separator. For example, given the following input:

a;b\r
c;d\n
e;f\r\n
g;h

The result is:

[["a", "b"], ["c", "d"], ["e", "f"], ["g", "h"]]

In contrast, univocity-parsers (when configured with \n as the line separator) produces different results:

["a", "b\rc", "d"], ["e", "f"], ["g", "h"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if \n is used in a cell like in the following example with 4 columns?

In this case, the results from FastCSV and univocity-parsers are mostly similar.

FastCSV:

[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]

univocity-parsers:

// .lineSeparator("\n") - same as FastCSV
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]

// .lineSeparator("\r\n") - same as FastCSV
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]

// .lineSeparator("\r") - differs from FastCSV
[["a", "b", "c", "d"], ["e", "f", "g", "h"], [null]]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid this breaks compatibility if someone uses a character sequence as a line delimiter that is not a newline.

Copy link
Contributor Author

@vdmitrienko vdmitrienko Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, considering 3 possible scenarios, all of them imply a breaking change 😞

  1. User explicitly relies on \r\n as the line separator:
    \r - causes an unexpected line break;
    \n - causes an unexpected line break;

  2. User explicitly relies on \r as the line separator:
    \n - causes an unexpected line break;
    \r\n - no change, since \r is already interpreted as a line break;

  3. User explicitly relies on \n as the line separator:
    \r - causes an unexpected line break;
    \r\n - no change, since \n is already interpreted as a line break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osiegmar, would it be possible to add support for a lineSeparator() parameter in FastCSV?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yes. Of course, this wouldn’t be a valid CSV file at all. Is this really a desired feature or just lack of specification/documentation and a good chance to change that with the new major version of JUnit?

Is there a (good) reason, someone separates text records by anything that is not a newline sequence?

Is there known usage of this?

Copy link
Member

@marcphilipp marcphilipp Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping this in the new major version makes sense. I'm not aware of any cases other than using the same line separator on different operating systems. IIRC I initially introduced it because univocity-parsers would use the system line separator (and only that) be default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping this in the new major version makes sense.

Great 👍
I'll make sure to clarify that in the release notes. Adding a few tests wouldn't hurt either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, do we intend to remove lineSeparator() entirely, or simply deprecate it? I initially considered deprecation, but now I’m not sure it’s appropriate here, since deprecation generally implies the feature will continue to function as before, which is not the case.

}
return String.join("\n", csvSource.value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does FastCSV provide an API for line-by-line reading so we don't have to create a string first? It's probably not a big deal since it comes from literals in an annotation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With osiegmar/FastCSV@1077389 there is one now. @vdmitrienko You may want to give it a try if it simplify things for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @osiegmar. This works well with individual strings, but it doesn't support headers. I think adding an overload that accepts an array (or varargs) of strings could simplify this use case:

build(final CsvCallbackHandler<T> callbackHandler, final String... data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the validation of empty records, having a setting for that could be quite handy. It would also be great if the exception message included the index of the empty record. That said, we could also handle this on our side 🙂

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. In that case, I'd rather stick to the current String.join() implementation. The effort for this extra build-method (bigger than initially thought) doesn't seem to match this edge-case use. I also highly doubt that it would improve performance.

@vdmitrienko vdmitrienko requested a review from osiegmar June 5, 2025 20:16
@@ -24,6 +24,7 @@ dependencies {
testImplementation(libs.kotlinx.coroutines)
testImplementation(libs.groovy4)
testImplementation(libs.memoryfilesystem)
testImplementation(libs.fastcsv)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still necessary?

Copy link
Contributor Author

@vdmitrienko vdmitrienko Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the library is now shadowed, this is no longer necessary 👍

@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2021 Oliver Siegmar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that this hasn't been updated for a while. Now, I did.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of FastCSV's jar file? If so, we should probably extract it from there to avoid it becoming outdated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added META-INF/LICENSE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdmitrienko Please let me know if you need help with Gradle to achieve that! 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcphilipp, please, take a look at d5139e2

static CsvReader<? extends CsvRecord> createReaderFor(CsvSource csvSource, String data) {
String delimiter = selectDelimiter(csvSource.delimiter(), csvSource.delimiterString());
// @formatter:off
CsvReader.CsvReaderBuilder builder = CsvReader.builder()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you rely on skipEmptyLines(true) (current default), you may want to set that explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated


String delimiter = selectDelimiter(csvFileSource.delimiter(), csvFileSource.delimiterString());
// @formatter:off
CsvReader.CsvReaderBuilder builder = CsvReader.builder()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you rely on skipEmptyLines(true) (current default), you may want to set that explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


// @formatter:off
if (useHeadersInDisplayName) {
return NamedCsvRecordHandler.builder()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation allows duplicate header names. For compatibility, add allowDuplicateHeaderFields(true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 87 to 90
Object[] arguments = new Object[record.getFields().size()];

for (int i = 0; i < record.getFields().size(); i++) {
String field = record.getFields().get(i);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should either call getFields() only once (as it constructs new objects internally) or (preferably) use the methods getFieldCount() and getField(int).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated

@@ -147,6 +103,10 @@ static Arguments processCsvRecord(@Nullable String[] csvRecord, Set<String> null
return Named.of(name, column);
}

private static List<String> getHeaders(CsvRecord record) {
return ((NamedCsvRecord) record).getHeader();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the header field names were trimmed.

Copy link

@osiegmar osiegmar Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just noticed, that empty header fields (like in foo,,bar) are causing NullPointerExceptions with the univocity implementation. I also noticed that while the emptyValue is applied to header fields, nullValues are not.

The new implementation currently uses the fieldModifier also for the header record and would produce unexpected results (NULL_MARKER, nullValues, emptyValue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the header field names were trimmed.

Currently, they are still trimmed because header fields are treated as regular fields, meaning that ignoreLeadingAndTrailingWhitespace() (default true) applies to them as well.

The new implementation currently uses the fieldModifier also for the header record and would produce unexpected results (NULL_MARKER, nullValues, emptyValue).

Thanks for the observation! Now it applies to headers consistently as well.

I believe treating headers as regular fields would be the simplest and most reliable solution, this way, attributes like ignoreLeadingAndTrailingWhitespace(), nullValues(), etc. apply uniformly to both headers and records.

@marcphilipp @osiegmar WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the header field names were trimmed.

Currently, they are still trimmed because header fields are treated as regular fields, meaning that ignoreLeadingAndTrailingWhitespace() (default true) applies to them as well.

Regardless of the setting ignoreLeadingAndTrailingWhitespace and regardless of whether quoted or not, headers were always trimmed:

// Cannot get parsed headers until after parsing has started.
static String[] getHeaders(CsvParser csvParser) {
return Arrays.stream(csvParser.getContext().parsedHeaders())//
.map(String::trim)//
.toArray(String[]::new);
}

I believe treating headers as regular fields would be the simplest and most reliable solution, this way, attributes like ignoreLeadingAndTrailingWhitespace(), nullValues(), etc. apply uniformly to both headers and records.

@marcphilipp @osiegmar WDYT?

I believe the same, just wanted to point out that difference.

}
return String.join("\n", csvSource.value());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. In that case, I'd rather stick to the current String.join() implementation. The effort for this extra build-method (bigger than initially thought) doesn't seem to match this edge-case use. I also highly doubt that it would improve performance.

Comment on lines 63 to 70
for (int i = 0; i < csvSource.value().length; i++) {
if (csvSource.value()[i].isEmpty()) {
int finalI = i;
Preconditions.condition(!csvSource.value()[i].isEmpty(), //
() -> "CSV record at index %d is empty".formatted(finalI + 1) //
);
}
Preconditions.notNull(csvRecord,
() -> "Record at index " + index + " contains invalid CSV: \"" + input + "\"");
argumentsList.add(processCsvRecord(csvRecord, nullValues, useHeadersInDisplayName, headers));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: why are empty strings in the value array handled as an error? In text blocks empty lines are simply skipped. I also couldn't find something in the javadoc about it. Seems like an unnecessary difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior originates from univocity parseLine() implementation, which we use to parse individual lines: it returns null in case a single line is empty. For text blocks, however, we use a different method: parseAll(), which simply discards empty lines.
I preserved this behavior for the compatibility sake, but seems like we could actually drop it and unify the handling of value arrays and text blocks.

@@ -147,6 +103,10 @@ static Arguments processCsvRecord(@Nullable String[] csvRecord, Set<String> null
return Named.of(name, column);
}

private static List<String> getHeaders(CsvRecord record) {
return ((NamedCsvRecord) record).getHeader();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the header field names were trimmed.

Currently, they are still trimmed because header fields are treated as regular fields, meaning that ignoreLeadingAndTrailingWhitespace() (default true) applies to them as well.

Regardless of the setting ignoreLeadingAndTrailingWhitespace and regardless of whether quoted or not, headers were always trimmed:

// Cannot get parsed headers until after parsing has started.
static String[] getHeaders(CsvParser csvParser) {
return Arrays.stream(csvParser.getContext().parsedHeaders())//
.map(String::trim)//
.toArray(String[]::new);
}

I believe treating headers as regular fields would be the simplest and most reliable solution, this way, attributes like ignoreLeadingAndTrailingWhitespace(), nullValues(), etc. apply uniformly to both headers and records.

@marcphilipp @osiegmar WDYT?

I believe the same, just wanted to point out that difference.

if (useHeadersInDisplayName) {
column = asNamed(requireNonNull(headers)[i] + " = " + column, column);
String header = resolveNullMarker(getHeaders(record).get(i));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see this before: getHeaders (which calls getHeader on NamedCsvRecord) should also better not called in a loop - it creates objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that. Updated

Comment on lines +46 to +54
static void validate(CsvSource csvSource) {
validateMaxCharsPerColumn(csvSource.maxCharsPerColumn());
validateDelimiter(csvSource.delimiter(), csvSource.delimiterString(), csvSource);
}

static void validate(CsvFileSource csvFileSource) {
validateMaxCharsPerColumn(csvFileSource.maxCharsPerColumn());
validateDelimiter(csvFileSource.delimiter(), csvFileSource.delimiterString(), csvFileSource);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A validation should be added to ensure that the (now deprecated) lineSeparator contains only \r\n, \r, or \n.

@vdmitrienko vdmitrienko requested a review from marcphilipp June 10, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants