When delim=groupmark=x
, treat x
as delim unless input is quoted
#182
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
co-authored with @Drvi
When parsing unquoted numbers, and
delim
andgroupmark
are set to the same character, what should happen?e.g.
"1,foo"
or"1,\"foo\""
or"1,2.3,foo"
or"1,2.3,\"foo\""
?On
main
there is no documented behaviour for this situation.In practice current behaviour on
main
is:With this PR the proposed behaviour is:
That is, the current behaviour on
main
is to "try" to treat the character as a groupmark if possible, e.g. withFloat64
the input"1,2.3,foo"
parses to12.3
with the first,
being interpreted as a groupmark due to the following number, but"1,foo"
parses to0.0
(not1.0
). And withInt
"1,foo"
parses to1
but is markedINVALID
whereas withFloat64
1,2.3,foo
parses to12.3
but is markedOK
. These inconsistencies are due to the parsers having no principled approach to resolving the ambiguous,
character.This PR changes the parsers to instead treat the character as a delim unless inside a quoted field. i.e. when groupmark and delim are the same (e.g. both
,
) the input1000,foo
and\"1,000\",foo"
both parse to1000
, but1,000,foo
parses to1
.This brings groupmark inline with how we handle special or ambiguous characters when parsing to
String
, i.e. they can only appear in quoted valuesIs this a breaking change?
The old behaviour was not documented, inconsistent, and an artefact of implementation details. However, this PR does change the behaviour of some existing tests, since unfortunately our tests were not written with a careful eye for
delim==groupmark
cases.The tests actually tested both that we could and that we couldn't parse numbers in these cases, e.g. we had both
i.e. testing we got
code='SUCCESS: OK | DELIMITED '
andval=1.0000000099e8
and
i.e. testing we code
code='INVALID: EOF | INVALID_DELIMITER '
and not testing the valueSo whilst it's always possible for someone to come to rely on a bug, i don't think there's any consistent behaviour here that someone could reasonably have been relying upon
Secondly, all tests in CSV.jl, WeakRefStrings.jl, JSON.jl, InlineStrings.jl, PowerFlowData.jl, JSON3.jl pass with this change, which gives some further validation that the old behaviour wasn't relied upon (at least as far as these test suites reflect usage).
Since breaking changes are not entirely free (all the dependency packages need to update, even though they don't require any code changes), i'll mark this as a new feature (we now support
groupmark==delim
with consistent semantics).