When delim=groupmark=`x`, treat `x` as delim unless input is quoted #182

nickrobinson251 · 2023-11-06T12:55:34Z

co-authored with @Drvi

When parsing unquoted numbers, and delim and groupmark are set to the same character, what should happen?
e.g. "1,foo" or "1,\"foo\"" or "1,2.3,foo" or "1,2.3,\"foo\""?

On main there is no documented behaviour for this situation.

In practice current behaviour on main is:

julia> for T in (Int, Float64)
          println(T)
          for c in ("1,foo","1,\"foo\"","1,2.3,foo","1,2.3,\"foo\"")
              print(rpad(repr(c), 15), " => ")
              println(Parsers.xparse(T, c; delim=',', groupmark=','))
           end
       end

Int64
"1,foo"         => Parsers.Result{Int64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=5, val=1)
"1,\"foo\""     => Parsers.Result{Int64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=7, val=1)
"1,2.3,foo"     => Parsers.Result{Int64}(code=`INVALID: OK | DELIMITED | INVALID_DELIMITER `, tlen=6, val=12)
"1,2.3,\"foo\"" => Parsers.Result{Int64}(code=`INVALID: OK | DELIMITED | INVALID_DELIMITER `, tlen=6, val=12)
Float64
"1,foo"         => Parsers.Result{Float64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=5, val=0.0)
"1,\"foo\""     => Parsers.Result{Float64}(code=`INVALID: EOF | INVALID_DELIMITER `, tlen=7, val=0.0)
"1,2.3,foo"     => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=6, val=12.3)
"1,2.3,\"foo\"" => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=6, val=12.3)

With this PR the proposed behaviour is:

Int64
"1,foo"         => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
"1,\"foo\""     => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
"1,2.3,foo"     => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
"1,2.3,\"foo\"" => Parsers.Result{Int64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1)
Float64
"1,foo"         => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)
"1,\"foo\""     => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)
"1,2.3,foo"     => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)
"1,2.3,\"foo\"" => Parsers.Result{Float64}(code=`SUCCESS: OK | DELIMITED `, tlen=2, val=1.0)

That is, the current behaviour on main is to "try" to treat the character as a groupmark if possible, e.g. with Float64 the input "1,2.3,foo" parses to 12.3 with the first , being interpreted as a groupmark due to the following number, but "1,foo" parses to 0.0 (not 1.0). And withInt "1,foo" parses to 1 but is marked INVALID whereas with Float64 1,2.3,foo parses to 12.3 but is marked OK. These inconsistencies are due to the parsers having no principled approach to resolving the ambiguous , character.

This PR changes the parsers to instead treat the character as a delim unless inside a quoted field. i.e. when groupmark and delim are the same (e.g. both ,) the input 1000,foo and \"1,000\",foo" both parse to 1000, but 1,000,foo parses to 1.

This brings groupmark inline with how we handle special or ambiguous characters when parsing to String, i.e. they can only appear in quoted values

Is this a breaking change?

The old behaviour was not documented, inconsistent, and an artefact of implementation details. However, this PR does change the behaviour of some existing tests, since unfortunately our tests were not written with a careful eye for delim==groupmark cases.

The tests actually tested both that we could and that we couldn't parse numbers in these cases, e.g. we had both

@test Parsers.xparse(Float64, "100,000,000.99,100"; groupmark=',', openquotechar='"', closequotechar='"') == Parsers.Result{Float64}(Int16(9), 15, 1.0000000099e8)

i.e. testing we got code='SUCCESS: OK | DELIMITED ' and val=1.0000000099e8
and

res = Parsers.xparse(Int64, "100,000,000,aaa"; groupmark=',')
@test res.code == EOF | INVALID | INVALID_DELIMITER

i.e. testing we code code='INVALID: EOF | INVALID_DELIMITER ' and not testing the value

So whilst it's always possible for someone to come to rely on a bug, i don't think there's any consistent behaviour here that someone could reasonably have been relying upon

Secondly, all tests in CSV.jl, WeakRefStrings.jl, JSON.jl, InlineStrings.jl, PowerFlowData.jl, JSON3.jl pass with this change, which gives some further validation that the old behaviour wasn't relied upon (at least as far as these test suites reflect usage).

Since breaking changes are not entirely free (all the dependency packages need to update, even though they don't require any code changes), i'll mark this as a new feature (we now support groupmark==delim with consistent semantics).

codecov · 2023-11-08T12:45:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d1c6fc5) 88.36% compared to head (d7e15c6) 88.42%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #182      +/-   ##
==========================================
+ Coverage   88.36%   88.42%   +0.06%     
==========================================
  Files          10       10              
  Lines        1882     1892      +10     
==========================================
+ Hits         1663     1673      +10     
  Misses        219      219

Files	Coverage Δ
src/Parsers.jl	`88.55% <100.00%> (+0.59%)`	⬆️
src/floats.jl	`88.93% <100.00%> (ø)`
src/ints.jl	`93.54% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

test/runtests.jl

Drvi

LGTM!

nickrobinson251 and others added 3 commits November 6, 2023 12:48

When delim=groupmark=x, treat x as delim

c82720f

Move groupmark checker to helper function

9ff2db6

Adapt tests for new groupmark handling

127cfae

Drvi reviewed Nov 8, 2023

View reviewed changes

test/runtests.jl Show resolved Hide resolved

Simplify tests

a9836ca

nickrobinson251 mentioned this pull request Nov 9, 2023

Ensure decimal being the same as delim is handled consistently #185

Open

Drvi approved these changes Nov 9, 2023

View reviewed changes

nickrobinson251 added 2 commits November 9, 2023 13:58

More test cases (based on old test cases)

8494136

Bump version

d7e15c6

nickrobinson251 changed the title ~~When delim=groupmark=x, treat x as delim~~ When delim=groupmark=x, treat x as delim unless input is quoted Nov 9, 2023

nickrobinson251 marked this pull request as ready for review November 9, 2023 19:15

nickrobinson251 merged commit 0eb5f46 into main Nov 9, 2023
12 checks passed

nickrobinson251 deleted the npr-delim-outweighs-groupmark branch November 9, 2023 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When delim=groupmark=`x`, treat `x` as delim unless input is quoted #182

When delim=groupmark=`x`, treat `x` as delim unless input is quoted #182

nickrobinson251 commented Nov 6, 2023 •

edited

Loading

codecov bot commented Nov 8, 2023 •

edited

Loading

Drvi left a comment

When delim=groupmark=x, treat x as delim unless input is quoted #182

When delim=groupmark=x, treat x as delim unless input is quoted #182

Conversation

nickrobinson251 commented Nov 6, 2023 • edited Loading

codecov bot commented Nov 8, 2023 • edited Loading

Codecov Report

Drvi left a comment

Choose a reason for hiding this comment

When delim=groupmark=`x`, treat `x` as delim unless input is quoted #182

When delim=groupmark=`x`, treat `x` as delim unless input is quoted #182

nickrobinson251 commented Nov 6, 2023 •

edited

Loading

codecov bot commented Nov 8, 2023 •

edited

Loading