Fix improperly parsed subsections #238

waldoj · 2013-02-26T15:02:54Z

The structure of § 8.8A-102 is a mess. There are two problems:

Roman numerals are in use as subsection identifiers, but we have no regular expression to identify them.
Sometimes closing-section identifier label parentheses are being replaced with periods—e.g., "(a."

The first problem is especially tricky, because it's tough to distinguish between Roman numeral "i" and the letter "i."

waldoj · 2013-04-02T19:41:59Z

This is a problem in the parser and, yes, the problem is that it's not set up to ID Roman numerals. I've added a new PCRE to support those, along with a change in how subsections are identified. Now identifications are made statefully—the matched PCRE is unshifted to the bottom of the stack, so that on the next loop through (for the subsequent subsection), the first matching attempt will be made with the prior prefix. This is both more efficient and a method of helping to keep identifications from wandering.

Note that this was a previously identified identified problem, and definitely a difficult one.

Ultimately, I think that the only real solution is going to be to make the identifier matching stateful and intelligent. If the identifier is "i" but we haven't already seen "h", then we're more likely looking at a Roman numeral. If the identifier is "ii" and the prior one was "ih" (and not "i"), then we're looking at an (absurdly high-numbered; 269 items) outlined list.

Then there's the second problem—"(a.". I think the problem there is that the prefix was being separated from the text based on the presence of a space, but some Roman numeral prefixes were long enough (e.g., "(iii) ") that they exceeded the five-character chunk of text being hacked off for comparison. I have raised that to six characters, and I'm optimistic that will solve things.

waldoj · 2013-05-29T14:04:13Z

Identifier matching can be improved by using an ordered list of all types of structural components, and having it look not just at the type of identifier, but whether it's appearing in the expected order.

For example:

e.

i.

ii.

iii.

iv.

f.

g.

h.

i.

j.

Presently, "i." matches the same regex as "e.", and so the parser would not store or render this properly. But, by using an ordered list of each element type, there should be no doubt as to whether i. is a child element of e., because f. should be the next element.

The trick will be to avoid over-engineering this. While it's logical that subsections should be ordered as A, B, C, D, E, it is by no means impossible that a law could be ordered A, B, D, E, omitting C. Some conditionals will be required to handle these possibilities.

Note that this does not solve our ultimate problem:

(h) Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    (i) Integer tincidunt, sem eu pretium condimentum.
    (ii) Sed dui justo, euismod nec mattis a, aliquet quis ante.
(i) Nulla dapibus sem et ligula consectetur vitae sagittis arcu varius.
(j) Proin a mauris sit amet enim ullamcorper ultricies vitae id lectus.

Because (i) follows (h), this will be tripped up just the same. Solving this final problem will involve some look-ahead functionality.

waldoj · 2013-06-12T19:06:48Z

Because the problematic code is no longer a part of The State Decoded—it now lives within Subsection Identifier—this fix doesn't show up in our codebase. This was fixed in the changes committed towards resolving this issue.

waldoj mentioned this issue Jun 1, 2013

add parens around section prefixes (just use CSS) openlawdc/browser#60

Merged

waldoj closed this as completed Jun 12, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix improperly parsed subsections #238

Fix improperly parsed subsections #238

waldoj commented Feb 26, 2013

waldoj commented Apr 2, 2013

waldoj commented May 29, 2013

waldoj commented Jun 12, 2013

Fix improperly parsed subsections #238

Fix improperly parsed subsections #238

Comments

waldoj commented Feb 26, 2013

waldoj commented Apr 2, 2013

waldoj commented May 29, 2013

waldoj commented Jun 12, 2013