Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix improperly parsed subsections #238

Closed
waldoj opened this issue Feb 26, 2013 · 3 comments
Closed

Fix improperly parsed subsections #238

waldoj opened this issue Feb 26, 2013 · 3 comments
Milestone

Comments

@waldoj
Copy link
Member

waldoj commented Feb 26, 2013

The structure of § 8.8A-102 is a mess. There are two problems:

  1. Roman numerals are in use as subsection identifiers, but we have no regular expression to identify them.
  2. Sometimes closing-section identifier label parentheses are being replaced with periods—e.g., "(a."

The first problem is especially tricky, because it's tough to distinguish between Roman numeral "i" and the letter "i."

@waldoj
Copy link
Member Author

waldoj commented Apr 2, 2013

This is a problem in the parser and, yes, the problem is that it's not set up to ID Roman numerals. I've added a new PCRE to support those, along with a change in how subsections are identified. Now identifications are made statefully—the matched PCRE is unshifted to the bottom of the stack, so that on the next loop through (for the subsequent subsection), the first matching attempt will be made with the prior prefix. This is both more efficient and a method of helping to keep identifications from wandering.

Note that this was a previously identified identified problem, and definitely a difficult one.

Ultimately, I think that the only real solution is going to be to make the identifier matching stateful and intelligent. If the identifier is "i" but we haven't already seen "h", then we're more likely looking at a Roman numeral. If the identifier is "ii" and the prior one was "ih" (and not "i"), then we're looking at an (absurdly high-numbered; 269 items) outlined list.

Then there's the second problem—"(a.". I think the problem there is that the prefix was being separated from the text based on the presence of a space, but some Roman numeral prefixes were long enough (e.g., "(iii) ") that they exceeded the five-character chunk of text being hacked off for comparison. I have raised that to six characters, and I'm optimistic that will solve things.

@waldoj
Copy link
Member Author

waldoj commented May 29, 2013

Identifier matching can be improved by using an ordered list of all types of structural components, and having it look not just at the type of identifier, but whether it's appearing in the expected order.

For example:

  • e.
    • i.
    • ii.
    • iii.
    • iv.
  • f.
  • g.
  • h.
  • i.
  • j.

Presently, "i." matches the same regex as "e.", and so the parser would not store or render this properly. But, by using an ordered list of each element type, there should be no doubt as to whether i. is a child element of e., because f. should be the next element.

The trick will be to avoid over-engineering this. While it's logical that subsections should be ordered as A, B, C, D, E, it is by no means impossible that a law could be ordered A, B, D, E, omitting C. Some conditionals will be required to handle these possibilities.

Note that this does not solve our ultimate problem:

(h) Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    (i) Integer tincidunt, sem eu pretium condimentum.
    (ii) Sed dui justo, euismod nec mattis a, aliquet quis ante.
(i) Nulla dapibus sem et ligula consectetur vitae sagittis arcu varius.
(j) Proin a mauris sit amet enim ullamcorper ultricies vitae id lectus.

Because (i) follows (h), this will be tripped up just the same. Solving this final problem will involve some look-ahead functionality.

@waldoj
Copy link
Member Author

waldoj commented Jun 12, 2013

Because the problematic code is no longer a part of The State Decoded—it now lives within Subsection Identifier—this fix doesn't show up in our codebase. This was fixed in the changes committed towards resolving this issue.

@waldoj waldoj closed this as completed Jun 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant