-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix improperly parsed subsections #238
Comments
This is a problem in the parser and, yes, the problem is that it's not set up to ID Roman numerals. I've added a new PCRE to support those, along with a change in how subsections are identified. Now identifications are made statefully—the matched PCRE is unshifted to the bottom of the stack, so that on the next loop through (for the subsequent subsection), the first matching attempt will be made with the prior prefix. This is both more efficient and a method of helping to keep identifications from wandering. Note that this was a previously identified identified problem, and definitely a difficult one. Ultimately, I think that the only real solution is going to be to make the identifier matching stateful and intelligent. If the identifier is "i" but we haven't already seen "h", then we're more likely looking at a Roman numeral. If the identifier is "ii" and the prior one was "ih" (and not "i"), then we're looking at an (absurdly high-numbered; 269 items) outlined list. Then there's the second problem—"(a.". I think the problem there is that the prefix was being separated from the text based on the presence of a space, but some Roman numeral prefixes were long enough (e.g., "(iii) ") that they exceeded the five-character chunk of text being hacked off for comparison. I have raised that to six characters, and I'm optimistic that will solve things. |
Identifier matching can be improved by using an ordered list of all types of structural components, and having it look not just at the type of identifier, but whether it's appearing in the expected order. For example:
Presently, "i." matches the same regex as "e.", and so the parser would not store or render this properly. But, by using an ordered list of each element type, there should be no doubt as to whether The trick will be to avoid over-engineering this. While it's logical that subsections should be ordered as A, B, C, D, E, it is by no means impossible that a law could be ordered A, B, D, E, omitting C. Some conditionals will be required to handle these possibilities. Note that this does not solve our ultimate problem:
Because |
Because the problematic code is no longer a part of The State Decoded—it now lives within Subsection Identifier—this fix doesn't show up in our codebase. This was fixed in the changes committed towards resolving this issue. |
The structure of § 8.8A-102 is a mess. There are two problems:
The first problem is especially tricky, because it's tough to distinguish between Roman numeral "i" and the letter "i."
The text was updated successfully, but these errors were encountered: