-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of Roman numerals #1
Comments
This is a step towards solving issue #1, which requires stateful knowledge of the ordinal values of prefixes.
Per #1. This is a lazy solution, but it's also very achievable, so it has that going for it.
This is an interesting puzzle for sure. If you're at all inclined to work It needs some upkeep, which I'd be happy to provide, but it's basic use is One slightly creative way to come at this might be to just detect ambiguity On Wed, Jun 12, 2013 at 2:53 PM, Waldo Jaquith [email protected]:
|
Because this is to be used within The State Decoded, unfortunately it really should be PHP. The good news is I've solved this conceptually—it only remains to execute it. I'm going to break up what's now one pass into two, with the second pass looking both back and forward to see if the identified structural unit is preceded and followed by the expected identifiers, giving special attention to any Roman numerals that could plausibly be letters, and vice-versa. "x" should have been preceded by a "w," and followed by a "y" (if, indeed, the document continues to that point). If "x" is preceded by an "ix," then we know that it's actually a Roman numeral. That's why I'm storing the list of viable identifiers in order, which I'm barely using at this point. All of which sounds a lot like what you've already done in schemes.py—that seems like a good sign. :) The trick is going to be recognizing that hierarchical documents don't necessary proceed properly, and being able to deal with that. Mistakes happen, as I'm sure you've seen in the structures of laws. Having a human have to touch it would be a worst-case scenario—as you can imagine, that could be a real mess when importing 40,000 laws—but I think you're right, and it's inevitable that such circumstances are possible. |
@twneale points out a use case that is not allowed for, but that should be:
This is a non-trivial modification, because it requires statefulness—an understanding, upon “realizing” that it’s in the midst of a list of Roman numerals, that it must backtrack, reevaluate where that list began, and modify the ancestry of those subsections accordingly. If it encountered only a single subsection of
(i)
, that's especially problematic, because it’s two “i”s in a row, and there’s no hint available that one of them should be a Roman numeral and, thus, a child of(h)
. That requires an understanding of order (alphabetic, numeric, and Roman numeric) that is not currently present in this, but that seems conceptually straightforward to add.Thom has found the example problem within the U.S. Code, so it’s not merely hypothetical.
Realistically, this is two problems. The first is the ability to recognize and handle Roman numerals properly, which is to say to understand that "i" isn't necessarily the same as "i". Second is the ability to look ahead and understand the unusual-but-extant problem of the use of the Roman numeral "i" following immediately "h."
The text was updated successfully, but these errors were encountered: