You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am doing some experiments on some of the "Statutes at Large" search-able PDFs on FDsys. The text layer presumably contains raw OCR output, since it contains a lot of errors. I am extracting the text layer and sending it to cite-server running locally.
The following code snippets return false positives:
Citation.find("pursuant to 5 use 552(a)(1)(E) and")// "use" instead of "usc"Citation.find("pursuant to 5 GARBAGE 552(a)(1)(E) and")Citation.find("The sum of 27 and 42 is a number between 68 and 70.")// two citations found!
I am seeing the first case ("use") often where US Code citations in historical documents often omit periods in the abbreviation "USC" (see https://www.gpo.gov/fdsys/pkg/STATUTE-70/content-detail.html, open the PDF, search for the string "use", and see it highlighted often in the margins). I think the OCR engine that generated the text guessed "use", a word more common in everyday English than "USC". (Just in case, I'm NOT suggesting that it is the responsibility of the citation finder to anticipate and fix things like OCR errors.)
The last case has been popping up every once in a while, where you have a single word in between two numbers (see #100).
Generally, the issue seems to be that citations of the reporter type are not being properly validated before being returned to the caller of Citation.find.
The text was updated successfully, but these errors were encountered:
This is a great writeup of the problem, thank you! I'll take a look into this, though I don't have an ETA for it. If you're using this in something where time is of the essence, let me know -- and I'd welcome a pull request with a fix, if you have one.
@konklone not time sensitive for me. I'm happy to collaborate on this issue though. I wouldn't be able to take it on entirely myself (not too knowledgeable about law and legal citations) but I am quite good with regular expressions. @mlissner each of the examples I mentioned are being interpreted as citations of type reporter
I am doing some experiments on some of the "Statutes at Large" search-able PDFs on FDsys. The text layer presumably contains raw OCR output, since it contains a lot of errors. I am extracting the text layer and sending it to
cite-server
running locally.The following code snippets return false positives:
I am seeing the first case ("use") often where US Code citations in historical documents often omit periods in the abbreviation "USC" (see https://www.gpo.gov/fdsys/pkg/STATUTE-70/content-detail.html, open the PDF, search for the string "use", and see it highlighted often in the margins). I think the OCR engine that generated the text guessed "use", a word more common in everyday English than "USC". (Just in case, I'm NOT suggesting that it is the responsibility of the citation finder to anticipate and fix things like OCR errors.)
The last case has been popping up every once in a while, where you have a single word in between two numbers (see #100).
Generally, the issue seems to be that citations of the
reporter
type are not being properly validated before being returned to the caller ofCitation.find
.The text was updated successfully, but these errors were encountered: