lack of validation on returned case citations #142

markmatney · 2018-05-18T02:37:24Z

I am doing some experiments on some of the "Statutes at Large" search-able PDFs on FDsys. The text layer presumably contains raw OCR output, since it contains a lot of errors. I am extracting the text layer and sending it to cite-server running locally.

The following code snippets return false positives:

Citation.find("pursuant to 5 use 552(a)(1)(E) and") // "use" instead of "usc"
Citation.find("pursuant to 5 GARBAGE 552(a)(1)(E) and")
Citation.find("The sum of 27 and 42 is a number between 68 and 70.") // two citations found!

I am seeing the first case ("use") often where US Code citations in historical documents often omit periods in the abbreviation "USC" (see https://www.gpo.gov/fdsys/pkg/STATUTE-70/content-detail.html, open the PDF, search for the string "use", and see it highlighted often in the margins). I think the OCR engine that generated the text guessed "use", a word more common in everyday English than "USC". (Just in case, I'm NOT suggesting that it is the responsibility of the citation finder to anticipate and fix things like OCR errors.)

The last case has been popping up every once in a while, where you have a single word in between two numbers (see #100).

Generally, the issue seems to be that citations of the reporter type are not being properly validated before being returned to the caller of Citation.find.

The text was updated successfully, but these errors were encountered:

konklone · 2018-06-02T19:04:09Z

This is a great writeup of the problem, thank you! I'll take a look into this, though I don't have an ETA for it. If you're using this in something where time is of the essence, let me know -- and I'd welcome a pull request with a fix, if you have one.

mlissner · 2018-06-04T06:17:48Z

Weird. Seems like the regex here would only allow USC or U.S.C.:

https://github.com/unitedstates/citation/blob/master/citations/usc.js#L51

Is it being picked up as a U.S.C. citation?

markmatney · 2018-06-04T23:20:44Z

@konklone not time sensitive for me. I'm happy to collaborate on this issue though. I wouldn't be able to take it on entirely myself (not too knowledgeable about law and legal citations) but I am quite good with regular expressions.
@mlissner each of the examples I mentioned are being interpreted as citations of type reporter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lack of validation on returned case citations #142

lack of validation on returned case citations #142

markmatney commented May 18, 2018

konklone commented Jun 2, 2018

mlissner commented Jun 4, 2018

markmatney commented Jun 4, 2018

lack of validation on returned case citations #142

lack of validation on returned case citations #142

Comments

markmatney commented May 18, 2018

konklone commented Jun 2, 2018

mlissner commented Jun 4, 2018

markmatney commented Jun 4, 2018