-
Notifications
You must be signed in to change notification settings - Fork 12
Eliminate duplicate results #18
Comments
happening for USDOT, too. so 👍 |
Is there checking to see if a file has already been saved in the db? I looked through code and couldn't determine if that was happening -- it looks like if you kick out to the celery task it's always creating a new model. If this is the case than if a bunch of celery workers all start working on the same site at the same time, they all might add duplicate records. |
I note that multiple results are not clustered together in the database, as in the public display. Here are some representative records from the database:
That's four records, recorded twice consecutively, in the same order. Seems to me that a constraint should be put on the database to prohibit the same file URL from being stored twice (assuming that SQLite supports that). But, of course, the cause of this needs to be determined. @bsmithgall's suggestion sounds like a good one. |
This could be a promising lead: https://www.sqlite.org/lang_createtable.html#uniqueconst |
My fear is that what'll actually happen here is that SQLite will throw an error when a duplicate insertion attempt is made, breaking the whole process. But I'm going to try it out and see what happens! |
@waldoj I think you'll probably have to catch whatever exception is raised and just |
Yeah, assuming exception handling already exists for those inserts (I haven't checked), it might not be a problem at all. |
I'm happy to take a look at this and put in a PR later today; just need to finish up some work on a different project first. |
Here's an open question about this -- from an efficiency perspective: is it getter to do a bunch of |
I want to leave this open until we're sure this is fixed (in part to prevent duplicate issues from being opened). |
Unfortunately, I haven't gotten this to work. I get this error:
Why it cannot import it, I don't have any idea. I'm going to keep working on debugging this. |
In the meantime, now we're getting the same file showing up 21 times consecutively. My suspicion is that result paging is the problem. I think that LMGTDFY is finding X results—say, 1,000 of them—and paging through them 50 at a time—but not actually advancing through the pages. The counter increments, but it's fetching exactly the same set of 50 results 20 times. @knowtheory, this seems like a pretty hefty bug. Is it plausible? |
@waldoj Sorry this is my bad; try attaching DoesNotExist to the object:
|
That fixed it, @bsmithgall! Or, rather, that prevents the error from appearing—unfortunately, duplicates are still being indexed, but it's not immediately obvious why that should be so. I'm going to walk through this in a deliberative way and figure out what's going wrong here. I appreciate your efforts! |
Notably, the results are duplicated in the database in windows. That is, if the file |
For this query, there are 9 results. But 36 results were retrieved (4 times each), and then 72 were loaded into SQLite (2 times each). Now, this was the 2nd query for this domain name, so 36 becoming 72 might have been a result of bad duplicate matching. A clear bug, though, is the repeated retrieval of the same results. Again, this looks like a windowing problem. It's surely somewhere within here. |
By putting a unique constraint on
Based on a few test queries, this results in better data than the status quo. That caveat is that LMGTDFY thinks that it's never done searching, since it never reaches the end of the results array. The page says:
So the page reloads every 10 seconds, but of course no more results ever appear. The next step, I think, is to work on getting this PR debugged, to avoid duplicates from being sent to the database in the first place. |
We're sometimes seeing each dataset appear six times, e.g., for Louisville, KY:
We've got to figure out what the problem is here and fix it.
The text was updated successfully, but these errors were encountered: