Eliminate duplicate results #18

waldoj · 2015-02-18T17:54:51Z

We're sometimes seeing each dataset appear six times, e.g., for Louisville, KY:

We've got to figure out what the problem is here and fix it.

dsmorgan77 · 2015-02-18T20:08:15Z

happening for USDOT, too. so 👍

bsmithgall · 2015-02-19T05:00:07Z

Is there checking to see if a file has already been saved in the db? I looked through code and couldn't determine if that was happening -- it looks like if you kick out to the celery task it's always creating a new model. If this is the case than if a bunch of celery workers all start working on the same site at the same time, they all might add duplicate records.

waldoj · 2015-02-19T05:27:39Z

I note that multiple results are not clustered together in the database, as in the public display. Here are some representative records from the database:

12850   http://www.pittsburghpa.gov/publicworks/files/commercial_haulers_10-11.xls
12851   http://www.pittsburghpa.gov/police/motorunit/motorskills/files/final_results_data.xls
12852   http://www.pittsburghpa.gov/landrecycling/ast/MI_PA_LB_cklst_chart_rev_3-29-2010.xls
12853   http://www.pittsburghpa.gov/neighborhoodinitiatives/files/CDBG_Vacant_Lot_Clean_Up_Eligible_Properties.xls
12854   http://www.pittsburghpa.gov/publicworks/files/commercial_haulers_10-11.xls
12855   http://www.pittsburghpa.gov/police/motorunit/motorskills/files/final_results_data.xls
12856   http://www.pittsburghpa.gov/landrecycling/ast/MI_PA_LB_cklst_chart_rev_3-29-2010.xls
12857   http://www.pittsburghpa.gov/neighborhoodinitiatives/files/CDBG_Vacant_Lot_Clean_Up_Eligible_Properties.xls

That's four records, recorded twice consecutively, in the same order. Seems to me that a constraint should be put on the database to prohibit the same file URL from being stored twice (assuming that SQLite supports that). But, of course, the cause of this needs to be determined. @bsmithgall's suggestion sounds like a good one.

bsmithgall · 2015-02-19T05:32:24Z

This could be a promising lead: https://www.sqlite.org/lang_createtable.html#uniqueconst

waldoj · 2015-02-21T20:29:00Z

My fear is that what'll actually happen here is that SQLite will throw an error when a duplicate insertion attempt is made, breaking the whole process. But I'm going to try it out and see what happens!

bsmithgall · 2015-02-21T20:32:44Z

@waldoj I think you'll probably have to catch whatever exception is raised and just pass or continue

waldoj · 2015-02-21T20:40:48Z

Yeah, assuming exception handling already exists for those inserts (I haven't checked), it might not be a problem at all.

bsmithgall · 2015-02-21T20:41:52Z

I'm happy to take a look at this and put in a PR later today; just need to finish up some work on a different project first.

bsmithgall · 2015-02-21T21:42:46Z

Here's an open question about this -- from an efficiency perspective: is it getter to do a bunch of update_or_create calls, or to do a get call into the DB and do a bulk update of things that aren't there?

waldoj · 2015-02-22T00:32:26Z

I want to leave this open until we're sure this is fixed (in part to prevent duplicate issues from being opened).

waldoj · 2015-03-24T00:42:02Z

Unfortunately, I haven't gotten this to work. I get this error:

File "/vol/www/lmgtdfy.usopendata.org/htdocs/lmgtfy/tasks.py", line 8, in <module>
  from django.core.exceptions import DoesNotExist
ImportError: cannot import name DoesNotExist

Why it cannot import it, I don't have any idea. I'm going to keep working on debugging this.

waldoj · 2015-03-24T00:47:18Z

In the meantime, now we're getting the same file showing up 21 times consecutively. My suspicion is that result paging is the problem. I think that LMGTDFY is finding X results—say, 1,000 of them—and paging through them 50 at a time—but not actually advancing through the pages. The counter increments, but it's fetching exactly the same set of 50 results 20 times. @knowtheory, this seems like a pretty hefty bug. Is it plausible?

bsmithgall · 2015-03-24T00:57:50Z

@waldoj Sorry this is my bad; try attaching DoesNotExist to the object:

The DoesNotExist exception is raised when an object is not found for the given parameters of a query. Django provides a DoesNotExist exception as an attribute of each model class to identify the class of object that could not be found and to allow you to catch a particular model class with try/except.

https://docs.djangoproject.com/en/1.7/ref/exceptions/

waldoj · 2015-03-24T17:10:16Z

That fixed it, @bsmithgall! Or, rather, that prevents the error from appearing—unfortunately, duplicates are still being indexed, but it's not immediately obvious why that should be so. I'm going to walk through this in a deliberative way and figure out what's going wrong here. I appreciate your efforts!

waldoj · 2015-03-31T21:26:08Z

Notably, the results are duplicated in the database in windows. That is, if the file UCM424433.xls appears 20 times, it's not rows 5,000–5,020 in the database. Instead, the same cohort of files appears over and over again. That file, in fact, is result 1, 42, 115, 142, 169, 196, 223, 250, 277, 304, 331, 358, 385, 412, and 439. (I stopped counting there.) It's always preceded and followed by the same results. Many of these results are incremented by 27, which is a weird window size.

waldoj · 2016-05-04T15:32:03Z

For this query, there are 9 results. But 36 results were retrieved (4 times each), and then 72 were loaded into SQLite (2 times each). Now, this was the 2nd query for this domain name, so 36 becoming 72 might have been a result of bad duplicate matching. A clear bug, though, is the repeated retrieval of the same results. Again, this looks like a windowing problem. It's surely somewhere within here.

waldoj · 2016-05-04T20:17:09Z

By putting a unique constraint on lmgtfy_domainsearchresult.result, we're not getting duplicates. That's the good news. The bad news is that, at some point, a duplicate is encountered and LMGTDFY dies:

[2016-05-04 15:05:41,912: ERROR/MainProcess] Task lmgtfy.tasks.search_bing_task[9841be4a-b158-480b-89ef-80cc8f38a2ca] raised unexpected: IntegrityError('column result is not unique',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/vol/www/lmgtdfy.usopendata.org/htdocs/lmgtfy/tasks.py", line 60, in search_bing_task
    DomainSearchResult.objects.bulk_create( result_model_objects )
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 92, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 409, in bulk_create
    self._batched_insert(objs_without_pk, fields, batch_size)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 938, in _batched_insert
    using=self.db)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 92, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 921, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 920, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 81, in execute
    return super(CursorDebugWrapper, self).execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 65, in execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 65, in execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/sqlite3/base.py", line 485, in execute
    return Database.Cursor.execute(self, query, params)
IntegrityError: column result is not unique

Based on a few test queries, this results in better data than the status quo. That caveat is that LMGTDFY thinks that it's never done searching, since it never reaches the end of the results array. The page says:

We're gathering more results right now. This page will refresh in 10 seconds.

So the page reloads every 10 seconds, but of course no more results ever appear.

The next step, I think, is to work on getting this PR debugged, to avoid duplicates from being sent to the database in the first place.

waldoj assigned knowtheory Feb 18, 2015

waldoj mentioned this issue Feb 21, 2015

Duplicate Results #31

Closed

waldoj added the bug label Feb 21, 2015

bsmithgall mentioned this issue Feb 21, 2015

check if we have a dupe during celery task #34

Merged

waldoj closed this as completed in #34 Feb 22, 2015

waldoj reopened this Feb 22, 2015

waldoj mentioned this issue Feb 24, 2015

Provide more than 300 results #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate duplicate results #18

Eliminate duplicate results #18

waldoj commented Feb 18, 2015

dsmorgan77 commented Feb 18, 2015

bsmithgall commented Feb 19, 2015

waldoj commented Feb 19, 2015

bsmithgall commented Feb 19, 2015

waldoj commented Feb 21, 2015

bsmithgall commented Feb 21, 2015

waldoj commented Feb 21, 2015

bsmithgall commented Feb 21, 2015

bsmithgall commented Feb 21, 2015

waldoj commented Feb 22, 2015

waldoj commented Mar 24, 2015

waldoj commented Mar 24, 2015

bsmithgall commented Mar 24, 2015

waldoj commented Mar 24, 2015

waldoj commented Mar 31, 2015

waldoj commented May 4, 2016

waldoj commented May 4, 2016

Eliminate duplicate results #18

Eliminate duplicate results #18

Comments

waldoj commented Feb 18, 2015

dsmorgan77 commented Feb 18, 2015

bsmithgall commented Feb 19, 2015

waldoj commented Feb 19, 2015

bsmithgall commented Feb 19, 2015

waldoj commented Feb 21, 2015

bsmithgall commented Feb 21, 2015

waldoj commented Feb 21, 2015

bsmithgall commented Feb 21, 2015

bsmithgall commented Feb 21, 2015

waldoj commented Feb 22, 2015

waldoj commented Mar 24, 2015

waldoj commented Mar 24, 2015

bsmithgall commented Mar 24, 2015

waldoj commented Mar 24, 2015

waldoj commented Mar 31, 2015

waldoj commented May 4, 2016

waldoj commented May 4, 2016