Skip to content
This repository has been archived by the owner on May 5, 2020. It is now read-only.

Eliminate duplicate results #18

Open
waldoj opened this issue Feb 18, 2015 · 17 comments · Fixed by #34
Open

Eliminate duplicate results #18

waldoj opened this issue Feb 18, 2015 · 17 comments · Fixed by #34
Assignees
Labels

Comments

@waldoj
Copy link
Member

waldoj commented Feb 18, 2015

We're sometimes seeing each dataset appear six times, e.g., for Louisville, KY:

repeats

We've got to figure out what the problem is here and fix it.

@dsmorgan77
Copy link

happening for USDOT, too. so 👍

@bsmithgall
Copy link
Contributor

Is there checking to see if a file has already been saved in the db? I looked through code and couldn't determine if that was happening -- it looks like if you kick out to the celery task it's always creating a new model. If this is the case than if a bunch of celery workers all start working on the same site at the same time, they all might add duplicate records.

@waldoj
Copy link
Member Author

waldoj commented Feb 19, 2015

I note that multiple results are not clustered together in the database, as in the public display. Here are some representative records from the database:

12850   http://www.pittsburghpa.gov/publicworks/files/commercial_haulers_10-11.xls
12851   http://www.pittsburghpa.gov/police/motorunit/motorskills/files/final_results_data.xls
12852   http://www.pittsburghpa.gov/landrecycling/ast/MI_PA_LB_cklst_chart_rev_3-29-2010.xls
12853   http://www.pittsburghpa.gov/neighborhoodinitiatives/files/CDBG_Vacant_Lot_Clean_Up_Eligible_Properties.xls
12854   http://www.pittsburghpa.gov/publicworks/files/commercial_haulers_10-11.xls
12855   http://www.pittsburghpa.gov/police/motorunit/motorskills/files/final_results_data.xls
12856   http://www.pittsburghpa.gov/landrecycling/ast/MI_PA_LB_cklst_chart_rev_3-29-2010.xls
12857   http://www.pittsburghpa.gov/neighborhoodinitiatives/files/CDBG_Vacant_Lot_Clean_Up_Eligible_Properties.xls

That's four records, recorded twice consecutively, in the same order. Seems to me that a constraint should be put on the database to prohibit the same file URL from being stored twice (assuming that SQLite supports that). But, of course, the cause of this needs to be determined. @bsmithgall's suggestion sounds like a good one.

@bsmithgall
Copy link
Contributor

This could be a promising lead: https://www.sqlite.org/lang_createtable.html#uniqueconst

@waldoj
Copy link
Member Author

waldoj commented Feb 21, 2015

My fear is that what'll actually happen here is that SQLite will throw an error when a duplicate insertion attempt is made, breaking the whole process. But I'm going to try it out and see what happens!

@bsmithgall
Copy link
Contributor

@waldoj I think you'll probably have to catch whatever exception is raised and just pass or continue

@waldoj
Copy link
Member Author

waldoj commented Feb 21, 2015

Yeah, assuming exception handling already exists for those inserts (I haven't checked), it might not be a problem at all.

@bsmithgall
Copy link
Contributor

I'm happy to take a look at this and put in a PR later today; just need to finish up some work on a different project first.

@waldoj waldoj added the bug label Feb 21, 2015
@bsmithgall
Copy link
Contributor

Here's an open question about this -- from an efficiency perspective: is it getter to do a bunch of update_or_create calls, or to do a get call into the DB and do a bulk update of things that aren't there?

@waldoj
Copy link
Member Author

waldoj commented Feb 22, 2015

I want to leave this open until we're sure this is fixed (in part to prevent duplicate issues from being opened).

@waldoj
Copy link
Member Author

waldoj commented Mar 24, 2015

Unfortunately, I haven't gotten this to work. I get this error:

File "/vol/www/lmgtdfy.usopendata.org/htdocs/lmgtfy/tasks.py", line 8, in <module>
  from django.core.exceptions import DoesNotExist
ImportError: cannot import name DoesNotExist

Why it cannot import it, I don't have any idea. I'm going to keep working on debugging this.

@waldoj
Copy link
Member Author

waldoj commented Mar 24, 2015

In the meantime, now we're getting the same file showing up 21 times consecutively. My suspicion is that result paging is the problem. I think that LMGTDFY is finding X results—say, 1,000 of them—and paging through them 50 at a time—but not actually advancing through the pages. The counter increments, but it's fetching exactly the same set of 50 results 20 times. @knowtheory, this seems like a pretty hefty bug. Is it plausible?

@bsmithgall
Copy link
Contributor

@waldoj Sorry this is my bad; try attaching DoesNotExist to the object:

The DoesNotExist exception is raised when an object is not found for the given parameters of a query. Django provides a DoesNotExist exception as an attribute of each model class to identify the class of object that could not be found and to allow you to catch a particular model class with try/except.

https://docs.djangoproject.com/en/1.7/ref/exceptions/

@waldoj
Copy link
Member Author

waldoj commented Mar 24, 2015

That fixed it, @bsmithgall! Or, rather, that prevents the error from appearing—unfortunately, duplicates are still being indexed, but it's not immediately obvious why that should be so. I'm going to walk through this in a deliberative way and figure out what's going wrong here. I appreciate your efforts!

@waldoj
Copy link
Member Author

waldoj commented Mar 31, 2015

Notably, the results are duplicated in the database in windows. That is, if the file UCM424433.xls appears 20 times, it's not rows 5,000–5,020 in the database. Instead, the same cohort of files appears over and over again. That file, in fact, is result 1, 42, 115, 142, 169, 196, 223, 250, 277, 304, 331, 358, 385, 412, and 439. (I stopped counting there.) It's always preceded and followed by the same results. Many of these results are incremented by 27, which is a weird window size.

@waldoj
Copy link
Member Author

waldoj commented May 4, 2016

For this query, there are 9 results. But 36 results were retrieved (4 times each), and then 72 were loaded into SQLite (2 times each). Now, this was the 2nd query for this domain name, so 36 becoming 72 might have been a result of bad duplicate matching. A clear bug, though, is the repeated retrieval of the same results. Again, this looks like a windowing problem. It's surely somewhere within here.

@waldoj
Copy link
Member Author

waldoj commented May 4, 2016

By putting a unique constraint on lmgtfy_domainsearchresult.result, we're not getting duplicates. That's the good news. The bad news is that, at some point, a duplicate is encountered and LMGTDFY dies:

[2016-05-04 15:05:41,912: ERROR/MainProcess] Task lmgtfy.tasks.search_bing_task[9841be4a-b158-480b-89ef-80cc8f38a2ca] raised unexpected: IntegrityError('column result is not unique',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/vol/www/lmgtdfy.usopendata.org/htdocs/lmgtfy/tasks.py", line 60, in search_bing_task
    DomainSearchResult.objects.bulk_create( result_model_objects )
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 92, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 409, in bulk_create
    self._batched_insert(objs_without_pk, fields, batch_size)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 938, in _batched_insert
    using=self.db)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 92, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 921, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 920, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 81, in execute
    return super(CursorDebugWrapper, self).execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 65, in execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/utils.py", line 65, in execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/sqlite3/base.py", line 485, in execute
    return Database.Cursor.execute(self, query, params)
IntegrityError: column result is not unique

Based on a few test queries, this results in better data than the status quo. That caveat is that LMGTDFY thinks that it's never done searching, since it never reaches the end of the results array. The page says:

We're gathering more results right now. This page will refresh in 10 seconds.

So the page reloads every 10 seconds, but of course no more results ever appear.

The next step, I think, is to work on getting this PR debugged, to avoid duplicates from being sent to the database in the first place.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants