Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

Source pages misrepresenting the actual process #640

Open
stefanb opened this issue May 21, 2017 · 5 comments
Open

Source pages misrepresenting the actual process #640

stefanb opened this issue May 21, 2017 · 5 comments

Comments

@stefanb
Copy link
Contributor

stefanb commented May 21, 2017

http://results.openaddresses.io/sources/si/countrywide
Is not getting any new rows with​ each new batch, just the date in first row changes, but everything else stays the same.
In the last batch run date changed from 2017-05-15 to 2017-05-20, but linked log (https://s3.amazonaws.com/data.openaddresses.io/runs/188019/output.txt ) is still showing timestamps from 2017-05-11 (one of the previous batches). I would expect existing rows to remain the same, but new added on top of the table with new links and new addresses count (now 554000+, as new source zip is prepared daily).

Diff:
screenshot_20170521-073819

@migurski
Copy link
Member

We have a 10 day timeout on batch runs — older versions are re-used if a new batch process is requested within that time. Is it possible that’s what you are seeing?

@stefanb
Copy link
Contributor Author

stefanb commented May 21, 2017 via email

@migurski
Copy link
Member

Here’s a simplified view of what’s in the database for Slovenia:

id success?copy ofset_id version datetime
187276f ba134282017-05-08 17:22:16+00
187278f 39ed8672017-05-08 18:11:15+00
187280f 187278 9e6a5c92017-05-08 19:20:37+00
187282t d8d09912017-05-08 19:39:11+00
187284t 187282 e753eca2017-05-08 19:51:19+00
187286t 187282 ea80fa72017-05-08 21:18:10+00
187296t 187282 e753eca2017-05-08 21:48:45+00
187301t 12261e62017-05-08 23:13:20+00
187302t 12261e62017-05-08 23:13:23+00
187351t 187282187303822dcb52017-05-08 23:13:34+00
187480t 187302 12261e62017-05-08 23:18:33+00
187740t 187302 12261e62017-05-08 23:30:19+00
187889t 187302 0fe3dd62017-05-08 23:36:56+00
187991t 187302 594fff82017-05-09 09:42:23+00
188005t 187302 7a6319b2017-05-10 06:57:21+00
188008t 98513dc2017-05-10 08:37:06+00
188009t 98513dc2017-05-10 08:37:10+00
188011t 188009 8b531ae2017-05-10 11:08:30+00
188013f 909901f2017-05-10 11:27:01+00
188015t fa978d02017-05-10 11:34:41+00
188017t 188009 ecc277b2017-05-10 11:31:18+00
188019t 3035d432017-05-11 06:41:07+00
188069t 18728218802005a58562017-05-11 18:37:59+00
190149t 188019 efee8f92017-05-11 22:58:56+00
190753t 188019190150efee8f92017-05-12 21:38:05+00
192875t 188019192272efee8f92017-05-15 21:37:28+00
195041t 188019194433483df2f2017-05-20 19:23:15+00
195206t c8e1b612017-05-20 20:41:16+00

Run 195041 (second-to-last row) is public in Set 194433, and it was just a copy of Run 188019 from May 11 from within the prior ten days. Anything with a value in the copy of column is like that. Looks like that newest one, 195206, was processed in this PR about a day ago. Sorry if this is confusing! Many of the tradeoffs here are designed to get something recent-enough onto the OA page without unnecessarily re-processing upstream sources too frequently.

@stefanb
Copy link
Contributor Author

stefanb commented May 22, 2017

Seeing so many records with the notion of "copy of" is a whole different story, more explanatory than just date being changed within the same, existing row, while keeping the linked data and logs old (as seen in the diff screenshot above).

Recent-enough sources could be skipped altogether, changing the date of an old import (even if just visually) does not look sane.

Perhaps the solution to confusion is to:

  • show the date when original (the one being copied) was imported, so in case of id 195041 (which is a copy of 188019) show the date 2017-05-11 06:41:07+00 (can be without time, just date for brevity)
  • make it clear that new ones will not be imported - until the most recent is at least x days old. (give it "fresh enough" badge or so)

Both could be achieved by improving the individual source page, making it show:

  • more details about last successful import at top of the page:
    • Title / attribution / License (for SEO and legal reasons)
    • URL (as it is now)
    • date of last import (not date of copying an existing import)
    • until when it will be considered fresh, not due to re-import
    • a picture overview (preview.png same as with CI jobs), maybe even a link to or iframe with slippy map (Added frameset with slippymap preview #579)
  • a table of all imports (with links to details, as it is now, just dates changed to the actual import date, not when the copy was made)

@migurski
Copy link
Member

migurski commented Jun 3, 2017

Great suggestions, Stefan! Would you be interested in making a contribution to the project? I could help you get set up with the code so that you could develop these changes directly.

@stefanb stefanb changed the title Slovenia is not being processed Source pages misrepresenting the actual process Nov 10, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants