Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dotmap missing DC #77

Open
sbma44 opened this issue Sep 19, 2019 · 11 comments
Open

dotmap missing DC #77

sbma44 opened this issue Sep 19, 2019 · 11 comments
Labels

Comments

@sbma44
Copy link
Contributor

sbma44 commented Sep 19, 2019

results.openaddresses.io shows a run as recently as 9/10 and the sample data and cache file look ok to me. It's present in the US South extract (haven't checked global--should I?). Not sure what could be going on.

image

@iandees iandees added the bug label Sep 19, 2019
@nvkelso
Copy link
Member

nvkelso commented Oct 8, 2019

I hit this today, too:

image

image

@iandees
Copy link
Member

iandees commented Oct 8, 2019

Our style has four Mapbox studio layers ("southwest", "southeast", "northeast", "northwest") with the OA data split into each. When i look at one of these layers, the "select data" tab has an error e.g. "source layer 'openaddresses' or Source 'mapbox://open-addresses.southwest' not found.":

image (1)

… but it's there:

image (2)

When I try to select it, studio shows a different source getting selected at the top. (i selected southwest, but the source thing at the top shows southeast):

image (3)

@iandees
Copy link
Member

iandees commented Oct 8, 2019

@migurski do you remember why we wrote out chunks of mbtiles instead of one big one?

@migurski
Copy link
Member

migurski commented Oct 8, 2019

Yeah, we split the world into four to slide under a Mapbox limitation on upload size: openaddresses/machine#631 (comment)

@iandees
Copy link
Member

iandees commented Oct 8, 2019

@sbma44 or @ingalls, can you (a) check if Mapbox's limitation on file upload size is still present and (b) if there's anything we can do to fix the Mapbox Studio issue described above?

@ingalls
Copy link
Member

ingalls commented Oct 9, 2019

@iandees Looks like the limit is 25gb per Ref

What is the current size of the file, and are we uploading prebuild vector tiles, or raw geojson?

I'm not super familiar with the studio/tiles side of things but https://docs.mapbox.com/api/maps/#create-a-tileset looks like it could be a more permanent solution to our problem. However I image uploading individual tiles would take considerably longer than our current approach

@migurski
Copy link
Member

migurski commented Oct 9, 2019

We are uploading prebuilt MBTiles files full of vector tiles after running Tippecanoe in the build process.

@ingalls
Copy link
Member

ingalls commented Oct 9, 2019

@migurski any chance you can pull the size of the current file? My guess is this is the issue once again :/

@iandees
Copy link
Member

iandees commented Oct 9, 2019

Looking at the logs, it seems that maybe the most recent two mbtiles builds on 9/29 and 10/1 stopped without finishing or uploading to Mapbox. I don't see any errors, so maybe the instance got terminated? It went from ~1600UTC to 0100UTC, which is ~8 hours. Maybe there's a time limit we need to bump up?

On the 9/11 build, most of the downloads failed because I changed permissions on the S3 bucket. The upload to Mapbox worked, though, so that's likely why there are so few points on the map as that's the most recent tileset in our account.

Actions to take:

  • Double-check machine's openaddr-update-dotmap is using the code that makes authenticated S3 requests instead of plain 'ol HTTP downloads (it is)
  • Increase the timeout on the job if there is one (although the slowness might come from trying to make requests through the CDN instead of using S3 directly) (timeout isn't the problem)

@iandees
Copy link
Member

iandees commented Oct 10, 2019

It looks like this is an issue with a hanging postgres query while pulling down the cached sources to use when iterating through all data.

This query is run and takes a very long time (times out after ~2 days):

SELECT MAX(id), source_path FROM runs
WHERE source_path IN (
    -- Get all source paths for successful runs in this set.
    SELECT source_path FROM runs
    WHERE set_id = 684369
  )
  -- Get only successful, merged runs.
  AND status = true
  AND (is_merged = true OR is_merged IS NULL)
GROUP BY source_path;

The Postgres analyzer says it should be relatively fast:

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=9914.83..9937.92 rows=2309 width=32)
   Group Key: runs.source_path
   ->  Nested Loop  (cost=859.73..8098.89 rows=363188 width=32)
         ->  HashAggregate  (cost=850.31..850.45 rows=14 width=28)
               Group Key: runs_1.source_path
               ->  Index Scan using runs_set_ids on runs runs_1  (cost=0.42..843.36 rows=2780 width=28)
                     Index Cond: (set_id = 684369)
         ->  Bitmap Heap Scan on runs  (cost=9.42..516.46 rows=129 width=32)
               Recheck Cond: ((source_path = runs_1.source_path) AND status AND (is_merged OR (is_merged IS NULL)))
               ->  Bitmap Index Scan on runs_source_path_idx  (cost=0.00..9.39 rows=129 width=0)
                     Index Cond: (source_path = runs_1.source_path)
(11 rows)

I'll dive into why this is taking so long later.

@iandees
Copy link
Member

iandees commented Oct 17, 2019

With all the changes above, I'm seeing more dots on the dotmap now. Including DC:

image

The update process takes quite a bit longer now (it's still running) that we have more data, but it's going somewhere now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants