Large dataset timeout #53

jacksonvoelkel · 2018-05-22T19:57:37Z

When downloading a very large dataset, esri2geojson encounters this issue:

./esri2geojson https://gismaps.kingcounty.gov/arcgis/rest/services/Property/KingCo_PropertyInfo/MapServer/2 asdf.geojson 2018-05-22 12:52:15,082 - cli.esridump - INFO - Built 615 requests using OID where clause method Traceback (most recent call last): File "./esri2geojson", line 11, in <module> sys.exit(main()) File "/home/<username>/esridump/local/lib/python2.7/site-packages/esridump/cli.py", line 111, in main feature = next(feature_iter) File "/home/<username>/esridump/local/lib/python2.7/site-packages/esridump/dumper.py", line 425, in __iter__ raise EsriDownloadError("Could not connect to URL", e) esridump.errors.EsriDownloadError: ('Could not connect to URL', EsriDownloadError('https://gismaps.kingcounty.gov/arcgis/rest/services/Property/KingCo_PropertyInfo/MapServer/2/query: Could not retrieve this chunk of objects HTTP 500 <html><head><title>Apache Tomcat/7.0.57 - Error report</title><style></style> </head><body><h1>HTTP Status 500 - </h1><HR size="1" noshade="noshade"><p><b>type</b> Exception report</p><p><b>message</b> <u></u></p><p><b>description</b> <u>The server encountered an internal error that prevented it from fulfilling this request.</u></p><p><b>exception</b> <pre>java.lang.NullPointerException\n</pre></p><p><b>note</b> <u>The full stack trace of the root cause is available in the Apache Tomcat/7.0.57 logs.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/7.0.57</h3></body></html>',))

Multiple runs of the same command download a file between 10mb and 600mb, depending on when the connection is lost. I think it would be very beneficial for esri2geojson to not exit upon this error, but to continue down the queue of batches to download.

The text was updated successfully, but these errors were encountered:

sindile · 2018-08-08T15:26:28Z

I am also experiencing this issue any suggestions on a work around.

spaceof7 · 2018-08-08T17:47:13Z

I have had this problem with large datasets as well. In my case I don't need the whole dataset so the workaround I used was to use some additional parameters to query a subset of the data. You can query with a lat/lon bounding box(at least some servers) if you use:

esri2geojson -p geometryType=esriGeometryEnvelope -p geometry=-123.24,47.29,-122.50,48.01 -p spatialRel=esriSpatialRelIntersects -p inSR=4326 ...

If you need the whole dataset I suppose you could break it up into multiple subsets though i imagine you'd have to also clean up duplicate features.

edit
Another workaround if you need the whole dataset, or if a geometry subset doesn't work, would to note how far your query got before timing out and running another query that skips the downloaded features. Run esri2geojson with the -v flag. That way you'll see each request and be able to identify how much data you've downloaded and where you need to resume from. Note whether the request uses "resultOffset or "where" to iterate through the features. For something like "'resultOffset': 20000 run esri2gejson again with -p resultOffset=20000

If the request uses something like'where': '(OBJECTID >= 226001 AND OBJECTID <= 227000)
run esri2geojson with:
-p 'where=OBJECTID >= 226001'

Just make sure you don't overwrite the first dataset you downloaded.

Finally, to get the files that were interrupted to work you'll need to edit them a little, but its fairly simple to do in python. first I use:

 with open('asdf.geojson', 'r') as f:
    f.seek(-1000,2)
    print f.read()

to read the last 1000 characters to make sure the last feature is complete. It always is but I just want to make sure. Then:

 with open('asdf.geojson', 'a') as f:
    f.write('\n')
    f.write(']}')

I've only tested this a couple of times but it seems to work and I haven't had to deal with cleaning duplicate data.

andrewharvey · 2019-07-12T01:37:29Z

@jacksonvoelkel If the layer has an ID field, then I've found forcing esri2geojson to query by ID range avoids timeouts, you can now do this with --paginate-oid, from a4c68db. Could you try that and see if it makes a difference?

bluetyson · 2019-11-14T03:43:50Z

Thanks for the tips!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large dataset timeout #53

Large dataset timeout #53

jacksonvoelkel commented May 22, 2018

sindile commented Aug 8, 2018

spaceof7 commented Aug 8, 2018 •

edited

Loading

andrewharvey commented Jul 12, 2019

bluetyson commented Nov 14, 2019

Large dataset timeout #53

Large dataset timeout #53

Comments

jacksonvoelkel commented May 22, 2018

sindile commented Aug 8, 2018

spaceof7 commented Aug 8, 2018 • edited Loading

andrewharvey commented Jul 12, 2019

bluetyson commented Nov 14, 2019

spaceof7 commented Aug 8, 2018 •

edited

Loading