Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large dataset timeout #53

Open
jacksonvoelkel opened this issue May 22, 2018 · 4 comments
Open

Large dataset timeout #53

jacksonvoelkel opened this issue May 22, 2018 · 4 comments

Comments

@jacksonvoelkel
Copy link

When downloading a very large dataset, esri2geojson encounters this issue:

./esri2geojson https://gismaps.kingcounty.gov/arcgis/rest/services/Property/KingCo_PropertyInfo/MapServer/2 asdf.geojson 2018-05-22 12:52:15,082 - cli.esridump - INFO - Built 615 requests using OID where clause method Traceback (most recent call last): File "./esri2geojson", line 11, in <module> sys.exit(main()) File "/home/<username>/esridump/local/lib/python2.7/site-packages/esridump/cli.py", line 111, in main feature = next(feature_iter) File "/home/<username>/esridump/local/lib/python2.7/site-packages/esridump/dumper.py", line 425, in __iter__ raise EsriDownloadError("Could not connect to URL", e) esridump.errors.EsriDownloadError: ('Could not connect to URL', EsriDownloadError('https://gismaps.kingcounty.gov/arcgis/rest/services/Property/KingCo_PropertyInfo/MapServer/2/query: Could not retrieve this chunk of objects HTTP 500 <html><head><title>Apache Tomcat/7.0.57 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 500 - </h1><HR size="1" noshade="noshade"><p><b>type</b> Exception report</p><p><b>message</b> <u></u></p><p><b>description</b> <u>The server encountered an internal error that prevented it from fulfilling this request.</u></p><p><b>exception</b> <pre>java.lang.NullPointerException\n</pre></p><p><b>note</b> <u>The full stack trace of the root cause is available in the Apache Tomcat/7.0.57 logs.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/7.0.57</h3></body></html>',))

Multiple runs of the same command download a file between 10mb and 600mb, depending on when the connection is lost. I think it would be very beneficial for esri2geojson to not exit upon this error, but to continue down the queue of batches to download.

@sindile
Copy link

sindile commented Aug 8, 2018

I am also experiencing this issue any suggestions on a work around.

@spaceof7
Copy link

spaceof7 commented Aug 8, 2018

I have had this problem with large datasets as well. In my case I don't need the whole dataset so the workaround I used was to use some additional parameters to query a subset of the data. You can query with a lat/lon bounding box(at least some servers) if you use:

esri2geojson -p geometryType=esriGeometryEnvelope -p geometry=-123.24,47.29,-122.50,48.01 -p spatialRel=esriSpatialRelIntersects -p inSR=4326 ...

If you need the whole dataset I suppose you could break it up into multiple subsets though i imagine you'd have to also clean up duplicate features.

edit
Another workaround if you need the whole dataset, or if a geometry subset doesn't work, would to note how far your query got before timing out and running another query that skips the downloaded features. Run esri2geojson with the -v flag. That way you'll see each request and be able to identify how much data you've downloaded and where you need to resume from. Note whether the request uses "resultOffset or "where" to iterate through the features. For something like "'resultOffset': 20000 run esri2gejson again with -p resultOffset=20000

If the request uses something like'where': '(OBJECTID >= 226001 AND OBJECTID <= 227000)
run esri2geojson with:
-p 'where=OBJECTID >= 226001'

Just make sure you don't overwrite the first dataset you downloaded.

Finally, to get the files that were interrupted to work you'll need to edit them a little, but its fairly simple to do in python. first I use:

 with open('asdf.geojson', 'r') as f:
    f.seek(-1000,2)
    print f.read()

to read the last 1000 characters to make sure the last feature is complete. It always is but I just want to make sure. Then:

 with open('asdf.geojson', 'a') as f:
    f.write('\n')
    f.write(']}')

I've only tested this a couple of times but it seems to work and I haven't had to deal with cleaning duplicate data.

@andrewharvey
Copy link
Contributor

@jacksonvoelkel If the layer has an ID field, then I've found forcing esri2geojson to query by ID range avoids timeouts, you can now do this with --paginate-oid, from a4c68db. Could you try that and see if it makes a difference?

@bluetyson
Copy link

Thanks for the tips!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants