Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download shell scripts for .tar files forbidden #22

Open
RobertSellers opened this issue Dec 20, 2018 · 14 comments
Open

Download shell scripts for .tar files forbidden #22

RobertSellers opened this issue Dec 20, 2018 · 14 comments

Comments

@RobertSellers
Copy link

This is also somewhat crossposted from the following: aria2/aria2#973. It seems as if wget, curl, and aria2 are forbidden. The .gz extension is also now missing. Any known workarounds to this?

12/20 16:25:03 [ERROR] CUID#8 - Download aborted. URI=https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar
Exception: [AbstractCommand.cc:351] errorCode=29 URI=https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar
  -> [HttpSkipResponseCommand.cc:231] errorCode=29 The response status is not successful. status=503
@iandees
Copy link
Member

iandees commented Dec 20, 2018

I ran into this last year and spoke with some IT folks at Census about it. Apparently they were enforcing some rules about SSL and so required a forged User-Agent and Strict-Transport-Security request headers. This worked last year, but isn't working this year. I think they're also blocking wide ranges of AWS IP addresses.

I got around this temporarily by downloading the files from my home and uploading them to the server doing the data load. I subsequently ran into a couple other problems:

  • this year the nationwide .tar files contain state level .zip files that have a different packaging than before
  • one of the estimate files doesn't match the sequence/metadata files in terms of header count

I haven't had a chance to look into these issues yet, which is why Census Reporter hasn't gotten the latest release added yet. I'm hoping to figure it out this weekend.

@RobertSellers
Copy link
Author

I appreciate the feedback. Also, yes, I'm running on AWS and haven't tested anywhere else so far.

@RobertSellers
Copy link
Author

I can add: the exact same problem occurs from my local PC using Windows 10 linux subsystem with a wget, so this might not be a problem targeted at AWS.

@iandees
Copy link
Member

iandees commented Dec 20, 2018

Can you try something that forges the User-Agent header? For example:

wget --debug \
   --header="User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52." \
   --header "Strict-Transport-Security: max-age=31536000" \
   https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.tar

@RobertSellers
Copy link
Author

No luck. It's a wall of 403 errors. uGet Desktop in Windows 10 also isn't working. Yeesh. This data isn't hosted anywhere else in bulk?

@loganpowell
Copy link

Hi everyone. I'm sorry to hear you're having issues with this. @iandees with whom did you speak at Census? Can you copy me/forward the email ([email protected])?

@iandees
Copy link
Member

iandees commented Dec 20, 2018

Hi @loganpowell! I spoke with Jeff Meisel and Lori Carrig last year. I'll forward the email chain.

@iandees
Copy link
Member

iandees commented Dec 20, 2018

@loganpowell It seems that your Akamai CDN might be blocking .tar downloads from some user agents? I can use wget on the .zip's ok, but the .tar's are failing.

@iandees
Copy link
Member

iandees commented Dec 21, 2018

I was able to get the download working on AWS with this:

aria2c \
    --allow-overwrite=true \
    --auto-file-renaming=false \
    --dir=/mnt/tmp/acs2017_5yr \
    --max-connection-per-server=5 \
    --force-sequential=true \
    --header='Connection: keep-alive' \
    --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' \
    --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
    --header='Accept-Encoding: gzip, deflate, br' \
    --header='Accept-Language: en-US,en;q=0.9' \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.tar" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/2017_ACS_Geography_Files.zip" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/documentation/user_tools/ACS_5yr_Seq_Table_Number_Lookup.txt"

@RobertSellers
Copy link
Author

This seems to be working as required. Thank you for your diligent work on this.

@loganpowell
Copy link

@iandees are .tars now cooperating for you?

@loganpowell
Copy link

Naive question, do all AWS requests stem from a small set/same IP?

@iandees
Copy link
Member

iandees commented Dec 21, 2018

@loganpowell they are, but it sure would be nice to figure out a way to download this data without having to go through all this header trickery. Other parts of the government might call forging these headers fraud 😬.

Requests from AWS come from different IP addresses, but there is a relatively small range of IP addresses and Akamai is probably able to figure them out. My guess that it was an IP block was based on it working from home and not from AWS machines. It's more likely that Census is using some Akamai product to prevent denial of service attacks and it's set to be too restrictive.

@loganpowell
Copy link

@iandees I've had this actually happen to me on my own IP (from home using wget for cartography files). I was blacklisted and had to be manually removed from the blacklist. I'm not an expert here, but I believe the problem is when trying to pull a lot of data over the wire very quickly. Have you tried it with some throttling of your requests?

Btw, I'm very happy you figured out a work around. I don't think what you're doing to work around the blacklisting issue would be considered fraud. You're simply doing what is needed to provide a very important public service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants