Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive.today is unavailable #92

Open
hemind opened this issue Aug 3, 2021 · 16 comments
Open

archive.today is unavailable #92

hemind opened this issue Aug 3, 2021 · 16 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@hemind
Copy link

hemind commented Aug 3, 2021

Bug Report

Current Behavior
When running wayback command to archive some web page, I got such error message.

html to archive.today failed: archive.today is unavailable.

Environment

  • Wayback version(s): 0.14.1
  • Golang version: go1.14 darwin/amd64
  • OS: macOS 11.2 (20D64)

Possible Solution

Now archive.today redirects to archive.ph. Maybe we should also use domain archive.ph?

@waybackarchiver
Copy link
Contributor

@hemind Thank you for your reporting. Wayback to archive.today has reached all the domains, will also access the onion service if exists a tor bundle. This case may be unavailable caused by CAPTCHA, and there is a solution currently.

If you have other more dependable solutions, please feel free to suggest.

@waybackarchiver waybackarchiver added enhancement New feature or request good first issue Good for newcomers labels Aug 4, 2021
@waybackarchiver waybackarchiver removed the enhancement New feature or request label Aug 13, 2021
@ahxxm
Copy link

ahxxm commented Oct 22, 2021

i got url like http://archive.today?url={{submitted-url}}

the message suggests that bot failed to connect tor service, is this related?

arc_1  | [2021-10-22T02:34:20] [DEBUG] [tor.go:98:useProxy] Try to connect tor proxy failed: dial tcp 127.0.0.1:9050: connect: connection refused
arc_1  | Oct 22 02:34:20.320 [warn] Tor was compiled with zstd 1.4.5, but is running with zstd 1.4.9. For safety, we'll avoid using advanced zstd functionality.
arc_1  | Oct 22 02:34:20.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.

the message is from a container running docker image with

  • entrypoint: wayback -d telegram --ia --is --tor --ph --ip
  • WAYBACK_USE_TOR=true
  • WAYBACK_TELEGRAM_TOKEN WAYBACK_BOLT_PATH WAYBACK_STORAGE_DIR

@waybackarchiver
Copy link
Contributor

@ahxxm Thanks for your feedback!

When sending a request to archive.today, the package archive.is will try to access the archive.today's onion service via tor proxy and will start a temporary tor proxy if port 9050 can not be connected.

We can see that starts a temporary tor proxy from the logs

arc_1  | Oct 22 02:34:20.320 [warn] Tor was compiled with zstd 1.4.5, but is running with zstd 1.4.9. For safety, we'll avoid using advanced zstd functionality.
arc_1  | Oct 22 02:34:20.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.

@ahxxm
Copy link

ahxxm commented Oct 23, 2021

@waybackarchiver but i still get url like http://archive.today?url={{submitted-url}}, which requires manual submission(usually after dealing with annoying cloudflare captcha).

It seems that current torrc disables 9050 by set SOCKSPort 0? According to sample torrc comments:

## Tor opens a SOCKS proxy on port 9050 by default -- even if you don't
## configure one below. Set "SOCKSPort 0" if you plan to run Tor only
## as a relay, and not make any local application connections yourself.

@ahxxm
Copy link

ahxxm commented Oct 23, 2021

I tried the following:

  • comment out SOCKSPort 0 line
  • star tor seperately, curl --socks5 127.0.0.1:9050 ifconfig.me, get response with an exit ip address
  • wabarc a link

still failed to submit to archive.today, the debug log says connected to 9050, but it starts another tor anyway?

arc_1  | [2021-10-23T06:55:12] [DEBUG] [tor.go:103:useProxy] Connected: 127.0.0.1:9050
arc_1  | Oct 23 06:55:12.507 [warn] Tor was compiled with zstd 1.4.5, but is running with zstd 1.4.9. For safety, we'll avoid using advanced zstd functionality.
arc_1  | Oct 23 06:55:12.000 [warn] You are running Tor as root. You don't need to, and you probably shouldn't.

ah sorry, I assumed that the onion address won't require captcha, it turns out ARCHIVE_COOKIE=cf_clearance= is still needed.

How often does this cookie expire?

@waybackarchiver
Copy link
Contributor

@ahxxm Thanks for your reporting!

It seems that current torrc disables 9050 by set SOCKSPort 0?

Yet, SocksPort should be set to 9050, I will update it later.

still failed to submit to archive.today, the debug log says connected to 9050, but it starts another tor anyway?

Actually, this log starts a tor from wayback/wbipfs, it is controlled by the --tor.

It looks like that needs to be optimized.

How often does this cookie expire?

For the captcha, I can't determine its expiration time.

waybackarchiver referenced this issue Oct 23, 2021
Remove ExcludeNodes and ExcludeExitNodes

Set ExitRelay to 0

Set LongLivedPorts to 8964
@ahxxm
Copy link

ahxxm commented Feb 8, 2022

turns out that its expiration time is quite short
how would you like implementing paid service APIs(along with privacy-pass plugin in headless browser, which can make one recognization "last" longer)

@waybackarchiver
Copy link
Contributor

waybackarchiver commented Feb 8, 2022

turns out that its expiration time is quite short how would you like implementing paid service APIs(along with privacy-pass plugin in headless browser, which can make one recognization "last" longer)

Privacy Pass is currently supported by Cloudflare to allow users to redeem validly signed tokens instead of completing CAPTCHA solutions. privacypass/challenge-bypass-extension

Unfortunately, it appears that only hCaptcha is currently Privacy Pass compatible, while the annoying reCAPTCHA is not. Aside from that, I'm aware of the following options.

@waybackarchiver
Copy link
Contributor

The dessant/buster approach is possible, and we will focus our attention next on developing a similiar strategy.

@hellodword
Copy link

puppeteer-extra-*

I'd like to use go, but puppeteer(nodejs) really has good ecosystem about bypassing these stuff.

So how about building a tool that can convert its plugins so we can use it in go?

@waybackarchiver
Copy link
Contributor

puppeteer-extra-*

I'd like to use go, but puppeteer(nodejs) really has good ecosystem about bypassing these stuff.

So how about building a tool that can convert its plugins so we can use it in go?

It's a fantastic idea, but I'm not sure how possible the approaches for implementing it are.

@waybackarchiver
Copy link
Contributor

puppeteer-extra-*

I'd like to use go, but puppeteer(nodejs) really has good ecosystem about bypassing these stuff.
So how about building a tool that can convert its plugins so we can use it in go?

It's a fantastic idea, but I'm not sure how possible the approaches for implementing it are.

Running Chrome with Xvfb and then reaching a shared goal via an extension might be a possible solution.

@ahxxm
Copy link

ahxxm commented Mar 17, 2022

@waybackarchiver
Copy link
Contributor

https://blog.cloudflare.com/friendly-bots/

Thank you for your sharing. This is important news for us, and we will submit the form as soon as possible.

@waybackarchiver
Copy link
Contributor

waybackarchiver commented Mar 26, 2022

The proposal is partial, but it needs extra test scenarios, so anyone is prepared to do so is welcome to download the binaries in this workflow runs for testing.

@waybackarchiver
Copy link
Contributor

waybackarchiver commented May 8, 2023

We found that even when using a proxy, the default Golang client was unable to pass CAPTCHA. After troubleshooting, we confirmed that the issue was related to TLS fingerprinting, and so we added a client using uTLS that allows for the specification of TLS fingerprints.

Now we have successfully addressed the issue by using a network proxy as a new solution, such as Cloudflare WARP.

To use a proxy like Cloudflare WARP, follow these steps:

  1. Sign up for a Cloudflare account.
  2. Obtain WARP credentials.
  3. Launch a Wireguard proxy.
  4. Export http_proxy and https_proxy.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants