Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected termination of scrapy-playwrigth jobs; caused by node:events:496 #316

Open
renatodvc opened this issue Sep 3, 2024 · 4 comments

Comments

@renatodvc
Copy link

I've been debugging this problem for a while, it's intermittent making it harder to reproduce.

When running some jobs with scrapy-playwright the jobs get's abruptly terminated, if you observe the log of the job, it doesn't even acknowledges the termination, as it would in a SIGTERM case. The process apparently gets killed.

As an example, a simple spider (with scrapy-playwright) for scraping webstaurant.com, here is how the log terminates (literally the last 3 lines)

2024-09-02 23:15:59 [scrapy.core.scraper] WARNING: Dropped: Duplicate item found: {...}
2024-09-02 23:16:02 [scrapy.extensions.logstats] INFO: Crawled 163 pages (at 0 pages/min), scraped 173 items (at 0 items/min)
2024-09-02 23:16:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.webstaurantstore.com/10-strawberry-street-arctic-9c-arctic-blue-6-oz-porcelain-cup-pack/850ARCTIC9C.html>: HTTP status code is not handled or not allowed

I first noticed the problem when running the jobs with scrapyd, and here is what scrapyd logs when the problem happens:

2024-07-31T18:07:33+0000 [-] Process died: exitstatus=None  project='{my_project}' spider='webstaurant' job='ff9296c04f6211ef87de0242ac110004' pid=18659 args=['/usr/local/bin/python', '-m', 'scrapyd.runner', 'crawl', 'webstaurant', '-a', '_job=ff9296c04f6211ef87de0242ac110004', '-s', 'LOG_FILE=/var/lib/scrapyd/logs/{my_project}/webstaurant/ff9296c04f6211ef87de0242ac110004.log']

This is just for extra data, the problem is unrelated to scrapyd, since it's reproducible without it.

In all occurrences the error that seems to be the cause is a node error:

2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Response: <200 https://assets.customer.io/assets/in-app.js>
2024-09-02T23:16:29+0000 [Launcher,16/stderr] DEBUG:scrapy-playwright:[Context=default] Request: <GET https://code.gist.build/web/latest/gist.min.js> (resource type: script, referrer: https://www.webstaurantstore.com/)
2024-09-02T23:16:29+0000 [Launcher,16/stderr] node:events:496
	      throw er; // Unhandled 'error' event
	      ^
	
	Error: write EPIPE
	    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
	Emitted 'error' event on Socket instance at:
	    at emitErrorNT (node:internal/streams/destroy:169:8)
	    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
	    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
	  errno: -32,
	  code: 'EPIPE',
	  syscall: 'write'
	}
	
	Node.js v20.12.2
  • This happens with other spiders as well, it's not related to a specific page/netloc.
  • Some times it happens within the first 15 minutes, sometimes it takes more than 15 hours for the issue to happen. (For some reason it seems to happen quicker when using containers)
  • I've reproduced this in docker containers, GCP Cloud Run and directly in a VPS. I'm confident this isn't related to resources as it happens with plenty of CPU/RAM available.
  • Decreasing PLAYWRIGHT_MAX_PAGES_PER_CONTEXT and PLAYWRIGHT_MAX_CONTEXTS all the way to 1 had no effect
  • I've tested in 3 different situations, running the same input in a pure playwright script (no scrapy) and the problem didn't occurred. All 3 times it concluded the job well after 24hours running. Which leads me to believe that the problem is somewhere between the scrapy and playwright "interaction".
  • At this point I've used versions from 0.0.38 to 0.0.41, all had the issue.

Finally, I found two issues in python-playwright that bear some resemblance, the first one appears logs the same exception and is caused by the handling of the event loop.

microsoft/playwright-python#2275
microsoft/playwright-python#2454

@renatodvc renatodvc changed the title Unexpected termination of scrapy-playwrigth jobs Unexpected termination of scrapy-playwrigth jobs; caused by node:events:496 Sep 3, 2024
@elacuesta
Copy link
Member

elacuesta commented Sep 4, 2024

Please provide a minimal reproducible example. Also, additional information about your environment (e.g. scrapy version -v). This is to understand the situation better, because there is no direct handling of event loops except when running on Windows.

@renatodvc
Copy link
Author

HI @elacuesta thank you for getting back to me.

Regarding the environment, I'm running Ubuntu 22.04.1 LTS on the VPS and scrapy==2.11.2. I've also reproduced the problem in other environments, but they were all Linux.

Unfortunately I don't know how to provide a MRE, giving that the issue is intermittent and I can't pin point the cause. I can provide the spider (which is pretty standard: start_requests -> parse -> item) and the input that is used by start_requests (which is fairly large) but that wouldn't be "minimal".

So far, there isn't a specific page that this seems to happen, to reproduce I just let it run until the process unexpectedly dies (from minutes to hours).
Any insight on how to better debug this, or how I could provide more info would be great.

BTW I mentioned the event loop because it seems to be the cause of the problem for that particular issue linked, of course, it may not be related at all to this issue.

@elacuesta
Copy link
Member

elacuesta commented Sep 10, 2024

For some time I've suspected the request & response loggers might be causing some trouble by keeping references to the requests and responses longer than strictly necessary. This is by no means a tested hypothesis, but at this point I don't have any other explanation about this specific subject. Would you be able to try the code from the disable-request-response-logger branch (5de5a52) and set PLAYWRIGHT_REQUEST_RESPONSE_LOGGER_ENABLED=False?

@renatodvc
Copy link
Author

renatodvc commented Sep 17, 2024

Thank you @elacuesta,

I've executed three jobs so far with the branch you linked, a quick summary of my findings are:

  1. It seems that the loggers are indeed related to the problem.
  2. I'll still run more tests, as my findings so far could just be a fluke.
  3. The jobs that didn't had early termination, ended up stalling, not performing new requests, but also not ending the job. ATM I'm not sure if the there are no more requests in the scheduler, or if there are but the spider can't perform them on Playwright for whatever reason.
  4. Of course there is also the chance that the stalling issue is a completely different issue, and unrelated to this.

For more details:

  1. My first job had the same early termination issue, same node:events:496 error logged. However, this job was using PLAYWRIGHT_ABORT_REQUEST, and because of it, the job kept logging the requests that were getting aborted. The other DEBUG request/response messages from scrapy-playwright weren't logged, as expected. E.g:
2024-09-11T21:45:54+0000 [Launcher,18/stderr] DEBUG:scrapy-playwright:[Context=default] Aborted Playwright request <GET https://bat.bing.com/bat.js>
  1. Since your suspicion was related to loggers, I ran other 2 jobs without PLAYWRIGHT_ABORT_REQUEST, so there wouldn't be the aborting request log message. In both jobs:
  • Had no early termination problem, one of the jobs ran for more than 48hrs.
  • Both jobs stalled, they don't terminate, but also don't perform new requests. Below is a piece of the log to illustrate:
2024-09-17 06:13:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
2024-09-17 06:14:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
...
2024-09-17 15:53:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
2024-09-17 15:54:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
  • ATM I can't say if they exhausted the input and then stalled, or stalled mid job. Would Playwright crashing mid job cause the job to stall?
  • Since one job stalled, I tried to run the second wit CLOSESPIDER_TIMEOUT_NO_ITEM, it logs the message, but it doesn't end the job:
2024-09-17 10:11:41 [scrapy.extensions.closespider] INFO: Closing spider since no items were produced in the last 7200 seconds.
2024-09-17 10:11:41 [scrapy.core.engine] INFO: Closing spider (closespider_timeout_no_item)
2024-09-17 10:11:42 [scrapy.extensions.logstats] INFO: Crawled 4001 pages (at 0 pages/min), scraped 6648 items (at 0 items/min)
...
  • Sending SIGTERM once get's acknowledge but doesn't end the process either, it has to receive SIGTERM twice to force it to end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants