-
I am not able to aquire players from the scraper. Here is my excerpt from the log file: make acquire_local ARGS="--asset players --season 2022"
PYTHONPATH=:`pwd`/. python scripts/acquire.py local --asset players --season 2022
2023-07-25 21:05:43,299 [INFO]: Schedule players for season 2022
2023-07-25 21:05:43,300 [INFO]: Overridden settings:
{'FEED_URI_PARAMS': 'tfmkt.utils.uri_params',
'HTTPCACHE_ENABLED': True,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['tfmkt'],
'USER_AGENT': 'transfermarkt-datasets/1.0 '
'(https://github.com/dcaribou/transfermarkt-datasets)'}
2023-07-25 21:05:43,306 [INFO]: Telnet Password: 77e807760644c197
2023-07-25 21:05:43,312 [INFO]: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-07-25 21:05:43,345 [INFO]: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2023-07-25 21:05:43,346 [INFO]: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-07-25 21:05:43,347 [INFO]: Enabled item pipelines:
[]
2023-07-25 21:05:43,350 [INFO]: Spider opened
2023-07-25 21:05:43,417 [INFO]: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-07-25 21:05:43,417 [INFO]: Telnet console listening on 127.0.0.1:6023
2023-07-25 21:05:45,842 [INFO]: Closing spider (finished)
2023-07-25 21:05:45,843 [INFO]: Dumping Scrapy stats:
{'downloader/request_bytes': 90051,
'downloader/request_count': 241,
'downloader/request_method_count/GET': 241,
'downloader/response_bytes': 18647027,
'downloader/response_count': 241,
'downloader/response_status_count/200': 241,
'elapsed_time_seconds': 2.426156,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 7, 25, 19, 5, 45, 843008),
'httpcache/hit': 241,
'httpcompression/response_bytes': 80917418,
'httpcompression/response_count': 241,
'log_count/DEBUG': 246,
'log_count/INFO': 10,
'memusage/max': 145530880,
'memusage/startup': 145530880,
'response_received_count': 241,
'scheduler/dequeued': 241,
'scheduler/dequeued/memory': 241,
'scheduler/enqueued': 241,
'scheduler/enqueued/memory': 241,
'start_time': datetime.datetime(2023, 7, 25, 19, 5, 43, 416852)}
2023-07-25 21:05:45,843 [INFO]: Spider closed (finished) Would be great if you can help me out here =) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hey @HaiFred, I noticed the following two lines in the log you provided ...
'downloader/request_count': 241
...
'httpcache/hit': 241, This is suggesting that the crawler is indeed working since 241 requests is roughly what is to be expected for a run of the players crawler (I get about the same numbers when I run it from my local). However the crawler is using your local cache to serve responses, and this could be why you are not getting any output. Scrapy may save responses to cache regardless of the return status, and it happened to me sometimes that I ended up with a corrupted cache because of this. Can you try and run it with the cache deactivated? You can do so by changing the setting to False in the config transfermarkt-datasets/config.yml Line 30 in 53f8b71 |
Beta Was this translation helpful? Give feedback.
Hey @HaiFred,
I noticed the following two lines in the log you provided
This is suggesting that the crawler is indeed working since 241 requests is roughly what is to be expected for a run of the players crawler (I get about the same numbers when I run it from my local).
However the crawler is using your local cache to serve responses, and this could be why you are not getting any output. Scrapy may save responses to cache regardless of the return status, and it happened to me sometimes that I ended up with a corrupted cache because of this.
Can you try and run it with the cache deactivated? You can do so by changing the settin…