The scraper runs but it returns an empty file #211

HaiFred · 2023-07-25T19:47:24Z

HaiFred
Jul 25, 2023

I am not able to aquire players from the scraper.
I don't get an error message but just an empty file.

Here is my excerpt from the log file:

make acquire_local ARGS="--asset players --season 2022"
PYTHONPATH=:`pwd`/. python scripts/acquire.py local --asset players --season 2022
2023-07-25 21:05:43,299 [INFO]: Schedule players for season 2022
2023-07-25 21:05:43,300 [INFO]: Overridden settings:
{'FEED_URI_PARAMS': 'tfmkt.utils.uri_params',
 'HTTPCACHE_ENABLED': True,
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['tfmkt'],
 'USER_AGENT': 'transfermarkt-datasets/1.0 '
               '(https://github.com/dcaribou/transfermarkt-datasets)'}
2023-07-25 21:05:43,306 [INFO]: Telnet Password: 77e807760644c197
2023-07-25 21:05:43,312 [INFO]: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-07-25 21:05:43,345 [INFO]: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2023-07-25 21:05:43,346 [INFO]: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-07-25 21:05:43,347 [INFO]: Enabled item pipelines:
[]
2023-07-25 21:05:43,350 [INFO]: Spider opened
2023-07-25 21:05:43,417 [INFO]: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-07-25 21:05:43,417 [INFO]: Telnet console listening on 127.0.0.1:6023
2023-07-25 21:05:45,842 [INFO]: Closing spider (finished)
2023-07-25 21:05:45,843 [INFO]: Dumping Scrapy stats:
{'downloader/request_bytes': 90051,
 'downloader/request_count': 241,
 'downloader/request_method_count/GET': 241,
 'downloader/response_bytes': 18647027,
 'downloader/response_count': 241,
 'downloader/response_status_count/200': 241,
 'elapsed_time_seconds': 2.426156,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 7, 25, 19, 5, 45, 843008),
 'httpcache/hit': 241,
 'httpcompression/response_bytes': 80917418,
 'httpcompression/response_count': 241,
 'log_count/DEBUG': 246,
 'log_count/INFO': 10,
 'memusage/max': 145530880,
 'memusage/startup': 145530880,
 'response_received_count': 241,
 'scheduler/dequeued': 241,
 'scheduler/dequeued/memory': 241,
 'scheduler/enqueued': 241,
 'scheduler/enqueued/memory': 241,
 'start_time': datetime.datetime(2023, 7, 25, 19, 5, 43, 416852)}
2023-07-25 21:05:45,843 [INFO]: Spider closed (finished)

Would be great if you can help me out here =)

Answered by dcaribou

Jul 30, 2023

Hey @HaiFred,

I noticed the following two lines in the log you provided

...
'downloader/request_count': 241
...
'httpcache/hit': 241,

This is suggesting that the crawler is indeed working since 241 requests is roughly what is to be expected for a run of the players crawler (I get about the same numbers when I run it from my local).

However the crawler is using your local cache to serve responses, and this could be why you are not getting any output. Scrapy may save responses to cache regardless of the return status, and it happened to me sometimes that I ended up with a corrupted cache because of this.

Can you try and run it with the cache deactivated? You can do so by changing the settin…

View full answer

dcaribou · 2023-07-30T16:21:18Z

dcaribou
Jul 30, 2023
Maintainer

Hey @HaiFred,

I noticed the following two lines in the log you provided

...
'downloader/request_count': 241
...
'httpcache/hit': 241,

This is suggesting that the crawler is indeed working since 241 requests is roughly what is to be expected for a run of the players crawler (I get about the same numbers when I run it from my local).

However the crawler is using your local cache to serve responses, and this could be why you are not getting any output. Scrapy may save responses to cache regardless of the return status, and it happened to me sometimes that I ended up with a corrupted cache because of this.

Can you try and run it with the cache deactivated? You can do so by changing the setting to False in the config

transfermarkt-datasets/config.yml

Line 30 in 53f8b71

HTTPCACHE_ENABLED: True

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The scraper runs but it returns an empty file #211

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

The scraper runs but it returns an empty file #211

HaiFred Jul 25, 2023

Replies: 1 comment

dcaribou Jul 30, 2023 Maintainer

HaiFred
Jul 25, 2023

dcaribou
Jul 30, 2023
Maintainer