-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]A new journey #203
Comments
It would be super useful to also add the feature of feeding more context to the spiders. Not just a list of start_urls, but a list of json like so: {
"start_urls": [
{
"start_url": "https://example.com/",
"sku": 1234
}
]
} This was already proposed a while back #156 |
Hello @Sm4o , I wrote an example according to your description. Has this achieved your purpose? import json
from scrapy import Request, Spider
from scrapy.http import Response
from scrapy_redis.spiders import RedisSpider
class SpiderError(Exception):
""""""
class BaseParser:
name = None
def __init__(self, spider: Spider):
# use log: self.spider.logger
self.spider = spider
def parse(
self,
*,
response: Response,
**kwargs
) -> list[str]:
raise NotImplementedError('`parse()` must be implemented.')
class HtmlParser(BaseParser):
name = 'html'
def parse(
self,
*,
response: Response,
rows_rule: str | None = '//tr',
row_start: int | None = 0,
row_end: int | None = -1,
cells_rule: str | None = 'td',
field_rule: str | None = 'text()',
) -> list[str]:
""""""
raise NotImplementedError('`parse()` must be implemented.')
def parser_factory(name: str, spider: Spider) -> BaseParser:
if name == 'html':
return HtmlParser(spider)
else:
raise SpiderError(f'Can not find parser name of "{name}"')
class MySpider(RedisSpider):
name = 'my_spider'
def make_request_from_data(self, data):
text = data.decode(encoding=self.redis_encoding)
params = json.loads(text)
return Request(
params.get('url'),
dont_filter=True,
meta={
'parser_name': params.get('parser_name'),
'parser_params': {
'rows_rule': params.get('rows_rule'), # rows_xpath = '//tbody/tr'
'row_start': params.get('index'), # row_start = 1
'row_end': params.get('row_end'), # row_end = -1
'cells_rule': params.get('cells_rule'), # cells_rule = 'td'
'field_rule': params.get('text()'), # field_rule = 'text()'
}
}
)
def parse(self, response: Response, **kwargs):
name = response.meta.get('parser_name')
params = response.meta.get('parser_params')
parser = parser_factory(name, self)
items = parser.parse(response=response, **params)
for item in items:
yield item |
Sounds perfect. Please take the lead! @LuckyPigeon has been given permissions to the repo. |
That's exactly what I needed. Thanks a lot! |
I am working in progress... |
I'm trying to reach 1500 requests/min but it seems like using a single spider might not be the best. I noticed that scrapy-redis reads urls from redis in batches equal to |
I think this could be improved by having a background thread that keeps a
buffer of urls to feed the scrapy scheduler when there is capacity.
The current approach relies on the idle mechanism to being able to tell
whether we can fetch another batch.
This is suited for start urls that generate a lot of subsequent requests.
If your start urls generate one or very few additional requests, then you
need to use more spider instances with lower batch size.
On Tue, 2 Nov 2021 at 10:46, Samuil Petrov ***@***.***> wrote:
I'm trying to reach 1500 requests/min but it seems like using a single
spider might not be the best. I noticed that scrapy-redis reads urls from
redis in batches equal to CONCURRENT_REQUESTS setting. So if I set it to
CONCURRENT_REQUESTS=1000 then scrapy-redis waits until all processes are
done before requesting another batch of 1000 from redis. I feel like I'm
using this tool wrong, so any tips or suggestions would be greatly
appreciated
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#203 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGLHYUF36TTTC6FJDETZTUJ6XQNANCNFSM5GNI6I4A>
.
--
Sent from Gmail Mobile
|
So far, I have done:
Now I'm having some problems with my documentation. I am Chinese, but my English is not very good, so my English expression ability is not strong. I want someone to take over the documentation. I think the current document is too simplistic. Perhaps we need to rearrange the structure and content of the document. |
@whg517 |
Hello everyone, I will reorganize features later and try to create a new feature PR. As the New Year begins, I still have many plans to do, I will arrange them as soon as possible. |
@whg517 thanks for the initiative. Could you also include the pros and cons of moving the project to scrapy-plugins org? |
@whg517 any progress? |
fix #226
Hi, scrapy-redis is one of the most commonly used tools for using scrapy, but IT seems to me that this project has not been maintained for a long time. Some of the states on the project are not updated synchronously.
Given the current updates to the Python and Scrapy versions, I wanted to make some feature contributions to the project. If you can accept, I will arrange the follow-up work.
Tasks:
The text was updated successfully, but these errors were encountered: