[RFC]A new journey #203

whg517 · 2021-10-21T06:08:17Z

Hi, scrapy-redis is one of the most commonly used tools for using scrapy, but IT seems to me that this project has not been maintained for a long time. Some of the states on the project are not updated synchronously.

Given the current updates to the Python and Scrapy versions, I wanted to make some feature contributions to the project. If you can accept, I will arrange the follow-up work.

Tasks:

Added GitHub Action workflow
Add pep-0517 A build-system independent format for source trees support
Added adaptation from Python 3.7 to Python 3.10, and removed Python 2 support
Add Type annotations pep-0483 #283
Add pytest to test project
Use modern sphinx theme -- furo , like pip document

Sm4o · 2021-10-25T12:33:16Z

It would be super useful to also add the feature of feeding more context to the spiders. Not just a list of start_urls, but a list of json like so:

{
    "start_urls": [
        {
            "start_url": "https://example.com/",
            "sku": 1234
        }
    ]
}

This was already proposed a while back #156

whg517 · 2021-10-26T06:50:47Z

Hello @Sm4o , I wrote an example according to your description. Has this achieved your purpose?

import json

from scrapy import Request, Spider
from scrapy.http import Response

from scrapy_redis.spiders import RedisSpider


class SpiderError(Exception):
    """"""


class BaseParser:
    name = None

    def __init__(self, spider: Spider):
        # use log: self.spider.logger
        self.spider = spider

    def parse(
        self,
        *,
        response: Response,
        **kwargs
    ) -> list[str]:
        raise NotImplementedError('`parse()` must be implemented.')


class HtmlParser(BaseParser):
    name = 'html'

    def parse(
        self,
        *,
        response: Response,
        rows_rule: str | None = '//tr',
        row_start: int | None = 0,
        row_end: int | None = -1,
        cells_rule: str | None = 'td',
        field_rule: str | None = 'text()',
    ) -> list[str]:
        """"""
        raise NotImplementedError('`parse()` must be implemented.')


def parser_factory(name: str, spider: Spider) -> BaseParser:
    if name == 'html':
        return HtmlParser(spider)
    else:
        raise SpiderError(f'Can not find parser name of "{name}"')


class MySpider(RedisSpider):
    name = 'my_spider'

    def make_request_from_data(self, data):
        text = data.decode(encoding=self.redis_encoding)
        params = json.loads(text)
        return Request(
            params.get('url'),
            dont_filter=True,
            meta={
                'parser_name': params.get('parser_name'),
                'parser_params': {
                    'rows_rule': params.get('rows_rule'),  # rows_xpath = '//tbody/tr'
                    'row_start': params.get('index'),  # row_start = 1
                    'row_end': params.get('row_end'),  # row_end = -1
                    'cells_rule': params.get('cells_rule'),  # cells_rule = 'td'
                    'field_rule': params.get('text()'),  # field_rule = 'text()'
                }
            }
        )

    def parse(self, response: Response, **kwargs):
        name = response.meta.get('parser_name')
        params = response.meta.get('parser_params')
        parser = parser_factory(name, self)
        items = parser.parse(response=response, **params)
        for item in items:
            yield item

LuckyPigeon · 2021-10-26T10:22:51Z

@rmax
Looks good to me. How do you think?
@Sm4o
@rmax is a little busy recently, if you don't mind. Feel free to work on it!

rmax · 2021-10-26T12:56:09Z

Sounds perfect. Please take the lead!

@LuckyPigeon has been given permissions to the repo.

Sm4o · 2021-10-29T11:18:51Z

That's exactly what I needed. Thanks a lot!

whg517 · 2021-11-01T07:55:50Z

I am working in progress...

Sm4o · 2021-11-02T09:46:35Z

I'm trying to reach 1500 requests/min but it seems like using a single spider might not be the best. I noticed that scrapy-redis reads urls from redis in batches equal to CONCURRENT_REQUESTS setting. So if I set it to CONCURRENT_REQUESTS=1000 then scrapy-redis waits until all processes are done before requesting another batch of 1000 from redis. I feel like I'm using this tool wrong, so any tips or suggestions would be greatly appreciated

rmax · 2021-11-02T10:01:53Z

I think this could be improved by having a background thread that keeps a buffer of urls to feed the scrapy scheduler when there is capacity. The current approach relies on the idle mechanism to being able to tell whether we can fetch another batch. This is suited for start urls that generate a lot of subsequent requests. If your start urls generate one or very few additional requests, then you need to use more spider instances with lower batch size.

On Tue, 2 Nov 2021 at 10:46, Samuil Petrov ***@***.***> wrote: I'm trying to reach 1500 requests/min but it seems like using a single spider might not be the best. I noticed that scrapy-redis reads urls from redis in batches equal to CONCURRENT_REQUESTS setting. So if I set it to CONCURRENT_REQUESTS=1000 then scrapy-redis waits until all processes are done before requesting another batch of 1000 from redis. I feel like I'm using this tool wrong, so any tips or suggestions would be greatly appreciated — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAGLHYUF36TTTC6FJDETZTUJ6XQNANCNFSM5GNI6I4A> .

-- Sent from Gmail Mobile

LuckyPigeon · 2021-11-03T03:49:07Z

@whg517
Please go head!
@Sm4o
What feature are you working for?

whg517 · 2021-11-03T05:34:01Z

So far, I have done:

Support python 3.7-3.9，scrapy 2.0-2.5 . And all test is fine.
Add isort, flake8 to check code
Add PEP-517 support
Add gh action

Now I'm having some problems with my documentation. I am Chinese, but my English is not very good, so my English expression ability is not strong. I want someone to take over the documentation.

I think the current document is too simplistic. Perhaps we need to rearrange the structure and content of the document.

LuckyPigeon · 2021-11-03T05:43:55Z

@whg517
Thanks for your contribution! Please file PR for each feature, then I will review it.
Chinese documentations are also welcome, we can rearrange the structure and the content in Chinese version first.
And I can do the translation work.

whg517 · 2022-01-05T02:10:15Z

Hello everyone, I will reorganize features later and try to create a new feature PR. As the New Year begins, I still have many plans to do, I will arrange them as soon as possible.

rmax · 2022-01-05T06:12:21Z

@whg517 thanks for the initiative. Could you also include the pros and cons of moving the project to scrapy-plugins org?

LuckyPigeon · 2022-02-26T07:51:24Z

@whg517 any progress?

whg517 mentioned this issue Nov 3, 2021

[WIP] Enhance scrapy-redis #204

Closed

6 tasks

LuckyPigeon added the feature label Jan 3, 2022

LuckyPigeon assigned whg517 Jan 3, 2022

LuckyPigeon mentioned this issue Jun 5, 2023

[dev] Add Type annotations #282

Closed

LuckyPigeon assigned LuckyPigeon and unassigned whg517 Jun 5, 2023

rmax mentioned this issue Nov 30, 2023

Add metadata to URLs to retrieve from Redis #287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]A new journey #203

[RFC]A new journey #203

whg517 commented Oct 21, 2021 •

edited by LuckyPigeon

Loading

Sm4o commented Oct 25, 2021

whg517 commented Oct 26, 2021 •

edited

Loading

LuckyPigeon commented Oct 26, 2021

rmax commented Oct 26, 2021

Sm4o commented Oct 29, 2021

whg517 commented Nov 1, 2021

Sm4o commented Nov 2, 2021

rmax commented Nov 2, 2021 via email

LuckyPigeon commented Nov 3, 2021

whg517 commented Nov 3, 2021

LuckyPigeon commented Nov 3, 2021 •

edited

Loading

whg517 commented Jan 5, 2022

rmax commented Jan 5, 2022

LuckyPigeon commented Feb 26, 2022

[RFC]A new journey #203

[RFC]A new journey #203

Comments

whg517 commented Oct 21, 2021 • edited by LuckyPigeon Loading

Sm4o commented Oct 25, 2021

whg517 commented Oct 26, 2021 • edited Loading

LuckyPigeon commented Oct 26, 2021

rmax commented Oct 26, 2021

Sm4o commented Oct 29, 2021

whg517 commented Nov 1, 2021

Sm4o commented Nov 2, 2021

rmax commented Nov 2, 2021 via email

LuckyPigeon commented Nov 3, 2021

whg517 commented Nov 3, 2021

LuckyPigeon commented Nov 3, 2021 • edited Loading

whg517 commented Jan 5, 2022

rmax commented Jan 5, 2022

LuckyPigeon commented Feb 26, 2022

whg517 commented Oct 21, 2021 •

edited by LuckyPigeon

Loading

whg517 commented Oct 26, 2021 •

edited

Loading

LuckyPigeon commented Nov 3, 2021 •

edited

Loading