Replies: 1 comment 1 reply
-
Hi @krstp, there has been a problem with the latency, I'm investigating, it was between 0,5 and 1 sec before that. The server is lean and Trafilatura distributes requests among hosts to allow for fluid downloads and also to avoid being blocked. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently Trafilatura works in much synchronous way due to underlying
urllib
dependency. What I am finding when reaching approx. 60-80 req/sec the engine somewhat locks out. I wonder how people manage multiple extraction requests that work in expected concurrent and efficient way?PS. I am not finding any existing conversations on the subject.
PS2. I see there is Trafilatura API that supports large volume requests... I wonder how the backend solution got handled here in terms of underlying
urllib
🤔 Ref: https://rapidapi.com/trafapi/api/trafilaturaWhat is interesting, based on the "large volume" rapidAPI endpoint Trafilatura hosts, the latency on single request is approx
6000ms
... which is6sec
, this is not blazing fast turnout in terms of high volume of data extraction (see screenshot). Note, this is not a complain on the Trafilatura itself, but rather backend solution for multiple and concurrent (perfectly in fullyasync
way, i.e.aiohttp/asyncio
) requests handling. I less care how much time the extraction takes or full extraction and return loop, could be1sec
, could be30sec
, I would rather prefer to have good support for concurrent requests, however, from what I see atm Trafilatura works in much synchronous way.Ref:
Beta Was this translation helpful? Give feedback.
All reactions