-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug?] Batch API status stuck at registering
#1497
Comments
I accidentally started another batch at 00:47 because I'm dumb 😆 Also stuck, still.
|
registering
registering
Maybe related, old finished requests are also stuck in @rudiedirkx: there was a migration to use docker on batch, there was a spike in Wednesday, and it's being investigated at the moment, related: BTW I never tried it, but it should be able to cancel a request, see the OpenAPI specification. |
The original 2 requests are Is it safe to start new requests (2 x ~250 domains every night), or should I use the old/individual/scraper checks for now? #1255 (comment) |
RCA:
The queue was cleared as a quick temporary solution, since the queue was only slowly moving. As a result, some requests in |
Tonight's batches of 374 domains hung again. Requested at 18:02, still |
Because of the option in the dashboard to scan once or twice a month, the 1st and 15th of each month have some extra load. We noticed a task queue building up from 15th of September 0 UTC until it was 0 again at 8:40 UTC (10:40) 16th of September. We're unsure if this behavior is new (because of the new setup) or was old, but we're just seeing it now because there is more monitoring. Sadly the report generation can 'clog' the queue, also the registering of domains. Please let us know it this happens again before the end of the month. It can be expected to be a problem on the first of the month. We're looking into improving the current situation. |
Tonight's batch is stuck at Is there a service status page or something? It would also be nice to get some more metadata in the Request status API call. Maybe there's a queue with # of waiting before me? Or a checked-on timestamp? Something to indicate it's moving ahead and not just stuck forever. |
@rudiedirkx we found the issue. It is probably not the same cause as issues mentioned earlier but at the moment I can't tell. The cause of this issue was the |
Both status I don't know what the beat scheduler is, but does it make sense that it was stuck at Is there a Batch Service status page or something? Or would this have been undetectable anyway? |
It's only been an hour since requesting, but tonight's batch (2x 271 domains) seems to be stuck at Still registering after 3 hours =( Cancelled after 6 hours. |
Tonight's batch (2x 359) worked perfectly. |
There are some issues after the docker migration, both in single test as batch (the API) that were not found during testing. This might be due to some bugs thought to be fixed and some optimizations that where made for a few large batch jobs, but the production load is probably different (more smaller jobs). Still to be investigated what causes the build up. |
I'll keep the 1st and 15th in mind next times. |
|
Tonight's batch (~341) hangs at Added at 18:02, still Still stuck at registering at 21:40. I've cancelled it. Started another one, waited 20 mins, stuck at |
Tonight's batch (~621) is stuck at |
Thanks for reporting, the queue seems stuck at the moment with 1.8 million tasks queued. |
Tonight's is stuck at 1.8M, ouch 😮 Goodluck! |
I personally only have access to the monitoring, nothing to reset the queue, but it's reported to the team. The 1.8 million queue is for Internet.nl/internetnl/settings.py Lines 390 to 391 in 86f39f7
I've a hunch somehow the There is a mutex lock in the generation: Internet.nl/interface/batch/util.py Lines 267 to 274 in 86f39f7
However I don't see a mutex here: Internet.nl/interface/batch/util.py Lines 922 to 923 in 86f39f7
I'm not knowledgeable enough on Celery if this is doing some unique constraint job on the parameters, or if it will just keeps on adding Next to this, one or multiple |
c35786c3ddda46c2ade734ee5fd5b249
1e96cbedc8f04955aec3ef87f672b729
We've been using the API for a few weeks, and it works almost perfectly, but tonight it's stuck. We created a batch for 384 domains at 18:02:14, and 1.5h later both requests are still
registering
. Usually the entire batch takes less than 1h to complete, results and report and all. Sometimes for 200 domains, sometimes for 500.Maybe I'm too fast complaining, but this has never happened so I'm concerned.
The text was updated successfully, but these errors were encountered: