Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug?] Batch API status stuck at registering #1497

Open
rudiedirkx opened this issue Sep 11, 2024 · 19 comments
Open

[bug?] Batch API status stuck at registering #1497

rudiedirkx opened this issue Sep 11, 2024 · 19 comments
Assignees
Labels
bug Unexpected or unwanted behaviour of current implementations

Comments

@rudiedirkx
Copy link

  • Web request: c35786c3ddda46c2ade734ee5fd5b249
  • Mail request: 1e96cbedc8f04955aec3ef87f672b729

We've been using the API for a few weeks, and it works almost perfectly, but tonight it's stuck. We created a batch for 384 domains at 18:02:14, and 1.5h later both requests are still registering. Usually the entire batch takes less than 1h to complete, results and report and all. Sometimes for 200 domains, sometimes for 500.

Maybe I'm too fast complaining, but this has never happened so I'm concerned.

@rudiedirkx
Copy link
Author

I accidentally started another batch at 00:47 because I'm dumb 😆 Also stuck, still.

web request: 152946d5c6154c6996da92120ec86dcc
mail request: 68b427cc9e7a47049022806e61ffdfc7

@rudiedirkx rudiedirkx changed the title Batch API status stuck at registering [bug?] Batch API status stuck at registering Sep 12, 2024
@bwbroersma bwbroersma transferred this issue from internetstandards/Internet.nl-dashboard Sep 13, 2024
@bwbroersma bwbroersma added the bug Unexpected or unwanted behaviour of current implementations label Sep 13, 2024
@bwbroersma
Copy link
Collaborator

bwbroersma commented Sep 13, 2024

Maybe related, old finished requests are also stuck in "status": "generating".

@rudiedirkx: there was a migration to use docker on batch, there was a spike in Wednesday, and it's being investigated at the moment, related:

BTW I never tried it, but it should be able to cancel a request, see the OpenAPI specification.

@rudiedirkx
Copy link
Author

The original 2 requests are "status": "done" now. Both finished around 2024-09-12 18:00. The newer 2 (accidental) requests are still "status": "registering".

Is it safe to start new requests (2 x ~250 domains every night), or should I use the old/individual/scraper checks for now? #1255 (comment)

@bwbroersma
Copy link
Collaborator

bwbroersma commented Sep 13, 2024

RCA:

The queue was cleared as a quick temporary solution, since the queue was only slowly moving. As a result, some requests in registering or generating are in limbo. We're looking into a solution, probably the generating status will be fixed over time (I have some experience with large crashing reports).

@rudiedirkx
Copy link
Author

Tonight's batches of 374 domains hung again. Requested at 18:02, still "registering" just now. I cancelled them and they're going through internet.nl frontend now.

@bwbroersma
Copy link
Collaborator

Because of the option in the dashboard to scan once or twice a month, the 1st and 15th of each month have some extra load. We noticed a task queue building up from 15th of September 0 UTC until it was 0 again at 8:40 UTC (10:40) 16th of September. We're unsure if this behavior is new (because of the new setup) or was old, but we're just seeing it now because there is more monitoring. Sadly the report generation can 'clog' the queue, also the registering of domains. Please let us know it this happens again before the end of the month. It can be expected to be a problem on the first of the month. We're looking into improving the current situation.

@rudiedirkx
Copy link
Author

rudiedirkx commented Sep 26, 2024

Tonight's batch is stuck at running. They (web & mail, 227 domains) have been running for at least 2.5 hours. It's not registering, but it's not much better 😆 I've never seen stuck at running before.

Is there a service status page or something?

It would also be nice to get some more metadata in the Request status API call. Maybe there's a queue with # of waiting before me? Or a checked-on timestamp? Something to indicate it's moving ahead and not just stuck forever.

@aequitas
Copy link
Collaborator

@rudiedirkx we found the issue. It is probably not the same cause as issues mentioned earlier but at the moment I can't tell.

The cause of this issue was the beat scheduling process apparently being stuck. Restarting the processes resolved the issue. It will take a while for the backlog to clear and your batch request to finish.

@rudiedirkx
Copy link
Author

Both status done at 23:44 👍

I don't know what the beat scheduler is, but does it make sense that it was stuck at running?

Is there a Batch Service status page or something? Or would this have been undetectable anyway?

@rudiedirkx
Copy link
Author

rudiedirkx commented Oct 1, 2024

It's only been an hour since requesting, but tonight's batch (2x 271 domains) seems to be stuck at registering again. It might just be busy, and not stuck, but usually it's completely done in an hour, not still registering.


Still registering after 3 hours =(


Cancelled after 6 hours.

@rudiedirkx
Copy link
Author

Tonight's batch (2x 359) worked perfectly.

@bwbroersma
Copy link
Collaborator

bwbroersma commented Oct 2, 2024

There are some issues after the docker migration, both in single test as batch (the API) that were not found during testing. This might be due to some bugs thought to be fixed and some optimizations that where made for a few large batch jobs, but the production load is probably different (more smaller jobs).
The 1st of October, as expected, triggered probably the same bug as the 15th. The queue keeps increasing for a full day (quite some of us were at some conference), until it was cleared and everything started worked again.

Still to be investigated what causes the build up.
Maybe related:

@rudiedirkx
Copy link
Author

rudiedirkx commented Oct 3, 2024

I'll keep the 1st and 15th in mind next times.

@bwbroersma
Copy link
Collaborator

@rudiedirkx
Copy link
Author

rudiedirkx commented Oct 16, 2024

Tonight's batch (~341) hangs at registering again. I intentionally skipped yesterday's batch, because it's the 15th 👻 and it's set up to skip the 1st too. Tonight was fair game I thought. Still busy?

Added at 18:02, still registering at 19:26.


Still stuck at registering at 21:40. I've cancelled it.


Started another one, waited 20 mins, stuck at registering, cancelled it.

@rudiedirkx
Copy link
Author

Tonight's batch (~621) is stuck at registering again (2.5h by now). Maybe it's not just the 1st and 15th that's busy, but also the 2nd and 16th?

@bwbroersma
Copy link
Collaborator

Thanks for reporting, the queue seems stuck at the moment with 1.8 million tasks queued.
For developers (authenticated endpoint): see the Grafana of the celery tasks.

@rudiedirkx
Copy link
Author

Tonight's is stuck at registering again. I occasionally catch up hanging batches using the gui tests at https://internet.nl/ and that works well every time.

1.8M, ouch 😮 Goodluck!

@bwbroersma
Copy link
Collaborator

bwbroersma commented Nov 3, 2024

I personally only have access to the monitoring, nothing to reset the queue, but it's reported to the team.
Monitoring batch and alerting on this is in the planning:

The 1.8 million queue is for batch_slow, no expert here but there seems to be only two tasks for this queue:

"interface.batch.util.batch_async_generate_results": {"queue": "batch_slow"},
"interface.batch.util.batch_async_register": {"queue": "batch_slow"},

I've a hunch somehow the batch_async_generate_results for the same request are added to queue multiple times, otherwise it cannot get to these proportions.

There is a mutex lock in the generation:

if not batch_request.has_report_file():
with memcache_lock(lock_id, lock_ttl) as acquired:
if acquired:
results = gather_batch_results(user, batch_request, site_url)
save_batch_results_to_file(user, batch_request, results)
del results
results = gather_batch_results_technical(user, batch_request, site_url)
save_batch_results_to_file(user, batch_request, results, technical=True)

However I don't see a mutex here:

if batch_request.status == BatchRequestStatus.done and not batch_request.has_report_file():
batch_async_generate_results.delay(user=user, batch_request=batch_request, site_url=get_site_url(request))

I'm not knowledgeable enough on Celery if this is doing some unique constraint job on the parameters, or if it will just keeps on adding batch_async_generate_results tasks.

Next to this, one or multiple batch_async_generate_results might be failing with AttributeError (both are > 100.000 in the Top failed tasks [1w] and Top task exceptions [1w]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected or unwanted behaviour of current implementations
Development

No branches or pull requests

4 participants