Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: wrong terminal counts calculated during migration check #5400

Open
wants to merge 1 commit into
base: release/1.39.x
Choose a base branch
from

Conversation

Sidddddarth
Copy link
Member

@Sidddddarth Sidddddarth commented Dec 27, 2024

Description

Wrong count calculated as part of migration check that happens every 30 seconds.
We fetch counts of terminal statuses in the status table, but since at startup we "cleanup" old jobs just by appending another aborted status irrespective of it's state, that above query can count more than one terminal status per job.

Understandably while actually migrating the jobs, we see that more jobs than we expect have been moved because we were expecting a lesser number.
This issue happened now because archival tables have a default retention of 24 hours. so on successive restarts, more and more statuses were being appended for the same job. And we expect the following expression number of jobs to be migrated:

numExpectedNumberOfMigratedJobs(e) = number of jobs(a) - number of terminal statuses in status table(b)

Due to the cleanup at startup even when a remains same, b increases based on the retention duration effectively decreasing e. And server panics when it actually migrated more jobs than e.
Now with this fix: we change b to number of jobIDs with terminal status in the status table and it's bound to remain the same even if we append more statuses for the same job.

Linear Ticket

Resolves PIPE-1827

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@Sidddddarth Sidddddarth changed the title fix: processing pickup race condition wrong terminal counts calculated during migration check Dec 27, 2024
@Sidddddarth Sidddddarth marked this pull request as ready for review December 27, 2024 11:16
@Sidddddarth Sidddddarth changed the title wrong terminal counts calculated during migration check fix: wrong terminal counts calculated during migration check Dec 27, 2024
Copy link

codecov bot commented Dec 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.72%. Comparing base (3605e90) to head (d40ff5e).

Additional details and impacted files
@@                Coverage Diff                 @@
##           release/1.39.x    #5400      +/-   ##
==================================================
+ Coverage           74.70%   74.72%   +0.01%     
==================================================
  Files                 437      437              
  Lines               61269    61269              
==================================================
+ Hits                45773    45785      +12     
+ Misses              12954    12944      -10     
+ Partials             2542     2540       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ktgowtham ktgowtham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the startup, what kind of clean up we do? and if for a job a terminal status already exists, why do we append more?

@ktgowtham ktgowtham requested a review from cisse21 December 30, 2024 11:03
@Sidddddarth
Copy link
Member Author

In the startup, what kind of clean up we do? and if for a job a terminal status already exists, why do we append more?

Just appends an aborted status without checking for the status already present.
like so:

INSERT INTO
	status_table (job_id, job_state, error_response)
SELECT
	job_id, 'aborted', '{"reason": "job max age exceeded"}'
FROM
	jobs_table
WHERE
	created_at <= $deadline;

@cisse21
Copy link
Member

cisse21 commented Jan 3, 2025

Lets change the base to master and release it in 1.40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants