Skip to content

Conversation

@singholt
Copy link
Contributor

@singholt singholt commented Dec 17, 2025

Summary

This PR fixes the following race condition in ECS agent -

When the ECS agent receives a task payload from ACS, the payload contains the execution role credentials. The ECS agent uses these credentials to create task resources, before the task can be run.

If the ECS agent is restarted between the time when a task payload arrived and the task's resources were created, currently the task fails with error like "unable to find execution role credentials".

This is because ECS agent only saves the credentials ID to its state during shutdown, not the credentials themselves. So after a restart, ECS agent needs to wait for the execution role credentials to (re)arrive from ACS, instead of trying to progress the task too soon and failing.

Implementation details

Updated the task manager's resource transition logic to check whether the execution role is needed for a resource creation, and if so, whether it is available. If not, it will wait, like it waits for execution role credentials to pull container images of a task. The wait timeout is configured to be 1 minute.

Also, added a RequiresExecutionRoleCredentials method to the taskresource interface that tells whether a resource requires execution role credentials. This is used by the task manager during resource transitions.

Testing

New tests cover the changes: yes, added unit and integ tests. The integ test reproduces the race condition and now asserts that the fix works.

Description for the changelog

bugfix: wait for execution role credentials to create task resource upon agent restart

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions? No

Does this PR include the addition of new environment variables in the README? No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@singholt singholt force-pushed the fix-taskresource-race branch 3 times, most recently from 8372714 to 5ae4c0b Compare December 17, 2025 17:29
@singholt singholt changed the base branch from master to dev December 17, 2025 17:30
@singholt singholt force-pushed the fix-taskresource-race branch 5 times, most recently from 446d0c2 to b6b3a0a Compare December 17, 2025 18:46
@singholt singholt marked this pull request as ready for review December 17, 2025 19:00
@singholt singholt requested a review from a team as a code owner December 17, 2025 19:00
@singholt singholt enabled auto-merge (rebase) December 17, 2025 19:01
@singholt singholt force-pushed the fix-taskresource-race branch from b6b3a0a to fa40298 Compare December 17, 2025 21:31
@singholt singholt force-pushed the fix-taskresource-race branch from fa40298 to cb0b3da Compare December 17, 2025 22:40
@singholt singholt force-pushed the fix-taskresource-race branch from cb0b3da to 9e4f457 Compare December 17, 2025 22:44
@singholt singholt force-pushed the fix-taskresource-race branch from 9e4f457 to d382456 Compare December 17, 2025 23:32
@singholt singholt force-pushed the fix-taskresource-race branch from d382456 to 0289725 Compare December 18, 2025 00:22
// RequiresExecutionRoleCredentials returns true if the resource requires execution role credentials.
// We only need execution role credentials when there's an external config to pull from S3
func (firelens *FirelensResource) RequiresExecutionRoleCredentials() bool {
return firelens.externalConfigType == ExternalConfigTypeS3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not fallback to instance credentials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but today it does not:

creds, ok := firelens.credentialsManager.GetTaskCredentials(firelens.executionCredentialsID)
if !ok {
return errors.New("unable to get execution role credentials")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider adding the fallback logic as a separate enhancement

@aws aws deleted a comment from ShelbyZ Dec 18, 2025
@singholt singholt merged commit 966c44d into aws:dev Dec 18, 2025
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants