Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CannotPullContainerError/ASM error message enhancements #4181

Merged
merged 18 commits into from
Jun 11, 2024

Conversation

nineonine
Copy link

@nineonine nineonine commented May 16, 2024

Summary

This change does a few things:

  • introduces new abstraction to parse known Docker errors and augment error message with extra useful context. The first error that we handle now is CannotPullContainerError, spefically the missing ECR pull permissions, broken or non-existent image repo url .
  • adds corresponding test coverage
  • refactors asm_test.go implementation to use gomock
  • enhances error message for failed ASM get secret API calls, specifically the ResourceNotFoundException
  • Unit tests added in asm_test.go
  • updated asmsecret_test.go to account for removed prefix and updated message

Implementation details

  • new module was added with all the encapsulated logic - error_messages.go
  • in engine.pullAndUpdateContainerReference we use new functionality to attempt to update error message when image pull failed due CannotPullContainerError

Testing

  • Unit tests added in error_messages_test.go
  • another test in docker_task_engine_test.go to capture augmented message.

New tests cover the changes:
yes

Manual testing steps and describe-tasks outputs

  1. Invalid registry - use non existent registry URL.
"containers": [
    {
        ... 
        "name": "foo",
        "image": "some/nonsense:latest",
        "lastStatus": "STOPPED",
        "reason": "CannotPullContainerError: The task can’t pull the image. Check whether the image exists. Error response from daemon: pull access denied for some/nonsense, repository does not exist or may require 'docker login': denied: requested access to the resource ",
        "networkInterfaces": [ ... ],
        "healthStatus": "UNKNOWN",
        "cpu": "0",
        ...
    }
  1. Missing permissions - remove ecr:BatchGetImage from policy attached to task execution role
"containers": [
    {
        ...
        "name": "main_container",
        "image": "123123123123.dkr.ecr.us-east-1.amazonaws.com/test_image:latest",
        "lastStatus": "STOPPED",
        "reason": "CannotPullContainerError: The task can’t pull the image. Check that the role has the permissions to pull images from the registry. Error response from daemon: pull access denied for 123123123123.dkr.ecr.us-east-1.amazonaws.com/test_image, repository does not exist or may require 'docker login': denied: User: arn:aws:sts::123123123123:assumed-role/MissinECRPull_ECSTaskExecutionRoleRole/60bece4b8db140d18dad0acd964c0889 is not authorized to perform: ecr:BatchGetImage on resource: arn:aws:ecr:us-east-1:123123123123:repository/test_image because no identity-based policy allows the ecr:BatchGetImage action",
        "healthStatus": "UNKNOWN",
        "cpu": "0",
        ...
    }
],
  1. Missing permissions - remove ecr:BatchGetImage from policy attached to task ec2 instance role
"containers": [
    {
        ...
        "name": "main_container",
        "image": "123123123123.dkr.ecr.us-east-1.amazonaws.com/test_image:latest",
        "lastStatus": "STOPPED",
        "reason": "CannotPullContainerError: The task can’t pull the image. Check that the role has the permissions to pull images from the registry. Error response from daemon: pull access denied for 123123123123.dkr.ecr.us-east-1.amazonaws.com/test_image, repository does not exist or may require 'docker login': denied: User: arn:aws:sts::123123123123:assumed-role/ecsInstanceRole/i-0123456789fff is not authorized to perform: ecr:BatchGetImage on resource: arn:aws:ecr:us-east-1:123123123123:repository/test_image because no identity-based policy allows the ecr:BatchGetImage action",
        "healthStatus": "UNKNOWN",
        "cpu": "0",
        ...
    }
],
  1. CannotPull - network issue. Delete SG outbound rule (pulling from ECR).
"containers": [
    {
        ...
        "name": "main_container",
        "image": "123123123123.dkr.ecr.us-east-1.amazonaws.com/test_image:latest",
        "lastStatus": "STOPPED",
        "reason": "CannotPullECRContainerError: The task can’t pull the image. Check your network configuration. RequestError: send request failed\ncaused by: Post \"https://api.ecr.us-east-1.amazonaws.com/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
        "healthStatus": "UNKNOWN",
        "cpu": "0"
        ...
    }
],
  1. CannotPull - network issue. Delete SG outbound rule (pulling from private registry).
"containers": [
    {
        ...
        "name": "main_container",
        "image": "123123123123.dkr.ecr.us-east-1.amazonaws.com/test_image:latest",
        "lastStatus": "STOPPED",
        "reason": "CannotPullContainerError: The task can’t pull the image. Check your network configuration. Error response from daemon: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
        "healthStatus": "UNKNOWN",
        "cpu": "0"
        ...
    }
],
  1. Missing secret - pass non-existent secret via taskDef.container.secrets
{
    "tasks": [
          {
             "stopCode": "TaskFailedToStart",
            "stoppedAt": "2024-06-03T19:03:47.816000-07:00",
            "stoppedReason": "ResourceNotFoundException: The task can't retrieve the secret with ARN 'arn:aws:secretsmanager:us-east-1:123123123123:secret:foo-bar-baz' from AWS Secrets Manager. Check whether the secret exists in the specified Region: ResourceNotFoundException: Secrets Manager can't find the specified secret.",
            "stoppingAt": "2024-06-03T19:03:34.828000-07:00",
            "tags": [],
          }
     ]
}

Description for the changelog

Enhancement: update CannotPullContainerError error messages

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@nineonine nineonine requested a review from a team as a code owner May 16, 2024 05:02
@nineonine nineonine force-pushed the cannot-pull-container-err-msg branch from 5dd5027 to 65a2859 Compare May 16, 2024 16:37
Improve CannotPullContainerError message
@nineonine nineonine force-pushed the cannot-pull-container-err-msg branch 2 times, most recently from 0a68d36 to ffcd9bf Compare May 16, 2024 22:05
@nineonine nineonine force-pushed the cannot-pull-container-err-msg branch from ffcd9bf to dc51281 Compare May 16, 2024 22:41
@nineonine nineonine force-pushed the cannot-pull-container-err-msg branch from 470ee7c to ec5361e Compare May 18, 2024 01:16
@nineonine nineonine force-pushed the cannot-pull-container-err-msg branch from ec5361e to 5d7418b Compare May 18, 2024 01:27
ecs-agent/api/errors/error_messages.go Outdated Show resolved Hide resolved
ecs-agent/api/errors/error_messages.go Outdated Show resolved Hide resolved
ecs-agent/api/errors/error_messages.go Outdated Show resolved Hide resolved
ecs-agent/api/errors/error_messages.go Outdated Show resolved Hide resolved
ecs-agent/api/errors/error_messages.go Outdated Show resolved Hide resolved
ecs-agent/api/errors/error_messages.go Outdated Show resolved Hide resolved
agent/engine/docker_task_engine_test.go Outdated Show resolved Hide resolved
agent/engine/docker_task_engine_test.go Outdated Show resolved Hide resolved
agent/asm/asm_test.go Outdated Show resolved Hide resolved
agent/asm/asm.go Outdated Show resolved Hide resolved
Alex Dudarenko added 2 commits May 30, 2024 23:40
* CannotPullContainerError: cleanup, renaming
* ASM error: do not use added errormessages code - match on AWS SDK
  error directly, drop tracing prefix, fix/add more unit tests
agent/asm/asm.go Outdated Show resolved Hide resolved
@nineonine nineonine changed the title Improve CannotPullContainerError message CannotPullContainerError/ASM error message enhancements May 31, 2024
@amogh09
Copy link
Contributor

amogh09 commented May 31, 2024

Thanks for this change! Let's please add details about any manual tests performed to validate this changes. The new error messages seen in describe-tasks response in particular would be very nice to have in the PR description.

amogh09
amogh09 previously approved these changes May 31, 2024
@nineonine nineonine force-pushed the cannot-pull-container-err-msg branch from 316f62e to a4ee7f4 Compare June 4, 2024 01:00
amogh09
amogh09 previously approved these changes Jun 6, 2024
@nineonine nineonine changed the base branch from feature/project-telescope to dev June 6, 2024 22:16
@nineonine nineonine dismissed amogh09’s stale review June 6, 2024 22:16

The base branch was changed.

amogh09
amogh09 previously approved these changes Jun 6, 2024
@@ -1606,6 +1607,8 @@ func (engine *DockerTaskEngine) pullAndUpdateContainerReference(task *apitask.Ta

findCachedImage := false
if !pullSucceeded {
// Extend error message
metadata.Error = apierrormsgs.AugmentNamedErrMsg(metadata.Error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this is the only place where AugmentNamedErrMsg is used? Is there some plan to add more errors to this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time, there are no plans to continue updating other messages using AugmentNamedErrMsg. This function was implemented and applied to specifically for the scenarios we are trying to address. If future requirements necessitate similar enhancements for additional error messages, we can consider extending its usage then.

}

// A map associating DockerErrorType with parser functions.
var errorParsers = map[DockerErrorType]func(string) (DockerError, error){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a map? I don't see it's keys being used anywhere?

Copy link
Author

@nineonine nineonine Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, not used right now. I did it just for symmetry with formatters and (arguably) slightly better readability of code. We can change it if you feel strongly. Let me know what you think. Thanks.


// An interface that provides means to reconstruct Named Errors. This is
// used during error message augmentation process.
type AugmentableNamedError interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand the point of this interface. Why not just have a function in the errormessages package that can take in a named error, augment the message if it can (using the available parser functions), and then return it? Why the need to create an interface and add this function to particular named errors?

Copy link
Author

@nineonine nineonine Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion, Thanks. I considered this approach too. Once slight complication is that we would need to somehow rebuild NamedError and return it. This creates dependencies on all modules that would need to be imported to bring error into scope.
The decision to create an interface was made to improve modularity. By doing this, we avoid the need to import all the specific error types into the errormessages package. This design allows each error type to implement the interface independently, keeping the errormessages package clean and free from dependencies. I felt this approach makes the codebase more maintainable and flexible, allowing new error types to be augmented without modifying the main parsing driver - AugmentNamedErrMsg and AugmentMessage function.
If you think it is an overkill - we can certainly change it, I am fine doing it either way.

@nineonine nineonine requested a review from sparrc June 7, 2024 16:28
@pallymore pallymore self-requested a review June 10, 2024 23:01
@sparrc sparrc merged commit f46cc75 into aws:dev Jun 11, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants