Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ecs): update applicationcachingagent to store applications/relationships #5377

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

piradeepk
Copy link
Contributor

@piradeepk piradeepk commented Jun 1, 2021

This change updates the application caching agent to store the application name as well as the services associated to that application as relationships. Storing these objects allows the EcsApplicationProvider to be able to query and retrieve all applications and their related services. Improving the search experience and returning the records quicker.

Previously, if users had too many services in their associated AWS accounts, the search would time out, and throw an exception.

Testing:

  • IN PROGRESS
    Testing by using the current logic, to perform multiple application searches (both using the application search and well as the shared search modal), as well as clicking through an application and deploying to ECS. Then deployed these changes and redid the same tests to validate that the previous behaviour continued to work and the search was able to function as expected.

Fixes: spinnaker/spinnaker#6084

services.forEach(
key -> {
Map<String, String> parsedKey = Keys.parse(key);
if (application.getClusterNames().get(parsedKey.get("account")) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a null check on getClusterNames here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but the existing logic doesn't, so didn't lean that way. That being said, I'm not opposed to adding it.

Map<String, Map<String, Collection<String>>> appRelationships = new HashMap<>();

for (Service service : services) {
String applicationKey = service.getApplicationName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this with monikers? in that case is the returned getApplicationName() the actual application (as determined by tags) or the service prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for determining applications is the same as it exists today. The only difference being that we're performing this logic and writing it to the cache rather than performing it at query runtime.

EcsApplication application = new EcsApplication(appName, attributes, clusterNames);

Set<String> services = getServiceRelationships(cacheData);
log.info("Found {} services for app {}", services.size(), appName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest we make this debug level instead of info - it's useful but produces quite a bit of noise in the logs (tested this locally myself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Set<Application> retrievedApplication = client.getApplications(true);

// Then
assertTrue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're asserting it's the right application, but shouldn't we also assert on the expected service relationship(s) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application object contains the services as the cluster names, so it's actually comparing the entire application object not just the app name itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see we're comparing the HashSets - nvm!

@piradeepk piradeepk force-pushed the search branch 2 times, most recently from 98bca2a to 6726ee7 Compare June 4, 2021 21:20
Copy link
Contributor

@allisaurus allisaurus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with the caveat that in-progress testing does not reveal any issues. I'll be unavailable for the next week and don't want to hold up merging this in if all is green.

@deverton
Copy link
Contributor

deverton commented Jun 7, 2021

We've run a short test of this patch (along with #5375 ) and unfortunately it doesn't seem to fix the issue. We're still seeing a large number of queries in the form of

SELECT `body` AS `body` , ? AS `id` , ? AS `rel_id` , ? AS `rel_type` FROM `cats_v1_alarms` WHERE `ID` IN (...) UNION ALL SELECT ? AS `body` , `id` AS `id` , `rel_id` AS `rel_id` , `rel_type` AS `rel_type` FROM `cats_v1_alarms_rel` WHERE `ID` IN (...) 

where the IN values look like 'ecs;alarms;ecs-account-id;us-west-2;arn:aws:cloudwatch:us-west-2:1234567890:alarm:nameofalarm-MemoryAlarmScalingOutPolicy-ID

and

SELECT `body` AS `body` , ? AS `id` , ? AS `rel_id` , ? AS `rel_type` FROM `cats_v1_loadBalancers` WHERE `ID` IN (...) UNION ALL SELECT ? AS `body` , `id` AS `id` , `rel_id` AS `rel_id` , `rel_type` AS `rel_type` FROM `cats_v1_loadBalancers_rel` WHERE ( `ID` IN (...) AND `rel_type` LIKE ? ) 

where the IN values look like aws:loadBalancers:aws-account-idp:us-west-2:lb-id:vpc-ID:application

Seeing plenty of messages like Cached 115 applications for 974 services and Found 974 ECS services for which to cache applications in the logs from the com.netflix.spinnaker.clouddriver.ecs.provider.agent.ApplicationCachingAgent so I assume it's doing its thing.

@allisaurus
Copy link
Contributor

@deverton Thanks so much for giving this change a shot and reporting back! Can you elaborate a little bit on what you specifically did to test (e.g., used the general /search field in deck, hit a specific gate endpoint, etc.) so we can work on repo/validation of further changes?

@deverton
Copy link
Contributor

The primary way this shows for us is the general search endpoint from the front page of Deck. From a user perspective the search never returns and we see long running queries against the /search endpoint in Gate and Clouddriver. Search from the Application tab is fine so presumably this is specific to the infrastructure search.

From the Clouddriver side this shows up as multi-hour queries as you can see in this chart from our deployment.

image

We only have five AWS accounts on-boarded at the moment with 1 region each and ECS enabled for all five. What might be making the difference is that those accounts have a large number of alarms and load balancers (not Spinnaker managed) which might be causing the slow down. Looking at the type of resource queried by Clouddriver we see a lot of calls for those types:

image

We did grab a quick flamegraph of one of the Clouddriver pods which I've attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants