Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2907] Queue config processing log spew #1009

Closed
wants to merge 10 commits into from

Conversation

kousei47747
Copy link

@kousei47747 kousei47747 commented Dec 26, 2024

What is this PR for?

Silence the logs of shadow queue structure during configuration update.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2907

Copy link

codecov bot commented Dec 28, 2024

Codecov Report

Attention: Patch coverage is 91.17647% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.50%. Comparing base (7391aeb) to head (f15ede1).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/partition.go 82.60% 1 Missing and 3 partials ⚠️
pkg/scheduler/context.go 50.00% 1 Missing ⚠️
pkg/scheduler/objects/queue.go 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1009      +/-   ##
==========================================
+ Coverage   82.48%   82.50%   +0.02%     
==========================================
  Files          97       97              
  Lines       15627    15637      +10     
==========================================
+ Hits        12890    12902      +12     
+ Misses       2457     2456       -1     
+ Partials      280      279       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kousei47747 thanks for this patch. There is one small question remaining. Please take a look.

return queue, err
}

// NewConfiguredShadowQueue creates a new queue from scratch based on the configuration and logs at debug level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kousei47747, thank you for your contribution!

Could you provide more details about the differences between NewConfiguredQueue and NewConfiguredShadowQueue?
Currently, it appears the only difference is the log level, which may be confusing for those unfamiliar with shadow queues.

It would be helpful to explain when developers should use NewConfiguredQueue versus NewConfiguredShadowQueue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blueBlue0102 thanks for reviewing! I will provide more detailed description in the new patch.

@kousei47747 kousei47747 marked this pull request as draft December 29, 2024 08:27
@wilfred-s
Copy link
Contributor

I dropped a large comment into the jira to provide some guidance and point out another issue.

@kousei47747
Copy link
Author

@wilfred-s thanks for reviewing! I've replied on jira, please take a look.

@kousei47747
Copy link
Author

I’ve updated the PR based on the guidance and noticed that NewPlacementManager also logs. Should it be silenced too?

@kousei47747 kousei47747 marked this pull request as ready for review January 2, 2025 07:24
Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. I'm not sure about the current approach. But if you keep it, please make sure it's tested properly. PartitionContext does not have the best unit test coverage (at least direct coverage), but new code should definitely be covered.

@kousei47747 kousei47747 marked this pull request as draft January 24, 2025 02:42
@kousei47747 kousei47747 marked this pull request as ready for review February 7, 2025 07:04
@blueBlue0102
Copy link
Contributor

If we need to silence all logs when creating shadow queues, it seems that some places were missed.

For example:

  • ACL.setUsers():
    • Related code: acl.go#L54
    • Call stack:
      NewConfiguredQueue
        └── sq.applyConf(conf)
              └── security.NewACL()
                    └── acl.setUsers()
      

It seems that either adding an additional parameter or creating new functions adds quite a lot of complexity to the codebase, but I don't have a good idea for a third approach at the moment.

@kousei47747
Copy link
Author

@blueBlue0102 good catch! I've updated the PR to silence ACL.setUsers() and NewPlacementManager. PTAL

@@ -191,17 +199,18 @@ func (pc *PartitionContext) updatePartitionDetails(conf configs.PartitionConfig)
return ugm.GetUserManager().UpdateConfig(queueConf, conf.Queues[0].Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this one? This is still called regardless of the value of silence.

Copy link
Author

@kousei47747 kousei47747 Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! updatePartitionDetails is called only when performing the real update, so I think we can skip silence here.

// checks passed perform the real update
log.Log(log.SchedContext).Info("updating partitions", zap.String("partitionName", partitionName))
err = part.updatePartitionDetails(p)

Copy link
Contributor

@blueBlue0102 blueBlue0102 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patches! It seems there are still some places that were missed. It made me think of another approach.

@@ -116,7 +119,7 @@ func NewACL(aclStr string) (ACL, error) {
// trim and check for wildcard
acl.setAllAllowed(aclStr)
// parse users and groups
acl.setUsers(strings.Split(fields[0], common.Separator))
acl.setUsers(strings.Split(fields[0], common.Separator), silence)
if len(fields) == 2 {
acl.setGroups(strings.Split(fields[1], common.Separator))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we still want to use the silence flag approach, acl.setGroups also needs to be silenced.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silence flag added to setGroups, thanks!

@@ -79,7 +79,7 @@ type PartitionContext struct {
locking.RWMutex
}

func newPartitionContext(conf configs.PartitionConfig, rmID string, cc *ClusterContext) (*PartitionContext, error) {
func newPartitionContext(conf configs.PartitionConfig, rmID string, cc *ClusterContext, silence bool) (*PartitionContext, error) {
if conf.Name == "" || rmID == "" {
log.Log(log.SchedPartition).Info("partition cannot be created",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we still want to use the silence flag approach, we should check silence before logging.

Copy link
Author

@kousei47747 kousei47747 Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is logged only when there's a misconfiguration and will return instantly with an error, so no further logs will be generated. I think we can keep it for debugging. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the log for misconfiguration make sense to me.

Comment on lines 367 to 369
// make sure the new info passes all checks
_, err = newPartitionContext(p, rmID, nil)
_, err = newPartitionContext(p, rmID, nil, true)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we call newPartitionContext here only for validating configs.PartitionConfig.
If my understanding is correct, the main purpose of this JIRA is to avoid any side effects when validating the config before updatePartitionDetails,
not only:

  • Logging
  • Sending out events

but also including:

  • Loading the user manager and updating user settings.
    ugm.GetUserManager().UpdateConfig() would be called in updatePartitionDetails() when validation passes.
  • etc.

So rather than adding a lot of flag parameters to each underlying function, extracting all the validation logic from newPartitionContext into another function or adding those validation logic into updatePartitionDetails might be a better approach.
Let newPartitionContext do what it needs to do without consideration of "silence". What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice approach! It looks like the trade-off would be duplicate logic/methods when extracting it from stacked callers. I tried a similar approach in the first few commits, but it turned out to be a bit cumbersome. So I think I still prefer adding the silence flag.
Please let me know if I’m missing something!

Copy link
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a confirmation.

return ugm.GetUserManager().UpdateConfig(queueConf, conf.Queues[0].Name)
if !silence {
return ugm.GetUserManager().UpdateConfig(queueConf, conf.Queues[0].Name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilfred-s This change also fix the unnecessarily ugm reload issue which you mentioned in Jira.

@kousei47747 I think we don't need another Jira. Does it sound right to you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenyulin0719 yes, thanks :)

Copy link
Contributor

@ryankert01 ryankert01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants