Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step in plan silently not running when using list of targets #3321

Closed
WembleyFord2 opened this issue May 23, 2024 · 10 comments
Closed

Step in plan silently not running when using list of targets #3321

WembleyFord2 opened this issue May 23, 2024 · 10 comments
Labels
Bug Bug reports and fixes.

Comments

@WembleyFord2
Copy link

WembleyFord2 commented May 23, 2024

Describe the Bug

I have a simple plan that does two downloads from targets. This plan works as expected when I list two targets by hostname when invoking the plan. If I provide a list of targets, via -t @list.txt , then only the first download step appears to run. With log-level set to trace, there's no indication that anything is attempting to execute the second step. I have probably missed something obvious and stupid as I am new to bolt, but cannot see what.

Expected Behavior

The second step in my plan should execute.

Steps to Reproduce

My plan is

parameters:
  targets:
    type: TargetSpec
steps:
  - download: '/var/log/test.txt'
    destination: 'logs'
    run_as: root
    targets: $targets
    description: 'Download txt'
  - download: '/var/log/test.json'
    destination: 'json'
    run_as: root
    targets: $targets
    description: 'download json'

The works as I would expect if I use

bolt plan run module::plan -t host1,host2

But only produces any output, only downloads (and completes successfully) performing the first step regarding the .txt file if I use a target file - a flat text file of hostnames, one per line, called with -t@list

The output of trace log does not contain the text 'json' at all.

Environment

Bolt 3.29.0 on Ubuntu mint.

@WembleyFord2 WembleyFord2 added the Bug Bug reports and fixes. label May 23, 2024
@donoghuc
Copy link
Member

Using --targets with a plan was added as a shortcut. It is really not a robust way to send parameters to a plan. Plans have no inherent "targets" like other actions. The reliable way to send any parameter to a plan (including a list of targets) is via parameters. I would encourage you to not use --targets with plans generally.

@WembleyFord2
Copy link
Author

The reliable way to send any parameter to a plan (including a list of targets) is via parameters. I would encourage you to not use --targets with plans generally.

So - I ditched using --target @list and instead used a group in my inventory file, pointing to a plugin which reads the same text file and turns it into json. The plugin works fine, and I see the targets listed with bolt inventory show.. However, I get exactly the same behaviour my plan - still the same as above - returns the text files, and does nothing at all about the json documents. No errors, warnings, messages, no instance of the word 'json' in the output or a trace log.

I also tried this from a different system - so it's not my OS or installation of bolt.

Whatever this is, it's happening for me regardless of the way I pass targets to bolt.

@WembleyFord2
Copy link
Author

WembleyFord2 commented May 25, 2024

Okay - I think Ive worked out what's happening - but I'm failing to understand why.
Two of these targets I know have problems, one of them doesn't have the source files to be downloaded in /var/log, the other has a full disk and logon fails.

If these two failing nodes are present in the list of targets the run completes without attempting the second json file download step on any of the targets - including the ones that are fine. All the working targets have their text file downloaded via the first step - but there's absolutely nothing in the output (inc. trace level) about json. No failure messages, or downloads, nothing - the download/json folder on the bolt host isn't even created. The two failing nodes are highlighted in the output as expected and are unable to download the test.txt file - but from the rest of the targets I do get the test.txt file - but not the json.

If I remove the two problematic nodes, I get both the json and txt files from all remaining nodes.

This also explains why I was getting different output between using a --targets @list vs --targets host1,host2 - because in this specific case @list contained a failing node neither of which were host 1 or host2.

Adding catch_errors: true to both steps in the plan means I get the downloads from all (working) targets - and while this is a useable workround in this use case it doesn't seem very satisfactury.

This all seems rather counter intuitive to me - unless I'm misunderstanding something or have setup something very badly wrong - it looks like the failure of the first step on one target prevents subsequent steps being attempted on all targets. This might be by design - though seems very strange to me - but doesn't really explain why there is nothing at all in the logs when this occurs.

@donoghuc
Copy link
Member

In the case where the first step has failures and the second step is not started (which is expected) does the plan report that it has completed successfully?

@WembleyFord2
Copy link
Author

The plan reports the failure of the first step on the failed nodes. And reports, otherwise, that the plan succeeded. However, there is absolutely zero mention of the second step in any of the logs. If I remove the two failing nodes from the list of targets, the second step is attempted.

Is it really expected that the second step would not be attempted on the nodes where the first step did succeed?

I've worked round this for now by setting catch_errors: true on the first step. While this all makes sense for a single node - the idea that a single failing step on just one node would cause all attempts at running the second step on any nodes - including those who succeeded with the frst step - and without any indication what is happening in the logs is extremely confusing..

If this is the intended behaviour - and it seems odd to me that it might be - though I could find nothing in the documentation discussing the ordering of multi-step plans on targets is. (clearly Bolt must be performing step 1 on all nodes before attempting step 2). But to fail in this manner so that there is no any indication at all in the logs that all subsequent steps have been skipped due to a failure on a single target is a serious failing.

@donoghuc
Copy link
Member

donoghuc commented May 28, 2024

Everything appears to be working as intended. In a plan the default behavior is to stop execution on failure (regardless of whether its yaml or puppet). We have the concept of catch_errors either as an argument to a step or as a block in a puppet plan. For the following two plans I would expect execution to stop after the first run_command exits non-zero:

plan proj::pup(){
  run_command('exit 3', 'local://localhost')
  run_command('whoami', 'local:/localhost')
}
steps:
  - command: "exit 3"
    targets: "local://localhost"
  - command: "whoami"
    targets: "local://localhost" 

In both cases plan execution halts and the result of the plan is "finished" with an exit code of 1 for the plan run. This is documented herre https://www.puppet.com/docs/bolt/latest/writing_yaml_plans.html#steps

@WembleyFord2
Copy link
Author

WembleyFord2 commented May 29, 2024

Hello - fully understand the above example - however I'm not sure it follows when extended to multiple targets. In my scenario, I have 98 targets. Two targets have conditions that cause a step in my plan to fail. The other 96 targets could execute the plan in its entirety.

But when the step on one of the targets fails, not only do subsequent steps not get executed on that target, but they do not get executed on any targets. This seems unexpected to me. Equally unexpected is the lack of any logging information indicating why a step 2 has not been executed on a target where step 1 completed and did not throw an error. The lack of any errors regarding the 96 working targets suggested, without further information, that other steps ran. Very unexpected that no further steps ran on 'good' targets.

E.g if the targets are node1, node 2 and node3.

steps:
  - command: "[ `hostname` != 'node1' ]"
    targets: node1,node2,node3
  - command "echo 'got here'" 
    targets: node1,node2,node3

Because this will fail on node1 step two isn't ever executed on any of the targets and there's no indication from the logs as to why or even that it hasn't, even though - to my mind - it should expect it to work on two out of the three targets. Or at least provide some hint that it's not even trying. A log entry that step two has been skipped due to failure of step 1 (possibly on a different node) would be helpful.

Hope that makes sense.

@WembleyFord2
Copy link
Author

Hello - just to clarify - is what I have described above the intended behaviour of bolt?

@donoghuc
Copy link
Member

Yes this is the intended behavior. A step fails if any target fails. If you want to add logic for retry or proceeding only on targets in which previous steps have succeeded then you can do that. When you say there is no logging... I'm not seeing that behavior, both the CLI output and bolt-debug.log with all defaults show the failed step.

@WembleyFord2
Copy link
Author

I assumed, given there's nothing I could find clear in the documentation about this scenario, that there be some sort of messaging explaining why steps beyond the failed step didn't execute on any of the nodes where the previous step succeeded. I was expecting steps in a plan to be executed on nodes where there are no errors in prior steps so the total lack of any output regarding the steps which could have ran but were never tried was very surprising.

That particular behaviour has encouraged me to look other tooling since this behaviour will lead to all of our nodes being in an inconsistent state with a partially executed plan on all nodes even if a failure occurs on one. We'll either switch to some other tool or instead write a wrapper script to execute bolt on one target at a time to make it behave more safely.

Thanks, I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bug reports and fixes.
Projects
None yet
Development

No branches or pull requests

2 participants