Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CentOS8: Errors while making cache if GPG keys change #67

Open
pfuntner opened this issue Feb 5, 2020 · 23 comments
Open

CentOS8: Errors while making cache if GPG keys change #67

pfuntner opened this issue Feb 5, 2020 · 23 comments
Labels
bug Something isn't working planned

Comments

@pfuntner
Copy link

pfuntner commented Feb 5, 2020

I'm seeing an error using GCP CentOS8 instances for my master and worker nodes:

TASK [geerlingguy.kubernetes : Make cache if Kubernetes GPG key changed.] *******************************************************************************************************************************************************************
Wednesday 05 February 2020  14:44:51 +0000 (0:00:01.290)       0:03:34.103 ****
fatal: [54.161.207.59]: FAILED! => {"changed": true, "cmd": ["yum", "-q", "makecache", "-y", "--disablerepo=*", "--enablerepo=kubernetes"], "delta": "0:00:00.726136", "end": "2020-02-05 14:44:56.576533", "msg": "non-zero return code", "rc": -13, "start": "2020-02-05 14:44:55.850397", "stderr": "Importing GPG key 0xA7317B0F:\n Userid     : \"Google Cloud Packages Automatic Signing Key <[email protected]>\"\n Fingerprint: D0BC 747F D8CA F711 7500 D6FA 3746 C208 A731 7B0F\n From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg", "stderr_lines": ["Importing GPG key 0xA7317B0F:", " Userid     : \"Google Cloud Packages Automatic Signing Key <[email protected]>\"", " Fingerprint: D0BC 747F D8CA F711 7500 D6FA 3746 C208 A731 7B0F", " From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg"], "stdout": "", "stdout_lines": []}
...

When I ssh into one of the instances and run the command directly, it seems to work ok:

[root@ip-172-31-51-134 ~]# yum -q makecache -y --disablerepo=\* --enablerepo=kubernetes
Importing GPG key 0xBA07F4FB:
 Userid     : "Google Cloud Packages Automatic Signing Key <[email protected]>"
 Fingerprint: 54A6 47F9 048D 5688 D7DA 2ABE 6A03 0B21 BA07 F4FB
 From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg
Importing GPG key 0x3E1BA8D5:
 Userid     : "Google Cloud Packages RPM Signing Key <[email protected]>"
 Fingerprint: 3749 E1BA 95A8 6CE0 5454 6ED2 F09C 394C 3E1B A8D5
 From       : https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
[root@ip-172-31-51-134 ~]# echo $?
0

The GCP keys and fingerprints are different are different from the failure but I don't know what the significance is. If I start over from scratch with new instances, it fails at the same point with the same key and fingerprint from the failure.

@timmay75
Copy link

timmay75 commented Mar 5, 2020

Hi @pfuntner, I have the same issue and its blocking me from using this on Centos 8. Have you found any way around it? Im pretty new to this deep usage of Ansible and im not sure if there is a way to not run this task. I tried to do the gpg in a pretask but that doesnt seem to work either. Any new would be greatly appreciated.

@geerlingguy
Copy link
Owner

I'm also seeing this now, only on CentOS 8 builds, in Travis CI:

Importing GPG key 0xA7317B0F:
 Userid     : "Google Cloud Packages Automatic Signing Key <[email protected]>"
 Fingerprint: D0BC 747F D8CA F711 7500 D6FA 3746 C208 A731 7B0F
 From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg
Importing GPG key 0xBA07F4FB:
 Userid     : "Google Cloud Packages Automatic Signing Key <[email protected]>"
 Fingerprint: 54A6 47F9 048D 5688 D7DA 2ABE 6A03 0B21 BA07 F4FB
 From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg"

On the task geerlingguy.kubernetes : Make cache if Kubernetes GPG key changed..

This started happening sometime between April 22 and April 30, according to cron-triggered CI builds: https://travis-ci.org/github/geerlingguy/ansible-role-kubernetes/builds

@geerlingguy geerlingguy added the bug Something isn't working label May 7, 2020
@geerlingguy
Copy link
Owner

Weird... if I run it locally, it passes. Exact same CI test.

I'm going to re-run the last failed build and see if maybe it's something in the Travis CI environment?

@geerlingguy
Copy link
Owner

I think this could possibly be related to the test container. Locally I'm running a 3-month-old image pulled from Docker Hub, while the latest is from 17 days ago...

And in the centos8 image build CI task on Travis CI, I'm seeing the error (https://travis-ci.com/github/geerlingguy/docker-centos8-ansible/jobs/326423010#L631):

Error: GPG check FAILED

Going to debug there.

@geerlingguy
Copy link
Owner

After updating to the latest version of the centos8 image, which seems to have the initial GPG key issue fixed, I'm getting:

--> Action: 'idempotence'
614ERROR: Idempotence test failed because of the following tasks:
615* [instance] => geerlingguy.docker : Add Docker GPG key.
616* [instance] => geerlingguy.kubernetes : Add Kubernetes GPG keys.
617* [instance] => geerlingguy.kubernetes : Add Kubernetes GPG keys.
618* [instance] => geerlingguy.kubernetes : Make cache if Kubernetes GPG key changed.

See failed build: https://travis-ci.org/github/geerlingguy/ansible-role-kubernetes/jobs/683881462

So it seems something's amiss with keys in yum in CentOS 8, but only on Travis CI in my case (and it sounds like also on @pfuntner's servers).

@pfuntner / @timmay75 - What kind of servers/instances are you deploying against?

@geerlingguy
Copy link
Owner

Actually, now I'm able to reproduce the issue locally:

--> Action: 'idempotence'
ERROR: Idempotence test failed because of the following tasks:
* [instance] => geerlingguy.docker : Add Docker GPG key.
* [instance] => geerlingguy.kubernetes : Add Kubernetes GPG keys.
* [instance] => geerlingguy.kubernetes : Add Kubernetes GPG keys.
* [instance] => geerlingguy.kubernetes : Make cache if Kubernetes GPG key changed.

@geerlingguy
Copy link
Owner

After running the playbook a number of times, I see the keys just keep importing over and over again:

# rpm -qi gpg-pubkey-\* | grep -E ^Packager
error: rpmdbNextIterator: skipping h#     173 blob size(4836): BAD, 8 + 16 * il(70) + dl(3708)
error: rpmdb: damaged header #173 retrieved -- skipping.
error: rpmdb: damaged header #173 retrieved -- skipping.
Packager    : Docker Release (CE rpm) <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages RPM Signing Key <[email protected]>
Packager    : Docker Release (CE rpm) <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages RPM Signing Key <[email protected]>
Packager    : Docker Release (CE rpm) <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages RPM Signing Key <[email protected]>
Packager    : Docker Release (CE rpm) <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages RPM Signing Key <[email protected]>
Packager    : Docker Release (CE rpm) <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages RPM Signing Key <[email protected]>
Packager    : Docker Release (CE rpm) <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages Automatic Signing Key <[email protected]>
Packager    : Google Cloud Packages RPM Signing Key <[email protected]>

@geerlingguy
Copy link
Owner

geerlingguy commented May 7, 2020

I can't even rebuild the rpmdb:

[root@instance ~]# rm -f /var/lib/rpm/.*.lock
[root@instance ~]# rm -f /var/lib/rpm/__db.*
[root@instance ~]# rpm --rebuilddb
error: rpmdbNextIterator: skipping h#     173 blob size(4836): BAD, 8 + 16 * il(70) + dl(3708)
error: failed to replace old database with new database!
error: replace files in /var/lib/rpm with files from /var/lib/rpmrebuilddb.66870 to recover

@geerlingguy
Copy link
Owner

@geerlingguy
Copy link
Owner

No clue what's going on here, but also see jellyfin/jellyfin#2563

@geerlingguy
Copy link
Owner

Someone else also ran into the corrupt db issue: ansible/awx#6306

@timmay75
Copy link

timmay75 commented May 7, 2020

Hi Jeff. Thanks for the reply. I was trying to get the example you had out there (https://github.com/geerlingguy/ansible-for-devops/tree/master/kubernetes) working with vagrant and centos8. I seem to remember being able to get this going a work around but have forgotten it now. We ran into issues with the k8s internal flannel networking that was a core issue. So we ended up going another route with this. If you have any pointers on cent8 to get that working I would love to revisit it since another team member took it and wrote a new playbook from scratch and used calico and separated the roles out.

@geerlingguy
Copy link
Owner

It looks like the major issue might relate to using overlayfs—see Bug 1680124 - rpmdb --rebuilddb fails inside a container.

Basically, because I had a separate build layer in the Dockerfile that ran a yum -y update, and that seems to sometimes update packages like rpm, which trigger that create-tmp-then-rename bug, then the resulting image would fail the first time people tried doing yum/dnf/rpm activities. And the db couldn't be rebuilt since it was really corrupt.

So in geerlingguy/docker-centos8-ansible#7 I removed that separate yum -y update layer, and we'll see if that fixes things.

@geerlingguy
Copy link
Owner

Drat, I fixed the issue with yum and built-in GPG keys over in the issue linked above... but now we're back to:

    TASK [geerlingguy.kubernetes : Make cache if Kubernetes GPG key changed.] ******
505fatal: [instance]: FAILED! => {"changed": true, "cmd": ["yum", "-q", "makecache", "-y", "--disablerepo=*", "--enablerepo=kubernetes"], "delta": "0:00:00.509995", "end": "2020-05-07 23:08:22.993389", "msg": "non-zero return code", "rc": -13, "start": "2020-05-07 23:08:22.483394", "stderr": "Importing GPG key 0xA7317B0F:\n Userid     : \"Google Cloud Packages Automatic Signing Key <[email protected]>\"\n Fingerprint: D0BC 747F D8CA F711 7500 D6FA 3746 C208 A731 7B0F\n From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg", "stderr_lines": ["Importing GPG key 0xA7317B0F:", " Userid     : \"Google Cloud Packages Automatic Signing Key <[email protected]>\"", " Fingerprint: D0BC 747F D8CA F711 7500 D6FA 3746 C208 A731 7B0F", " From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg"], "stdout": "", "stdout_lines": []}

@dmlb2000
Copy link

dmlb2000 commented May 7, 2020

@geerlingguy So, is this issue being worked off of a different branch?

@dmlb2000
Copy link

dmlb2000 commented May 7, 2020

Oddly, the second run of a converge seems to get past the initial GPG key error. Is that the case for others?

@dmlb2000
Copy link

dmlb2000 commented May 8, 2020

Interesting related issue containers/podman#4431

Trying to change the task from a command to a shell and add the environment variable GPG_TTY=/dev/null

@dmlb2000
Copy link

dmlb2000 commented May 8, 2020

I'm wondering if the yum command need some sort of TTY input to open and close...

@dmlb2000
Copy link

dmlb2000 commented May 8, 2020

Interestingly enough I got the task to work by using expect. It's a tty input/output emulator through a scripting language.

- name: Install Expect
  package:
    name: ['expect']

- name: Make cache if Kubernetes GPG key changed.
  shell: |
    spawn yum -q makecache -y --disablerepo=* --enablerepo=kubernetes
    expect eof
    puts $expect_out(buffer)
    lassign [wait] pid spawnid os_error_flag value
    exit $value
  when: kubernetes_rpm_key is changed
  args:
    warn: false
    executable: /usr/bin/expect

The lassign statement captures the return information from the spawn process. The last return result value is the return code of the spawned process.

Don't get me wrong, this is a big jump to get the task to work properly. I think we shouldn't have to do this, there should be some other way that doesn't require a TTY.

@dmlb2000
Copy link

dmlb2000 commented May 8, 2020

Thought of another option last night, ignore_errors: true also seems to get past the issue. Though I'm not sure if that's inviting more problems later.

@andybrook
Copy link

In my testing it doesn't appear necessary to import the keys in the task "Add Kubernetes GPG keys", since they are already imported by the task "Ensure Kubernetes repository exists" thus the refresh of the cache in the task "Make cache if Kubernetes GPG key changed" doesn't happen, and doesn't error.

On freshly deployed VMs from the same template (template built and updated on the 9th of May).

Build 1
Ran a simple playbook copied from the readme letting it error, then reran the playbook successfully.

Build 2
Ran the same simple playbook copied from the readme letting it error, then ran the command in "Make cache if Kubernetes GPG key changed" manually, which resulted in output, (it gives no output when run a second time immediately after) then reran the playbook successfully.

Build 3
Deleted the task "Add Kubernetes GPG keys" from the role and ran the playbook successfully first go.

Comparing the output of the installed packages on the three builds they are identical, so there seems no reason to need to add the keys or update the cache on CentOS 8, could that task be made conditional on the ansible_distribution_major_version parameter not being 8?

Output from Build 1
[sysadmin@d-10-0-0-33 ~]$ dnf list installed | md5sum
c22a567f3e0f3567ddd1afb0f22f2e19 -

Output from Build 2
[sysadmin@d-10-0-0-34 ~]$ dnf list installed | md5sum
c22a567f3e0f3567ddd1afb0f22f2e19

Output from Build 3
[sysadmin@d-10-0-0-35 ~]$ dnf list installed | md5sum
c22a567f3e0f3567ddd1afb0f22f2e19

Very happy to provide further output or a PR if required, just didn't want to fill this issue with text!

@santidhammo
Copy link

santidhammo commented May 29, 2020

I want to note that we are experiencing the same issue, and it does not have anything to do related to Kubernetes. Using The centos:8 docker image, Jenkins, a proxy and DNF causes exactly the same issue, it corrupts the database. This is indeed new (and wrong) behaviour.

Indeed, running dnf update and install sequences in a single layer resolves the issue, but still corrupts the database (therefore making any derived images impossible)

@geerlingguy
Copy link
Owner

Could this have been fixed this week upstream? I just got a passing test a few minutes ago...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working planned
Projects
None yet
Development

No branches or pull requests

6 participants