Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squid: mgr/cephadm: update default NVMEoF container image version #57516

Open
wants to merge 2 commits into
base: squid
Choose a base branch
from

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented May 16, 2024

backport tracker: https://tracker.ceph.com/issues/65958


backport of #57182
parent tracker: https://tracker.ceph.com/issues/65718

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

Otherwise the nvmeof daemon fails to start up

i148-njpgei[69585]: Traceback (most recent call last):
i148-njpgei[69585]:   File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
i148-njpgei[69585]:     return _run_code(code, main_globals, None,
i148-njpgei[69585]:   File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
i148-njpgei[69585]:     exec(code, run_globals)
i148-njpgei[69585]:   File "/src/control/__main__.py", line 43, in <module>
i148-njpgei[69585]:     gateway.serve()
i148-njpgei[69585]:   File "/src/control/server.py", line 177, in serve
i148-njpgei[69585]:     omap_lock = OmapLock(omap_state, gateway_state)
i148-njpgei[69585]:   File "/src/control/state.py", line 201, in __init__
i148-njpgei[69585]:     self.omap_file_lock_retry_sleep_interval = self.omap_state.config.getint_with_default("gateway",
i148-njpgei[69585]:   File "/src/control/config.py", line 47, in getint_with_default
i148-njpgei[69585]:     return self.config.getint(section, param, fallback=value)
i148-njpgei[69585]:   File "/usr/lib64/python3.9/configparser.py", line 818, in getint
i148-njpgei[69585]:     return self._get_conv(section, option, int, raw=raw, vars=vars,
i148-njpgei[69585]:   File "/usr/lib64/python3.9/configparser.py", line 808, in _get_conv
i148-njpgei[69585]:     return self._get(section, conv, option, raw=raw, vars=vars,
i148-njpgei[69585]:   File "/usr/lib64/python3.9/configparser.py", line 803, in _get
i148-njpgei[69585]:     return conv(self.get(section, option, **kwargs))
i148-njpgei[69585]: ValueError: invalid literal for int() with base 10: '1.0'

I've been told 1.2.5 has a patch that allows this value to be a float

Fixes: https://tracker.ceph.com/issues/65718

Signed-off-by: Adam King <[email protected]>
(cherry picked from commit 0cebec2)
The python/mypy combination on the jenkins nodes the CI
is running on don't seem to care, but locally I get

mypy: commands[0]> mypy --config-file=../mypy.ini -p ceph
ceph/deployment/service_spec.py: note: In member "validate" of class "NvmeofServiceSpec":
ceph/deployment/service_spec.py:1497: error: Unsupported operand types for > ("float" and "None")  [operator]
ceph/deployment/service_spec.py:1497: note: Left operand is of type "Optional[float]"
ceph/deployment/service_spec.py:1500: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1500: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1503: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1503: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1506: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1506: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1509: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1509: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1512: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1512: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1515: error: Unsupported operand types for > ("float" and "None")  [operator]
ceph/deployment/service_spec.py:1515: note: Left operand is of type "Optional[float]"
ceph/deployment/service_spec.py:1518: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1518: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1521: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1521: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1524: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1524: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1527: error: Unsupported operand types for > ("int" and "None")  [operator]
ceph/deployment/service_spec.py:1527: note: Left operand is of type "Optional[int]"
ceph/deployment/service_spec.py:1530: error: Unsupported operand types for > ("float" and "None")  [operator]
ceph/deployment/service_spec.py:1530: note: Left operand is of type "Optional[float]"
Found 12 errors in 1 file (checked 27 source files)

The errors make sense to me, so I think we should fix them

Signed-off-by: Adam King <[email protected]>
(cherry picked from commit 7520f65)
@adk3798 adk3798 requested a review from a team as a code owner May 16, 2024 12:09
@adk3798 adk3798 added this to the squid milestone May 16, 2024
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: rbd:nvmeof suite would still fail, but with a different error (further into the test):

2024-05-16T09:15:10.367 INFO:tasks.workunit.client.2.smithi100.stdout:Failure creating subsystem nqn.2016-06.io.spdk:cnode1: HA must be enabled for subsystems

(This is from today's run on main: https://pulpito.ceph.com/dis-2024-05-16_08:07:49-rbd-wip-dis-testing-distro-default-smithi/7708811)

@adk3798
Copy link
Contributor Author

adk3798 commented May 16, 2024

Note: rbd:nvmeof suite would still fail, but with a different error (further into the test):

2024-05-16T09:15:10.367 INFO:tasks.workunit.client.2.smithi100.stdout:Failure creating subsystem nqn.2016-06.io.spdk:cnode1: HA must be enabled for subsystems

(This is from today's run on main: https://pulpito.ceph.com/dis-2024-05-16_08:07:49-rbd-wip-dis-testing-distro-default-smithi/7708811)

Alright, would you like me to ping you to review the results of that run when it happens before we merge this?

@idryomov
Copy link
Contributor

idryomov commented May 16, 2024

Alright, would you like me to ping you to review the results of that run when it happens before we merge this?

I can review, but as long as it fails the same way as https://pulpito.ceph.com/dis-2024-05-16_08:07:49-rbd-wip-dis-testing-distro-default-smithi/7708811 (i.e. status 22 from qa/workunits/rbd/nvmeof_setup_subsystem.sh and that error in the log), you can consider it approved. All we are doing here is catching up to main.

@adk3798
Copy link
Contributor Author

adk3798 commented May 21, 2024

@idryomov this patch worked for the nvmeof test in the orch suite, but the test in the rbd suite is doing a ceph config set mgr mgr/cephadm/container_image_nvmeof quay.io/ceph/nvmeof:latest which nullifies the effect of this change https://pulpito.ceph.com/adking-2024-05-21_13:52:35-rbd-wip-adk4-testing-2024-05-17-0821-squid-distro-default-smithi/7718636/. Do we want to adjust the test, or should we request an update of the latest tag for the nvmeof image from the nvmeof team?

@idryomov
Copy link
Contributor

idryomov commented May 21, 2024

Do we want to adjust the test, or should we request an update of the latest tag for the nvmeof image from the nvmeof team?

The latter, but this is pending discussion with Aviv's team. @barakda indicated in #57522 (comment) that latest not getting updated automatically is intended. I think that needs to change -- we shouldn't be requesting anything and it should be done automatically. Having latest point at 1.0.0 when e.g. 1.2.5 is available and used in other places is not only weird, but also breaks the idea of CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants