Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pvc is created but pod is unable to mount the volume #13

Open
brunnels opened this issue Jun 8, 2024 · 15 comments
Open

pvc is created but pod is unable to mount the volume #13

brunnels opened this issue Jun 8, 2024 · 15 comments

Comments

@brunnels
Copy link

brunnels commented Jun 8, 2024

Latest QTS5 version

Events:
  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               105s              default-scheduler        Successfully assigned default/write-ntswf to tc1
  Normal   SuccessfulAttachVolume  104s              attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884"
  Warning  FailedMount             3s (x5 over 81s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884" : rpc error: code = Internal desc = failed to stage volume: exit status 2

Logs from trident controller don't appear to have anything wrong

time="2024-06-08 23:40:25" level=info msg="---> Receive Storage Add Request"
time="2024-06-08 23:40:25" level=info msg="Add Storage Successfully" Host=10.60.0.5 Name=qts-nasf586af
time="2024-06-08 23:40:29" level=info msg="---> Receive Qvolume Create Request"
time="2024-06-08 23:40:37" level=info msg="Create Volume Successfully" Capacity=5 Index=0 Name=trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884
time="2024-06-08 23:40:37" level=info msg="---> Receive Iscsi Create Request"
time="2024-06-08 23:40:39" level=info msg="Create Iscsi Target Successfully" Alias=iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884 IQN="iqn.2004-04.com.qnap:tvs-871u-rp:iscsi.iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884.f586af" Index=0 Name=iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884
time="2024-06-08 23:40:39" level=info msg="---> Receive Iscsi Map Request"
time="2024-06-08 23:40:41" level=info msg="Map Volume and Iscsi Target Successfully" Iscsi Target Index=0 Volume Index=0
time="2024-06-08 23:40:41" level=info msg="---> Receive Qvolume FullGet Request"
time="2024-06-08 23:43:00" level=info msg="---> Receive Qvolume List Request"
time="2024-06-08 23:43:00" level=info msg="List Volume Successfully"
time="2024-06-08 23:43:00" level=info msg="---> Receive Qvolume FullGet Request"
time="2024-06-08 23:45:41" level=info msg="---> Receive Qvolume List Request"
time="2024-06-08 23:45:41" level=info msg="List Volume Successfully"
time="2024-06-08 23:45:41" level=info msg="---> Receive Qvolume FullGet Request"

trident-node-linux logs have errors

time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.getDeviceInfoForLUN" iSCSINodeName="iqn.2004-04.com.qnap:tvs-871u-rp:iscsi.iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884.f586af" logLayer=csi_frontend lunID=0 needFSType=false requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Found device." devices="[sdb]" iqn="iqn.2004-04.com.qnap:tvs-871u-rp:iscsi.iscsi-trident-pvc-ff88d0e4-3eed-4ea7-b79a-bab140864884.f586af" logLayer=csi_frontend multipathDevice= requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI scsiLun=0 workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Device found." device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.getDeviceFSType" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Device found." device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.waitForDevice" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[/dev/sdb]" command=blkid logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeoutAndInput." args="[/dev/sdb]" command=blkid logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeoutSeconds=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeoutAndInput." command=blkid error="exit status 2" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeout." logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=info msg="Could not get FSType for device; err: exit status 2." device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.getDeviceFSType" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> devices.isDeviceUnformatted" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeout." args="[if=/dev/sdb bs=4096 count=512 status=none]" command=dd logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeout=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg=">>>> osutils.execCommandWithTimeoutAndInput." args="[if=/dev/sdb bs=4096 count=512 status=none]" command=dd logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI timeoutSeconds=5s workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeoutAndInput." command=dd error="exit status 2" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< osutils.execCommandWithTimeout." logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=error msg="failed to read the device" device=/dev/sdb error="exit status 2" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< devices.isDeviceUnformatted" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=error msg="Unable to identify if the device is unformatted; err: exit status 2" device=/dev/sdb logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="<<<< iscsi.AttachISCSIVolume" logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
time="2024-06-08T23:50:28Z" level=debug msg="Attach iSCSI volume is not complete, waiting." error="exit status 2" increment=5.51696103s logLayer=csi_frontend requestID=87c09a05-119e-4e25-adfe-4ec498925d39 requestSource=CSI workflow="node_server=stage"
@davidcheng0716
Copy link
Collaborator

Hi @brunnels,

Can you exec the command "dd if=/dev/sdb bs=4096 count=512 status=none" on the node ?

The function just execute the above command and get the result.

@brunnels
Copy link
Author

The node is talos so no ssh access. I can exec on the trident-node-linux pod if that would help but logs indicate it would probably fail. All my talos nodes have the iscsi-tools system extension.

@brunnels
Copy link
Author

@davidcheng0716 It looks like the synology-csi that uses iscsi for RWO and cifs for RWM works well under talos. https://www.talos.dev/v1.7/kubernetes-guides/configuration/synology-csi/

Perhaps the qnap-csi needs some additional configuration steps like the openebs jiva as detailed here?

@brunnels
Copy link
Author

@davidcheng0716 I found this project talos-debug-tools that creates a daemonset similar to the qnap csi driver. I changed the namespace to be the same as the qnap csi driver namespace and I am able to ssh into the pod and run the dd command without error.

Maybe the debug daemonset could provide some insight into something the qnap csi daemonset might be missing?

@brunnels
Copy link
Author

Comparing the daemonset-trident-node-linux and the debug daemonset the only things that pop out is the pod securityContext is this in the debug daemonset

        securityContext:
          allowPrivilegeEscalation: true
          capabilities:
            add:
              - SYS_ADMIN
          privileged: true

and this for the spec.template.spec.securityContext

      securityContext:
        runAsNonRoot: false
        seccompProfile:
          type: RuntimeDefault

@LeonaChen2727
Copy link
Contributor

Hi @brunnels,
We have received your report and suspect the issue may be caused by a specific Linux distribution.
We are actively working to resolve this problem.

@brunnels
Copy link
Author

@davidcheng0716 Do you plan to release the source for the operator? If the source were available I could make the necessary changes and test.

@brunnels
Copy link
Author

@davidcheng0716 any movement on this? I'm happy to assist with testing.

@davidcheng0716
Copy link
Collaborator

@brunnels
Sorry for the late reply, and I'll take a look at this issue. 
To reproduce the issue, I'm going to build the same environment; what kind of installed talos do you have? Bare metal, virtualbox, or others? 

@brunnels
Copy link
Author

brunnels commented Jul 23, 2024

@davidcheng0716 I have baremetal. I'm using https://github.com/onedr0p/cluster-template template. This makes it very quick to get a repeatable, secure, and best practices cluster up in a few minutes.

@davidcheng0716
Copy link
Collaborator

@brunnels
I guess the problem is the node doesn't have the tool 'dd'; our service doesn't have 'dd' in the container.

The talos debug daemonset installs the tools in the container instead of calling the tool on the node. Therefore, it can work when you ssh into the pod.

I'm wondering if the node has the tool 'dd' without the container on the node.

@brunnels
Copy link
Author

brunnels commented Aug 2, 2024

@davidcheng0716 no, that's not the problem. If you see my prior comment I can use the talos debug daemonset to get into a pod in the same namespace on the same node and I can run dd on the iscsi pvc that qnap created. I think the issue lies with the pod securityContext.

By default talos security policy prevents privileged escalation in the pod but can be overridden with a namespace annotation and adding the capabilites https://www.talos.dev/v1.7/kubernetes-guides/configuration/pod-security/

I have the annotation on the qnap-trident namespace but I cannot change the pod securityContext set but the qnap-trident operator.

@brunnels
Copy link
Author

brunnels commented Aug 2, 2024

@davidcheng0716 I just re-read you comment and looked at the debug-tools dockerfile and you are correct. It is installing dd into the pod. My apologies.

The node won't have dd. It's the very bare minimum so anything the pod needs it must provide.

@brunnels
Copy link
Author

@Shanna-chen Why did you close this? Issue is not completed or fixed.

@Shanna-chen Shanna-chen reopened this Sep 30, 2024
@Shanna-chen
Copy link
Collaborator

@Shanna-chen Why did you close this? Issue is not completed or fixed.

Sorry, I closed it by mistake. I will reopen it to continue evaluating this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants