Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warp with versioned, enablemd5 true, ran into "We encountered an internal error. Please try again" PUT method errors #8524

Open
rkomandu opened this issue Nov 13, 2024 · 15 comments
Assignees
Labels

Comments

@rkomandu
Copy link
Collaborator

rkomandu commented Nov 13, 2024

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

noobaa-core-5.17.1-20241104.el9.ppc64le (standalone)

Actual behavior

Warp was run for versioned enabled bucket, observed the 

./warp versioned --insecure --duration 180m --host gpfs-p10-s3-ces.rtp.raleigh.ibm.com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-vers-mup-13nov

warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: Benchmark data written to "warp-versioned-2024-11-13[002802]-3u4C.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 2h59m58s.
 * Throughput: 0.46 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 3h0m4s.
 * Throughput: 500.46 MiB/s, 2.05 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 3h0m4s.
Errors: 4
 * Throughput: 166.92 MiB/s, 0.68 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 2h59m58s.
 * Throughput: 1.37 obj/s

Cluster Total: 667.40 MiB/s, 4.56 obj/s, 4 errors over 3h0m0s.
Total Errors:4.
warp: Cleanup Done

ERROR in the noobaa.log (snippet is copied) --> more in the logs

Nov 13 02:20:39 node-gui1 [3689895]: [nsfs/3689895] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR Interna
lError
We encountered an internal error. Please try again./newbucket-warp-vers-mup-13nov/n9pjzZ6F/5.7eWO9uYt8k5sK7Wn.rnd?up
loadId=bd897211-764d-48a6-a572-73bf184e750em3fjvnp1-1bh1z0-1nv POST /newbucket-warp-vers-mup-13nov/n9pjzZ6F/5.7eWO9u
Yt8k5sK7Wn.rnd?uploadId=bd897211-764d-48a6-a572-73bf184e750e {"host":"gpfs-p10-s3-ces.rtp.raleigh.ibm.com:6443","user-agent":"MinIO (linux; ppc64le) minio-go/
v7.0.70 warp/(dev)","content-length":"1450","authorization":"AWS4-HMAC-SHA256 Credential=KCxP4AN9937kVqoCrNIs/20241113/us-east-1/s3/aws4_request, SignedHeader
s=content-type;host;x-amz-checksum-crc32c;x-amz-content-sha256;x-amz-date, Signature=259d9b307e73f794210a9ea45b43da9818ba49cbca69c3ac409e0e1f76f499ce","conten
t-type":"application/octet-stream","x-amz-checksum-crc32c":"DiTesA==","x-amz-content-sha256":"dbc6f5a50de4dfc013242d1332e6e6caf77d458a4040248a85a5d6b8ef56085a
","x-amz-date":"20241113T071859Z"} Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client at ServerResponse.setHeader (node:_h
ttp_outgoing:652:11) at Object.post_object_uploadId [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_post_object_uploadId.js:43:13) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
Nov 13 02:20:39 node-gui1 [3689895]: [nsfs/3689895] [L0] core.endpoint.s3.s3_rest:: Sending error xml in body, but too late for headers...
...

s3 configuration

ALLOW_HTTP : false
DEBUGLEVEL : default
ENABLEMD5 : true
ENDPOINT_FORKS : 2
ENDPOINT_PORT : 6001
ENDPOINT_SSL_PORT : 6443

Expected behavior

Why does the Error crop up ? when there is no connection reset

4 errors are observed as shown below

Operation: PUT (7391). Ran 3h0m4s. Size: 256000000 bytes. Concurrency: 20.
Errors: 4
First Errors:

  • https://:6443, 2024-11-13 02:20:40 -0500 EST: We encountered an internal error. Please try again.
  • https://:6443, 2024-11-13 02:20:40 -0500 EST: We encountered an internal error. Please try again.
  • https://:6443, 2024-11-13 02:23:01 -0500 EST: We encountered an internal error. Please try again.
  • https://:6443, 2024-11-13 02:30:02 -0500 EST: We encountered an internal error. Please try again.

Steps to reproduce

Run Warp as shown above with options , enablemd5=true

More information - Screenshots / Logs / Other output

Will upload the logs to box folder and update here (start with here --> https://ibm.box.com/s/5jsco2y3f50zcpwr39yt0f0scpo2tns8)

@rkomandu rkomandu added the NS-FS label Nov 13, 2024
@nadavMiz nadavMiz self-assigned this Nov 13, 2024
@nadavMiz
Copy link
Contributor

@rkomandu the box is empty. any news about the logs?

@rkomandu
Copy link
Collaborator Author

rkomandu commented Nov 18, 2024

reposting again the box folder --> https://ibm.ent.box.com/folder/293689458577

it had files from 13th Nov as posted in the slack screenshot

@nadavMiz
Copy link
Contributor

nadavMiz commented Nov 18, 2024

on the time these 4 errors happen we see the following error:

Nov 13 02:20:39  [3689895]: [nsfs/3689895] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/newbucket-warp-vers-mup-13nov/n9pjzZ6F/5.7eWO9uYt8k5sK7Wn.rnd?uploadId=bd897211-764d-48a6-a572-73bf184e750e</Resource><RequestId>m3fjvnp1-1bh1z0-1nv</RequestId></Error> POST /newbucket-warp-vers-mup-13nov/n9pjzZ6F/5.7eWO9uYt8k5sK7Wn.rnd?uploadId=bd897211-764d-48a6-a572-73bf184e750e {"host":"gpfs-p10-s3-ces.rtp.raleigh.ibm.com:6443","user-agent":"MinIO (linux; ppc64le) minio-go/v7.0.70 warp/(dev)","content-length":"1450","authorization":"AWS4-HMAC-SHA256 Credential=KCxP4AN9937kVqoCrNIs/20241113/us-east-1/s3/aws4_request, SignedHeaders=content-type;host;x-amz-checksum-crc32c;x-amz-content-sha256;x-amz-date, Signature=259d9b307e73f794210a9ea45b43da9818ba49cbca69c3ac409e0e1f76f499ce","content-type":"application/octet-stream","x-amz-checksum-crc32c":"DiTesA==","x-amz-content-sha256":"dbc6f5a50de4dfc013242d1332e6e6caf77d458a4040248a85a5d6b8ef56085a","x-amz-date":"20241113T071859Z"} 
Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client    at ServerResponse.setHeader (node:_http_outgoing:652:11)    at Object.post_object_uploadId [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_post_object_uploadId.js:43:13)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)

looks like the issue happens during multiple upload
might not be related but right before that we get warnings of FS commands taking too long:

Nov 13 02:20:39  [3689891]: 2024-11-13 02:20:39.580321 [PID-3689891/TID-3689928] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/25cf8cc4-f1bb-4d01-8a4c-81dbb239a9ce/part-7  took too long: 132.432 ms
Nov 13 02:20:39  [3689895]: 2024-11-13 02:20:39.581724 [PID-3689895/TID-3689943] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/bd897211-764d-48a6-a572-73bf184e750e/part-1  took too long: 132.659 ms
Nov 13 02:20:39  [3689895]: 2024-11-13 02:20:39.581829 [PID-3689895/TID-3689955] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/bd897211-764d-48a6-a572-73bf184e750e/part-15  took too long: 109.97 ms
Nov 13 02:20:39 [3689891]: 2024-11-13 02:20:39.586127 [PID-3689891/TID-3689933] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/25cf8cc4-f1bb-4d01-8a4c-81dbb239a9ce/part-2  took too long: 138.213 ms
Nov 13 02:20:39  [3689891]: 2024-11-13 02:20:39.591514 [PID-3689891/TID-3689937] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/25cf8cc4-f1bb-4d01-8a4c-81dbb239a9ce/part-6  took too long: 143.57 ms
Nov 13 02:20:39  [3689891]: 2024-11-13 02:20:39.600428 [PID-3689891/TID-3689926] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/25cf8cc4-f1bb-4d01-8a4c-81dbb239a9ce/part-1  took too long: 134.905 ms
Nov 13 02:20:39  [3689891]: 2024-11-13 02:20:39.600500 [PID-3689891/TID-3689931] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-13nov/.n
oobaa-nsfs_673437d3edd92ddbf1c85cdd/multipart-uploads/25cf8cc4-f1bb-4d01-8a4c-81dbb239a9ce/part-15  took too long: 101.728 ms

@nadavMiz
Copy link
Contributor

@rkomandu can you take full logs, so I can farther investigate this:
mms3 config change DEBUGLEVEL="all"

@rkomandu
Copy link
Collaborator Author

rkomandu commented Nov 19, 2024

@nadavMiz , you can check this, it had 8 errors from the Warp summary

./warp-script.sh 2>&1 | tee ./warp-vers-issue8524-19Nov.log
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: <ERROR> upload error:  We encountered an internal error. Please try again.
warp: <ERROR> short upload. want:256000000, got:0
warp: Benchmark data written to "warp-versioned-2024-11-19[000053]-MgrT.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 2h59m56s.
 * Throughput: 0.44 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 3h0m3s.
 * Throughput: 480.33 MiB/s, 1.97 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 3h0m1s.
Errors: 8
 * Throughput: 160.12 MiB/s, 0.66 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 2h59m59s.
 * Throughput: 1.31 obj/s

Cluster Total: 640.49 MiB/s, 4.37 obj/s, 8 errors over 2h59m59s.
Total Errors:8.
warp: Cleanup Done

Warp analyze showed PUT errors at the timestamps

Operation: PUT - total: 7077, 15.0%, Size: 256000000 bytes. Concurrency: 20, Ran 3h0m1s, starting 2024-11-19 00:00:56.946 -0500 EST
Errors: 8
https://<ces-ip>:6443, 2024-11-19 01:03:44 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 01:03:44 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 01:31:00 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 01:42:04 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 01:42:04 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 02:43:25 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 02:52:08 -0500 EST: We encountered an internal error. Please try again.
https://<ces-ip>:6443, 2024-11-19 02:52:08 -0500 EST: We encountered an internal error. Please try again.
 * Throughput: 160.12 MiB/s, 0.66 obj/s

output is located here https://ibm.ent.box.com/folder/294550758268

./warp versioned --insecure --duration 180m --host :6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-vers-mup-19nov

@rkomandu
Copy link
Collaborator Author

rkomandu commented Nov 19, 2024

run started around "Nov 18 23:58:30" , the noobaa file in box are from 00 of Nov 19th, nothing much in the older logs (as such no errors from analyze as well) ..

Nov 18 23:58:30 gui0-node [4006211]: [nsfs/4006211] [L1] core.endpoint.s3.s3_rest:: S3 REQUEST PUT /newbucket-warp-vers-mup-19nov op put_bucket request_id m3nzi3yy-4d47kz-m3d

@romayalon
Copy link
Contributor

@guymguym I talked with @nadavMiz about this bug today, is there a reason we are awaiting for the deletion of the parts? seems to me like this can be done in the background so we won't delay our response.
In addition, Nadav mentioned he saw there is a TODO in the original keep alive implementation on version id response, he can add more details

@guymguym
Copy link
Member

@romayalon if we don't wait for the deletion of parts we can't delete the multipart directory as well. Why are we waiting for so long anyway? If we want to speed it up then we can create a single fs_napi operation that will delete the entire directory and all the files in it, so we won't have to submit so many work items.

@nadavMiz
Copy link
Contributor

@romayalon @guymguym the TODO mentioned is this:

    // TODO: Should be refactored when coding the complete solution
    // We need to know all of these things prior to completion
    // This also leaves a gap on long versioned/encrypted uploads (failure to assign headers after responding)

looks like the function http_utils.set_keep_alive_whitespace_interval sends headers as heartbeat after timeout. this causes the Cannot set headers after they are sent to the client error when trying to set version-id in headers

I am not sure why unlink is taking so long, but other operations occasionally take that long as well. for example:

[2056231]: 2024-11-19 02:44:48.964316 [PID-2056231/TID-2056252] [L0] FS::FSWorker::Execute: WARNING FileRead _wrap->_path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/)tePSHyK/.versions/7.(8RNMlI6eghWis7E.rnd_mtime-d5pz1qiv6ps0-ino-1dezd _wrap->_fd=19 _pos=8388608 _offset=0 _len=8388608  took too long: 105.5 ms

@romayalon
Copy link
Contributor

@guymguym Why do we need to wait with sending the response to after the deletion of the directory?

@guymguym
Copy link
Member

we don't have to but otherwise the mp upload still assumed to exist by other api calls

@romayalon
Copy link
Contributor

romayalon commented Nov 20, 2024

@guymguym Thanks Guy that's what I thought, I think we can really improve this flow, maybe also we can await for the rename of the folder and not wait for the unlinks.

@nadavMiz
Copy link
Contributor

nadavMiz commented Nov 20, 2024

@guymguym @romayalon even if we won't wait for the unlink, we would still might still take a long time uploading the file itself. it seems to be the reason we have the keepalive mechanism in the first place. maybe we can somehow pass the version-id to the header before uploading the file and deleting the folder, that way we can wait as long as needed

@romayalon
Copy link
Contributor

@nadavMiz regardless of the fix of the http_utils.set_keep_alive_whitespace_interval(res); we can improve this flow in namespace_fs, can you please check that?

@nadavMiz
Copy link
Contributor

nadavMiz commented Dec 2, 2024

@madhuthorat FS commands are taking very long in this run. link command takes 10 seconds and each unlink command takes 230 ms:

Nov 19 02:52:07 gpfs-p10-gui1 [2056231]: 2024-11-19 02:52:07.971538 [PID-2056231/TID-2056258] [L0] FS::FSWorker::Execute: WARNING Link _oldpath=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.
noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/ec269088-f490-4719-89ae-737a677bbbf5/parts-size-16777216 _newpath=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee
6ca215c7ee394/multipart-uploads/ec269088-f490-4719-89ae-737a677bbbf5/final  took too long: 98038 ms
Nov 19 02:52:08 gpfs-p10-gui1 [2056231]: 2024-11-19 02:52:08.371059 [PID-2056231/TID-2056250] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/ec269088-f490-4719-89ae-737a677bbbf5/part-7  took too long: 265.178 ms
Nov 19 02:52:08 gpfs-p10-gui1 [2056231]: 2024-11-19 02:52:08.371098 [PID-2056231/TID-2056231] [L1] FS::FSWorker::OnOK: undefined Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/ec269088-f490-4719-89ae-737a677bbbf5/part-7
Nov 19 02:52:08 gpfs-p10-gui1 [2056232]: 2024-11-19 02:52:08.372460 [PID-2056232/TID-2056275] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/7441e260-93a1-4d94-82bb-9963a488e6a3/part-7  took too long: 221.655 ms
Nov 19 02:52:08 gpfs-p10-gui1 [2056232]: 2024-11-19 02:52:08.372501 [PID-2056232/TID-2056232] [L1] FS::FSWorker::OnOK: undefined Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/7441e260-93a1-4d94-82bb-9963a488e6a3/part-7
Nov 19 02:52:08 gpfs-p10-gui1 [2056231]: 2024-11-19 02:52:08.373491 [PID-2056231/TID-2056258] [L0] FS::FSWorker::Execute: WARNING Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/ec269088-f490-4719-89ae-737a677bbbf5/part-2  took too long: 267.578 ms
Nov 19 02:52:08 gpfs-p10-gui1 [2056231]: 2024-11-19 02:52:08.373535 [PID-2056231/TID-2056231] [L1] FS::FSWorker::OnOK: undefined Unlink _path=/ibm/fvt_fs/s3user-17001-dir/newbucket-warp-vers-mup-19nov/.noobaa-nsfs_673c1af6eee6ca215c7ee394/multipart-uploads/ec269088-f490-4719-89ae-737a677bbbf5/part-2

tested it on my cluster and for 30 GB file link took only 0.321312 ms. looks like the main issue is with the filesystem

regardless, I have opened a new issue (#8569 ) to decide how to handle this scenario on our side in case there are file system problems. maybe we should just let it fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants