Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration jobs stuck and Pool appears unresponsive #7578

Open
cfgamboa opened this issue May 22, 2024 · 1 comment
Open

Migration jobs stuck and Pool appears unresponsive #7578

cfgamboa opened this issue May 22, 2024 · 1 comment

Comments

@cfgamboa
Copy link

Hello,

dCache release 9.2.17

A pool dc267_10 appears to be stuck, currently there are 400 p2p transfers to the pool.
The max number allowed is increased to 100. The transfers are from migration jobs.

For example this migration job for the file 0000500691EC18CD4BF787E0AA7022D2E96B, size 8196943164 appears to be have an active transfer

[dccore03] (dcdoor31_1@dcdoor31oneDomain) admin > migration info 179

Command : migration move -storage=MCTAPE:MC -permanent -concurrency=120 -eager -replicas=1 -target=pgroup -- MCTAPE-write
State : RUNNING
Queued : 0
Attempts : 292992
Targets : dc242_10,dc240_10,dc263_10,dc269_10,dc246_10,dc267_10,dc248_10,dc265_10,dc261_10,dc254_10,dc252_10,dc237_10,dc239_10,dc256_10,dc266_10,dc241_10,dc245_10,dc268_10,dc262_10,dc249_10,dc264_10,dc260_10,dc259_10,dc236_10,dc238_10,dc270_10,dc255_10
Completed : 128736 files; 303479319986394 bytes; 99%
Total : 303521277057964 bytes
Concurrency: 150
Running tasks:
[439494] 0000500691EC18CD4BF787E0AA7022D2E96B: TASK.Copying -> [dc267_10@local]
[439632] 00002C75947271B947F4802282CE5286B119: TASK.Copying -> [dc267_10@local]
[439879] 0000A1D47825261B485F8857749DC1863FD4: TASK.Copying -> [dc267_10@local]
[439990] 00004C03422A41694C60852B4FA131CEFE4B: TASK.Copying -> [dc267_10@local]
[440138] 0000467AD762815E428981B424846A7A6B35: TASK.Copying -> [dc267_10@local]

But there is not activity

[root@dc267 data]# ls -l 0000500691EC18CD4BF787E0AA7022D2E96B
-rw-r--r-- 1 root root 1939860332 May 21 23:54 0000500691EC18CD4BF787E0AA7022D2E96B

Also the destination pool appears to be stuck

image

Load on the pool server is not high, however commands like sweeper purge, does not seem to take effect.

https://dcache.sdcc.bnl.gov/usatlas/pools/list/PoolManager//dmz-pools/spaces

There is not special information from System@dc267_10-Domain

The pool in debug mode show
only shows this type of entries

22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/fd3270d3.signing_policy.
22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/fdf90b95.signing_policy.
22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/f30dd6ad.signing_policy.
22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/9629661e.signing_policy.
22 May 2024 14:49:00 (dc267_10) [] Reloaded CRL from file:/etc/grid-security/certificates/3fb4d8a6.r0.
22 May 2024 14:49:18 (dc267_10) [] Sweeper tries to reclaim 9223372036854775807 bytes.

I have created a dump file dc267_10-Domain_15756098301082517203.jfr which could be sent as needed.

Please advise,

All the best,
Carlos

@cfgamboa
Copy link
Author

Cancelling these stuck migration jobs transfers helped to recover the pools controllability

22 May 2024 15:31:33 [pool-32-thread-1] [] Flushed 0000DAADA4797AD8430FA6267BFFDF8C527C to nearline storage: [osm://osm?bfid=/home/dcatlas/BNLT1D0/data24_13p6TeV/RAW/other/data24_13p6TeV.00476218.physics_Main.daq.RAW/data24_13p6TeV.00476218.physics_Main.daq.RAW._lb0395._SFO-18._0005.data]
22 May 2024 15:31:33 [sweeper-purge] [] remove entry 00000DCD049FF5F44ACCB7569B47B85AFB90: 'sweeper purge' command
22 May 2024 15:31:33 [sweeper-purge] [] remove entry 000057C1EB9F2E3B4B559B34528588ECCCD0: 'sweeper purge' command
22 May 2024 15:31:34 [sweeper-purge] [] remove entry 0000D8A832FD665F498593B534D284A8B2F9: 'sweeper purge' command
22 May 2024 15:31:34 [sweeper-purge] [] remove entry 00002417AF41FDBC44BCB66AD12A67FB4D06: 'sweeper purge' command
22 May 2024 15:31:34 [sweeper-purge] [] remove entry 0000D8E53997532F4D85BBE11809B95B2E2A: 'sweeper purge' command
22 May 2024 15:31:35 [sweeper-purge] [] remove entry 000043C771E4FBBE41B6966F9367D7C1EE3D: 'sweeper purge' command
22 May 2024 15:31:35 [sweeper-purge] [] remove entry 00003B238B9527C24038957889C8CC84A42E: 'sweeper purge' command
22 May 2024 15:31:35 [sweeper-purge] [] remove entry 0000ACAFC4AD3C8A4453A1D4534A00AC43D6: 'sweeper purge' command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant