-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migration jobs stuck and Pool appears unresponsive #7578
Comments
Cancelling these stuck migration jobs transfers helped to recover the pools controllability
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
dCache release 9.2.17
A pool dc267_10 appears to be stuck, currently there are 400 p2p transfers to the pool.
The max number allowed is increased to 100. The transfers are from migration jobs.
For example this migration job for the file 0000500691EC18CD4BF787E0AA7022D2E96B, size 8196943164 appears to be have an active transfer
[dccore03] (dcdoor31_1@dcdoor31oneDomain) admin > migration info 179
Command : migration move -storage=MCTAPE:MC -permanent -concurrency=120 -eager -replicas=1 -target=pgroup -- MCTAPE-write
State : RUNNING
Queued : 0
Attempts : 292992
Targets : dc242_10,dc240_10,dc263_10,dc269_10,dc246_10,dc267_10,dc248_10,dc265_10,dc261_10,dc254_10,dc252_10,dc237_10,dc239_10,dc256_10,dc266_10,dc241_10,dc245_10,dc268_10,dc262_10,dc249_10,dc264_10,dc260_10,dc259_10,dc236_10,dc238_10,dc270_10,dc255_10
Completed : 128736 files; 303479319986394 bytes; 99%
Total : 303521277057964 bytes
Concurrency: 150
Running tasks:
[439494] 0000500691EC18CD4BF787E0AA7022D2E96B: TASK.Copying -> [dc267_10@local]
[439632] 00002C75947271B947F4802282CE5286B119: TASK.Copying -> [dc267_10@local]
[439879] 0000A1D47825261B485F8857749DC1863FD4: TASK.Copying -> [dc267_10@local]
[439990] 00004C03422A41694C60852B4FA131CEFE4B: TASK.Copying -> [dc267_10@local]
[440138] 0000467AD762815E428981B424846A7A6B35: TASK.Copying -> [dc267_10@local]
But there is not activity
[root@dc267 data]# ls -l 0000500691EC18CD4BF787E0AA7022D2E96B
-rw-r--r-- 1 root root 1939860332 May 21 23:54 0000500691EC18CD4BF787E0AA7022D2E96B
Also the destination pool appears to be stuck
Load on the pool server is not high, however commands like sweeper purge, does not seem to take effect.
https://dcache.sdcc.bnl.gov/usatlas/pools/list/PoolManager//dmz-pools/spaces
There is not special information from System@dc267_10-Domain
The pool in debug mode show
only shows this type of entries
I have created a dump file dc267_10-Domain_15756098301082517203.jfr which could be sent as needed.
Please advise,
All the best,
Carlos
The text was updated successfully, but these errors were encountered: