Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration move jobs random flag verification #7550

Open
cfgamboa opened this issue Apr 17, 2024 · 10 comments
Open

Migration move jobs random flag verification #7550

cfgamboa opened this issue Apr 17, 2024 · 10 comments

Comments

@cfgamboa
Copy link

Dear all,

As it was reported today in the Tier1 dev meeting.
Our DMZ pools have uses migration move jobs to distribute files to TAPE and DISK ONLY poolgroups.
The following is an example of the migration job used to move files from DMZ pools to TAPE like pools on a pool group.

migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE-write

migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write

There 16 DMZ which are enabled/configured in a similar way.

Attached a picture of the pool monitor, this corresponds to a period in which the DMZ pools are saturated ( many TAPE files awaiting to be move to the internal TAPE pool groups)

image

It is not clear why there is a few pools chosen as a destination from the migration jobs?

This situation was first observed when we used the default set form the -select parameter.

I was expecting a more distributed allocation of destination pools from the TAPE diskgroup.

Could you please advise?

All the best,
Carlos

@lemora
Copy link
Member

lemora commented Apr 17, 2024

Hi Carlos.

Could you please provide more information on these jobs via migration info?
Is there anything logged on the origin pools or in PoolManager, were these other pool ever tried?

Thanks.
Lea

@cfgamboa
Copy link
Author

Hello Lea,

 migration info 179
Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- MCTAPE\-write
State      : SLEEPING
Queued     : 0
Attempts   : 2929
Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
Completed  : 2821 files; 4344642122070 bytes; 100%
Total      : 4344642122070 bytes
Concurrency: 40
Running tasks:
Most recent errors:
08:26:06 [4655] 0000FF579F67CAAF40D7926FBE1A57B40250: File does not exist, skipped
08:26:16 [4660] 00009A193BE526A244ECB444F4A210EC56A1: Transfer to [dc269_10@local] failed (No such file or directory: 00009A193BE526A244ECB444F4A210EC56A1); will not be retried

Carlos

@cfgamboa
Copy link
Author

@lemora here there is an example were the selection goes to one pool

    Command    : migration move -storage=bnlt1d0:BNLT1D0 -permanent -concurrency=40 -eager -select=random -replicas=1 -target=pgroup -- DATATAPE\-write
    State      : RUNNING
    Queued     : 0
    Attempts   : 1731
    Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10,dc258_10
    Completed  : 1675 files; 7823832367778 bytes; 98%
    Total      : 7968129878941 bytes
    Concurrency: 40
    Running tasks:
    [16785] 000038B8C629722342969DA89EFF9978416D: TASK.Copying -> [dc258_10@local]
    [16899] 0000AAC5B024EE984B5B8C9C748D0384C90C: TASK.Copying -> [dc258_10@local]
    [16928] 0000550312C67B0F499BA575AE34B0E82E03: TASK.Copying -> [dc258_10@local]
    [16937] 000091015BAE1D12491BB97546EE57906F20: TASK.Copying -> [dc258_10@local]
    [16994] 0000330AEAE002D942B3BCBA526AEDCF96D5: TASK.Copying -> [dc258_10@local]
    [17031] 0000476DD01AE9554AD9A9F1338A983C7F8A: TASK.Copying -> [dc258_10@local]
    [17351] 0000276485F3CE9349648672BCC6E65684BA: TASK.Copying -> [dc258_10@local]
    [17447] 0000F6747705F7CA4946BF641F828ED7007F: TASK.Copying -> [dc258_10@local]
    [17459] 0000132D43F0281D45D0B9481DDFC2F1D790: TASK.Copying -> [dc258_10@local]
    [17472] 0000156D3D980FB744CB85AF804115C5BD8E: TASK.Copying -> [dc258_10@local]
    [17651] 00005430A7E0A45F479DA1C7E0E3C4F80338: TASK.Copying -> [dc258_10@local]
    [17930] 00005EE96E2A319644B6B0152F19A9DD8790: TASK.Copying -> [dc258_10@local]
    [18300] 0000AE20E0EDEC8D4EC08538C148ED24A892: TASK.Copying -> [dc258_10@local]
    [18617] 00001390D214813F44449AFCFD9D9B855EDC: TASK.Copying -> [dc253_10@local]
    [18752] 00003BD558A54ADA430C81FBE2AAB170042B: TASK.Copying -> [dc258_10@local]
    [18764] 000011532DE9D5FA468E8516C38055CB6DD5: TASK.Copying -> [dc258_10@local]
    [18993] 00002D3A5F89BF85471BA65B6873D4C9B8C5: TASK.Copying -> [dc258_10@local]
    [19047] 00007415AE95AD1A4F649E83CDD9BD6FB8F7: TASK.Copying -> [dc258_10@local]
    [19125] 000029FC4B8489D2476FAD4EB078DE636875: TASK.Copying -> [dc258_10@local]
    [19171] 00002E6EF98D3FEA45B7B2ECC957866F22AA: TASK.Copying -> [dc258_10@local]
    [19257] 0000BCBBB24C9AD147AFB847CE43AA2E7327: TASK.Copying -> [dc253_10@local]
    [19293] 0000588897B48404491E9D2289658255D90C: TASK.Copying -> [dc253_10@local]
    [19329] 0000ECB0152EEDF34A4C9FB38E0DC5CDFF24: TASK.Copying -> [dc253_10@local]
    [19336] 0000C263D49A3AC74CC4B6F37E12A99F9F8D: TASK.Copying -> [dc258_10@local]

Many Migration jobs select the same pool

image

@cfgamboa
Copy link
Author

Only when I cancel the on going migration stuck (hot pool) and exclude the HOT pool from the migration job destination the destination pools for transfers started to be more diverse.

    Command    : migration move -storage=MCTAPE:MC -permanent -concurrency=40 -eager -select=random -replicas=1 -exclude=dc258_10 -target=pgroup -- MCTAPE\-write
    State      : RUNNING
    Queued     : 380
    Attempts   : 103
    Targets    : dc266_10,dc263_10,dc245_10,dc268_10,dc269_10,dc249_10,dc246_10,dc267_10,dc248_10,dc264_10,dc265_10,dc259_10,dc260_10,dc261_10,dc254_10,dc253_10,dc252_10,dc270_10,dc255_10
    Completed  : 63 files; 108070950853 bytes; 6%
    Total      : 1766407491660 bytes
    Concurrency: 40
    Running tasks:
    [20366] 0000FF68EBD77BA84E39B94455CC0B90DF0A: TASK.Copying -> [dc254_10@local]
    [20368] 00003467F90D4D524A488AC8EC789E18C780: TASK.Copying -> [dc246_10@local]
    [20370] 0000580E1B24CF104896A7C8F5D03DDA3CDA: TASK.Copying -> [dc254_10@local]
    [20373] 00009ECC758E07F840E182D55601B52023AA: TASK.Copying -> [dc266_10@local]
    [20374] 000009980CDD21E840D39B7AE2DD21A4F49C: TASK.Copying -> [dc254_10@local]
    [20375] 0000D64A5282E0AD499D90A3036B3D685FFD: TASK.Copying -> [dc249_10@local]
    [20379] 0000352929A645D2463C8518E351980B498B: TASK.Copying -> [dc249_10@local]
    [20381] 00005FC09C36DB52460092C2F963589CC22E: TASK.Copying -> [dc253_10@local]
    [20393] 0000D5BF282B5C864C729F927122C279F551: TASK.Copying -> [dc264_10@local]
    [20399] 0000D276AB59EE5C4036B249AAFBD503EE0C: TASK.Copying -> [dc253_10@local]
    [20404] 00003DECF626706E41AF861BFF261AB69EAC: TASK.Copying -> [dc259_10@local]
    [20407] 00007417BDEE533C4E449DF48EA5C64F3469: TASK.Copying -> [dc249_10@local]
    [20411] 0000FAC0B45CF886425EAE11BE8F64672F69: TASK.Copying -> [dc254_10@local]
    [20413] 0000626112F23C7A4362B9F184C907A70C6E: TASK.Copying -> [dc254_10@local]
    [20426] 0000F9B07024EA5F4765883F4D9BCECE51C2: TASK.Copying -> [dc255_10@local]
    [20428] 0000E91CF3FEA9104B96AD086714F246EA23: TASK.Copying -> [dc254_10@local]
    [20435] 000042579D86E46B4E0290D34A723BA4AC46: TASK.Copying -> [dc265_10@local]
    [20436] 00007E30937DF40F432BBE6B3164C2AEACFF: TASK.Copying -> [dc254_10@local]
    [20437] 00005BCEE90C79DD46359F2A7AD05398585D: TASK.Copying -> [dc268_10@local]
    [20438] 00006E2857CB031042788240A9C7B45F85DB: TASK.Copying -> [dc245_10@local]
    [20449] 0000ABF2340E29364D09831942BF148445C5: TASK.Copying -> [dc254_10@local]
    [20453] 000067DE01BD61BC487CA29E91AA53E4958C: TASK.Copying -> [dc266_10@local]
    [20454] 000000D23598EEA2447996B47C0660E30B26: TASK.Copying -> [dc253_10@local]
    [20456] 0000091F04719AFC4984A1DA08753086629B: TASK.Copying -> [dc264_10@local]
    [20458] 0000CBAF5B648EEC4EA481366A8B87543CEF: TASK.Copying -> [dc254_10@local]
    [20460] 0000A9E9C70532074B918B264289E5039DAF: TASK.Copying -> [dc254_10@local]
    [20461] 00004E9436A3151142C6B2C4F3F31CA2DB1B: TASK.Copying -> [dc253_10@local]
    [20463] 00000D635C6F2DBA488EAB74383AD976E361: TASK.Copying -> [dc252_10@local]
    [20464] 00009DC49ADFF6724B56BF306C26F117E626: TASK.Copying -> [dc264_10@local]
    [20466] 0000F74DCE54C5A74217B31AC75F109B0E61: TASK.Copying -> [dc263_10@local]
    [20467] 0000893BA234596A4575B837A0ADDD4A45E7: TASK.Copying -> [dc270_10@local]
    [20468] 0000083A69109D5D4DA6BB5AF9B06F2C3CCA: TASK.Copying -> [dc268_10@local]
    [20471] 000094EBE42BE8174FD7B2079B967032FC06: TASK.Copying -> [dc261_10@local]
    [20472] 0000F1745EF967904B318C9595AF24BD6527: TASK.Copying -> [dc260_10@local]
    [20476] 0000267757BADBCB4BC9AACD99196F606619: TASK.Copying -> [dc248_10@local]
    [20477] 000018D1DEC8F9A1472F8A93B70F7C3B8C70: TASK.Copying -> [dc245_10@local]
    [20478] 0000B8D1CC202EF74E1EA28E45760D8A72A4: TASK.Copying -> [dc245_10@local]
    [20491] 0000B22DD9631AF14FADB85A59AD701F9A9D: TASK.Copying -> [dc245_10@local]
    [20492] 00006A94B6B915564D6390226D8987B7F95E: TASK.Copying -> [dc267_10@local]
    [20493] 00002FF8EE11B9634689997D73AE2FAABFF5: TASK.Copying -> [dc246_10@local]

@kofemann
Copy link
Member

kofemann commented Apr 18, 2024 via email

@cfgamboa
Copy link
Author

The migration from source does not stop. The problem here is that it chooses the same destination pool. It does not seem to be a pure random process.

@kofemann
Copy link
Member

@cfgamboa can you check in billing and confirm that all p2p when into one pool and all others get less traffic? Or on average the data distribution is flat.

@cfgamboa
Copy link
Author

cfgamboa commented Apr 23, 2024 via email

@DmitryLitvintsev
Copy link
Member

DmitryLitvintsev commented Apr 23, 2024

This is the best indication that there is load pattern that sculpts the initially random distribution.
Do you have other activities on destination pools? That may sculpt the initially random distributiob. Whereas
non specifying random takes pool load (and space) into account.

(example of sculpting - a slow pool will seem as "attracting" many transfers when pools are selected randomly)

@cfgamboa
Copy link
Author

cfgamboa commented May 8, 2024

Yes there are other activities at the destination pools also on the DMZ pools there are other migration jobs to other pool groups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants