Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch failure of rename when archiving to mwacache boxes #15

Open
gsleap opened this issue Sep 16, 2022 · 0 comments
Open

Catch failure of rename when archiving to mwacache boxes #15

gsleap opened this issue Sep 16, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@gsleap
Copy link
Member

gsleap commented Sep 16, 2022

This is a very rare condition which occurred once only.

mwax02 was processing a file (1332129440_20220324035702_ch110_000.fits). It had transferred it to mwacache01 with a temp filename (as it should). It was then about to issue an ssh 'mv' command on mwacache01, but before it could complete the remote mv command, the mwax_subfile_distributor was told to shutdown. This somehow caused the result of the mv to be counted as successful and so the file was deleted from the mwax host. For several months the partial file sat on mwacache01. Then at some point mwacache01 was restarted, causing it to see the partial file and, seeing it was so old, deleting it (this is expected behaviour).

So this issue is to ensure an interruption does not result in success and a delete, but instead a fail, and this should leave the file on the mwax02 server, and upon restart it will be archived again successfully. The old partial file will stay on mwacache01 until it is removed upon restart.

2022-03-24 11:57:25,586, DEBUG, work_vis_outgoing, Executing /usr/local/bin/xrdcp --cksum adler32 --posc --rm-bad-cksum --silent --streams 2 --tlsnodata /visdata/outgoing/1332129440_20220324035702_ch110_000.fits xroot://192.168.120.101://volume2/incoming/1332129440_20220324035702_ch110_000.fits.partce1a555b-a1ee-477b-aaaf-3b082292337c...
2022-03-24 11:57:27,321, INFO, work_vis_outgoing, /visdata/outgoing/1332129440_20220324035702_ch110_000.fits archive_file_xrootd success (0.543GB in 1.735 seconds at 2.505 Gbps)
2022-03-24 11:57:27,321, DEBUG, work_vis_outgoing, Executing ssh [email protected] 'mv //volume2/incoming/1332129440_20220324035702_ch110_000.fits.partce1a555b-a1ee-477b-aaaf-3b082292337c //volume2/incoming/1332129440_20220324035702_ch110_000.fits'...

2022-03-24 11:57:27,829, WARNING, MainThread, Interrupted. Shutting down 2 processors...

2022-03-24 11:57:27,829, INFO, MainThread, Watcher stopping on /dev/shm/mwax/*.sub...
2022-03-24 11:57:27,829, DEBUG, MainThread, Watcher watch_sub Stopping...
2022-03-24 11:57:27,830, DEBUG, MainThread, Watcher watch_sub Stopped
2022-03-24 11:57:27,830, DEBUG, MainThread, QueueWorker work_sub Stopping...
2022-03-24 11:57:27,830, INFO, work_vis_outgoing, /visdata/outgoing/1332129440_20220324035702_ch110_000.fits archive_file_xrootd successfully renamed //volume2/incoming/1332129440_20220324035702_ch110_000.fits.partce1a555b-a1ee-477b-aaaf-3b082292337c to //volume2/incoming/1332129440_20220324035702_ch110_000.fits on the remote host 192.168.120.101
2022-03-24 11:57:27,831, DEBUG, work_vis_outgoing, /visdata/outgoing/1332129440_20220324035702_ch110_000.fits- archive_handler() Deleting file
2022-03-24 11:57:27,874, INFO, work_vis_outgoing, /visdata/outgoing/1332129440_20220324035702_ch110_000.fits- file deleted
2022-03-24 11:57:27,875, INFO, work_vis_outgoing, /visdata/outgoing/1332129440_20220324035702_ch110_000.fits- archive_handler() Finished
2022-03-24 11:57:27,875, INFO, work_vis_outgoing, Complete. Queue size: 0 Elapsed: 2.29 sec
2022-03-24 11:57:27,974, DEBUG, MainThread, QueueWorker work_sub Stopped
2022-03-24 11:57:27,974, INFO, MainThread, Watcher stopping on /voltdata/incoming/*.sub...
2022-03-24 11:57:27,974, INFO, MainThread, Watcher stopping on /visdata/incoming/*.fits...
2022-03-24 11:57:27,975, INFO, MainThread, Watcher stopping on /visdata/processing_stats/*.fits...
2022-03-24 11:57:27,975, INFO, MainThread, Watcher stopping on /voltdata/outgoing/*.sub...
2022-03-24 11:57:27,975, INFO, MainThread, Watcher stopping on /visdata/outgoing/*.fits...
2022-03-24 11:57:27,975, DEBUG, MainThread, Watcher watch_volt_incoming Stopping...

Later... when mwacache_archiver gets restarted...

mwa@mwacache01:~$ cat logs/old/* | grep 1332129440_20220324035702_ch110_000.fits
2022-05-11 15:10:58,453, WARNING, MainThread, Partial file /volume2/incoming/1332129440_20220324035702_ch110_000.fits.partce1a555b-a1ee-477b-aaaf-3b082292337c is older than 3600 seconds and will be removed...
2022-05-11 15:10:58,454, WARNING, MainThread, Partial file /volume2/incoming/1332129440_20220324035702_ch110_000.fits.partce1a555b-a1ee-477b-aaaf-3b082292337c deleted
@gsleap gsleap added the bug Something isn't working label Sep 16, 2022
@gsleap gsleap self-assigned this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant