-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebased LOFAR parallelism #372
Conversation
3d4b060
to
95e6e31
Compare
As mentioned in the last xradio meeting, we can probably come up with a better name than Something that would either suggest that:
One thing to note is that there is already an option called if lofar == True and parallel == True:
logger.warn(
"Unsupported config. `lofar` and `parallel` both true isn't supported.\nSwitching off `lofar` mode"
)
lofar = False This strongly suggests that |
If changing the meaning/type of
|
I think Peter's suggestion of I was about to write a suggestion similar to @Jan-Willem would it be worth deprecating the |
@Jan-Willem I've just realised that most of the usual checks haven't run because this is a PR from a fork. You might want to change the other checks to "on: pull_request" too |
I have added I have no objections to changing parallel to parallelism_mode. A check will need to be added for when parallelism_mode='time' to ensure that there is a single DDI and observation present in the MSv2 being converted. The dask documentation recommends not calling delayed within delayed functions (https://docs.dask.org/en/latest/delayed-best-practices.html). If we go with just read_col_conversion_dask() and parallel_mode is not 'time' then we end up with the dask.delayed(convert_and_write_partition) in measurement_set/convert_msv2_to_processing_set.py containing a map_blocks (a delayed object). In the best practices, they say, "While this may actually work, it’s usually slow and results in hard-to-understand solutions." so, we might be able to get away with this. I am planning on doing some testing. @sjperkins, do you have any additional insight? |
Within my own code-bases, I've exclusively used dask.array.blockwise to generate I/O reads, rather than combining it with |
@Jan-Willem Just to 100% clarify, this would mean the updated interface would only have a |
@v-morello yes, only one parameter called parallelism_mode. |
cccc7d1
to
807f688
Compare
The `if is_single_dish` clause was only being called when `overwrite=False`. This is undesired behaviour
a611380
to
4640868
Compare
Ready for second review Commit c361da2 contains a bug fix that's independent of the parallelism changes |
It looks good to me. If there are no objections I will merge it today. |
Adds a new method
read_col_conversion_dask
that allows larger than memory columns to be converted. It also provides speed-ups for columns that comfortably fit in memory. Various changes:TableManager
class has been added so that multi-thread/process conversion can happen without having to serialize casacore table objects. This replacesopen_table_ro
andopen_query
inconvert_and_write_partition
.read_col_conversion_dask
uses dask'smap_blocks
to create tasks for each chunk of aDataArray
which reads data from a MSv2 column and reshapes itThis has been used to convert 9TB of lofar data in ~4.5 hours using a compute node with ~100GB of system memory. Previously this was impossible unless a compute node with >9TB of memory could be procured
Notes:
lofar=False
andlofar=True
have been validated against one another. They are identical.lofar=True
andparallel=True
, it defaults toparallel
to adhere to dask best practices.read_col_conversion
to improve maintainability.