Skip to content

Request for Feedback on Parallel I/O Performance with PnetCDF-Python on HPC Cluster #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wkliao opened this issue Apr 8, 2025 · 8 comments
Labels
question Further information is requested

Comments

@wkliao
Copy link
Member

wkliao commented Apr 8, 2025

On Apr 8, 2025, at 3:59 AM, Sumanth Gopalagowda wrote:

Dear Professor Liao,

I hope this message finds you well.

My name is Sumanth Gopalagowda, ...
As part of my academic project, I am working on computing various deformation measures from snapshots of atomistic simulations.

After computing these measures, I am attempting to write the resulting data to disk in parallel using the PnetCDF-Python interface. However, I am facing performance issues during the I/O stage on our HPC cluster. For instance, writing approximately 72 GB of data takes around 269 seconds on a single node, but the time increases to about 337 seconds when using two nodes. This I/O bottleneck is negating the computational speed-up achieved through parallelization.

I have attached the relevant portion of my code that performs the file writing. I would be extremely grateful if you could take a moment to review it and provide any suggestions or insights on how I might improve its scalability and performance.

Thank you very much for your time and consideration.

Best regards,
Sumanth Gopalagowda

write_out.py.txt

Please let us know what file system you are using (is it a parallel file system?).
If you can edit your short program to add the main function so we can reproduce
the performance number, that will be helpful.

@wkliao wkliao added the question Further information is requested label Apr 8, 2025
@Sumanthbg
Copy link

Hello,
I am using the Panasas file system (panfs). Also, I am adding edited code to reproduce.

write_out.py.txt

@wkliao
Copy link
Member Author

wkliao commented Apr 9, 2025

I was able to run the test program. Thanks.

Since I do not have an access to a Panasas, if you can test a few changes
to the codes and let me know if they make a difference, that will be great.
First, can you comment out those romio hints? The data partitioning pattern
used in your test program needs none of these hints.

@Sumanthbg
Copy link

Hi, I tried that, and it didn't change the time taken.

@wkliao
Copy link
Member Author

wkliao commented Apr 10, 2025

I have a few suggestions.

  1. For Panasas, you may need to consult the system administrator, as there may be some
    command-line flags required for using MPI-IO. At the moment, you can add prefix "ufs:" to
    the output file name, i.e. file_save = "ufs:test_output_parallel". This bypasses the
    Panasas specific settings and use generic Unix File System setting in MPI-IO.
  2. In order to see whether the problem lies in PnetCDF or MPI-IO, can you try a
    pure MPI-IO test program, such as coll_perf.c
  3. Here is just an observation. Is it possible to swap the dimensions of "OutArray", i.e. to
        OutArray = np.zeros((10, atoms_per_rank))
    
    This is because numpy arrays are in row-major. Your test program uses "OutArray[:,j]"
    which results in a noncontiguous I/O buffer.

@Sumanthbg
Copy link

I tested the first 2 points,

  1. Adding "ufs" to file save did not change the timings.
  2. Find the benchmarks as below:
    data file of size 57GB,
    On 2 nodes:
    Collective write time = 351.039425 sec, Collective write bandwidth = 164.084134 Mbytes/sec
    Collective read time = 229.400698 sec, Collective read bandwidth = 251.089036 Mbytes/sec
    On 1 node:
    Collective write time = 185.402592 sec, Collective write bandwidth = 310.675268 Mbytes/sec
    Collective read time = 73.226014 sec, Collective read bandwidth = 786.605696 Mbytes/sec

@wkliao
Copy link
Member Author

wkliao commented Apr 10, 2025

Thanks. These timing results show MPI I/O does not take effect as it supposes to.
Most likely, it is an issue with the Panasas file system setting. Please reach out to
the system administrator. Maybe you need to set some environment variables.

I wonder if you can run coll_perf.c with more MPI processes and on more compute nodes.
If the timings fail to reduce, then it is an indicator that the issue has something to do
with the file system.

@wkliao
Copy link
Member Author

wkliao commented Apr 11, 2025

FYI. There are few MPI-IO hints for Panasas.
See pages 6 and 7 in ROMIO user guide
There is an example in page 7. You can try to adjust to see if they make any difference.

@Sumanthbg
Copy link

Thanks, I am in contact with the system admins. I will let you know if there are any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants