starting visit in a directory with 200K files is slow #19473

cyrush · 2024-04-30T20:26:30Z

Describe the bug

User reports starting visit in a directory with 200K files on gpfs file system, launching the GUI hangs and takes quite some time.

Pretty sure its listing those files, and there may not be much we can do to avoid that cost.

JustinPrivitera · 2024-04-30T21:07:21Z

Do they have to start it there?

markcmiller86 · 2024-04-30T21:09:56Z

Pretty sure its listing those files, and there may not be much we can do to avoid that cost.

I think we could be smarter about this. I think we may be calling stat() on files in addition to iterating over a readdir() thing. I don't think we need to populate a GUI with thousands of file names unless the user explicitly requests that. So, we could set a threshold of the number of files we try to scan.

I'd bet other software doesn't take long to launch in such a context.

biagas · 2024-04-30T21:16:03Z

Why do we scan at startup? Why not only when the user requests File->Open or any other such commands that needs the file list?

markcmiller86 · 2024-04-30T21:21:45Z

Why do we scan at startup? Why not only when the user requests File->Open or any other such commands that needs the file list?

Great question and, honestly, I don't know for sure that we do scan on startup. I think we do somewhere early though and, in align with your thoughts, I think we do it unnecessarily. We should do it as you describe and, perhaps, with some limits on what we try to scan.

markcmiller86 · 2024-05-01T22:38:50Z

FYI, there are these notes from a 2015 tutorial series.

cyrush · 2024-05-01T22:43:07Z

Good find!

cyrush · 2024-05-01T22:50:25Z

user reports that starting in another dir and then browsing to that dir takes like 10 seconds to list the files.

starting visit in that dir, takes like 20min to startup.

I wonder if there are other checks going on.

markcmiller86 · 2024-05-01T23:06:48Z

user reports that starting in another dir and then browsing to that dir takes like 10 seconds to list the files.

Still concerned that this user seems to be working in a way that suggests they think this approach is "normal". I thought all of our codes now are putting .root files one dir above the dir where all the domain level files are. In which case, they should not ever have to descend into a directory with >200K+ files.

markcmiller86 · 2024-05-01T23:33:20Z

I did some quick tests on my laptop with a dir with 20,000 .silo and .txt files in it. Launching from within the dir (using terminal command) and browsing to the dir from a click-the-icon launch showed little difference in performance. Less than a 2 second delay in both cases. Though I think with file grouping set to off, its a little worse.

markcmiller86 · 2024-05-02T04:42:16Z

Is there any config setting the user could have that may exacerbate issues?

markcmiller86 · 2024-05-02T06:20:08Z

I added a dtruss option to internallauncher (which is macOS version of strace) to capture if its looking at a ton of files.

./bin/visit -dtruss -f -t getdirentries64 mdserver

Didn't see any attempts to scan file contents. But, on macOS, we may be able to get regular file vs. directory entry back from a dirent object instead of having to stat the files. On Lustre, I think we wind up actually stat ing files to determine if they are regular files or directory entries. I think that is the crux of what is going on.

From ChatGPT...

In general, when using readdir() to read directory entries in Unix-like systems, including file systems like Lustre, the information returned by readdir() is encapsulated in a structure of type struct dirent. This structure primarily includes the name of the file and an inode number, but it does not typically include complete metadata about the file, such as whether it is a regular file, a directory, or some other type of file.

The struct dirent provided by readdir() typically includes:

d_name: the name of the file.
d_ino: the inode number.
d_type (in some implementations): an indicator of the file type.

The d_type field, which is available in many modern Unix-like systems including Linux and some BSD variants (and potentially in Lustre if configured with this feature), can tell you the type of the file (e.g., regular file, directory, symbolic link, etc.) without needing to perform a stat() call. The d_type field can have values like DT_REG for regular files and DT_DIR for directories.

Using `d_type` in Lustre

Check for d_type Support: Not all file systems or system configurations provide the d_type field in struct dirent. You should first verify if your particular setup of Lustre supports it. This support can depend on the specific version and configuration of the file system.

Efficient File Type Detection: If d_type is supported and available in your implementation, you can use it to check the type of each entry directly from the result of readdir(), which is more efficient than calling stat() for each entry. Here's a basic example in C to illustrate how you might use this:

#include <dirent.h>
#include <stdio.h>

int main() {
    DIR *dir;
    struct dirent *entry;

    dir = opendir("."); // Open the current directory
    if (dir == NULL) {
        perror("Failed to open directory");
        return 1;
    }

    while ((entry = readdir(dir)) != NULL) {
        printf("%s: ", entry->d_name);
        switch (entry->d_type) {
            case DT_REG:
                printf("Regular file\n");
                break;
            case DT_DIR:
                printf("Directory\n");
                break;
            default:
                printf("Other\n");
        }
    }

    closedir(dir);
    return 0;
}

Fallback to stat() if Necessary: If d_type is not supported, you will need to fall back on using stat() to determine the file type.

Conclusion

In summary, while readdir() itself doesn't return complete metadata about files, the d_type field in the dirent structure (if available) can provide immediate file type information. This is beneficial for file systems like Lustre, where minimizing the overhead of metadata operations can significantly impact performance. Always confirm the specific capabilities and configuration of your file system to make optimal use of available features.

markcmiller86 · 2024-05-02T06:21:03Z

I think the issue is that Lustre doesn't support the d_type entry and we are forced to stat() to determine a directory entry's state.

markcmiller86 · 2024-05-02T06:24:43Z

Here is what we could do. In cases where the size of a directory is above K entries (or K entries falling into some common pattern naming), we could opt to assume all entries are regular files and avoid the stat(). But, what do we do then if an entry is really a directory? We could offer the user an option (perhaps with right clicking) to try to treat the entry as a directory and then do the stat() call to confirm that and if so, descend into it. Or, we just treat all double-clicks on entries as possible attempts to descend into a dir before we treat them as an open. Or, maybe a shift-click on an entry means to try to treat it as a directory.

markcmiller86 · 2024-05-02T14:36:32Z

So, I am wrong. I just tested the above d_type testing code on both lustre (quartz) and gpfs (lassen) and d_type is supported there. I created a dir with 20,000 files and launched VisIt from it and it opens quickly and fine.

cyrush · 2024-05-02T20:13:54Z

I think this case was on GPFS, which might not be able to handle the metadata ops well.

markcmiller86 · 2024-05-02T20:55:43Z

I think this case was on GPFS, which might not be able to handle the metadata ops well.

Well, as I mention above, I tested both GPFS (IBM Spectrum Scale now) and Lustre on a dir 10% the size (in terms of inodes anyways) and saw nothing approaching even 10 seconds worth of delay.

markcmiller86 · 2024-05-02T20:56:05Z

Was this on SCF by any chance?

cyrush · 2024-05-02T21:03:56Z

Yes

markcmiller86 · 2024-05-02T21:58:11Z

Ok, I'll perform some similar tests on SCF next tuesday.

markcmiller86 · 2024-05-07T20:38:49Z

@cyrush do we know what version of VisIt this was?

markcmiller86 · 2024-05-07T21:49:41Z

Ok, I created a dir on /p/gpfs on SCF with 100,000 silo files. All with random names and no extensions. Launching VisIt there took about 1-2 mins for GUI to pop up.

markcmiller86 · 2024-05-08T05:25:57Z

Ok, I've taken a closer look with strace, -timing, -debug 5 and with some simple test codes I tried.

I created a dir with 100,000 temp symlinks (random names) all pointing to a singe, 30 MB, silo file. The situation with VisIt doing stat()s is more complicated than I initially thought. We are definitely stat()ing all the files. AND, we really need to fix that.

However, 100,000 stat() calls from VisIt is taking just a few seconds in most situations. There was a situation were /p/gpfs1 seemed unbearably slow but that was rare and I believe related to load created from other activity on the system. I did an ls -l tmp* (which I think must do some kind of stat() also) and that completes in < 2 seconds. I did a loop calling stat shell command...that takes 7 minutes.

I do not think the delay in VisIt startup is related to mdserver getting the current file list. From what I can tell, the mdserver starts quickly but the GUI splash screen seems to hang after printing a message Creating plugin windows. So, whatever is delaying the startup is happening sometime after that.

The timings files all show very small values for each activity and then end with total time that is ~100 seconds.

cyrush added bug Something isn't working likelihood medium Neither low nor high likelihood impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) labels Apr 30, 2024

markcmiller86 self-assigned this May 2, 2024

markcmiller86 added this to the 3.4.2 milestone May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

starting visit in a directory with 200K files is slow #19473

starting visit in a directory with 200K files is slow #19473

cyrush commented Apr 30, 2024

JustinPrivitera commented Apr 30, 2024

markcmiller86 commented Apr 30, 2024

biagas commented Apr 30, 2024

markcmiller86 commented Apr 30, 2024

markcmiller86 commented May 1, 2024

cyrush commented May 1, 2024

cyrush commented May 1, 2024

markcmiller86 commented May 1, 2024

markcmiller86 commented May 1, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024 •

edited

markcmiller86 commented May 2, 2024

cyrush commented May 2, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024

cyrush commented May 2, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 7, 2024

markcmiller86 commented May 7, 2024

markcmiller86 commented May 8, 2024

starting visit in a directory with 200K files is slow #19473

starting visit in a directory with 200K files is slow #19473

Comments

cyrush commented Apr 30, 2024

Describe the bug

JustinPrivitera commented Apr 30, 2024

markcmiller86 commented Apr 30, 2024

biagas commented Apr 30, 2024

markcmiller86 commented Apr 30, 2024

markcmiller86 commented May 1, 2024

cyrush commented May 1, 2024

cyrush commented May 1, 2024

markcmiller86 commented May 1, 2024

markcmiller86 commented May 1, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024

Using d_type in Lustre

Conclusion

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024 • edited

markcmiller86 commented May 2, 2024

cyrush commented May 2, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 2, 2024

cyrush commented May 2, 2024

markcmiller86 commented May 2, 2024

markcmiller86 commented May 7, 2024

markcmiller86 commented May 7, 2024

markcmiller86 commented May 8, 2024

Using `d_type` in Lustre

markcmiller86 commented May 2, 2024 •

edited