Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

starting visit in a directory with 200K files is slow #19473

Open
cyrush opened this issue Apr 30, 2024 · 22 comments
Open

starting visit in a directory with 200K files is slow #19473

cyrush opened this issue Apr 30, 2024 · 22 comments
Assignees
Labels
bug Something isn't working impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) likelihood medium Neither low nor high likelihood
Milestone

Comments

@cyrush
Copy link
Member

cyrush commented Apr 30, 2024

Describe the bug

User reports starting visit in a directory with 200K files on gpfs file system, launching the GUI hangs and takes quite some time.

Pretty sure its listing those files, and there may not be much we can do to avoid that cost.

@cyrush cyrush added bug Something isn't working likelihood medium Neither low nor high likelihood impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) labels Apr 30, 2024
@JustinPrivitera
Copy link
Member

Do they have to start it there?

@markcmiller86
Copy link
Member

Pretty sure its listing those files, and there may not be much we can do to avoid that cost.

I think we could be smarter about this. I think we may be calling stat() on files in addition to iterating over a readdir() thing. I don't think we need to populate a GUI with thousands of file names unless the user explicitly requests that. So, we could set a threshold of the number of files we try to scan.

I'd bet other software doesn't take long to launch in such a context.

@biagas
Copy link
Contributor

biagas commented Apr 30, 2024

Why do we scan at startup? Why not only when the user requests File->Open or any other such commands that needs the file list?

@markcmiller86
Copy link
Member

Why do we scan at startup? Why not only when the user requests File->Open or any other such commands that needs the file list?

Great question and, honestly, I don't know for sure that we do scan on startup. I think we do somewhere early though and, in align with your thoughts, I think we do it unnecessarily. We should do it as you describe and, perhaps, with some limits on what we try to scan.

@markcmiller86
Copy link
Member

FYI, there are these notes from a 2015 tutorial series.

@cyrush
Copy link
Member Author

cyrush commented May 1, 2024

Good find!

@cyrush
Copy link
Member Author

cyrush commented May 1, 2024

user reports that starting in another dir and then browsing to that dir takes like 10 seconds to list the files.

starting visit in that dir, takes like 20min to startup.

I wonder if there are other checks going on.

@markcmiller86
Copy link
Member

user reports that starting in another dir and then browsing to that dir takes like 10 seconds to list the files.

Still concerned that this user seems to be working in a way that suggests they think this approach is "normal". I thought all of our codes now are putting .root files one dir above the dir where all the domain level files are. In which case, they should not ever have to descend into a directory with >200K+ files.

@markcmiller86
Copy link
Member

I did some quick tests on my laptop with a dir with 20,000 .silo and .txt files in it. Launching from within the dir (using terminal command) and browsing to the dir from a click-the-icon launch showed little difference in performance. Less than a 2 second delay in both cases. Though I think with file grouping set to off, its a little worse.

@markcmiller86
Copy link
Member

Is there any config setting the user could have that may exacerbate issues?

@markcmiller86
Copy link
Member

I added a dtruss option to internallauncher (which is macOS version of strace) to capture if its looking at a ton of files.

./bin/visit -dtruss -f -t getdirentries64 mdserver

Didn't see any attempts to scan file contents. But, on macOS, we may be able to get regular file vs. directory entry back from a dirent object instead of having to stat the files. On Lustre, I think we wind up actually stat ing files to determine if they are regular files or directory entries. I think that is the crux of what is going on.

From ChatGPT...

In general, when using readdir() to read directory entries in Unix-like systems, including file systems like Lustre, the information returned by readdir() is encapsulated in a structure of type struct dirent. This structure primarily includes the name of the file and an inode number, but it does not typically include complete metadata about the file, such as whether it is a regular file, a directory, or some other type of file.

The struct dirent provided by readdir() typically includes:

  • d_name: the name of the file.
  • d_ino: the inode number.
  • d_type (in some implementations): an indicator of the file type.

The d_type field, which is available in many modern Unix-like systems including Linux and some BSD variants (and potentially in Lustre if configured with this feature), can tell you the type of the file (e.g., regular file, directory, symbolic link, etc.) without needing to perform a stat() call. The d_type field can have values like DT_REG for regular files and DT_DIR for directories.

Using d_type in Lustre

  1. Check for d_type Support: Not all file systems or system configurations provide the d_type field in struct dirent. You should first verify if your particular setup of Lustre supports it. This support can depend on the specific version and configuration of the file system.

  2. Efficient File Type Detection: If d_type is supported and available in your implementation, you can use it to check the type of each entry directly from the result of readdir(), which is more efficient than calling stat() for each entry. Here's a basic example in C to illustrate how you might use this:

    #include <dirent.h>
    #include <stdio.h>
    
    int main() {
        DIR *dir;
        struct dirent *entry;
    
        dir = opendir("."); // Open the current directory
        if (dir == NULL) {
            perror("Failed to open directory");
            return 1;
        }
    
        while ((entry = readdir(dir)) != NULL) {
            printf("%s: ", entry->d_name);
            switch (entry->d_type) {
                case DT_REG:
                    printf("Regular file\n");
                    break;
                case DT_DIR:
                    printf("Directory\n");
                    break;
                default:
                    printf("Other\n");
            }
        }
    
        closedir(dir);
        return 0;
    }
  3. Fallback to stat() if Necessary: If d_type is not supported, you will need to fall back on using stat() to determine the file type.

Conclusion

In summary, while readdir() itself doesn't return complete metadata about files, the d_type field in the dirent structure (if available) can provide immediate file type information. This is beneficial for file systems like Lustre, where minimizing the overhead of metadata operations can significantly impact performance. Always confirm the specific capabilities and configuration of your file system to make optimal use of available features.

@markcmiller86
Copy link
Member

I think the issue is that Lustre doesn't support the d_type entry and we are forced to stat() to determine a directory entry's state.

@markcmiller86
Copy link
Member

markcmiller86 commented May 2, 2024

Here is what we could do. In cases where the size of a directory is above K entries (or K entries falling into some common pattern naming), we could opt to assume all entries are regular files and avoid the stat(). But, what do we do then if an entry is really a directory? We could offer the user an option (perhaps with right clicking) to try to treat the entry as a directory and then do the stat() call to confirm that and if so, descend into it. Or, we just treat all double-clicks on entries as possible attempts to descend into a dir before we treat them as an open. Or, maybe a shift-click on an entry means to try to treat it as a directory.

@markcmiller86
Copy link
Member

So, I am wrong. I just tested the above d_type testing code on both lustre (quartz) and gpfs (lassen) and d_type is supported there. I created a dir with 20,000 files and launched VisIt from it and it opens quickly and fine.

@cyrush
Copy link
Member Author

cyrush commented May 2, 2024

I think this case was on GPFS, which might not be able to handle the metadata ops well.

@markcmiller86
Copy link
Member

I think this case was on GPFS, which might not be able to handle the metadata ops well.

Well, as I mention above, I tested both GPFS (IBM Spectrum Scale now) and Lustre on a dir 10% the size (in terms of inodes anyways) and saw nothing approaching even 10 seconds worth of delay.

@markcmiller86
Copy link
Member

Was this on SCF by any chance?

@cyrush
Copy link
Member Author

cyrush commented May 2, 2024

Yes

@markcmiller86
Copy link
Member

Ok, I'll perform some similar tests on SCF next tuesday.

@markcmiller86 markcmiller86 self-assigned this May 2, 2024
@markcmiller86 markcmiller86 added this to the 3.4.2 milestone May 2, 2024
@markcmiller86
Copy link
Member

@cyrush do we know what version of VisIt this was?

@markcmiller86
Copy link
Member

Ok, I created a dir on /p/gpfs on SCF with 100,000 silo files. All with random names and no extensions. Launching VisIt there took about 1-2 mins for GUI to pop up.

@markcmiller86
Copy link
Member

Ok, I've taken a closer look with strace, -timing, -debug 5 and with some simple test codes I tried.

I created a dir with 100,000 temp symlinks (random names) all pointing to a singe, 30 MB, silo file. The situation with VisIt doing stat()s is more complicated than I initially thought. We are definitely stat()ing all the files. AND, we really need to fix that.

However, 100,000 stat() calls from VisIt is taking just a few seconds in most situations. There was a situation were /p/gpfs1 seemed unbearably slow but that was rare and I believe related to load created from other activity on the system. I did an ls -l tmp* (which I think must do some kind of stat() also) and that completes in < 2 seconds. I did a loop calling stat shell command...that takes 7 minutes.

I do not think the delay in VisIt startup is related to mdserver getting the current file list. From what I can tell, the mdserver starts quickly but the GUI splash screen seems to hang after printing a message Creating plugin windows. So, whatever is delaying the startup is happening sometime after that.

The timings files all show very small values for each activity and then end with total time that is ~100 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) likelihood medium Neither low nor high likelihood
Projects
None yet
Development

No branches or pull requests

4 participants