Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive handler error: matching $format: read /tmp/tmp_archive123456: is a directory #2426

Open
rgmz opened this issue Feb 12, 2024 · 3 comments
Labels

Comments

@rgmz
Copy link
Contributor

rgmz commented Feb 12, 2024

Please review the Community Note before submitting

TruffleHog Version

3.67.5

Trace Output

2024-02-11T20:04:34-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "Ap0tp", "repo": "https://github.com/Mzack9999/subnet.git", "commit": "4bcb643", "path": "src/golang.org/x/tools/go/gcimporter15/testdata/versions/test_go1.7_1.a", "timeout": 30, "error": "matching bz2: read /tmp/tmp_archive3933205369: is a directory"}

Expected Behavior

The archive handler should not call archive.Identify on directories.

Actual Behavior

Frequent errors like matching bz2: read /tmp/tmp_archive3933205369: is a directory are printed to the console.

Steps to Reproduce

Scan https://github.com/Mzack9999/subnet.git.

Environment

N/A

Additional Context

The error comes from the underlying mbolt/archiver library, specifically the Indentify function.

So far I have only encountered this with .a files. I haven't seen it with .rpm files, but I also haven't explicitly tested it with those either. This may be a slightly different problem from #2071 or the fix was incomplete. I am far too sick to properly investigate at the moment (sorry).

Edit: I've also encountered it with .rlib and .rpm files.

2024-02-11T22:40:42-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "MugHm", "repo": "https://github.com/bevyengine/bevy.git", "commit": "a5a7edf", "path": "crates/glsl-to-spirv/target/debug/deps/libtypenum-fb88411ee80f9742.rlib", "timeout": 30, "error": "matching sz: read /tmp/tmp_archive3144325389: is a directory"}
...
2024-02-12T10:17:06-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "rMZzm", "repo": "https://github.com/elastic/beats.git", "commit": "815ef78", "path": "dev-tools/vendor/github.com/cavaliercoder/go-rpm/testdata/epel-release-7-5.noarch.rpm", "timeout": 30, "error": "matching lz4: read /tmp/tmp_archive1073985454: is a directory"}

Edit2: stack trace from inserting a manual panic

panic: matching xz: read /tmp/tmp_archive3262228446: is a directory

goroutine 1018 [running]:
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).openArchive(0xc0214637d0, {0x3ee7680?, 0xc0212bc630}, 0x0, {0x3eac220, 0xc01f4cc178}, 0xc01f497bc0)
        /tmp/trufflehog/pkg/handlers/archive.go:114 +0x59b
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile.func1()
        /tmp/trufflehog/pkg/handlers/archive.go:86 +0x1da
created by github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile in goroutine 574
        /tmp/trufflehog/pkg/handlers/archive.go:81 +0xf6

References

@rgmz rgmz added the bug label Feb 12, 2024
@rgmz
Copy link
Contributor Author

rgmz commented Feb 13, 2024

@ahrav @bugbaba The "is a directory" error is surprisingly convoluted, but I think I've narrowed down the cause. #2178 didn't fix the root cause, only reduced the likelihood of it happening.

  1. handlers.processHandler calls Archive.HandleSpecialized
    func processHandler(ctx logContext.Context, h Handler, reReader *diskbufferreader.DiskBufferReader, chunkSkel *sources.Chunk, reporter sources.ChunkReporter) bool {
    if specialHandler, ok := h.(SpecializedHandler); ok {
    file, isSpecial, err := specialHandler.HandleSpecialized(ctx, reReader)
  2. HandleSpecialized calls either extractDebContent or extractRpmContent
    reader, err = a.extractDebContent(ctx, reader)
  3. extractDebContent/extractRpmContent creates a directory in /tmp/ and extracts the contents to it
    tmpEnv, err := a.createTempEnv(ctx, file)
    if err != nil {
    return nil, err
    }
    defer os.Remove(tmpEnv.tempFileName)
    defer os.RemoveAll(tmpEnv.extractPath)
    cmd := exec.Command("ar", "x", tmpEnv.tempFile.Name())
  4. extractDebContent/extractRpmContent then calls handleExtractedFiles
    dataArchiveName, err := a.handleExtractedFiles(ctx, tmpEnv, handler)
  5. handleExtractedFiles iterates through each file in the temporary directory and tries to assign a value to dataArchiveName
    var dataArchiveName string
    for _, file := range extractedFiles {
    filename := file.Name()
    filePath := filepath.Join(env.extractPath, filename)
    fileInfo, err := os.Stat(filePath)
    if err != nil {
    return "", fmt.Errorf("unable to get file info for filename %s: %w", filename, err)
    }
    if fileInfo.IsDir() {
    continue
    }
    name, err := handleFile(ctx, env, filename)
    if err != nil {
    return "", err
    }
    if name != "" {
    dataArchiveName = name
    break
    }
  6. Each iteration calls the handleFile func (directories are skipped)
    handler := func(ctx logContext.Context, env tempEnv, file string) (string, error) {
    if strings.HasPrefix(file, "data.tar.") {
    return file, nil
    }
    return a.handleNestedFileMIME(ctx, env, file)
    }
  7. The handleFile func calls handleNestedFileMIME
  8. handleNestedFileMIME returns an empty string due to 4 possible cases
    nestedFile, err := os.Open(filepath.Join(tempEnv.extractPath, fileName))
    if err != nil {
    return "", err
    }

    mimeType, reader, err := determineMimeType(nestedFile)
    if err != nil {
    return "", fmt.Errorf("unable to determine MIME type of nested filename: %s, %w", nestedFile.Name(), err)
    }

    switch mimeType {
    case arMimeType, rpmMimeType:
    _, _, err = a.HandleSpecialized(ctx, reader)
    default:
    return "", nil

    if err != nil {
    return "", fmt.Errorf("unable to extract file with MIME type %s: %w", mimeType, err)
    }
  9. No value is assigned to dataArchiveName, so handleExtractedFiles returns ""
    return dataArchiveName, nil
  10. extractDebContent/extractRpmContent returns openDataArchive, which opens the path /tmp/generatedDirName + dataArchiveName ("") and returns a io.ReadCloser.
    return openDataArchive(tmpEnv.extractPath, dataArchiveName)
  11. extractDebContent/extractRpmContent returns that io.ReadCloser to HandleSpecialized
  12. HandleSpecialized returns that io.ReadCloser to handlers.processHandler.
  13. processHandler calls archive.FromFile
    file, isSpecial, err := specialHandler.HandleSpecialized(ctx, reReader)
    if isSpecial {
    return handleChunks(ctx, h.FromFile(ctx, file), chunkSkel, reporter)
    }
  14. FromFile calls openArchive
    func (a *Archive) FromFile(originalCtx logContext.Context, data io.Reader) chan []byte {
    if a.skipArchives {
    return nil
    }
    archiveChan := make(chan []byte, defaultBufferSize)
    go func() {
    ctx, cancel := logContext.WithTimeout(originalCtx, maxTimeout)
    logger := logContext.AddLogger(ctx).Logger()
    defer cancel()
    defer close(archiveChan)
    err := a.openArchive(ctx, 0, data, archiveChan)
  15. openArchive calls mbolt/archiver.Identify
    func (a *Archive) openArchive(ctx logContext.Context, depth int, reader io.Reader, archiveChan chan []byte) error {
    if common.IsDone(ctx) {
    return ctx.Err()
    }
    if depth >= maxDepth {
    return fmt.Errorf(errMaxArchiveDepthReached)
    }
    format, arReader, err := archiver.Identify("", reader)
  16. Identify goes "wtf, this is a directory and not a file"
    https://github.com/mholt/archiver/blob/81f9e06b11ad6ba424f8311c0bc18ceb01f2b67a/formats.go#L53

What does all this mean?

I don't know. I understand how this happens, but not why nor what the proper fix is. The obvious solution is to throw in another IsDir() check somewhere, but at this point I'm not sure if the contents of these specialized archives are actually being scanned. (e.g., I don't understand why handleExtractedFiles iterates through each file but only returns a single value for dataArchiveName.)

It seems to me that the core design issue is that archive.FromFile assumes it's being given a file, but archive.HandleSpecialized decompresses archives into a directory first.

file, isSpecial, err := specialHandler.HandleSpecialized(ctx, reReader)
if isSpecial {
return handleChunks(ctx, h.FromFile(ctx, file), chunkSkel, reporter)
}

@ahrav
Copy link
Collaborator

ahrav commented Mar 5, 2024

I think we are planning on re-working the archive handler in general. There is too much recursion for my liking 😅 Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.

@rgmz
Copy link
Contributor Author

rgmz commented Apr 7, 2024

There is too much recursion for my liking 😅

Tbh, I'm not sure how you would handle directories without recursion. The logic could definitely be cleaner.

Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.

Perhaps this is a naive suggestion, but would modifying FromFile to accept both files and directories (i.e., walk them) fix the issue? It would avoid a large rewrite of the current logic/API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants