Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory` #2426

rgmz · 2024-02-12T01:45:37Z

Please review the Community Note before submitting

TruffleHog Version

3.67.5

Trace Output

2024-02-11T20:04:34-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "Ap0tp", "repo": "https://github.com/Mzack9999/subnet.git", "commit": "4bcb643", "path": "src/golang.org/x/tools/go/gcimporter15/testdata/versions/test_go1.7_1.a", "timeout": 30, "error": "matching bz2: read /tmp/tmp_archive3933205369: is a directory"}

Expected Behavior

The archive handler should not call archive.Identify on directories.

Actual Behavior

Frequent errors like matching bz2: read /tmp/tmp_archive3933205369: is a directory are printed to the console.

Steps to Reproduce

Scan https://github.com/Mzack9999/subnet.git.

Environment

N/A

Additional Context

The error comes from the underlying mbolt/archiver library, specifically the Indentify function.

So far I have only encountered this with .a files. I haven't seen it with .rpm files, but I also haven't explicitly tested it with those either. This may be a slightly different problem from #2071 or the fix was incomplete. I am far too sick to properly investigate at the moment (sorry).

Edit: I've also encountered it with .rlib and .rpm files.

2024-02-11T22:40:42-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "MugHm", "repo": "https://github.com/bevyengine/bevy.git", "commit": "a5a7edf", "path": "crates/glsl-to-spirv/target/debug/deps/libtypenum-fb88411ee80f9742.rlib", "timeout": 30, "error": "matching sz: read /tmp/tmp_archive3144325389: is a directory"}
...
2024-02-12T10:17:06-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "rMZzm", "repo": "https://github.com/elastic/beats.git", "commit": "815ef78", "path": "dev-tools/vendor/github.com/cavaliercoder/go-rpm/testdata/epel-release-7-5.noarch.rpm", "timeout": 30, "error": "matching lz4: read /tmp/tmp_archive1073985454: is a directory"}

Edit2: stack trace from inserting a manual panic

panic: matching xz: read /tmp/tmp_archive3262228446: is a directory

goroutine 1018 [running]:
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).openArchive(0xc0214637d0, {0x3ee7680?, 0xc0212bc630}, 0x0, {0x3eac220, 0xc01f4cc178}, 0xc01f497bc0)
        /tmp/trufflehog/pkg/handlers/archive.go:114 +0x59b
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile.func1()
        /tmp/trufflehog/pkg/handlers/archive.go:86 +0x1da
created by github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile in goroutine 574
        /tmp/trufflehog/pkg/handlers/archive.go:81 +0xf6

References

[Bug] Missing check for directories in extractRpmContent function is causing unable to determine MIME type of nested filename error #2071

The text was updated successfully, but these errors were encountered:

rgmz · 2024-02-13T00:39:48Z

@ahrav @bugbaba The "is a directory" error is surprisingly convoluted, but I think I've narrowed down the cause. #2178 didn't fix the root cause, only reduced the likelihood of it happening.

handlers.processHandler calls Archive.HandleSpecialized

trufflehog/pkg/handlers/handlers.go

Lines 72 to 74 in b69e2c6

 func processHandler(ctx logContext.Context, h Handler, reReader *diskbufferreader.DiskBufferReader, chunkSkel *sources.Chunk, reporter sources.ChunkReporter) bool { 

 if specialHandler, ok := h.(SpecializedHandler); ok { 

 file, isSpecial, err := specialHandler.HandleSpecialized(ctx, reReader)

HandleSpecialized calls either extractDebContent or extractRpmContent

trufflehog/pkg/handlers/archive.go

Line 328 in b69e2c6

reader, err = a.extractDebContent(ctx, reader)

extractDebContent/extractRpmContent creates a directory in /tmp/ and extracts the contents to it

trufflehog/pkg/handlers/archive.go

Lines 354 to 361 in b69e2c6

 tmpEnv, err := a.createTempEnv(ctx, file) 

 if err != nil { 

 return nil, err 

 } 

 defer os.Remove(tmpEnv.tempFileName) 

 defer os.RemoveAll(tmpEnv.extractPath) 

 cmd := exec.Command("ar", "x", tmpEnv.tempFile.Name())

extractDebContent/extractRpmContent then calls handleExtractedFiles

trufflehog/pkg/handlers/archive.go

Line 374 in b69e2c6

dataArchiveName, err := a.handleExtractedFiles(ctx, tmpEnv, handler)

handleExtractedFiles iterates through each file in the temporary directory and tries to assign a value to dataArchiveName

trufflehog/pkg/handlers/archive.go

Lines 481 to 500 in b69e2c6

 var dataArchiveName string 

 for _, file := range extractedFiles { 

 filename := file.Name() 

 filePath := filepath.Join(env.extractPath, filename) 

 fileInfo, err := os.Stat(filePath) 

 if err != nil { 

 return "", fmt.Errorf("unable to get file info for filename %s: %w", filename, err) 

 } 

 if fileInfo.IsDir() { 

 continue 

 } 

 name, err := handleFile(ctx, env, filename) 

 if err != nil { 

 return "", err 

 } 

 if name != "" { 

 dataArchiveName = name 

 break 

 }

Each iteration calls the handleFile func (directories are skipped)

trufflehog/pkg/handlers/archive.go

Lines 367 to 372 in b69e2c6

 handler := func(ctx logContext.Context, env tempEnv, file string) (string, error) { 

 if strings.HasPrefix(file, "data.tar.") { 

 return file, nil 

 } 

 return a.handleNestedFileMIME(ctx, env, file) 

 }

The handleFile func calls handleNestedFileMIME
handleNestedFileMIME returns an empty string due to 4 possible cases

trufflehog/pkg/handlers/archive.go

Lines 422 to 425 in b69e2c6

nestedFile, err := os.Open(filepath.Join(tempEnv.extractPath, fileName))

if err != nil {

return "", err

}

trufflehog/pkg/handlers/archive.go

Lines 428 to 431 in b69e2c6

mimeType, reader, err := determineMimeType(nestedFile)

if err != nil {

return "", fmt.Errorf("unable to determine MIME type of nested filename: %s, %w", nestedFile.Name(), err)

}

trufflehog/pkg/handlers/archive.go

Lines 433 to 437 in b69e2c6

switch mimeType {

case arMimeType, rpmMimeType:

_, _, err = a.HandleSpecialized(ctx, reader)

default:

return "", nil

trufflehog/pkg/handlers/archive.go

Lines 440 to 442 in b69e2c6

if err != nil {

return "", fmt.Errorf("unable to extract file with MIME type %s: %w", mimeType, err)

}
No value is assigned to dataArchiveName, so handleExtractedFiles returns ""

trufflehog/pkg/handlers/archive.go

Line 503 in b69e2c6

return dataArchiveName, nil
extractDebContent/extractRpmContent returns openDataArchive, which opens the path /tmp/generatedDirName + dataArchiveName ("") and returns a io.ReadCloser.

trufflehog/pkg/handlers/archive.go

Line 379 in b69e2c6

return openDataArchive(tmpEnv.extractPath, dataArchiveName)
extractDebContent/extractRpmContent returns that io.ReadCloser to HandleSpecialized
HandleSpecialized returns that io.ReadCloser to handlers.processHandler.
processHandler calls archive.FromFile

trufflehog/pkg/handlers/handlers.go

Lines 74 to 77 in b69e2c6

file, isSpecial, err := specialHandler.HandleSpecialized(ctx, reReader)

if isSpecial {

return handleChunks(ctx, h.FromFile(ctx, file), chunkSkel, reporter)

}

FromFile calls openArchive

trufflehog/pkg/handlers/archive.go

Lines 75 to 86 in b69e2c6

 func (a *Archive) FromFile(originalCtx logContext.Context, data io.Reader) chan []byte { 

 if a.skipArchives { 

 return nil 

 } 

 archiveChan := make(chan []byte, defaultBufferSize) 

 go func() { 

 ctx, cancel := logContext.WithTimeout(originalCtx, maxTimeout) 

 logger := logContext.AddLogger(ctx).Logger() 

 defer cancel() 

 defer close(archiveChan) 

 err := a.openArchive(ctx, 0, data, archiveChan)

openArchive calls mbolt/archiver.Identify

trufflehog/pkg/handlers/archive.go

Lines 98 to 107 in b69e2c6

 func (a *Archive) openArchive(ctx logContext.Context, depth int, reader io.Reader, archiveChan chan []byte) error { 

 if common.IsDone(ctx) { 

 return ctx.Err() 

 } 

 if depth >= maxDepth { 

 return fmt.Errorf(errMaxArchiveDepthReached) 

 } 

 format, arReader, err := archiver.Identify("", reader)

Identify goes "wtf, this is a directory and not a file"
https://github.com/mholt/archiver/blob/81f9e06b11ad6ba424f8311c0bc18ceb01f2b67a/formats.go#L53

What does all this mean?

I don't know. I understand how this happens, but not why nor what the proper fix is. The obvious solution is to throw in another IsDir() check somewhere, but at this point I'm not sure if the contents of these specialized archives are actually being scanned. (e.g., I don't understand why handleExtractedFiles iterates through each file but only returns a single value for dataArchiveName.)

It seems to me that the core design issue is that archive.FromFile assumes it's being given a file, but archive.HandleSpecialized decompresses archives into a directory first.

trufflehog/pkg/handlers/handlers.go

Lines 74 to 77 in b69e2c6

 file, isSpecial, err := specialHandler.HandleSpecialized(ctx, reReader) 

 if isSpecial { 

 return handleChunks(ctx, h.FromFile(ctx, file), chunkSkel, reporter) 

 }

ahrav · 2024-03-05T18:38:58Z

I think we are planning on re-working the archive handler in general. There is too much recursion for my liking 😅 Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.

rgmz · 2024-04-07T16:55:58Z

There is too much recursion for my liking 😅

Tbh, I'm not sure how you would handle directories without recursion. The logic could definitely be cleaner.

Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.

Perhaps this is a naive suggestion, but would modifying FromFile to accept both files and directories (i.e., walk them) fix the issue? It would avoid a large rewrite of the current logic/API.

rgmz added the bug label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory` #2426

Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory` #2426

rgmz commented Feb 12, 2024 •

edited

rgmz commented Feb 13, 2024 •

edited

ahrav commented Mar 5, 2024

rgmz commented Apr 7, 2024 •

edited

Archive handler error: matching $format: read /tmp/tmp_archive123456: is a directory #2426

Archive handler error: matching $format: read /tmp/tmp_archive123456: is a directory #2426

Comments

rgmz commented Feb 12, 2024 • edited

TruffleHog Version

Trace Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Environment

Additional Context

References

rgmz commented Feb 13, 2024 • edited

What does all this mean?

ahrav commented Mar 5, 2024

rgmz commented Apr 7, 2024 • edited

Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory` #2426

Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory` #2426

rgmz commented Feb 12, 2024 •

edited

rgmz commented Feb 13, 2024 •

edited

rgmz commented Apr 7, 2024 •

edited