Skip to content

Mr-Eggy/metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

meta-extract

CLI tool that pulls metadata from files and URLs. Built this because I got tired of opening 5 different apps to check file info.

Install

npm install

You'll also need ffmpeg if you want audio/video extraction. On Ubuntu: sudo apt install ffmpeg

Usage

# basic
node src/index.js ./report.pdf
node src/index.js https://github.com

# quiet mode - just JSON, no status messages
node src/index.js -q ./photo.jpg

# skip security scan (faster)
node src/index.js --no-security ./archive.zip

What it extracts

Type What you get
PDF pages, author, producer, text stats, language, watermark detection
Images dimensions, EXIF, GPS coords, camera/lens info
Audio duration, codec, bitrate, sample rate
Video resolution, fps, codecs, audio tracks
URLs title, meta tags, OG/Twitter cards, load time
ZIP file listing, sizes, nested zip detection

Output

Always JSON. Always this shape:

{
  "status": "success",
  "input_type": "file",
  "source": "./report.pdf",
  "metadata": { ... },
  "extracted_at": "2024-01-15T12:00:00.000Z"
}

If something fails, you get "status": "error" or "status": "partial" with an errors array.

Examples

URL extraction:

$ node src/index.js https://example.com
{
  "status": "success",
  "input_type": "url",
  "source": "https://example.com",
  "metadata": {
    "web": {
      "title": "Example Domain",
      "statusCode": 200,
      "loadTimeMs": 245,
      "pageStats": {
        "linkCount": 1,
        "imageCount": 0
      }
    }
  }
}

Error case:

$ node src/index.js ./doesnt-exist.pdf
{
  "status": "error",
  "source": "./doesnt-exist.pdf",
  "errors": ["File not found or not accessible"]
}

Project structure

src/
  index.js          # CLI entry
  core/
    extractor.js    # routes to handlers
    normalizer.js   # builds output JSON
  handlers/
    pdf.js, image.js, audio.js, video.js, web.js, zip.js
  utils/
    file.js, hash.js, time.js
  security/
    scan.js         # macro detection, pattern matching

Security stuff

  • The virus scan is mocked - swap in ClamAV or VirusTotal for real use
  • ZIP extraction has depth limits (3 levels) and file count limits (100) to prevent zip bombs
  • Scans for suspicious patterns like powershell, eval(), encoded strings
  • Flags external URLs in documents

Known limitations

  • DOCX/XLSX text extraction not implemented yet (just security scan)
  • YouTube URLs treated as regular web pages
  • GPS extraction only works on images with EXIF data (obviously)
  • ffprobe errors if ffmpeg not installed

License

MIT

more legit

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors