A powerful utility for extracting metadata from PDF and DOCX documents using Google's Gemini AI.
This tool scans documents in a specified directory, extracts their text content, and uses Gemini AI to identify and extract structured metadata. The results are compiled into an Excel spreadsheet for easy review and analysis.
- Supports PDF and DOCX file formats
- Extracts comprehensive metadata including:
- General document information (title, type, dates, version)
- Document-specific metadata based on document type
- Keywords, authors, and other contextual information
- Exports results to a formatted Excel file
- Includes logging for troubleshooting
- Handles errors gracefully
- Includes progress reporting for large batches
- Python 3.7+
- Dependencies:
python-docx
: For DOCX file processingPyPDF2
: For PDF file processinggoogle-generativeai
: For AI metadata extractionopenpyxl
: For Excel output generationpathlib
: For path handling
- Clone or download this repository
- Install required dependencies:
pip install python-docx PyPDF2 google-generativeai openpyxl
- Create a
config.json
file with your Gemini API key:
{
"api_key": "YOUR_GEMINI_API_KEY"
}
Run the script from the command line:
python metadata_extractor.py --dir "path/to/documents" --output "results.xlsx" --output-dir "output" --config "config.json"
--dir
,-d
: Directory containing documents to process (required)--output
,-o
: Output Excel filename (default: "metadata_output.xlsx")--output-dir
: Directory to save JSON responses and other output files (default: "output")--config
,-c
: Path to config file with API key (default: "config.json")
The script generates the following outputs:
- An Excel file with all extracted metadata (saved to the output directory)
- JSON files for each processed document containing the raw Gemini API responses (saved in the output directory with the same base filename as the original document)
The tool extracts metadata according to a comprehensive schema, including:
- Document title, type, creation date
- Version/revision information
- Source, keywords, and summary
- File details and confidentiality level
- Relevant products and geographic regions
Tailored metadata extraction based on document type:
- Research Papers: Authors, journal name, DOI, methodology, findings
- Test Documents: Test name, standards, equipment, materials, results
- EPDs: Declared unit, GWP, LCA practitioner, validity period
- Case Studies: Project details, challenges, benefits, outcomes
- Technical Product Data: Product specs, application instructions, references
- ASTM Standards: Designation, issue year, title, relevant sections
- Time Saving: Automate metadata extraction from large document collections
- Consistency: Apply the same extraction criteria across all documents
- AI-Powered: Leverage Gemini AI for intelligent content analysis
- Structured Output: Get organized results ready for database import
- Complete Records: Store both the processed metadata and raw AI responses
Contributions, bug reports, and feature requests are welcome! Feel free to open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by Fayaz K