GrobidArticleExtractor

This Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way to extract both metadata and content from academic papers and other structured documents.

Features

Direct PDF processing using GROBID API
Metadata extraction (title, authors, abstract, publication date)
Hierarchical section organization with subsections

Prerequisites

Start GROBID Service:

docker pull lfoppiano/grobid:0.8.0
docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0

JAVA_OPTS="-XX:+UseZGC" helps to resolve the following error in mac os.

[thread 44 also had an error]

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47

JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86)
Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
Problematic frame:
[thread 41 also had an error]
[thread 45 also had an error]
[thread 46 also had an error]

Installation :

Install this package via :
```
pip install GrobidArticleExtractor
```
Or get the newest development version via:
```
pip install git+https://github.com/sensein/GrobidArticleExtractor.git
```
Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:
```
pip uninstall GrobidArticleExtractor
pip install GrobidArticleExtractor
```

Usage

Command Line Interface

The tool provides a user-friendly command-line interface for batch processing PDF files:

# Basic usage (processes PDFs from 'pdfs' directory)
grobidextractor

# Process PDFs from a specific directory
grobidextractor path/to/pdfs

# Specify custom output directory
grobidextractor path/to/pdfs -o path/to/output

# Use custom GROBID server and disable content preview
grobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview

Available options:

$ grobidextractor --help
Usage: grobidextractor [OPTIONS] [INPUT_FOLDER]  

  Process PDF files from INPUT_FOLDER and extract their content using GROBID.

  The extracted content is saved as JSON files in the output directory.
  Each JSON file is named after its source PDF file.

Options:
  -o, --output-dir PATH  Directory to save extracted JSON files (default: output)
  -g, --grobid-url TEXT  GROBID service URL (default: http://localhost:8070)
  --preview / --no-preview
                        Show preview of extracted content (default: True)
  --help                Show this message and exit.

Example:
  grobidextractor path/to/pdfs -o path/to/output

Python API Usage

You can also use the tool programmatically in your Python code:

from GrobidArticleExtractor.app import GrobidArticleExtractor

# Initialize extractor (default GROBID URL: http://localhost:8070)
extractor = GrobidArticleExtractor()

# Process a PDF file
xml_content = extractor.process_pdf("path/to/your/paper.pdf")

if xml_content:
   # Extract and organize content
   result = extractor.extract_content(xml_content)

   # Access metadata
   print(result['metadata'])

   # Access sections
   for section in result['sections']:
      print(section['heading'])
      if 'content' in section:
         print(section['content'])

Custom GROBID server:

extractor = GrobidArticleExtractor(grobid_url="http://your-grobid-server:8070")

Example NoteBook

Check out the example notebook for a step-by-step guide on how to run it.

Output Structure

The extracted content is organized as follows:

{
   'metadata': {
      'title': 'Paper Title',
      'authors': ['Author 1', 'Author 2'],
      'abstract': 'Paper abstract...',
      'publication_date': '2023'
   },
   'sections': [
      {
         'heading': 'Introduction',
         'content': ['Paragraph 1...', 'Paragraph 2...'],
         'subsections': [
            {
               'heading': 'Background',
               'content': ['Subsection content...']
            }
         ]
      }
      # More sections...
   ]
}

Project Structure

The project is organized into two main files:

app.py - Contains the core GrobidArticleExtractor class with all the PDF processing and content extraction functionality
cli.py - Contains the command-line interface implementation using Click

Error Handling

The tool includes comprehensive error handling for common scenarios:

PDF file not found
GROBID service unavailable
XML parsing errors
Invalid content structure

All errors are logged with appropriate messages using Python's logging module.

Contributing

Feel free to submit issues and enhancement requests!

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs_style/pdoc-theme		docs_style/pdoc-theme
examples		examples
github		github
src		src
.autorc		.autorc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
template_setup.py		template_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GrobidArticleExtractor

Features

Prerequisites

Usage

Command Line Interface

Python API Usage

Example NoteBook

Output Structure

Project Structure

Error Handling

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

sensein/GrobidArticleExtractor

Folders and files

Latest commit

History

Repository files navigation

GrobidArticleExtractor

Features

Prerequisites

Usage

Command Line Interface

Python API Usage

Example NoteBook

Output Structure

Project Structure

Error Handling

Contributing

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages