Fabula is an AI-powered narrative analysis engine that transforms unstructured narrative texts (scripts, novels, etc.) into richly structured knowledge graphs. By combining LLM-driven extraction with a robust entity resolution pipeline, Fabula enables deep analysis of story structure, character development, and thematic elements.
Inspired by the BBC's Mythology Engine, Fabula aims to unlock narrative information and make it explorable through graph-based queries.
- Script Pre-processing: Converts TV/film scripts from various formats into standardized JSON using
script2json.py - LLM-Based Entity Extraction: Uses BAML for structured extraction of:
- Characters (Agents)
- Locations
- Objects
- Organizations
- Events
- Relationships
- Two-Pass Processing Pipeline:
- First Pass: Raw entity extraction
- Second Pass: Detailed scene metadata, events, and participations
- Entity Resolution: Reconciles and merges duplicate entities using fuzzy matching and LLM-assisted resolution
- Cypher Generation: Converts processed data to Neo4j Cypher queries for graph database import
- Natural Language Graph Queries: Work-in-progress tool for converting natural language to Cypher queries
# Required Python packages (requirements.txt coming soon)
pip install requests beautifulsoup4 thefuzz neo4j pydantic openaiYou'll also need:
- Neo4j Desktop (free version available)
- Access to an LLM API (by default the code uses OpenAI o3-mini)
- Convert a script to JSON:
python script2json.py "http://chakoteya.net/DoctorWho/29-10.html" output.json- Process the script and generate the knowledge graph:
# Basic usage with default settings
python main.py input_file.json --output output_graph.json
# Using combined extraction mode (recommended for speed and efficiency)
python main.py input_file.json --combined --output output_graph.json
# Using fully combined extraction mode for max performance
python main.py input_file.json --fully-combined --output output_graph.json
# Using batch resolution for large datasets
python main.py input_file.json --batch-resolution --output output_graph.json- Convert the processed data to Cypher:
from json_cypher import main as generate_cypher
generate_cypher()Fabula offers several processing modes to balance speed, cost, and accuracy:
Standard mode processes each entity type separately and is good for small datasets or when you need maximum precision.
python main.py input_file.jsonCombined mode extracts all primary entities (agents, locations, organizations, objects) in a single pass, reducing API calls and improving consistency.
python main.py input_file.json --combinedThis mode extracts both primary and secondary entities (events, participations, etc.) in combined calls, offering the best performance. This is recommended for most use cases.
python main.py input_file.json --fully-combinedFor large datasets, batch resolution processes entity resolution in smaller groups, improving performance and reducing API costs.
python main.py input_file.json --batch-resolutionYou can combine multiple flags for optimal performance:
python main.py input_file.json --fully-combined --batch-resolutionfabula/
├── main.py # Main orchestration
├── episode_processor.py # Episode-level processing
├── scene_processor.py # Scene-level extraction
├── entity_registry.py # Entity management/resolution
├── validation.py # Reference validation
├── context.py # Global context management
├── utils/
│ ├── script2json.py # Script preprocessing
│ ├── fabula_graphrag.py # Natural language query tool
│ └── json_cypher.py # Generate graph as Cypher statements
└── baml_src/
└── myth06.baml # BAML extraction definitions
The script converter is designed to work with TV/film scripts from sources like chakoteya.net. It:
- Parses HTML/text scripts into structured JSON
- Extracts scene boundaries, dialogue, and stage directions
- Handles multi-episode stories
- Supports various script formats
Example output structure:
{
"Story": "Blink",
"Airdate": "2007-06-09",
"Episodes": [
{
"Episode": "Episode One",
"Scenes": [
{
"Scene": "WESTER DRUMLINS",
"Dialogue": [
{
"Character": "SALLY",
"Line": "Hello? Is someone there?"
},
{
"Stage Direction": "Sally enters the abandoned house"
}
]
}
]
}
]
}Converts processed story data into Neo4j Cypher queries for graph database import. Features:
- Generates schema cleanup and constraint creation
- Creates nodes for all entity types
- Establishes relationships between entities
- Handles metadata and properties
- Supports incremental updates
The fabula_graphrag.py tool enables natural language querying of your narrative knowledge graphs, allowing exploration without writing Cypher queries:
- Dynamic schema extraction from Neo4j database
- LLM-based question decomposition for complex narrative inquiries
- BAML-constrained Cypher generation for accurate queries
- Multi-hop path exploration for deep narrative connections
- Rich answer synthesis with narrative context
# Basic usage
python fabula_graphrag.py
# With custom Neo4j connection
python fabula_graphrag.py --uri bolt://localhost:7687 --username neo4j --password passwordExample questions:
- "What happens prior to the Doctor's arrival?"
- "How does Josh's emotional state change throughout the episode?"
- "Which characters visit the White House Situation Room?"
- "What objects are significant to the main character's development?"
The system:
- Extracts and enriches the database schema with narrative semantics
- Decomposes complex questions into focused sub-questions
- Generates Cypher queries constrained by BAML templates
- Traverses multi-hop relationships to find narrative connections
- Synthesizes results into coherent narrative answers
The architecture combines BAML's structured output guarantees with rich narrative exploration capabilities, allowing both precise fact retrieval and deeper thematic analysis.
- Core extraction pipeline
- Entity resolution with batch processing
- Combined extraction modes
- Basic validation
- Script preprocessing
- Cypher generation
- Natural language query tool
- Enhanced entity resolution
- Requirements specification
- Documentation improvements
- Test coverage
- Multi-modal support (extract from video)
- Additional script format support
- Interactive visualization
- API documentation
- Performance optimizations
While in active development, we welcome:
- Bug reports
- Feature suggestions
- Documentation improvements
- Script format contributions
- Ontology enhancements
Please open an issue to discuss potential changes.
[License TBD]
- BBC Mythology Engine for inspiration
- chakoteya.net for script resources
- BAML team for the extraction framework
- Neo4j community for graph database expertise
For more information or to report issues, please open a GitHub issue.




