NER & LLM Gen AI Application Documentation

Introduction

The NER & LLM Gen AI Application represents a sophisticated, serverless document processing solution built on Amazon Web Services (AWS). This enterprise-grade system combines Named Entity Recognition (NER) and Large Language Models (LLM) to automate document analysis, information extraction, and content summarization at scale.

In today's data-driven environment, organizations face the challenge of processing vast amounts of textual information efficiently. Traditional methods of manual document processing are time-consuming, inconsistent, and don't scale well. This application addresses these challenges by leveraging state-of-the-art language models: Mixtral-8x7B for intelligent text generation and summarization, and a Bert NER model for precise entity recognition.

The application implements a serverless architecture with sagemaker endpoints that automatically deploys and destroys after use. This architecture balances processing power, cost efficiency, and scalability. Each component is designed to handle specific aspects of the document processing workflow while maintaining loose coupling for flexibility and resilience.

Project Structure

The application follows a modular, maintainable architecture with clear separation of concerns:

    .
    ├── cfn_templates/              # Cloudformation templates 
    │   ├── data.yaml               # Storage and database resources
    │   └── sm_gen_ai.yaml          # ML infrastructure components
    ├── functions/                   
    │   ├── genai/                  # AI processing functions
    │   │   ├── abstractive_summarization/
    │   │   ├── author_extraction/
    │   │   ├── extractive_summarization/
    │   │   └── generated_title/
    │   └── shared/                 # Infrastructure functions
    │       ├── create_sagemaker_endpoint/
    │       ├── delete_sagemaker_endpoint/
    │       ├── extract_paths_in_s3/
    │       ├── push_to_queue/
    │       ├── sns/
    │       ├── sqs_batch_receive/
    │       └── trigger_endpoints/
    ├── notebooks/                   # Model preparation notebooks
    ├── statemachines/              # Processing workflow definitions
    └── template.yaml               # Main AWS SAM template

Each component serves a specific purpose in the processing pipeline:

The genai/ directory contains the core AI processing functions:

extractive_summarization: Creates initial summaries using key sentence extraction
abstractive_summarization: Generates fluent, contextual summaries using the LLM from the extractive summary
author_extraction: Identifies and extracts author information using NER
generated_title: Produces document titles based on extractive summary

The shared/ directory houses infrastructure management functions:

create_sagemaker_endpoint: Initializes ML model endpoints
delete_sagemaker_endpoint: Cleans up resources after processing
extract_paths_in_s3: Manages document discovery and tracking
push_to_queue: Handles document processing queue
sns: Manages notifications
sqs_batch_receive: Coordinates batch processing
trigger_endpoints: Initiates processing pipeline

Core Processing Flow

When a document enters the system, it triggers a carefully orchestrated sequence of events:

Application Trigger: The application compares the contents of the extracts bucket against the extractive summary bucket to identify unprocessed files. This approach ensures efficient processing by:
- Preventing redundant processing of already analyzed documents
- Enabling batch processing for improved efficiency
- Maintaining processing state across system restarts
Model Endpoint Management: The system employs a dynamic approach to model endpoint management:
- Endpoints are created only when needed, reducing costs
- Multiple documents are processed in batches while endpoints are active
- Automatic deletion of endpoints occurs after processing completes
Document Processing: Each document undergoes multiple processing stages:
- Parallel processing where appropriate
- Sequential processing for dependent operations
- Separation of results in storage

Infrastructure

The infrastructure implements a secure, scalable foundation for document processing through multiple specialized components:

Storage Architecture

The multi-bucket S3 strategy serves specific purposes in the processing pipeline:

Bucket Type	Purpose
Extracts	Source documents
Extractive Summary	Key sentence summaries
Abstractive Summary	LLM generated summaries
Generated Titles	Document titles
Author Information	Extracted metadata

Logical Diagram

Processing Components

Each Lambda function serves a specific purpose in the processing pipeline:

AI Processing Functions:

Extractive Summarization
- Purpose: Creates initial summary using key sentence extraction
- Input: Raw document text
- Processing: Implements TextRank or Luhn algorithm
- Output: Condensed document with key sentences
Abstractive Summarization
- Purpose: Generates natural language summary
- Input: Extractive summary, removing low impact sentences improves compute time and cost
- Processing: Uses LLM for coherent text generation
- Output: Fluent, contextual summary
Title Generation
- Purpose: Creates contextual document titles
- Input: Extractive summary, removing low impact sentences improves compute time and cost
- Processing: LLM-based title generation
- Output: Descriptive title
Author Extraction
- Purpose: Identifies document authors
- Input: Original document text (first 1500 characters - estimated first page of the document)
- Processing: NER-based name identification
- Output: Structured author information

Example Processing Chain:

Original Document (Extracts Bucket):

"Q3 Financial Report, prepared by John Smith and Sarah Johnson. The company experienced significant growth..."
Extractive Summary:

"The company experienced significant growth. Revenue increased by 45%. Market share expanded in key regions." (Retain sentences of high importance)
Abstractive Summary:

"The organization demonstrated exceptional performance in Q3, with substantial revenue growth and market expansion." (From extractive summary)
Generated Title:

"Q3 2024 Financial Performance Analysis" (From extractive summary)
Author Information:

"John Smith" (From first 1,500 characters)
"Sarah Johnson" (From first 1,500 characters)

Customization and Extensibility

The provided Lambda functions serve as examples of document processing capabilities, but they are designed with extensibility in mind. Each function demonstrates a particular use case but these implementations represent what's possible with this architecture.

Organizations can easily modify these functions to meet their specific document processing requirements. For instance, the abstractive summarization prompt could be adjusted to generate executive summaries, technical abstracts, or marketing copy. The NER model could extract other entities not just people/author. The extractive summarization algorithm could be modified to prioritize different types of content based on domain-specific rules or alternative summarization techniques.

Helper Lambda Functions:

Document Discovery and Queue Management
- Extract Paths in S3
  - Purpose: Manages document discovery and initiates processing
  - Behavior: Compares contents between extracts and extractive summary buckets to identify new documents
  - Integration: Triggers the Push to Queue function for newly identified documents
- Push to Queue
  - Purpose: Manages document processing queue
  - Behavior: Adds documents to SQS queue and updates DynamoDB with document status
  - Integration: Interfaces with SQS and DynamoDB for reliable message handling and state management
SageMaker Endpoint Management
- Create SageMaker Endpoint (LLM)
  - Purpose: Initializes the Large Language Model endpoint
  - Behavior: Creates SageMaker endpoint for LLM using specified configurations
  - Integration: Interfaces with SageMaker API to manage model deployments
- Create SageMaker Endpoint (NER)
  - Purpose: Initializes the Named Entity Recognition model endpoint
  - Behavior: Creates SageMaker endpoint for NER using specified configurations
  - Integration: Interfaces with SageMaker API to manage model deployments
- Delete SageMaker Endpoint (LLM)
  - Purpose: Cleans up LLM resources after processing
  - Behavior: Deletes the LLM SageMaker endpoint and associated resources. Sends an e-mail notification regarding endpoint deletion ensuring awareness that the endpoint is not persistent after the job completion.
  - Integration: Interfaces with SageMaker API and sends notifications via SNS
- Delete SageMaker Endpoint (NER)
  - Purpose: Cleans up NER resources after processing
  - Behavior: Deletes the NER SageMaker endpoint and associated resources. Sends an e-mail notification regarding endpoint deletion ensuring awareness that the endpoint is not persistent after the job completion.
  - Integration: Interfaces with SageMaker API and sends notifications via SNS
Processing Coordination
- SQS Batch Receive
  - Purpose: Retrieves batches of messages from SQS for processing
  - Behavior: Fetches messages from SQS queue with specified batch size
  - Integration: Interfaces with SQS and provides input for Step Functions
- SNS (Job Complete)
  - Purpose: Notifies users of job completion
  - Behavior: Sends notification when all documents are processed
  - Integration: Uses SNS to deliver email notifications
- Trigger Endpoints
  - Purpose: Initiates the creation of ML model endpoints
  - Behavior: Invokes Lambda functions to create LLM and NER endpoints
  - Integration: Coordinates with endpoint creation functions to start processing

Technical Details

Named Entity Recognition

Our NER implementation uses a Bert NER model to identify and classify named entities within documents. The system focuses particularly on author detection, examining the first 1,500 characters of each document where author information typically appears.

The NER process follows a sequence:

Token Detection identifies potential name components with confidence scoring
Name Assembly combines related tokens into full names
Validation confirms proper formatting and eliminates false positives

For example, processing the text:

"Authored by Jane Smith"

Produces the below results:

Processing Stage	Output
Initial Detection	["Jane", "Smith"]
Assembly	"Jane Smith"

Below are the entities Bert NER can detect:

Tag	Description
O	Outside of a named entity
B-MISC	Beginning of a miscellaneous entity right after another misc
B-PER	Beginning of a person's name right after another person's name
B-ORG	Beginning of an organization right after another organization
B-LOC	Beginning of a location right after another location
I-MISC	Miscellaneous entity
I-PER	Person's name
I-ORG	Organization
I-LOC	Location

Summarization Approaches

The system implements both extractive and abstractive summarization techniques. Extractive summarization, powered by sumy and NLTK, functions like a skilled editor selecting key sentences from the original text. It employs the TextRank algorithm, treating sentences as nodes in a graph and determining importance based on their interconnections.

Abstractive summarization, utilizing the Mixtral-8x7B model, generates entirely new text that captures the document's essential meaning. This approach enables paraphrasing, information restructuring, and concept combination for more natural summaries.

Consider this example:

Original Text:

"The company reported revenue growth of 12% in Q1 2024. Operating expenses increased by 5%. Net profit margins expanded to 15%. The board approved a new share buyback program..."

Extractive Summary:

"The company reported revenue growth of 12% in Q1 2024. Net profit margins expanded to 15%."

Abstractive Summary:

"The company demonstrated strong financial performance in Q1 2024, with double-digit revenue growth and improved profitability."

Setup and Deployment

Prerequisites

Before deploying the application, ensure your environment and AWS account are properly configured. This section details the requirements and preparation steps for successful deployment.

Regions

This application was tested on the following AWS regions:

Partition	Region
Commercial	us-east-1
GovCloud	us-gov-west-1

Environment Requirements

A proper development environment requires:

Local Development Setup
- Python 3.11 or later
- AWS SAM CLI for infrastructure deployment
- AWS CLI v2 for AWS interaction
AWS Account
- Service quota increase for ml.p4d.24xlarge instance
- Sagemaker Domain/Notebook instance to download the models (optional)

Model Preparation

Huggingface security token token
Request model access for Mixtral-8x7B-Instruct-v0.1
Grab the model bucket name from S3 which will be in this pattern: [aws-account-num]-[region]-[environment (default is dev)]-model-weights-[random string])
The application requires two primary models: Mixtral-8x7B for text generation and BERT-NER for entity recognition. These models are prepared using the provided Jupyter notebook in the notebooks directory which can be executed in a SageMaker Domain, Notebook instance or your local machine. Its output are .tar.gz files and will be uploaded in the models bucket. Ensure that you have write access to the bucket. Note: Mixtral model is 140+ GB

ModelsBucket/
├── llm-model.tar.gz     # Mixtral model
└── ner-model.tar.gz     # BERT-NER model

Deployment Guide

Initial Setup

Clone the Repository:

git clone <repository-url>

Go to the directory

cd <project-directory>

Download nltk files

bash download_punkt.sh

Configure AWS Credentials:

Ensure AWS CLI is configured with appropriate credentials
Verify access to required services
Check service quotas

Deployment Steps

Build Application:

sam build

Deploy Infrastructure:

sam deploy --parameter-overrides [email protected]

Note: email provided will be sent a message regarding verification during deployment and job completion.

Destroy/Delete application from AWS account

sam destroy

Usage Guide

Getting Started

Document Preparation

Before uploading documents, ensure they meet these requirements:

Format: Plain text (.txt)
Encoding: UTF-8
Content: Raw text (For other types of files like PDFs/Images consider using AWS Textract first)

Document Processing

Upload Documents by navigate to your extracts bucket in the S3 AWS Console:
Find the trigger_endpoints function in Lambda console and execute the function
Processing Stages:
- Create required SageMaker endpoints
- Get documents in extracts bucket that is not in the extractive summary bucket to determine which files needs to be processed.
- Process documents in parallel
- Generate all outputs
- Clean up endpoints
- Send completion notification

Accessing Results

Results are organized in separate buckets:

Extractive summaries: {ExtractiveSummaryBucket}/{original_filename}
Abstractive summaries: {AbstractiveSummaryBucket}/{original_filename}
Generated titles: {GeneratedTitleBucket}/{original_filename}
Author information: {AuthorExtractionBucket}/{original_filename}

Example Output Structure:

ExtractiveSummaryBucket/
└── filename.txt       # Key sentences from document

AbstractiveSummaryBucket/
└── filename.txt       # Fluent summary

GeneratedTitleBucket/
└── filename.txt       # Generated title

AuthorExtractionBucket/
└── filename.txt       # Extracted author names

Processing Validation

Email Notification:
- Receive completion notification
- Indicates endpoint cleanup
Results Verification:
- Check each output bucket
- Verify file contents

Authors and Reviewers

Nick Biso, Machine Learning Engineer - Amazon Web Services Inc.
Ian Lunsford, Aerospace Engineer - Amazon Web Services Inc.
Max Rathmann, Senior DevOps Engineer - Amazon Web Services Inc.
Jeff Ryan, DevOps Engineer - Amazon Web Services Inc.
Michael Massey, Cloud Application Architect - Amazon Web Services Inc.
Brian Weston, Cloud Transformation for Mission Science Program Lead - LLNL
Ian Thompson, Data Engineer - Enterprise Knowledge
Anna D’Angela, Data Engineer - Enterprise Knowledge
Samson Odock, Security Engineer - Amazon Web Services Inc.
Kyle Rose, Data Scientist - Amazon Web Services Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
cfn_templates		cfn_templates
functions		functions
notebooks		notebooks
statemachines		statemachines
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
download_punkt.sh		download_punkt.sh
samconfig.toml		samconfig.toml
template.yaml		template.yaml

License

aws-samples/sample-intelligent-document-processing-using-llm-and-ner

Folders and files

Latest commit

History

Repository files navigation

NER & LLM Gen AI Application Documentation

Introduction

Table of Contents

Architecture

System Overview

Project Structure

Core Processing Flow

Infrastructure

Storage Architecture

Logical Diagram

Processing Components

Technical Details

Named Entity Recognition

Summarization Approaches

Setup and Deployment

Prerequisites

Regions

Environment Requirements

Model Preparation

Deployment Guide

Initial Setup

Deployment Steps

Destroy/Delete application from AWS account

Usage Guide

Getting Started

Document Preparation

Document Processing

Accessing Results

Processing Validation

Authors and Reviewers

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages