An AWS Lambda function that exposes the Docling library for document conversion to markdown.
- Convert documents (PDF, DOCX, etc.) to markdown format
- Support for both URL-based and base64-encoded document input
- Run locally with Docker for development and testing
- Deploy as AWS Lambda function with container images
- Built on IBM Research's Docling library
-
Start the local Lambda emulator:
docker-compose up --build
-
Test the function:
./test_lambda.sh
Or manually:
# Convert from URL curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \ -H "Content-Type: application/json" \ -d '{ "body": "{\"source_url\": \"https://arxiv.org/pdf/2408.09869\"}" }' # Convert from base64-encoded document curl -X POST "http://localhost:9000/2015-03-31/functions/function/invocations" \ -H "Content-Type: application/json" \ -d '{ "body": "{\"document\": \"BASE64_ENCODED_CONTENT\", \"filename\": \"document.pdf\"}" }'
The Lambda function accepts events with the following format (API Gateway proxy integration):
{
"body": "{\"source_url\": \"https://example.com/document.pdf\"}"
}Or for base64-encoded documents:
{
"body": "{\"document\": \"base64_encoded_content\", \"filename\": \"document.pdf\"}"
}Parameters (inside the body JSON):
source_url(string, optional): URL to the document to convertdocument(string, optional): Base64-encoded document contentfilename(string, optional): Original filename (used when providing base64 content)
Note: Either source_url or document must be provided.
Success (200):
{
"statusCode": 200,
"headers": {
"Content-Type": "application/json",
"Access-Control-Allow-Origin": "*"
},
"body": "{\"success\": true, \"markdown\": \"# Converted Document\\n\\n...\", \"metadata\": {\"num_pages\": 10, \"source\": \"https://example.com/document.pdf\"}}"
}Error (400/500):
{
"statusCode": 400,
"body": "{\"success\": false, \"error\": \"Error message\", \"error_type\": \"ExceptionType\"}"
}- AWS account
- AWS CLI configured
- Docker installed (for building container images)
-
Build the Docker image:
docker build -t docling-lambda . -
Create an ECR repository:
aws ecr create-repository --repository-name docling-lambda --region us-east-1
-
Authenticate Docker with ECR:
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
-
Tag and push the image:
docker tag docling-lambda:latest <account-id>.dkr.ecr.us-east-1.amazonaws.com/docling-lambda:latest docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/docling-lambda:latest
-
Create the Lambda function from the container image:
aws lambda create-function \ --function-name docling-converter \ --package-type Image \ --code ImageUri=<account-id>.dkr.ecr.us-east-1.amazonaws.com/docling-lambda:latest \ --role arn:aws:iam::<account-id>:role/lambda-execution-role \ --memory-size 2048 \ --timeout 300
-
Test the function:
aws lambda invoke \ --function-name docling-converter \ --payload '{"body": "{\"source_url\": \"https://arxiv.org/pdf/2408.09869\"}"}' \ response.json
To expose the Lambda function as an HTTP API:
- Create an HTTP API in API Gateway
- Add a POST route (e.g.,
/convert) - Integrate with the Lambda function
- Enable CORS if needed
docling-nest/
├── lambda_handler.py # Lambda function handler
├── requirements.txt # Python dependencies
├── Dockerfile # AWS Lambda container image
├── docker-compose.yml # Local Lambda RIE setup
├── test_lambda.sh # Local testing script
└── terraform/ # Legacy Terraform configuration
├── main.tf
├── variables.tf
└── README.md
- Python 3.11
- Docker and Docker Compose (for local development)
- AWS CLI (for deployment)
The local Docker setup uses AWS Lambda Runtime Interface Emulator (RIE) to simulate the Lambda environment:
# Start the Lambda emulator
docker-compose up --build
# In another terminal, run tests
./test_lambda.sh
# Or run specific tests
./test_lambda.sh url # Test URL-based conversion
./test_lambda.sh base64 # Test base64 document conversion
./test_lambda.sh error # Test error handlingRecommended Lambda settings:
- Memory: 2048 MB (Docling's ML models require significant memory)
- Timeout: 300 seconds (5 minutes, for processing large documents)
- Ephemeral storage: 512 MB (default is sufficient)
Docling supports various document formats including:
- DOCX
- PPTX
- HTML
- Images (PNG, JPG, etc.)
- And more...
See the Docling documentation for the full list.
AWS Lambda pricing is based on:
- Requests: $0.20 per 1 million requests
- Duration: $0.0000166667 per GB-second (for 2GB memory)
With 2GB memory allocation:
- A 30-second conversion costs approximately $0.001
- Free tier includes 400,000 GB-seconds per month
Container fails to start:
- Ensure Docker has enough memory (at least 4GB recommended)
- Check Docker logs:
docker-compose logs
Conversion fails:
- Some documents may require additional system dependencies
- Check the Docling logs for specific errors
- Ensure the Lambda has sufficient memory
Function timeout:
- Increase Lambda timeout (max 900 seconds)
- Large documents may take longer to process
Memory errors:
- Increase Lambda memory allocation
- Docling's ML models require at least 2GB memory
Cold start latency:
- First invocation may be slow (30-60 seconds) due to model loading
- Consider provisioned concurrency for production workloads
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is provided as-is. The Docling library is licensed under the MIT License.