Skip to content

Commit 1de17ea

Browse files
committed
docs: ✏️ add PDF Vision guide and script example
1 parent 5b4749e commit 1de17ea

File tree

3 files changed

+169
-8
lines changed

3 files changed

+169
-8
lines changed
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: "PDF Vision"
3+
keywords: ["genai", "pdf", "markdown", "ocr", "beginner"]
4+
sidebar:
5+
order: 60
6+
---
7+
import { Code } from "@astrojs/starlight/components"
8+
import source from "../../../../../packages/sample/genaisrc/pdfocr.genai.mts?raw"
9+
10+
Extracting markdown from PDFs is a tricky task that may involve customized toolchains.
11+
12+
There are many techniques applied in the field to get the best results:
13+
14+
- one can read the text using pdfjs (GenAIScript uses that), which may give some results but the text might be garbled or not in the correct order. And tables are a challenge. And this won't work for PDFs that are images only.
15+
- another technique would be to apply OCR algorithm on segments of the image to "read" the rendered text.
16+
17+
In this guide, we will build a GenAIScript that uses a LLM with vision support to extract text and images from a PDF, converting each page into markdown.
18+
19+
Let's assume that the user is running our script on a PDF file, so it is the first element of `env.files`.
20+
We use the PDF parser to extract both the pages and images from the PDF file. The `renderAsImage` option is set to `true`, which means each page is also converted into an image.
21+
22+
```ts "renderAsImage: true"
23+
const { pages, images } = await parsers.PDF(env.files[0], {
24+
renderAsImage: true,
25+
})
26+
```
27+
28+
We begin a loop that iterates over each page in the PDF.
29+
30+
```ts
31+
for (let i = 0; i < pages.length; ++i) {
32+
const page = pages[i]
33+
const image = images[i]
34+
```
35+
36+
For each iteration, we extract the current page and its corresponding image.
37+
We use the `runPrompt` function to process both text and image data.
38+
39+
```ts
40+
// mix of text and vision
41+
const res = await runPrompt(
42+
(ctx) => {
43+
if (i > 0) ctx.def("PREVIOUS_PAGE", pages[i - 1])
44+
ctx.def("PAGE", page)
45+
if (i + 1 < pages.length) ctx.def("NEXT_PAGE", pages[i + 1])
46+
ctx.defImages(image, { autoCrop: true, greyscale: true })
47+
```
48+
49+
The context `ctx` is set up with definitions for the current page, and optionally the previous and next pages. Images are defined with auto-cropping and greyscale adjustments.
50+
51+
```ts
52+
ctx.$`You are an expert in reading and extracting markdown from a PDF image stored in the attached images.
53+
54+
Your task is to convert the attached image to markdown.
55+
56+
- We used pdfjs-dist to extract the text of the current page in PAGE, the previous page in PREVIOUS_PAGE and the next page in NEXT_PAGE.
57+
- Generate markdown. Do NOT emit explanations.
58+
- Generate CSV tables for tables.
59+
- For images, generate a short alt-text description.
60+
`
61+
```
62+
63+
This prompt instructs GenAI to convert the page image into markdown. It highlights the use of `pdfjs-dist` for text extraction and instructs how to handle text, tables, and images.
64+
65+
```ts
66+
},
67+
{
68+
model: "small",
69+
label: `page ${i + 1}`,
70+
cache: "pdf-ocr",
71+
system: [
72+
"system",
73+
"system.assistant",
74+
"system.safety_jailbreak",
75+
"system.safety_harmful_content",
76+
],
77+
}
78+
)
79+
```
80+
81+
We configure the model with specific settings, such as labeling each page, caching settings, and system configurations for safety.
82+
83+
```ts
84+
ocrs.push(parsers.unfence(res.text, "markdown") || res.error?.message)
85+
}
86+
```
87+
88+
Each result is processed, converted back to markdown, and added to the `ocrs` array.
89+
90+
```ts
91+
console.log(ocrs.join("\n\n"))
92+
```
93+
94+
Finally, we print out all the collected OCR results in markdown format.
95+
96+
## Running the Script
97+
98+
To run this script using the GenAIScript CLI, navigate to your terminal and execute:
99+
100+
```bash
101+
npx --yes genaiscript run pdfocr <mypdf.pdf>
102+
```
103+
104+
For more details on installing and setting up the GenAIScript CLI, refer to the [official documentation](https://microsoft.github.io/genaiscript/getting-started/installation).
105+
106+
This script provides a straightforward way to convert PDFs into markdown, making it easier to work with their contents programmatically. Happy coding! 🚀
107+
108+
## Full source
109+
110+
The full script source code is available below:
111+
112+
<Code
113+
code={source}
114+
wrap={true}
115+
lang="js"
116+
title="pdfocr.genai.mts"
117+
/>

packages/sample/genaisrc/pdf-ocr.genai.mjs

Lines changed: 0 additions & 8 deletions
This file was deleted.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
script({
2+
files: "src/pdf/jacdac.pdf",
3+
})
4+
5+
for (const file of env.files.filter((f) => f.filename.endsWith(".pdf"))) {
6+
// extract text and render pages as images
7+
const { pages, images } = await parsers.PDF(file, {
8+
renderAsImage: true,
9+
})
10+
console.log(`pages: ${pages.length}`)
11+
const ocrs: string[] = []
12+
13+
for (let i = 0; i < pages.length; ++i) {
14+
const page = pages[i]
15+
const image = images[i]
16+
// todo: orientation
17+
18+
// mix of text and vision
19+
const res = await runPrompt(
20+
(ctx) => {
21+
if (i > 0) ctx.def("PREVIOUS_PAGE", pages[i - 1])
22+
ctx.def("PAGE", page)
23+
if (i + 1 < pages.length) ctx.def("NEXT_PAGE", pages[i + 1])
24+
ctx.defImages(image, { autoCrop: true, greyscale: true })
25+
ctx.$`You are an expert in reading and extracting markdown from a PDF image stored in the attached images.
26+
27+
Your task is to convert the attached image to markdown.
28+
29+
- We used pdfjs-dist to extract the text of the current page in PAGE, the previous page in PREVIOUS_PAGE and the next page in NEXT_PAGE.
30+
- Generate markdown. Do NOT emit explanations.
31+
- Generate CSV tables for tables.
32+
- For images, generate a short alt-text description.
33+
`
34+
},
35+
{
36+
model: "small",
37+
label: `page ${i + 1}`,
38+
cache: "pdf-ocr",
39+
system: [
40+
"system",
41+
"system.assistant",
42+
"system.safety_jailbreak",
43+
"system.safety_harmful_content",
44+
],
45+
}
46+
)
47+
48+
ocrs.push(parsers.unfence(res.text, "markdown") || res.error?.message)
49+
}
50+
51+
await workspace.writeText(file.filename + ".md", ocrs.join("\n\n"))
52+
}

0 commit comments

Comments
 (0)