|
| 1 | +--- |
| 2 | +title: "PDF Vision" |
| 3 | +keywords: ["genai", "pdf", "markdown", "ocr", "beginner"] |
| 4 | +sidebar: |
| 5 | + order: 60 |
| 6 | +--- |
| 7 | +import { Code } from "@astrojs/starlight/components" |
| 8 | +import source from "../../../../../packages/sample/genaisrc/pdfocr.genai.mts?raw" |
| 9 | + |
| 10 | +Extracting markdown from PDFs is a tricky task that may involve customized toolchains. |
| 11 | + |
| 12 | +There are many techniques applied in the field to get the best results: |
| 13 | + |
| 14 | +- one can read the text using pdfjs (GenAIScript uses that), which may give some results but the text might be garbled or not in the correct order. And tables are a challenge. And this won't work for PDFs that are images only. |
| 15 | +- another technique would be to apply OCR algorithm on segments of the image to "read" the rendered text. |
| 16 | + |
| 17 | +In this guide, we will build a GenAIScript that uses a LLM with vision support to extract text and images from a PDF, converting each page into markdown. |
| 18 | + |
| 19 | +Let's assume that the user is running our script on a PDF file, so it is the first element of `env.files`. |
| 20 | +We use the PDF parser to extract both the pages and images from the PDF file. The `renderAsImage` option is set to `true`, which means each page is also converted into an image. |
| 21 | + |
| 22 | +```ts "renderAsImage: true" |
| 23 | +const { pages, images } = await parsers.PDF(env.files[0], { |
| 24 | + renderAsImage: true, |
| 25 | +}) |
| 26 | +``` |
| 27 | + |
| 28 | +We begin a loop that iterates over each page in the PDF. |
| 29 | + |
| 30 | +```ts |
| 31 | +for (let i = 0; i < pages.length; ++i) { |
| 32 | + const page = pages[i] |
| 33 | + const image = images[i] |
| 34 | +``` |
| 35 | +
|
| 36 | +For each iteration, we extract the current page and its corresponding image. |
| 37 | +We use the `runPrompt` function to process both text and image data. |
| 38 | +
|
| 39 | +```ts |
| 40 | + // mix of text and vision |
| 41 | + const res = await runPrompt( |
| 42 | + (ctx) => { |
| 43 | + if (i > 0) ctx.def("PREVIOUS_PAGE", pages[i - 1]) |
| 44 | + ctx.def("PAGE", page) |
| 45 | + if (i + 1 < pages.length) ctx.def("NEXT_PAGE", pages[i + 1]) |
| 46 | + ctx.defImages(image, { autoCrop: true, greyscale: true }) |
| 47 | +``` |
| 48 | +
|
| 49 | +The context `ctx` is set up with definitions for the current page, and optionally the previous and next pages. Images are defined with auto-cropping and greyscale adjustments. |
| 50 | +
|
| 51 | +```ts |
| 52 | +ctx.$`You are an expert in reading and extracting markdown from a PDF image stored in the attached images. |
| 53 | +
|
| 54 | + Your task is to convert the attached image to markdown. |
| 55 | +
|
| 56 | + - We used pdfjs-dist to extract the text of the current page in PAGE, the previous page in PREVIOUS_PAGE and the next page in NEXT_PAGE. |
| 57 | + - Generate markdown. Do NOT emit explanations. |
| 58 | + - Generate CSV tables for tables. |
| 59 | + - For images, generate a short alt-text description. |
| 60 | + ` |
| 61 | +``` |
| 62 | +
|
| 63 | +This prompt instructs GenAI to convert the page image into markdown. It highlights the use of `pdfjs-dist` for text extraction and instructs how to handle text, tables, and images. |
| 64 | +
|
| 65 | +```ts |
| 66 | + }, |
| 67 | + { |
| 68 | + model: "small", |
| 69 | + label: `page ${i + 1}`, |
| 70 | + cache: "pdf-ocr", |
| 71 | + system: [ |
| 72 | + "system", |
| 73 | + "system.assistant", |
| 74 | + "system.safety_jailbreak", |
| 75 | + "system.safety_harmful_content", |
| 76 | + ], |
| 77 | + } |
| 78 | + ) |
| 79 | +``` |
| 80 | +
|
| 81 | +We configure the model with specific settings, such as labeling each page, caching settings, and system configurations for safety. |
| 82 | +
|
| 83 | +```ts |
| 84 | + ocrs.push(parsers.unfence(res.text, "markdown") || res.error?.message) |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +Each result is processed, converted back to markdown, and added to the `ocrs` array. |
| 89 | + |
| 90 | +```ts |
| 91 | +console.log(ocrs.join("\n\n")) |
| 92 | +``` |
| 93 | + |
| 94 | +Finally, we print out all the collected OCR results in markdown format. |
| 95 | + |
| 96 | +## Running the Script |
| 97 | + |
| 98 | +To run this script using the GenAIScript CLI, navigate to your terminal and execute: |
| 99 | + |
| 100 | +```bash |
| 101 | +npx --yes genaiscript run pdfocr <mypdf.pdf> |
| 102 | +``` |
| 103 | + |
| 104 | +For more details on installing and setting up the GenAIScript CLI, refer to the [official documentation](https://microsoft.github.io/genaiscript/getting-started/installation). |
| 105 | + |
| 106 | +This script provides a straightforward way to convert PDFs into markdown, making it easier to work with their contents programmatically. Happy coding! 🚀 |
| 107 | + |
| 108 | +## Full source |
| 109 | + |
| 110 | +The full script source code is available below: |
| 111 | + |
| 112 | +<Code |
| 113 | + code={source} |
| 114 | + wrap={true} |
| 115 | + lang="js" |
| 116 | + title="pdfocr.genai.mts" |
| 117 | +/> |
0 commit comments