PDF Scrape

Print demo.html to demo.pdf or use your own document
Go to https://mozilla.github.io/pdf.js/getting_started
Download Stable
Extract pdf.js and pdf.worker.js and their corresponding *.map here
Make index.html and reference PDF.js:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
  </head>
  <body>

  </body>
</html>

Create index.js and reference it from index.html:

index.js

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
    <script src="index.js"></script>
  </head>
  <body>

  </body>
</html>

Update index.js with code to load the document and render its page:

index.js

void async function () {
  const document = await pdfjsLib.getDocument('demo.pdf').promise;
  const page = await document.getPage(1);
}()

Add a canvas element to index.html where the page will be rendered:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
    <script src="index.js"></script>
  </head>
  <body>
    <canvas id="pageCanvas"></canvas>
  </body>
</html>

Extend the code to render the page to the canvas context:

index.js

window.addEventListener('load', async () => {
  const document = await pdfjsLib.getDocument('demo.pdf').promise;
  const page = await document.getPage(1);
  const viewport = page.getViewport({ scale: 1 });
  const canvas = window.document.getElementById('pageCanvas');
  canvas.width = viewport.width;
  canvas.height = viewport.height;
  const context = canvas.getContext('2d');
  page.render({ canvasContext: context, viewport });
});

Hook up code to extract text and highlight texts and images (see this repo)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
demo.html		demo.html
demo.pdf		demo.pdf
demo.png		demo.png
index.css		index.css
index.html		index.html
index.js		index.js
pdf.js		pdf.js
pdf.js.map		pdf.js.map
pdf.worker.js		pdf.worker.js
pdf.worker.js.map		pdf.worker.js.map
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Scrape

About

Languages

TomasHubelbauer/pdf-scrape

Folders and files

Latest commit

History

Repository files navigation

PDF Scrape

About

Topics

Resources

Stars

Watchers

Forks

Languages