- Print
demo.html
todemo.pdf
or use your own document - Go to https://mozilla.github.io/pdf.js/getting_started
- Download Stable
- Extract
pdf.js
andpdf.worker.js
and their corresponding*.map
here - Make
index.html
and reference PDF.js:
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>PDF Scrape</title>
<script src="pdf.js"></script>
</head>
<body>
</body>
</html>
- Create
index.js
and reference it fromindex.html
:
index.js
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>PDF Scrape</title>
<script src="pdf.js"></script>
<script src="index.js"></script>
</head>
<body>
</body>
</html>
- Update
index.js
with code to load the document and render its page:
index.js
void async function () {
const document = await pdfjsLib.getDocument('demo.pdf').promise;
const page = await document.getPage(1);
}()
- Add a
canvas
element toindex.html
where the page will be rendered:
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>PDF Scrape</title>
<script src="pdf.js"></script>
<script src="index.js"></script>
</head>
<body>
<canvas id="pageCanvas"></canvas>
</body>
</html>
- Extend the code to render the page to the canvas context:
index.js
window.addEventListener('load', async () => {
const document = await pdfjsLib.getDocument('demo.pdf').promise;
const page = await document.getPage(1);
const viewport = page.getViewport({ scale: 1 });
const canvas = window.document.getElementById('pageCanvas');
canvas.width = viewport.width;
canvas.height = viewport.height;
const context = canvas.getContext('2d');
page.render({ canvasContext: context, viewport });
});
- Hook up code to extract text and highlight texts and images (see this repo)