Skip to content

Commit

Permalink
add a ton of example docs from the unstructured site
Browse files Browse the repository at this point in the history
  • Loading branch information
alnutile committed Apr 21, 2024
1 parent 07ce85a commit 0f59316
Show file tree
Hide file tree
Showing 181 changed files with 189,045 additions and 0 deletions.
Binary file not shown.
Binary file added tests/example-docs/CantinaBand3.wav
Binary file not shown.
Binary file added tests/example-docs/DA-1p.heic
Binary file not shown.
Binary file added tests/example-docs/DA-1p.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/example-docs/DA-1p.pdf
Binary file not shown.
Binary file added tests/example-docs/DA-1p.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/example-docs/DA-619p.pdf
Binary file not shown.
23 changes: 23 additions & 0 deletions tests/example-docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Example Docs

The sample docs directory contains the following files:

- `example-10k.html` - A 10-K SEC filing in HTML format
- `layout-parser-paper.pdf` - A PDF copy of the layout parser paper
- `factbook.xml`/`factbook.xsl` - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are
instructions for pulling in some sample docs that are too big to store in the repo.

#### XBRL 10-K

You can get an example 10-K in inline XBRL format using the following `curl`. Note, you need
to have the user agent set in the header or the SEC site will reject your request.

```bash
curl -O \
-A '${organization} ${email}'
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
```

You can parse this document using the HTML parser.
27 changes: 27 additions & 0 deletions tests/example-docs/README.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
* Example Docs

The sample docs directory contains the following files:

- ~example-10k.html~ - A 10-K SEC filing in HTML format
- ~layout-parser-paper.pdf~ - A PDF copy of the layout parser paper
- ~factbook.xml~ / ~factbook.xsl~ - Example XML/XLS files that you
can use to test stylesheets

These documents can be used to test out the parsers in the library. In
addition, here are instructions for pulling in some sample docs that are
too big to store in the repo.

** XBRL 10-K

You can get an example 10-K in inline XBRL format using the following
~curl~. Note, you need to have the user agent set in the header or the
SEC site will reject your request.

#+BEGIN_SRC bash

curl -O \
-A '${organization} ${email}'
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
#+END_SRC

You can parse this document using the HTML parser.
28 changes: 28 additions & 0 deletions tests/example-docs/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Example Docs
------------

The sample docs directory contains the following files:

- ``example-10k.html`` - A 10-K SEC filing in HTML format
- ``layout-parser-paper.pdf`` - A PDF copy of the layout parser paper
- ``factbook.xml``/``factbook.xsl`` - Example XML/XLS files that you
can use to test stylesheets

These documents can be used to test out the parsers in the library. In
addition, here are instructions for pulling in some sample docs that are
too big to store in the repo.

XBRL 10-K
^^^^^^^^^

You can get an example 10-K in inline XBRL format using the following
``curl``. Note, you need to have the user agent set in the header or the
SEC site will reject your request.

.. code:: bash
curl -O \
-A '${organization} ${email}'
https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
You can parse this document using the HTML parser.
Binary file added tests/example-docs/all-number-table.pdf
Binary file not shown.
63,845 changes: 63,845 additions & 0 deletions tests/example-docs/book-war-and-peace-1225p.txt

Large diffs are not rendered by default.

62 changes: 62 additions & 0 deletions tests/example-docs/book-war-and-peace-1p.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
CHAPTER I

"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist--I really believe he is Antichrist--I will have nothing more
to do with you and you are no longer my friend, no longer my 'faithful
slave,' as you call yourself! But how do you do? I see I have frightened
you--sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known Anna Pavlovna
Scherer, maid of honor and favorite of the Empress Marya Fedorovna. With
these words she greeted Prince Vasili Kuragin, a man of high rank and
importance, who was the first to arrive at her reception. Anna Pavlovna
had had a cough for some days. She was, as she said, suffering from la
grippe; grippe being then a new word in St. Petersburg, used only by the
elite.

All her invitations without exception, written in French, and delivered
by a scarlet-liveried footman that morning, ran as follows:

"If you have nothing better to do, Count (or Prince), and if the
prospect of spending an evening with a poor invalid is not too terrible,
I shall be very charmed to see you tonight between 7 and 10--Annette
Scherer."

"Heavens! what a virulent attack!" replied the prince, not in the least
disconcerted by this reception. He had just entered, wearing an
embroidered court uniform, knee breeches, and shoes, and had stars on
his breast and a serene expression on his flat face. He spoke in that
refined French in which our grandfathers not only spoke but thought, and
with the gentle, patronizing intonation natural to a man of importance
who had grown old in society and at court. He went up to Anna Pavlovna,
kissed her hand, presenting to her his bald, scented, and shining head,
and complacently seated himself on the sofa.

"First of all, dear friend, tell me how you are. Set your friend's mind
at rest," said he without altering his tone, beneath the politeness and
affected sympathy of which indifference and even irony could be
discerned.

"Can one be well while suffering morally? Can one be calm in times like
these if one has any feeling?" said Anna Pavlovna. "You are staying the
whole evening, I hope?"

"And the fete at the English ambassador's? Today is Wednesday. I must
put in an appearance there," said the prince. "My daughter is coming for
me to take me there."

"I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome."

"If they had known that you wished it, the entertainment would have been
put off," said the prince, who, like a wound-up clock, by force of habit
said things he did not even wish to be believed.

"Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything."

"What can one say about it?" replied the prince in a cold, listless
tone. "What has been decided? They have decided that Buonaparte has
burnt his boats, and I believe that we are ready to burn ours."
Binary file added tests/example-docs/category-level.docx
Binary file not shown.
Binary file added tests/example-docs/chevron-page.pdf
Binary file not shown.
Binary file added tests/example-docs/chi_sim_image.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/example-docs/copy-protected.pdf
Binary file not shown.
Binary file added tests/example-docs/docx-hdrftr.docx
Binary file not shown.
Binary file added tests/example-docs/docx-shapes.docx
Binary file not shown.
Binary file added tests/example-docs/docx-tables.docx
Binary file not shown.
Binary file added tests/example-docs/double-column-A.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/example-docs/double-column-B.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/example-docs/embedded-images-tables.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/example-docs/embedded-images-tables.pdf
Binary file not shown.
Binary file added tests/example-docs/embedded-images.pdf
Binary file not shown.
Binary file added tests/example-docs/embedded-link.pdf
Binary file not shown.
67 changes: 67 additions & 0 deletions tests/example-docs/eml/email-equals-attachment-filename.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Return-Path: <[email protected]>
Delivered-To: [email protected]
Received: from mail-il1-x135.google.com (mail-il1-x135.google.com [IPv6:2607:f8b0:4864:20::135])
by spool.mail.gandi.net (Postfix) with ESMTPS id 30071740049
for <[email protected]>; Sun, 13 Aug 2023 22:00:09 +0000 (UTC)
Received: by mail-il1-x135.google.com with SMTP id e9e14a558f8ab-34aa0845837so895295ab.1
for <[email protected]>; Sun, 13 Aug 2023 15:00:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20221208; t=1691964008; x=1692568808;
h=to:subject:message-id:date:from:mime-version:from:to:cc:subject
:date:message-id:reply-to;
bh=u2zZbdTcme/MRpud6yh6mKzbHh7iBKn7qvZ1YZJcZuQ=;
b=LxHBRFvl8tDcIithe7Il7GC7rAEu5QHGoko+PZll4SUDgh0gYHu35ksEuMO3bBT3sB
UGM5/Obbn+17F+DL0Mk/Zyc/6gG15lNMVLcr9+Fzjt2hDkrcUsEAkmS9chFiF0asGebj
F3vn1FJ9ZDi3IISHeD80PzmhT23Zp4ELjrfEGv2go7Psb320wzL58mHObkhz2spXEK0c
YzlCkJd8hBz2wI5mKedzf4mLdbTUZhPpmycvS+NkNwxQzaMXouLEkBvOXticqPQHvbTe
IiTb2JsaTFEJCfDVjhzIuGA6fFqNmH7hz7Fjh6eW66msB2QCIAhWHIIQ0Uy0Lx0FaQeo
pA5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20221208; t=1691964008; x=1692568808;
h=to:subject:message-id:date:from:mime-version:x-gm-message-state
:from:to:cc:subject:date:message-id:reply-to;
bh=u2zZbdTcme/MRpud6yh6mKzbHh7iBKn7qvZ1YZJcZuQ=;
b=InxJceRYlLILD0JdetCcOea42zYYLr8BYlhXxpB07TuFAiKbIV28vxmp7XaEa6YIuA
IkpNpvLWrlRJpLsKvJF9QRdIAt83p+91zvd2BW8M0P7AP04KofvzGbbCFG67tjr23K7S
YYuVSIXgVli3sjbIMsxq/JnaHWk1fnrGBvpnMLEEekqsdXpyL5GJ0yN0Qb/4lBZO1uO9
oZ0gbwqEMA/eHAnpH5W/g9ubkVcXzfjSPCRzzNhXfOEGn3Cc5sAuEH03iVuVIKMe9FJg
sO5iyah9+tjnm1NBWCk2qSIuCJrA0YvqcoztgpmJYDDQtG6scHRL83DdMx7phwRlVd/l
S6rQ==
X-Gm-Message-State: AOJu0YzoTpbToiITeHpRUQB8Tc5krfAtkhP2TRgs0WdgPAgfeUixZft6
vGUz3KcsN2V+qf2+RQPiveSjelXe81VfycqaH+I2hUNd
X-Google-Smtp-Source: AGHT+IFzHJ5xiuLxHriivr/CAV7z2Qo6Jep/LEhlzu4GiHEoXTFGC1DZ/MTDROwUz3fXKlKLU6uBzylF4XSOdKWfTW8=
X-Received: by 2002:a05:6e02:1bee:b0:349:2d1d:e463 with SMTP id
y14-20020a056e021bee00b003492d1de463mr12404311ilv.13.1691964008036; Sun, 13
Aug 2023 15:00:08 -0700 (PDT)
MIME-Version: 1.0
From: Test User <[email protected]>
Date: Sun, 13 Aug 2023 14:59:56 -0700
Message-ID: <CABBgHeGAW=UW77EE7p4CsCuaudixAYUU8iPqsm3=[email protected]>
Subject: Odd filename example
To: [email protected]
Content-Type: multipart/mixed; boundary="000000000000ac11b20602d51124"
X-GND-Status: LEGIT

--000000000000ac11b20602d51124
Content-Type: multipart/alternative; boundary="000000000000ac11b10602d51122"
--000000000000ac11b10602d51122
Content-Type: text/plain; charset="UTF-8"
Below is an example of an odd filename
--000000000000ac11b10602d51122
Content-Type: text/html; charset="UTF-8"

<div dir="ltr">Below is an example of an odd filename</div>

--000000000000ac11b10602d51122--
--000000000000ac11b20602d51124
Content-Type: text/plain; charset="US-ASCII"; name="odd=file=name.txt"
Content-Disposition: attachment; filename="odd=file=name.txt"
Content-Transfer-Encoding: base64
Content-ID: <f_ll9zod670>
X-Attachment-Id: f_ll9zod670
T2RkIGZpbGVuYW1lCg==
--000000000000ac11b20602d51124--
23 changes: 23 additions & 0 deletions tests/example-docs/eml/email-inline-content-disposition.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
From: Test User <[email protected]>
To: [email protected]
Subject: Testing Inline
Message-ID: <CAPgCCDEzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.example.com>
X-Mailer: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="MP_/pqqQE0ldE7RYQcFW3Kd0aQV"

--MP_/pqqQE0ldE7RYQcFW3Kd0aQV
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
This is a test of inline
--MP_/pqqQE0ldE7RYQcFW3Kd0aQV
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename=t.txt
test
--MP_/pqqQE0ldE7RYQcFW3Kd0aQV--
139 changes: 139 additions & 0 deletions tests/example-docs/eml/email-no-html-content-1.eml
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
X-Gmail-Labels: Archived,Category Updates,Unread
Delivered-To: [email protected]
Received: by 2002:a05:651c:98a:0:0:0:0 with SMTP id b10csp1857290ljq;
Mon, 10 Oct 2022 05:14:10 -0700 (PDT)
X-Google-Smtp-Source: AMsMyM59wmLQXJ73eH+8nTYP0CP3MwUawP3ir019apX61OB0CVRaP92eFI9REGn5dNV33Tf9I/vw
X-Received: by 2002:a6b:be86:0:b0:6b9:7a46:479f with SMTP id o128-20020a6bbe86000000b006b97a46479fmr8507790iof.130.1665404050586;
Mon, 10 Oct 2022 05:14:10 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1665404050; cv=none;
d=google.com; s=arc-20160816;
b=EKi4xY7ACAVUDqLPuhCP59ecwcnpS+zY+pOcRCiUIiuvLfKwdZ39nuQKuXzhKMuaKR
NTfIJY7Y0tWtuPQGJ+2ZCtjqJyqqorZhB71CQXe9yLQxT1iu5Z39XBbFyfix+3ylRQH/
G4N7OXn/P1baycsYs15y3/uetDje+NvFrkkq2OYjcBhXEwZn531vEiEp2zcL+wBDvjVI
NtGaXkN7HuH6X38Siz0CMHj1YLHtjGnrCjXNHa9iXjIV4Uja4WzS9Jl65Nu5bTjZegtr
uIHB+q2XOiOU1GhZB4GCRLxrj4msCBgHqtJ0qBId9k1WhVku57ovMiiUvJDtF3gyMk/h
9gOw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
h=content-transfer-encoding:mime-version:subject:message-id:to:from
:date:dkim-signature;
bh=Ub9henOpe5XmdF/KsWVEJ2CSvdKWuLqCTVvS2++iE6A=;
b=a3fGmj/w8651iTjLw887qBJS32RMpFKrxxvYoegQU9b6Gt2vqzvsB90dg67rUq4iV1
uqrSmPphkOCKu6C0kYf22t1xOmG745zxnnfIGwi79I03kQZ1HX0IQBecGooVOlROIyN8
Q00Y3256iXAXB2qEPc4cBCSFamO9XjWuCkP2PNLqxYZrdGjUYUVo5vmvdI3EP1hvxipq
nWxjNQ+lH2UiLnBzb4Gfe7acrvkhKz9gbfXIhMPsvDBIi3bxNXmi3ID2nHXJmpYHWMq8
UwsmAq/X3lxmhLfmkkOWTuujYVjDWOzbzfZbv7sWJgdgur4sOUgxGpRLaZJPO3gib2B7
F0dA==
ARC-Authentication-Results: i=1; mx.google.com;
dkim=pass [email protected] header.s=pf2014 header.b=NcXDBg+m;
spf=pass (google.com: domain of [email protected] designates 192.30.252.211 as permitted sender) smtp.mailfrom=[email protected];
dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=github.com
Return-Path: <[email protected]>
Received: from out-28.smtp.github.com (out-28.smtp.github.com. [192.30.252.211])
by mx.google.com with ESMTPS id s3-20020a056e0210c300b002f8e9246e7asi9145395ilj.14.2022.10.10.05.14.10
for <[email protected]>
(version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256);
Mon, 10 Oct 2022 05:14:10 -0700 (PDT)
Received-SPF: pass (google.com: domain of [email protected] designates 192.30.252.211 as permitted sender) client-ip=192.30.252.211;
Authentication-Results: mx.google.com;
dkim=pass [email protected] header.s=pf2014 header.b=NcXDBg+m;
spf=pass (google.com: domain of [email protected] designates 192.30.252.211 as permitted sender) smtp.mailfrom=[email protected];
dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=github.com
Received: from github-lowworker-bf9f1da.ash1-iad.github.net (github-lowworker-bf9f1da.ash1-iad.github.net [10.56.117.24])
by smtp.github.com (Postfix) with ESMTP id 01F7390008F
for <[email protected]>; Mon, 10 Oct 2022 05:14:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com;
s=pf2014; t=1665404050;
bh=Ub9henOpe5XmdF/KsWVEJ2CSvdKWuLqCTVvS2++iE6A=;
h=Date:From:To:Subject:From;
b=NcXDBg+mI5wZwGHJkZSkoax1lOe+r3MZEyz2E+47Ce1s/4a4EQoeSGyQroedLgUnA
lVoA13APnvt/kYa8k5Y348gK4qEKfQwpcC1gQkZfgUjgjQ6tacnRsZDl96AGH9vclW
Kj/tO7vx4xOGBUHqfzkdokHV0ms5rzrBJ0tv8lic=
Date: Mon, 10 Oct 2022 05:14:09 -0700
From: GitHub <[email protected]>
To: [email protected]
Message-ID: <[email protected]>
Subject: [GitHub] Subscribed to 63 nf-core repositories
Mime-Version: 1.0
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Auto-Response-Suppress: All

Hey there, we=E2=80=99re just writing to let you know that you=E2=80=99ve=
automatically started watching several repositories on GitHub.=0D
=0D
You=E2=80=99ll receive notifications for all issues, pull requests, and c=
omments that happen inside the repository. If you would like to stop watc=
hing any of these repositories, you can manage your settings here:=0D
=0D
https://github.com/nf-core/cookiecutter/subscription=0D
https://github.com/nf-core/tools/subscription=0D
https://github.com/nf-core/logos/subscription=0D
https://github.com/nf-core/methylseq/subscription=0D
https://github.com/nf-core/test-datasets/subscription=0D
https://github.com/nf-core/rnaseq/subscription=0D
https://github.com/nf-core/exoseq/subscription=0D
https://github.com/nf-core/nf-co.re/subscription=0D
https://github.com/nf-core/chipseq/subscription=0D
https://github.com/nf-core/vipr/subscription=0D
https://github.com/nf-core/mag/subscription=0D
https://github.com/nf-core/eager/subscription=0D
https://github.com/nf-core/hlatyping/subscription=0D
https://github.com/nf-core/smrnaseq/subscription=0D
https://github.com/nf-core/ampliseq/subscription=0D
https://github.com/nf-core/neutronstar/subscription=0D
https://github.com/nf-core/rnafusion/subscription=0D
https://github.com/nf-core/atacseq/subscription=0D
https://github.com/nf-core/nascent/subscription=0D
https://github.com/nf-core/configs/subscription=0D
https://github.com/nf-core/epitopeprediction/subscription=0D
https://github.com/nf-core/airrflow/subscription=0D
https://github.com/nf-core/bacass/subscription=0D
https://github.com/nf-core/scrnaseq/subscription=0D
https://github.com/nf-core/hic/subscription=0D
https://github.com/nf-core/proteomicslfq/subscription=0D
https://github.com/nf-core/sarek/subscription=0D
https://github.com/nf-core/cageseq/subscription=0D
https://github.com/nf-core/bactmap/subscription=0D
https://github.com/nf-core/mnaseseq/subscription=0D
https://github.com/nf-core/kmermaid/subscription=0D
https://github.com/nf-core/crisprvar/subscription=0D
https://github.com/nf-core/imcyto/subscription=0D
https://github.com/nf-core/modules/subscription=0D
https://github.com/nf-core/nanoseq/subscription=0D
https://github.com/nf-core/demultiplex/subscription=0D
https://github.com/nf-core/viralrecon/subscription=0D
https://github.com/nf-core/gwas/subscription=0D
https://github.com/nf-core/dualrnaseq/subscription=0D
https://github.com/nf-core/clipseq/subscription=0D
https://github.com/nf-core/cutandrun/subscription=0D
https://github.com/nf-core/circrna/subscription=0D
https://github.com/nf-core/vscode-extensionpack/subscription=0D
https://github.com/nf-core/rnavar/subscription=0D
https://github.com/nf-core/fetchngs/subscription=0D
https://github.com/nf-core/raredisease/subscription=0D
https://github.com/nf-core/hicar/subscription=0D
https://github.com/nf-core/circdna/subscription=0D
https://github.com/nf-core/funcscan/subscription=0D
https://github.com/nf-core/ssds/subscription=0D
https://github.com/nf-core/nanostring/subscription=0D
https://github.com/nf-core/genomeannotator/subscription=0D
https://github.com/nf-core/taxprofiler/subscription=0D
https://github.com/nf-core/spatialtranscriptomics/subscription=0D
https://github.com/nf-core/proteinfold/subscription=0D
https://github.com/nf-core/prettier-plugin-nextflow/subscription=0D
https://github.com/nf-core/genomeassembler/subscription=0D
https://github.com/nf-core/hgtseq/subscription=0D
https://github.com/nf-core/isoseq/subscription=0D
https://github.com/nf-core/rnadnavar/subscription=0D
https://github.com/nf-core/crisprseq/subscription=0D
https://github.com/nf-core/genomeskim/subscription=0D
https://github.com/nf-core/fastquorum/subscription=0D
=0D
You automatically watched these repositories because you=E2=80=99ve been =
given access to them.=0D
=0D
Thanks!=0D

From 1751394755051884550@xxx Mon Dec 05 17:09:55 +0000 2022
X-GM-THRID: 1751394755051884550
Loading

0 comments on commit 0f59316

Please sign in to comment.