Skip to content

Commit

Permalink
Documenting and cleaning up manifest file logic (#448)
Browse files Browse the repository at this point in the history
  • Loading branch information
jamesbraza authored Sep 21, 2024
1 parent 040e69a commit a3a069b
Show file tree
Hide file tree
Showing 3 changed files with 38 additions and 9 deletions.
36 changes: 32 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ question answering, summarization, and contradiction detection.
- [What's New in Version 5 (aka PaperQA2)?](#whats-new-in-version-5-aka-paperqa2)
- [PaperQA2 Algorithm](#paperqa2-algorithm)
- [Installation](#installation)
- [CLI Usage](#cli-usage)
- [Bundled Settings](#bundled-settings)
- [CLI Usage](#cli-usage)
- [Bundled Settings](#bundled-settings)
- [Library Usage](#library-usage)
- [`ask` manually](#ask-manually)
- [Adding Documents Manually](#adding-documents-manually)
Expand All @@ -30,6 +30,8 @@ question answering, summarization, and contradiction detection.
- [Adjusting number of sources](#adjusting-number-of-sources)
- [Using Code or HTML](#using-code-or-html)
- [Using External DB/Vector DB and Caching](#using-external-dbvector-db-and-caching)
- [Creating Index](#creating-index)
- [Manifest Files](#manifest-files)
- [Reusing Index](#reusing-index)
- [Running on LitQA v2](#running-on-litqa-v2)
- [Using Clients Directly](#using-clients-directly)
Expand Down Expand Up @@ -169,7 +171,7 @@ you will likely want an API key for both [Crossref](https://www.crossref.org/doc
which will allow you to avoid hitting public rate limits using these metadata services.
Those can be exported as `CROSSREF_API_KEY` and `SEMANTIC_SCHOLAR_API_KEY` variables.

### CLI Usage
## CLI Usage

The fastest way to test PaperQA2 is via the CLI. First navigate to a directory with some papers and use the `pqa` cli:

Expand Down Expand Up @@ -236,7 +238,7 @@ Both the CLI and module have pre-configured settings based on prior performance
pqa --settings <setting name> ask 'Are there nm scale features in thermoelectric materials?'
```

#### Bundled Settings
### Bundled Settings

Inside [`paperqa/configs`](paperqa/configs) we bundle known useful settings:

Expand Down Expand Up @@ -524,6 +526,32 @@ for ... in my_docs:
docs.add_texts(texts, doc)
```

### Creating Index

Indexes will be placed in the [home directory][home dir] by default.
This can be controlled via the `PQA_HOME` environment variable.

Indexes are made by reading files in the `Settings.paper_directory`.
By default, we recursively read from subdirectories of the paper directory,
unless disabled using `Settings.index_recursively`.
The paper directory is not modified in any way, it's just read from.

[home dir]: https://docs.python.org/3/library/pathlib.html#pathlib.Path.home

#### Manifest Files

The indexing process attempts to infer paper metadata like title and DOI
using LLM-powered text processing.
You can avoid this point of uncertainty using a "manifest" file,
which is a CSV containing three columns (order doesn't matter):

- `file_location`: relative path to the paper's PDF within the index directory
- `doi`: DOI of the paper
- `title`: title of the paper

By providing this information,
we ensure queries to metadata providers like Crossref are accurate.

### Reusing Index

The local search indexes are built based on a hash of the current `Settings` object.
Expand Down
9 changes: 5 additions & 4 deletions paperqa/agents/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,16 +333,17 @@ async def query(
]


async def maybe_get_manifest(filename: anyio.Path | None) -> dict[str, DocDetails]:
async def maybe_get_manifest(
filename: anyio.Path | None = None,
) -> dict[str, DocDetails]:
if not filename:
return {}
if filename.suffix == ".csv":
try:
async with await anyio.open_file(filename, mode="r") as file:
content = await file.read()
reader = csv.DictReader(StringIO(content))
records = [DocDetails(**row) for row in reader]
return {str(r.file_location): r for r in records if r.file_location}
records = [DocDetails(**row) for row in csv.DictReader(StringIO(content))]
return {str(r.file_location): r for r in records if r.file_location}
except FileNotFoundError:
logging.warning(f"Manifest file at {filename} could not be found.")
except Exception:
Expand Down
2 changes: 1 addition & 1 deletion paperqa/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ class Settings(BaseSettings):
default=None,
description=(
"Optional manifest CSV, containing columns which are attributes for a"
" DocDetails object. Only 'file_location','doi', and 'title' will be used"
" DocDetails object. Only 'file_location', 'doi', and 'title' will be used"
" when indexing."
),
)
Expand Down

0 comments on commit a3a069b

Please sign in to comment.