Skip to content

Latest commit

 

History

History
111 lines (78 loc) · 4.2 KB

README.md

File metadata and controls

111 lines (78 loc) · 4.2 KB

classifier-pipeline

A software suite that enables remote extraction, transformation and loading of data.

This repository is geared heavily towards drawing articles from PubMed, identifying scientific articles containing information about biological pathways, and loading the records into a data store.

Requirements

Access article feed via web service

  • Access the Swagger documentation at /docs.
  • Access the Redoc documentation at /redoc.

Usage

Create a conda environment, here named pipeline:

$ conda create --name pipeline python=3.8 --yes
$ conda activate pipeline

Download the remote:

$ git clone https://github.com/jvwong/classifier-pipeline
$ cd classifier-pipeline

Install the dependencies:

$ poetry install

Web server

To start up the server:

uvicorn classifier_pipeline.main:app --port 8000 --reload
  • uvicron options
    • --reload: Enable auto-reload.
    • --port INTEGER: Bind socket to this port (default 8000)

And now, go to http://127.0.0.1:8000/redoc (swap out the port if neccessary) to see the automatic documentation.

Pipeline

Launch a pipeline to process daily updates from PubMed and dump the RethinkDB database:

$ ./scripts/cron/install.sh

Elements of the 'Pipeline'

The pipeline

The scripts directory contains python files that chain functions in classifier_pipeline to:

  • read in data from
  • retrieve records/files from PubMed (pubmed_transformer)
  • apply various filters on the individual records (citation_pubtype_filter, citation_date_filter)
  • apply a deep-learning classifier to text fields (classification_transformer)
  • loads the formatted data into a RethinkDB instance (db_loader)

Launchers

  • Pipelines are launched through bash scripts that retrieve PubMed article records in two ways:
    • ./scripts/cron/cron.sh: retrieves via the FTP file server all new content
    • ./scripts/csv/pmids.sh: retrieve using the NCBI E-Utilities given a set of PubMed IDs
  • Variables
    • DATA_DIR root directory where your data files exist
    • DATA_FILE name of the csv file in your DATA_DIR
    • ARG_IDCOLUMN the csv header column name containing either
      • a list of update files to extract (dailyupdates.sh)
      • a list of PubMed IDs to extract (pmids.sh)
    • JOB_NAME the name of this pipeline job
    • CONDA_ENV should be the environment name you declared in the first steps
    • ARG_TYPE
      • use fetch for downloading individual PubMed IDs
      • use download to retrieve FTP update files
    • ARG_MINYEAR articles published in years before this will be filtered out (optional)
    • ARG_TABLE is the name of the table to dump results into
    • ARG_THRESHOLD set the lowest probability to classify an article as 'positive' using pathway-abstract-classifier

Testing

There is a convenience script that can be launched:

$ ./test.sh

This will run the tests in ./tests, lint with flake8 and type check with mypy.