Exploratory analyses of GEO (Gene Expression Omnibus) metadata from the PEPhub geo namespace (~229K projects). Work is organized around two discussions on databio/lab:
Embeds series-level text (title + summary + design) with sentence-transformers, projects into 2D with UMAP, and clusters with HDBSCAN/k-means to map the landscape of GEO research topics. Includes temporal trend analysis, country/institution profiles, and centroid trajectory tracking across eras.
Analyzes the sample-table column names across all GEO PEPs to characterize schema diversity. Classifies columns as GEO-standard vs. user-defined, clusters synonymous column names (e.g. tissue, tissue_type, Tissue), and produces a benchmark dataset for schema mapping evaluation.
preprocessing/ Shared data ingestion (download archive, parse PEPs)
issue55/
scripts/ Analysis scripts (numbered, run in order from 03)
output/ Figures, CSVs, cluster results, report
issue65/
scripts/ Analysis scripts (numbered, run in order from 03)
output/ Figures, CSVs, benchmark files, report
data/ Shared data dir (gitignored — large parquet files)
plans/ Session plans and implementation logs
Scripts are numbered to indicate execution order. Steps 00-02 in preprocessing/ are shared (data download and parsing); steps 03+ are issue-specific.
- R packages:
arrow,data.table,ggplot2,lubridate,fs,yaml,httr2,furrr,progressr,here - Python packages:
pandas,pyarrow,numpy,sentence-transformers,umap-learn,hdbscan,scikit-learn,matplotlib,rapidfuzz,adjustText,python-dotenv - Run
preprocessing/01_download_archive.Rto fetch the GEO archive, thenpreprocessing/02_parse_peps.Rto builddata/geo_metadata.parquet. - Run issue-specific scripts from their
scripts/directory (they use relative paths expectingcwd=issueNN/scripts/).