geo-schemas

Exploratory analyses of GEO (Gene Expression Omnibus) metadata from the PEPhub geo namespace (~229K projects). Work is organized around two discussions on databio/lab:

Issues

Issue 55 — GEO Series-Level Semantic Meta-Analysis

Embeds series-level text (title + summary + design) with sentence-transformers, projects into 2D with UMAP, and clusters with HDBSCAN/k-means to map the landscape of GEO research topics. Includes temporal trend analysis, country/institution profiles, and centroid trajectory tracking across eras.

Issue 65 — PEPhub Schema Diversity Analysis

Analyzes the sample-table column names across all GEO PEPs to characterize schema diversity. Classifies columns as GEO-standard vs. user-defined, clusters synonymous column names (e.g. tissue, tissue_type, Tissue), and produces a benchmark dataset for schema mapping evaluation.

Repo structure

preprocessing/          Shared data ingestion (download archive, parse PEPs)
issue55/
  scripts/              Analysis scripts (numbered, run in order from 03)
  output/               Figures, CSVs, cluster results, report
issue65/
  scripts/              Analysis scripts (numbered, run in order from 03)
  output/               Figures, CSVs, benchmark files, report
data/                   Shared data dir (gitignored — large parquet files)
plans/                  Session plans and implementation logs

Scripts are numbered to indicate execution order. Steps 00-02 in preprocessing/ are shared (data download and parsing); steps 03+ are issue-specific.

Setup

R packages: arrow, data.table, ggplot2, lubridate, fs, yaml, httr2, furrr, progressr, here
Python packages: pandas, pyarrow, numpy, sentence-transformers, umap-learn, hdbscan, scikit-learn, matplotlib, rapidfuzz, adjustText, python-dotenv
Run preprocessing/01_download_archive.R to fetch the GEO archive, then preprocessing/02_parse_peps.R to build data/geo_metadata.parquet.
Run issue-specific scripts from their scripts/ directory (they use relative paths expecting cwd = issueNN/scripts/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

geo-schemas

Issues

Issue 55 — GEO Series-Level Semantic Meta-Analysis

Issue 65 — PEPhub Schema Diversity Analysis

Repo structure

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
issue55		issue55
issue65		issue65
plans		plans
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md

databio/geo-schemas

Folders and files

Latest commit

History

Repository files navigation

geo-schemas

Issues

Issue 55 — GEO Series-Level Semantic Meta-Analysis

Issue 65 — PEPhub Schema Diversity Analysis

Repo structure

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages