GBIF Data Profiler

GBIF Data Parsing Utilities, Profilers, and Web-based publishing site

Website

You can explore an existing inventory of interactive profiles by visiting the repository website: https://ben-norton.github.io/GBIF-Data-Profiler/

Project Directory Structure

├───app
|   ├───analysis    | A collection of scripts for running basic structural analysis of source data files
|   ├───media       | Utilities to download, migrate, and inventory multimedia files in occurrence datasets
|   ├───profilers   | Scripts to generate interactive dataset profiles, see dataset profiler libraries
|   ├───schemas     | Tabular schemas and CSV file statistics generated from the source dataset files
|   └───utils       | General support utility scripts for performing specific tasks such as converting tab-delimited to comma-separated
├───docs            | Documentation markdown files
├───source-data     | Source occurrence datasets
├───web             | Static site generator for publishing interactive profiles using VitePress

Source Data

All source datasets downloaded from Dataset Search in the GBIF Portal are extracted and stored in the source-datasets folder. Many occurrence datasets are very large > 1GB, which can exceed the allocation provided by certain GitHub plans. Therefore, a simple example is included with the repo. Except for the example dataset, the contents of the source-datasets are set to ignore in order to prevent exceeding the GitHub limitations.

Datasets are organized by package ID, the identifier assigned by GBIF when a dataset is selected for download. See the source--data-inventory.md document for a tabulated summary. Please see the README file in the source-datasets folder and the README.md file in the project-docs for more information including procedures for downloading your own datasets.

Datset Profiler Libraries

Library	PIP	URL	Description
YData Profiler		https://docs.profiling.ydata.ai/latest/	Data quality profiling and exploratory data analysis are crucial steps in the process of Data Science and Machine Learning development. YData-profiling is a leading tool in the data understanding step of the data science workflow as a pioneering Python package.
Sweetviz	sweetviz	https://pypi.org/project/sweetviz/	A pandas-based library to visualize and compare datasets.

Profiling Library Notes

Sweetviz requires downgrading numpy to 1.26.4 (last version before 2.0.0)

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.idea		.idea
app		app
docs		docs
project-docs		project-docs
source-data		source-data
web		web
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
deploy.bat		deploy.bat
globals.py		globals.py
init.bat		init.bat
meta.yml		meta.yml
requirements.txt		requirements.txt
schemas.py		schemas.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GBIF Data Profiler

Website

Project Directory Structure

Source Data

Datset Profiler Libraries

Profiling Library Notes

About

Releases

Packages

Languages

ben-norton/GBIF-Data-Profiler

Folders and files

Latest commit

History

Repository files navigation

GBIF Data Profiler

Website

Project Directory Structure

Source Data

Datset Profiler Libraries

Profiling Library Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages