|
| 1 | +# 🥐 croissant-rdf |
| 2 | + |
| 3 | +[](https://pypi.org/project/croissant-rdf/) |
| 4 | +[](https://github.com/david4096/croissant-rdf/actions/workflows/test.yml) |
| 5 | + |
| 6 | +A proud [Biohackathon](http://www.biohackathon.org/) project 🧑💻🧬👩💻 |
| 7 | + |
| 8 | +* [Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF](https://osf.io/preprints/biohackrxiv/msv7x_v1) |
| 9 | +* (In progress) Preprint from Elixir Biohackathon |
| 10 | + |
| 11 | +<a target="_blank" href="https://colab.research.google.com/github/david4096/croissant-rdf/blob/main/example.ipynb"> |
| 12 | + <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> |
| 13 | +</a> |
| 14 | + |
| 15 | +[](https://huggingface.co/spaces/david4096/huggingface-rdf) |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +`croissant-rdf` is a Python tool that generates RDF (Resource Description Framework) data from datasets available on Hugging Face. This tool enables researchers and developers to convert data into a machine-readable format for enhanced querying and data analysis. |
| 21 | + |
| 22 | +This is made possible due to an effort to align to the [MLCommons Croissant](https://github.com/mlcommons/croissant) schema, which HF and others conform to. |
| 23 | + |
| 24 | +## Features |
| 25 | + |
| 26 | +- Fetch datasets from HuggingFace or Kaggle. |
| 27 | +- Convert datasets metadata to RDF format. |
| 28 | +- Generate Turtle (`.ttl`) files for easy integration with SPARQL endpoints. |
| 29 | + |
| 30 | +## Installation |
| 31 | + |
| 32 | +croissant-rdf is available in PyPi! |
| 33 | + |
| 34 | +```bash |
| 35 | +pip install croissant-rdf |
| 36 | +``` |
| 37 | + |
| 38 | +## Usage |
| 39 | + |
| 40 | +After installing the package, you can use the command-line interface (CLI) to generate RDF data: |
| 41 | + |
| 42 | +```sh |
| 43 | +export HF_API_KEY={YOUR_KEY} |
| 44 | +huggingface-rdf --fname huggingface.ttl --limit 10 |
| 45 | +``` |
| 46 | + |
| 47 | +Check out the `qlever_scripts` directory to get help loading the RDF into qlever for querying. |
| 48 | + |
| 49 | +You can also easily use Jena fuseki and load the generated .ttl file from the Fuseki ui. |
| 50 | + |
| 51 | +```sh |
| 52 | +docker run -it -p 3030:3030 stain/jena-fuseki |
| 53 | +``` |
| 54 | +### Extracting data from Kaggle |
| 55 | +You'll need to get a Kaggle API key and it comes in a file called `kaggle.json`, you have to put the username and key into environment variables. |
| 56 | + |
| 57 | +```sh |
| 58 | +export KAGGLE_USERNAME={YOUR_USERNAME} |
| 59 | +export KAGGLE_KEY={YOUR_KEY} |
| 60 | +kaggle-rdf --fname kaggle.ttl --limit 10 |
| 61 | + |
| 62 | +# Optionally you can provide a positional argument to filter the dataset search |
| 63 | +kaggle-rdf --fname kaggle.ttl --limit 10 covid |
| 64 | +``` |
| 65 | + |
| 66 | +### Running via Docker |
| 67 | + |
| 68 | +You can use the `huggingface-rdf` or `kaggle-rdf` tools via Docker: |
| 69 | + |
| 70 | +```bash |
| 71 | +docker run -t -v $(pwd):/app david4096/croissant-rdf huggingface-rdf --fname docker.ttl |
| 72 | +``` |
| 73 | + |
| 74 | +This will create a Turtle file `docker.ttl` in the current working directory. |
| 75 | + |
| 76 | +### Using Common Workflow Language (CWL) |
| 77 | + |
| 78 | +First install [cwltool](https://www.commonwl.org/user_guide/introduction/prerequisites.html) and then you can run the workflow using: |
| 79 | + |
| 80 | +```bash |
| 81 | +cwltool https://raw.githubusercontent.com/david4096/croissant-rdf/refs/heads/main/workflows/huggingface-rdf.cwl --fname cwl.ttl --limit 5 |
| 82 | +``` |
| 83 | + |
| 84 | +This will output a Turtle file called `cwl.ttl` in your local directory. |
| 85 | + |
| 86 | +### Using Docker to run a Jupyter server |
| 87 | +To launch a jupyter notebook server to run and develop on the project locally run the following: |
| 88 | + |
| 89 | +Build: |
| 90 | + |
| 91 | +```sh |
| 92 | +docker build -t croissant-rdf-jupyter -f notebooks/Dockerfile . |
| 93 | +``` |
| 94 | + |
| 95 | +Run Jupyter: |
| 96 | + |
| 97 | +```sh |
| 98 | +docker run -p 8888:8888 -v $(pwd):/app croissant-rdf-jupyter |
| 99 | +``` |
| 100 | +The run command works for mac and linux for windows in PowerShell you need to use the following: |
| 101 | +```sh |
| 102 | +docker run -p 8888:8888 -v ${PWD}:/app croissant-rdf-jupyter |
| 103 | +``` |
| 104 | + |
| 105 | +After that, you can access the Jupyter notebook server at http://localhost:8888. |
| 106 | + |
| 107 | +## Useful SPARQL Queries |
| 108 | + |
| 109 | +SPARQL (SPARQL Protocol and RDF Query Language) is a query language used to retrieve and manipulate data stored in RDF (Resource Description Framework) format, typically within a triplestore. Here are a few useful SPARQL query examples you can try to implement on https://huggingface.co/spaces/david4096/huggingface-rdf |
| 110 | + |
| 111 | +The basic structure of a SPQRQL query is SELECT: which you have to include a keywords that you would like to return in the result. |
| 112 | +WHERE: Defines the triple pattern we want to match in the RDF dataset. |
| 113 | + |
| 114 | +1. This query is used to retrieve distinct predicates from an Huggingface RDF dataset |
| 115 | + |
| 116 | +```sparql |
| 117 | +SELECT DISTINCT ?b WHERE {?a ?b ?c} |
| 118 | +``` |
| 119 | + |
| 120 | +2. To retrieve information about a dataset, including its name, predicates, and the count of objects associated with each predicate. Includes a filters in the results to include only resources that are of type <https://schema.org/Dataset>. |
| 121 | + |
| 122 | +```sparql |
| 123 | +PREFIX schema: <https://schema.org/> |
| 124 | +SELECT ?name ?p (count(?o) as ?predicate_count) |
| 125 | +WHERE { |
| 126 | + ?dataset a schema:Dataset ; |
| 127 | + schema:name ?name ; |
| 128 | + ?p ?o . |
| 129 | +} |
| 130 | +GROUP BY ?p ?dataset |
| 131 | +``` |
| 132 | + |
| 133 | +3. To retrieve distinct values with the keyword "bio" associated with the property <https://schema.org/keywords> regardless of the case. |
| 134 | + |
| 135 | +```sparql |
| 136 | +PREFIX schema: <https://schema.org/> |
| 137 | +SELECT DISTINCT ?keyword |
| 138 | +WHERE { |
| 139 | + ?s schema:keywords ?keyword . |
| 140 | + FILTER(CONTAINS(LCASE(?keyword), "bio")) |
| 141 | +} |
| 142 | +``` |
| 143 | + |
| 144 | +4. To retrieve distinct values for croissant columns associated with the predicate. |
| 145 | + |
| 146 | +```sparql |
| 147 | +PREFIX cr: <http://mlcommons.org/croissant/> |
| 148 | +SELECT DISTINCT ?column |
| 149 | +WHERE { |
| 150 | + ?s cr:column ?column |
| 151 | +} |
| 152 | +``` |
| 153 | + |
| 154 | +5. To retrieves the names of creators and the count of items they are associated with. |
| 155 | + |
| 156 | +```sparql |
| 157 | +PREFIX schema: <https://schema.org/> |
| 158 | +SELECT ?creatorName (COUNT(?a) AS ?count) |
| 159 | +WHERE { |
| 160 | + ?s schema:creator ?creator . |
| 161 | + ?creator schema:name ?creatorName . |
| 162 | +} |
| 163 | +GROUP BY ?creatorName |
| 164 | +ORDER BY DESC(?count) |
| 165 | +``` |
| 166 | + |
| 167 | +## Contributing |
| 168 | + |
| 169 | +We welcome contributions! Please open an issue or submit a pull request! |
| 170 | + |
| 171 | +### Development |
| 172 | + |
| 173 | +> We recommend to use [`uv`](https://docs.astral.sh/uv/getting-started/installation/) for working in development, it will handle virtual environments and dependencies automatically and really quickly. |
| 174 | +
|
| 175 | +Create a `.env` file with the required API keys. |
| 176 | + |
| 177 | +```sh |
| 178 | +HF_API_KEY=hf_YYY |
| 179 | +KAGGLE_USERNAME=you |
| 180 | +KAGGLE_KEY=0000 |
| 181 | +``` |
| 182 | + |
| 183 | +Run for HuggingFace: |
| 184 | + |
| 185 | +```sh |
| 186 | +uv run --env-file .env huggingface-rdf --fname huggingface.ttl --limit 10 covid |
| 187 | +``` |
| 188 | + |
| 189 | +Run for kaggle: |
| 190 | + |
| 191 | +```sh |
| 192 | +uv run --env-file .env kaggle-rdf --fname kaggle.ttl --limit 10 covid |
| 193 | +``` |
| 194 | + |
| 195 | +Run tests: |
| 196 | + |
| 197 | +```sh |
| 198 | +uv run pytest |
| 199 | +``` |
| 200 | + |
| 201 | +> Test with HTML coverage report: |
| 202 | +> |
| 203 | +> ```sh |
| 204 | +> uv run pytest --cov-report html && uv run python -m http.server 3000 --directory ./htmlcov |
| 205 | +> ``` |
| 206 | +
|
| 207 | +Run formatting and linting: |
| 208 | +
|
| 209 | +```sh |
| 210 | +uvx ruff format && uvx ruff check --fix |
| 211 | +``` |
| 212 | +
|
| 213 | +Start a SPARQL endpoint on the generated files using [`rdflib-endpoint`](https://github.com/vemonet/rdflib-endpoint): |
| 214 | + |
| 215 | +```sh |
| 216 | +uv run rdflib-endpoint serve --store Oxigraph *.ttl |
| 217 | +``` |
| 218 | + |
0 commit comments