Skip to content

Commit e6d8860

Browse files
authored
Bring croissant-rdf into Croissant repo (#848)
croissant-rdf is a tool for working with Croissant JSON-LD using RDF. It currently supports converting data from the major dataset providers. It's useful for getting started with creating Knowledge Graphs of datasets and is a good place to demonstrate further semantic features. * Started at DBCLS Biohackathon Japan 2024 * https://osf.io/preprints/biohackrxiv/msv7x_v1 * Continued at Elixir Biohackathon 2024 * https://osf.io/4sgdq_v1/ * Lastly at Semantic Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS) * Preprint in progress Currently at https://github.com/david4096/croissant-rdf This PR will attempt to bring it into this repo. TODOs: - [ ] Get CI tests and code quality checks integrated - [ ] Normalize docs and license as requested by MLCommons - [ ] Move over issues from origin repository - [ ] Other things? let me know!
1 parent ca10fde commit e6d8860

33 files changed

+369052
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ ENV/
111111
env.bak/
112112
venv.bak/
113113

114+
# Mac files
115+
.DS_Store
116+
114117
# Spyder project settings
115118
.spyderproject
116119
.spyproject

croissant-rdf/Dockerfile

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
FROM python:3.13-slim
2+
3+
WORKDIR /app
4+
5+
RUN apt-get update && \
6+
apt-get install -y git && \
7+
apt-get clean
8+
9+
COPY . .
10+
11+
RUN pip install .
12+
13+
# Start Jupyter notebook
14+
#CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

croissant-rdf/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 David Steinberg
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

croissant-rdf/README.md

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# 🥐 croissant-rdf
2+
3+
[![PyPI - Version](https://img.shields.io/pypi/v/croissant-rdf.svg?logo=pypi&label=PyPI&logoColor=silver)](https://pypi.org/project/croissant-rdf/)
4+
[![Tests](https://github.com/david4096/croissant-rdf/actions/workflows/test.yml/badge.svg)](https://github.com/david4096/croissant-rdf/actions/workflows/test.yml)
5+
6+
A proud [Biohackathon](http://www.biohackathon.org/) project 🧑‍💻🧬👩‍💻
7+
8+
* [Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF](https://osf.io/preprints/biohackrxiv/msv7x_v1)
9+
* (In progress) Preprint from Elixir Biohackathon
10+
11+
<a target="_blank" href="https://colab.research.google.com/github/david4096/croissant-rdf/blob/main/example.ipynb">
12+
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
13+
</a>
14+
15+
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/david4096/huggingface-rdf)
16+
17+
![image](https://github.com/user-attachments/assets/444fe9e9-0838-4f67-be7b-0f22a0789817)
18+
19+
20+
`croissant-rdf` is a Python tool that generates RDF (Resource Description Framework) data from datasets available on Hugging Face. This tool enables researchers and developers to convert data into a machine-readable format for enhanced querying and data analysis.
21+
22+
This is made possible due to an effort to align to the [MLCommons Croissant](https://github.com/mlcommons/croissant) schema, which HF and others conform to.
23+
24+
## Features
25+
26+
- Fetch datasets from HuggingFace or Kaggle.
27+
- Convert datasets metadata to RDF format.
28+
- Generate Turtle (`.ttl`) files for easy integration with SPARQL endpoints.
29+
30+
## Installation
31+
32+
croissant-rdf is available in PyPi!
33+
34+
```bash
35+
pip install croissant-rdf
36+
```
37+
38+
## Usage
39+
40+
After installing the package, you can use the command-line interface (CLI) to generate RDF data:
41+
42+
```sh
43+
export HF_API_KEY={YOUR_KEY}
44+
huggingface-rdf --fname huggingface.ttl --limit 10
45+
```
46+
47+
Check out the `qlever_scripts` directory to get help loading the RDF into qlever for querying.
48+
49+
You can also easily use Jena fuseki and load the generated .ttl file from the Fuseki ui.
50+
51+
```sh
52+
docker run -it -p 3030:3030 stain/jena-fuseki
53+
```
54+
### Extracting data from Kaggle
55+
You'll need to get a Kaggle API key and it comes in a file called `kaggle.json`, you have to put the username and key into environment variables.
56+
57+
```sh
58+
export KAGGLE_USERNAME={YOUR_USERNAME}
59+
export KAGGLE_KEY={YOUR_KEY}
60+
kaggle-rdf --fname kaggle.ttl --limit 10
61+
62+
# Optionally you can provide a positional argument to filter the dataset search
63+
kaggle-rdf --fname kaggle.ttl --limit 10 covid
64+
```
65+
66+
### Running via Docker
67+
68+
You can use the `huggingface-rdf` or `kaggle-rdf` tools via Docker:
69+
70+
```bash
71+
docker run -t -v $(pwd):/app david4096/croissant-rdf huggingface-rdf --fname docker.ttl
72+
```
73+
74+
This will create a Turtle file `docker.ttl` in the current working directory.
75+
76+
### Using Common Workflow Language (CWL)
77+
78+
First install [cwltool](https://www.commonwl.org/user_guide/introduction/prerequisites.html) and then you can run the workflow using:
79+
80+
```bash
81+
cwltool https://raw.githubusercontent.com/david4096/croissant-rdf/refs/heads/main/workflows/huggingface-rdf.cwl --fname cwl.ttl --limit 5
82+
```
83+
84+
This will output a Turtle file called `cwl.ttl` in your local directory.
85+
86+
### Using Docker to run a Jupyter server
87+
To launch a jupyter notebook server to run and develop on the project locally run the following:
88+
89+
Build:
90+
91+
```sh
92+
docker build -t croissant-rdf-jupyter -f notebooks/Dockerfile .
93+
```
94+
95+
Run Jupyter:
96+
97+
```sh
98+
docker run -p 8888:8888 -v $(pwd):/app croissant-rdf-jupyter
99+
```
100+
The run command works for mac and linux for windows in PowerShell you need to use the following:
101+
```sh
102+
docker run -p 8888:8888 -v ${PWD}:/app croissant-rdf-jupyter
103+
```
104+
105+
After that, you can access the Jupyter notebook server at http://localhost:8888.
106+
107+
## Useful SPARQL Queries
108+
109+
SPARQL (SPARQL Protocol and RDF Query Language) is a query language used to retrieve and manipulate data stored in RDF (Resource Description Framework) format, typically within a triplestore. Here are a few useful SPARQL query examples you can try to implement on https://huggingface.co/spaces/david4096/huggingface-rdf
110+
111+
The basic structure of a SPQRQL query is SELECT: which you have to include a keywords that you would like to return in the result.
112+
WHERE: Defines the triple pattern we want to match in the RDF dataset.
113+
114+
1. This query is used to retrieve distinct predicates from an Huggingface RDF dataset
115+
116+
```sparql
117+
SELECT DISTINCT ?b WHERE {?a ?b ?c}
118+
```
119+
120+
2. To retrieve information about a dataset, including its name, predicates, and the count of objects associated with each predicate. Includes a filters in the results to include only resources that are of type <https://schema.org/Dataset>.
121+
122+
```sparql
123+
PREFIX schema: <https://schema.org/>
124+
SELECT ?name ?p (count(?o) as ?predicate_count)
125+
WHERE {
126+
?dataset a schema:Dataset ;
127+
schema:name ?name ;
128+
?p ?o .
129+
}
130+
GROUP BY ?p ?dataset
131+
```
132+
133+
3. To retrieve distinct values with the keyword "bio" associated with the property <https://schema.org/keywords> regardless of the case.
134+
135+
```sparql
136+
PREFIX schema: <https://schema.org/>
137+
SELECT DISTINCT ?keyword
138+
WHERE {
139+
?s schema:keywords ?keyword .
140+
FILTER(CONTAINS(LCASE(?keyword), "bio"))
141+
}
142+
```
143+
144+
4. To retrieve distinct values for croissant columns associated with the predicate.
145+
146+
```sparql
147+
PREFIX cr: <http://mlcommons.org/croissant/>
148+
SELECT DISTINCT ?column
149+
WHERE {
150+
?s cr:column ?column
151+
}
152+
```
153+
154+
5. To retrieves the names of creators and the count of items they are associated with.
155+
156+
```sparql
157+
PREFIX schema: <https://schema.org/>
158+
SELECT ?creatorName (COUNT(?a) AS ?count)
159+
WHERE {
160+
?s schema:creator ?creator .
161+
?creator schema:name ?creatorName .
162+
}
163+
GROUP BY ?creatorName
164+
ORDER BY DESC(?count)
165+
```
166+
167+
## Contributing
168+
169+
We welcome contributions! Please open an issue or submit a pull request!
170+
171+
### Development
172+
173+
> We recommend to use [`uv`](https://docs.astral.sh/uv/getting-started/installation/) for working in development, it will handle virtual environments and dependencies automatically and really quickly.
174+
175+
Create a `.env` file with the required API keys.
176+
177+
```sh
178+
HF_API_KEY=hf_YYY
179+
KAGGLE_USERNAME=you
180+
KAGGLE_KEY=0000
181+
```
182+
183+
Run for HuggingFace:
184+
185+
```sh
186+
uv run --env-file .env huggingface-rdf --fname huggingface.ttl --limit 10 covid
187+
```
188+
189+
Run for kaggle:
190+
191+
```sh
192+
uv run --env-file .env kaggle-rdf --fname kaggle.ttl --limit 10 covid
193+
```
194+
195+
Run tests:
196+
197+
```sh
198+
uv run pytest
199+
```
200+
201+
> Test with HTML coverage report:
202+
>
203+
> ```sh
204+
> uv run pytest --cov-report html && uv run python -m http.server 3000 --directory ./htmlcov
205+
> ```
206+
207+
Run formatting and linting:
208+
209+
```sh
210+
uvx ruff format && uvx ruff check --fix
211+
```
212+
213+
Start a SPARQL endpoint on the generated files using [`rdflib-endpoint`](https://github.com/vemonet/rdflib-endpoint):
214+
215+
```sh
216+
uv run rdflib-endpoint serve --store Oxigraph *.ttl
217+
```
218+

0 commit comments

Comments
 (0)