Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor RO Crate #47

Merged
merged 4 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 26 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,42 @@
# WorkflowHub Knowledge Graph
# WorkflowHub Knowledge Graph

## Getting started
A tool to generate a knowledge graph from a source of RO Crates. By default, this tool sources and generates an RDF graph of crates from [WorkflowHub](https://workflowhub.eu/).

### Obtaining workflowhub-graph
## Getting Started

workflowhub-graph is available packaged as a Docker container. You can pull the latest version of the container by running:
This tool is run as a Snakemake workflow. We recommend building a Docker container to run the workflow:

```bash
docker pull ghcr.io/uomresearchit/workflowhub-graph:latest
```bash
docker build -t knowledgegraph .
```

This provides the a wrapper for the executable `workflowhub-graph` which can be used to run the various tools provided by the package.
Then, you can run the workflow using the following command:

### Running workflowhub-graph
```bash
docker run --rm -v $(pwd):/app -w /app knowledgegraph --cores 4 -s /app/Snakefile
```

There are several tools provided by the `workflowhub-graph` package. These are:
- 'help': Display help information.
- 'source-crates': Download ROCrates from the WorkflowHub API.
- 'absolutize': Make all paths in an ROCrate absolute.
- 'upload': Upload an ROCrate to Zenodo.
- 'merge': Merge multiple ROCrates into an RDF graph.
This command runs a Docker container using the `knowledgegraph` image. It mounts the working directory to `/app`
inside the container, sets `/app` as the working directory, and then runs the workflow. Once the workflow completes,
the container is automatically removed.

To run any of these tools, you can use the following command:
## Structure

```bash
docker run ghcr.io/uomresearchit/workflowhub-graph:latest <tool> <args>
```mermaid
flowchart TD
A[Source RO Crates] --> B[Check Outputs];
B[Check Outputs] --> C[Report Downloaded RO Crates];
B[Check Outputs]-->D[Merge RO Crates];
D[Merge RO Crates]-->E[Create Merged Workflow Run RO Crate]
```

For example, to download ROCrates from the WorkflowHub API, you can run:
- **`source_ro_crates`**: This rule sources RO crates from the WorkflowHub API (`source_crates.py`) and then checks
the output (`check_outputs.py`). This generates a list of expected file paths based on the workflow IDs and versions to
facilitate the workflow.

```bash
docker run ghcr.io/uomresearchit/workflowhub-graph:latest source-crates
```
- **`report_created_files`**: Optional. This rule reports the downloaded RO crates to the user.
- **`merge_files`**: This rule merges the downloaded RO crates into a single RDF graph (`merge_ro_crates.py`).
- **`create_ro_crate`**: This rule creates a merged workflow run RO crate from the merged RDF graph (`create_ro_crate.py`).

## Contributing

Expand All @@ -46,10 +51,6 @@ docker run ghcr.io/uomresearchit/workflowhub-graph:latest source-crates
- **Development Branch**: The `develop` branch is currently our main integration branch. Features and fixes should target `develop` through PRs.
- **Feature Branches**: These feature branches should be short-lived and focused. Once done, please create a pull request to merge it into `develop`.

## Overview

![arch_diagram.png](./docs/images/arch_diagram.png)

## License

[BSD 2-Clause License](https://opensource.org/license/bsd-2-clause)
2 changes: 1 addition & 1 deletion Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ rule create_ro_crate:
# pip uninstall urllib3
# Install required packages
pip install requests urllib3 rocrate rocrate-zenodo
pip install requests urllib3 rocrate
# Run the create_ro_crate script
python workflowhub_graph/create_ro_crate.py {input} {params.workflow_file} {output}
Expand Down
Empty file.
106 changes: 106 additions & 0 deletions ro-crate-metadata/#12c6426a-fe66-48e6-9863-bde836ce0b16/absolutize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import argparse
import copy
import json
from urllib.parse import urlparse
import arcp
import rdflib


# TODO: following https://github.com/workflowhub-eu/workflowhub-graph/issues/12
# builing upon is_all_absolute
# add extended RO-Crate profile validation
# get information like schema.org domain and check if the graph is compliant with the schema
# normative schema.org dev docs: https://schema.org/docs/developers.html
# make a note for validation of the graph


def is_all_absolute(G: rdflib.Graph) -> bool:
for triple in G:
for item in triple:
if isinstance(item, rdflib.URIRef):
# TODO: is this enough?
parsed = urlparse(item)

# we accept file:// with a netloc, even if netloc is not a FQDN,
# see https://github.com/workflowhub-eu/workflowhub-graph/issues/1#issuecomment-2127351752
if parsed.netloc == "" and parsed.scheme != "mailto":
print(
f"found non-absolute path <{item}> {parsed.netloc}, {urlparse(item)}"
)
return False
return True


def make_paths_absolute(
json_data: dict, workflowhub_url: str, workflow_id: int, workflow_version: int
) -> dict:
"""
Makes all paths in the JSON content absolute by adding an '@base' key to the JSON-LD context.
:param json_data: The JSON content as a dictionary.
:param workflowhub_url: The base URL for WorkflowHub.
:param workflow_id: The workflow ID to construct the absolute paths.
:param workflow_version: The workflow version.
:return: The modified JSON content with absolute paths.
:raises ValueError: If '@context' key is missing or if '@base' key already exists in the JSON content.
"""

json_data = copy.deepcopy(json_data)

workflow_url = (
f"{workflowhub_url}/workflows/{workflow_id}/ro_crate?version={workflow_version}"
)

if "@context" not in json_data:
raise ValueError(
"The JSON content does not contain a '@context' key, refusing to add it, can not absolutize paths"
)

if not isinstance(json_data["@context"], list):
json_data["@context"] = [json_data["@context"]]

if any(
isinstance(item, dict) and "@base" in item for item in json_data["@context"]
):
raise ValueError(
"The JSON content already contains an '@base' key, it was probably already processed."
)

json_data["@context"].append({"@base": arcp.arcp_location(workflow_url)})

return json_data


def main():
parser = argparse.ArgumentParser(
description="Make all paths in a JSON file absolute."
)
parser.add_argument("json_file", help="The JSON file to process.")
parser.add_argument("output_file", help="The output file.")
parser.add_argument("workflow_id", help="The Workflow ID.")
parser.add_argument("workflow_version", help="The Workflow version.")
parser.add_argument(
"-u",
"--workflowhub-url",
help="The WorkflowHub URL.",
default="https://workflowhub.eu",
)

args = parser.parse_args()

with open(args.json_file, "r") as f:
json_data = json.load(f)

processed_json_data = make_paths_absolute(
json_data, args.workflowhub_url, args.workflow_id, args.workflow_version
)

if args.output_file == "-":
print(json.dumps(processed_json_data, indent=2))
else:
with open(args.output_file, "w") as f:
json.dump(processed_json_data, f, indent=2)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import json
import os
import re
from unittest.mock import patch, MagicMock
from contextlib import contextmanager
import io
from urllib.parse import urlparse
from urllib.request import urlopen


def url_to_filename(url):
"""
Converts a URL to a filename by removing non-alphanumeric characters and replacing them with dashes.
:param url: The URL to convert.
:return: The filename.
"""

parsed = urlparse(url)
if parsed.scheme not in ["http", "https"]:
raise ValueError(f"Unsupported scheme {parsed.scheme}")

return re.sub("[^0-9a-z]+", "-", (parsed.netloc + parsed.path).lower().strip("_"))


@contextmanager
def patch_rdflib_urlopen(
cache_base_dir=None,
write_cache=True,
allowed_urls_pattern=r"https://w3id.org/ro/crate/1\.[01]/context",
):
"""
Context manager to patch rdflib.parser.urlopen to cache and return the content of a URL.
:param cache_base_dir: The base directory to store the cached files.
:param write_cache: Whether to write the cache if the file is not found.
:param allowed_urls_pattern: A regex pattern to match the allowed URLs to cache.
"""

allowed_urls_re = re.compile(allowed_urls_pattern)
if cache_base_dir is None:
cache_base_dir = "cached_urlopen"
os.makedirs(cache_base_dir, exist_ok=True)

def cached_urlopen(request):
url = request.get_full_url()

class Response(io.StringIO):
content_type = "text/html"
headers = {"Content-Type": "text/html"}

def info(self):
return self.headers

def geturl(self):
return url

if not allowed_urls_re.match(url):
return Response(json.dumps({"@context": {}}))
# raise ValueError(
# f"URL {url} not allowed to cache, allowed: {allowed_urls_pattern}"
# )

cached_filename = os.path.join(cache_base_dir, url_to_filename(url))

if not os.path.exists(cached_filename):
if write_cache:
response = urlopen(request)
content = response.read().decode("utf-8")

with open(cached_filename, "wt") as f:
f.write(content)
else:
raise ValueError(
f"Cache file {cached_filename} not found, not allowed to download and update cache"
)

content = open(cached_filename, "rt").read()

return Response(content)

with patch("rdflib.parser.urlopen", cached_urlopen):
yield
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import argparse
import json
import os
import re


def parse_args() -> argparse.Namespace:
"""
Parse command-line arguments.
:return: Parsed command-line arguments.
"""
parser = argparse.ArgumentParser(
description="Generate list of created files based on workflow IDs and versions."
)
parser.add_argument(
"--workflow-ids",
type=str,
help="Range of workflow IDs to process (e.g., '1-10').",
)
parser.add_argument(
"--versions",
type=str,
required=True,
help="Comma-separated list of versions to process (e.g., '1,2,3').",
)
parser.add_argument(
"--output-dir",
type=str,
default="data",
help="Directory where the output files are stored (default: 'data').",
)
return parser.parse_args()


def get_max_id_from_files(output_dir: str) -> int:
"""
If no workflow ID parameter is provided, get the maximum workflow ID from the files in the output directory.
:param output_dir: The directory where output files are stored.
:return: The maximum workflow ID.
"""
max_id = 0
pattern = re.compile(r"^(\d+)_\d+_ro-crate-metadata\.json$")
for filename in os.listdir(output_dir):
match = pattern.match(filename)
if match:
wf_id = int(match.group(1))
if wf_id > max_id:
max_id = wf_id
return max_id


def generate_expected_files(
output_dir: str, workflow_ids: range, versions: list[str]
) -> list[str]:
"""
Generate a list of expected file paths based on the workflow IDs and versions.
:param output_dir: The directory where output files are stored.
:param workflow_ids: The range of workflow IDs to process.
:param versions: The list of versions to process.
:return: A list of expected file paths.
"""

expected_files = []
for wf_id in workflow_ids:
for ver in versions:
expected_files.append(f"{output_dir}/{wf_id}_{ver}_ro-crate-metadata.json")
return expected_files


def verify_created_files(expected_files: list[str]) -> list[str]:
"""
Verify which files from the list of expected files actually exist.
:param expected_files: The list of expected file paths.
:return: A list of file paths that actually exist.
"""
return [f for f in expected_files if os.path.exists(f)]


def main():
args = parse_args()

if args.workflow_ids:
min_id, max_id = map(int, args.workflow_ids.split("-"))
workflow_ids = range(min_id, max_id + 1)
else:
max_id = get_max_id_from_files(args.output_dir)
workflow_ids = range(1, max_id + 1)

versions = args.versions.split(",")

# Generate expected file paths
expected_files = generate_expected_files(args.output_dir, workflow_ids, versions)

# Check which files were actually created
created_files = verify_created_files(expected_files)

# Output the list of created files to a JSON file
with open("created_files.json", "w") as f:
json.dump(created_files, f)

print("\nFile names written to created_files.json")


if __name__ == "__main__":
main()
Loading
Loading