SEDAR: A Semantic Data Reservoir for Heterogeneous Datasets

Data lakes have emerged as a solution for managing vast and diverse datasets for modern data analytics. To prevent them from becoming ungoverned, semantic data management techniques are crucial, which involve connecting metadata with knowledge graphs, following the principles of Linked Data. This semantic layer enables more expressive data management, integration from various sources and enhances data access utilizing the concepts and relations to semantically enrich the data. Some frameworks have been proposed, but requirements like data versioning, linking of datasets, managing machine learning projects, automated semantic modeling and ontology-based data access are not supported in one uniform system. We demonstrate SEDAR, a comprehensive semantic data lake that includes support for data ingestion, storage, processing, and governance with a special focus on semantic data management. The demo will showcase how the system allows for various ingestion scenarios, metadata enrichment, data source linking, profiling, semantic modeling, data integration and processing inside a machine learning life cycle.

Demo Video

Publications

The code in this repository is related to the earlier publication (ERCIM NEWS). The code related to semantic modeling and OBDA for the CIKM publication will be released later.

Cite this work:

Sayed Hoseini, Ahmed Ali, Haron Shaker, and Christoph Quix. 2023. SEDAR: A Semantic Data Reservoir for Heterogeneous Datasets. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), October 21–25, 2023, Birmingham, United Kingdom. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3583780.3614753

@inproceedings{DBLP:conf/cikm/HoseiniASQ23,
  author       = {Sayed Hoseini and
                  Ahmed Ali and
                  Haron Shaker and
                  Christoph Quix},
  editor       = {Ingo Frommholz and
                  Frank Hopfgartner and
                  Mark Lee and
                  Michael Oakes and
                  Mounia Lalmas and
                  Min Zhang and
                  Rodrygo L. T. Santos},
  title        = {{SEDAR:} {A} Semantic Data Reservoir for Heterogeneous Datasets},
  booktitle    = {Proceedings of the 32nd {ACM} International Conference on Information
                  and Knowledge Management, {CIKM} 2023, Birmingham, United Kingdom,
                  October 21-25, 2023},
  pages        = {5056--5060},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3583780.3614753},
  doi          = {10.1145/3583780.3614753}
}

Contact

For any setup difficulties or other inquiries, please contact: [email protected]

License

This project is openly shared under the terms of the Apache License v2.0 (read for more).

Acknowledgement

We acknowledge the cooperation with A. Martin, M. Thiel, R. Kuller, L. Beer, F. Lentzen, F. Bongartz, M. Noman, T. Claas, M. Fallouh, Z. Abdullah gratefully who made this work possible. This work has been sponsored by the German Federal Ministry of Education and Research, Germany in the funding program “Forschung an Fachhochschulen”, project \href{https://www.hs-niederrhein.de/i2dach}{$I^2DACH$} (grant no. 13FH557KX0).

Overview

This is a figure that represents the architecture of the system. All individual elements are shipped in Docker containers.

Installation

The following installation instruction have been tested on a Debian GNU/Linux 11 (bullseye) virtual machine with 32GB RAM and 8 CPUs, Architecture x86_64.

For the development we have been connecting to the VM via SSH using Visual Studio Code. This complicates the setup a little, because ports have to be forwarded to your local machine. In case you are developing locally, the setup is simplified, however you might face other problems in configuring for example the spark cluster, HDFS, etc.

Install the following via terminal.

Git:

      apt-get install git curl build-essential gcc ffmpeg libsm6 libxext6  -y

Clone the repository:

  git clone https://github.com/hsnr-data-science/SEDAR.git

Docker :
An installation guide for Docker can also be found at: https://docs.docker.com/desktop/install/linux-install/

Setup the repository, update the apt package index and install packages to allow apt to use a repository over HTTPS:
```
  sudo apt-get update
  sudo apt-get install ca-certificates curl gnupg
```

Add Docker’s official GPG key:

  sudo install -m 0755 -d /etc/apt/keyrings 
  curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /etc apt/keyrings/docker.gpg 
  sudo chmod a+r /etc/apt/keyrings/docker.gpg

Use the following command to set up the repository:

  echo \ "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux debian \ "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Docker Engine:
- Update the apt package index:
```
  sudo apt-get update
```
- Install Docker Engine, containerd, and Docker Compose. To install the latest version, run:
```
  sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
- Verify that the Docker Engine installation is successful by running the hello-world image:
```
  sudo docker run hello-world
```
  Continue to Linux postinstall to allow non-privileged users to run Docker commands and for other optional configuration steps.

NVM (Node Version Manager)
With NVM it is much easier to make changes in NodeJS and npm.

      curl https://raw.githubusercontent.com/creationix/nvm/master/install.sh | bash 

      source ~/.bashrc

npm and NodeJS:
- Install npm and Node.js by using NVM.
```
  nvm install v18.16.0 
```
-  Use this to check which Node.js version you installed.
```
  node -v && npm --version
```
- execute this command below to be able to use npm without sudo on port 80.
```
  sudo setcap 'cap_net_bind_service=+ep' `which node`
```
- installation guide for NodeJS can also be found at this site:
  https://www.knowledgehut.com/blog/web-development/install-nodejs-on-ubuntu
Python3:
- We're running PySpark version 3.1.2 on Python 3.9
```
          sudo apt-get install python3.9
```

pip: & python-venv

      sudo apt install python3-pip python3-venv -y

Upgrade pip:
```
  pip install --upgrade pip
```

Configure the system Check out the example.env. This is the main config file to define all ports and host IP addresses, secret keys, URLs etc. It is configured to work as a standalone solution with no external JupyterHub or GitLab Identity provider. An important key is
```
 BACKEND_HOST=<ipaddressofhostsystem>
```
In our VM setup, we used the IP address of the VM host, e.g. 192.168.220.100. This key is important to ensure services, frontend and backend will be able to communicate with each other.
Copy the content of example.env into a file called .env and adjust according to your needs.
Build, pull and run containers. First, change Directory to SEDAR/src, then:
1. commons.yaml
  Now start running the command bellow:
```
 docker compose -f commons.yaml build
```
2. services.yaml
```
 docker compose -f services.yaml build
```
3. dev.yaml
```
 docker compose -f dev.yaml build
```

Running requires only services.yaml and dev.yaml. Commons does not need to be run.
4. dev.yaml

        docker compose -f dev.yaml up -d

Make sure Kafka is up and running. The services need to be able to succesfully connect to Kafka.
5. services.yaml

        docker compose -f services.yaml up -d

Frontend
- Use the command below to install dependencies which locates in SEDAR/src/frontend/:
```
  npm install
```
- Run sedar frontend:
```
  npm run start
```
Backend
- Create virtual enviroment (named flask) using the command:
```
  python3 -m venv flask
```
  And activate the virtual enviroment. After activating you will see your directory will change. Directory will be now under (flask) enviroment.
```
  source flask/bin/activate
```
- install the requirement files that are shown below to virtual enviroment:
  SEDAR/backend/mainbackend/requirements.txt
  SEDAR/backend/commons/requirements.txt
```
      pip install -r ./SEDAR/src/backend/commons/requirements.txt
      pip install -r ./SEDAR/src/backend/mainbackend/requirements.txt
```
HDFS connectivity:
- run the command:
```
      sudo nano /etc/hosts
```
  type namenode datanode1 next to YOUR:IP:ADDRESS localhost then exit with ctrl+X and save changes by pressing Y.

Debugging in VS Code To debug the mainbackend or individual services we have been using python extension in vscode.

Create a launch.json from Run & Debug Tab and past the following content. Make sure to adjust the filepaths to your system and also to kill corresponding containers first.

      {
      "version": "0.2.0",
      "configurations": [
      {
      "name": "Mainbackend",
      "python":"/home/<USER_NAME>/SEDAR/src/backend/mainbackend/flask/bin/python3",
      "type": "python",
      "request": "launch",
      "cwd": "/home/<USER_NAME>/SEDAR/src/backend/mainbackend",
      "program": "main_server.py",
      "console": "integratedTerminal",
      "envFile": "/home/<USER_NAME>/SEDAR/src/.env",
      },
      {
      "name": "Ingestion Service",
      "python":"/home/<USER_NAME>/SEDAR/src/backend/mainbackend/flask/bin/python3",
      "type": "python",
      "request": "launch",
      "cwd": "/home/<USER_NAME>/SEDAR/src/backend/mainbackend",
      "program": "server.py",
      "console": "integratedTerminal",
      "envFile": "/home/<USER_NAME>/SEDAR/src/.env",
      },
      {
      "name": "Continuation Service",
      "python":"/home/<USER_NAME>/SEDAR/src/backend/mainbackend/flask/bin/python3",
      "type": "python",
      "request": "launch",
      "cwd": "/home/<USER_NAME>/SEDAR/src/backend/mainbackend",
      "program": "server.py",
      "console": "integratedTerminal",
      "envFile": "/home/<USER_NAME>/SEDAR/src/.env",
      },
      {
      "name": "Schema Service",
      "python":"/home/<USER_NAME>/SEDAR/src/backend/mainbackend/flask/bin/python3",
      "type": "python",
      "request": "launch",
      "cwd": "/home/<USER_NAME>/SEDAR/src/backend/mainbackend",
      "program": "server.py",
      "console": "integratedTerminal",
      "envFile": "/home/<USER_NAME>/SEDAR/src/.env",
      },
      {
      "name": "Profiling Service",
      "python":"/home/<USER_NAME>/SEDAR/src/backend/mainbackend/flask/bin/python3",
      "type": "python",
      "request": "launch",
      "cwd": "/home/<USER_NAME>/SEDAR/src/backend/mainbackend",
      "program": "server.py",
      "console": "integratedTerminal",
      "envFile": "/home/<USER_NAME>/SEDAR/src/.env",
      }
    ]
}

Alternatively, you can run the servides inside terminals as well. Create four terminals to run ingestion, continuation, schema and profiling services. Make sure you have activated virtual enviroment 'flask' and for all terminals use the commands in order. Make sure to adjust the filepaths to your system and also to kill corresponding containers first:
```
      cd SEDAR/src/backend/services/ingestion/
      python3 server.py

      cd SEDAR/src/backend/services/continuation/
      python3 server.py

      cd SEDAR/src/backend/services/schema/
      python3 server.py

      cd SEDAR/src/backend/services/profiling/
      python3 server.py
```

Port-Forwarding The follwing ports need to be forwarded to host system in case of running SEDAR inside a SSH-V. These are based on the standard definition given in example.env:
80 Frontend port
5000 Backend port
8088 WebVOWL
7687 Neo4j
8000 Jupyter

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
documents		documents
sedar		sedar
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEDAR: A Semantic Data Reservoir for Heterogeneous Datasets

Demo Video

Publications

Cite this work:

Contact

License

Acknowledgement

Overview

Installation

About

Releases

Packages

Contributors 2

Languages

License

hsnr-data-science/SEDAR

Folders and files

Latest commit

History

Repository files navigation

SEDAR: A Semantic Data Reservoir for Heterogeneous Datasets

Demo Video

Publications

Cite this work:

Contact

License

Acknowledgement

Overview

Installation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages