Version: 0.1.1
Repository: SEMT_py GitHub Repository
SemT-py is a Python library designed for the semantic enrichment of tabular data. It facilitates the transformation, modification, and enhancement of tables with additional semantic information. The package is modular, making it adaptable for both expert users and non-experts, offering an intuitive approach to complex data enrichment tasks.
With this package, users can reconcile values against external sources and extend their tables with external data, ensuring that the enriched datasets are accurate and valuable for downstream analysis.
SemT_py
│
├── __init__.py
├── modification_manager.py
├── dataset_manager.py
├── extension_manager.py
├── reconciliation_manager.py
├── token_manager.py
├── utils.py
│
├── setup.py
├── LICENSE
├── README.md
- Root Directory (
SemT_py
): Contains core library files.__init__.py
: Initializes the package when imported.data_handler.py
: Manages data input/output and processing.modification_manager.py
: Handles modification and enrichment of data.dataset_manager.py
: Manages dataset operations like loading and merging.extension_manager.py
: Controls the addition of extensions or plugins to expand library functionalities.main.py
: The primary entry point for the library; orchestrates key tasks.reconciliation_manager.py
: Manages the reconciliation of data with external sources.semtui_evals.py
: Provides evaluation tools or metrics to assess data enrichment.token_manager.py
: Handles authentication tokens for communication with external services.utils.py
: Contains utility functions that assist the core functionality.
- Modular Structure: Adaptable for various data enrichment workflows.
- Semantic Enrichment: Add meaningful semantic context to tabular data.
- Reconciliation: Match table data with external sources for verification.
- Dataset Management: Efficient handling and modification of large datasets.
- Extensions: Seamlessly integrate additional features through extensions.
- Evaluation: Assess the quality of enriched datasets.
- Ease of Use: Intuitive for both experts and non-experts.
-
Create a Python Virtual Environment
- Run:
python3 -m venv myenv
- Run:
-
Activate the Virtual Environment
- For macOS and Linux:
source myenv/bin/activate
- For Windows:
myenv\Scripts\activate
- For macOS and Linux:
-
Install the SemT-py Library
- Run:
pip install --upgrade git+https://github.com/unimib-datAI/Semtui-python.git
- Run:
-
Set Up Jupyter Kernel for the Virtual Environment
- While inside your virtual environment, install the
ipykernel
package to ensure Jupyter can use this environment as a kernel:pip install ipykernel
- Then, add your virtual environment to Jupyter as a new kernel:
python -m ipykernel install --user --name=myenv --display-name "Python (myenv)"
- While inside your virtual environment, install the
-
Download the Sample Notebooks
-
To access the sample notebooks (
sample_notebook.ipynb
,SEMTUI_test_Notebook.ipynb
), download them individually from the following GitHub folder: -
To download a notebook:
- Click on the notebook name (e.g.,
sample_notebook.ipynb
). - Find the download button at the top-right corner to download the file.
- Save the notebook in the
myenv/Sample Notebooks/
directory.
-
Optionally Move Downloaded Files Using the Terminal: If you've downloaded the notebooks to the
Downloads
folder, you can move them to themyenv
directory using the terminal. Here's how: -
For example, to move
sample_table.csv
andSEMTUI_test_Notebook.ipynb
:mv ~/Downloads/sample_table.csv ~/myenv/Sample\ Notebooks/ mv ~/Downloads/SEMTUI_test_Notebook.ipynb ~/myenv/Sample\ Notebooks/
Suggested Folder Structure:
project-folder/ │ ├── myenv/ # Virtual environment folder │ ├── Sample Notebooks/ # Folder to store notebooks and data files │ │ ├── sample_notebook.ipynb │ │ ├── SEMTUI_test_Notebook.ipynb │ │ ├── sample_data.csv # Newly added sample data file │ │ └── your_script.py # Any Python scripts you create ```
- Click on the notebook name (e.g.,
-
This way, all the necessary files will be accessible from Jupyter, even if its access is restricted to the myenv
folder.
-
Explore the Sample Notebook
-
Launch Jupyter Notebook:
- Run:
jupyter notebook
- Run:
-
Switch the Kernel to the Virtual Environment:
- After opening a notebook, ensure the kernel is set to the correct virtual environment:
- In the Jupyter notebook interface, click on Kernel > Change Kernel.
- Select Python (myenv) to use the virtual environment you set up earlier.
- After opening a notebook, ensure the kernel is set to the correct virtual environment:
-
Open the Sample Notebooks:
- Navigate to the
myenv/Sample Notebooks/
folder and open either:sample_notebook.ipynb
SEMTUI_test_Notebook.ipynb
- Navigate to the
-
Run and Review:
- Execute the cells to see example implementations.
-
-
Note
- Ensure Git is installed on your system since the library is fetched from a GitHub repository.
SemT-py relies on the following Python libraries:
- pandas - for efficient data handling and manipulation.
- numpy - for numerical computations.
- chardet - for character encoding detection.
- PyJWT - for secure token handling and authentication.
- fake-useragent - to generate random user agents for web scraping.
- requests - for making HTTP requests to external APIs.
All dependencies are automatically installed when using pip
.
Here’s a quick start guide to using SemT-py:
from SemT_py import dataset_manager, data_modifier, reconciliation_manager
# Load a dataset
dataset = dataset_manager.load_dataset('path_to_dataset.csv')
# Modify the dataset by applying enrichment
modified_dataset = data_modifier.modify_data(dataset)
# Reconcile values from an external source
reconciled_data = reconciliation_manager.reconcile_data(modified_dataset)
# Save the enriched and reconciled dataset
dataset_manager.save_dataset(reconciled_data, 'enriched_dataset.csv')
The SemT-py library works by allowing users to load tabular data, modify it with external semantic information, reconcile it with external data sources, and evaluate the final dataset.
- Load Data: Load raw tabular data from a CSV or other supported formats.
- Modify Data: Apply transformations and add semantic information from external sources.
- Reconcile Data: Match and validate table data with authoritative external sources (e.g., APIs).
- Evaluate Data: Ensure the enriched data is accurate and of high quality.
- Save Data: Export the final, enriched dataset for further use or analysis.
SemT-py is designed to be modular and extensible. You can add custom functionalities by writing extensions.
To help you get started, SemT-py comes with example Jupyter notebooks that showcase its functionalities. Open and run these notebooks to see how to implement various tasks such as data loading, enrichment, reconciliation, and evaluation.
jupyter sample Notebooks/sample_notebook.ipynb
Feel free to contribute to the project by forking the repository and submitting a pull request.
To install SemT-py locally, follow these steps:
-
Clone the repository:
git clone https://github.com/unimib-datAI/Semtui-python.git
-
Navigate to the cloned directory:
cd Semtui-python
-
Create and activate a virtual environment (optional but recommended):
- For macOS/Linux:
python3 -m venv myenv source myenv/bin/activate
- For Windows:
python -m venv myenv myenv\Scripts\activate
- For macOS/Linux:
-
Install the library:
pip install .
This project is licensed under the MIT License. See the LICENSE file for more information.
For any questions, suggestions, or feedback, feel free to reach out to:
- Special thanks to the open-source community for providing essential libraries that power this project.