MITHRIL is a tool is designed to discover schemas within Property Graph Databases. It supports incremental schmea discovery and helps identify the structure, patterns, and relationships in graph data, facilitating the understanding and exploration of datasets.
The project supports schema discovery on popular datasets like LDBC, FIB25, and MB6, and integrates with Neo4j for seamless graph data management.
To install and set up MITHRIL, follow these steps:
git clone https://github.com/sophisid/MITHRIL.git
cd MITHRIL/
Navigate to the schemadiscovery
directory and build the project using sbt
(Scala Build Tool):
cd schemadiscovery
sbt compile
MITHRIL relies on Neo4j for managing and querying the property graphs. To set up Neo4j:
wget https://dist.neo4j.org/neo4j-community-4.4.0-unix.tar.gz
tar -xzf neo4j-community-4.4.0-unix.tar.gz
cd neo4j-community-4.4.0
You can set the NEO4J_DIR
environment variable to this directory:
export NEO4J_DIR=$(pwd)
bin/neo4j start
Neo4j will be available at http://localhost:7474.
- Set the initial password for the default user
neo4j
when prompted. - Access the Neo4j browser and verify the connection.
The project includes evaluation datasets (FIB25, LDBC, MB6) that need to be unzipped and loaded into Neo4j.
cd datasets
unzip FIB25/fib25_neo4j_inputs.zip
unzip LDBC/ldbc_neo4j_inputs.zip
unzip MB6/mb6_neo4j_inputs1.zip
Before importing, ensure that:
- Neo4j is stopped if currently running.
NEO4J_DIR
is set to your Neo4j installation directory.current_dataset_dir
is set to the path containing the dataset CSV files.
General Preparation Steps:
cd $NEO4J_DIR
bin/neo4j stop # Stop Neo4j if running
rm -rf $NEO4J_DIR/data/databases/neo4j # Delete old database if needed
export current_dataset_dir=<path-to-datasets>
$NEO4J_DIR/bin/neo4j-admin import --database=neo4j --delimiter='|' \
--nodes=Forum="$current_dataset_dir/forum_0_0_corrupted.csv" \
--nodes=Person="$current_dataset_dir/person_0_0_corrupted.csv" \
--nodes=Post="$current_dataset_dir/post_0_0_corrupted.csv" \
--nodes=Place="$current_dataset_dir/place_0_0_corrupted.csv" \
--nodes=Organisation="$current_dataset_dir/organisation_0_0_corrupted.csv" \
--nodes=TagClass="$current_dataset_dir/tagclass_0_0_corrupted.csv" \
--nodes=Tag="$current_dataset_dir/tag_0_0_corrupted.csv" \
--relationships=CONTAINER_OF="$current_dataset_dir/forum_containerOf_post_0_0_corrupted.csv" \
--relationships=HAS_MEMBER="$current_dataset_dir/forum_hasMember_person_0_0_corrupted.csv" \
--relationships=HAS_MODERATOR="$current_dataset_dir/forum_hasModerator_person_0_0_corrupted.csv" \
--relationships=HAS_TAG="$current_dataset_dir/forum_hasTag_tag_0_0_corrupted.csv" \
--relationships=HAS_INTEREST="$current_dataset_dir/person_hasInterest_tag_0_0_corrupted.csv" \
--relationships=IS_LOCATED_IN="$current_dataset_dir/person_isLocatedIn_place_0_0_corrupted.csv" \
--relationships=KNOWS="$current_dataset_dir/person_knows_person_0_0_corrupted.csv" \
--relationships=LIKES="$current_dataset_dir/person_likes_post_0_0_corrupted.csv" \
--relationships=STUDIES_AT="$current_dataset_dir/person_studyAt_organisation_0_0_corrupted.csv" \
--relationships=WORKS_AT="$current_dataset_dir/person_workAt_organisation_0_0_corrupted.csv" \
--relationships=HAS_CREATOR="$current_dataset_dir/post_hasCreator_person_0_0_corrupted.csv" \
--relationships=HAS_TAG="$current_dataset_dir/post_hasTag_tag_0_0_corrupted.csv" \
--relationships=IS_LOCATED_IN="$current_dataset_dir/post_isLocatedIn_place_0_0_corrupted.csv" \
--relationships=IS_LOCATED_IN="$current_dataset_dir/organisation_isLocatedIn_place_0_0_corrupted.csv" \
--relationships=IS_PART_OF="$current_dataset_dir/place_isPartOf_place_0_0_corrupted.csv" \
--relationships=HAS_TYPE="$current_dataset_dir/tag_hasType_tagclass_0_0_corrupted.csv" \
--relationships=IS_SUBCLASS_OF="$current_dataset_dir/tagclass_isSubclassOf_tagclass_0_0_corrupted.csv"
$NEO4J_DIR/bin/neo4j-admin import --database=neo4j --delimiter=',' \
--nodes=Meta="$current_dataset_dir/Neuprint_Meta_mb6_corrupted.csv" \
--nodes=Neuron="$current_dataset_dir/Neuprint_Neurons_mb6_corrupted.csv" \
--relationships=CONNECTS_TO="$current_dataset_dir/Neuprint_Neuron_Connections_mb6_corrupted.csv" \
--nodes=SynapseSet="$current_dataset_dir/Neuprint_SynapseSet_mb6_corrupted.csv" \
--relationships=CONNECTS_TO="$current_dataset_dir/Neuprint_SynapseSet_to_SynapseSet_mb6_corrupted.csv" \
--relationships=CONTAINS="$current_dataset_dir/Neuprint_Neuron_to_SynapseSet_mb6_corrupted.csv" \
--nodes=Synapse="$current_dataset_dir/Neuprint_Synapses_mb6_corrupted.csv" \
--relationships=SYNAPSES_TO="$current_dataset_dir/Neuprint_Synapse_Connections_mb6_corrupted.csv" \
--relationships=CONTAINS="$current_dataset_dir/Neuprint_SynapseSet_to_Synapses_mb6_corrupted.csv"
$NEO4J_DIR/bin/neo4j-admin import --database=neo4j --delimiter=',' \
--nodes=Meta="$current_dataset_dir/Neuprint_Meta_fib25_corrupted.csv" \
--nodes=Neuron="$current_dataset_dir/Neuprint_Neurons_fib25_corrupted.csv" \
--relationships=CONNECTS_TO="$current_dataset_dir/Neuprint_Neuron_Connections_fib25_corrupted.csv" \
--nodes=SynapseSet="$current_dataset_dir/Neuprint_SynapseSet_fib25_corrupted.csv" \
--relationships=CONNECTS_TO="$current_dataset_dir/Neuprint_SynapseSet_to_SynapseSet_fib25_corrupted.csv" \
--relationships=CONTAINS="$current_dataset_dir/Neuprint_Neuron_to_SynapseSet_fib25_corrupted.csv" \
--nodes=Synapse="$current_dataset_dir/Neuprint_Synapses_fib25_corrupted.csv" \
--relationships=SYNAPSES_TO="$current_dataset_dir/Neuprint_Synapse_Connections_fib25_corrupted.csv" \
--relationships=CONTAINS="$current_dataset_dir/Neuprint_SynapseSet_to_Synapses_fib25_corrupted.csv"
-
Verify the Import:
$NEO4J_DIR/bin/neo4j-admin check-consistency --database=neo4j
-
Start the Neo4j Server:
cd $NEO4J_DIR bin/neo4j start
-
Access the Neo4j Browser:
Open http://localhost:7474 to visualize the graph and run queries.
-
Delimiters:
- Use
|
for LDBC dataset. - Use
,
for MB6 and FIB25 datasets.
- Use
-
CSV Format Requirements:
- Nodes must have a unique
id
field. - Relationships must have
:START_ID
,:END_ID
, and:TYPE
columns.
- Nodes must have a unique
-
Reloading Data:
If you need to re-import, remove the database first:rm -rf $NEO4J_DIR/data/databases/neo4j
Once the setup is complete and the datasets are loaded, you can run MITHRIL to perform schema discovery.
MITHRIL can be executed in multiple ways:
Scripts (e.g., run_mithril_ldbc.sh
, run_mithril_fib25.sh
, run_mithril_mb6.sh
) are provided to automate the entire process:
- Set environment variables and directories inside these scripts.
- Make them executable:
chmod +x run_mithril_ldbc.sh chmod +x run_mithril_fib25.sh chmod +x run_mithril_mb6.sh
- Run the script for the desired dataset:
./run_mithril_ldbc.sh ./run_mithril_fib25.sh ./run_mithril_mb6.sh
The scripts handle:
- Removing and re-extracting Neo4j.
- Importing data.
- Running schema discovery (LSH clustering).
- Stopping Neo4j and cleaning up.
If you prefer more control, follow the manual steps:
- Stop Neo4j, remove old database, re-extract Neo4j.
- Import the desired dataset with
neo4j-admin import
. - Start Neo4j.
- Run:
cd schemadiscovery sbt "run l" # LSH clustering
- Stop Neo4j and clean up if needed.
An incremental script is also provided (an example shown below) to process datasets incrementally. This script follows a similar pattern but runs the Scala program in an "incremental" mode (indicated by sbt "run l i 500000"
) after importing the dataset. It then stops Neo4j, cleans up processes, and moves on to the next dataset.
Usage Incremental Script:
- Set environment variables and directories inside these scripts.
- Make them executable:
chmod +x run_mithril_ldbc_incremental.sh
- Run the script for the desired dataset:
./run_mithril_ldbc_incremental.sh
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions to MITHRIL are welcome! If you find bugs, have suggestions, or want to contribute features, feel free to open an issue or submit a pull request.
For questions, feedback, or support, please contact the repository maintainer.