-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GFD mining #465
base: main
Are you sure you want to change the base?
Add GFD mining #465
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 29. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
c3eeecf
to
a59e3cc
Compare
: query_(query_), iso_(iso_), res_(res_) {} | ||
|
||
template <typename CorrespondenceMap1To2, typename CorrespondenceMap2To1> | ||
bool operator()(CorrespondenceMap1To2 f, CorrespondenceMap2To1) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second parameter is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boost Graph Library requires such syntax from callback function. Unfortunately, if I change the function signature, the code will not compile
Add PR description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
You are doing great! Please don't forget to notify when you are done with changes requested. Also please mark conversations if they are resolved |
60aeadc
to
9ce90a3
Compare
Token name_token = Token(i, name.first); | ||
for (auto& value : name.second) { | ||
Token value_token = Token(-1, value); | ||
result.push_back(Literal(name_token, value_token)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.reserve(name.second.size() * attrs_info.at(label).size())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, name.second
can have different number of values, depending on name.first
. The reserve may not work correctly.
11d6a2e
to
13b76ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
10ff8ca
to
b1a77f0
Compare
|
Change commit name |
// Defines a specific attribute of the pattern. | ||
// The first element is the index of the vertex, | ||
// the second is the name of the attribute. | ||
// An alias for user convenience. | ||
using Token = std::pair<int, std::string>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are indexes supposed to have negative values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I encode the constant token with the value -1
std::vector<Literal> premises = {}; | ||
Gfd gfd = {pattern, premises, conclusion}; | ||
if (Validate(graph, gfd, embeddings, sigma)) { | ||
rules.push_back({{}, l}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid unnecessary temporary object creation
rules.push_back({{}, l}); | |
rules.emplace_back({}, l); |
Consider everywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the suggested line gives a compiler error:
no matching function for call to ‘std::vector<std::pair<std::vector<std::pair<std::pair<int, std::__cxx11::basic_string >, std::pair<int, std::__cxx11::basic_string > > >, std::pair<std::pair<int, std::__cxx11::basic_string >, std::pair<int, std::__cxx11::basic_string > > > >::emplace_back(, gfd::Literal&)’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will review gfd_miner.cpp
tomorrow (:
src/tests/test_gfd_mining.cpp
Outdated
std::unique_ptr<algos::GfdMiner> algorithm = | ||
CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | ||
algorithm->Execute(); | ||
std::vector<Gfd> gfd_list = algorithm->GfdList(); | ||
algorithm = CreateGfdMiningInstance(graph_path, {.k = 3, .sigma = 3}); | ||
algorithm->Execute(); | ||
std::size_t expected_size = gfd_list.size(); | ||
gfd_list = algorithm->GfdList(); | ||
ASSERT_EQ(expected_size, gfd_list.size()); | ||
for (std::size_t index = 0; index < gfd_list.size(); ++index) { | ||
ASSERT_TRUE(gfds.at(index) == gfd_list.at(index)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we call CreateGfdMiningInstance
and Execute
two times in a row?
Should't this be just
std::unique_ptr<algos::GfdMiner> algorithm = | |
CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | |
algorithm->Execute(); | |
std::vector<Gfd> gfd_list = algorithm->GfdList(); | |
algorithm = CreateGfdMiningInstance(graph_path, {.k = 3, .sigma = 3}); | |
algorithm->Execute(); | |
std::size_t expected_size = gfd_list.size(); | |
gfd_list = algorithm->GfdList(); | |
ASSERT_EQ(expected_size, gfd_list.size()); | |
for (std::size_t index = 0; index < gfd_list.size(); ++index) { | |
ASSERT_TRUE(gfds.at(index) == gfd_list.at(index)); | |
} | |
std::unique_ptr<algos::GfdMiner> algorithm = CreateGfdMiningInstance(graph_path, {.k = 2, .sigma = 3}); | |
ExecuteAndCompare(algorithm, expected_gfds); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test tests the minimality of the dependencies, that is, independence from the choice of k. In this example, the algorithm should not return patterns with 3 vertices, since they are not minimal. So, the result with k = 2 and k = 3 should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Add class for GFD miner
Add minimal GFD and GFD with multiple conclusion tests
Added two examples with searching for dependencies in small graphs.
This PR implements an algorithm for mining graph functional dependencies based on article "Discovering Graph Functional Dependencies" by Fan Wenfei, Hu Chunming, Liu Xueli, and Lu Ping. Algorithm, given an input graph, returns a set of dependencies satisfied on this graph. The algorithm also has two configurable parameters:
k
is the maximum number of vertices in the pattern of the mined dependency andsigma
is its minimum frequency.In addition, the PR implements the ability to run the algorithm in Python, and also contains examples of its use.