Skip to content

Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. #576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
ycw66 opened this issue Mar 21, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@ycw66
Copy link

ycw66 commented Mar 21, 2025

Describe the bug

Problem statement

A critical memory corruption occurs in form_reverse_links_ when new_neighbors (a view of context.top) becomes invalidated during iteration. new_neighbors is a view of context.top, during traversal of new_neighbors, context.top may be modified to complete the function refine_, therefore corrupt the 'new_neighbors' as it is just a view.

Subsequent accesses to new_neighbors after refine_ function on a corrupted/stale memory range, leading to a violation of HNSW algorithm

In a word, the following code will cause some correctness problem, it will lead to some wrong links among nodes that will never appear in original HNSW algorithm.

form_reverse_links_ function is as follows:

 template <typename value_at, typename metric_at>
    void form_reverse_links_( //
        metric_at&& metric, compressed_slot_t new_slot, candidates_view_t new_neighbors, value_at&& value,
        level_t level, context_t& context) usearch_noexcept_m {

        top_candidates_t& top = context.top_candidates;
        std::size_t const connectivity_max = level ? config_.connectivity : config_.connectivity_base;

        // Reverse links from the neighbors:
        for (auto new_neighbor : new_neighbors) {
            compressed_slot_t close_slot = new_neighbor.slot;
            if (close_slot == new_slot)
                continue;
            node_lock_t close_lock = node_lock_(close_slot);
            node_t close_node = node_at_(close_slot);
            neighbors_ref_t close_header = neighbors_(close_node, level);

            // The node may have no neighbors only in one case, when it's the first one in the index,
            // but that is problematic to track in multi-threaded environments, where the order of insertion
            // is not guaranteed.
            // usearch_assert_m(close_header.size() || new_slot == 1, "Possible corruption - isolated node");
            usearch_assert_m(close_header.size() <= connectivity_max, "Possible corruption - overflow");
            usearch_assert_m(close_slot != new_slot, "Self-loops are impossible");
            usearch_assert_m(level <= close_node.level(), "Linking to missing level");

            // If `new_slot` is already present in the neighboring connections of `close_slot`
            // then no need to modify any connections or run the heuristics.
            if (close_header.size() < connectivity_max) {
                close_header.push_back(new_slot);
                continue;
            }

            // To fit a new connection we need to drop an existing one.
            top.clear();
            usearch_assert_m((top.capacity() >= (close_header.size() + 1)),
                             "The memory must have been reserved in `add`");
            top.insert_reserved({context.measure(value, citerator_at(close_slot), metric), new_slot});
            for (compressed_slot_t successor_slot : close_header)
                top.insert_reserved(
                    {context.measure(citerator_at(close_slot), citerator_at(successor_slot), metric), successor_slot});

            // Export the results:
            close_header.clear();
            candidates_view_t top_view =
                refine_(metric, connectivity_max, top, context, context.computed_distances_in_reverse_refines);
            usearch_assert_m(top_view.size(), "This would lead to isolated nodes");
            for (std::size_t idx = 0; idx != top_view.size(); idx++)
                close_header.push_back(top_view[idx].slot);
        }
    }

Steps to reproduce

  1. Write a test program, trying to continuely add the vectors into usearch index one by one, ~1000 vectors.
  2. Set a breakpoint in funtion form_reverse_link_, here is accurate code postion: link
  3. Run the test program and it will hit the breakpoint.
  4. Watch the elements in new_neighbors when you reach the breakpoint.
  5. Then run the program step by step, you will find the change of the new_neighbours, which should not be changed.

Expected behavior

During the execution of the function form_reverse_link_, the new_neighbors should never be changed.

USearch version

latest

Operating System

Ubuntu 24.04

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ashvardanian
Copy link
Contributor

Hi @ycw66! When I've shared the code for USearch, I didn't expect people to go that deep into it's logic, so I'm very excited to see this issue the proposed #575 patch. I hope this waits a few days until the next week, was planning to refactor a few bits and pieces in the USearch and your suggestions would fit right in 😉

@ycw66
Copy link
Author

ycw66 commented Mar 21, 2025

Hi @ycw66! When I've shared the code for USearch, I didn't expect people to go that deep into it's logic, so I'm very excited to see this issue the proposed #575 patch. I hope this waits a few days until the next week, was planning to refactor a few bits and pieces in the USearch and your suggestions would fit right in 😉

Thank you Ash! USearch is a popular high-performance library for vector search. I am trying to adapt USearch to our project, a universal data layer, to build the distributed vector database. Our system have a universal data layer can provide the ability of auto-scale, cache, ACID transaction and fault-tolerance. I just want to adapt the USearch to our universal data layer. When I do this, I need slightly modify the original code of USearch and find this bug.

@ycw66
Copy link
Author

ycw66 commented Apr 1, 2025

Hi @ycw66! When I've shared the code for USearch, I didn't expect people to go that deep into it's logic, so I'm very excited to see this issue the proposed #575 patch. I hope this waits a few days until the next week, was planning to refactor a few bits and pieces in the USearch and your suggestions would fit right in 😉

Hi ash, I wanna know if this bug is verified and fixed?

@ycw66
Copy link
Author

ycw66 commented Apr 24, 2025

Hi @ashvardanian ! I conduct a recall test and find when fix this bug with this PR#575, recall can improve more than 10%. My test config is as follows:

  1. Randomly insert 10000 vectors with dim-96 into HNSW index.
  2. Randomly generate 100 query vectors, and search top 100 nearest neighbors. Test result show when fix this bug recall can be improved from 61% to 77%.

@ashvardanian
Copy link
Contributor

Nice! Any chance you have any larger-scale results? Maybe at least 10M vectors?

@ycw66
Copy link
Author

ycw66 commented Apr 24, 2025

Nice! Any chance you have any larger-scale results? Maybe at least 10M vectors?

I will test this and then show the result

@ycw66
Copy link
Author

ycw66 commented Apr 24, 2025

@ashvardanian I increase the numbers of vector to 1M. When bug not fixed, the average recall is 8.82%. When this bug fixed, the average recall is 17.27%. I think because the dataset is randomly generated so the recall rate is very low compared with these general dataset in the ANN Index Benchmark.

Anyway, I think this bug can lead to a decrease of recall rate and can be fixed with minimal change. I kindly ask whether the PR#575 can be merged or any other code should change?

@ycw66 ycw66 changed the title Bug: Wrong memory override lead to a violation of HNSW algorithm Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. Apr 24, 2025
@ycw66
Copy link
Author

ycw66 commented Apr 24, 2025

Hi @ashvardanian , I further conduct tests on another two datasets from ANN Benchmark, SIFT and DEEP1B. Test results are as follows:

Dataset numbers of vectors Recall of current Usearch Recall after bug fixed
SIFT 1M 89.34% 91.89%
DEEP1B 10M 78.95% 84.74%

Results show this bug will decrease recall rate a lot!

@ashvardanian
Copy link
Contributor

Looks very promising! Thank you @ycw66!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants