Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. #576

ycw66 · 2025-03-21T15:12:44Z

Describe the bug

Problem statement

A critical memory corruption occurs in form_reverse_links_ when new_neighbors (a view of context.top) becomes invalidated during iteration. new_neighbors is a view of context.top, during traversal of new_neighbors, context.top may be modified to complete the function refine_, therefore corrupt the 'new_neighbors' as it is just a view.

Subsequent accesses to new_neighbors after refine_ function on a corrupted/stale memory range, leading to a violation of HNSW algorithm

In a word, the following code will cause some correctness problem, it will lead to some wrong links among nodes that will never appear in original HNSW algorithm.

form_reverse_links_ function is as follows:

 template <typename value_at, typename metric_at>
    void form_reverse_links_( //
        metric_at&& metric, compressed_slot_t new_slot, candidates_view_t new_neighbors, value_at&& value,
        level_t level, context_t& context) usearch_noexcept_m {

        top_candidates_t& top = context.top_candidates;
        std::size_t const connectivity_max = level ? config_.connectivity : config_.connectivity_base;

        // Reverse links from the neighbors:
        for (auto new_neighbor : new_neighbors) {
            compressed_slot_t close_slot = new_neighbor.slot;
            if (close_slot == new_slot)
                continue;
            node_lock_t close_lock = node_lock_(close_slot);
            node_t close_node = node_at_(close_slot);
            neighbors_ref_t close_header = neighbors_(close_node, level);

            // The node may have no neighbors only in one case, when it's the first one in the index,
            // but that is problematic to track in multi-threaded environments, where the order of insertion
            // is not guaranteed.
            // usearch_assert_m(close_header.size() || new_slot == 1, "Possible corruption - isolated node");
            usearch_assert_m(close_header.size() <= connectivity_max, "Possible corruption - overflow");
            usearch_assert_m(close_slot != new_slot, "Self-loops are impossible");
            usearch_assert_m(level <= close_node.level(), "Linking to missing level");

            // If `new_slot` is already present in the neighboring connections of `close_slot`
            // then no need to modify any connections or run the heuristics.
            if (close_header.size() < connectivity_max) {
                close_header.push_back(new_slot);
                continue;
            }

            // To fit a new connection we need to drop an existing one.
            top.clear();
            usearch_assert_m((top.capacity() >= (close_header.size() + 1)),
                             "The memory must have been reserved in `add`");
            top.insert_reserved({context.measure(value, citerator_at(close_slot), metric), new_slot});
            for (compressed_slot_t successor_slot : close_header)
                top.insert_reserved(
                    {context.measure(citerator_at(close_slot), citerator_at(successor_slot), metric), successor_slot});

            // Export the results:
            close_header.clear();
            candidates_view_t top_view =
                refine_(metric, connectivity_max, top, context, context.computed_distances_in_reverse_refines);
            usearch_assert_m(top_view.size(), "This would lead to isolated nodes");
            for (std::size_t idx = 0; idx != top_view.size(); idx++)
                close_header.push_back(top_view[idx].slot);
        }
    }

Steps to reproduce

Write a test program, trying to continuely add the vectors into usearch index one by one, ~1000 vectors.
Set a breakpoint in funtion form_reverse_link_, here is accurate code postion: link
Run the test program and it will hit the breakpoint.
Watch the elements in new_neighbors when you reach the breakpoint.
Then run the program step by step, you will find the change of the new_neighbours, which should not be changed.

Expected behavior

During the execution of the function form_reverse_link_, the new_neighbors should never be changed.

USearch version

latest

Operating System

Ubuntu 24.04

Hardware architecture

x86

Which interface are you using?

C++ implementation

Contact Details

No response

Are you open to being tagged as a contributor?

I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ashvardanian · 2025-03-21T17:15:01Z

Hi @ycw66! When I've shared the code for USearch, I didn't expect people to go that deep into it's logic, so I'm very excited to see this issue the proposed #575 patch. I hope this waits a few days until the next week, was planning to refactor a few bits and pieces in the USearch and your suggestions would fit right in 😉

ycw66 · 2025-03-21T17:46:16Z

Hi @ycw66! When I've shared the code for USearch, I didn't expect people to go that deep into it's logic, so I'm very excited to see this issue the proposed #575 patch. I hope this waits a few days until the next week, was planning to refactor a few bits and pieces in the USearch and your suggestions would fit right in 😉

Thank you Ash! USearch is a popular high-performance library for vector search. I am trying to adapt USearch to our project, a universal data layer, to build the distributed vector database. Our system have a universal data layer can provide the ability of auto-scale, cache, ACID transaction and fault-tolerance. I just want to adapt the USearch to our universal data layer. When I do this, I need slightly modify the original code of USearch and find this bug.

ycw66 · 2025-04-01T06:38:11Z

Hi @ycw66! When I've shared the code for USearch, I didn't expect people to go that deep into it's logic, so I'm very excited to see this issue the proposed #575 patch. I hope this waits a few days until the next week, was planning to refactor a few bits and pieces in the USearch and your suggestions would fit right in 😉

Hi ash, I wanna know if this bug is verified and fixed?

ycw66 · 2025-04-24T09:42:43Z

Hi @ashvardanian ! I conduct a recall test and find when fix this bug with this PR#575, recall can improve more than 10%. My test config is as follows:

Randomly insert 10000 vectors with dim-96 into HNSW index.
Randomly generate 100 query vectors, and search top 100 nearest neighbors. Test result show when fix this bug recall can be improved from 61% to 77%.

ashvardanian · 2025-04-24T09:44:35Z

Nice! Any chance you have any larger-scale results? Maybe at least 10M vectors?

ycw66 · 2025-04-24T09:55:12Z

Nice! Any chance you have any larger-scale results? Maybe at least 10M vectors?

I will test this and then show the result

ycw66 · 2025-04-24T11:20:57Z

@ashvardanian I increase the numbers of vector to 1M. When bug not fixed, the average recall is 8.82%. When this bug fixed, the average recall is 17.27%. I think because the dataset is randomly generated so the recall rate is very low compared with these general dataset in the ANN Index Benchmark.

Anyway, I think this bug can lead to a decrease of recall rate and can be fixed with minimal change. I kindly ask whether the PR#575 can be merged or any other code should change?

ycw66 · 2025-04-24T16:23:05Z

Hi @ashvardanian , I further conduct tests on another two datasets from ANN Benchmark, SIFT and DEEP1B. Test results are as follows:

Dataset	numbers of vectors	Recall of current Usearch	Recall after bug fixed
SIFT	1M	89.34%	91.89%
DEEP1B	10M	78.95%	84.74%

Results show this bug will decrease recall rate a lot!

ashvardanian · 2025-04-24T17:47:35Z

Looks very promising! Thank you @ycw66!

ycw66 added the bug Something isn't working label Mar 21, 2025

ycw66 mentioned this issue Mar 21, 2025

Fix a correctness bug caused by a wrong memory override #575

Open

ycw66 changed the title ~~Bug: Wrong memory override lead to a violation of HNSW algorithm~~ Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. #576

Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. #576

ycw66 commented Mar 21, 2025 •

edited

Loading

ashvardanian commented Mar 21, 2025

ycw66 commented Mar 21, 2025

ycw66 commented Apr 1, 2025

ycw66 commented Apr 24, 2025

ashvardanian commented Apr 24, 2025

ycw66 commented Apr 24, 2025

ycw66 commented Apr 24, 2025

ycw66 commented Apr 24, 2025

ashvardanian commented Apr 24, 2025

Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. #576

Bug: Wrong memory override lead to a violation of HNSW algorithm which will lead to a decrease of recall rate. #576

Comments

ycw66 commented Mar 21, 2025 • edited Loading

Describe the bug

Problem statement

Steps to reproduce

Expected behavior

USearch version

Operating System

Hardware architecture

Which interface are you using?

Contact Details

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented Mar 21, 2025

ycw66 commented Mar 21, 2025

ycw66 commented Apr 1, 2025

ycw66 commented Apr 24, 2025

ashvardanian commented Apr 24, 2025

ycw66 commented Apr 24, 2025

ycw66 commented Apr 24, 2025

ycw66 commented Apr 24, 2025

ashvardanian commented Apr 24, 2025

ycw66 commented Mar 21, 2025 •

edited

Loading