Redis
Python >=3.11
git clone --recursive https://github.com/keiserlab/rad.git
cd rad
pip install .
We also provide a Dockerfile containing all required software.
There are two primary things to run RAD:
- Constructing the HNSW graph
- Defining a scoring function to use for traversal
Constructing the HNSW graph consists of setting the construction parameters expansion_add and connectivity and then adding each molecule by providing a numerical key and a its fingerprint.
expansion_add controls the number of candidates considered as potential neighbors during element insertion, while connectivity controls how many of these candidates are actually connected to the inserted element.
from usearch.index import Index
hnsw = Index(
ndim = 1024, # 1024 bit fingerprint
dtype='b1', # For packed binary fingerprints
metric='tanimoto',
connectivity = 8,
expansion_add = 400
)
fingerprints = ...
keys = np.arange(len(fingerprints))
hnsw.add(keys, fingerprints, log="Building HNSW")
The fingerprints are expected to be an (n x d/8) numpy array where each row is a packed binary fingerprint. e.g turning a 1024-bit binary fingerprint into a 128 uint8 fingerprint with np.packbits()
. See the example notebook for more details.
To traverse the HNSW graphs, you must provide a scoring function that maps a molecule's key to a score, where numerically smaller scores are better. Here is a mock example of a scoring function:
def score_fn(key):
fp = fingerprints[key]
score = score_from_fp(fp)
return score
With the HNSW graphs built and the scoring function defined, we can initialize a RADTraverser
from rad.traverser import RADTraverser
traverser = RADTraverser(hnsw=hnsw, scoring_fn=score_fn)
This initialization does two things:
- Starts a process which handles the HNSW neighbor queries.
- Starts a process which runs a redis server and handles the traversal queue.
You can also connect to an already running redis server by passing the host and port of the server. This is useful for HPC environments where a single node can manage the traversal queue for many worker nodes doing the scoring:
traverser = RADTraverser(hnsw=hnsw, scoring_fn=score_fn, redis_host='xxx:xxx:xxx', redis_port=6379)
The traverser is 'primed' by finding and scoring the nodes on the top layer of the HNSW graph and initializing the priority queue. This should only be run once.
traverser.prime()
The traversal then proceeds until a max number of molecules is scored or a timeout is reached.
# Note that n_workers must be set to 1 for now until I fix a bug.
traverser.traverse(n_workers=1, n_to_score=100,000) # Run the traversal until 100k molecules are scored
or
traverser.traveser(n_workers=1, timeout=120) # Run the traversal for 2 minutes
The results can be accessed by looping over the scored set which contains the keys in the order that they were traversed.
results = []
for key, score in traverser.scored_set:
results.append((key,score))
To gracefully shut down the HNSW and redis servers:
traverser.shutdown()
In the example folder, there is a jupyter notebook for constructing and traversing the DUDE-Z DOCK HNSW investigated in the original RAD paper.
For a larger billion-scale application and integration with Chemprop see the repo at https://github.com/bwhall61/lsd
The original HNSW paper by Yury Malkov and Dmitry Yashunin.
The original RAD paper by Brendan Hall and Michael Keiser.
The lsd.docking.org paper by Brendan Hall, Tia Tummino, et al. shows a billion-scale application and integration with Chemprop ML models.
And then most importantly, the HNSW graph code is built on the usearch library so large thanks to Ash Vardanian for his awesome HNSW library!