Creating a Tiny Vector Database, Part 1

Apr 20, 2024

Reading Time: 8 minutes

[medium discussion]

Table of Contents

Introduction

Vector databases allow you to search for approximate nearest neighbors from a large set of vectors. They provide an alternative to brute force nearest neighbor searches at the cost of accuracy and additional memory. Previously, many advancements in NLP and deep learning were limited to small scales due to the lack of scalable nearest neighbor search. Recently, with RAG and generative AI, the applications for vector databases are exploding. Metric learning is popularly used to project any document into an embedding space. Vector databases are then used to index a large amount of document embeddings and provide nearest neighbors for a query embedding. Common use-cases for vector databases are semantic search and multi-modal search.

Popular Methods

Early algorithms such as the k-d tree were used, but they were not very accurate with higher-dimensional embeddings. The k-d tree splits the dataset along a dimension based on a specific value. It recursively splits the dataset until a stopping criterion is reached. During inference, it traverses the tree to find the subset of points nearest to the query point.
Locality Sensitive Hashing (LSH) creates a hash for each embedding and groups points based on their hash values. As the name suggests, the hashing function is designed so that points close to each other have relatively the same hash value. A common method involves taking a random vector $v’$ with the same dimension as the embedding and using it to compute the metric. Multiple buckets can be created based on these metrics.
Graph algorithms like NSW and HSNW create weighted graphs with embeddings as vertices and the distances between them as bidirectional edges. During construction, new points are added iteratively and are connected to some of their nearest neighbors. During inference, a greedy traversal is performed starting from an initial random point.

NSW (Navigable Small World)

The NSW is a graph structure characterized by a high clustering coefficient, where nodes are connected to their nearest neighbors. The shortest navigation distance between any two selected nodes is proportional to the logarithm of the total number of nodes. The NSW graph is iteratively constructed from all embeddings. During a query, this graph facilitates the navigation and quick retrieval of nearest neighbors. NSW enables searches for elements with logarithmic time complexity.

Building Graph with NSW

Add the each node (vector) to graph iteratively
For each vector, $\text{ef}_\text{construction}$ nearest neighbours from existing graph is searched using the search API. $\text{ef}_\text{construction} ≥ 2$. M points from these points are selected based on their proximity to current node.
Create bidirectional connections with current node and the final M nodes.
Pruning of edges are done if any node starts to have more than $e_\text{max}$ edges. This is done in HNSW to reduce maintain small world property

void NSW::AddToGraph(Graph &graph, const vf &element) {
  int nodes = graph.size();
  vector<int> currentNode(0);

  auto res = Search(element, this->M, this->efConstruction);
  for (auto r : res) {
    int index = r.first;
    graph[index].push_back(nodes); // adding current element
    currentNode.push_back(index);
  }
  graph.push_back(currentNode);
}

Finding Nearest Neighbours (single nearest point)

Given a query node q. We start with a random node in graph
We iterate to it’s neighbors in direction which reduces the proximity to q
Stop iteration when this is not possible and conclude that this is our closest point
This can lead to local minimum. To avoid that multiple iteration of same loop is performed

Finding Nearest Neighbours (top k points)

Input: query node is $q$, explore parameter $\text{ef}$, return nearest neighbours $K$
We create
- A min heap to keep track of $\text{ef}$ nearest points, call this minheap
- A priority queue to keep tracking of best possible element (element of maximum similarity to query) to explore at each step, call it pq
We push a random node in pq
We keep fetching an element node from pq
- While node improve the minheap , i.e. sim(node, query) > min(minheap). if this current node doesn’t improve the minheap , no other points in the pq will improve.
- Explore the node’s neighbours. Ignore the duplicate and add new ones into the pq
Repeat the above process for $L$ number of random points and select $K$ best points from minheap
Note: if ef==1, this algorithm reduces to previous algorithm

vpif NSW::Search(const vf &query, int K = 10, int ef = 20) {
  vector<pif> ans(0);
  int nodes = this->graph.size();
  std::set<int> visited;

  // repeating this L times to find K nearest neigbours,
  // to avoid local optimum
  for (int x = 0; x < this->L; x++) {
    vector<pif> minheap(0);

    std::priority_queue<pif, vector<pif>, CompareSecond> q;

    int currNode = rand() % nodes;
    auto currSim = Dot(query, this->documents[currNode]);

    q.push({currNode, currSim});

    while (q.size()) {
      pif node = q.top();
      q.pop();
      int currNode = node.first;
      float currSim = node.second;

      // break when the quality cannot be further improved 
      if (minheap.size() > 0 && currSim < minheap[0].second)
        break;
      // already visited node
      if (visited.find(currNode) != visited.end())
        continue;
      visited.insert(node.first);

      // keep track of highesh ef points in minheap
      KeepTrackHighest(minheap, node, ef);
      
      auto &neighbors = this->graph[currNode];
      for (auto idx : neighbors) {
        auto neighborSim = Dot(query, this->documents[idx]);
        q.push({idx, neighborSim});
      }
    }

    // global optimum across all random searches
    for (auto each : minheap) {
      KeepTrackHighest(ans, each, K);
    }
  }
  return ans;
}

Benchmark

We benchmark the above algorithm on dummy dataset of randomly generated vectors. We use vector dimension of 128 and generate a random number between -1 to 1 for each value. We finally normalize all the vectors with their $L_2$ norm to create unit vectors. The below benchmark are done on Intel i7 MacBook Pro with cosine similarity as metric. You can find the complete code at github

Config	ndim	ef_construction	M	ef	K	L
config1	128	32	32	20	10	1
config2	128	32	32	64	10	1
config4	128	64	64	32	10	2

Configurations used for benchmarks

Configuration	NN Search Time (for 1K queries)	Average Number of Hops	Average Number of Similarity Evaluations	Recall@10
nPoints=100K, Brute Force	4189 ms	NA	100K	1
nPoints=100K, NSW, config1	~422 ms	~28	~2400	0.22
nPoints=100K, NSW, config2	~ 1017 ms	~70	~ 5800	0.43
nPoints=100K, NSW, config3	~ 2100 ms	~ 77	~ 11000	0.73

Benchmark results

Challenges with NSW:

The NSW algorithm takes a completely greedy approach, which can create problems during its operation. One major issue is its inability to make large jumps quickly. In the early stages of training, only a few edges are long, and as training goes on, these long-length edges become even less common. This leads to slow progress because the algorithm spends too much time exploring nodes that are far away. In the beginning, it should be able to jump long distances to cover more area quickly and then focus on shorter jumps as it gets closer to the solution. This strategy, which is used by the HNSW algorithm, helps in reducing unnecessary calculations of similarities between far-off nodes, making the process more efficient. ¹

What’s Next?

HNSW solves the limitations of NSW by creating a hierarchy of points, where each point is present at certain level and all levels above. The idea is to reduce the number of similarity computations at lower layers to find an approx location. As levels increase the search space is reduced leading to faster compute.

References

https://chat.openai.com/c/cdd37e7a-b24c-49ca-afe7-73da52b9e030 ↩︎