NeuVSA: A Unified and Efficient Accelerator for Neural Vector Search

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2025

Neural Vector Search (NVS) has exhibited superior search quality over traditional key-based strategies for information retrieval tasks. An effective NVS architecture requires high recall, low latency, and high throughput to enhance user experience and cost-efficiency. However, implementing NVS on existing neural network accelerators and vector search accelerators is sub-optimal due to the separation between the embedding stage and vector search stage at both algorithm and architecture levels. Fortunately, we unveil that Product Quantization (PQ) opens up an opportunity to break separation. However, existing PQ algorithms and accelerators still focus on either the embedding stage or the vector search stage, rather than both simultaneously. Simply combining existing solutions still follows the beaten track of separation and suffers from insufficient parallelization, frequent data access conflicts, and the absence of scheduling, thus failing to reach optimal recall, latency, and throughput.

To this end, we propose a unified and efficient NVS accelerator dubbed NeuVSA based on algorithm and architecture co-design philosophy. Specifically, on the algorithm level, we propose a learned PQ-based unified NVS algorithm that consolidates two separate stages into the same computing and memory access paradigm. It integrates an end-to-end joint training strategy to learn the optimal codebook and index for enhanced recall and reduced PQ complexity, thus achieving smoother acceleration. On the architecture level, we customize a homogeneous NVS accelerator based on the unified NVS algorithm. Each sub-accelerator is optimized to exploit all parallelism exposed by unified NVS, incorporating a structured index assignment strategy and an elastic on-chip buffer to alleviate buffer conflicts for reduced latency. All sub-accelerators are coordinated using a hardware-aware scheduling strategy for boosted throughput. Experimental results show that the joint training strategy improves recall by 4.6% over the separated strategy and accuracy by 43.5% over LUT-NN. NeuVSA achieves 2.82× to 416.17× lower latency over CPU, GPU, DFX+ANNA, and PQA+ANNA, and up to 49.60× and 10.57× higher average throughput over CPU and GPU, respectively. NeuVSA also reduces chip area by 65.2% over PQA+ANNA.

Paper download

Recommended citation: N/A.