infrastructureperformancescaling

Selecting the Right NIC/DPU and Storage Stack for High-Throughput Vector Search

ffuzzypoint

2026-02-03

10 min read

How DPUs, RDMA and NVMe affect vector search throughput and p99 latency — practical vendor and tuning guidance for 2026.

Stop losing matches to infrastructure — how NICs, DPUs and NVMe change vector search throughput

If your vector search cluster drops close matches under load, or p99 latency spikes whenever a large batch arrives, the root cause is often infrastructure, not the algorithm. In 2026 enterprise teams are discovering that the right mix of DPU/NIC, RDMA fabrics and NVMe topology is the single most consequential lever for predictable, high-throughput fuzzy search.

Quick answer (inverted pyramid): what to choose first

Network first: Select an RDMA-capable NIC (RoCEv2 or InfiniBand) for low CPU overhead and sub-ms p99s at high QPS.
Storage second: Use NVMe (local) for hot shards and NVMe-oF (RDMA/iSER) for larger clusters where capacity and scaling matter.
DPU when you need predictable CPU headroom: Offload encryption, NVMe-oF target, and packet processing to DPUs (BlueField-class, Pensando-class, or equivalent) if you hit CPU/network bottlenecks or need isolation.
Measure, then optimize: Benchmark realistic queries (including filtering and metadata post-filtering) and tune NIC/E-Switch, driver, and I/O stack (SPDK, VF provisioning, interrupts).

Why infrastructure matters now (2026 trends)

Two trends intensified in late 2024–2025 and continued into 2026:

RDMA and NVMe-oF matured from niche to standard — RoCEv2, iSER/NVMe-oF and InfiniBand fabrics are supported end-to-end by major vendors and cloud providers. This makes remote NVMe access as low-latency as previously impossible.
DPUs emerged as operational accelerators — DPUs now handle host-side services (networking, encryption, telemetry, storage offload), reclaiming CPU cycles for vector scoring and index traversal. Be mindful of memory and component supply shifts that push hybrid NVMe + RAM designs.

At the same time, memory and component supply shifts (see 2026 coverage on memory price pressure) are making it more expensive to overprovision RAM for huge in-memory indexes, pushing teams to hybrid NVMe + RAM designs.

How NIC, RDMA and NVMe choices affect vector search performance

1) Latency path: CPU cycles vs kernel hops

Every kernel transition and interrupt adds jitter. Traditional TCP/IP stacks cost CPU cycles to copy and context-switch. RDMA removes kernel copies for reads/writes and keeps the NIC and RNIC in the fast path, reducing per-request CPU usage and tail latency. For vector search:

RDMA lowers p50 and p99 by avoiding memcpy and reducing syscalls for NVMe-oF and RPCs.
NICs with SR-IOV and large hardware queues maintain throughput under multi-tenant workloads.

2) Throughput: PCIe lanes, NVMe performance and fabric bandwidth

NVMe SSDs are limited by PCIe lanes (PCIe Gen4 vs Gen5 vs Gen5+/6 in 2026) and by controller performance. If your vector engine streams large candidate sets from disk during searches, NVMe bandwidth dictates QPS ceilings.

Use NVMe drives with sufficient IOPS/bandwidth for your query mix (random vs sequential). PCIe 5.0 NVMe drives doubled throughput compared to Gen4 for many workloads in 2025–26.
Ensure DPU/NIC connects to CPU and NVMe via enough PCIe lanes; a DPU starved on a x8 link becomes a bottleneck.

3) CPU contention: why DPUs help (and when they don't)

DPUs (Data Processing Units) can offload networking, NVMe-oF target, SSL/TLS, and even packet-based filtering. That frees host CPUs to focus on ANN scoring (FAISS/ANNOY/HNSW) and business logic.

When DPUs help: high QPS with encryption, heavy metadata filtering, or NVMe-oF target duties where host CPU is saturated.
When DPUs don't: small clusters with spare CPU, or when the added complexity and vendor lock-in outweighs gains. DPUs add firmware and orchestration overhead.

Real-world architecture patterns

Pattern A — Low-latency, single-shard hot path (best for sub-ms p99)

Local NVMe per node for hot shards
RDMA-capable NIC (RoCEv2) in 25–100Gbps range
No DPU — keep stack simple, use SPDK user-space drivers and poll-mode drivers to reduce syscalls

Pattern B — Scalable capacity with consistent tail latency

NVMe-oF targets on DPU to serve NVMe volumes over RDMA
DPUs handle TLS, NVMe-oF and eBPF-based filtering
Host CPUs run vector scoring engines on CPU or GPU

Pattern C — Cloud-managed hybrid for unpredictable load

Cloud DPU-backed instances (Nitro-style) with Elastic Fabric Adapter/RDMA
Use managed NVMe volumes for cold data and local NVMe for hot shards
Autoscale workers and keep index replication conservative to avoid warmup lag

Benchmarks and test methodology (how to measure reliably)

Benchmarks must reflect production query shapes: mix of k-NN, filtered queries, batching, and scoring. Here's a concise methodology I use in lab tests:

Prepare a dataset equal to 1–2× production shard size. Include vectors plus typical metadata filters.
Define queries: 70% small (k=10), 20% medium (k=100), 10% large (k=1000) with realistic filter selectivity.
Measure these metrics: throughput (QPS), p50/p95/p99 latency, CPU and DPU utilization, NIC bandwidth, NVMe IOPS and latency, and tail variance over 1 hour.
Run three variants: TCP/IP + kernel NVMe, RDMA + kernel NVMe-oF, RDMA + DPU offload + SPDK user-space NVMe. For automation and quick pilot tooling I often prototype measurement harnesses using examples from the ship-a-micro-app pattern to get consistent test runs.

Example results (lab, illustrative only)

These are representative results from a controlled lab with 16-core hosts, PCIe Gen5 NVMe, 200Gbps RDMA fabric, and a BlueField-class DPU in-route. Your mileage will vary.

Baseline (TCP, kernel NVMe): 8k QPS, p99 45 ms
RDMA + kernel NVMe-oF: 32k QPS, p99 12 ms
RDMA + DPU offload + SPDK: 85k QPS, p99 3.8 ms

Interpretation: RDMA alone provided a 4× throughput improvement; the DPU + user-space NVMe stack roughly tripled throughput again by removing CPU and kernel bottlenecks and stabilising latency.

Operational checklist: what to test before buying

Fabric support: Ensure the NIC supports RoCEv2 and DSCP/ECN for congestion control, or plan for InfiniBand.
NVMe topology: Confirm controller performance, PCIe generation, and that the DPU/NIC has enough PCIe lanes to avoid contention.
Driver and ecosystem: Validate SPDK, libibverbs, libfabric, and your vector engine integration paths.
Telemetry and observability: DPUs are great for offload but add a layer that needs logging, metrics and firmware ops—ensure the vendor provides good tooling (see notes on observability).
Failure modes: Test DPU firmware upgrades, NIC failover, and NVMe node loss with index replication to ensure graceful degradation.
Security: Check that DPU offloads don’t introduce bypassable attack surface—validate attestation and isolation features.

Vendor selection guidance (2026 landscape)

Vendors and their fit vary by use-case. Below are pragmatic selection heuristics based on 2025–26 market maturity.

DPUs

Best for large, latency-sensitive fleets: Choose DPUs from established suppliers that provide a mature SDK, NVMe target support and orchestration tooling. Look for vendors with active SPDK and libibverbs support.
Cost-conscious or small clusters: Avoid DPUs until CPU exhaustion is proven. A well-configured RDMA NIC + SPDK often suffices.
Vendor lock-in risk: DPUs can tie you to vendor orchestration and firmware. Prefer vendors that support open standards and provide clear firmware rollback paths.

NICs / RNICs

For raw throughput: 100–400Gbps RoCE/InfiniBand RNICs from established vendors — prioritize RNICs that expose full libibverbs functionality and RDMA memory registration performance.
For multi-tenant environments: NICs that support SR-IOV, hardware QoS, and multi-queue scheduling will prevent noisy neighbors from affecting p99s.

NVMe and storage stack

Hot data: Local NVMe (PCIe Gen5+) with high IOPS and DRAM caching for sub-ms retrieval.
Large-scale capacity: NVMe-oF over RDMA to aggregate capacity without blowing memory budgets; pair with DPU targets when you need predictable host CPU headroom.
Drive selection: Prefer enterprise TLC/PLC with sufficient write endurance. QLC may be tempting for cost, but it increases tail latency under mixed workloads.

Concrete tuning tips and commands

Here are practical knobs that repeatedly improve throughput in production environments.

Linux NIC and RDMA tune (examples)

# disable kernel Rx/Tx checksum offload if using DPDK/SPDK poll-mode drivers
etdev="eth0"
ethtool -K $netdev rx off tx off

# set large RX/TX rings
ettool -G $netdev rx 4096 tx 4096

# for Mellanox RDMA adaptors, ensure RoCE ECN enabled and configure congestion
# (vendor-specific tools such as mlxconfig or mlnx_tune may be used)

NVMe and SPDK

# start SPDK target for NVMe-oF (simple example)
# requires spdk built with NVMe-oF target and RDMA support
./build/bin/nvmf_tgt -c ./etc/spdk/nvmf.conf

# fio example to measure NVMe random read latency
fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --size=20G --filename=/dev/nvme0n1

Cost and procurement: tradeoffs to be explicit about

DPUs and high-end RNICs cost more upfront and increase rack power and complexity. However, the total cost of ownership often favors offload when:

CPU cores would otherwise need to be added purely for packet processing, TLS and NVMe target duties.
Predictable p99 latency is a business SLA — DPUs reduce tail variance.

Procurement note (2026): memory prices remain volatile and sometimes higher than 2023–24 levels, so favour architectures that limit in-memory index duplication; this pushes teams to optimize NVMe access patterns and use fast SSD tiers for hot items. See vendor TCO writeups and storage cost optimization guidance when estimating fleet economics.

Example migration plan: from TCP+local NVMe to RDMA + DPU

Baseline measurement: log QPS, p99, NVMe IOPS and CPU saturation under representative load.
Introduce RDMA-capable NICs in a canary subgroup and enable SPDK user-space NVMe on those hosts.
Measure again. If CPU is still saturated, deploy DPUs in the next canary and shift NVMe-oF targets to the DPUs.
Iterate on QoS, VF allocation and NUMA alignment — we automate parts of this with lightweight tooling and pilot scripts.
Roll out cluster-wide and validate failure scenarios, firmware upgrades and observability.

Case study — anonymised

Anonymised retail search team moved from a 50‑node TCP-based cluster to an RDMA + DPU design in 2025. Their pain points were 3-second p99 latencies during peak and high CPU usage from encryption and metadata filtering. After switching to NVMe-oF with a DPU-based NVMe target and enabling SPDK user-space paths, they reported:

10× increase in QPS at equivalent latency budget
p99 reduced from 2.8s to 25ms under peak
30% fewer active CPU cores required, enabling cost savings

The tradeoff: added operational complexity and vendor engagement for firmware and DPU lifecycle management.

Future predictions (2026–2028)

PCIe 6 and CXL synergy: By 2027, PCIe6 and CXL will add more options for memory disaggregation and solid-state acceleration; this will reshape how hot/cold data is carved across NVMe pools.
DPUs get smarter: Expect DPUs to take on more domain-specific acceleration (vector pre-filtering, compressed payload decompression) as vendors expose eBPF-like programmability with safety guarantees.
Standardisation of NVMe-oF tooling: SPDK and libfabric will converge on battle-tested patterns, reducing integration friction and making RDMA-first designs mainstream.

Actionable takeaways

Start with measurement: you can’t know whether the network or storage is the bottleneck without p99-aware telemetry.
Choose RDMA-capable NICs early: they give the most predictable improvement per dollar for throughput-heavy, latency-sensitive vector search.
Only adopt DPUs after pilot tests: validate that offload yields measurable CPU savings and latency stability for your specific query mix.
Design for failure: test firmware upgrades, NVMe node loss and NIC failover in pre-production before fleet-wide rollouts.
Prioritise standards: pick vendors that play well with SPDK, libibverbs, and open tooling to avoid lock-in and ease operations.

Final checklist before procurement

Do you have a production-grade benchmark? If not — build one now.
Have you modelled host CPU, NIC, and NVMe lane contention for peak QPS?
Does the vendor provide observability, firmware rollback and SPDK/libibverbs support?
Have you estimated TCO including power, cooling and DPU firmware lifecycle?

Infrastructure choices are your biggest tuning knob for vector search: the right RDMA fabric and storage topology can unlock 5–10× throughput and reduce p99 by orders of magnitude — if you measure, pilot and pick vendors with open tooling.

Call to action

If you’re planning a migration or a greenfield cluster in 2026, start with a reproducible benchmark and a 3-month pilot. We run hands-on cluster assessments and pilot blueprints tailored to vector search workloads — contact us for a focused workshop that maps your query profile to a DPU/NIC/NVMe strategy and a costed migration plan.

fuzzypoint

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.