← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

How do you choose a vector database for a production RAG system in 2027?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · Updated · 6 min read
How do you choose a vector database for a production RAG system in 2027?

How do you choose a vector database for a production RAG system in 2027?

Choosing a vector database for production RAG comes down to matching four things to your workload: the operational model you can support (managed vs. Self-hosted), your scale and latency targets, your filtering and hybrid-search needs, and your cost ceiling. In practice, teams that already run PostgreSQL should start with pgvector, teams that want zero operations should start with Pinecone, and teams that need self-hosted scale or heavy filtering should evaluate Qdrant, Weaviate, or Milvus.

The decision is reversible because embeddings are portable, so pick the simplest option that meets your requirements and benchmark before you scale.

Start with the workload, not the vendor

The most common mistake is choosing a vector database by reputation instead of by workload shape. Before comparing products, write down four numbers: how many vectors you will store (thousands, millions, or billions), your embedding dimension, your acceptable p95 query latency, and your peak queries per second.

These numbers eliminate most options immediately. A 200,000-chunk internal knowledge base has almost nothing in common with a 2-billion-vector product-search index, and the right database is different for each.

Also document your filtering requirements. Almost every production RAG system filters by tenant, document type, recency, or access permissions. If filtering is central, engines with efficient filtered search such as Qdrant and Pinecone rise to the top.

If you need to blend keyword and semantic matching, hybrid-capable engines like Weaviate, Elasticsearch, OpenSearch, and Redis matter more.

Decide your operational model first

The single biggest lever on total cost of ownership is whether you run the database yourself.

flowchart TD A[Pick operational model] --> B{Do you have an ops/platform team?} B -- No --> C[Managed service: Pinecone, Weaviate Cloud, Qdrant Cloud, Zilliz] B -- Yes --> D{Already run a database you can extend?} D -- Postgres --> E[pgvector] D -- Elasticsearch/OpenSearch --> F[Native dense vectors] D -- MongoDB --> G[Atlas Vector Search] D -- None --> H{Scale?} H -- Millions, filtered --> I[Qdrant or Weaviate self-hosted] H -- Billions --> J[Milvus or Vespa]

A managed service like Pinecone or Zilliz Cloud removes index builds, sharding, replication, and upgrades from your plate. You pay more per query but spend far less engineering time. A self-hosted engine like Qdrant or Milvus is cheaper at steady high volume but demands real operational maturity: monitoring, scaling, backups, and version upgrades.

If you do not already have a platform team comfortable running stateful distributed systems, managed almost always wins on total cost.

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Match scale to architecture

Vector databases differ sharply in how they scale. pgvector is excellent up to low millions of vectors with HNSW indexes, especially when your filters are selective, but it is not designed to be a billion-vector engine. Qdrant and Weaviate scale comfortably into the tens or hundreds of millions with horizontal sharding.

Milvus and Vespa are built for the largest workloads, separating storage and compute and offering disk-based indexes like DiskANN so you are not forced to hold every vector in RAM.

If your dataset is large, pay attention to memory cost, which usually dominates the bill. Quantization — scalar, product, or binary — can shrink memory use by a large factor at a modest recall cost. Qdrant and Milvus have strong quantization support, and choosing IVF-PQ or DiskANN over plain HNSW can change your infrastructure cost by an order of magnitude at billion-vector scale.

Test recall and latency on your own data

Vendor benchmarks use public datasets that rarely match your embeddings or query distribution. The only benchmark that matters is yours. Take a representative sample of your real documents and queries, build the index with each candidate engine, and measure two things: recall@k against a brute-force ground truth, and latency at p95 and p99 under realistic concurrency.

flowchart LR A[Sample real corpus + queries] --> B[Compute brute-force ground truth] B --> C[Index in each candidate DB] C --> D[Measure recall@k] C --> E[Measure p95/p99 latency at target QPS] D --> F{Recall acceptable?} E --> F F -- Yes --> G[Compare cost] F -- No --> H[Tune index params or quantization] H --> C G --> I[Choose]

Tune index parameters honestly during this test. For HNSW that means ef_construction, M, and query-time ef; for IVF it means the number of lists and probes. A poorly tuned engine can look worse than it is. Run the test long enough to see tail latency under sustained load, not just a few warm queries.

Plan for hybrid search and metadata filtering

Pure dense-vector retrieval misses queries that hinge on exact terms — product SKUs, error codes, names, acronyms. Hybrid search, which fuses BM25 keyword scoring with vector similarity, consistently improves recall on real user queries. If your users ask questions with specific identifiers, prioritize engines with native hybrid support: Weaviate, Elasticsearch, OpenSearch, Qdrant, and Redis.

Metadata filtering deserves equal attention. In multi-tenant systems you must filter by tenant before returning results, and in permissioned systems you must enforce access at query time. Confirm that your candidate database does pre-filtering or efficient filtered ANN rather than naive post-filtering, which can silently destroy recall when filters are selective.

Qdrant and Pinecone are particularly strong here.

Weigh cost and lock-in honestly

Cost has several components: storage of vectors, compute for queries, and the engineering time to operate the system. Managed serverless options (Pinecone serverless, Zilliz, Qdrant Cloud) bill by usage and suit spiky RAG traffic. Fixed self-hosted clusters are cheaper at steady high load but waste money when idle.

pgvector is often the cheapest credible option because it adds vectors to a database you already pay for.

Lock-in risk is low for the vectors themselves — they are just arrays you can re-embed or re-index elsewhere. The lock-in that matters is operational: proprietary query features, filtering syntax, and tooling you build around one engine. Keep your retrieval layer behind a thin interface so swapping databases later is a contained change, not a rewrite.

Frequently Asked Questions

Is pgvector good enough for production? For many systems, yes. With HNSW indexing and selective filters, pgvector handles low-millions of vectors with good latency, and it keeps your embeddings transactionally consistent with your relational data. It becomes the wrong tool mainly at very large scale or extreme QPS.

When should I choose a managed service over self-hosting? Choose managed (Pinecone, Zilliz, Qdrant Cloud, Weaviate Cloud) when you lack a platform team to operate stateful distributed systems, when time-to-market matters more than per-query cost, or when your traffic is spiky and serverless billing saves money.

How do I benchmark recall correctly? Compute a brute-force exact nearest-neighbor ground truth on a sample, then measure how often each engine's approximate results match it at your chosen k. Tune index parameters for each engine before comparing, and test under realistic concurrency for tail latency.

Do I need hybrid search? If your users include exact terms like codes, names, or SKUs in their queries, hybrid search noticeably improves recall. If your queries are purely conceptual, dense vectors alone may suffice. When unsure, test both on real queries.

How do I control memory cost at scale? Use quantization (scalar, product, or binary) and disk-based indexes like DiskANN to avoid keeping every full-precision vector in RAM. Qdrant and Milvus offer strong support for these techniques, often cutting memory cost dramatically with a small recall trade-off.

Can I change vector databases after launch? Yes. Because embeddings are portable, migrating means re-indexing the same vectors into a new engine and re-tuning parameters. Keeping retrieval behind a thin abstraction makes the swap a contained task rather than a major rewrite.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-speeches · speechesA Speech for a Promotion Announcementpulse-speeches · speechesHow to Keep a Wedding Toast Under Three Minutespulse-speeches · speechesA Speech for a PTA Meetingpulse-ai-infrastructure · ai-infrastructureWhat is model serving and how is it different from a REST API?revops · current-events-2027How does the expanding size of B2B buying committees increase the risk of vendor consolidation paralysis?pulse-ai-infrastructure · ai-infrastructureWhat is model quantization and when should you use it?pulse-ai-infrastructure · ai-infrastructureThe 10 Best RAG Frameworks in 2027pulse-aquariums · aquariumHow do you set up a shrimp-only aquarium?pulse-aquariums · aquariumTop 10 Canister Filters for Planted Aquariums in 2027pulse-speeches · speechesHow to Quote Someone Without Sounding Clichepulse-ai-infrastructure · ai-infrastructureHow do you set up observability for a RAG application?pulse-ai-infrastructure · ai-infrastructureThe 10 Best Embedding Models for Search and RAG in 2027pulse-aquariums · aquariumTop 10 Aquarium Controllers for Smart Tanks in 2027