Enterprise RAG Architecture: Vector DB, Chunking and Evaluation Guide
Seven practical decisions for production RAG systems: vector DB selection, chunking strategy, hybrid search, reranker, evaluation framework. Lessons from enterprise AI projects. An eCloud Tech engineering note.
Over the past three years RAG (Retrieval-Augmented Generation) has become the dominant architecture for enterprise AI projects. Requirements like KVKK compliance, sensitive-data governance and working with in-house knowledge bases have opened a much more practical path than fine-tuning open models: leave the base model as it is, feed the organisation's documents to the model at call time, and have it produce the answer. The slogan is simple; lack of discipline in execution is rotting projects. The difference between production-grade RAG and demo-grade RAG is architectural quality — every layer, from vector DB selection to chunking strategy, from hybrid search to reranker, from evaluation framework to production monitoring, must be measured and tested.
Within our RAG systems engineering and vector database engineering services we have worked on 12 enterprise RAG projects over the past 18 months, and we operate our own AIGENCY v4 platform on a RAG-heavy architecture. In this article we walk through seven critical decisions for enterprise RAG architecture in order: doc store selection, chunking strategy, embedding model, hybrid search, reranker layer, evaluation framework and production monitoring.
1. Strategic decision — RAG, fine-tuning or hybrid?
The first question is not let's use RAG — it should be is RAG the right tool for this problem? Three approaches are on the table:
- Pure RAG — the base model is unchanged; for every query, organisation documents are pulled at call time. Ideal for scenarios with frequently updated data, audit requirements and traceability (legal, customer support, corporate knowledge base).
- Fine-tuning — the base model is specialised for a specific tone, format or domain task. Good for stable, narrow tasks (medical reporting, code-style transformation, product description generation).
- Hybrid (RAG + fine-tuning) — the fine-tuned model holds the base knowledge (domain terminology, format), while RAG provides continuously updated data. The dominant approach for complex, high-volume enterprise scenarios.
Decision test: if knowledge updates daily, RAG; if behaviour/format must be learned, fine-tuning; if both are required, hybrid. About 70% of enterprise projects start with pure RAG; once they hit real production volume, about half move to hybrid. Pure fine-tuning is a minority in enterprise use because retraining on every knowledge change is costly and audit is hard.
Second practical point: RAG is no silver bullet. Applied wrongly, hallucination rates rise (the LLM treats irrelevant docs as sources), latency increases, costs explode. Applied correctly, it makes the model say I don't know where it should, and answer with citations where it does know. The difference is architectural discipline.
2. Doc store selection — pgvector vs Qdrant vs Milvus vs managed services
Vector database selection is the skeletal decision of RAG architecture; changing it later means migration pain and months of friction. Three main categories:
PostgreSQL + pgvector — installed as an extension on top of your existing PostgreSQL. One database, one backup, one operation. Practical at 100K-500K embedding scale. Advantages: SQL-level metadata filter, transactional consistency, the tool your team already knows. Drawbacks: HNSW indexing slows down at 1M+ scale, and filter + vector search combinations require special tuning. Our recommendation: default choice for small-to-medium enterprise projects.
Qdrant — written in Rust, a vector-first database. Excellent HNSW index performance, native payload filter, REST + gRPC APIs. Sweet spot at 1M-50M embeddings. Self-hosted or managed cloud. Has gained popularity in enterprise projects since 2023. In our AIGENCY pipeline we hold 8M embeddings in a Qdrant cluster.
Milvus — distributed architecture designed for very large scale (10M-1B+ embeddings). Operational load is high (Kubernetes, etcd, MinIO components); hard without a dedicated DevOps presence on the team. Advantage: GPU-accelerated indexing and advanced index options (IVF/HNSW/DiskANN).
Pinecone / Weaviate Cloud / Zilliz Cloud — managed services. No operational load, production-ready in minutes. Cost: monthly USD 70-1,500 for small-to-medium load, USD 3,000-15,000 for enterprise SLA. Data location is critical from the KVKK perspective — Pinecone has an EU region, Weaviate has self-hosted or AWS Frankfurt options; a DPA (Data Processing Agreement) is mandatory in the contract.
Decision matrix:
| Criterion | pgvector | Qdrant | Milvus | Managed |
|---|---|---|---|---|
| Sweet spot scale | <500K | 1M-50M | 10M-1B+ | Any scale |
| Operational load | Low | Medium | High | None |
| Monthly infra cost | TRY 500-3,000 | TRY 3,000-15,000 | TRY 10,000-40,000 | USD 70-15,000 |
| KVKK control | Full | Full | Full | Contract-dependent |
| Hybrid search | Manual | Native | Native | Native |
| Latency p95 | 30-100ms | 10-50ms | 5-30ms | 20-100ms |
Practical recommendation: start with pgvector. As you approach the 500K embedding limit, plan a Qdrant migration. Above 10M, consider Milvus or managed service. Trying to pick big up front creates ops fatigue and unnecessary cost in most projects.
3. Chunking strategy — fixed size isn't enough
The single most critical factor in RAG quality is chunking strategy. Bad chunks waste even the world's best embedding model. Three core patterns:
Fixed-size chunking — the document is split into a fixed token count (e.g. 500 tokens, 15% overlap). Simplest, fastest; acceptable for free-form text (emails, chat logs). Drawback: cuts can land in the middle of paragraphs or sentences; semantic boundaries shift.
Structure-aware chunking — split by document structure: by Markdown heading (#, ##), by HTML section/article tag, in PDF by font changes or bullet points. Gives 2-3× better recall for technical docs, legal texts and product specs. Implementation: ready-made parsers exist in Unstructured.io, LlamaIndex node parsers, LangChain text splitters.
Semantic chunking — embedding-aware splitting; a new chunk starts when the cosine distance between two consecutive sentences' embeddings exceeds a threshold. Most sophisticated but compute-heavy (extra embedding pass per document). Delivers real value for complex, long texts (academic articles, reports).
Optimal chunk size is domain-dependent:
| Document type | Recommended chunk | Overlap | Strategy |
|---|---|---|---|
| Contract, legal text | 600-900 tokens | 20% | Structure-aware (article/clause) |
| Technical documentation | 400-600 tokens | 15% | Structure-aware (headings) |
| Customer support FAQ | 200-400 tokens | 10% | Q-A pair as one chunk |
| Email, chat log | 300-500 tokens | 15% | Fixed + sliding window |
| Academic article | 500-800 tokens | 20% | Semantic |
| Code file | Function as a whole | 0 | AST-aware (Tree-sitter) |
In our AIGENCY pipeline a doc-type classifier first classifies the document (contract / email / FAQ / log / article) and then runs a chunker specialised for that type. This approach lifted top-5 recall by 35% compared with one-size-fits-all fixed chunking.
Third practical point: metadata enrichment. Attach not just text but also doc title, section heading, document type, source URL, language, last_updated date and author to every chunk. At retrieval time this metadata is gold for filters (e.g. only documents from the last 6 months); at generation time it is gold for source attribution.
4. Embedding model — multilingual, dense, sparse, hybrid
Embedding model quality is the ceiling of retrieval quality. With the wrong model, even the best chunking gives low recall. As of 2026, four practical families:
OpenAI text-embedding-3-large — 3072 dimensions, strong in English, Turkish and Arabic, ~USD 0.00013 per query. Popular in production, but every query goes to OpenAI — this can create a data-location issue from the KVKK perspective and requires a DPA in the contract.
Cohere embed-multilingual-v3 — 1024 dimensions, 100+ languages; slightly weaker than OpenAI in Turkish and Arabic but still strong. Cohere's EU region offering is an advantage for KVKK. Cost is close to OpenAI.
Open-source: BGE-M3, mE5-large, multilingual-MiniLM — self-hosted, running on GPU. BGE-M3 can operate in dense, sparse and multi-vector modes (similar to Cohere's patented approach). Advantages: full KVKK control, no per-query cost, you can fine-tune. Drawbacks: GPU operational load; raw latency can be 2-5× higher than OpenAI/Cohere.
Domain-specific fine-tuned — the organisation contrastive-fine-tunes BGE or E5 on its own corpus. Training for 1-3 epochs on 5K-50K query-positive_doc pairs lifts recall by 10-25%. Real value exists for domain-specific language (legal, finance, medical).
Practical recommendation: OpenAI or Cohere for pilots and PoCs; BGE-M3 self-hosted for KVKK-critical enterprise; fine-tuned BGE or mE5 for real volume + domain-specific precision. Switching the embedding model requires re-embedding the whole corpus — bake this cost and time into the start (1M docs × 2 hours re-embed × GPU hour cost).
5. Hybrid search — vector + BM25 + RRF
Pure vector search misses certain queries:
- Exact keyword match required (XR-2050-A product code, Law 5651)
- Person/place names (embedding models represent proper nouns poorly)
- Acronyms and technical terms (KVKK, MASAK, TS 13638)
- Negation (systems not KVKK-compliant — embeddings handle negation weakly)
The solution is hybrid search: BM25 (sparse) + vector (dense) run together, results merged. Two merging approaches:
Reciprocal Rank Fusion (RRF) — each system's document rank is run through 1/(k+rank) (k=60 default), and the two scores are summed. No score normalisation needed; robust. Default choice in production.
Weighted score — BM25 score and vector score are normalised separately, then merged with a weight (e.g. 0.4 BM25, 0.6 vector). Weights must be tuned per domain; more flexible but tuning-heavy.
Implementation:
| Vector DB | Hybrid support |
|---|---|
| pgvector | Manual (ts_vector + cosine, merged in code) |
| Qdrant | Native (Fusion API) |
| Weaviate | Native (hybrid query) |
| Milvus | Native (BM25 + vector + RRF) |
| Elasticsearch | Native (knn + match query) |
In our measurements, hybrid search lifted top-5 recall by 18-32% and top-1 (best result) accuracy by 25-40% versus pure vector. Outside PoC, we treat it as mandatory for every production RAG.
6. Reranker — the pipeline's precision filter
The retrieval stage is fast + broad (top-50, top-100 documents); the reranker is slow + precise (a cross-encoder scores each query-document pair individually and trims to top-5). The math of the two-layer approach is simple: a bi-encoder embedding converts query and doc to separate vectors and then cosine-distances them — information is lost. A cross-encoder feeds query and doc to the model together and produces a more precise score; but because each query-document pair requires a separate model call, it is slower.
Practical flow:
- Query → embedding → vector DB top-100 retrieval (10-50ms)
- Top-100 → reranker model → score per pair (50-300ms for 100 docs)
- Top-5 / top-10 → placed in LLM context
- LLM produces the answer
Popular rerankers:
- Cohere Rerank v3 — managed API, USD 0.001-0.002 per query, 100ms latency. EU region for KVKK. The most common choice in production.
- BGE Reranker (open-source) — self-hosted, 50-150ms on GPU. Full KVKK control, no per-query cost. Variants: base/large/v2-m3.
- ColBERT v2 — late-interaction architecture, score per query-token per document-token. Most precise but operationally complex. Only for very high-precision use cases.
- Cross-encoder ms-marco-MiniLM — small model, fast (reasonable even on CPU), limited recall gain.
In AIGENCY we use Cohere Rerank v3 with self-hosted BGE Reranker as fallback. Automatic fallback when Cohere is unavailable keeps the latency budget intact.
Proof of reranker value: on a 200-query benchmark, retrieval + LLM alone gave 62% accuracy; retrieval + rerank + LLM rose to 84%. For production RAG, the reranker is no longer optional — it is treated as a mandatory layer.
7. Evaluation framework — how do you know the system actually works?
The most frequently skipped step in production RAG is evaluation. The system is taken to production because it works in the demo; three months later the complaints start: why is our model saying I don't know? Solution: continuous, three-layer evaluation.
Layer 1 — Retrieval evaluation. A golden dataset (50-200 manually labelled query-positive_doc pairs) is prepared; the system returns top-K for each query; metrics:
- Recall@K: is the correct document inside top-K?
- MRR (Mean Reciprocal Rank): at what average rank is the correct document?
- NDCG: graded ranking quality.
Automated: Ragas, TruLens, Phoenix (Arize) and DeepEval frameworks provide ready-made retrieval + generation eval patterns. Weekly automated runs + dashboard is the minimum standard in enterprise projects.
Layer 2 — Generation evaluation. LLM answer quality:
- Faithfulness: is the answer loyal to the retrieved context (no hallucination)?
- Answer relevance: does the answer respond to the query?
- Context precision/recall: was the necessary context used?
The LLM-as-judge pattern (asking GPT-4o or Claude Opus to rate this answer's faithfulness to this source, 1-5) is fast and scales; but it is subjective — quarterly human calibration is mandatory (50-100 sample manual labels).
Layer 3 — Production monitoring. On real user queries:
- Latency distribution (p50, p95, p99) — retrieval, rerank and LLM stages separately.
- Cost per query — token consumption and vector DB query cost.
- Thumbs up/down feedback rate — user feedback.
- I don't know / clarification rate — is the model too uncertain or too aggressive?
- Hallucination spot-check — weekly manual review of 20-50 random queries.
Without all three layers running together, RAG degradation takes months to detect. An embedding model upgrade, a chunking parameter tweak, a vector DB index rebuild — any of these can silently reduce quality. Integrating an eval suite into the CI/CD pipeline (retrieval eval run on every PR, block on regression) enables disciplined RAG evolution.
Our AI governance framework details how these evaluation layers integrate with enterprise audit requirements. When combined with the KVKK Data Controller obligation, this eval also doubles as legal evidence — a documented answer to how does our model work, to which source is it faithful.
Practical summary — a starting checklist
The correct order for your first production RAG project:
- Is RAG actually right? — if knowledge does not update often, lean fine-tuning; if behaviour isn't being learned, lean RAG; if both, hybrid.
- Doc store: start with pgvector (<500K embeddings); migrate to Qdrant as volume grows. Pick a managed service only when KVKK DPA is in place.
- Chunking: doc-type classifier + per-type chunker. Fixed-size only for free-form text. Structure-aware as default whenever possible.
- Embedding: start with OpenAI/Cohere (PoC); for production consider BGE-M3 self-hosted or domain fine-tuned.
- Hybrid search: BM25 + vector + RRF. Always, outside PoC.
- Reranker: Cohere Rerank or BGE Reranker. Mandatory in production.
- Evaluation: golden dataset → retrieval eval → LLM-judge → production monitoring. All three, automated weekly.
- Metadata: attach doc title, section, type, source URL, date and language to every chunk. Gold for filter and attribution.
- KVKK: data location (model, vector DB, logs), DPA, retention policy, traceability of user-derived data.
- Continuous improvement: close the feedback loop — analyse low-rated production queries, update the golden dataset, iterate chunker/embedding/reranker.
This list is the minimum discipline. Domain-specific requirements (citation for legal RAG, confidence threshold for medical RAG, escalation pattern for customer support RAG) are added on top. To extract real value from RAG in production, you need to move past the demo works and reach the evaluation framework runs every week and alerts the moment something breaks.
Our team in Şanlıurfa Karaköprü has internalised this discipline by operating our AIGENCY v4 platform on a RAG-heavy architecture and by delivering enterprise RAG systems engineering. For an enterprise RAG pilot, vector DB selection or production-grade RAG architecture assessment, you can reach us through the contact form — the first assessment call is free of charge.
eCloud Tech — A team based in Şanlıurfa, Türkiye, working on enterprise software, AI, blockchain and cybersecurity. Building Tomorrow.
Frequently Asked Questions
The decision rests on three variables. (1) Scale: up to 100K embeddings, PostgreSQL + the pgvector extension is enough; one database, one backup, one ops team. Between 1M and 10M, Qdrant or Milvus are preferred — HNSW/IVF indexing and filter performance are much better than a classical DB. Above 10M, Milvus or managed services (Pinecone, Weaviate Cloud) are realistic. (2) Operational load: pgvector runs on the PostgreSQL you already have; Qdrant requires its own cluster with separate backup/upgrade routines. For small teams, pgvector is a lifesaver. (3) Filter + metadata requirements: Qdrant's payload filter is more predictable in production; pgvector + GiST/SP-GiST combinations are also strong but require tuning effort. Practical recommendation: under 500K, start with pgvector and migrate to Qdrant if needed.