Why is chunking strategy so critical and how is it designed?

Chunk size **directly determines retrieval quality**. Chunks that are too small (50-100 tokens) lose context — the model gets half of the relevant paragraph and misses the other half. Chunks that are too large (1500+ tokens) reduce embedding quality and consume the LLM context window. The practical range: **300-600 tokens** (roughly 200-400 words) with 10-20% overlap. But proper chunking is not just about size; **semantic boundaries** (paragraph, heading, list end) must be preserved. For contracts, legal texts and technical documentation, **structure-aware chunking** (by Markdown heading, by HTML section) gives 2-3× better recall than fixed-size chunking. For free-form text like chat logs and emails, fixed-size + overlap is enough. In our AIGENCY pipeline we use a doc-type classifier plus a chunker specialised per type.

What is hybrid search and why isn't pure vector search enough?

Vector search finds semantic similarity — a query like *KVKK-compliant CRM* will surface a document titled *personal data protection system* even with no word overlap. But some situations require **literal keyword match**: product codes (XR-2050-A), person names (Mehmet Tunç), legal references (Law 5651). Pure vector search weakens in these cases — embedding models represent proper nouns and codes poorly. The solution is **hybrid search**: BM25 (classic full-text) and vector search run together, and results are merged with **Reciprocal Rank Fusion (RRF)** or a weighted score. In production, hybrid search gives **15-30% better top-5 recall** than pure vector. Qdrant, Weaviate and Elasticsearch support hybrid natively; pgvector requires manual implementation (separate ts_vector and cosine distance queries, merged in code).

Is a reranker layer really necessary, and how much value does it add?

The three-layer pipeline standard (retrieval → rerank → generation) is now the **default** for production RAG. In the retrieval stage, top-50 / top-100 documents are pulled for speed (vector + hybrid); then cross-encoder models like Cohere Rerank, BGE Reranker or ColBERT re-rank this list, and the top-5 / top-10 go to the LLM. A reranker is much more precise than bi-encoder embeddings because it evaluates each query-document pair individually. Cost: +30-100ms additional latency per query and +USD 0.001-0.005 fee (with Cohere). Value: in our measurements a reranker lifts top-3 recall by **25-40%**, which directly improves the rate at which the LLM finds the correct answer. RAG without a reranker is acceptable for a small PoC; for production, a reranker layer is always recommended.

How do we evaluate a RAG system — how do we know it actually works?

Three layers: (1) **Retrieval evaluation** — a golden dataset (50-200 manually labelled query-doc pairs), top-K recall, MRR (Mean Reciprocal Rank) and NDCG metrics. Frameworks like Ragas, TruLens and Phoenix automate this layer. (2) **Generation evaluation** — accuracy of LLM output, faithfulness (loyalty to source), answer relevance. The LLM-as-judge pattern (asking GPT-4 or Claude *is this answer faithful to the source?*) is fast but subjective; human evaluation is mandatory every quarter for calibration. (3) **Production monitoring** — real query logs, latency distribution, user feedback (thumbs up/down), *I don't know* / hallucination rate. Without all three layers running together, it takes months to notice that RAG has degraded — an embedding model change, a doc store update or a prompt change can silently reduce quality. For our enterprise clients, the minimum standard is: weekly automated retrieval eval, monthly LLM-judge eval, continuous production monitoring.

Artificial Intelligence

Enterprise RAG Architecture: Vector DB, Chunking and Evaluation Guide

Seven practical decisions for production RAG systems: vector DB selection, chunking strategy, hybrid search, reranker, evaluation framework. Lessons from enterprise AI projects. An eCloud Tech engineering note.

Published: May 24, 202612 min read

ragvector-databasechunkinghybrid-search

Over the past three years RAG (Retrieval-Augmented Generation) has become the dominant architecture for enterprise AI projects. Requirements like KVKK compliance, sensitive-data governance and working with in-house knowledge bases have opened a much more practical path than fine-tuning open models: leave the base model as it is, feed the organisation's documents to the model at call time, and have it produce the answer. The slogan is simple; lack of discipline in execution is rotting projects. The difference between production-grade RAG and demo-grade RAG is architectural quality — every layer, from vector DB selection to chunking strategy, from hybrid search to reranker, from evaluation framework to production monitoring, must be measured and tested.

Within our RAG systems engineering and vector database engineering services we have worked on 12 enterprise RAG projects over the past 18 months, and we operate our own AIGENCY v4 platform on a RAG-heavy architecture. In this article we walk through seven critical decisions for enterprise RAG architecture in order: doc store selection, chunking strategy, embedding model, hybrid search, reranker layer, evaluation framework and production monitoring.

1. Strategic decision — RAG, fine-tuning or hybrid?

Figure 1 — Strategic decision — RAG, fine-tuning or hybrid?

The first question is not let's use RAG — it should be is RAG the right tool for this problem? Three approaches are on the table:

Pure RAG — the base model is unchanged; for every query, organisation documents are pulled at call time. Ideal for scenarios with frequently updated data, audit requirements and traceability (legal, customer support, corporate knowledge base).
Fine-tuning — the base model is specialised for a specific tone, format or domain task. Good for stable, narrow tasks (medical reporting, code-style transformation, product description generation).
Hybrid (RAG + fine-tuning) — the fine-tuned model holds the base knowledge (domain terminology, format), while RAG provides continuously updated data. The dominant approach for complex, high-volume enterprise scenarios.

Decision test: if knowledge updates daily, RAG; if behaviour/format must be learned, fine-tuning; if both are required, hybrid. About 70% of enterprise projects start with pure RAG; once they hit real production volume, about half move to hybrid. Pure fine-tuning is a minority in enterprise use because retraining on every knowledge change is costly and audit is hard.

Second practical point: RAG is no silver bullet. Applied wrongly, hallucination rates rise (the LLM treats irrelevant docs as sources), latency increases, costs explode. Applied correctly, it makes the model say I don't know where it should, and answer with citations where it does know. The difference is architectural discipline.

2. Doc store selection — pgvector vs Qdrant vs Milvus vs managed services

Figure 2 — Doc store selection — pgvector vs Qdrant vs Milvus vs managed services

Vector database selection is the skeletal decision of RAG architecture; changing it later means migration pain and months of friction. Three main categories:

PostgreSQL + pgvector — installed as an extension on top of your existing PostgreSQL. One database, one backup, one operation. Practical at 100K-500K embedding scale. Advantages: SQL-level metadata filter, transactional consistency, the tool your team already knows. Drawbacks: HNSW indexing slows down at 1M+ scale, and filter + vector search combinations require special tuning. Our recommendation: default choice for small-to-medium enterprise projects.

Qdrant — written in Rust, a vector-first database. Excellent HNSW index performance, native payload filter, REST + gRPC APIs. Sweet spot at 1M-50M embeddings. Self-hosted or managed cloud. Has gained popularity in enterprise projects since 2023. In our AIGENCY pipeline we hold 8M embeddings in a Qdrant cluster.

Milvus — distributed architecture designed for very large scale (10M-1B+ embeddings). Operational load is high (Kubernetes, etcd, MinIO components); hard without a dedicated DevOps presence on the team. Advantage: GPU-accelerated indexing and advanced index options (IVF/HNSW/DiskANN).

Pinecone / Weaviate Cloud / Zilliz Cloud — managed services. No operational load, production-ready in minutes. Cost: monthly USD 70-1,500 for small-to-medium load, USD 3,000-15,000 for enterprise SLA. Data location is critical from the KVKK perspective — Pinecone has an EU region, Weaviate has self-hosted or AWS Frankfurt options; a DPA (Data Processing Agreement) is mandatory in the contract.

Decision matrix:

Criterion	pgvector	Qdrant	Milvus	Managed
Sweet spot scale	<500K	1M-50M	10M-1B+	Any scale
Operational load	Low	Medium	High	None
Monthly infra cost	TRY 500-3,000	TRY 3,000-15,000	TRY 10,000-40,000	USD 70-15,000
KVKK control	Full	Full	Full	Contract-dependent
Hybrid search	Manual	Native	Native	Native
Latency p95	30-100ms	10-50ms	5-30ms	20-100ms

Practical recommendation: start with pgvector. As you approach the 500K embedding limit, plan a Qdrant migration. Above 10M, consider Milvus or managed service. Trying to pick big up front creates ops fatigue and unnecessary cost in most projects.

3. Chunking strategy — fixed size isn't enough

The single most critical factor in RAG quality is chunking strategy. Bad chunks waste even the world's best embedding model. Three core patterns:

Fixed-size chunking — the document is split into a fixed token count (e.g. 500 tokens, 15% overlap). Simplest, fastest; acceptable for free-form text (emails, chat logs). Drawback: cuts can land in the middle of paragraphs or sentences; semantic boundaries shift.

Structure-aware chunking — split by document structure: by Markdown heading (#, ##), by HTML section/article tag, in PDF by font changes or bullet points. Gives 2-3× better recall for technical docs, legal texts and product specs. Implementation: ready-made parsers exist in Unstructured.io, LlamaIndex node parsers, LangChain text splitters.

Semantic chunking — embedding-aware splitting; a new chunk starts when the cosine distance between two consecutive sentences' embeddings exceeds a threshold. Most sophisticated but compute-heavy (extra embedding pass per document). Delivers real value for complex, long texts (academic articles, reports).

Optimal chunk size is domain-dependent:

Document type	Recommended chunk	Overlap	Strategy
Contract, legal text	600-900 tokens	20%	Structure-aware (article/clause)
Technical documentation	400-600 tokens	15%	Structure-aware (headings)
Customer support FAQ	200-400 tokens	10%	Q-A pair as one chunk
Email, chat log	300-500 tokens	15%	Fixed + sliding window
Academic article	500-800 tokens	20%	Semantic
Code file	Function as a whole	0	AST-aware (Tree-sitter)

In our AIGENCY pipeline a doc-type classifier first classifies the document (contract / email / FAQ / log / article) and then runs a chunker specialised for that type. This approach lifted top-5 recall by 35% compared with one-size-fits-all fixed chunking.

Third practical point: metadata enrichment. Attach not just text but also doc title, section heading, document type, source URL, language, last_updated date and author to every chunk. At retrieval time this metadata is gold for filters (e.g. only documents from the last 6 months); at generation time it is gold for source attribution.

4. Embedding model — multilingual, dense, sparse, hybrid

Embedding model quality is the ceiling of retrieval quality. With the wrong model, even the best chunking gives low recall. As of 2026, four practical families:

OpenAI text-embedding-3-large — 3072 dimensions, strong in English, Turkish and Arabic, ~USD 0.00013 per query. Popular in production, but every query goes to OpenAI — this can create a data-location issue from the KVKK perspective and requires a DPA in the contract.

Cohere embed-multilingual-v3 — 1024 dimensions, 100+ languages; slightly weaker than OpenAI in Turkish and Arabic but still strong. Cohere's EU region offering is an advantage for KVKK. Cost is close to OpenAI.

Open-source: BGE-M3, mE5-large, multilingual-MiniLM — self-hosted, running on GPU. BGE-M3 can operate in dense, sparse and multi-vector modes (similar to Cohere's patented approach). Advantages: full KVKK control, no per-query cost, you can fine-tune. Drawbacks: GPU operational load; raw latency can be 2-5× higher than OpenAI/Cohere.

Domain-specific fine-tuned — the organisation contrastive-fine-tunes BGE or E5 on its own corpus. Training for 1-3 epochs on 5K-50K query-positive_doc pairs lifts recall by 10-25%. Real value exists for domain-specific language (legal, finance, medical).

Practical recommendation: OpenAI or Cohere for pilots and PoCs; BGE-M3 self-hosted for KVKK-critical enterprise; fine-tuned BGE or mE5 for real volume + domain-specific precision. Switching the embedding model requires re-embedding the whole corpus — bake this cost and time into the start (1M docs × 2 hours re-embed × GPU hour cost).

5. Hybrid search — vector + BM25 + RRF

Pure vector search misses certain queries:

Exact keyword match required (XR-2050-A product code, Law 5651)
Person/place names (embedding models represent proper nouns poorly)
Acronyms and technical terms (KVKK, MASAK, TS 13638)
Negation (systems not KVKK-compliant — embeddings handle negation weakly)

The solution is hybrid search: BM25 (sparse) + vector (dense) run together, results merged. Two merging approaches:

Reciprocal Rank Fusion (RRF) — each system's document rank is run through 1/(k+rank) (k=60 default), and the two scores are summed. No score normalisation needed; robust. Default choice in production.

Weighted score — BM25 score and vector score are normalised separately, then merged with a weight (e.g. 0.4 BM25, 0.6 vector). Weights must be tuned per domain; more flexible but tuning-heavy.

Implementation:

Vector DB	Hybrid support
pgvector	Manual (ts_vector + cosine, merged in code)
Qdrant	Native (Fusion API)
Weaviate	Native (hybrid query)
Milvus	Native (BM25 + vector + RRF)
Elasticsearch	Native (knn + match query)

In our measurements, hybrid search lifted top-5 recall by 18-32% and top-1 (best result) accuracy by 25-40% versus pure vector. Outside PoC, we treat it as mandatory for every production RAG.

6. Reranker — the pipeline's precision filter

The retrieval stage is fast + broad (top-50, top-100 documents); the reranker is slow + precise (a cross-encoder scores each query-document pair individually and trims to top-5). The math of the two-layer approach is simple: a bi-encoder embedding converts query and doc to separate vectors and then cosine-distances them — information is lost. A cross-encoder feeds query and doc to the model together and produces a more precise score; but because each query-document pair requires a separate model call, it is slower.

Practical flow:

Query → embedding → vector DB top-100 retrieval (10-50ms)
Top-100 → reranker model → score per pair (50-300ms for 100 docs)
Top-5 / top-10 → placed in LLM context
LLM produces the answer

Popular rerankers:

Cohere Rerank v3 — managed API, USD 0.001-0.002 per query, 100ms latency. EU region for KVKK. The most common choice in production.
BGE Reranker (open-source) — self-hosted, 50-150ms on GPU. Full KVKK control, no per-query cost. Variants: base/large/v2-m3.
ColBERT v2 — late-interaction architecture, score per query-token per document-token. Most precise but operationally complex. Only for very high-precision use cases.
Cross-encoder ms-marco-MiniLM — small model, fast (reasonable even on CPU), limited recall gain.

In AIGENCY we use Cohere Rerank v3 with self-hosted BGE Reranker as fallback. Automatic fallback when Cohere is unavailable keeps the latency budget intact.

Proof of reranker value: on a 200-query benchmark, retrieval + LLM alone gave 62% accuracy; retrieval + rerank + LLM rose to 84%. For production RAG, the reranker is no longer optional — it is treated as a mandatory layer.

7. Evaluation framework — how do you know the system actually works?

The most frequently skipped step in production RAG is evaluation. The system is taken to production because it works in the demo; three months later the complaints start: why is our model saying I don't know? Solution: continuous, three-layer evaluation.

Layer 1 — Retrieval evaluation. A golden dataset (50-200 manually labelled query-positive_doc pairs) is prepared; the system returns top-K for each query; metrics:

Recall@K: is the correct document inside top-K?
MRR (Mean Reciprocal Rank): at what average rank is the correct document?
NDCG: graded ranking quality.

Automated: Ragas, TruLens, Phoenix (Arize) and DeepEval frameworks provide ready-made retrieval + generation eval patterns. Weekly automated runs + dashboard is the minimum standard in enterprise projects.

Layer 2 — Generation evaluation. LLM answer quality:

Faithfulness: is the answer loyal to the retrieved context (no hallucination)?
Answer relevance: does the answer respond to the query?
Context precision/recall: was the necessary context used?

The LLM-as-judge pattern (asking GPT-4o or Claude Opus to rate this answer's faithfulness to this source, 1-5) is fast and scales; but it is subjective — quarterly human calibration is mandatory (50-100 sample manual labels).

Layer 3 — Production monitoring. On real user queries:

Latency distribution (p50, p95, p99) — retrieval, rerank and LLM stages separately.
Cost per query — token consumption and vector DB query cost.
Thumbs up/down feedback rate — user feedback.
I don't know / clarification rate — is the model too uncertain or too aggressive?
Hallucination spot-check — weekly manual review of 20-50 random queries.

Without all three layers running together, RAG degradation takes months to detect. An embedding model upgrade, a chunking parameter tweak, a vector DB index rebuild — any of these can silently reduce quality. Integrating an eval suite into the CI/CD pipeline (retrieval eval run on every PR, block on regression) enables disciplined RAG evolution.

Our AI governance framework details how these evaluation layers integrate with enterprise audit requirements. When combined with the KVKK Data Controller obligation, this eval also doubles as legal evidence — a documented answer to how does our model work, to which source is it faithful.

Practical summary — a starting checklist

The correct order for your first production RAG project:

Is RAG actually right? — if knowledge does not update often, lean fine-tuning; if behaviour isn't being learned, lean RAG; if both, hybrid.
Doc store: start with pgvector (<500K embeddings); migrate to Qdrant as volume grows. Pick a managed service only when KVKK DPA is in place.
Chunking: doc-type classifier + per-type chunker. Fixed-size only for free-form text. Structure-aware as default whenever possible.
Embedding: start with OpenAI/Cohere (PoC); for production consider BGE-M3 self-hosted or domain fine-tuned.
Hybrid search: BM25 + vector + RRF. Always, outside PoC.
Reranker: Cohere Rerank or BGE Reranker. Mandatory in production.
Evaluation: golden dataset → retrieval eval → LLM-judge → production monitoring. All three, automated weekly.
Metadata: attach doc title, section, type, source URL, date and language to every chunk. Gold for filter and attribution.
KVKK: data location (model, vector DB, logs), DPA, retention policy, traceability of user-derived data.
Continuous improvement: close the feedback loop — analyse low-rated production queries, update the golden dataset, iterate chunker/embedding/reranker.

This list is the minimum discipline. Domain-specific requirements (citation for legal RAG, confidence threshold for medical RAG, escalation pattern for customer support RAG) are added on top. To extract real value from RAG in production, you need to move past the demo works and reach the evaluation framework runs every week and alerts the moment something breaks.

Our team in Şanlıurfa Karaköprü has internalised this discipline by operating our AIGENCY v4 platform on a RAG-heavy architecture and by delivering enterprise RAG systems engineering. For an enterprise RAG pilot, vector DB selection or production-grade RAG architecture assessment, you can reach us through the contact form — the first assessment call is free of charge.

eCloud Tech — A team based in Şanlıurfa, Türkiye, working on enterprise software, AI, blockchain and cybersecurity. Building Tomorrow.

Frequently Asked Questions

The decision rests on three variables. (1) Scale: up to 100K embeddings, PostgreSQL + the pgvector extension is enough; one database, one backup, one ops team. Between 1M and 10M, Qdrant or Milvus are preferred — HNSW/IVF indexing and filter performance are much better than a classical DB. Above 10M, Milvus or managed services (Pinecone, Weaviate Cloud) are realistic. (2) Operational load: pgvector runs on the PostgreSQL you already have; Qdrant requires its own cluster with separate backup/upgrade routines. For small teams, pgvector is a lifesaver. (3) Filter + metadata requirements: Qdrant's payload filter is more predictable in production; pgvector + GiST/SP-GiST combinations are also strong but require tuning effort. Practical recommendation: under 500K, start with pgvector and migrate to Qdrant if needed.

Artificial Intelligence

Enterprise AI Agent Automation: Multi-Agent Architecture, Tool Calling and Human-in-the-Loop

Seven practical decisions for production-grade AI agent systems: single vs multi-agent, tool calling contracts, orchestrator patterns, human-in-the-loop gates, audit and KVKK compliance, observability and evaluation. Enterprise lessons from the AIGENCY v4 platform. An eCloud Tech engineering note.

Artificial Intelligence

KVKK-Compliant Artificial Intelligence Guide: A 5-Layer Practical Architecture

A five-layer architectural guide for KVKK-compliant enterprise AI systems: data residency, explicit consent, anonymisation, cross-border API risk and audit trail. An engineering note from eCloud Tech.

Frequently Asked Questions

Related articles

Enterprise AI Agent Automation: Multi-Agent Architecture, Tool Calling and Human-in-the-Loop

KVKK-Compliant Artificial Intelligence Guide: A 5-Layer Practical Architecture