RAG pipeline với Gemma 4 và Elasticsearch | BKGlobal Tech

RAG pipeline với Gemma 4 và Elasticsearch: từ concept đến production

Hồi đầu năm 2025, team BKGlobal bắt đầu thử nghiệm tích hợp LLM vào một hệ thống hỗ trợ khách hàng nội bộ. Vấn đề đầu tiên gặp phải: hallucination cực kỳ nhiều. Model trả lời tự tin nhưng sai hoàn toàn - đặc biệt với các câu hỏi liên quan đến nghiệp vụ đặc thù của dự án. Không có cách nào fix bằng prompt engineering đơn thuần khi model không biết gì về domain cụ thể đó.

Giải pháp: RAG (Retrieval-Augmented Generation). Và vì Elasticsearch đã là core infrastructure của team từ nhiều năm trước, việc tích hợp vector search vào Elasticsearch để phục vụ RAG là lựa chọn tự nhiên nhất. Bài này chia sẻ toàn bộ quá trình đó - từ architecture, code C#/.NET đến các bẫy khi làm tiếng Việt.

1. tại sao RAG, và tại sao Elasticsearch?

RAG là gì và vì sao cần thiết

RAG (Retrieval-Augmented Generation) giải quyết một trong những nhược điểm lớn nhất của LLM thuần: knowledge cutoff và domain blindness. Thay vì dựa hoàn toàn vào kiến thức được train sẵn trong model, RAG chia pipeline thành 2 bước:

Retrieval: Tìm kiếm các đoạn tài liệu liên quan nhất từ knowledge base
Generation: Dùng những tài liệu đó làm context để model sinh câu trả lời chính xác

User Query
    │
    ▼
[Embedding] ──► [Vector Search] ──► [Top-K Documents]
                      │                      │
                      │                      ▼
                 Elasticsearch         [Prompt Builder]
                                             │
                                             ▼
                                       [Gemma 4 LLM]
                                             │
                                             ▼
                                       Final Answer

Tại sao Elasticsearch phù hợp cho RAG?

Elasticsearch không chỉ là search engine - từ phiên bản 8.0, nó là full-stack vector database với các tính năng:

dense_vector field: Lưu trữ embedding vectors với support HNSW indexing
kNN search: Approximate nearest neighbor search với độ chính xác cao
Hybrid search: Kết hợp BM25 lexical + vector semantic trong một query duy nhất
Mature ecosystem: Team đã vận hành ES nhiều năm, monitoring, scaling, backup đã có sẵn

So với việc dùng thêm một vector database riêng (Pinecone, Qdrant, Weaviate), tận dụng Elasticsearch đã có giúp team giảm infrastructure complexity đáng kể.

2. kiến trúc tổng thể của RAG pipeline

Trước khi đi vào code, hãy hiểu rõ hai giai đoạn của pipeline:

Giai đoạn Indexing (offline)

Documents
    │
    ▼
[Text Chunking]        ← Chia nhỏ theo semantic boundary
    │
    ▼
[Embedding Generation] ← Vietnamese bi-encoder model
    │
    ▼
[Elasticsearch Index]  ← dense_vector + text + metadata

Giai đoạn Querying (online/real-time)

User Question
    │
    ├──► [Embed Query]
    │           │
    │           ▼
    │    [kNN + BM25 Hybrid Search] → Elasticsearch
    │           │
    │           ▼
    └──► [Top-K Chunks] ──► [Build Prompt] ──► [Gemma 4] ──► Answer

3. Elasticsearch: setup index mapping cho vector search

Index mapping với dense_vector

Đây là bước quan trọng nhất - mapping phải được định nghĩa đúng trước khi index bất kỳ document nào:

PUT /knowledge-base
{
  "settings": {
    "analysis": {
      "analyzer": {
        "vi_custom_analyzer": {
          "type": "custom",
          "tokenizer": "vi_tokenizer",
          "filter": ["lowercase", "vi_stop"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "vi_custom_analyzer"
      },
      "title": {
        "type": "text",
        "analyzer": "vi_custom_analyzer",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      },
      "chunk_index": { "type": "integer" },
      "source_id": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Lưu ý quan trọng:

dims: 768 phải khớp với output dimension của embedding model
similarity: cosine phù hợp với hầu hết text embedding models
index: true bật HNSW indexing cho approximate kNN (bắt buộc để dùng knn query)

4. embedding generation cho tiếng Việt

Chọn embedding model cho Vietnamese

Đây là điểm khác biệt quan trọng so với tiếng Anh. Các model multilingual thông thường (như all-MiniLM-L6-v2) cho chất lượng kém với tiếng Việt vì:

Tiếng Việt có dấu thanh điệu phức tạp
Word segmentation khác hoàn toàn tiếng Anh (compound words)
Corpus training ít

Team BKGlobal đánh giá và chọn bkai-foundation-models/vietnamese-bi-encoder - được train trên corpus tiếng Việt lớn, output 768 dimensions:

# embedding_service.py
# Service tạo embeddings cho documents và queries tiếng Việt

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List

EMBEDDING_MODEL = "bkai-foundation-models/vietnamese-bi-encoder"
EMBEDDING_DIMS = 768

class VietnameseEmbeddingService:
    def __init__(self):
        # Load model một lần, tái sử dụng nhiều lần
        self.model = SentenceTransformer(EMBEDDING_MODEL)

    def embed_documents(self, texts: List[str], batch_size: int = 32) -> List[List[float]]:
        """
        Batch embedding cho documents - optimize throughput khi index lớn.
        batch_size=32 là balance tốt giữa memory và speed trên CPU.
        Tăng lên 64-128 nếu có GPU.
        """
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=True,
            normalize_embeddings=True  # L2 normalize để cosine similarity chính xác hơn
        )
        return embeddings.tolist()

    def embed_query(self, query: str) -> List[float]:
        """Single embedding cho query - cần latency thấp."""
        embedding = self.model.encode(
            query,
            normalize_embeddings=True
        )
        return embedding.tolist()

Chunking strategy cho tiếng Việt

Chunking là yếu tố ảnh hưởng lớn đến chất lượng retrieval. Team BKGlobal dùng semantic chunking thay vì fixed-size:

# document_chunker.py
# Chunking document theo semantic boundary

import re
from dataclasses import dataclass
from typing import List

@dataclass
class DocumentChunk:
    content: str
    chunk_index: int
    source_id: str
    char_start: int
    char_end: int

def chunk_vietnamese_text(
    text: str,
    source_id: str,
    max_chunk_size: int = 512,
    overlap: int = 64
) -> List[DocumentChunk]:
    """
    Chia text tiếng Việt theo paragraph trước, sau đó split nếu quá dài.
    Overlap 64 tokens giúp giữ context giữa các chunks liền kề.
    """
    # Split theo paragraph (2+ newlines)
    paragraphs = re.split(r'\n{2,}', text.strip())

    chunks = []
    current_chunk = ""
    chunk_index = 0
    char_pos = 0

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        # Nếu paragraph đơn đã quá dài, split theo câu
        if len(para) > max_chunk_size:
            sentences = re.split(r'(?<=[.!?])\s+', para)
            for sent in sentences:
                if len(current_chunk) + len(sent) > max_chunk_size and current_chunk:
                    chunks.append(DocumentChunk(
                        content=current_chunk.strip(),
                        chunk_index=chunk_index,
                        source_id=source_id,
                        char_start=char_pos - len(current_chunk),
                        char_end=char_pos
                    ))
                    # Overlap: giữ lại phần cuối của chunk trước
                    current_chunk = current_chunk[-overlap:] + " " + sent
                    chunk_index += 1
                else:
                    current_chunk += " " + sent
                char_pos += len(sent)
        else:
            if len(current_chunk) + len(para) > max_chunk_size and current_chunk:
                chunks.append(DocumentChunk(
                    content=current_chunk.strip(),
                    chunk_index=chunk_index,
                    source_id=source_id,
                    char_start=char_pos - len(current_chunk),
                    char_end=char_pos
                ))
                current_chunk = para
                chunk_index += 1
            else:
                current_chunk += "\n" + para
            char_pos += len(para)

    # Chunk cuối cùng
    if current_chunk.strip():
        chunks.append(DocumentChunk(
            content=current_chunk.strip(),
            chunk_index=chunk_index,
            source_id=source_id,
            char_start=char_pos - len(current_chunk),
            char_end=char_pos
        ))

    return chunks

5. C# implementation với Elastic .NET client

BKGlobal stack là .NET, nên toàn bộ RAG orchestration layer được viết bằng C#. Team dùng Elastic.Clients.Elasticsearch (official client, thay thế NEST từ ES 8.x).

Model definitions

// Models/KnowledgeDocument.cs
// Document model tương ứng với Elasticsearch index mapping

public class KnowledgeDocument
{
    [JsonPropertyName("content")]
    public string Content { get; set; } = string.Empty;

    [JsonPropertyName("title")]
    public string Title { get; set; } = string.Empty;

    [JsonPropertyName("embedding")]
    public float[] Embedding { get; set; } = Array.Empty<float>();

    [JsonPropertyName("chunk_index")]
    public int ChunkIndex { get; set; }

    [JsonPropertyName("source_id")]
    public string SourceId { get; set; } = string.Empty;

    [JsonPropertyName("created_at")]
    public DateTime CreatedAt { get; set; }
}

Index service - đưa documents vào Elasticsearch

// Services/DocumentIndexService.cs
// Service index documents vào Elasticsearch với embedding

public class DocumentIndexService
{
    private readonly ElasticsearchClient _client;
    private readonly IEmbeddingService _embeddingService;
    private readonly ILogger<DocumentIndexService> _logger;
    private const string IndexName = "knowledge-base";

    public DocumentIndexService(
        ElasticsearchClient client,
        IEmbeddingService embeddingService,
        ILogger<DocumentIndexService> logger)
    {
        _client = client;
        _embeddingService = embeddingService;
        _logger = logger;
    }

    public async Task IndexDocumentsAsync(
        IEnumerable<DocumentChunk> chunks,
        CancellationToken ct = default)
    {
        // Batch embedding - gọi Python service qua HTTP hoặc gRPC
        var texts = chunks.Select(c => c.Content).ToList();
        var embeddings = await _embeddingService.EmbedBatchAsync(texts, ct);

        var documents = chunks.Zip(embeddings, (chunk, embedding) => new KnowledgeDocument
        {
            Content = chunk.Content,
            Title = chunk.SourceId,
            Embedding = embedding,
            ChunkIndex = chunk.ChunkIndex,
            SourceId = chunk.SourceId,
            CreatedAt = DateTime.UtcNow
        }).ToList();

        // Bulk index - hiệu quả hơn index từng document
        var bulkResponse = await _client.BulkAsync(b => b
            .Index(IndexName)
            .IndexMany(documents), ct);

        if (bulkResponse.Errors)
        {
            var errorCount = bulkResponse.ItemsWithErrors.Count();
            _logger.LogError("Bulk index có {ErrorCount} lỗi", errorCount);
            throw new InvalidOperationException($"Bulk index thất bại với {errorCount} lỗi");
        }

        _logger.LogInformation("Đã index {Count} chunks thành công", documents.Count);
    }
}

Search service - hybrid kNN + BM25

Đây là phần core của RAG pipeline. Team dùng hybrid search thay vì chỉ vector search thuần:

// Services/HybridSearchService.cs
// Hybrid search kết hợp BM25 lexical và kNN vector search

public class HybridSearchService
{
    private readonly ElasticsearchClient _client;
    private readonly IEmbeddingService _embeddingService;
    private const string IndexName = "knowledge-base";
    private const int TopK = 10;
    private const int NumCandidates = 100;

    public HybridSearchService(
        ElasticsearchClient client,
        IEmbeddingService embeddingService)
    {
        _client = client;
        _embeddingService = embeddingService;
    }

    public async Task<IReadOnlyList<KnowledgeDocument>> SearchAsync(
        string userQuery,
        int maxResults = 5,
        CancellationToken ct = default)
    {
        // Embed query song song với bất kỳ prep khác
        var queryEmbedding = await _embeddingService.EmbedQueryAsync(userQuery, ct);

        // Hybrid search: kNN vector + BM25 keyword
        // RRF (Reciprocal Rank Fusion) tự động merge scores từ 2 retrievers
        var response = await _client.SearchAsync<KnowledgeDocument>(s => s
            .Index(IndexName)
            .Size(maxResults)
            // Vector search: tìm theo semantic similarity
            .Knn(k => k
                .Field(f => f.Embedding)
                .QueryVector(queryEmbedding)
                .k(TopK)
                .NumCandidates(NumCandidates))
            // Keyword search: tìm theo BM25 exact/fuzzy match
            .Query(q => q
                .Match(m => m
                    .Field(f => f.Content)
                    .Query(userQuery)
                    .Fuzziness(new Fuzziness("AUTO")))),
            ct);

        if (!response.IsValidResponse)
        {
            throw new InvalidOperationException(
                $"Elasticsearch search thất bại: {response.ElasticsearchServerError?.Error?.Reason}");
        }

        return response.Documents;
    }
}

Lưu ý về RRF: Khi dùng cả knn và query cùng nhau trong ES 8.x, Elasticsearch tự động áp dụng score merging. Để dùng RRF explicit (Enterprise license), có thể dùng retrievers API mới hơn.

RAG orchestration service

// Services/RagService.cs
// Orchestration layer: kết hợp retrieval + generation

public class RagService
{
    private readonly HybridSearchService _searchService;
    private readonly IGemmaGenerationService _generationService;
    private readonly ILogger<RagService> _logger;

    // Prompt template chuẩn cho RAG tiếng Việt
    private const string SystemPrompt = """
        Bạn là trợ lý hỗ trợ kỹ thuật của BKGlobal. 
        Trả lời câu hỏi DỰA TRÊN thông tin trong phần CONTEXT bên dưới.
        Nếu context không đủ thông tin, hãy nói rõ điều đó thay vì đoán mò.
        Trả lời bằng tiếng Việt, ngắn gọn và chính xác.
        """;

    public RagService(
        HybridSearchService searchService,
        IGemmaGenerationService generationService,
        ILogger<RagService> logger)
    {
        _searchService = searchService;
        _generationService = generationService;
        _logger = logger;
    }

    public async Task<RagResponse> AnswerAsync(
        string userQuestion,
        CancellationToken ct = default)
    {
        // Step 1: Retrieve relevant chunks
        var relevantDocs = await _searchService.SearchAsync(userQuestion, maxResults: 5, ct);

        if (!relevantDocs.Any())
        {
            _logger.LogWarning("Không tìm thấy context cho câu hỏi: {Question}", userQuestion);
            return new RagResponse
            {
                Answer = "Không tìm thấy thông tin liên quan trong knowledge base.",
                SourceChunks = Array.Empty<string>()
            };
        }

        // Step 2: Build prompt với retrieved context
        var context = BuildContext(relevantDocs);
        var prompt = $"""
            {SystemPrompt}

            CONTEXT:
            {context}

            CÂU HỎI: {userQuestion}

            TRẢ LỜI:
            """;

        // Step 3: Generate với Gemma 4
        var answer = await _generationService.GenerateAsync(prompt, ct);

        return new RagResponse
        {
            Answer = answer,
            SourceChunks = relevantDocs.Select(d => d.SourceId).Distinct().ToArray()
        };
    }

    private static string BuildContext(IEnumerable<KnowledgeDocument> docs)
    {
        // Limit context window - Gemma 4 có context 8192 tokens
        // Mỗi chunk ~512 tokens, lấy tối đa 5 chunks
        return string.Join("\n\n---\n\n",
            docs.Select((doc, i) => $"[Nguồn {i + 1}: {doc.SourceId}]\n{doc.Content}"));
    }
}

public record RagResponse
{
    public string Answer { get; init; } = string.Empty;
    public string[] SourceChunks { get; init; } = Array.Empty<string>();
}

DI registration

// Program.cs (hoặc Startup.cs)
// Đăng ký các services vào DI container

builder.Services.AddSingleton<ElasticsearchClient>(sp =>
{
    var settings = new ElasticsearchClientSettings(
        new Uri(builder.Configuration["Elasticsearch:Uri"]!))
        .DefaultIndex("knowledge-base")
        .EnableDebugMode(); // Bật trong development để xem raw request/response

    return new ElasticsearchClient(settings);
});

builder.Services.AddHttpClient<IEmbeddingService, EmbeddingHttpService>(client =>
{
    // Python embedding service chạy local hoặc separate container
    client.BaseAddress = new Uri(builder.Configuration["EmbeddingService:Uri"]!);
    client.Timeout = TimeSpan.FromSeconds(30);
});

builder.Services.AddHttpClient<IGemmaGenerationService, VllmGenerationService>(client =>
{
    // vLLM serving Gemma 4 - compatible với OpenAI API spec
    client.BaseAddress = new Uri(builder.Configuration["VllmService:Uri"]!);
    client.Timeout = TimeSpan.FromSeconds(120);
});

builder.Services.AddScoped<HybridSearchService>();
builder.Services.AddScoped<DocumentIndexService>();
builder.Services.AddScoped<RagService>();

6. Python: embedding service và vLLM Gemma 4 call

Embedding REST service

# embedding_api.py
# FastAPI service expose embedding endpoint cho C# client call

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from typing import List
import uvicorn

app = FastAPI(title="Vietnamese Embedding Service")

# Load một lần khi startup - tránh load lại mỗi request
model = SentenceTransformer("bkai-foundation-models/vietnamese-bi-encoder")

class EmbedRequest(BaseModel):
    texts: List[str]
    batch_size: int = 32

class EmbedResponse(BaseModel):
    embeddings: List[List[float]]
    model: str
    dimensions: int

@app.post("/embed", response_model=EmbedResponse)
async def embed_texts(request: EmbedRequest):
    if not request.texts:
        raise HTTPException(status_code=400, detail="texts không được rỗng")

    if len(request.texts) > 500:
        raise HTTPException(status_code=400, detail="Tối đa 500 texts mỗi request")

    embeddings = model.encode(
        request.texts,
        batch_size=request.batch_size,
        normalize_embeddings=True,
        show_progress_bar=False
    )

    return EmbedResponse(
        embeddings=embeddings.tolist(),
        model="bkai-foundation-models/vietnamese-bi-encoder",
        dimensions=embeddings.shape[1]
    )

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8001)

Gọi Gemma 4 qua vLLM

vLLM phục vụ Gemma 4 với OpenAI-compatible API, nên C# client rất đơn giản:

# Test script - verify pipeline end-to-end

import httpx
import json

VLLM_URL = "http://localhost:8000/v1/chat/completions"

def generate_with_gemma4(prompt: str, max_tokens: int = 512) -> str:
    """
    Gọi Gemma 4 qua vLLM OpenAI-compatible endpoint.
    Gemma 4 hỗ trợ context window lên đến 128K tokens.
    """
    payload = {
        "model": "google/gemma-4-9b-it",  # Hoặc gemma-4-27b-it cho quality cao hơn
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": 0.1,  # Thấp để output deterministic hơn với RAG
        "top_p": 0.9
    }

    response = httpx.post(VLLM_URL, json=payload, timeout=60.0)
    response.raise_for_status()

    return response.json()["choices"][0]["message"]["content"]

Tương ứng trong C#, VllmGenerationService dùng HttpClient với OpenAI SDK hoặc gọi thẳng REST:

// Services/VllmGenerationService.cs
// Gọi Gemma 4 qua vLLM OpenAI-compatible API

public class VllmGenerationService : IGemmaGenerationService
{
    private readonly HttpClient _httpClient;

    public VllmGenerationService(HttpClient httpClient)
    {
        _httpClient = httpClient;
    }

    public async Task<string> GenerateAsync(string prompt, CancellationToken ct = default)
    {
        var request = new
        {
            model = "google/gemma-4-9b-it",
            messages = new[]
            {
                new { role = "user", content = prompt }
            },
            max_tokens = 512,
            temperature = 0.1
        };

        var json = JsonSerializer.Serialize(request);
        using var content = new StringContent(json, Encoding.UTF8, "application/json");

        var response = await _httpClient.PostAsync("/v1/chat/completions", content, ct);
        response.EnsureSuccessStatusCode();

        var responseBody = await response.Content.ReadAsStringAsync(ct);
        var parsed = JsonDocument.Parse(responseBody);

        return parsed.RootElement
            .GetProperty("choices")[0]
            .GetProperty("message")
            .GetProperty("content")
            .GetString() ?? string.Empty;
    }
}

7. xử lý tiếng Việt trong Elasticsearch

Cài Vietnamese Analysis Plugin

# Cài plugin phân tích tiếng Việt của Duy Đỗ
bin/elasticsearch-plugin install \
  https://github.com/duydo/elasticsearch-analysis-vietnamese/releases/download/v8.13.0/elasticsearch-analysis-vietnamese-8.13.0.zip

Plugin cung cấp vi_tokenizer và vi_analyzer - tokenizer dùng thư viện vn_tokenizer của GS. Lê Hồng Phương với độ chính xác >95% cho word segmentation tiếng Việt.

Analyzer configuration

{
  "settings": {
    "analysis": {
      "filter": {
        "vi_stop": {
          "type": "stop",
          "stopwords": [
            "và", "của", "là", "trong", "có", "cho", "với", "từ",
            "được", "các", "một", "những", "này", "đó", "khi",
            "tôi", "bạn", "họ", "chúng", "mình", "rất", "như"
          ]
        }
      },
      "analyzer": {
        "vi_custom": {
          "type": "custom",
          "tokenizer": "vi_tokenizer",
          "filter": ["lowercase", "vi_stop"]
        }
      }
    }
  }
}

Vấn đề tiếng Việt không dấu

Một edge case thực tế: users hay gõ tiếng Việt không dấu trong chatbox. Team thêm một sub-field content_no_diacritics với custom analyzer xử lý case này:

"content": {
  "type": "text",
  "analyzer": "vi_custom",
  "fields": {
    "no_diacritics": {
      "type": "text",
      "analyzer": "vi_no_diacritics_analyzer"
    }
  }
}

Trong query, search cả hai fields với multi_match:

.Query(q => q
    .MultiMatch(m => m
        .Fields(f => f
            .Field("content", boost: 1.0f)
            .Field("content.no_diacritics", boost: 0.7f))
        .Query(userQuery)
        .Type(TextQueryType.BestFields)))

8. performance optimization

Chunking strategy - quyết định quality và latency

Sau nhiều thử nghiệm, team đúc kết một số heuristics:

Loại document	Chunk size	Overlap	Lý do
FAQ / Q&A	256 tokens	0	Mỗi cặp Q&A là một unit độc lập
Technical docs	512 tokens	64	Balance giữa context và precision
Long articles	768 tokens	128	Cần nhiều context hơn
Code + comments	400 tokens	50	Code blocks không nên bị split

Embedding batch size

# Benchmark trên CPU (Intel i7-12700)
# batch_size=1:   ~800ms/doc  → tổng 80s cho 100 docs
# batch_size=16:  ~120ms/doc  → tổng 12s cho 100 docs
# batch_size=32:  ~80ms/doc   → tổng 8s cho 100 docs
# batch_size=64:  ~75ms/doc   → tổng 7.5s (OOM risk tăng)

OPTIMAL_BATCH_SIZE = 32  # Trên CPU không có GPU

Latency breakdown của một RAG request

Đo trên production (AWS t3.xlarge):

Query embedding:    ~45ms   (Python service, CPU)
Elasticsearch kNN:  ~25ms   (p99)
Prompt building:    ~2ms
Gemma 4 inference:  ~800ms  (vLLM, 4 A10G GPUs, 9B model)
─────────────────────────────
Total P50:          ~870ms
Total P99:          ~1.2s

Bottleneck chính là LLM inference. Team đang evaluate speculative decoding với Gemma 4 để giảm latency xuống ~600ms.

Elasticsearch index optimization

{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100,
      "number_of_shards": 2,
      "number_of_replicas": 1
    }
  }
}

ef_search cao hơn = chính xác hơn nhưng chậm hơn. 100 là sweet spot cho production với <50k documents.

9. RAG vs fine-tuning - khi nào dùng cái nào?

Đây là câu hỏi team gặp nhiều nhất từ các dự án. Bảng so sánh thực tế:

Tiêu chí	RAG	Fine-tuning
Knowledge updates	Real-time (chỉ cần re-index)	Phải train lại
Domain accuracy	Cao (grounded in retrieved docs)	Rất cao nếu train đúng
Infrastructure cost	Elasticsearch + inference	Training cost cao
Hallucination risk	Thấp (có context)	Trung bình (model "nhớ"
Data privacy	Documents không leak vào model	Data vào training set
Setup time	1-2 tuần	2-4 tuần + GPU budget
Phù hợp khi	FAQ, docs, knowledge base thay đổi thường	Tone/style, task-specific behavior

Kết luận của team BKGlobal: Với hầu hết use case enterprise, RAG là lựa chọn đầu tiên. Fine-tuning bổ sung khi cần model hiểu domain-specific terminology sâu hơn.

Thực tế một số dự án team đang chạy kết hợp cả hai: fine-tune model để hiểu terminology, dùng RAG để cung cấp context realtime. Best of both worlds.

10. monitoring và debug RAG pipeline

Một pipeline chạy được là chưa đủ - phải biết tại sao nó trả lời đúng hoặc sai:

// Thêm tracing để debug retrieval quality
public async Task<RagResponse> AnswerWithTracingAsync(
    string userQuestion,
    CancellationToken ct = default)
{
    var sw = Stopwatch.StartNew();
    var relevantDocs = await _searchService.SearchAsync(userQuestion, ct: ct);
    var retrievalMs = sw.ElapsedMilliseconds;

    _logger.LogInformation(
        "Retrieval: {Count} docs trong {Ms}ms. Scores: {Scores}",
        relevantDocs.Count,
        retrievalMs,
        string.Join(", ", relevantDocs.Select(d => d.Score.ToString("F3"))));

    // Log retrieved context để debug hallucination
    foreach (var doc in relevantDocs)
    {
        _logger.LogDebug(
            "Retrieved chunk [{SourceId}:{ChunkIndex}]: {Preview}",
            doc.SourceId,
            doc.ChunkIndex,
            doc.Content[..Math.Min(100, doc.Content.Length)]);
    }

    sw.Restart();
    var answer = await _generationService.GenerateAsync(
        BuildPrompt(userQuestion, relevantDocs), ct);
    var generationMs = sw.ElapsedMilliseconds;

    _logger.LogInformation("Generation: {Ms}ms", generationMs);

    return new RagResponse { Answer = answer };
}

Metrics cần track:

Retrieval recall@5: Có chunk đúng trong top-5 không?
Answer groundedness: LLM có trả lời từ context không hay tự bịa?
Latency P50/P99: Phân tích từng bước
Query distribution: Loại câu hỏi nào hay bị miss retrieval?

kết luận

RAG với Elasticsearch và Gemma 4 là một stack mạnh - đặc biệt với team đã có Elasticsearch trong production. Những điểm chính từ kinh nghiệm thực tế của BKGlobal:

Hybrid search (BM25 + kNN) luôn tốt hơn pure vector search cho tiếng Việt - BM25 bắt exact terms, vector bắt semantic similarity
Embedding model tiếng Việt mặc định là bkai-foundation-models/vietnamese-bi-encoder - đừng dùng multilingual models
Chunking strategy quan trọng hơn bạn nghĩ - sai chunking thì retrieval sẽ miss context dù embedding tốt đến đâu
Vietnamese Analysis Plugin là bắt buộc cho BM25 part của hybrid search
vLLM + Gemma 4 cho throughput tốt, đặc biệt với batched requests

Các bài tiếp theo trong series AI của BKGlobal sẽ đi sâu hơn vào: setup vLLM production-grade, fine-tuning Gemma 4 cho Vietnamese domain, và agentic RAG với tool calling.

BKGlobal Tech Team

Sources:

Ứng dụng & Xu hướng AI

Xem tất cả