Agent Harness là gì? Kiến trúc vỏ bọc biến LLM thành AI agent production

Fine-tuning Gemma 4 với QLoRA cho domain tiếng Việt: hướng dẫn thực chiến

QLoRA giúp fine-tune Gemma 4 trên RTX 3090/4090 với chỉ 8–12GB VRAM - tiết kiệm 80% so với full fine-tuning, chất lượng đạt 85–90% tương đương. Bài này hướng dẫn end-to-end: setup môi trường, chuẩn bị dataset tiếng Việt, train với Unsloth, evaluate, và export ra Ollama để deploy local. ---

Vấn đề

Chúng tôi có một bài toán quen thuộc ở BKGlobal: cần một LLM hiểu sâu nghiệp vụ cụ thể - nội dung hợp đồng pháp lý, thuật ngữ kỹ thuật nội bộ, hay quy trình CSKH riêng của khách hàng.

Gemma 4 base model tốt cho tác vụ general, nhưng với domain-specific, đặc biệt là tiếng Việt chuyên ngành, cần fine-tuning. Câu hỏi là: fine-tune thế nào mà không cần cluster GPU triệu đô?

Câu trả lời: QLoRA (Quantized Low-Rank Adaptation) - kỹ thuật fine-tuning hiệu quả nhất cho team nhỏ và startup.

Giải thích - QLoRA là gì và tại sao dùng nó?

LoRA: huấn luyện không chạm vào trọng số gốc

Thay vì cập nhật toàn bộ hàng tỷ tham số của model, LoRA chèn thêm hai ma trận nhỏ (gọi là adapters) vào từng layer. Chỉ adapters này được train - model gốc đóng băng hoàn toàn.

Full fine-tune:  cập nhật 7B tham số  → cần 40GB+ VRAM
LoRA:            cập nhật ~50MB adapter → cần 8–16GB VRAM
QLoRA:           LoRA + model 4-bit quantized → cần 4–12GB VRAM

QLoRA: thêm quantization cho model gốc

QLoRA load model ở định dạng 4-bit (thay vì 16-bit hoặc 32-bit). Kết quả:

Gemma 4 E4B full fine-tune: cần ~40GB VRAM
Gemma 4 E4B QLoRA: cần ~8–10GB VRAM - RTX 3090/4090 làm được

Chất lượng đánh đổi: 85–90% so với full fine-tuning. Với hầu hết domain tasks, đây là trade-off hợp lý.

So sánh 3 phương pháp

Phương pháp	VRAM cần	Thời gian (3 epochs)	Chi phí cloud	Chất lượng
Full fine-tune	40–120GB	12–48h	$500–2000	100%
LoRA	8–16GB	3–6h	$50–200	95–98%
QLoRA	4–8GB	4–8h	$20–80	85–90%

Khuyến nghị: Bắt đầu với QLoRA, chỉ nâng lên LoRA/full nếu QLoRA không đủ chất lượng.

Setup môi trường

# Tạo virtual environment
python -m venv venv && source venv/bin/activate

# Cài Unsloth (khuyến nghị - 2x nhanh hơn HuggingFace thuần)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install xformers bitsandbytes trl datasets

# Kiểm tra GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"

Yêu cầu tối thiểu:

GPU: RTX 3090 (24GB) hoặc RTX 4090 (24GB) - cũng chạy được trên RTX 4080 16GB nếu giảm batch size
RAM: 32GB hệ thống
Storage: 20GB cho model + dataset

Chuẩn bị dataset tiếng Việt

Format dữ liệu

Gemma 4 dùng instruction-tuning format:

{
  "instruction": "Tóm tắt điều khoản quan trọng trong hợp đồng sau.",
  "input": "ĐIỀU 3: Bên B cam kết hoàn thành dự án trong vòng 90 ngày kể từ ngày ký kết...",
  "output": "Điều khoản chính: (1) Thời hạn thực hiện 90 ngày từ ngày ký. (2) ..."
}

File .jsonl cho dataset lớn:

{"instruction": "...", "input": "...", "output": "..."}
{"instruction": "...", "input": "...", "output": "..."}

Dataset tiếng Việt có sẵn

Dataset	Kích thước	Use case	Link
Bactrian-X (VI)	3.4M pairs	General instruction	HuggingFace: bactrian-x/bactrian-x
5CD-AI Viet-Alpaca	~52K pairs	GPT-4 translated	HuggingFace: 5CD-AI/Viet-Alpaca-data-gpt4
ViMMRC (UIT)	5.1K QA	Reading comprehension	GitHub: UIT-NLP
ZaloE2E	10K+	Open-ended QA	Zalo AI Challenge
VN News Corpus	50GB raw text	Pre-training / general	OSCAR / CulturaX

Cho domain-specific: thường cần tạo dataset riêng. Chúng tôi hay dùng pipeline:

Thu thập raw data (tài liệu nội bộ, email, ticket CSKH)
Dùng GPT-4 để generate instruction-output pairs từ raw data
Kiểm duyệt thủ công 10–15% ngẫu nhiên trước khi dùng

Preprocessing tiếng Việt

Gemma 4's tokenizer xử lý dấu tiếng Việt (á, à, ả, ã, ạ...) tốt mặc định - không cần preprocessing đặc biệt. Tuy nhiên, nếu dataset có text thô cần chuẩn hóa:

import unicodedata

def normalize_vietnamese(text: str) -> str:
    """Chuẩn hóa encoding NFC để đảm bảo dấu tiếng Việt nhất quán."""
    return unicodedata.normalize("NFC", text)

# Optional: word segmentation với pyvi
from pyvi import ViTokenizer
segmented = ViTokenizer.tokenize("Tôi đang học lập trình AI")
# → "Tôi đang học lập_trình AI"

Code fine-tuning: Unsloth (khuyến nghị)

Unsloth nhanh hơn HuggingFace thuần 2 lần và dùng ít VRAM hơn 70%. Đây là lựa chọn của team chúng tôi cho experiments nhanh:

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer

# ─── 1. Load model với 4-bit quantization ───────────────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-4-4b-it-bnb-4bit",
    max_seq_length=2048,
    dtype=None,           # auto-detect: bfloat16 trên Ampere+, float16 cũ hơn
    load_in_4bit=True,    # QLoRA: quantize model gốc xuống 4-bit
)

# ─── 2. Gắn LoRA adapters ────────────────────────────────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                 # LoRA rank: 8 (nhanh/nhẹ) hoặc 32 (chất lượng cao hơn)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,        # Scaling factor; thường = r
    lora_dropout=0,       # 0 tốt nhất với Unsloth
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized
    random_state=42,
)

# ─── 3. Load và format dataset ────────────────────────────────────────────────
dataset = load_dataset("5CD-AI/Viet-Alpaca-data-gpt4")
# Hoặc local file: load_dataset("json", data_files="./data/cskh_train.jsonl")

def format_instruction(example):
    """Format theo template instruction-tuning tiếng Việt."""
    if example.get("input") and example["input"].strip():
        return {
            "text": (
                f"### Hướng dẫn:\n{example['instruction']}\n\n"
                f"### Đầu vào:\n{example['input']}\n\n"
                f"### Phản hồi:\n{example['output']}"
            )
        }
    return {
        "text": (
            f"### Hướng dẫn:\n{example['instruction']}\n\n"
            f"### Phản hồi:\n{example['output']}"
        )
    }

dataset = dataset.map(format_instruction)

# ─── 4. Cấu hình training ─────────────────────────────────────────────────────
training_args = TrainingArguments(
    output_dir="./gemma4-vi-lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,   # effective batch = 2×4 = 8
    warmup_ratio=0.1,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",               # Memory-efficient optimizer
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    report_to="none",
)

# ─── 5. Trainer và bắt đầu train ─────────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,   # Pack nhiều examples → tăng hiệu suất GPU
    args=training_args,
)

trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Loss: {trainer_stats.metrics['train_loss']:.4f}")

# ─── 6. Save adapter và export ────────────────────────────────────────────────
model.save_pretrained("./gemma4-vi-lora")
tokenizer.save_pretrained("./gemma4-vi-lora")

# Export GGUF để chạy với Ollama
model.save_pretrained_gguf(
    "./gemma4-vi-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Balanced quality/size
)

Ước tính thời gian trên RTX 4090 với dataset 50K examples:

Epoch 1: ~45 phút
Tổng 3 epochs: ~2.5 tiếng
Loss cuối epoch 3: thường về 1.2–1.8 (tốt cho instruction tuning)

Inference với model đã fine-tune

# Load adapter lên base model
from peft import PEFTModel
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-e4b-it",
    load_in_4bit=True,
    device_map="auto"
)
model = PEFTModel.from_pretrained(base_model, "./gemma4-vi-lora")
tokenizer = AutoTokenizer.from_pretrained("./gemma4-vi-lora")

def ask_vi(question: str, context: str = "") -> str:
    prompt = f"### Hướng dẫn:\n{question}"
    if context:
        prompt += f"\n\n### Đầu vào:\n{context}"
    prompt += "\n\n### Phản hồi:\n"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Phản hồi:\n")[-1]

# Test
print(ask_vi(
    "Điều khoản nào trong hợp đồng có rủi ro cao nhất?",
    context="ĐIỀU 5: Bên B chịu hoàn toàn trách nhiệm về chất lượng sản phẩm..."
))

Deploy với Ollama

Sau khi export GGUF, đưa vào Ollama để team sử dụng qua API hoặc CLI:

# Tạo Modelfile
cat > Modelfile << 'EOF'
FROM ./gemma4-vi-gguf/gemma4-vi-q4_k_m.gguf

SYSTEM """Bạn là trợ lý AI chuyên về [domain cụ thể] của BKGlobal.
Luôn trả lời bằng tiếng Việt, chính xác và súc tích."""

PARAMETER temperature 0.7
PARAMETER top_k 50
PARAMETER top_p 0.9
EOF

# Tạo model trong Ollama
ollama create gemma4-vi-cskh -f Modelfile

# Test
ollama run gemma4-vi-cskh "Khách hàng phàn nàn về hóa đơn sai, phải xử lý thế nào?"

Đánh giá chất lượng

Đừng chỉ nhìn vào loss. Sau fine-tuning, chạy một số test case thực tế:

# Đánh giá định lượng với BLEU + ROUGE
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=False)

test_cases = [
    {"input": "...", "expected": "...", "predicted": ask_vi("...")},
    # thêm test cases
]

bleu_scores = []
rouge_scores = []
for case in test_cases:
    bleu = sentence_bleu([case["expected"].split()], case["predicted"].split())
    rouge = scorer.score(case["expected"], case["predicted"])
    bleu_scores.append(bleu)
    rouge_scores.append(rouge["rougeL"].fmeasure)

print(f"Avg BLEU: {sum(bleu_scores)/len(bleu_scores):.3f}")
print(f"Avg ROUGE-L: {sum(rouge_scores)/len(rouge_scores):.3f}")

Ngưỡng chấp nhận được cho production: BLEU > 0.35, ROUGE-L > 0.45 với domain-specific task tiếng Việt.

Best practices và lưu ý từ thực chiến

Chọn LoRA rank phù hợp:

r=8: ít tham số hơn, train nhanh - đủ cho task đơn giản (phân loại, Q&A ngắn)
r=16: cân bằng tốt - mặc định cho hầu hết domain tasks
r=32: chất lượng cao hơn, cần thêm VRAM - dùng cho tác vụ generation phức tạp

Kích thước dataset:

Minimum viable: ~1,000 examples chất lượng cao (hơn 10,000 examples chất lượng thấp)
Sweet spot cho domain fine-tuning: 5,000–50,000 examples
Trên 100K examples: nên cân nhắc LoRA thay vì QLoRA

Tránh catastrophic forgetting:

Thêm 5–10% general Vietnamese data vào training set domain-specific. Nếu không, model có thể "quên" cách trả lời tiếng Việt thông thường khi được fine-tune quá kỹ vào một domain hẹp.

Checkpoint thường xuyên:

# Trong TrainingArguments
save_strategy="steps",
save_steps=200,
load_best_model_at_end=True,

Kết

Fine-tuning Gemma 4 với QLoRA là bước tiếp theo tự nhiên sau khi đã chạy Gemma 4 base model. Với chi phí compute $20–80/run trên cloud GPU, hoặc miễn phí nếu có RTX 3090/4090 trong nhà, team có thể iterate nhanh và build model riêng cho từng domain khách hàng.

Tại BKGlobal, chúng tôi đang áp dụng pipeline này cho một số internal tool. Kết quả sớm khả quan - đặc biệt với task trích xuất thông tin từ tài liệu pháp lý tiếng Việt.

Nếu bạn đang triển khai fine-tuning cho domain cụ thể và gặp khó khăn về dataset hay hyperparameter, để lại câu hỏi trong comment - chúng tôi sẽ chia sẻ thêm kinh nghiệm thực chiến.

Tham khảo

#	Nguồn	URL
1	Unsloth – Gemma 4 fine-tuning docs	https://unsloth.ai/docs/models/gemma-4/train
2	HuggingFace – Gemma 4 blog	https://huggingface.co/blog/gemma4
3	Google AI – Gemma 4 model card	https://ai.google.dev/gemma/docs/core/model_card_4
4	QLoRA paper (ArXiv)	https://arxiv.org/abs/2305.14314
5	5CD-AI Vietnamese Alpaca dataset	https://huggingface.co/datasets/5CD-AI/Viet-Alpaca-data-gpt4
6	Bactrian-X multilingual dataset	https://huggingface.co/datasets/bactrian-x/bactrian-x
7	VietAI ViT5 baseline	https://github.com/VietAI/vit5

BKGlobal Tech Team

Ứng dụng & Xu hướng AI

Xem tất cả