Upload dpo-r1/README.md with huggingface_hub

Files changed (1) hide show

dpo-r1/README.md ADDED Viewed

+---
+language:
+  - ko
+  - en
+license: apache-2.0
+tags:
+  - dpo
+  - rlhf
+  - alignment
+  - lora
+  - korean
+  - llm
+pipeline_tag: text-generation
+---
+# EVAFRILL-Mo 3B — DPO Round 1
+First DPO (Direct Preference Optimization) alignment round applied on top of SFT v2.
+LoRA adapters are included alongside the base weights.
+## Training Stage
+DPO alignment — Round 1. Based on the SFT v2 checkpoint.
+## Key Details
+- **Steps**: 3,000
+- **LoRA rank**: 32
+- **Beta**: 0.1
+- **DPO loss (start → end)**: 0.693 → 0.565
+- **LoRA weights file**: `lora_weights.pt` (~41 MB)
+## Metrics
+| Metric | Value |
+|--------|-------|
+| DPO loss (initial) | 0.693 |
+| DPO loss (final) | 0.565 |
+| Loss reduction | ~18.5% |
+## Notes
+LoRA adapters are stored separately as `lora_weights.pt`. To use the full merged model,
+prefer the merged checkpoint in [dpo-r2](../dpo-r2/) or the [SLERP merge](../slerp/).
+## Main Model Card
+See the [main README](../../README.md) for full project details, architecture, and training history.
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base = AutoModelForCausalLM.from_pretrained("path/to/dpo-r1", torch_dtype="bfloat16")
+model = PeftModel.from_pretrained(base, "path/to/dpo-r1")
+```