pathcosmos commited on
Commit
82c8a94
·
verified ·
1 Parent(s): b229e78

Upload dpo-r1/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. dpo-r1/README.md +57 -0
dpo-r1/README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - dpo
8
+ - rlhf
9
+ - alignment
10
+ - lora
11
+ - korean
12
+ - llm
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # EVAFRILL-Mo 3B — DPO Round 1
17
+
18
+ First DPO (Direct Preference Optimization) alignment round applied on top of SFT v2.
19
+ LoRA adapters are included alongside the base weights.
20
+
21
+ ## Training Stage
22
+
23
+ DPO alignment — Round 1. Based on the SFT v2 checkpoint.
24
+
25
+ ## Key Details
26
+
27
+ - **Steps**: 3,000
28
+ - **LoRA rank**: 32
29
+ - **Beta**: 0.1
30
+ - **DPO loss (start → end)**: 0.693 → 0.565
31
+ - **LoRA weights file**: `lora_weights.pt` (~41 MB)
32
+
33
+ ## Metrics
34
+
35
+ | Metric | Value |
36
+ |--------|-------|
37
+ | DPO loss (initial) | 0.693 |
38
+ | DPO loss (final) | 0.565 |
39
+ | Loss reduction | ~18.5% |
40
+
41
+ ## Notes
42
+
43
+ LoRA adapters are stored separately as `lora_weights.pt`. To use the full merged model,
44
+ prefer the merged checkpoint in [dpo-r2](../dpo-r2/) or the [SLERP merge](../slerp/).
45
+
46
+ ## Main Model Card
47
+
48
+ See the [main README](../../README.md) for full project details, architecture, and training history.
49
+
50
+ ## Usage
51
+
52
+ ```python
53
+ from transformers import AutoModelForCausalLM, AutoTokenizer
54
+ from peft import PeftModel
55
+ base = AutoModelForCausalLM.from_pretrained("path/to/dpo-r1", torch_dtype="bfloat16")
56
+ model = PeftModel.from_pretrained(base, "path/to/dpo-r1")
57
+ ```