Korean PII β multilingual-e5-base
Span-level Korean PII detection, fine-tuned from
intfloat/multilingual-e5-base
(a multilingual XLM-RoBERTa bidirectional encoder). It detects 9 PII categories as
character-offset spans and is trained for multi-domain Korean coverage
(conversational, news, and a range of document domains).
Open PII Notebook β load the model and redact Korean PII interactively.
Capabilities
| Category | Description | Example |
|---|---|---|
private_person |
Personal name (Korean / Western / handles) | κΉλ―Όμ, John Smith |
private_address |
Physical / postal address | μμΈνΉλ³μ κ°λ¨κ΅¬ ν ν€λλ‘ 123 |
private_phone |
Phone number | 010-1234-5678 |
private_email |
Email address | minsu@example.com |
private_date |
Birthday / personally-identifying date | 1985λ 3μ 12μΌ |
private_url |
Personal URL | github.com/minsu |
account_number |
Bank, card, RRN, passport, etc. | 110-234-567890 |
personal_handle |
Username / handle | rainbow879612 |
ip_address |
IP address | 192.168.1.5 |
Benchmark Results
Evaluated across three domains, exact character-span F1, with deterministic span
normalization (see extract_pii below).
| eval set | what it measures | Overall F1 |
|---|---|---|
| KDPII test (2,252) | conversational Korean (in-domain) | 0.943 |
| Held-out document domains (insurance, government) | unseen domains | 0.995 |
KLUE-NER person |
real Korean news text | 0.866 (recall 0.92) |
KDPII per-class (conversational, in-domain)
| label | F1 | label | F1 | |
|---|---|---|---|---|
private_email |
1.000 | private_person |
0.909 | |
private_url |
1.000 | private_address |
0.922 | |
ip_address |
1.000 | account_number |
0.979 | |
private_date |
0.980 | personal_handle |
0.863 | |
private_phone |
0.993 |
Quick Start
Install
pip install "transformers>=4.40" torch safetensors
Load
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
MODEL_ID = "FrameByFrame/korean-pii-e5-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForTokenClassification.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
model.eval()
if torch.cuda.is_available():
model.cuda()
Inference
The model emits per-token BIOES labels. The helper decodes them into character-offset
spans and applies light, deterministic span normalization (strips trailing Korean
particles / whitespace from a span, e.g. λ―Όμμ¨ β λ―Όμ, μ‘νꡬμ β μ‘νꡬ). The
benchmark numbers above include this normalization.
import re
_TRAILING_JOSA = ["μ΄μμ","μ΄λΌκ³ ","μ
λλ€","μ΄μΌ","μ΄λ","νν
","μκ²","μΌλ‘","μ΄κ°","μ΄λ",
"μμ","μ΄κ³ ","μμ","μ¨","λ","μ΄","κ°","μ","λ","μ","λ₯Ό","μΌ","μ","μ","μ","λ","κ»","κ³ "]
_DATE_END = re.compile(r".*(?:μΌ|[0-9])", re.S)
def _normalize(text, label, s, e):
while s < e and text[s] in " .,\t\n": s += 1
while e > s and text[e-1] in " .,\t\n": e -= 1
if label == "private_date":
m = _DATE_END.match(text[s:e])
if m and m.end() > 0: e = s + m.end()
elif label in ("private_person", "personal_handle", "private_address"):
for _ in range(2):
seg = text[s:e]
for j in _TRAILING_JOSA:
if seg.endswith(j) and (e - s) - len(j) >= 2:
e -= len(j); break
else:
break
return s, e
def extract_pii(text: str, max_length: int = 256):
enc = tokenizer(text, truncation=True, max_length=max_length,
return_offsets_mapping=True, return_tensors="pt")
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
logits = model(**{k: v.to(model.device) for k, v in enc.items()}).logits
pred = logits.argmax(-1)[0].tolist()
id2label = model.config.id2label
spans, active = [], None # active = [label, start, end]
for i, lid in enumerate(pred):
label = id2label[int(lid)]
cs, ce = offsets[i]
if cs == ce: # special token
if active: spans.append(active); active = None
continue
if label == "O":
if active: spans.append(active); active = None
continue
prefix, cat = label.split("-", 1)
if prefix in ("B", "S") or not active or active[0] != cat:
if active: spans.append(active)
active = [cat, cs, ce]
else:
active[2] = ce
if active: spans.append(active)
out = []
for cat, s, e in spans:
s, e = _normalize(text, cat, s, e)
if text[s:e].strip():
out.append({"label": cat, "start": s, "end": e, "text": text[s:e]})
return out
Redaction
def redact(text: str) -> str:
spans = sorted(extract_pii(text), key=lambda s: s["start"], reverse=True)
for s in spans:
text = text[:s["start"]] + f"[{s['label'].upper()}]" + text[s["end"]:]
return text
>>> redact("κΉλ―Όμλμ λ²νΈλ 010-1234-5678μ
λλ€.")
"[PRIVATE_PERSON]λμ λ²νΈλ [PRIVATE_PHONE]μ
λλ€."
Output Schema
| field | description |
|---|---|
label |
one of the 9 categories above |
start |
character offset start (inclusive) |
end |
character offset end (exclusive) |
text |
the matched substring |
Training Details
| Base model | intfloat/multilingual-e5-base (XLM-RoBERTa, ~278M) |
| Task | token classification, BIOES (9 PII classes β 37 labels) |
| Method | full fine-tune (token head randomly initialized; encoder fully trained) |
| Datasets | multi-domain Korean mix β KDPII (conversational, CC BY 4.0) + KLUE-NER person spans (news) + LLM-generated multi-domain documents (medical, legal, finance, e-commerce, HR, real-estate, social, gaming, IT, telecom, education, travel, delivery, email) with placeholder-filled PII + distribution-matched synthetic PII. All PII is synthetic/generated, never real. |
| Split | KDPII test held out (seed 42); 2 document domains (insurance, government) fully held out for unseen-domain eval; KLUE-val held out |
| Optimizer | AdamW, lr 3e-5, linear schedule, warmup 0.05 |
| Batch / seq | 32 per device, max_length 256 |
| Epochs | 3, best checkpoint by eval_span_f1 |
| Precision | bf16 |
| Hardware | 1Γ NVIDIA RTX A5000 |
Known Limitations
personal_handle(~0.86 in-domain) is the weakest class β handles are open-vocabulary (arbitrary usernames) and overlap with names; near its practical ceiling.- Held-out document-domain F1 (0.995) is optimistic β those domains are unseen, but share the generator/entity distribution of the synthetic training data. It shows domain-content transfer, not guaranteed real-world-text robustness. Treat real-world performance as bounded by the KDPII (0.94, real conversational) and KLUE-news (0.87, real news) numbers.
- Evaluate on your own domain before high-stakes use. Coverage is broad but not exhaustive; Korean PII annotation conventions vary by source.
- Structured PII (phone/email/url/ip/account/RRN) is best paired with a regex/checksum validator in production for guaranteed precision.
- The
extract_piihelper applies span normalization; if you decode logits yourself, apply equivalent trimming to reproduce the reported numbers.
License
MIT β inherited from the base intfloat/multilingual-e5-base (MIT). Training data includes KDPII (CC BY 4.0).
Citation
@misc{framebyframe-korean-pii-e5-base-2026,
title = {Korean PII (multilingual-e5-base): token classification for Korean PII},
author = {Mariappan, Vijayachandran},
year = {2026},
url = {https://huggingface.co/FrameByFrame/korean-pii-e5-base}
}
Contact
For inquiries, please contact vijay@artelligence.ai
- Downloads last month
- 10
Model tree for FrameByFrame/korean-pii-e5-base
Base model
intfloat/multilingual-e5-base