NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts
Paper • 2305.16023 • Published
How to use HillZhang/pseudo_native_bart_CGEC with Transformers:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("HillZhang/pseudo_native_bart_CGEC")
model = AutoModelForSeq2SeqLM.from_pretrained("HillZhang/pseudo_native_bart_CGEC")This model is a cutting-edge CGEC model based on Chinese BART-large. It is trained with about 100M pseudo native speaker CGEC training data generated by heuristic rules. More details can be found in our Github and the paper.
pip install transformers
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained("HillZhang/pseudo_native_bart_CGEC")
model = BartForConditionalGeneration.from_pretrained("HillZhang/pseudo_native_bart_CGEC")
encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True)
if "token_type_ids" in encoded_input:
del encoded_input["token_type_ids"]
output = model.generate(**encoded_input)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
@inproceedings{zhang-etal-2023-nasgec,
title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts",
author = "Zhang, Yue and
Zhang, Bo and
Jiang, Haochen and
Li, Zhenghua and
Li, Chen and
Huang, Fei and
Zhang, Min"
booktitle = "Findings of ACL",
year = "2023"
}