Pedro Ortiz Suarez
AI & ML interests
Language modeling, parsing, sequence tagging, NER, historical languages.
Organizations
Set `sep="\s+"` for the duplicates file
2
#1 opened almost 2 years ago
by
lhoestq
Porn-related strings in the datasets (zh)
2
#8 opened over 2 years ago
by
kiwakwok
colab crashed after trying to load the dataset
1
#4 opened over 2 years ago
by
MhondGhod
Change foldernames
👍 2
4
#3 opened over 2 years ago
by
hac541309
Unsafe Files
20
#12 opened almost 3 years ago
by
GetzPro
About the number of documents
6
#6 opened over 2 years ago
by
lixin4ever
Upload the rest of the data for 05-06-23
#1 opened over 2 years ago
by
pjox
Changing into Parquet
👍 1
2
#5 opened almost 3 years ago
by
hac541309
the link to RoBERTa base model directs us to bert-base-uncased
1
#1 opened about 3 years ago
by
hurrial
Deduplicated English Corpus
👍 1
2
#3 opened about 3 years ago
by
conceptofmind
Data hosting on Huggingface
1
#2 opened about 3 years ago
by
hieuhocnlp
How to download only one language?
2
#1 opened about 3 years ago
by
musabg
full of sexy content and does't have 200G in zh corpus
1
#10 opened about 3 years ago
by
Hzhiqiang