| | --- |
| | language: |
| | - en |
| | pipeline_tag: sentence-similarity |
| | --- |
| | # Model Card for gowitheflow/LASER-cubed-bert-base-unsup |
| |
|
| | Official model checkpoints of **LA(SER)<sup>3</sup>** (LASER-cubed) from EMNLP 2023 paper "Length is a Curse and a Blessing for Document-level Semantics" |
| |
|
| | ### Model Summary |
| |
|
| | LASER-cubed-bert-base-unsup is an **unsupervised** model trained on wiki1M dataset. Without needing the training sets to have long texts, it provides surprising generalizability on long document retrieval. |
| |
|
| | - **Developed by:** Chenghao Xiao, Yizhi Li, G Thomas Hudson, Chenghua Lin, Noura Al-Moubayed |
| | - **Shared by:** Chenghao Xiao |
| | - **Model type:** BERT-base |
| | - **Language(s) (NLP):** English |
| | - **Finetuned from model:** BERT-base-uncased |
| |
|
| | ### Model Sources |
| |
|
| | - **Github Repo:** https://github.com/gowitheflow-1998/LA-SER-cubed |
| | - **Paper:** https://aclanthology.org/2023.emnlp-main.86/ |
| |
|
| |
|
| | ### Usage |
| | Use the model with Sentence Transformers: |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | model = SentenceTransformer("gowitheflow/LASER-cubed-bert-base-unsup") |
| | |
| | text = "LASER-cubed is a dope model - It generalizes to long texts without needing the training sets to have long texts." |
| | representation = model.encode(text) |
| | ``` |
| | ### Evaluation |
| | Evaluate it with the BEIR framework: |
| | ```python |
| | from beir.retrieval import models |
| | from beir.datasets.data_loader import GenericDataLoader |
| | from beir.retrieval.evaluation import EvaluateRetrieval |
| | from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES |
| | |
| | # download the datasets with BEIR original repo youself first |
| | data_path = './datasets/arguana' |
| | corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test") |
| | model = DRES(models.SentenceBERT("gowitheflow/LASER-cubed-bert-base-unsup"), batch_size=512) |
| | retriever = EvaluateRetrieval(model, score_function="cos_sim") |
| | results = retriever.retrieve(corpus, queries) |
| | ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values) |
| | |
| | ``` |
| | ### Downstream Use |
| |
|
| | Information Retrieval |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | The model is not for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching. |
| |
|
| |
|
| |
|
| | ## Training Details |
| |
|
| | max seq 256, batch size 128, lr 3e-05, 1 epoch, 10% warmup, 1 A100. |
| |
|
| | ### Training Data |
| |
|
| | wiki 1M |
| |
|
| | ### Training Procedure |
| |
|
| | Please refer to the paper. |
| |
|
| | ## Evaluation |
| |
|
| |
|
| | ### Results |
| |
|
| |
|
| |
|
| | **BibTeX:** |
| | ```bibtex |
| | @inproceedings{xiao2023length, |
| | title={Length is a Curse and a Blessing for Document-level Semantics}, |
| | author={Xiao, Chenghao and Li, Yizhi and Hudson, G and Lin, Chenghua and Al Moubayed, Noura}, |
| | booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, |
| | pages={1385--1396}, |
| | year={2023} |
| | } |
| | ``` |