gowitheflow
/

LASER-cubed-bert-base-unsup

Sentence Similarity

feature-extraction

text-embeddings-inference

Model card Files Files and versions

LASER-cubed-bert-base-unsup / README.md

gowitheflow's picture

Update README.md

a2ba71d about 2 years ago

|

history blame contribute delete

2.77 kB

	---
	language:
	- en
	pipeline_tag: sentence-similarity
	---
	# Model Card for gowitheflow/LASER-cubed-bert-base-unsup

	Official model checkpoints of LA(SER)<sup>3</sup> (LASER-cubed) from EMNLP 2023 paper "Length is a Curse and a Blessing for Document-level Semantics"

	### Model Summary

	LASER-cubed-bert-base-unsup is an unsupervised model trained on wiki1M dataset. Without needing the training sets to have long texts, it provides surprising generalizability on long document retrieval.

	- Developed by: Chenghao Xiao, Yizhi Li, G Thomas Hudson, Chenghua Lin, Noura Al-Moubayed
	- Shared by: Chenghao Xiao
	- Model type: BERT-base
	- Language(s) (NLP): English
	- Finetuned from model: BERT-base-uncased

	### Model Sources

	- Github Repo: https://github.com/gowitheflow-1998/LA-SER-cubed
	- Paper: https://aclanthology.org/2023.emnlp-main.86/


	### Usage
	Use the model with Sentence Transformers:
	```python
	from sentence_transformers import SentenceTransformer
	model = SentenceTransformer("gowitheflow/LASER-cubed-bert-base-unsup")

	text = "LASER-cubed is a dope model - It generalizes to long texts without needing the training sets to have long texts."
	representation = model.encode(text)
	```
	### Evaluation
	Evaluate it with the BEIR framework:
	```python
	from beir.retrieval import models
	from beir.datasets.data_loader import GenericDataLoader
	from beir.retrieval.evaluation import EvaluateRetrieval
	from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

	# download the datasets with BEIR original repo youself first
	data_path = './datasets/arguana'
	corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
	model = DRES(models.SentenceBERT("gowitheflow/LASER-cubed-bert-base-unsup"), batch_size=512)
	retriever = EvaluateRetrieval(model, score_function="cos_sim")
	results = retriever.retrieve(corpus, queries)
	ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

	```
	### Downstream Use

	Information Retrieval

	### Out-of-Scope Use

	The model is not for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.



	## Training Details

	max seq 256, batch size 128, lr 3e-05, 1 epoch, 10% warmup, 1 A100.

	### Training Data

	wiki 1M

	### Training Procedure

	Please refer to the paper.

	## Evaluation


	### Results



	BibTeX:
	```bibtex
	@inproceedings{xiao2023length,
	title={Length is a Curse and a Blessing for Document-level Semantics},
	author={Xiao, Chenghao and Li, Yizhi and Hudson, G and Lin, Chenghua and Al Moubayed, Noura},
	booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
	pages={1385--1396},
	year={2023}
	}
	```