File size: 1,785 Bytes
2e6cde3 882f0dc 2e6cde3 b2b7351 2e6cde3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | ---
language:
- en
---
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
SPLADE Copyright (c) 2021-present NAVER Corp.
This model was trained with [Sparsembed](https://github.com/raphaelsty/sparsembed). You can find details on how to use it in the [Sparsembed](https://github.com/raphaelsty/sparsembed) repository.
```sh
pip install sparsembed
```
```python
from sparsembed import model, retrieve
from transformers import AutoModelForMaskedLM, AutoTokenizer
device = "cuda" # cpu
batch_size = 10
# List documents to index:
documents = [
{'id': 0,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': 'Paris is the capital and most populous city of France.'},
{'id': 1,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
{'id': 2,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
}]
model = model.Splade(
model=AutoModelForMaskedLM.from_pretrained("raphaelsty/splade-max").to(device),
tokenizer=AutoTokenizer.from_pretrained("raphaelsty/splade-max"),
device=device
)
retriever = retrieve.SpladeRetriever(
key="id", # Key identifier of each document.
on=["title", "text"], # Fields to search.
model=model # Splade retriever.
)
retriever = retriever.add(
documents=documents,
batch_size=batch_size,
k_tokens=256, # Number of activated tokens.
)
retriever(
["paris", "Toulouse"], # Queries
k_tokens=20, # Maximum number of activated tokens.
k=100, # Number of documents to retrieve.
batch_size=batch_size
)
``` |