| | --- |
| | base_model: sentence-transformers/all-MiniLM-L6-v2 |
| | datasets: |
| | - fbroy/talk2ref |
| | language: en |
| | library_name: transformers |
| | license: cc-by-4.0 |
| | pipeline_tag: feature-extraction |
| | tags: |
| | - scientific-retrieval |
| | - dense-passage-retrieval |
| | - dual-encoder |
| | - talk2ref |
| | - speech-to-text |
| | - sentence-embedding |
| | - SBERT |
| | --- |
| | |
| | # 🗣️ Talk2Ref Query Talk Encoder |
| |
|
| | This model encodes **scientific talks** (transcripts, titles, and years) into dense vector representations, designed for **Reference Prediction from Talks (RPT)** — the task of retrieving relevant cited papers for a given talk. |
| | It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project. |
| |
|
| | The model forms the **query-side encoder** in a **dual-encoder (DPR-style)** setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder). |
| |
|
| | --- |
| |
|
| | ## 🎯 Usage |
| |
|
| | Example with `transformers`: |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | import torch |
| | |
| | # Load model |
| | model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder") |
| | |
| | # Example input |
| | title = "Attention Is All You Need" |
| | year = 2017 |
| | query_text = f"The following presentation is about the paper of the title: '{title}'. Published in {year}. " + \ |
| | "In this talk, we introduce the Transformer architecture and discuss its impact on sequence modeling." |
| | |
| | # Compute embedding |
| | with torch.no_grad(): |
| | embedding = model([query_text]) |
| | |
| | print(embedding.shape) # (1, hidden_dim) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 🧩 Model Overview |
| |
|
| | | Property | Description | |
| | |-----------|-------------| |
| | | **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) | |
| | | **Pooling** | Mean pooling | |
| | | **Max sequence length** | 512 tokens | |
| | | **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) | |
| | | **Objective** | Contrastive binary (DPR-style) loss | |
| | | **Task** | Encode cited papers into a shared semantic space with talk transcripts | |
| |
|
| | --- |
| |
|
| |
|
| |
|
| | ## Citation |
| |
|
| | If you use this dataset, please cite the following paper: |
| |
|
| | ```bibtex |
| | @misc{broy2025talk2refdatasetreferenceprediction, |
| | title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks}, |
| | author = {Frederik Broy and Maike Züfle and Jan Niehues}, |
| | year = {2025}, |
| | eprint = {2510.24478}, |
| | archivePrefix= {arXiv}, |
| | primaryClass = {cs.CL}, |
| | url = {https://arxiv.org/abs/2510.24478} |
| | } |
| | |