| | --- |
| | language: |
| | - ko |
| | license: apache-2.0 |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - transformers |
| | --- |
| | |
| | ## PwC-Embedding-expr |
| |
|
| | We trained the **PwC-Embedding-expr** model on top of the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) embedding model. |
| | To enhance performance in Korean, we applied our curated augmentation to STS datasets and fine-tuned the E5 model using a carefully balanced ratio across datasets. |
| |
|
| | > ⚠️ This is an experimental model and is under continuous development. |
| |
|
| | ### To-do |
| | - [x] MTEB Leaderboard |
| | - [ ] Technical Report |
| |
|
| |
|
| | ## MTEB |
| | PwC-Embedding_expr was evaluated on the Korean subset of MTEB. |
| | A leaderboard link will be added once it is published. |
| | |
| | | Task | PwC-Embedding_expr | |
| | |------------------|--------------------| |
| | | KLUE-STS | 0.88 | |
| | | KLUE-TC | 0.73 | |
| | | Ko-StrategyQA | 0.80 | |
| | | KorSTS | 0.84 | |
| | | MIRACL-Reranking | 0.72 | |
| | | MIRACL-Retrieval | 0.65 | |
| | | **Average** | **0.77** | |
| |
|
| |
|
| | ## Model |
| | - Base Model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) |
| | - Model Size: 0.56B |
| | - Embedding Dimension: 1024 |
| | - Max Input Tokens: 514 |
| |
|
| |
|
| | ## Requirements |
| | It works with the dependencies included in the latest version of MTEB. |
| |
|
| |
|
| | ## Citation |
| |
|
| | TBD (technical report expected September 2025) |