File size: 3,900 Bytes
304730b
6702ad3
 
 
 
 
 
304730b
 
 
6702ad3
 
 
 
 
 
 
 
 
 
304730b
 
2c6a56c
304730b
6702ad3
 
304730b
 
6702ad3
304730b
6702ad3
304730b
 
027ab17
 
304730b
 
027ab17
 
304730b
 
 
6702ad3
304730b
 
 
 
 
027ab17
 
 
 
 
6702ad3
304730b
 
 
 
 
027ab17
 
 
 
 
 
 
304730b
 
027ab17
 
304730b
 
6702ad3
027ab17
 
 
 
304730b
 
 
 
 
 
 
 
 
 
 
 
027ab17
304730b
027ab17
 
 
304730b
 
 
027ab17
304730b
 
 
 
 
 
 
 
 
 
027ab17
304730b
027ab17
304730b
 
 
027ab17
304730b
 
 
 
6702ad3
 
 
 
304730b
6702ad3
304730b
6702ad3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language:
- zh
- en
- de
- fr
library_name: transformers
license: mit
pipeline_tag: feature-extraction
tags:
- embeddings
- lora
- sociology
- retrieval
- feature-extraction
- sentence-transformers
- peft
base_model:
- Qwen/Qwen3-Embedding-0.6B
- Qwen/Qwen3-Embedding-4B
---

# THETA: Textual Hybrid Embedding–based Topic Analysis

[Paper](https://huggingface.co/papers/2603.05972) | [GitHub](https://github.com/CodeSoul-co/THETA)

## Model Description

THETA (Textual Hybrid Embedding-based Topic Analysis) is a domain-specific embedding framework designed for scalable qualitative research in sociology and the social sciences. This repository contains LoRA adapters fine-tuned on top of Qwen3-Embedding models (0.6B and 4B) using **Domain-Adaptive Fine-tuning (DAFT)**.

The model is optimized to capture semantic vector structures within specific social contexts, making it suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).

**Base Models:**
- [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
- [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)

**Fine-tuning Methods:**
- **Unsupervised:** SimCSE (contrastive learning)
- **Supervised:** Label-guided contrastive learning with LoRA

## Intended Use

This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations in the sociology and social science domains.

It is **not** designed for text generation or decision-making in high-risk scenarios.

## Model Architecture

| Component | Detail |
|---|---|
| Base model | Qwen3-Embedding (0.6B / 4B) |
| Fine-tuning | LoRA (Low-Rank Adaptation) |
| Output dimension | 896 (0.6B) / 2560 (4B) |
| Framework | Transformers + PEFT (PyTorch) |

## Repository Structure

```
CodeSoulco/THETA/
β”œβ”€β”€ 0.6B/
β”‚   β”œβ”€β”€ supervised/
β”‚   └── unsupervised/
β”œβ”€β”€ 4B/
β”‚   β”œβ”€β”€ supervised/
β”‚   └── unsupervised/
└── logs/
```

Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/THETA-embeddings](https://huggingface.co/datasets/CodeSoulco/THETA-embeddings)

## Training Details

- **Fine-tuning method:** LoRA (DAFT)
- **Training domain:** Sociology and social science texts
- **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
- **Objective:** Improve domain-specific semantic representation
- **Hardware:** Dual NVIDIA GPU

## How to Use

```python
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch

# Load base model
base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "CodeSoulco/THETA",
    subfolder="0.6B/unsupervised/germanCoal"
)

# Generate embeddings
text = "Social structure and individual behavior"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
```

## Limitations

- Fine-tuned for sociology/social science domain; may not generalize well to unrelated topics.
- Performance depends on input text length and quality.
- Does not generate text and should not be used for generative tasks.

## License

This model is released under the **MIT License**.

## Citation

```bibtex
@article{duan2026theta,
  title={THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science},
  author={Duan, Zhenke and Pan, Jiqun and Li, Xin},
  journal={arXiv preprint arXiv:2603.05972},
  year={2026},
  doi={10.48550/arXiv.2603.05972}
}
```