LossFunctionLover
/

pairwise-orm-model

@@ -36,9 +36,7 @@ model-index:
 **A Robust Preference Learning Model for Agentic Reasoning Systems**
-[![Paper](https://img.shields.io/badge/Paper-ArXiv-red)](link-to-arxiv)
-[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/akleshmishra/orm-pairwise-preference-pairs)
-[![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/Coder-12)
 </div>
@@ -200,21 +198,38 @@ L = -log(sigmoid(f(x_chosen) - f(x_rejected)))
 ### Installation
 ```bash
-pip install transformers torch
 ```
 ### Basic Usage
 ```python
-from transformers import AutoModel, AutoTokenizer
 import torch
-# Load model and tokenizer
-model_name = "akleshmishra/pairwise-orm-model"
-model = AutoModel.from_pretrained(model_name)
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model.eval()
-model.to("cuda" if torch.cuda.is_available() else "cpu")
 # Score a single reasoning trace
 def score_trace(trace_text: str) -> float:
@@ -229,12 +244,15 @@ def score_trace(trace_text: str) -> float:
         max_length=512,
         padding=True
     )
-    inputs = {k: v.to(model.device) for k, v in inputs.items()}
     with torch.no_grad():
-        outputs = model(**inputs)
-        # Assuming outputs.logits is shape [batch, 1]
-        score = outputs.logits.squeeze(-1).cpu().item()
     return score
@@ -332,26 +350,25 @@ This work builds upon and complements:
 If you use this model in your research, please cite:
 ```bibtex
-@article{mishra2025orm,
-  title={An Empirical Study of Robust Preference Learning under Minimal Supervision},
   author={Mishra, Aklesh},
-  journal={arXiv preprint arXiv:XXXX.XXXXX},
-  year={2025}
 }
 ```
 ## 🔗 Resources
-- 📄 **Paper**: [ArXiv](link-to-arxiv) (Coming soon)
 - 💾 **Dataset**: [HuggingFace](https://huggingface.co/datasets/LossFunctionLover/orm-pairwise-preference-pairs)
-- 💻 **Code**: [GitHub](https://github.com/Coder-12)
-- 📊 **Training Logs**: [Weights & Biases](wandb-link) (if available)
 ## 📧 Contact
 **Aklesh Mishra**
 - Email: akleshmishra7@gmail.com
-- GitHub: [@Coder-12](https://github.com/Coder-12)
 ## 📝 License
@@ -382,4 +399,4 @@ This research builds upon months of dedicated work in preference learning and ag
 ---
-**Last Updated**: November 27, 2025

 **A Robust Preference Learning Model for Agentic Reasoning Systems**
+[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/LossFunctionLover/orm-pairwise-preference-pairs)
 </div>
 ### Installation
 ```bash
+pip install transformers torch huggingface_hub
 ```
 ### Basic Usage
 ```python
 import torch
+from transformers import AutoModel, AutoTokenizer
+from huggingface_hub import hf_hub_download
+# Download the trained model weights
+model_path = hf_hub_download(
+    repo_id="LossFunctionLover/pairwise-orm-model",
+    filename="pairwise_orm.pt"
+)
+# Load the base encoder (frozen during training)
+base_model = AutoModel.from_pretrained("facebook/opt-1.3b")
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
+# Load the trained scoring head weights
+scoring_head_weights = torch.load(model_path, map_location="cpu")
+# Initialize scoring head (single linear layer)
+hidden_size = base_model.config.hidden_size
+scoring_head = torch.nn.Linear(hidden_size, 1)
+scoring_head.load_state_dict(scoring_head_weights)
+# Move to device
+device = "cuda" if torch.cuda.is_available() else "cpu"
+base_model.eval().to(device)
+scoring_head.eval().to(device)
 # Score a single reasoning trace
 def score_trace(trace_text: str) -> float:
         max_length=512,
         padding=True
     )
+    inputs = {k: v.to(device) for k, v in inputs.items()}
     with torch.no_grad():
+        # Get base model embeddings
+        encoder_outputs = base_model(**inputs)
+        # Pool final token (EOS)
+        pooled = encoder_outputs.last_hidden_state[:, -1, :]
+        # Get reward score
+        score = scoring_head(pooled).squeeze(-1).cpu().item()
     return score
 If you use this model in your research, please cite:
 ```bibtex
+@article{mishra2026orm,
+  title={Stable Outcome Reward Modeling via Pairwise Preference Learning},
   author={Mishra, Aklesh},
+  journal={arXiv preprint},
+  year={2026},
+  note={Under review}
 }
 ```
 ## 🔗 Resources
+- 📄 **Paper**: Submitted to arXiv (under review)
 - 💾 **Dataset**: [HuggingFace](https://huggingface.co/datasets/LossFunctionLover/orm-pairwise-preference-pairs)
 ## 📧 Contact
 **Aklesh Mishra**
 - Email: akleshmishra7@gmail.com
+- Independent Researcher
 ## 📝 License
 ---
+**Last Updated**: January 22, 2026