reaperdoesntknow
/

Discovery

@@ -1,199 +1,191 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+tags:
+- trl
+- sft
+- metric-attention
+- mixture-of-attentions
+- triangle-inequality
+- blackhole-rope
+- discrepancy-calculus
+- discover
+license: cc
+datasets:
+- nohurry/Opus-4.6-Reasoning-3000x-filtered
+- openbmb/UltraData-Math
+- yahma/alpaca-cleaned
+language:
+- en
+pipeline_tag: text-generation
 ---
+# DiscoverLM-70M-Base
+A 70M parameter causal language model built on the **Mixture-of-Attentions (MoA)** architecture — distance-based metric attention that respects the triangle inequality by construction, not approximation.
+Every attention head operates in a proper metric space. The geometry is enforced, not hoped for.
+## What Makes This Different
+Standard transformers compute attention as a dot product: Q·Kᵀ. This has no geometric meaning — it's a bilinear form, not a distance. Two tokens can be "close" by dot product while violating basic metric properties.
+MoA replaces this with **negative squared distance** under a learned diagonal Mahalanobis metric, then enforces the triangle inequality through a regularizer over random triples sampled during training. The result: attention weights reflect actual geometric proximity in a space where d(a,c) ≤ d(a,b) + d(b,c) holds.
+This isn't a constraint that fights the model. It's structure the model uses.
+## Architecture
+```
+Input → Token Embedding (48K vocab, custom tokenizer)
+  │
+  ▼
+┌──────────────────────────────────────────────────┐
+│              MoA Block × 4                       │
+│                                                  │
+│  ┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ │
+│  │  Local   │ │  Global  │ │Channel │ │  MQA   │ │
+│  │  Conv    │ │  Metric  │ │  Mix   │ │ Metric │ │
+│  │         │ │ (64 heads)│ │        │ │(64 Q)  │ │
+│  └────┬────┘ └────┬─────┘ └───┬────┘ └───┬────┘ │
+│       └──────┬────┴───────────┴───────────┘      │
+│              ▼                                   │
+│     Feature Gates + Token Router (top-2)         │
+│              ▼                                   │
+│        Residual + DropPath                       │
+└──────────────────────┬───────────────────────────┘
+                       ▼
+         HyperFFN (SwiGLU + CausalConv + LowRank)
+                       ▼
+                   LayerNorm
+                       ▼
+┌──────────────────────────────────────────────────┐
+│            MoA Language Model Head               │
+│  (same 4-path mixture → SwiGLU → tied vocab)    │
+└──────────────────────┬───────────────────────────┘
+                       ▼
+                 Logits (48,000)
+```
+### Core Components
+**Metric Attention.** Queries attend to keys via learned Mahalanobis distance. Each of 64 heads has an 8-dimensional head space with its own diagonal scaling, learnable ball origin, and adaptive radius for sparse pruning. Pairs outside the ball are masked before softmax.
+**Mixture-of-Attentions Routing.** Four parallel paths per token — local depthwise convolution, full multi-head metric attention, gated channel mixing, and multi-query metric attention. A learned router selects top-2 paths per token position. Feature gates scale each path's output before mixing.
+**BlackHoleRoPE.** Rotary position encoding with learned phase perturbations from a compact Fourier basis. Q/K rotations stay unitary. V amplitudes get bounded energy gating clamped to [0.5, 2.0] with optional discrepancy-state modulation.
+**HyperFFN.** Three-branch feedforward: SwiGLU channel MLP, causal depthwise separable convolution, and gated low-rank bottleneck — routed per-token with top-2 sparse selection.
+**MoA LM Head.** The vocabulary projection runs its own mixture-of-attentions (32 heads, head_dim=16) before projecting to logits through a SwiGLU transform. Weight-tied to the input embedding.
+## Parameter Budget
+| Component | Parameters | % |
+|---|---|---|
+| Token embedding (tied) | 24.6M | 35.5% |
+| MoA blocks × 4 | 28.9M | 41.8% |
+| HyperFFN (shared) | 4.2M | 6.1% |
+| MoA LM head | 10.8M | 15.6% |
+| RoPE + norms | 0.6M | 0.9% |
+| **Total** | **69.1M** | |
+## vs Standard Transformers
+| | Transformer | MoA |
+|---|---|---|
+| Attention scoring | Dot product (Q·Kᵀ) | Negative Mahalanobis distance |
+| Geometric guarantee | None | Triangle inequality regularized |
+| Position encoding | RoPE | BlackHoleRoPE (learned phase + bounded V energy) |
+| Attention sparsity | Causal mask only | Ball pruning + top-k routing |
+| Head combination | Concatenation | Per-token routed mixture of 4 path types |
+| FFN | Single MLP | 3-branch routed (SwiGLU + CausalConv + LowRank) |
+| LM head | Linear projection | Full MoA mixture → SwiGLU → tied projection |
+## Training
+### Data
+| Dataset | Domain |
+|---|---|
+| [Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | Multi-step reasoning |
+| [UltraData-Math](https://huggingface.co/datasets/openbmb/UltraData-Math) | Mathematical problem solving |
+| [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | General instruction following |
+## Usage
+```python
+from transformers import AutoTokenizer
+from MoA import MoAMetricLM, MoAMetricConfig
+tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
+model = MoAMetricLM.from_pretrained("reaperdoesntknow/DiscoverLM-70M")
+inputs = tokenizer("The triangle inequality guarantees that", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=128)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Chat Format
+The tokenizer includes built-in special tokens for structured generation:
+| Token | Role |
+|---|---|
+| `<\|system\|>` | System prompt boundary |
+| `<\|user\|>` | User turn boundary |
+| `<\|assistant\|>` | Assistant turn boundary |
+| `<\|think\|>` | Internal reasoning start |
+| `<\|reasoning\|>` | Reasoning chain marker |
+| `<\|bos\|>` | Beginning of sequence |
+| `<\|eos\|>` | End of sequence |
+| `<\|pad\|>` | Padding |
+```python
+# Chat-style prompting
+prompt = "<|system|>You are DiscoverLM, a small language model with metric attention.<|user|>What is the triangle inequality?<|assistant|><|think|><|reasoning|>"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=256)
+```
+## Mathematical Foundation
+The metric attention mechanism is grounded in the Discrepancy Calculus (DISC), a measure-theoretic framework for singularity analysis developed by the author. The triangle inequality regularizer enforces that the learned attention geometry satisfies d(a,c) ≤ d(a,b) + d(b,c) across sampled triples, ensuring the distance function used for attention scoring is a proper metric — not merely a similarity function.
+The ball pruning mechanism (learnable per-head origins and radii) creates adaptive sparse attention patterns that emerge from the geometry itself rather than from fixed masking heuristics.
+BlackHoleRoPE extends standard rotary position encoding with learned phase perturbations synthesized from a Fourier basis, maintaining the unitary property on Q/K while adding bounded amplitude modulation on V — ensuring position-dependent energy gating stays within Lyapunov-stable bounds.
+## Lineage
+This architecture derives from research in metric-native neural computation:
+- **DISC** — Discrepancy Calculus: measure-theoretic singularity analysis (Colca, 2025)
+- **MoA** — Mixture-of-Attentions with triangle inequality enforcement
+- **BlackHoleRoPE** — Learned rotary position encoding with bounded energy gating
+## Limitations
+- Trained on 262K tokens — the architecture works, but this is a proof-of-concept scale. Generalization to unseen distributions is not yet validated.
+- No eval split was used; training metrics only.
+- 8 epochs over 64 batches means the model has seen each example multiple times. Overfitting is likely at this data scale.
+- fp32 training only — bf16/fp16 behavior untested.
+## Citation
+```bibtex
+@misc{CILLC2026discoverLM,
+  author = {Convergent Intelligence LLC: Research Division},
+  title = {DiscoverLM-70M: Metric-Attention Mixture of Attentions with Triangle Inequality Enforcement},
+  year = {2026},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/reaperdoesntknow/DiscoverLM-70M}
+}
+```
+## Author
+Roy Colca Jr. — [Convergent Intelligence LLC](https://convergentintel.com)
+HuggingFace: [reaperdoesntknow](https://huggingface.co/reaperdoesntknow)