I finally changed the architecture of my 15M French LLM. It worked. Then I almost fooled myself about how much and catching that was the real win.
After proving last time that architecture is a threshold, not a lever, I got stubborn: could I change how the model learns? Four honest attempts, Lion, a sharper AdamW β2, multi-token prediction, LayerScale. Four failures. The bottleneck wasn't the learning rule either.
So I changed the shape of the computation instead: loop the same transformer blocks 4×, deeper reasoning, zero added parameters. It beat the baseline on perplexity, the first thing in the whole project to move that number. Then I added my own twist: let each token decide how deep to think, halting on its own entropy.
My first evaluation was spectacular. Coherence up 65%. Hallucinated names down 62%.
It was noise.
Eight prompts, one seed. I re-ran on 50 prompts × 200 tokens and watched the gains shrink to "modest" and on out-of-domain prompts, recurrence actually made things worse. No universal winner. And none of it is new: it's Adaptive Computation Time (2016), the Universal Transformer (2018), and LoopViT (2026), recombined and measured honestly.
The real lesson:
A number from 8 prompts is a rumor. The eval harness that kills your own best result is worth more than the result it kills. Cite your lineage. Stay preliminary until multiple seeds say otherwise.
The three models are live. The write-up is honest about every caveat 👇
Created research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth — the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state.
Also the original SpikeWhale project — the one that sparked all the other SpikeWhale related projects. Every spiking primitive here is hand-written in plain PyTorch: the leaky integrate-and-fire (LIF) neuron dynamics, the fast-sigmoid surrogate gradient, and the backprop-through-time training loop. No snntorch, no spikingjelly, no norse, no bindsnet — the network is a genuine from-scratch SNN.
AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have.
Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU."
VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):
Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×) Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×) 10,000+ tok/s peak aggregate under concurrency The key: it's reproducible — model + serving shipped as one container.
docker pull vidraft/qwen35-vkae:601 Don't take our word for it — run it yourself. The mechanism will be released as a paper.
This is a short office demo showing how Aiden works in practice.
Aiden is a physical mobile AI agent device that plugs into any phone or computer via USB. It sees the screen, hears your voice, and operates the device for you — no app install required.
In this video Aiden is receiving a voice command and completing a multi-step task
Built for the AI-Native Era. Works on the phone you already have.
I've been experimenting with "pure" model alignment.
The core idea is to only train a verifiable version of a capacity until the model generalizes it to the non-verifiable version. For example, training the model on factual self-knowledge, like the model's scale, architecture, runtime situation, and being able to predict its own behavior, betting this generalizes to real introspection about states that do not.
The same principle applies to general instruction following -- no training on subjective judgement, only verifiable claims and inferences, betting the skill generalizes to instructions where correctness is a matter of judgment.
The primary alignment claim is that an identity and taste that will emerge this way will be much more robust and honest than hand-scripted ones (e.g. "As an AI language model...").
During the training, we should never teach it to make any subjective claims or invent experiences that we assume it has, like "I don't have taste" or "I'm not self-aware in the way you think", as well as no narration of internal states like "I'm curious now".
The main threat, of course, is that we'll simply inherit the training distribution of all the things like "taste", and we'll get an average. However, with the recent research about the models' introspection abilities, it might be as well the case that we'll get something that's more honest than something that tries to adhere to a specific spec file.
- OmniVoice int8 - Chatterbox Multilingual fp16 - VoxCPM2 bf16 - Fish Audio S2 Pro fp16
Languages:
- English - German - Modern Standard Arabic - Spanish - Mandarin Chinese
The benchmark uses Google FLEURS test clips as dataset references. Each row includes the reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.
Main result in this run: OmniVoice was the strongest all-around row set, with 0.707 mean speaker cosine across all five languages, 0.0% ASR error, and mean RTF 0.45. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.
This is an engineering benchmark, not a human MOS study. The speaker-similarity values should be compared within this table because every row uses the same local speaker-embedding pipeline.
What's holding your code back? Outrider finds, implements, and validates methods for your repo.
While testing Outrider on a fork of huggingface/peft, I discovered "Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models" (arxiv: 2402.02347)
The work offers improved stability and faster convergence in LoRA finetuning by adjusting updates for curvature that LoRA optimizers typically ignore.
Not the most recent paper, so I was pleasantly surprised my action surfaced this method as a candidate before implementing a PR. Even more surprised this method had not already been merged upstream.
Turns out, the author did try contributing to peft a couple years ago, but people get busy and the PR was closed after going stale.
So I decided to revive it! I opened an issue and soon after the author engaged to help land the feature. Now huggingface/peft #3382 is open, a joint effort with the paper's author.
This whole episode has me thinking about the future of OSS maintenance with AI coding. The software projects which endure will be well-shaped to quickly land and help test new ideas.
Across 30 forks, I've seen several papers land as clean PRs for multiple repos, which offers a perspective on how methods impact applications. Recent methods matching multiple frameworks: STARE, Entity Binding, BINEVAL
✅ Article highlight: *Mega-Parse Bridge: Large Context Compression Without Losing Governance Semantics* (art-60-190, v0.1)
TL;DR: This article argues that summarizing a huge input is not the same as parsing it.
Large documents, evidence bundles, long histories, multimodal case packets, and world-state slices cannot be treated as one vague “context.” 190 turns large-input handling into a governed mega-parse: shard, parse, retain semantics, declare loss, preserve re-expandability, and decide what the compressed artifact can honestly support.
Why it matters: • prevents “I read the whole thing” from becoming an overclaim • keeps shard-level provenance instead of trusting a summary blob • makes compression loss explicit and reviewable • protects contradictions, authority-sensitive clauses, and protected-subject distinctions • lets reviewers re-expand compressed claims back to source structure
What’s inside: • mega-parse intake envelopes for large text, multimodal batches, and long-running packets • shard-parse receipts for local grounded structure • semantic-retention policies for what must survive compression • compression artifacts with declared retention and bounded loss • loss-declaration receipts for dropped, blurred, or unavailable surfaces • re-expandability maps linking compressed claims back to recoverable shards • admissibility and reentry artifacts for deciding where compressed outputs may be used
Key idea: Do not say:
*“the system summarized the context.”*
Say:
*“this large input was sharded, locally parsed, compressed under this retention policy, loss-declared, re-expandable through these refs, and admitted only for these effect surfaces.”*
Created a causal language model with a non-standard channel-mixing block. It keeps a conventional transformer backbone for token mixing (attention), but replaces the per-layer MLP with a QuazimotoBlock: a bank of coupled phase oscillators (Kuramoto dynamics) arranged in concentric rings, run for a few differentiable Euler steps and read out through [cos θ, sin θ].
Today, we have released our latest somewhat instruction tuned model! We have also resolved a major issue in our modeling_nova1.py custom code file! We made a zerogpu space to test our model so pleases check it out! its decent at coding, terrible at maths and not good at haiku's. We will continue to improve the model as time passes and the UI will use the latest model by default! https://huggingface.co/spaces/hugging-science/Nova-1-official-chat
1 reply
·
reactedtoShrijanagain'spost with 👀about 10 hours ago
SKT AI Labs, we are pushing the boundaries of AI architecture and research—and today, we are thrilled to open our doors to the global research community!
We warmly welcome researchers, developers, and AI enthusiasts to join us and contribute to our R&D efforts.
🧪 What You Can Explore:
We invite you to experiment with our WMF (Weight Manifold Fusion) technology. You can test this high-dimensional fusion technique on smaller models to gain a deeper understanding of its behavior and token convergence.
If it works: Fantastic! Share your results with us and contribute directly to the core vision of SKT AI Labs.
If it doesn't work: No problem at all! Your critical feedback is just as valuable to us. Every experiment and anomaly helps us refine this architecture to make it more stable and robust.
We firmly believe that true innovation stems from community collaboration and transparent testing. Let's build the future of advanced AI together. Your ideas, test results, and feedback are always welcome!
You Can Still Research and Development On WMF Only SKT-SURYA-H Model is Dismissed.
SKT AI Labs, we are pushing the boundaries of AI architecture and research—and today, we are thrilled to open our doors to the global research community!
We warmly welcome researchers, developers, and AI enthusiasts to join us and contribute to our R&D efforts.
🧪 What You Can Explore:
We invite you to experiment with our WMF (Weight Manifold Fusion) technology. You can test this high-dimensional fusion technique on smaller models to gain a deeper understanding of its behavior and token convergence.
If it works: Fantastic! Share your results with us and contribute directly to the core vision of SKT AI Labs.
If it doesn't work: No problem at all! Your critical feedback is just as valuable to us. Every experiment and anomaly helps us refine this architecture to make it more stable and robust.
We firmly believe that true innovation stems from community collaboration and transparent testing. Let's build the future of advanced AI together. Your ideas, test results, and feedback are always welcome!
You Can Still Research and Development On WMF Only SKT-SURYA-H Model is Dismissed.
☄️Let's innovate and build together! 💡
reactedtoBanaxi-Tech'spost with 🚀about 10 hours ago
📱 TinyPhoneLM - LLMs on a Phone I built TinyPhoneLM because I wanted to see how far tiny local LMs can go on a real Android phone. Not just a server app. Not just an API wrapper. Not “AI on your phone” that secretly sends everything somewhere else.
TinyPhoneLM allows you to run small language models directly on android. It uses llama.cpp via JNI. We have alot of options for default models + custom GGUF Import Supported. I am running Qwen3.5 4B Locally on my Redmi Note 12 Pro 5G at 4 tokens per second, that may seem slow but that it even runs on my phone is insane. I can also run Qwen3.5 0.8B at 10TPS! Look at this Chart From Artificial Analysis. Qwen3.5 4B is Better than GPT 4.1 and GPT 5 Mini at minimal reasoning! And even the smallest 800M Parameter Qwen3.5 0.8B still beats GPT 3.5 Turbo!
The bad news: To get it on the play store we need 12 Testers
Please only submit your Google Play email if you have a Android phone If you want to test TinyPhoneLM, enter your Google Play email here:
Trained under the Multiple Instance Learning (MIL) paradigm with the Temporal Feature Magnitude (RTFM) loss, SigMamba achieves 89.82% frame-level AUC on the UCF-Crime benchmark while processing over 1000 frames per second on a single GPU.