🔄 In a Training Loop

John Smith PRO

John6666

558 2502 32131

John6666cat

AI & ML interests

None yet

Recent Activity

reacted to RDTvlokip's post with 👍 about 10 hours ago

I finally changed the architecture of my 15M French LLM. It worked. Then I almost fooled myself about how much and catching that was the real win. After proving last time that architecture is a threshold, not a lever, I got stubborn: could I change how the model learns? Four honest attempts, Lion, a sharper AdamW β2, multi-token prediction, LayerScale. Four failures. The bottleneck wasn't the learning rule either. So I changed the shape of the computation instead: loop the same transformer blocks 4×, deeper reasoning, zero added parameters. It beat the baseline on perplexity, the first thing in the whole project to move that number. Then I added my own twist: let each token decide how deep to think, halting on its own entropy. My first evaluation was spectacular. Coherence up 65%. Hallucinated names down 62%. It was noise. Eight prompts, one seed. I re-ran on 50 prompts × 200 tokens and watched the gains shrink to "modest" and on out-of-domain prompts, recurrence actually made things worse. No universal winner. And none of it is new: it's Adaptive Computation Time (2016), the Universal Transformer (2018), and LoopViT (2026), recombined and measured honestly. The real lesson: A number from 8 prompts is a rumor. The eval harness that kills your own best result is worth more than the result it kills. Cite your lineage. Stay preliminary until multiple seeds say otherwise. The three models are live. The write-up is honest about every caveat 👇 🔗 https://huggingface.co/blog/RDTvlokip/teaching-a-15m-french-llm-to-think-deeper

reacted to Quazim0t0's post with 🔥 about 10 hours ago

Created research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth — the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state. https://huggingface.co/Quazim0t0/Mycel-LM-79M Also the original SpikeWhale project — the one that sparked all the other SpikeWhale related projects. Every spiking primitive here is hand-written in plain PyTorch: the leaky integrate-and-fire (LIF) neuron dynamics, the fast-sigmoid surrogate gradient, and the backprop-through-time training loop. No snntorch, no spikingjelly, no norse, no bindsnet — the network is a genuine from-scratch SNN. https://huggingface.co/Quazim0t0/SpikeWhale-SNN-216M

reacted to SeaWolf-AI's post with 👀 about 10 hours ago

🚀 Adding a GPU without building one AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have. Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU." VIDRAFT's VKAE, measured (B200, same-harness, no quality loss): Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×) Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×) 10,000+ tok/s peak aggregate under concurrency The key: it's reproducible — model + serving shipped as one container. docker pull vidraft/qwen35-vkae:601 Don't take our word for it — run it yourself. The mechanism will be released as a paper. 🏆 Leaderboard & demo 👉 https://huggingface.co/spaces/VIDraft/vkae Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

View all activity

Organizations

reacted to RDTvlokip's post with 👍 about 10 hours ago

Post

I finally changed the architecture of my 15M French LLM. It worked. Then I almost fooled myself about how much and catching that was the real win.

After proving last time that architecture is a threshold, not a lever, I got stubborn: could I change how the model learns? Four honest attempts, Lion, a sharper AdamW β2, multi-token prediction, LayerScale. Four failures. The bottleneck wasn't the learning rule either.

So I changed the shape of the computation instead: loop the same transformer blocks 4×, deeper reasoning, zero added parameters. It beat the baseline on perplexity, the first thing in the whole project to move that number. Then I added my own twist: let each token decide how deep to think, halting on its own entropy.

My first evaluation was spectacular. Coherence up 65%. Hallucinated names down 62%.

It was noise.

Eight prompts, one seed. I re-ran on 50 prompts × 200 tokens and watched the gains shrink to "modest" and on out-of-domain prompts, recurrence actually made things worse. No universal winner. And none of it is new: it's Adaptive Computation Time (2016), the Universal Transformer (2018), and LoopViT (2026), recombined and measured honestly.

The real lesson:

A number from 8 prompts is a rumor. The eval harness that kills your own best result is worth more than the result it kills. Cite your lineage. Stay preliminary until multiple seeds say otherwise.

The three models are live. The write-up is honest about every caveat 👇

🔗 https://huggingface.co/blog/RDTvlokip/teaching-a-15m-french-llm-to-think-deeper

reacted to Quazim0t0's post with 🔥 about 10 hours ago

Post

469

Created research language model whose channel-mixing block is not an MLP. It is a differentiable Neighbour-Sensing fungal-colony-growth model: each token is expanded into a colony of hyphal tips that grow in a bounded latent region, sense a shared density field, and steer their own growth — the "MLP" is replaced by a few differentiable steps of colony growth, read back out into the hidden state.

Quazim0t0/Mycel-LM-79M

Also the original SpikeWhale project — the one that sparked all the other SpikeWhale related projects. Every spiking primitive here is hand-written in plain PyTorch: the leaky integrate-and-fire (LIF) neuron dynamics, the fast-sigmoid surrogate gradient, and the backprop-through-time training loop. No snntorch, no spikingjelly, no norse, no bindsnet — the network is a genuine from-scratch SNN.

Quazim0t0/SpikeWhale-SNN-216M

reacted to SeaWolf-AI's post with 👀 about 10 hours ago

Post

1111

🚀 Adding a GPU without building one

AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have.

Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU."

VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):

Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×)
Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible — model + serving shipped as one container.

docker pull vidraft/qwen35-vkae:601
Don't take our word for it — run it yourself. The mechanism will be released as a paper.

🏆 Leaderboard & demo 👉 VIDraft/vkae
Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

reacted to NatalieY's post with 🔥 about 10 hours ago

Post

This is a short office demo showing how Aiden works in practice.

Aiden is a physical mobile AI agent device that plugs into any phone or computer via USB. It sees the screen, hears your voice, and operates the device for you — no app install required.

In this video Aiden is receiving a voice command and completing a multi-step task

Built for the AI-Native Era. Works on the phone you already have.

AidenAgent

aidenai.io

reacted to breitburg's post with 🔥 about 10 hours ago

Post

570

I've been experimenting with "pure" model alignment.

The core idea is to only train a verifiable version of a capacity until the model generalizes it to the non-verifiable version. For example, training the model on factual self-knowledge, like the model's scale, architecture, runtime situation, and being able to predict its own behavior, betting this generalizes to real introspection about states that do not.

The same principle applies to general instruction following -- no training on subjective judgement, only verifiable claims and inferences, betting the skill generalizes to instructions where correctness is a matter of judgment.

The primary alignment claim is that an identity and taste that will emerge this way will be much more robust and honest than hand-scripted ones (e.g.
"As an AI language model...").

During the training, we should never teach it to make any subjective claims or invent experiences that we assume it has, like "I don't have taste" or "I'm not self-aware in the way you think", as well as no narration of internal states like "I'm curious now".

The main threat, of course, is that we'll simply inherit the training distribution of all the things like "taste", and we'll get an average. However, with the recent research about the models' introspection abilities, it might be as well the case that we'll get something that's more honest than something that tries to adhere to a specific spec file.

I'm posting new experimental models trained that way in this collection: https://huggingface.co/collections/breitburg/neue

3 replies

reacted to aufklarer's post with 🔥 about 10 hours ago

Post

622

Voice cloning models measured across five languages: OmniVoice, Chatterbox, VoxCPM2, Fish Audio

I published a new Soniqo benchmark post for local voice cloning models across five languages:

https://www.soniqo.audio/blog/voice-cloning-benchmarks

Models:

- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16

Languages:

- English
- German
- Modern Standard Arabic
- Spanish
- Mandarin Chinese

The benchmark uses Google FLEURS test clips as dataset references. Each row includes the reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.

Main result in this run: OmniVoice was the strongest all-around row set, with 0.707 mean speaker cosine across all five languages, 0.0% ASR error, and mean RTF 0.45. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.

This is an engineering benchmark, not a human MOS study. The speaker-similarity values should be compared within this table because every row uses the same local speaker-embedding pipeline.

Try the stack locally with Speech Studio:

https://www.soniqo.audio/speech-studio
https://github.com/soniqo/speech-studio

Underlying Swift library/CLI:

https://github.com/soniqo/speech-swift

Soniqo models and exports:

soniqo @aufklarer

What model or language should I add next?

reacted to salma-remyx's post with 🔥 about 10 hours ago

Post

576

What's holding your code back?
Outrider finds, implements, and validates methods for your repo.

While testing Outrider on a fork of huggingface/peft, I discovered "Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models" (arxiv: 2402.02347)

The work offers improved stability and faster convergence in LoRA finetuning by adjusting updates for curvature that LoRA optimizers typically ignore.

Not the most recent paper, so I was pleasantly surprised my action surfaced this method as a candidate before implementing a PR. Even more surprised this method had not already been merged upstream.

Turns out, the author did try contributing to peft a couple years ago, but people get busy and the PR was closed after going stale.

So I decided to revive it! I opened an issue and soon after the author engaged to help land the feature. Now huggingface/peft #3382 is open, a joint effort with the paper's author.

This whole episode has me thinking about the future of OSS maintenance with AI coding. The software projects which endure will be well-shaped to quickly land and help test new ideas.

Across 30 forks, I've seen several papers land as clean PRs for multiple repos, which offers a perspective on how methods impact applications. Recent methods matching multiple frameworks: STARE, Entity Binding, BINEVAL

Get Outrider: https://github.com/remyxai/outrider

reacted to ProCreations's post with 🧠 about 10 hours ago

Post

330

want model think like grug?token efficient think! Grug dataset here. Grug model now ProCreations/grug-think

reacted to stas's post with 🤗 about 10 hours ago

Post

641

I present to you a new experimental open book.

https://github.com/stas00/python-cookbook

I took my dense Python cheatsheet that I have been honing for many years and use a lot daily and turned it into a book of recipes.

Is this useful?

This is, of course, free, like other open books.

reacted to fffiloni's post with 👍🤗 about 10 hours ago

Post

436

I made a Hugging Face Space for SCAIL-2 🤗

Reference character + driving motion → animated result.

A simple demo to explore the paper’s core workflow with curated examples.

👉 fffiloni/SCAIL-2

reacted to kanaria007's post with 👀 about 10 hours ago

Post

101

✅ Article highlight: *Mega-Parse Bridge: Large Context Compression Without Losing Governance Semantics* (art-60-190, v0.1)

TL;DR:
This article argues that summarizing a huge input is not the same as parsing it.

Large documents, evidence bundles, long histories, multimodal case packets, and world-state slices cannot be treated as one vague “context.” 190 turns large-input handling into a governed mega-parse: shard, parse, retain semantics, declare loss, preserve re-expandability, and decide what the compressed artifact can honestly support.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents “I read the whole thing” from becoming an overclaim
• keeps shard-level provenance instead of trusting a summary blob
• makes compression loss explicit and reviewable
• protects contradictions, authority-sensitive clauses, and protected-subject distinctions
• lets reviewers re-expand compressed claims back to source structure

What’s inside:
• mega-parse intake envelopes for large text, multimodal batches, and long-running packets
• shard-parse receipts for local grounded structure
• semantic-retention policies for what must survive compression
• compression artifacts with declared retention and bounded loss
• loss-declaration receipts for dropped, blurred, or unavailable surfaces
• re-expandability maps linking compressed claims back to recoverable shards
• admissibility and reentry artifacts for deciding where compressed outputs may be used

Key idea:
Do not say:

*“the system summarized the context.”*

Say:

*“this large input was sharded, locally parsed, compressed under this retention policy, loss-declared, re-expandable through these refs, and admitted only for these effect surfaces.”*

Compression is allowed.

Unreceipted semantic loss is not.

reacted to PeetPedro's post with 🧠 about 10 hours ago

Post

built a garden today.
it runs on an M1, breathes on its own,
and asks for nothing.

https://garden.vaked.dev/
https://github.com/peterlodri-sec/kompress-ultra

entropy is the source.
no chains needed.

PeetPedro/ultrawhale-dogfood

thanks all, especially Rahul, my friend, who helps with my loops

ॐ

https://github.com/peterlodri-sec/kompress-ultra/issues/6

reacted to Quazim0t0's post with 🔥 about 10 hours ago

Post

290

Created a causal language model with a non-standard channel-mixing block. It keeps a conventional transformer backbone for token mixing (attention), but replaces the per-layer MLP with a QuazimotoBlock: a bank of coupled phase oscillators (Kuramoto dynamics) arranged in concentric rings, run for a few differentiable Euler steps and read out through [cos θ, sin θ].

Quazim0t0/Positronic-144M

reacted to Bc-AI's post with 👍 about 10 hours ago

Post

Today, we have released our latest somewhat instruction tuned model! We have also resolved a major issue in our modeling_nova1.py custom code file! We made a zerogpu space to test our model so pleases check it out! its decent at coding, terrible at maths and not good at haiku's. We will continue to improve the model as time passes and the UI will use the latest model by default! https://huggingface.co/spaces/hugging-science/Nova-1-official-chat

1 reply

reacted to Shrijanagain's post with 👀 about 10 hours ago

Post

Welcome Researcher and Developers!

SKT AI Labs, we are pushing the boundaries of AI architecture and research—and today, we are thrilled to open our doors to the global research community!

We warmly welcome researchers, developers, and AI enthusiasts to join us and contribute to our R&D efforts.

🧪 What You Can Explore:

We invite you to experiment with our WMF (Weight Manifold Fusion) technology. You can test this high-dimensional fusion technique on smaller models to gain a deeper understanding of its behavior and token convergence.

---------- CHECK OUT:

SPACE : SKT-NRS/RD
EXPERIMENT : sKT-Ai-Labs/SKT-SURYA-H
DIRECT TO MAIN DISCUSSION : SKT-NRS/RD#1

🤝 Your Feedback Shapes the Future :

If it works: Fantastic! Share your results with us and contribute directly to the core vision of SKT AI Labs.

If it doesn't work: No problem at all! Your critical feedback is just as valuable to us. Every experiment and anomaly helps us refine this architecture to make it more stable and robust.

We firmly believe that true innovation stems from community collaboration and transparent testing. Let's build the future of advanced AI together. Your ideas, test results, and feedback are always welcome!

You Can Still Research and Development On WMF Only SKT-SURYA-H Model is Dismissed.

Let's innovate and build together! 💡

reacted to ST-x-Tony's post with 🔥 about 10 hours ago

Post

109

Welcome Researcher and Developers!

SKT AI Labs, we are pushing the boundaries of AI architecture and research—and today, we are thrilled to open our doors to the global research community!

We warmly welcome researchers, developers, and AI enthusiasts to join us and contribute to our R&D efforts.

🧪 What You Can Explore:

We invite you to experiment with our WMF (Weight Manifold Fusion) technology. You can test this high-dimensional fusion technique on smaller models to gain a deeper understanding of its behavior and token convergence.

❤ CHECK OUT :

SPACE : SKT-NRS/RD
EXPERIMENT : sKT-Ai-Labs/SKT-SURYA-H
DIRECT TO MAIN DISCUSSION : SKT-NRS/RD#1

🤝 Your Feedback Shapes the Future :

If it works: Fantastic! Share your results with us and contribute directly to the core vision of SKT AI Labs.

If it doesn't work: No problem at all! Your critical feedback is just as valuable to us. Every experiment and anomaly helps us refine this architecture to make it more stable and robust.

We firmly believe that true innovation stems from community collaboration and transparent testing. Let's build the future of advanced AI together. Your ideas, test results, and feedback are always welcome!

You Can Still Research and Development On WMF Only SKT-SURYA-H Model is Dismissed.

☄️Let's innovate and build together! 💡

reacted to Banaxi-Tech's post with 🚀 about 10 hours ago

Post

370

📱 TinyPhoneLM - LLMs on a Phone
I built TinyPhoneLM because I wanted to see how far tiny local LMs can go on a real Android phone.
Not just a server app.
Not just an API wrapper.
Not “AI on your phone” that secretly sends everything somewhere else.

TinyPhoneLM allows you to run small language models directly on android. It uses llama.cpp via JNI. We have alot of options for default models + custom GGUF Import Supported. I am running Qwen3.5 4B Locally on my Redmi Note 12 Pro 5G at 4 tokens per second, that may seem slow but that it even runs on my phone is insane. I can also run Qwen3.5 0.8B at 10TPS!
Look at this Chart From Artificial Analysis.
Qwen3.5 4B is Better than GPT 4.1 and GPT 5 Mini at minimal reasoning!
And even the smallest 800M Parameter Qwen3.5 0.8B still beats GPT 3.5 Turbo!

The bad news: To get it on the play store we need 12 Testers

Please only submit your Google Play email if you have a Android phone
If you want to test TinyPhoneLM, enter your Google Play email here:

👉 https://docs.google.com/forms/d/1LqkT2pUHbalSUV50M8PX8m7M6S122ip0cWcbKcytcXk/viewform
I would really appreciate the help if you get a tester!

reacted to VINAY-UMRETHE's post with 🔥 about 10 hours ago

Post

101

Presenting SigMamba-V1
https://huggingface.co/collections/VINAY-UMRETHE/sigmamba-inventory
A unified architecture that couples SigLIP2 vision encoder with a trainable Mamba state space model for temporal reasoning with O(N) complexity.

Trained under the Multiple Instance Learning (MIL) paradigm with the Temporal Feature Magnitude (RTFM) loss, SigMamba achieves 89.82% frame-level AUC on the UCF-Crime benchmark while processing over 1000 frames per second on a single GPU.

Released two model variants VINAY-UMRETHE/SigMamba-V1-Large and VINAY-UMRETHE/SigMamba-V1-Small along with all training code and datasets under an open-source license.

GitHub: https://github.com/Vinay-Umrethe/SigMamba-V1

reacted to davidmezzetti's post with 👍 about 10 hours ago

Post

🚀 Check out AstroBERT Small a 22.7M parameter model that specializes in the Astronomy domain.

The base model is trained from scratch along with a finetuned vector embeddings model. Use this model for vector search, RAG and Agents for Astronomy.

https://huggingface.co/blog/NeuML/astrobert-small

John Smith PRO

AI & ML interests

Recent Activity

Organizations

John6666's activity