Introducing QIE-2511-Zoom-Master for highlight-guided area zoom-in, enabling lossless zooming within a drawn square area, and QIE-2511-Object-Remover-v2 for precise object or highlight-guided area cleanup. These experimental adapters are trained based on QIE-2511. Find the adapters below.
MAD-GRPO: https://huggingface.co/blog/telcom/mad-grpo In R1-Zero-Like Training *, Dr.GRPO treats GRPO’s by dropping std, but that often comes with a hidden side effect: length-weighted updates that can nudge model toward verbosity. MAD-GRPO provides robust scale (MAD + epsilon) per-token normalization stability without verbosity bias.
LTX-2 Camera-Control LoRA demo with dolly-in/out and dolly-left/right is now available on Hugging Face, paired with ltx-2-19b-distilled-lora for fast inference. It also includes dynamic GPU duration adjustments for long video generations. Click the related Space links below.
Qwen-Image-Edit-2511-Object-Remover is an adapter (LoRA) developed for Qwen’s Qwen-Image-Edit-2511 image-to-image model. It is specifically designed for precise object removal from images.
Qwen-Image-Edit-2511-Object-Adder is an adapter (LoRA) developed for Qwen’s Qwen-Image-Edit-2511 image-to-image model. It is specifically designed for precise object addition to images.
Update: TRELLIS.2 (Text to 3D, Image to 3D) Gradio with Rerun Embedded demo with improved visualization of the 3D model previewer is now available on Hugging Face. Generate assets and view them in the 3D viewer, powered and streamlined with Microsoft’s TRELLIS.2 and Tongyi-MAI’s Z-Image-Turbo models.
Introducing the Qwen-Image-Edit-2511-LoRAs-Fast demo, featuring image property comparison and contrast, built on top of Gradio and the combined Rerun SDK. It supports single and multi-image edits with existing LoRAs that are lazily loaded. (Note: This is still an experimental Space for Qwen-Image-Edit-2511.)
NVIDIA’s Groq deal ... I think, inference efficiency is becoming the main driver of profitability, and NVIDIA’s Groq deal is evidence the market is moving from “who can train biggest” to “who can serve cheapest and fastest at scale.” That points to a maturing phase of AI, not necessarily the end of a bubble, but definitely a correction in what “wins” long-term. What do you think?
CIFAR-10 your handing image dataset ... CIFAR-10 is a small, standard computer-vision dataset used to quickly test and compare ideas.
- 60,000 color images, each 32×32 pixels, labeled into 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. - Label mapping (important):
- 0 airplane - 1 automobile - 2 bird - 3 cat - 4 deer - 5 dog - 6 frog - 7 horse - 8 ship - 9 truck - Split: 50,000 train and 10,000 test. - Why people use it: fast benchmarking for image classifiers (small CNNs, ResNet, ViT), and quick experiments for training pipelines, augmentation, regularization, pruning, distillation, and demos. - Sizes (downloads): Python version about 163 MB, binary about 162 MB. Hugging Face shows about 144 MB for the dataset files. - Where to get it: the official CIFAR page (University of Toronto) and the Hugging Face CIFAR-10 dataset page. uoft-cs/cifar10 If you want something more, check the table below | Dataset | Resolution | Classes | Best For | | ImageNet 1K | 224–256×256 | 1000 | Real-world large-scale classification | | ImageNet-256. | 256×256 | 1000 | Direct high-res training | | TinyImageNet | 64×64 | 200 | Mid-range benchmark | | UC Merced Land Use | 256×256 | ~21 | Higher resolution small classification | | MS COCO | >256×256 | ~80 objects | Detection / segmentation |
To endorse another user to submit to the cs.AI (Artificial Intelligence) subject class, an arXiv submitter must have submitted 3 papers to any of cs.AI, cs.AR, cs.CC, cs.CE, cs.CG, cs.CL, cs.CR, cs.CV, cs.CY, cs.DB, cs.DC, cs.DL, cs.DM, cs.DS, cs.ET, cs.FL, cs.GL, cs.GR, cs.GT, cs.HC, cs.IR, cs.IT, cs.LG, cs.LO, cs.MA, cs.MM, cs.MS, cs.NA, cs.NE, cs.NI, cs.OH, cs.OS, cs.PF, cs.PL, cs.RO, cs.SC, cs.SD, cs.SE, cs.SI or cs.SY earlier than three months ago and less than five years ago.
Introducing demos for new SOTA models from AI2: SAGE-MM (Smart Any-Horizon Agents for Long-Video Reasoning) and Molmo-2, an open vision-language model that supports multi-image (QA and pointing) and video (QA, pointing, and tracking). The respective demo-related collections are listed below. 🎃🔥
Introducing TRELLIS.2 Text-to-3D. The demo for the TRELLIS.2-4B (Image-to-3D) model is streamlined with the Z-Image Turbo image generation model to enable Text-to-3D functionality. There is no need for input assets, making a small leap forward for ideation. Optionally, it also includes default support for Image-to-3D inference using direct image assets. Find the demo and related collections below... 🤗🔥
Demo for Molmo2 on Hugging Face is live now, including Single/Multi-Image VQA, Visual Pointing/Grounding, Video VQA, and Video Point Tracking. Find the demo and related collections below. 🔥🤗
Introducing the Z Image Turbo LoRA DLC App, a gallery space for plug-and-play Z-Image-Turbo LoRAs. It features a curated collection of impressive LoRAs for generating high-quality images. By default, it runs on the base model. Simply choose a LoRA, type your prompt, and generate images. You can find the app and more details below. 🤗🧪
Recently I was playing with my model . What is your idea about "unlearning" since I need it 😀 telcom/deewaiREALCN, I have the original one on the main branch and trained version "cp550" and "n_680" on anther branch. Both trained on telcom/deewaiREALCN-training. I got three results when doing prompt: "Athlete portrait, 26-year-old woman, post-training sweat, gym ambient light, chalk dust particles, intense gaze, crisp detail." Apparently, model is sensitive to the word "old". You can see the training on more faces improved from main, however, still not ideal... I am working now on unlearning. I would like to hear about your opinion. #unlearning
Introducing the D.Markdown Experimental Models, Proxima and Epsilon OCR models, built on top of Qwen3-VL and Qwen2.5-VL respectively. Proxima is optimized for Markdown generation and is capable of embedding inline programming code snippets and generating rich nodes such as HTML, XML, JSON, and YAML. Epsilon is optimized for reconstructing complex layouts including tables, forms, and mathematical content. 🌌✨
Try CUA GUI Operator 🖥️ Space, the demo of some interesting multimodal ultra-compact Computer Use Agent (CUA) models in a single app, including Fara-7B, UI-TARS-1.5-7B, and Holo models, to perform GUI localization tasks.
I have planned to add Chrome sandboxes to streamline it and turn it into a browser based CUA multimodal tool, which will be added to the same space soon.
To know more about it, visit the app page or the respective model page!
One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. 🗣️🔥