metadata
license: mit
language:
- en
tags:
- tactile-sensing
- controlnet
- stable-diffusion
- depth-to-tactile
- image-generation
- robotics
- multi-modal
- diffusion
- ICRA
pipeline_tag: image-to-image
library_name: pytorch
MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation
MultiDiffSense is a ControlNet-based diffusion model that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities.
Model Details
| Architecture | ControlNet built on Stable Diffusion 1.5 |
| Task | Depth map + Text Prompt to Multi-Modal Tactile sensor image generation |
| Input | 512x512 depth map (viridis colourmap) + text prompt |
| Output | 512x512 tactile sensor image |
| Training | ~150 epochs, frozen SD backbone, lr=1e-5, batch size 8 |
| Parameters | ~860M (SD 1.5) + ~360M (ControlNet copy) |
Supported Tactile Sensor Modalities
| Sensor | Description | Image Example |
|---|---|---|
| TacTip | Optical tactile sensor with pin-based deformation markers | ![]() |
| ViTac | Vision-based tactile sensor (no markers) | ![]() |
| ViTacTip | Combined vision-tactile sensor | ![]() |
Files
| File | Description |
|---|---|
multidiffsense.ckpt |
Trained checkpoint (trained on short prompts + depth maps) |
Usage
Clone the GitHub repository and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run:
git clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
pip install -r requirements.txt
# Single depth map:
python multidiffsense/controlnet/generate.py \
--source_image path/to/depth_map.png \
--prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'
# Batch generation from a prompt file:
python multidiffsense/controlnet/generate.py \
--dataset_dir datasets \
--prompt_json datasets/test/prompt_ViTacTip.json
See the GitHub repository for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies.
Citation
@inproceedings{multidiffsense2026,
title = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
author = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
url = {https://arxiv.org/abs/2602.19348}
}
License
MIT


