MultiDiffSense / README.md
sirine16's picture
Create README file
d8f95eb verified
---
license: mit
language:
- en
tags:
- tactile-sensing
- controlnet
- stable-diffusion
- depth-to-tactile
- image-generation
- robotics
- multi-modal
- diffusion
- ICRA
pipeline_tag: image-to-image
library_name: pytorch
---
<h1 align="center">MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation</h1>
<p align="center">
<a href="https://github.com/sirine-b/MultiDiffSense"><img src="https://img.shields.io/badge/Code-GitHub-black?logo=github" alt="GitHub"></a>
<a href="https://arxiv.org/abs/2602.19348"><img src="https://img.shields.io/badge/Paper-ICRA%202026-blue" alt="Paper"></a>
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-green" alt="License"></a>
</p>
MultiDiffSense is a **ControlNet-based diffusion model** that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities.
## Model Details
| | |
|---|---|
| **Architecture** | ControlNet built on Stable Diffusion 1.5 |
| **Task** | Depth map + Text Prompt to Multi-Modal Tactile sensor image generation |
| **Input** | 512x512 depth map (viridis colourmap) + text prompt |
| **Output** | 512x512 tactile sensor image |
| **Training** | ~150 epochs, frozen SD backbone, lr=1e-5, batch size 8 |
| **Parameters** | ~860M (SD 1.5) + ~360M (ControlNet copy) |
## Supported Tactile Sensor Modalities
<table>
<thead>
<tr>
<th>Sensor</th>
<th>Description</th>
<th>Image Example</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>TacTip</strong></td>
<td>Optical tactile sensor with pin-based deformation markers</td>
<td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/HFLb9F7xYiNmlfQAkh3KO.png" width="120"/></td>
</tr>
<tr>
<td><strong>ViTac</strong></td>
<td>Vision-based tactile sensor (no markers)</td>
<td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/2R9-qwRVSl6UUpXdl-6HC.png" width="120"/></td>
</tr>
<tr>
<td><strong>ViTacTip</strong></td>
<td>Combined vision-tactile sensor</td>
<td><img src="https://cdn-uploads.huggingface.co/production/uploads/67b5f6b9abfba5ff6dd1b645/24s4nbM-Vx9vrAIONOqUI.png" width="120"/></td>
</tr>
</tbody>
</table>
## Files
| File | Description |
|------|-------------|
| `multidiffsense.ckpt` | Trained checkpoint (trained on short prompts + depth maps) |
## Usage
Clone the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run:
```bash
git clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
pip install -r requirements.txt
# Single depth map:
python multidiffsense/controlnet/generate.py \
--source_image path/to/depth_map.png \
--prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'
# Batch generation from a prompt file:
python multidiffsense/controlnet/generate.py \
--dataset_dir datasets \
--prompt_json datasets/test/prompt_ViTacTip.json
```
See the [GitHub repository](https://github.com/sirine-b/MultiDiffSense) for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies.
## Citation
```bibtex
@inproceedings{multidiffsense2026,
title = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
author = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
url = {https://arxiv.org/abs/2602.19348}
}
```
## License
MIT