Create README file

d8f95eb verified 1 day ago

4.04 kB

license: mit
language:
  - en
tags:
  - tactile-sensing
  - controlnet
  - stable-diffusion
  - depth-to-tactile
  - image-generation
  - robotics
  - multi-modal
  - diffusion
  - ICRA
pipeline_tag: image-to-image
library_name: pytorch

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation

MultiDiffSense is a ControlNet-based diffusion model that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities.

Model Details


Architecture	ControlNet built on Stable Diffusion 1.5
Task	Depth map + Text Prompt to Multi-Modal Tactile sensor image generation
Input	512x512 depth map (viridis colourmap) + text prompt
Output	512x512 tactile sensor image
Training	~150 epochs, frozen SD backbone, lr=1e-5, batch size 8
Parameters	~860M (SD 1.5) + ~360M (ControlNet copy)

Supported Tactile Sensor Modalities

Sensor	Description	Image Example
TacTip	Optical tactile sensor with pin-based deformation markers
ViTac	Vision-based tactile sensor (no markers)
ViTacTip	Combined vision-tactile sensor

Files

File	Description
`multidiffsense.ckpt`	Trained checkpoint (trained on short prompts + depth maps)

Usage

Clone the GitHub repository and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run:

git clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
pip install -r requirements.txt

# Single depth map:
python multidiffsense/controlnet/generate.py \
    --source_image path/to/depth_map.png \
    --prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'

# Batch generation from a prompt file:
python multidiffsense/controlnet/generate.py \
    --dataset_dir datasets \
    --prompt_json datasets/test/prompt_ViTacTip.json

See the GitHub repository for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies.

Citation

@inproceedings{multidiffsense2026,
    title     = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
    author    = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
    year      = {2026}
    url       = {https://arxiv.org/abs/2602.19348}
}

License

MIT