| # Official models of "MoChat: Joints-Grouped Spatio-Temporal Grounding LLM for Multi-Turn Motion Comprehension and Description" | |
| ## Overview | |
| MoChat is a Multimodal Large Language Model (MLLM) that revolutionizes human motion understanding through precise spatio-temporal grounding. Unlike conventional motion analysis systems, MoChat integrates: | |
| - **Motion Understanding**: Performs fundamental motion comprehension and summarization. | |
| - **Spatial Limb Grounding**: Accurately locates body parts involved in described movements. | |
| - **Temporal Action Grounding**: Precisely identifies time boundaries corresponding to specific motion descriptions. | |
| ## Models | |
| We provide the following trained models for download: | |
| - **[Joints-Grouped Skeleton Encoder](https://huggingface.co/CSUBioGroup/MoChat/blob/main/JGSE_epoch120)** for motion sequences representation. | |
| - Two variants of motion comprehension models: | |
| - [MoChat](https://huggingface.co/CSUBioGroup/MoChat/tree/main/MoChat): Base model. | |
| - [MoChat-R](https://huggingface.co/CSUBioGroup/MoChat/tree/main/MoChat-R): Extended model with regression head. | |
| ## Resources | |
| - **Codebase**: [Github](https://github.com/CSUBioGroup/MoChat) | |
| - **Paper**: [Arxiv](https://arxiv.org/abs/2410.11404) |