| --- |
| license: apache-2.0 |
| pipeline_tag: visual-question-answering |
| --- |
| |
| # Building Your Own Multimodal Large Model from Scratch |
|
|
| For the Chinese version of the README, please refer to [δΈζζζ‘£](README_zh.md). |
|
|
| ## Model Architecture π€ |
|
|
| In the VLM (Visual Language Model), the visual component utilizes the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the `forward` method of the `QWenModel`, the corresponding `image` tokens are replaced with visual features. |
|
|
| ## GitHub Repository π |
|
|
| The code for running the model can be found at [Basic-Visual-Language-Model](https://github.com/xinyanghuang7/Basic-Visual-Language-Model/tree/main). |
|
|
| ## References π |
|
|
| Special thanks to the following projects for their great work π: |
|
|
| - https://github.com/WatchTower-Liu/VLM-learning/tree/main |
| - https://github.com/QwenLM/Qwen |
| - https://github.com/haotian-liu/LLaVA |
|
|
| ## Contact β |
|
|
| If you have any questions or ideas, feel free to reach out to me π: |
|
|
| hsinyanghuang7@gmail.com |
|
|
| I will respond as soon as I see your email! |