Instructions to use Blinorot/AL-SSLAM-R with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Blinorot/AL-SSLAM-R with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Blinorot/AL-SSLAM-R") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Blinorot/AL-SSLAM-R", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Blinorot/AL-SSLAM-R with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Blinorot/AL-SSLAM-R" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Blinorot/AL-SSLAM-R", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Blinorot/AL-SSLAM-R
- SGLang
How to use Blinorot/AL-SSLAM-R with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Blinorot/AL-SSLAM-R" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Blinorot/AL-SSLAM-R", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Blinorot/AL-SSLAM-R" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Blinorot/AL-SSLAM-R", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Blinorot/AL-SSLAM-R with Docker Model Runner:
docker model run hf.co/Blinorot/AL-SSLAM-R
| library_name: transformers | |
| datasets: | |
| - Blinorot/ALARM-Corpora | |
| base_model: | |
| - Qwen/Qwen3-4B-Thinking-2507 | |
| # Model Card for AL-SSLAM-R | |
| This is a checkpoint for AL-SSLAM-R, audio-understanding reasoning language model, proposed in [ALARM: Audio–Language Alignment for Reasoning Models](https://arxiv.org/abs/2603.09556). | |
| For more details regarding the model and its usage, please refer to our [GitHub](https://github.com/Blinorot/ALARM). | |
| ## Inference | |
| We provide [vLLM](https://github.com/vllm-project/vllm) support using [vLLM Prompt Embedding API](https://docs.vllm.ai/en/stable/features/prompt_embeds/). | |
| Since ALARM uses the frozen Qwen3 model as the backbone, `vllm` just runs the original Qwen3 checkpoint, and the ALARM checkpoint is used for extracting LLM input embeddings. | |
| After you cloned the repo and installed the depnedencies, you can run the pretrained model as follows: | |
| ```python | |
| # Import libraries | |
| import os | |
| os.environ["CUDA_VISIBLE_DEVICES"] = "0" #optional | |
| # run before importing torch because generate_vllm sets the multiprocessing method | |
| from generate_vllm import get_response | |
| from src.model.wrapped_llms.qwen3 import Qwen3AudioWrappedFeatureExtractor | |
| from omegaconf import OmegaConf | |
| from torchaudio.utils import _download_asset | |
| from torchcodec.decoders import AudioDecoder | |
| from transformers import AutoTokenizer | |
| from vllm import LLM | |
| # The model configuration config. | |
| # Handles vllm-related configuration and defines feature extractors, | |
| # i.e., audio -> encoder input embedding conversion. | |
| # All other configuration, including model architecture, will be | |
| # loaded from the checkpoint. | |
| default_model_config_name = "src/configs/model/default_inference.yaml" | |
| model_config = OmegaConf.load(default_model_config_name) | |
| # checkpoint_name = which model to run | |
| # Single model version (no inference-time ensemble): | |
| # checkpoint_name='Blinorot/AL-Whisper-Instruct-R' | |
| # ALARM-E embedding fusion-type version (inference-time ensemble): | |
| # checkpoint_name=["Blinorot/ALARM-CA","Blinorot/AL-Whisper-Instruct-R"] | |
| checkpoint_name = "Blinorot/AL-SSLAM-R" | |
| device = "cuda" | |
| # Load Tokenizer for Text Processing | |
| tokenizer = AutoTokenizer.from_pretrained(model_config.llm) | |
| # Load ALARM/AL-*-R checkpoints for extraction of LLM input embeddings | |
| if isinstance(checkpoint_name, list): # ALARM-E-style embedding fusion (inference-time ensemble) | |
| feature_extractor_list = [] | |
| for name in checkpoint_name: | |
| # Load weights into the (audio,text)->LLM embeddings converter | |
| feature_extractor = Qwen3AudioWrappedFeatureExtractor( | |
| model_config=model_config, | |
| checkpoint_name=name, | |
| tokenizer=tokenizer, | |
| ) | |
| feature_extractor.to(device) | |
| feature_extractor_list.append(feature_extractor) | |
| feature_extractor = feature_extractor_list | |
| else: # Single Model version (no inference-time ensemble) | |
| # Load weights into the (audio,text)->LLM embeddings converter | |
| feature_extractor = Qwen3AudioWrappedFeatureExtractor( | |
| model_config=model_config, | |
| checkpoint_name=checkpoint_name, | |
| tokenizer=tokenizer, | |
| ) | |
| feature_extractor.to(device) | |
| # Start the offline vLLM instance of original Qwen3 RLM | |
| # Model will be loaded to CUDA_VISIBLE_DEVICES id | |
| llm = LLM( | |
| model_config.llm, | |
| enable_prefix_caching=True, | |
| max_model_len=model_config.max_model_len, | |
| max_num_seqs=model_config.max_num_seq, | |
| max_num_batched_tokens=model_config.max_num_batched_tokens, | |
| gpu_memory_utilization=model_config.gpu_memory_utilization, | |
| enable_prompt_embeds=True, | |
| ) | |
| # Set sampling arguments for the RLM | |
| sample = llm.get_default_sampling_params() | |
| sample.seed = model_config.seed | |
| sample.max_tokens = model_config.max_tokens | |
| # Define audio and prompt | |
| # Audio must come from torchcodec.AudioDecoder | |
| audio_example_path = _download_asset("tutorial-assets/ctc-decoding/1688-142285-0007.wav") | |
| audio = AudioDecoder(audio_example_path) | |
| prompt = "Describe the audio content." | |
| # Define a system prompt | |
| system_prompt = "You are an audio-understanding model." | |
| # Obtain response from Audio RLM | |
| response = get_response( | |
| prompts=[prompt], # list of all the prompts | |
| audio_list=[audio], # list of corresponding audio | |
| llm=llm, | |
| feature_extractor=feature_extractor, | |
| sample=sample, | |
| tokenizer=tokenizer, | |
| system_prompt=system_prompt, | |
| max_thinking_tokens=model_config.max_thinking_tokens, # controls thinking budget for the RLM | |
| debug=False, | |
| ) | |
| # Response is a list of responses, one per each (prompt, audio) input pair | |
| # We have only one input pair, so the final response is at index 0 | |
| response = response[0] | |
| print(f"Model response:\n\n{response}") | |
| ``` | |
| ## Citation | |
| If you use this work, please cite: | |
| ```bibtex | |
| @article{grinberg2026alarm, | |
| title={ALARM: Audio-Language Alignment for Reasoning Models}, | |
| author={Grinberg, Petr and Shahmohammadi, Hassan}, | |
| journal={arXiv preprint arXiv:2603.09556}, | |
| year={2026} | |
| } | |
| ``` | |
| ## License | |
| The model checkpoint is licensed under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0). | |
| It may only be used for non-commercial research purposes. |