| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - moe |
| | - olmo |
| | - olmoe |
| | co2_eq_emissions: 1 |
| | datasets: |
| | - allenai/OLMoE-mix-0924 |
| | library_name: transformers |
| | --- |
| | |
| |
|
| | # Model Summary |
| | # OLMoE with Adapters |
| |
|
| | This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training. |
| |
|
| | ## Model Architecture |
| |
|
| | The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by: |
| |
|
| | 1. Adding small adapter layers (bottleneck layers) to each MLP block |
| | 2. Allowing selective freezing of the base model's parameters |
| | 3. Training only the adapter parameters (~0.1-1% of total parameters) |
| |
|
| | Key components: |
| | - `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules |
| | - `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs |
| | - `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers |
| | - `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters |
| |
|
| | ## Training Script |
| |
|
| | The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model: |
| |
|
| | ### Features: |
| | - Parameter-efficient fine-tuning using adapters |
| | - Support for various datasets through Hugging Face datasets library |
| | - Customizable adapter size |
| | - Option to freeze/unfreeze different components |
| | - Training with AdamW optimizer and learning rate scheduling |
| | - Evaluation with perplexity metrics |
| | - Model checkpointing and saving |
| |
|
| | ### Usage: |
| |
|
| | ```bash |
| | python train.py \ |
| | --model_name_or_path allenai/OLMo-7B \ |
| | --adapter_size 64 \ |
| | --freeze_base_model True \ |
| | --dataset_name wikitext \ |
| | --dataset_config_name wikitext-2-raw-v1 \ |
| | --output_dir ./olmoe-adapter-finetuned \ |
| | --num_train_epochs 3 \ |
| | --per_device_train_batch_size 4 \ |
| | --per_device_eval_batch_size 4 \ |
| | --learning_rate 5e-5 \ |
| | --warmup_steps 100 \ |
| | --logging_steps 100 \ |
| | --save_steps 1000 \ |
| | --seed 42 |
| | ``` |
| |
|
| | ## Benefits of Adapter-Based Fine-Tuning |
| |
|
| | 1. **Efficiency**: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements |
| | 2. **Storage**: Store only adapter weights rather than full fine-tuned models |
| | 3. **Composability**: Multiple adapters can be trained for different tasks and swapped at inference time |
| | 4. **Reduced Overfitting**: Lower parameter count helps prevent overfitting on small datasets |
| |
|
| | ## How to Use the Fine-Tuned Model |
| |
|
| | ```python |
| | from transformers import OlmoTokenizer |
| | from modeling_olmoe import OlmoEWithAdaptersForCausalLM |
| | |
| | # Load the fine-tuned model |
| | model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned") |
| | tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned") |
| | |
| | # Generate text |
| | inputs = tokenizer("Once upon a time", return_tensors="pt") |
| | outputs = model.generate(**inputs, max_length=50) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Adapter Size Recommendations |
| |
|
| | The adapter size determines the parameter efficiency vs. performance trade-off: |
| |
|
| | - **Small datasets**: 16-32 dimensions |
| | - **Medium datasets**: 64-128 dimensions |
| | - **Large datasets**: 128-256 dimensions |
| |
|
| | For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance. |