Charlie81
/

SkipMoE

Text Generation

Mixture of Experts

Model card Files Files and versions

SkipMoE / README.md

chengyanwu

stuff

ccda2ec 9 months ago

|

history blame contribute delete

3.32 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- moe
	- olmo
	- olmoe
	co2_eq_emissions: 1
	datasets:
	- allenai/OLMoE-mix-0924
	library_name: transformers
	---


	# Model Summary
	# OLMoE with Adapters

	This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.

	## Model Architecture

	The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by:

	1. Adding small adapter layers (bottleneck layers) to each MLP block
	2. Allowing selective freezing of the base model's parameters
	3. Training only the adapter parameters (~0.1-1% of total parameters)

	Key components:
	- `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules
	- `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs
	- `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers
	- `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters

	## Training Script

	The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model:

	### Features:
	- Parameter-efficient fine-tuning using adapters
	- Support for various datasets through Hugging Face datasets library
	- Customizable adapter size
	- Option to freeze/unfreeze different components
	- Training with AdamW optimizer and learning rate scheduling
	- Evaluation with perplexity metrics
	- Model checkpointing and saving

	### Usage:

	```bash
	python train.py \
	--model_name_or_path allenai/OLMo-7B \
	--adapter_size 64 \
	--freeze_base_model True \
	--dataset_name wikitext \
	--dataset_config_name wikitext-2-raw-v1 \
	--output_dir ./olmoe-adapter-finetuned \
	--num_train_epochs 3 \
	--per_device_train_batch_size 4 \
	--per_device_eval_batch_size 4 \
	--learning_rate 5e-5 \
	--warmup_steps 100 \
	--logging_steps 100 \
	--save_steps 1000 \
	--seed 42
	```

	## Benefits of Adapter-Based Fine-Tuning

	1. Efficiency: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
	2. Storage: Store only adapter weights rather than full fine-tuned models
	3. Composability: Multiple adapters can be trained for different tasks and swapped at inference time
	4. Reduced Overfitting: Lower parameter count helps prevent overfitting on small datasets

	## How to Use the Fine-Tuned Model

	```python
	from transformers import OlmoTokenizer
	from modeling_olmoe import OlmoEWithAdaptersForCausalLM

	# Load the fine-tuned model
	model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
	tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")

	# Generate text
	inputs = tokenizer("Once upon a time", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Adapter Size Recommendations

	The adapter size determines the parameter efficiency vs. performance trade-off:

	- Small datasets: 16-32 dimensions
	- Medium datasets: 64-128 dimensions
	- Large datasets: 128-256 dimensions

	For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.