| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - code |
| | - git |
| | - commit-message |
| | - qwen2 |
| | - lora |
| | datasets: |
| | - bigcode/commitpackft |
| | --- |
| | |
| | # Git Commit Message Generator |
| |
|
| | Fine-tuned Qwen-0.5B model for generating professional Git commit messages from code diffs. |
| |
|
| | ## Model Description |
| |
|
| | This model was fine-tuned using LoRA (Low-Rank Adaptation) on the CommitPackFT dataset to generate concise, professional commit messages from git diffs. |
| |
|
| | **Base Model**: Qwen-0.5B |
| | **Fine-tuning Method**: LoRA (r=16, alpha=32) |
| | **Training Data**: 55K filtered commits from CommitPackFT |
| | **Languages**: Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more |
| |
|
| | ## Intended Use |
| |
|
| | Generate commit messages for staged changes in a Git repository. |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | import torch |
| | |
| | # Load model and tokenizer |
| | model_name = "rajtiwariee/auto-commit" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # Prepare your diff |
| | diff = """ |
| | Diff: |
| | File: src/auth.py |
| | Language: Python |
| | |
| | Old content: |
| | def login(username, password): |
| | user = get_user(username) |
| | if user.password == password: |
| | return True |
| | return False |
| | |
| | New content: |
| | def login(username, password): |
| | user = get_user(username) |
| | if user and user.password == password: |
| | return True |
| | return False |
| | """ |
| | |
| | # Generate commit message |
| | prompt = f"Write a git commit message:\n\n{diff}\n\nCommit message:\n" |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=30, |
| | do_sample=False, # Deterministic |
| | pad_token_id=tokenizer.eos_token_id, |
| | ) |
| | |
| | message = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(message.split("Commit message:")[-1].strip()) |
| | # Output: "Check for user existence before accessing password" |
| | ``` |
| |
|
| | ### CLI Tool |
| |
|
| | For easier usage, install the companion CLI tool from the [GitHub repository](https://github.com/rajtiwariee/GitCommitGenerator): |
| |
|
| | ```bash |
| | pip install -e . |
| | commit-gen generate --commit |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | - **Dataset**: CommitPackFT (filtered subset) |
| | - **Training samples**: 55,730 |
| | - **Validation samples**: 6,966 |
| | - **Test samples**: 6,967 |
| |
|
| | ### Training Procedure |
| |
|
| | - **Epochs**: 3 |
| | - **Batch Size**: 4 (effective batch size: 32 with gradient accumulation) |
| | - **Learning Rate**: 5e-5 |
| | - **Optimizer**: AdamW |
| | - **LoRA Config**: |
| | - r: 16 |
| | - alpha: 32 |
| | - dropout: 0.05 |
| | - target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| |
|
| | ### Hardware |
| |
|
| | - **GPU**: NVIDIA Tesla T4 (16GB) |
| | - **Precision**: Mixed Precision (FP32 weights + FP16 compute) |
| | - **Training Time**: ~7.5 hours |
| |
|
| | ## Evaluation Results |
| |
|
| | - **BLEU Score**: 0.0244 |
| | - **ROUGE-1**: 0.1968 |
| | - **ROUGE-2**: 0.0420 |
| | - **ROUGE-L**: 0.1816 |
| | - **Exact Match Rate**: 0.00% |
| |
|
| |
|
| | ## Limitations |
| |
|
| | - The model is trained primarily on English commit messages |
| | - Best suited for code changes in common programming languages |
| | - May not handle very large diffs well (>384 tokens) |
| | - Generated messages should be reviewed before committing |
| |
|
| | ## Ethical Considerations |
| |
|
| | This model is intended to assist developers in writing commit messages, not replace human judgment. Users should: |
| | - Review generated messages for accuracy |
| | - Ensure messages accurately describe the changes |
| | - Follow their team's commit message conventions |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{git-commit-generator, |
| | author = {Raj Tiwari}, |
| | title = {Git Commit Message Generator}, |
| | year = {2024}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/rajtiwariee/auto-commit}}, |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT License |
| |
|