| --- |
| library_name: transformers |
| datasets: |
| - bigcode/the-stack-v2 |
| - andreagurioli1995/SynthCode2Code2NL-neardedup |
| license: bigcode-openrail-m |
| base_model: |
| - andreagurioli1995/ModularStarEncoder |
| --- |
| |
| # ModularStarEncoder-1B Fine-Tuned model |
|
|
| <!-- Provide a quick summary of what the model is/does. --> |
|
|
| ModularStarEncoder-finetuned is an encoder built on top of [ModularStarEncoder-1B Pre-trained](https://huggingface.co/andreagurioli1995/ModularStarEncoder) on [SynthCode2Code2NL](https://huggingface.co/datasets/andreagurioli1995/SynthCode2Code2NL-neardedup). |
| ModularStarEncoder fine-tuned is an encoder for various retrieval tasks, enabling the end user to select the model size that meets their memory and computational constraints. |
| We built ModularStarEncoder on top of [StarCoder-2](https://huggingface.co/bigcode/starcoder2-15b), reducing its size from 15B to 1B parameters in bfloat16. |
|
|
| The model is finetuned with [CLIP objective](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py) |
|
|
| - **Paper:** [Link](arxiv.paper) |
| - **Languages:** English, Go, Ruby, Python, Java, C++, PHP, C, JavaScript |
|
|
| ### How to use |
| ```python |
| from transformers import AutoModel |
| from transformers import AutoTokenizer |
| |
| #import the model |
| model = AutoModel.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned", trust_remote_code=True) |
| |
| #import the tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned") |
| |
| |
| language = "yourlanguagelowercased" |
| |
| #instruction in case of code embedding in a code language |
| instruction_code = f"Represent this {language} code snippet for retrieval:" |
| |
| #instruction in case of code embedding in English |
| instruction_natural_language = "Represent this code description for retrieving supporting snippets of code:" |
| |
| code_snippet = "your code to embed here" |
| |
| #You should follow this pattern to embed a snippet of code or natural language queries |
| sentence = f"{tokenizer.sep_token}{instruction_code}{tokenizer.sep_token}{code_snippet)}{tokenizer.cls_token}" |
| |
| #Tokenizing your sentence |
| tokenized_sensence = tokenizer(sentence, return_tensors="pt",truncation=True, max_length=2048) |
| |
| #Embedding the tokenized sentence |
| embedded_sentence = model(**sentence) |
| ``` |
|
|
| You will get as an output three elements: |
|
|
| - projected_pooled_normalized: a list of the projected, pooled, and normalized embeddings from the five exit points; |
| - raw_hidden_states: raw representation from all the hidden states of the model, without pooling, normalization, and projection |
| - attentions: attention scores from the encoder |
| |
| |
| ### Training |
|
|
| <!-- Provide a longer summary of what this model is. --> |
| We fine-tuned ModularStarEncoder with a batch size of 2048 contrastive samples for 20,000 training steps. |
| The pre-training and fine-tuning were conducted on 512 NVIDIA Ampere (64GB) GPUs using the [Leonardo](https://arxiv.org/abs/2307.16885) supercomputer, requiring 450,000 GPU working hours. |
|
|
| | Hyperparameter | Value | |
| |--------------------------|-----------| |
| | Hidden size | 1024 | |
| | Max. position embeddings | 2048 | |
| | Num. of attention heads | 12 | |
| | Num. of key values heads | 4 | |
| | Num. of hidden layers | 36 | |
| | Attention | GQA | |
| | Num. of parameters | ≈1B | |
| |Loss function |CLIP loss | |
| |Multi-layer loss | yes | |
|
|
| ## Licence |
| The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). |