| | --- |
| | license: cc-by-nc-sa-4.0 |
| | datasets: |
| | - ed001/ds-coder-instruct-v1 |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | tags: |
| | - code |
| | - data science |
| | --- |
| | |
| | # The Data Science Coder |
| |
|
| | Data Science coder is a group of fine tuned models designed to help with coding for data science applications. It comes in 2 variants: 1.3b and 6.7b. Models are fine tuned from DeepSeek Coder instruct versions. Fine tuning was performed on the [ed001/ds-coder-instruct-v1](https://huggingface.co/datasets/ed001/ds-coder-instruct-v1) dataset which is constructed by filtering publicly available datasets on HuggingFace. |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline |
| | |
| | def build_instruction_prompt(instruction): |
| | return ''' |
| | You are the Data Science Coder, a helpful AI assistant created by a man named Ed. |
| | You help people with data science coding and you answer questions about data science in a helpful manner. |
| | ### Instruction: |
| | {} |
| | ### Response: |
| | '''.format(instruction.strip()).lstrip() |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("ed001/datascience-coder-1.3b", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained("ed001/datascience-coder-1.3b", trust_remote_code=True).cuda() |
| | pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1024, top_p=0.95) |
| | result = pipe(build_instruction_prompt("Perform EDA on the Iris dataset")) |
| | print(result[0]['generated_text']) |
| | ``` |
| |
|
| | ## Training Details |
| | lora_r: 16 |
| | lora_alpha: 8 |
| | lora_dropout: 0.05 |
| | target_modules: q, k, v, o, gate_proj, down_proj, up_proj, lm_head |
| | weight_decay: 0 |
| | optmizer: paged_adamw_32bit |
| | lr: 1e-4 |
| | lr_scheduler: cosine |
| | max_seq_len: 4096 |
| | batch_size: 4 |
| | max_grad_norm: 0.5 |
| | warmup_ratio: 0.05 |
| | num_epochs: 1 |
| | |
| | Training was performed on the python subset of the ds-coder-instruct dataset. |
| | |
| | ## Examples |
| | |
| | |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62618f3e6dae705b2567fb13/d3qCHXdrNNlq4VMus7e_S.png" width="90%"/> |
| | |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62618f3e6dae705b2567fb13/pU7flGRav_h1WDCj12RwP.png" width="90%"/> |
| | |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62618f3e6dae705b2567fb13/txFZANcIhaY-6mEe49kTE.png" width="90%"/> |
| | |
| | ## Contact |
| | GitHub: [Ea0011](https://github.com/Ea0011) |