| --- |
| license: apache-2.0 |
| datasets: |
| - huggingface-course/codeparrot-ds-train |
| - huggingface-course/codeparrot-ds-valid |
| language: |
| - en |
| metrics: |
| - code_eval |
| pipeline_tag: text-generation |
| tags: |
| - code |
| - gpt2 |
| - pytorch |
| - causal-lm |
| --- |
| |
| # python-ds-accelerate (GPT-2 124M) |
|
|
| This model is a GPT-2 (124M parameter) causal language model trained from scratch specifically for **Python code completion** in Data Science contexts. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| This model is an implementation of the GPT-2 architecture optimized for generating functional Python code snippets. It was trained using a custom training pipeline that incorporates a **keytoken weighted loss** function to prioritize important programming keywords (like `plt`, `pd`, `fit`, `predict`), making it more effective at suggesting Data Science-related code. |
|
|
| - **Developed by:** [Pranav Guhan R](https://github.com/PranavGuhanR) |
| - **Model type:** Transformer-based Causal Language Model |
| - **Language(s):** Python (English comments) |
| - **License:** Apache 2.0 |
| - **Finetuned from model:** Trained from scratch |
|
|
| ### Model Sources |
|
|
| - **Repository:** [GPT-2-124M-pretraining-for-code-completion](https://github.com/PranavGuhanR/GPT-2-124M-pretraining-for-code-completion) |
|
|
| ## Uses |
|
|
| ### Direct Use |
| The model is intended to be used for code completion tasks, specifically for completing Python scripts involving libraries like `pandas`, `matplotlib`, and `scikit-learn`. |
|
|
| ### Out-of-Scope Use |
| The model is not suitable for general-purpose natural language conversation or generating code in languages other than Python. |
|
|
| ## How to Get Started with the Model |
|
|
| You can use the model directly with a Hugging Face pipeline: |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline("text-generation", model="PranavGuhan/python-ds-accelerate") |
| |
| txt = """# create dataframe from x and y |
| df = pd.DataFrame({'x':x, 'y':y}) |
| """ |
| print(pipe(txt, num_return_sequences=1)[0]["generated_text"]) |