| ---
|
| tags:
|
| - code
|
| - programming
|
| - dataset
|
| pretty_name: "Coding Dataset"
|
| ---
|
|
|
| # Coding Dataset
|
|
|
| Production-grade dataset for training AI coding agents.
|
|
|
| ## Dataset Summary
|
|
|
| - **Total Examples**: 6 (demo)
|
| - **Languages**: Python, JavaScript, Java
|
| - **Task Types**: Code Generation
|
| - **License**: CC0-1.0
|
|
|
| ## Dataset Structure
|
|
|
| ### Data Splits
|
|
|
| - train: 70% of data
|
| - validation: 15% of data
|
| - test: 15% of data
|
|
|
| ### Features
|
|
|
| - `id` (string): Unique identifier
|
| - `code` (string): Source code snippet
|
| - `code_description` (string): Natural language description
|
| - `programming_language` (string): Language (python, javascript, java, etc.)
|
| - `task_type` (string): Type of task
|
| - `difficulty_level` (string): Difficulty (beginner, intermediate, advanced, expert)
|
| - `quality_score` (float): Quality score 0.0-1.0
|
| - `is_tested` (bool): Code is tested
|
| - `has_bugs` (bool): Known bugs exist
|
| - `lines_of_code` (int): Number of lines
|
| - `collected_at` (string): Collection timestamp
|
|
|
| ## Usage
|
|
|
| ```python
|
| from datasets import load_dataset
|
|
|
| # Load dataset
|
| dataset = load_dataset("romcmu863/code-dataset")
|
|
|
| # Access splits
|
| train = dataset['train']
|
| validation = dataset['validation']
|
| test = dataset['test']
|
|
|
| # Get first example
|
| example = train[0]
|
| print(example['code_description'])
|
| print(example['code'])
|
| ```
|
|
|
| ## License
|
|
|
| CC0-1.0
|
|
|
| ## Created
|
|
|
| 2025-10-25
|
|
|