romcmu863
/

code-dataset

Model card Files Files and versions

code-dataset / README.md

romcmu863's picture

Upload folder using huggingface_hub

7fec244 verified 5 months ago

|

history blame contribute delete

1.41 kB

	---
	tags:
	- code
	- programming
	- dataset
	pretty_name: "Coding Dataset"
	---

	# Coding Dataset

	Production-grade dataset for training AI coding agents.

	## Dataset Summary

	- Total Examples: 6 (demo)
	- Languages: Python, JavaScript, Java
	- Task Types: Code Generation
	- License: CC0-1.0

	## Dataset Structure

	### Data Splits

	- train: 70% of data
	- validation: 15% of data
	- test: 15% of data

	### Features

	- `id` (string): Unique identifier
	- `code` (string): Source code snippet
	- `code_description` (string): Natural language description
	- `programming_language` (string): Language (python, javascript, java, etc.)
	- `task_type` (string): Type of task
	- `difficulty_level` (string): Difficulty (beginner, intermediate, advanced, expert)
	- `quality_score` (float): Quality score 0.0-1.0
	- `is_tested` (bool): Code is tested
	- `has_bugs` (bool): Known bugs exist
	- `lines_of_code` (int): Number of lines
	- `collected_at` (string): Collection timestamp

	## Usage

	```python
	from datasets import load_dataset

	# Load dataset
	dataset = load_dataset("romcmu863/code-dataset")

	# Access splits
	train = dataset['train']
	validation = dataset['validation']
	test = dataset['test']

	# Get first example
	example = train[0]
	print(example['code_description'])
	print(example['code'])
	```

	## License

	CC0-1.0

	## Created

	2025-10-25