composer-replication-framework / docs /HF_REPO_LAYOUT.md
Codeseys's picture
Initial commit: Composer 2.5 Replication Framework — research synthesis
7165832

HF Repo Layout — composer-replication-framework

Per the HF multi-artifact research project pattern, this project will eventually span multiple HF repos. This document records the layout.

Current state (2026-05-25)

Only the methodology repo exists. No trained variants, no datasets yet.

Repo Type Status Purpose
Codeseys/composer-replication-framework model ✅ exists (this repo) Methodology, ADRs, framework spec, research deep-dives

Planned splits (post-spike)

When the v0.0 spike produces a result, the following repos will be created:

Repo Type Created when Contents
Codeseys/composer-replication-traces-v0 dataset v0.0 spike data is collected 100 frozen agentic-coding traces (JSON), used for trace-replay-distillation experiments
Codeseys/composer-replication-qwen3-7b-v0 model v0.0 spike produces a checkpoint LoRA adapter or full fine-tune of Qwen3-7B trained with GRPO + trace-replay-DPO
Codeseys/composer-replication-qwen3-7b-v0-baseline model v0.0 spike produces a baseline checkpoint Same training, plain GRPO only (A/B comparison)

After v0.1:

Repo Type Contents
Codeseys/composer-replication-traces-v1 dataset Larger trace corpus + Feature-Deletion environment seed repos
Codeseys/composer-replication-feature-deletion-env-v1 dataset Repos with passing tests, with deletion masks for the env to apply
Codeseys/composer-replication-qwen3-32b-v1 model Full Composer-recipe v1 trained variant

All trained-variant repos will:

  • Link back to this repo (Codeseys/composer-replication-framework) in their README.md as the methodology source.
  • Live in an HF Collection (composer-replication-*) created when the second member repo is added.

Why this split

Per the huggingface-hub skill's references/multi-artifact-research-layout.md:

  1. Type semantics matter — HF dataset repos have native handling for jsonl/parquet (streaming load, dataset viewer). The model repo type used for this repo treats markdown research as first-class.
  2. Cite-ability — each trained variant gets its own DOI / citation.
  3. Variant training is unbounded — we don't know how many variants will ship; per-variant repos keep eval results, model cards, and weights cleanly separated.
  4. Discoverability via Collection — single URL surfaces the whole study.

Conventions

  • Repo prefix: composer-replication- for every repo in this study.
  • Variant suffix: <base-model>-<size>-<scale-tag> (e.g. qwen3-7b-v0, qwen3-32b-v1).
  • Dataset suffix: -traces-v<N>, -feature-deletion-env-v<N>, -bench-v<N>.
  • Branch: master locally → push to HF as main (refspec master:main).
  • License: MIT for methodology and code; per-trained-variant license depends on base model's license.

Sync pattern

When adding a new variant repo, use the huggingface-hub skill's references/sync-to-hf-template.py shape — create_repo + upload_folder + add_collection_item(exists_ok=True) in a single script, so shipping a new variant is one command.