| --- |
| license: apache-2.0 |
| tags: |
| - code |
| - multi-language |
| - pretraining-data |
| --- |
| |
| # code-graph-v4 |
|
|
| Packaged git clones for the graphjepa / code-transformer project. |
| with full git history. |
|
|
| ## Contents |
|
|
| - clones_csharp_full.tar.gz |
| - clones_java_full.tar.gz |
| - clones_javascript_full.tar.gz |
| - clones_python_full.tar.gz |
| - clones_typescript_full.tar.gz |
|
|
| Each tarball contains `{language}/{repo_id}/...` — extract anywhere, |
| point the parser at the extracted directory. |
|
|
| ## On the receiving (big) machine |
|
|
| ```bash |
| from huggingface_hub import hf_hub_download |
| path = hf_hub_download( |
| repo_id="IDMedicine/code-graph-v4", |
| filename="clones_python_full.tar.gz", |
| repo_type="model", |
| local_dir=".", |
| ) |
| tar -xzf $path -C ./data_multilang/ |
| |
| # Then process each repo with build_bundle.py (needs include_git=True for |
| # temporal processing; or single-snapshot parsing if code-only). |
| ``` |
|
|
| ## Limitations |
|
|
| - If packaged without `.git` (the `_code` variants), **no temporal |
| processing is possible** downstream — only single-snapshot SSL. |
| - If packaged with `.git` (the `_full` variants), tarballs are larger |
| but the full commit history is preserved for `build_bundle.py`. |
|
|