--- license: apache-2.0 tags: - code - multi-language - pretraining-data --- # code-graph-v4 Packaged git clones for the graphjepa / code-transformer project. with full git history. ## Contents - clones_csharp_full.tar.gz - clones_java_full.tar.gz - clones_javascript_full.tar.gz - clones_python_full.tar.gz - clones_typescript_full.tar.gz Each tarball contains `{language}/{repo_id}/...` — extract anywhere, point the parser at the extracted directory. ## On the receiving (big) machine ```bash from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="IDMedicine/code-graph-v4", filename="clones_python_full.tar.gz", repo_type="model", local_dir=".", ) tar -xzf $path -C ./data_multilang/ # Then process each repo with build_bundle.py (needs include_git=True for # temporal processing; or single-snapshot parsing if code-only). ``` ## Limitations - If packaged without `.git` (the `_code` variants), **no temporal processing is possible** downstream — only single-snapshot SSL. - If packaged with `.git` (the `_full` variants), tarballs are larger but the full commit history is preserved for `build_bundle.py`.