code-graph-v4 / README.md
Bremin's picture
Upload README.md with huggingface_hub
5bef42a verified
metadata
license: apache-2.0
tags:
  - code
  - multi-language
  - pretraining-data

code-graph-v4

Packaged git clones for the graphjepa / code-transformer project. with full git history.

Contents

    - clones_csharp_full.tar.gz
    - clones_java_full.tar.gz
    - clones_javascript_full.tar.gz
    - clones_python_full.tar.gz
    - clones_typescript_full.tar.gz

Each tarball contains {language}/{repo_id}/... — extract anywhere, point the parser at the extracted directory.

On the receiving (big) machine

from huggingface_hub import hf_hub_download
path = hf_hub_download(
    repo_id="IDMedicine/code-graph-v4",
    filename="clones_python_full.tar.gz",
    repo_type="model",
    local_dir=".",
)
tar -xzf $path -C ./data_multilang/

# Then process each repo with build_bundle.py (needs include_git=True for
# temporal processing; or single-snapshot parsing if code-only).

Limitations

  • If packaged without .git (the _code variants), no temporal processing is possible downstream — only single-snapshot SSL.
  • If packaged with .git (the _full variants), tarballs are larger but the full commit history is preserved for build_bundle.py.