protfunc / CHECKLIST.md
Sbhat2026's picture
perf: ESM embedding cache + 1500aa limit, add research scripts
7f7a890

ProtFunc Checklist

Updated: 2026-04-16 | Goal: best GO-MF on insects + max cross-taxon transfer Pipeline running: artifacts/pipeline_run.log | PID: bash 1963, python 1966


CRITICAL FINDINGS

Finding Impact
graph_hpo_best.pth val_fmax 0.9540, test_fmax 0.9533, CAFA 0.6536 β€” joint insect+mammal training Massive jump over ablation A (0.8947/0.6338). No overfit. Mammals eval pending.
HPO only ran 2 trials (target was 40) Best params likely suboptimal. Re-run HPO with full budget.
Threshold A (current v3): ~1448 preds/protein, precision=0.002 Broken for inference. Use threshold C (novelty-gated) instead.
All gen_ratios < 0.50, mammals n=7 only Stats unreliable. Need β‰₯100 mammal proteins.
AF features (Model C) kill transfer: gen_ratio=0.02 Never use esm_all features for cross-taxon models.
graph_hpo mammal gen_ratio=0.233 (n=4672) Joint insect+mammal HPO didn't significantly improve transfer vs ablation A (0.4347 at n=7, unreliable). Phase 7 needed.
protfunc_v3_fixed.pth referenced in server.py but NOT in HF Space server.py loads it as priority β€” will silently fall back if missing.

Phase 1 β€” Ablation βœ… DONE

Model Val Fmax Test Fmax Test CAFA Mammal gen_ratio
A: ESM only (320d) 0.8900 0.8947 0.6338 0.4347 ← best
B: ESM+seq (331d) β€” 0.8999 0.6360 0.4167
C: ESM+seq+AF (360d) 0.8900 0.8902 0.6326 0.0225 ⚠️ AF hurts

Winner: Model A (ESM only) β€” best insect+mammal balance.


Phase 2 β€” HPO (graph_hpo joint pipeline)

  • βœ… 2a. HPO script ran β€” but only 2/40 trials completed
  • βœ… 2b. Best params saved: graph_hpo/hpo_results.json (hidden=2048, n_blocks=8, feat_level=esm_seq, score=0.7756)
  • βœ… 2c. Full train on best params done β†’ graph_hpo_best.pth, val_fmax=0.9533
  • ⬜ 2d. Re-run HPO with full 40 trials β€” current best may not be global optimum
    cd "/Users/siddhantbhat/Desktop/Research Files"
    .venv/bin/python3 scripts/hpo.py \
      --mammal artifacts/generalization/mammal_full_v1.parquet \
      --n_trials 40 --epochs 20 --patience 6 --alpha 0.6 \
      --startup_trials 5 --warmup_steps 5 \
      --multivariate_tpe --group_tpe \
      --out artifacts/graph_hpo/hpo_results.json
    
  • βœ… 2e. Test eval on graph_hpo_best.pth β†’ test_micro_fmax=0.9533, CAFA=0.6536, P=0.9553, R=0.9514, t*=0.94
  • βœ… 2f. Mammal gen eval on graph_hpo_best.pth β†’ n=4672, micro_fmax=0.2224, CAFA=0.201, gen_ratio=0.233 ⚠️ poor transfer
    .venv/bin/python3 scripts/eval_generalization.py \
      --checkpoint artifacts/graph_hpo/graph_hpo_best.pth \
      --thresholds artifacts/graph_hpo/graph_hpo_best_thresholds.json \
      --mlb "Important Files/mlb_public_v1.pkl" \
      --taxon_parquet artifacts/generalization/mammal_embeddings_v3.parquet \
      --taxon_name mammals_graph_hpo --obo go-basic.obo \
      --out artifacts/graph_hpo/generalization_results.json
    

Phase 3 β€” Threshold Fix ⚠️ BROKEN

Current v3 thresholds output 1448 preds/protein at 0.2% precision β€” unusable.

  • βœ… 3a. Comparison run β†’ artifacts/threshold_comparison_results.json
  • βœ… 3b. Winner: C (novelty-gated) β€” F1=0.0733, 6.69 preds/protein, novelty subset F1=0.2757
  • βœ… 3c. Threshold comparison on graph_hpo_best.pth β†’ A (per-label t*): P=0.143/R=0.984/F1=0.250/20.6preds βœ… use this | B: P=0.878/F1=0.219/0.43preds (83% zero) | Cβ‰ˆB
    .venv/bin/python3 scripts/threshold_comparison.py
    # Edit script to point to graph_hpo_best.pth first
    
  • ⬜ 3d. Update server.py to use novelty-gated thresholds by default (currently falls back to broken A thresholds)

Phase 4 β€” Mammal Dataset Expansion ⚠️ URGENT

n=7 proteins β†’ all gen_ratio stats are noise.

  • ⬜ 4a. Run build_mammal_dataset.py for β‰₯100 mammal proteins with GO-MF annotations
    .venv/bin/python3 scripts/build_mammal_dataset.py
    # Check script args β€” output should go to artifacts/generalization/mammal_full_v2.parquet
    
  • ⬜ 4b. Re-run gen eval for A, B, graph_hpo_best with new mammal set
  • ⬜ 4c. Update CHECKLIST gen_ratio table with reliable numbers

Phase 5 β€” Broader Taxon Coverage

  • ⬜ 5a. Get FASTAs: fungi, plants (arabidopsis), fish (zebrafish), archaea, nematode
  • ⬜ 5b. For each: prep_taxon.py β†’ eval_generalization.py
    BASE="/Users/siddhantbhat/Desktop/Research Files"
    .venv/bin/python3 scripts/prep_taxon.py \
      --fasta "Important Files/<taxon>.fasta" \
      --taxon_name <taxon> \
      --mlb "Important Files/mlb_public_v1.pkl" \
      --out artifacts/generalization/<taxon>_embeddings.parquet
    
  • ⬜ 5c. Fill generalization table below

Phase 6 β€” HF Upload & Webapp

  • βœ… 6a. server.py updated locally (generalization API + v3_fixed priority) β€” commit bd99db9e
  • βœ… 6b. Uploaded graph_hpo_best.pth β†’ HF as protfunc_v3_fixed.pth + thresholds (test_fmax=0.9533 confirmed better)
    huggingface-cli upload Sbhat2026/protfunc-models \
      "artifacts/graph_hpo/graph_hpo_best.pth" protfunc_v3_fixed.pth
    huggingface-cli upload Sbhat2026/protfunc-models \
      "artifacts/graph_hpo/graph_hpo_best_thresholds.json" protfunc_v3_fixed_thresholds.json
    
  • βœ… 6c. Pushed static/interface.html to HF Space (commit 2aa49963) β€” collapsible lower-confidence UI
    cd /Users/siddhantbhat/insecta_webapp
    git add server.py static/interface.html
    git commit -m "fix: use novelty-gated thresholds; add generalization panel"
    git push
    
  • ⬜ 6d. Add /api/generalization endpoint to serve generalization_results.json for all taxons (currently only mammals)

Phase 7 β€” Generalization Improvement (if gen_ratio < 0.85)

  • ⬜ 7a. Mixed-taxon fine-tuning with more mammal data (Phase 4 first)
  • ⬜ 7b. Domain adaptation: freeze ESM layers, fine-tune MLP head on target taxon
  • ⬜ 7c. Re-eval after changes

Generalization Table

gen_ratio = taxon micro_fmax / insect test micro_fmax. Target β‰₯ 0.85.

Taxon Model n micro_fmax cafa_fmax gen_ratio Status
insects A ~250k 0.8947 0.6338 1.00 (ref) βœ…
mammals A 7 ⚠️ 0.3889 0.3917 0.4347 ⚠️ n too small
mammals B 7 ⚠️ 0.3750 0.3088 0.4167 ⚠️ n too small
mammals C 7 ⚠️ 0.0200 0.0056 0.0225 ⚠️ AF kills transfer
insects graph_hpo ~250k 0.9533 0.6536 1.00 (ref) βœ… 2e done
mammals graph_hpo 4672 0.2224 0.201 0.233 ⚠️ βœ… 2f done
fungi β€” β€” β€” β€” β€” ⬜ Phase 5
plants β€” β€” β€” β€” β€” ⬜ Phase 5
fish β€” β€” β€” β€” β€” ⬜ Phase 5
archaea β€” β€” β€” β€” β€” ⬜ Phase 5
nematode β€” β€” β€” β€” β€” ⬜ Phase 5

Priority Order (do in this order)

  1. πŸ”„ Phase 4a β€” mammal build running (500k max, artifacts/pipeline_run.log)
  2. πŸ”„ Phase 2e β€” test eval on graph_hpo_best.pth (queued after mammal build)
  3. πŸ”„ Phase 2f β€” mammal gen eval on graph_hpo_best.pth (queued)
  4. πŸ”„ Phase 2d β€” 40-trial HPO re-run (queued)
  5. πŸ”„ Phase 6b-6c β€” upload + push to HF (queued)
  6. ⬜ Phase 3c-3d β€” fix threshold in server.py (skipped per user; revisit after push)
  7. ⬜ Phase 5 β€” broader taxon coverage
  8. ⬜ Phase 7 β€” generalization improvement if needed

New scripts:

  • scripts/eval_checkpoint.py β€” test eval any .pth on insect test set
  • scripts/run_full_pipeline.sh β€” chains all steps 1-5 above

Directory

Research Files/
β”œβ”€β”€ artifacts/
β”‚   β”œβ”€β”€ checkpoints/        ← model .pth files
β”‚   β”‚   β”œβ”€β”€ ablation_A_ESM_only.pth    βœ… best ablation
β”‚   β”‚   β”œβ”€β”€ ablation_B_ESM_seq.pth     βœ…
β”‚   β”‚   β”œβ”€β”€ ablation_C_ESM_seq_AF.pth  βœ… (AF hurts transfer)
β”‚   β”‚   └── improved_res.pth           baseline
β”‚   β”œβ”€β”€ graph_hpo/          ← joint insect+mammal HPO pipeline
β”‚   β”‚   β”œβ”€β”€ graph_hpo_best.pth         βœ… val_fmax=0.9533 (BEST)
β”‚   β”‚   β”œβ”€β”€ graph_hpo_best_thresholds.json
β”‚   β”‚   β”œβ”€β”€ graph_hpo_best_log.json
β”‚   β”‚   β”œβ”€β”€ hpo_results.json           ← only 2 trials run
β”‚   β”‚   └── methodology.json
β”‚   β”œβ”€β”€ logs/               ← training JSON logs
β”‚   β”œβ”€β”€ thresholds/         ← per-label threshold JSON files
β”‚   β”œβ”€β”€ splits/             ← train/val/test index splits
β”‚   β”œβ”€β”€ generalization/     ← taxon embeddings + eval results
β”‚   β”‚   β”œβ”€β”€ mammal_embeddings_v3.parquet  (7 proteins only ⚠️)
β”‚   β”‚   β”œβ”€β”€ mammal_full_v1.parquet        (used in HPO training)
β”‚   β”‚   └── generalization_results.json
β”‚   β”œβ”€β”€ threshold_comparison_results.json  ← use C (novelty-gated)
β”‚   └── hpo_test.json       ← old 2-trial HPO result
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train_v3_fixed.py
β”‚   β”œβ”€β”€ hpo.py              ← graph-aware joint HPO
β”‚   β”œβ”€β”€ graph_hpo_sequence.py  ← runs full pipeline
β”‚   β”œβ”€β”€ eval_generalization.py
β”‚   β”œβ”€β”€ prep_taxon.py
β”‚   β”œβ”€β”€ build_mammal_dataset.py
β”‚   β”œβ”€β”€ threshold_comparison.py
β”‚   └── archive/
β”œβ”€β”€ Important Files/        ← mlb, parquets, fastas
└── CHECKLIST.md