# ProtFunc Checklist *Updated: 2026-04-16 | Goal: best GO-MF on insects + max cross-taxon transfer* *Pipeline running: `artifacts/pipeline_run.log` | PID: bash 1963, python 1966* --- ## CRITICAL FINDINGS | Finding | Impact | |---------|--------| | `graph_hpo_best.pth` val_fmax **0.9540**, test_fmax **0.9533**, CAFA **0.6536** — joint insect+mammal training | Massive jump over ablation A (0.8947/0.6338). No overfit. Mammals eval pending. | | HPO only ran **2 trials** (target was 40) | Best params likely suboptimal. Re-run HPO with full budget. | | Threshold A (current v3): ~1448 preds/protein, precision=0.002 | **Broken for inference**. Use threshold C (novelty-gated) instead. | | All gen_ratios < 0.50, mammals n=7 only | Stats unreliable. Need ≥100 mammal proteins. | | AF features (Model C) kill transfer: gen_ratio=0.02 | Never use `esm_all` features for cross-taxon models. | | graph_hpo mammal gen_ratio=**0.233** (n=4672) | Joint insect+mammal HPO didn't significantly improve transfer vs ablation A (0.4347 at n=7, unreliable). Phase 7 needed. | | `protfunc_v3_fixed.pth` referenced in server.py but NOT in HF Space | server.py loads it as priority — will silently fall back if missing. | --- ## Phase 1 — Ablation ✅ DONE | Model | Val Fmax | Test Fmax | Test CAFA | Mammal gen_ratio | |-------|----------|-----------|-----------|-----------------| | A: ESM only (320d) | 0.8900 | 0.8947 | 0.6338 | 0.4347 ← best | | B: ESM+seq (331d) | — | 0.8999 | 0.6360 | 0.4167 | | C: ESM+seq+AF (360d) | 0.8900 | 0.8902 | 0.6326 | 0.0225 ⚠️ AF hurts | Winner: **Model A** (ESM only) — best insect+mammal balance. --- ## Phase 2 — HPO (graph_hpo joint pipeline) - ✅ **2a.** HPO script ran — but only 2/40 trials completed - ✅ **2b.** Best params saved: `graph_hpo/hpo_results.json` (hidden=2048, n_blocks=8, feat_level=esm_seq, score=0.7756) - ✅ **2c.** Full train on best params done → `graph_hpo_best.pth`, val_fmax=**0.9533** - ⬜ **2d.** Re-run HPO with full 40 trials — current best may not be global optimum ```bash cd "/Users/siddhantbhat/Desktop/Research Files" .venv/bin/python3 scripts/hpo.py \ --mammal artifacts/generalization/mammal_full_v1.parquet \ --n_trials 40 --epochs 20 --patience 6 --alpha 0.6 \ --startup_trials 5 --warmup_steps 5 \ --multivariate_tpe --group_tpe \ --out artifacts/graph_hpo/hpo_results.json ``` - ✅ **2e.** Test eval on `graph_hpo_best.pth` → test_micro_fmax=**0.9533**, CAFA=**0.6536**, P=0.9553, R=0.9514, t*=0.94 - ✅ **2f.** Mammal gen eval on `graph_hpo_best.pth` → n=4672, micro_fmax=**0.2224**, CAFA=0.201, **gen_ratio=0.233** ⚠️ poor transfer ```bash .venv/bin/python3 scripts/eval_generalization.py \ --checkpoint artifacts/graph_hpo/graph_hpo_best.pth \ --thresholds artifacts/graph_hpo/graph_hpo_best_thresholds.json \ --mlb "Important Files/mlb_public_v1.pkl" \ --taxon_parquet artifacts/generalization/mammal_embeddings_v3.parquet \ --taxon_name mammals_graph_hpo --obo go-basic.obo \ --out artifacts/graph_hpo/generalization_results.json ``` --- ## Phase 3 — Threshold Fix ⚠️ BROKEN Current v3 thresholds output **1448 preds/protein at 0.2% precision** — unusable. - ✅ **3a.** Comparison run → `artifacts/threshold_comparison_results.json` - ✅ **3b.** Winner: **C (novelty-gated)** — F1=0.0733, 6.69 preds/protein, novelty subset F1=0.2757 - ✅ **3c.** Threshold comparison on `graph_hpo_best.pth` → A (per-label t*): P=0.143/R=0.984/F1=0.250/20.6preds ✅ use this | B: P=0.878/F1=0.219/0.43preds (83% zero) | C≈B ```bash .venv/bin/python3 scripts/threshold_comparison.py # Edit script to point to graph_hpo_best.pth first ``` - ⬜ **3d.** Update server.py to use novelty-gated thresholds by default (currently falls back to broken A thresholds) --- ## Phase 4 — Mammal Dataset Expansion ⚠️ URGENT n=7 proteins → all gen_ratio stats are noise. - ⬜ **4a.** Run `build_mammal_dataset.py` for ≥100 mammal proteins with GO-MF annotations ```bash .venv/bin/python3 scripts/build_mammal_dataset.py # Check script args — output should go to artifacts/generalization/mammal_full_v2.parquet ``` - ⬜ **4b.** Re-run gen eval for A, B, graph_hpo_best with new mammal set - ⬜ **4c.** Update CHECKLIST gen_ratio table with reliable numbers --- ## Phase 5 — Broader Taxon Coverage - ⬜ **5a.** Get FASTAs: fungi, plants (arabidopsis), fish (zebrafish), archaea, nematode - ⬜ **5b.** For each: `prep_taxon.py` → `eval_generalization.py` ```bash BASE="/Users/siddhantbhat/Desktop/Research Files" .venv/bin/python3 scripts/prep_taxon.py \ --fasta "Important Files/.fasta" \ --taxon_name \ --mlb "Important Files/mlb_public_v1.pkl" \ --out artifacts/generalization/_embeddings.parquet ``` - ⬜ **5c.** Fill generalization table below --- ## Phase 6 — HF Upload & Webapp - ✅ **6a.** server.py updated locally (generalization API + v3_fixed priority) — commit `bd99db9e` - ✅ **6b.** Uploaded `graph_hpo_best.pth` → HF as `protfunc_v3_fixed.pth` + thresholds (test_fmax=0.9533 confirmed better) ```bash huggingface-cli upload Sbhat2026/protfunc-models \ "artifacts/graph_hpo/graph_hpo_best.pth" protfunc_v3_fixed.pth huggingface-cli upload Sbhat2026/protfunc-models \ "artifacts/graph_hpo/graph_hpo_best_thresholds.json" protfunc_v3_fixed_thresholds.json ``` - ✅ **6c.** Pushed `static/interface.html` to HF Space (commit 2aa49963) — collapsible lower-confidence UI ```bash cd /Users/siddhantbhat/insecta_webapp git add server.py static/interface.html git commit -m "fix: use novelty-gated thresholds; add generalization panel" git push ``` - ⬜ **6d.** Add `/api/generalization` endpoint to serve `generalization_results.json` for all taxons (currently only mammals) --- ## Phase 7 — Generalization Improvement (if gen_ratio < 0.85) - ⬜ **7a.** Mixed-taxon fine-tuning with more mammal data (Phase 4 first) - ⬜ **7b.** Domain adaptation: freeze ESM layers, fine-tune MLP head on target taxon - ⬜ **7c.** Re-eval after changes --- ## Generalization Table *gen_ratio = taxon micro_fmax / insect test micro_fmax. Target ≥ 0.85.* | Taxon | Model | n | micro_fmax | cafa_fmax | gen_ratio | Status | |-------|-------|---|------------|-----------|-----------|--------| | insects | A | ~250k | 0.8947 | 0.6338 | 1.00 (ref) | ✅ | | mammals | A | 7 ⚠️ | 0.3889 | 0.3917 | 0.4347 | ⚠️ n too small | | mammals | B | 7 ⚠️ | 0.3750 | 0.3088 | 0.4167 | ⚠️ n too small | | mammals | C | 7 ⚠️ | 0.0200 | 0.0056 | 0.0225 ⚠️ | AF kills transfer | | insects | graph_hpo | ~250k | 0.9533 | 0.6536 | 1.00 (ref) | ✅ 2e done | | mammals | graph_hpo | 4672 | 0.2224 | 0.201 | **0.233** ⚠️ | ✅ 2f done | | fungi | — | — | — | — | — | ⬜ Phase 5 | | plants | — | — | — | — | — | ⬜ Phase 5 | | fish | — | — | — | — | — | ⬜ Phase 5 | | archaea | — | — | — | — | — | ⬜ Phase 5 | | nematode | — | — | — | — | — | ⬜ Phase 5 | --- ## Priority Order (do in this order) 1. 🔄 **Phase 4a** — mammal build running (500k max, `artifacts/pipeline_run.log`) 2. 🔄 **Phase 2e** — test eval on graph_hpo_best.pth (queued after mammal build) 3. 🔄 **Phase 2f** — mammal gen eval on graph_hpo_best.pth (queued) 4. 🔄 **Phase 2d** — 40-trial HPO re-run (queued) 5. 🔄 **Phase 6b-6c** — upload + push to HF (queued) 6. ⬜ **Phase 3c-3d** — fix threshold in server.py (skipped per user; revisit after push) 7. ⬜ **Phase 5** — broader taxon coverage 8. ⬜ **Phase 7** — generalization improvement if needed New scripts: - `scripts/eval_checkpoint.py` — test eval any .pth on insect test set - `scripts/run_full_pipeline.sh` — chains all steps 1-5 above --- ## Directory ``` Research Files/ ├── artifacts/ │ ├── checkpoints/ ← model .pth files │ │ ├── ablation_A_ESM_only.pth ✅ best ablation │ │ ├── ablation_B_ESM_seq.pth ✅ │ │ ├── ablation_C_ESM_seq_AF.pth ✅ (AF hurts transfer) │ │ └── improved_res.pth baseline │ ├── graph_hpo/ ← joint insect+mammal HPO pipeline │ │ ├── graph_hpo_best.pth ✅ val_fmax=0.9533 (BEST) │ │ ├── graph_hpo_best_thresholds.json │ │ ├── graph_hpo_best_log.json │ │ ├── hpo_results.json ← only 2 trials run │ │ └── methodology.json │ ├── logs/ ← training JSON logs │ ├── thresholds/ ← per-label threshold JSON files │ ├── splits/ ← train/val/test index splits │ ├── generalization/ ← taxon embeddings + eval results │ │ ├── mammal_embeddings_v3.parquet (7 proteins only ⚠️) │ │ ├── mammal_full_v1.parquet (used in HPO training) │ │ └── generalization_results.json │ ├── threshold_comparison_results.json ← use C (novelty-gated) │ └── hpo_test.json ← old 2-trial HPO result ├── scripts/ │ ├── train_v3_fixed.py │ ├── hpo.py ← graph-aware joint HPO │ ├── graph_hpo_sequence.py ← runs full pipeline │ ├── eval_generalization.py │ ├── prep_taxon.py │ ├── build_mammal_dataset.py │ ├── threshold_comparison.py │ └── archive/ ├── Important Files/ ← mlb, parquets, fastas └── CHECKLIST.md ```