protfunc / CHECKLIST.md
Sbhat2026's picture
perf: ESM embedding cache + 1500aa limit, add research scripts
7f7a890
# ProtFunc Checklist
*Updated: 2026-04-16 | Goal: best GO-MF on insects + max cross-taxon transfer*
*Pipeline running: `artifacts/pipeline_run.log` | PID: bash 1963, python 1966*
---
## CRITICAL FINDINGS
| Finding | Impact |
|---------|--------|
| `graph_hpo_best.pth` val_fmax **0.9540**, test_fmax **0.9533**, CAFA **0.6536** β€” joint insect+mammal training | Massive jump over ablation A (0.8947/0.6338). No overfit. Mammals eval pending. |
| HPO only ran **2 trials** (target was 40) | Best params likely suboptimal. Re-run HPO with full budget. |
| Threshold A (current v3): ~1448 preds/protein, precision=0.002 | **Broken for inference**. Use threshold C (novelty-gated) instead. |
| All gen_ratios < 0.50, mammals n=7 only | Stats unreliable. Need β‰₯100 mammal proteins. |
| AF features (Model C) kill transfer: gen_ratio=0.02 | Never use `esm_all` features for cross-taxon models. |
| graph_hpo mammal gen_ratio=**0.233** (n=4672) | Joint insect+mammal HPO didn't significantly improve transfer vs ablation A (0.4347 at n=7, unreliable). Phase 7 needed. |
| `protfunc_v3_fixed.pth` referenced in server.py but NOT in HF Space | server.py loads it as priority β€” will silently fall back if missing. |
---
## Phase 1 β€” Ablation βœ… DONE
| Model | Val Fmax | Test Fmax | Test CAFA | Mammal gen_ratio |
|-------|----------|-----------|-----------|-----------------|
| A: ESM only (320d) | 0.8900 | 0.8947 | 0.6338 | 0.4347 ← best |
| B: ESM+seq (331d) | β€” | 0.8999 | 0.6360 | 0.4167 |
| C: ESM+seq+AF (360d) | 0.8900 | 0.8902 | 0.6326 | 0.0225 ⚠️ AF hurts |
Winner: **Model A** (ESM only) β€” best insect+mammal balance.
---
## Phase 2 β€” HPO (graph_hpo joint pipeline)
- βœ… **2a.** HPO script ran β€” but only 2/40 trials completed
- βœ… **2b.** Best params saved: `graph_hpo/hpo_results.json` (hidden=2048, n_blocks=8, feat_level=esm_seq, score=0.7756)
- βœ… **2c.** Full train on best params done β†’ `graph_hpo_best.pth`, val_fmax=**0.9533**
- ⬜ **2d.** Re-run HPO with full 40 trials β€” current best may not be global optimum
```bash
cd "/Users/siddhantbhat/Desktop/Research Files"
.venv/bin/python3 scripts/hpo.py \
--mammal artifacts/generalization/mammal_full_v1.parquet \
--n_trials 40 --epochs 20 --patience 6 --alpha 0.6 \
--startup_trials 5 --warmup_steps 5 \
--multivariate_tpe --group_tpe \
--out artifacts/graph_hpo/hpo_results.json
```
- βœ… **2e.** Test eval on `graph_hpo_best.pth` β†’ test_micro_fmax=**0.9533**, CAFA=**0.6536**, P=0.9553, R=0.9514, t*=0.94
- βœ… **2f.** Mammal gen eval on `graph_hpo_best.pth` β†’ n=4672, micro_fmax=**0.2224**, CAFA=0.201, **gen_ratio=0.233** ⚠️ poor transfer
```bash
.venv/bin/python3 scripts/eval_generalization.py \
--checkpoint artifacts/graph_hpo/graph_hpo_best.pth \
--thresholds artifacts/graph_hpo/graph_hpo_best_thresholds.json \
--mlb "Important Files/mlb_public_v1.pkl" \
--taxon_parquet artifacts/generalization/mammal_embeddings_v3.parquet \
--taxon_name mammals_graph_hpo --obo go-basic.obo \
--out artifacts/graph_hpo/generalization_results.json
```
---
## Phase 3 β€” Threshold Fix ⚠️ BROKEN
Current v3 thresholds output **1448 preds/protein at 0.2% precision** β€” unusable.
- βœ… **3a.** Comparison run β†’ `artifacts/threshold_comparison_results.json`
- βœ… **3b.** Winner: **C (novelty-gated)** β€” F1=0.0733, 6.69 preds/protein, novelty subset F1=0.2757
- βœ… **3c.** Threshold comparison on `graph_hpo_best.pth` β†’ A (per-label t*): P=0.143/R=0.984/F1=0.250/20.6preds βœ… use this | B: P=0.878/F1=0.219/0.43preds (83% zero) | Cβ‰ˆB
```bash
.venv/bin/python3 scripts/threshold_comparison.py
# Edit script to point to graph_hpo_best.pth first
```
- ⬜ **3d.** Update server.py to use novelty-gated thresholds by default (currently falls back to broken A thresholds)
---
## Phase 4 β€” Mammal Dataset Expansion ⚠️ URGENT
n=7 proteins β†’ all gen_ratio stats are noise.
- ⬜ **4a.** Run `build_mammal_dataset.py` for β‰₯100 mammal proteins with GO-MF annotations
```bash
.venv/bin/python3 scripts/build_mammal_dataset.py
# Check script args β€” output should go to artifacts/generalization/mammal_full_v2.parquet
```
- ⬜ **4b.** Re-run gen eval for A, B, graph_hpo_best with new mammal set
- ⬜ **4c.** Update CHECKLIST gen_ratio table with reliable numbers
---
## Phase 5 β€” Broader Taxon Coverage
- ⬜ **5a.** Get FASTAs: fungi, plants (arabidopsis), fish (zebrafish), archaea, nematode
- ⬜ **5b.** For each: `prep_taxon.py` β†’ `eval_generalization.py`
```bash
BASE="/Users/siddhantbhat/Desktop/Research Files"
.venv/bin/python3 scripts/prep_taxon.py \
--fasta "Important Files/<taxon>.fasta" \
--taxon_name <taxon> \
--mlb "Important Files/mlb_public_v1.pkl" \
--out artifacts/generalization/<taxon>_embeddings.parquet
```
- ⬜ **5c.** Fill generalization table below
---
## Phase 6 β€” HF Upload & Webapp
- βœ… **6a.** server.py updated locally (generalization API + v3_fixed priority) β€” commit `bd99db9e`
- βœ… **6b.** Uploaded `graph_hpo_best.pth` β†’ HF as `protfunc_v3_fixed.pth` + thresholds (test_fmax=0.9533 confirmed better)
```bash
huggingface-cli upload Sbhat2026/protfunc-models \
"artifacts/graph_hpo/graph_hpo_best.pth" protfunc_v3_fixed.pth
huggingface-cli upload Sbhat2026/protfunc-models \
"artifacts/graph_hpo/graph_hpo_best_thresholds.json" protfunc_v3_fixed_thresholds.json
```
- βœ… **6c.** Pushed `static/interface.html` to HF Space (commit 2aa49963) β€” collapsible lower-confidence UI
```bash
cd /Users/siddhantbhat/insecta_webapp
git add server.py static/interface.html
git commit -m "fix: use novelty-gated thresholds; add generalization panel"
git push
```
- ⬜ **6d.** Add `/api/generalization` endpoint to serve `generalization_results.json` for all taxons (currently only mammals)
---
## Phase 7 β€” Generalization Improvement (if gen_ratio < 0.85)
- ⬜ **7a.** Mixed-taxon fine-tuning with more mammal data (Phase 4 first)
- ⬜ **7b.** Domain adaptation: freeze ESM layers, fine-tune MLP head on target taxon
- ⬜ **7c.** Re-eval after changes
---
## Generalization Table
*gen_ratio = taxon micro_fmax / insect test micro_fmax. Target β‰₯ 0.85.*
| Taxon | Model | n | micro_fmax | cafa_fmax | gen_ratio | Status |
|-------|-------|---|------------|-----------|-----------|--------|
| insects | A | ~250k | 0.8947 | 0.6338 | 1.00 (ref) | βœ… |
| mammals | A | 7 ⚠️ | 0.3889 | 0.3917 | 0.4347 | ⚠️ n too small |
| mammals | B | 7 ⚠️ | 0.3750 | 0.3088 | 0.4167 | ⚠️ n too small |
| mammals | C | 7 ⚠️ | 0.0200 | 0.0056 | 0.0225 ⚠️ | AF kills transfer |
| insects | graph_hpo | ~250k | 0.9533 | 0.6536 | 1.00 (ref) | βœ… 2e done |
| mammals | graph_hpo | 4672 | 0.2224 | 0.201 | **0.233** ⚠️ | βœ… 2f done |
| fungi | β€” | β€” | β€” | β€” | β€” | ⬜ Phase 5 |
| plants | β€” | β€” | β€” | β€” | β€” | ⬜ Phase 5 |
| fish | β€” | β€” | β€” | β€” | β€” | ⬜ Phase 5 |
| archaea | β€” | β€” | β€” | β€” | β€” | ⬜ Phase 5 |
| nematode | β€” | β€” | β€” | β€” | β€” | ⬜ Phase 5 |
---
## Priority Order (do in this order)
1. πŸ”„ **Phase 4a** β€” mammal build running (500k max, `artifacts/pipeline_run.log`)
2. πŸ”„ **Phase 2e** β€” test eval on graph_hpo_best.pth (queued after mammal build)
3. πŸ”„ **Phase 2f** β€” mammal gen eval on graph_hpo_best.pth (queued)
4. πŸ”„ **Phase 2d** β€” 40-trial HPO re-run (queued)
5. πŸ”„ **Phase 6b-6c** β€” upload + push to HF (queued)
6. ⬜ **Phase 3c-3d** β€” fix threshold in server.py (skipped per user; revisit after push)
7. ⬜ **Phase 5** β€” broader taxon coverage
8. ⬜ **Phase 7** β€” generalization improvement if needed
New scripts:
- `scripts/eval_checkpoint.py` β€” test eval any .pth on insect test set
- `scripts/run_full_pipeline.sh` β€” chains all steps 1-5 above
---
## Directory
```
Research Files/
β”œβ”€β”€ artifacts/
β”‚ β”œβ”€β”€ checkpoints/ ← model .pth files
β”‚ β”‚ β”œβ”€β”€ ablation_A_ESM_only.pth βœ… best ablation
β”‚ β”‚ β”œβ”€β”€ ablation_B_ESM_seq.pth βœ…
β”‚ β”‚ β”œβ”€β”€ ablation_C_ESM_seq_AF.pth βœ… (AF hurts transfer)
β”‚ β”‚ └── improved_res.pth baseline
β”‚ β”œβ”€β”€ graph_hpo/ ← joint insect+mammal HPO pipeline
β”‚ β”‚ β”œβ”€β”€ graph_hpo_best.pth βœ… val_fmax=0.9533 (BEST)
β”‚ β”‚ β”œβ”€β”€ graph_hpo_best_thresholds.json
β”‚ β”‚ β”œβ”€β”€ graph_hpo_best_log.json
β”‚ β”‚ β”œβ”€β”€ hpo_results.json ← only 2 trials run
β”‚ β”‚ └── methodology.json
β”‚ β”œβ”€β”€ logs/ ← training JSON logs
β”‚ β”œβ”€β”€ thresholds/ ← per-label threshold JSON files
β”‚ β”œβ”€β”€ splits/ ← train/val/test index splits
β”‚ β”œβ”€β”€ generalization/ ← taxon embeddings + eval results
β”‚ β”‚ β”œβ”€β”€ mammal_embeddings_v3.parquet (7 proteins only ⚠️)
β”‚ β”‚ β”œβ”€β”€ mammal_full_v1.parquet (used in HPO training)
β”‚ β”‚ └── generalization_results.json
β”‚ β”œβ”€β”€ threshold_comparison_results.json ← use C (novelty-gated)
β”‚ └── hpo_test.json ← old 2-trial HPO result
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ train_v3_fixed.py
β”‚ β”œβ”€β”€ hpo.py ← graph-aware joint HPO
β”‚ β”œβ”€β”€ graph_hpo_sequence.py ← runs full pipeline
β”‚ β”œβ”€β”€ eval_generalization.py
β”‚ β”œβ”€β”€ prep_taxon.py
β”‚ β”œβ”€β”€ build_mammal_dataset.py
β”‚ β”œβ”€β”€ threshold_comparison.py
β”‚ └── archive/
β”œβ”€β”€ Important Files/ ← mlb, parquets, fastas
└── CHECKLIST.md
```