Spaces:

Sbhat2026
/

protfunc

Running

App Files Files Community

protfunc / CHECKLIST.md

Sbhat2026

perf: ESM embedding cache + 1500aa limit, add research scripts

7f7a890 28 days ago

preview code

raw

history blame contribute delete

9.65 kB

ProtFunc Checklist

Updated: 2026-04-16 | Goal: best GO-MF on insects + max cross-taxon transfer Pipeline running: artifacts/pipeline_run.log | PID: bash 1963, python 1966

CRITICAL FINDINGS

Finding	Impact
`graph_hpo_best.pth` val_fmax 0.9540, test_fmax 0.9533, CAFA 0.6536 — joint insect+mammal training	Massive jump over ablation A (0.8947/0.6338). No overfit. Mammals eval pending.
HPO only ran 2 trials (target was 40)	Best params likely suboptimal. Re-run HPO with full budget.
Threshold A (current v3): ~1448 preds/protein, precision=0.002	Broken for inference. Use threshold C (novelty-gated) instead.
All gen_ratios < 0.50, mammals n=7 only	Stats unreliable. Need ≥100 mammal proteins.
AF features (Model C) kill transfer: gen_ratio=0.02	Never use `esm_all` features for cross-taxon models.
graph_hpo mammal gen_ratio=0.233 (n=4672)	Joint insect+mammal HPO didn't significantly improve transfer vs ablation A (0.4347 at n=7, unreliable). Phase 7 needed.
`protfunc_v3_fixed.pth` referenced in server.py but NOT in HF Space	server.py loads it as priority — will silently fall back if missing.

Phase 1 — Ablation ✅ DONE

Model	Val Fmax	Test Fmax	Test CAFA	Mammal gen_ratio
A: ESM only (320d)	0.8900	0.8947	0.6338	0.4347 ← best
B: ESM+seq (331d)	—	0.8999	0.6360	0.4167
C: ESM+seq+AF (360d)	0.8900	0.8902	0.6326	0.0225 ⚠️ AF hurts

Winner: Model A (ESM only) — best insect+mammal balance.

Phase 2 — HPO (graph_hpo joint pipeline)

✅ 2a. HPO script ran — but only 2/40 trials completed
✅ 2b. Best params saved: graph_hpo/hpo_results.json (hidden=2048, n_blocks=8, feat_level=esm_seq, score=0.7756)
✅ 2c. Full train on best params done → graph_hpo_best.pth, val_fmax=0.9533

⬜ 2d. Re-run HPO with full 40 trials — current best may not be global optimum

cd "/Users/siddhantbhat/Desktop/Research Files"
.venv/bin/python3 scripts/hpo.py \
  --mammal artifacts/generalization/mammal_full_v1.parquet \
  --n_trials 40 --epochs 20 --patience 6 --alpha 0.6 \
  --startup_trials 5 --warmup_steps 5 \
  --multivariate_tpe --group_tpe \
  --out artifacts/graph_hpo/hpo_results.json

✅ 2e. Test eval on graph_hpo_best.pth → test_micro_fmax=0.9533, CAFA=0.6536, P=0.9553, R=0.9514, t*=0.94

✅ 2f. Mammal gen eval on graph_hpo_best.pth → n=4672, micro_fmax=0.2224, CAFA=0.201, gen_ratio=0.233 ⚠️ poor transfer

.venv/bin/python3 scripts/eval_generalization.py \
  --checkpoint artifacts/graph_hpo/graph_hpo_best.pth \
  --thresholds artifacts/graph_hpo/graph_hpo_best_thresholds.json \
  --mlb "Important Files/mlb_public_v1.pkl" \
  --taxon_parquet artifacts/generalization/mammal_embeddings_v3.parquet \
  --taxon_name mammals_graph_hpo --obo go-basic.obo \
  --out artifacts/graph_hpo/generalization_results.json

Phase 3 — Threshold Fix ⚠️ BROKEN

Current v3 thresholds output 1448 preds/protein at 0.2% precision — unusable.

✅ 3a. Comparison run → artifacts/threshold_comparison_results.json
✅ 3b. Winner: C (novelty-gated) — F1=0.0733, 6.69 preds/protein, novelty subset F1=0.2757
✅ 3c. Threshold comparison on graph_hpo_best.pth → A (per-label t*): P=0.143/R=0.984/F1=0.250/20.6preds ✅ use this | B: P=0.878/F1=0.219/0.43preds (83% zero) | C≈B
```
.venv/bin/python3 scripts/threshold_comparison.py
# Edit script to point to graph_hpo_best.pth first
```
⬜ 3d. Update server.py to use novelty-gated thresholds by default (currently falls back to broken A thresholds)

Phase 4 — Mammal Dataset Expansion ⚠️ URGENT

n=7 proteins → all gen_ratio stats are noise.

⬜ 4a. Run build_mammal_dataset.py for ≥100 mammal proteins with GO-MF annotations

.venv/bin/python3 scripts/build_mammal_dataset.py
# Check script args — output should go to artifacts/generalization/mammal_full_v2.parquet

⬜ 4b. Re-run gen eval for A, B, graph_hpo_best with new mammal set
⬜ 4c. Update CHECKLIST gen_ratio table with reliable numbers

Phase 5 — Broader Taxon Coverage

⬜ 5a. Get FASTAs: fungi, plants (arabidopsis), fish (zebrafish), archaea, nematode

⬜ 5b. For each: prep_taxon.py → eval_generalization.py

BASE="/Users/siddhantbhat/Desktop/Research Files"
.venv/bin/python3 scripts/prep_taxon.py \
  --fasta "Important Files/<taxon>.fasta" \
  --taxon_name <taxon> \
  --mlb "Important Files/mlb_public_v1.pkl" \
  --out artifacts/generalization/<taxon>_embeddings.parquet

⬜ 5c. Fill generalization table below

Phase 6 — HF Upload & Webapp

✅ 6a. server.py updated locally (generalization API + v3_fixed priority) — commit bd99db9e

✅ 6b. Uploaded graph_hpo_best.pth → HF as protfunc_v3_fixed.pth + thresholds (test_fmax=0.9533 confirmed better)

huggingface-cli upload Sbhat2026/protfunc-models \
  "artifacts/graph_hpo/graph_hpo_best.pth" protfunc_v3_fixed.pth
huggingface-cli upload Sbhat2026/protfunc-models \
  "artifacts/graph_hpo/graph_hpo_best_thresholds.json" protfunc_v3_fixed_thresholds.json

✅ 6c. Pushed static/interface.html to HF Space (commit 2aa49963) — collapsible lower-confidence UI

cd /Users/siddhantbhat/insecta_webapp
git add server.py static/interface.html
git commit -m "fix: use novelty-gated thresholds; add generalization panel"
git push

⬜ 6d. Add /api/generalization endpoint to serve generalization_results.json for all taxons (currently only mammals)

Phase 7 — Generalization Improvement (if gen_ratio < 0.85)

⬜ 7a. Mixed-taxon fine-tuning with more mammal data (Phase 4 first)
⬜ 7b. Domain adaptation: freeze ESM layers, fine-tune MLP head on target taxon
⬜ 7c. Re-eval after changes

Generalization Table

gen_ratio = taxon micro_fmax / insect test micro_fmax. Target ≥ 0.85.

Taxon	Model	n	micro_fmax	cafa_fmax	gen_ratio	Status
insects	A	~250k	0.8947	0.6338	1.00 (ref)	✅
mammals	A	7 ⚠️	0.3889	0.3917	0.4347	⚠️ n too small
mammals	B	7 ⚠️	0.3750	0.3088	0.4167	⚠️ n too small
mammals	C	7 ⚠️	0.0200	0.0056	0.0225 ⚠️	AF kills transfer
insects	graph_hpo	~250k	0.9533	0.6536	1.00 (ref)	✅ 2e done
mammals	graph_hpo	4672	0.2224	0.201	0.233 ⚠️	✅ 2f done
fungi	—	—	—	—	—	⬜ Phase 5
plants	—	—	—	—	—	⬜ Phase 5
fish	—	—	—	—	—	⬜ Phase 5
archaea	—	—	—	—	—	⬜ Phase 5
nematode	—	—	—	—	—	⬜ Phase 5

Priority Order (do in this order)

🔄 Phase 4a — mammal build running (500k max, artifacts/pipeline_run.log)
🔄 Phase 2e — test eval on graph_hpo_best.pth (queued after mammal build)
🔄 Phase 2f — mammal gen eval on graph_hpo_best.pth (queued)
🔄 Phase 2d — 40-trial HPO re-run (queued)
🔄 Phase 6b-6c — upload + push to HF (queued)
⬜ Phase 3c-3d — fix threshold in server.py (skipped per user; revisit after push)
⬜ Phase 5 — broader taxon coverage
⬜ Phase 7 — generalization improvement if needed

New scripts:

scripts/eval_checkpoint.py — test eval any .pth on insect test set
scripts/run_full_pipeline.sh — chains all steps 1-5 above