Spaces:

Sbhat2026
/

protfunc

Running

App Files Files Community

protfunc / CHECKLIST.md

Sbhat2026

perf: ESM embedding cache + 1500aa limit, add research scripts

7f7a890 28 days ago

preview code

raw

history blame contribute delete

9.65 kB

	# ProtFunc Checklist
	Updated: 2026-04-16 \| Goal: best GO-MF on insects + max cross-taxon transfer
	Pipeline running: `artifacts/pipeline_run.log` \| PID: bash 1963, python 1966

	---

	## CRITICAL FINDINGS

	\| Finding \| Impact \|
	\|---------\|--------\|
	\| `graph_hpo_best.pth` val_fmax 0.9540, test_fmax 0.9533, CAFA 0.6536 — joint insect+mammal training \| Massive jump over ablation A (0.8947/0.6338). No overfit. Mammals eval pending. \|
	\| HPO only ran 2 trials (target was 40) \| Best params likely suboptimal. Re-run HPO with full budget. \|
	\| Threshold A (current v3): ~1448 preds/protein, precision=0.002 \| Broken for inference. Use threshold C (novelty-gated) instead. \|
	\| All gen_ratios < 0.50, mammals n=7 only \| Stats unreliable. Need ≥100 mammal proteins. \|
	\| AF features (Model C) kill transfer: gen_ratio=0.02 \| Never use `esm_all` features for cross-taxon models. \|
	\| graph_hpo mammal gen_ratio=0.233 (n=4672) \| Joint insect+mammal HPO didn't significantly improve transfer vs ablation A (0.4347 at n=7, unreliable). Phase 7 needed. \|
	\| `protfunc_v3_fixed.pth` referenced in server.py but NOT in HF Space \| server.py loads it as priority — will silently fall back if missing. \|

	---

	## Phase 1 — Ablation ✅ DONE

	\| Model \| Val Fmax \| Test Fmax \| Test CAFA \| Mammal gen_ratio \|
	\|-------\|----------\|-----------\|-----------\|-----------------\|
	\| A: ESM only (320d) \| 0.8900 \| 0.8947 \| 0.6338 \| 0.4347 ← best \|
	\| B: ESM+seq (331d) \| — \| 0.8999 \| 0.6360 \| 0.4167 \|
	\| C: ESM+seq+AF (360d) \| 0.8900 \| 0.8902 \| 0.6326 \| 0.0225 ⚠️ AF hurts \|

	Winner: Model A (ESM only) — best insect+mammal balance.

	---

	## Phase 2 — HPO (graph_hpo joint pipeline)

	- ✅ 2a. HPO script ran — but only 2/40 trials completed
	- ✅ 2b. Best params saved: `graph_hpo/hpo_results.json` (hidden=2048, n_blocks=8, feat_level=esm_seq, score=0.7756)
	- ✅ 2c. Full train on best params done → `graph_hpo_best.pth`, val_fmax=0.9533
	- ⬜ 2d. Re-run HPO with full 40 trials — current best may not be global optimum
	```bash
	cd "/Users/siddhantbhat/Desktop/Research Files"
	.venv/bin/python3 scripts/hpo.py \
	--mammal artifacts/generalization/mammal_full_v1.parquet \
	--n_trials 40 --epochs 20 --patience 6 --alpha 0.6 \
	--startup_trials 5 --warmup_steps 5 \
	--multivariate_tpe --group_tpe \
	--out artifacts/graph_hpo/hpo_results.json
	```
	- ✅ 2e. Test eval on `graph_hpo_best.pth` → test_micro_fmax=0.9533, CAFA=0.6536, P=0.9553, R=0.9514, t*=0.94
	- ✅ 2f. Mammal gen eval on `graph_hpo_best.pth` → n=4672, micro_fmax=0.2224, CAFA=0.201, gen_ratio=0.233 ⚠️ poor transfer
	```bash
	.venv/bin/python3 scripts/eval_generalization.py \
	--checkpoint artifacts/graph_hpo/graph_hpo_best.pth \
	--thresholds artifacts/graph_hpo/graph_hpo_best_thresholds.json \
	--mlb "Important Files/mlb_public_v1.pkl" \
	--taxon_parquet artifacts/generalization/mammal_embeddings_v3.parquet \
	--taxon_name mammals_graph_hpo --obo go-basic.obo \
	--out artifacts/graph_hpo/generalization_results.json
	```

	---

	## Phase 3 — Threshold Fix ⚠️ BROKEN

	Current v3 thresholds output 1448 preds/protein at 0.2% precision — unusable.

	- ✅ 3a. Comparison run → `artifacts/threshold_comparison_results.json`
	- ✅ 3b. Winner: C (novelty-gated) — F1=0.0733, 6.69 preds/protein, novelty subset F1=0.2757
	- ✅ 3c. Threshold comparison on `graph_hpo_best.pth` → A (per-label t*): P=0.143/R=0.984/F1=0.250/20.6preds ✅ use this \| B: P=0.878/F1=0.219/0.43preds (83% zero) \| C≈B
	```bash
	.venv/bin/python3 scripts/threshold_comparison.py
	# Edit script to point to graph_hpo_best.pth first
	```
	- ⬜ 3d. Update server.py to use novelty-gated thresholds by default (currently falls back to broken A thresholds)

	---

	## Phase 4 — Mammal Dataset Expansion ⚠️ URGENT

	n=7 proteins → all gen_ratio stats are noise.

	- ⬜ 4a. Run `build_mammal_dataset.py` for ≥100 mammal proteins with GO-MF annotations
	```bash
	.venv/bin/python3 scripts/build_mammal_dataset.py
	# Check script args — output should go to artifacts/generalization/mammal_full_v2.parquet
	```
	- ⬜ 4b. Re-run gen eval for A, B, graph_hpo_best with new mammal set
	- ⬜ 4c. Update CHECKLIST gen_ratio table with reliable numbers

	---

	## Phase 5 — Broader Taxon Coverage

	- ⬜ 5a. Get FASTAs: fungi, plants (arabidopsis), fish (zebrafish), archaea, nematode
	- ⬜ 5b. For each: `prep_taxon.py` → `eval_generalization.py`
	```bash
	BASE="/Users/siddhantbhat/Desktop/Research Files"
	.venv/bin/python3 scripts/prep_taxon.py \
	--fasta "Important Files/<taxon>.fasta" \
	--taxon_name <taxon> \
	--mlb "Important Files/mlb_public_v1.pkl" \
	--out artifacts/generalization/<taxon>_embeddings.parquet
	```
	- ⬜ 5c. Fill generalization table below

	---

	## Phase 6 — HF Upload & Webapp

	- ✅ 6a. server.py updated locally (generalization API + v3_fixed priority) — commit `bd99db9e`
	- ✅ 6b. Uploaded `graph_hpo_best.pth` → HF as `protfunc_v3_fixed.pth` + thresholds (test_fmax=0.9533 confirmed better)
	```bash
	huggingface-cli upload Sbhat2026/protfunc-models \
	"artifacts/graph_hpo/graph_hpo_best.pth" protfunc_v3_fixed.pth
	huggingface-cli upload Sbhat2026/protfunc-models \
	"artifacts/graph_hpo/graph_hpo_best_thresholds.json" protfunc_v3_fixed_thresholds.json
	```
	- ✅ 6c. Pushed `static/interface.html` to HF Space (commit 2aa49963) — collapsible lower-confidence UI
	```bash
	cd /Users/siddhantbhat/insecta_webapp
	git add server.py static/interface.html
	git commit -m "fix: use novelty-gated thresholds; add generalization panel"
	git push
	```
	- ⬜ 6d. Add `/api/generalization` endpoint to serve `generalization_results.json` for all taxons (currently only mammals)

	---

	## Phase 7 — Generalization Improvement (if gen_ratio < 0.85)

	- ⬜ 7a. Mixed-taxon fine-tuning with more mammal data (Phase 4 first)
	- ⬜ 7b. Domain adaptation: freeze ESM layers, fine-tune MLP head on target taxon
	- ⬜ 7c. Re-eval after changes

	---

	## Generalization Table

	gen_ratio = taxon micro_fmax / insect test micro_fmax. Target ≥ 0.85.

	\| Taxon \| Model \| n \| micro_fmax \| cafa_fmax \| gen_ratio \| Status \|
	\|-------\|-------\|---\|------------\|-----------\|-----------\|--------\|
	\| insects \| A \| ~250k \| 0.8947 \| 0.6338 \| 1.00 (ref) \| ✅ \|
	\| mammals \| A \| 7 ⚠️ \| 0.3889 \| 0.3917 \| 0.4347 \| ⚠️ n too small \|
	\| mammals \| B \| 7 ⚠️ \| 0.3750 \| 0.3088 \| 0.4167 \| ⚠️ n too small \|
	\| mammals \| C \| 7 ⚠️ \| 0.0200 \| 0.0056 \| 0.0225 ⚠️ \| AF kills transfer \|
	\| insects \| graph_hpo \| ~250k \| 0.9533 \| 0.6536 \| 1.00 (ref) \| ✅ 2e done \|
	\| mammals \| graph_hpo \| 4672 \| 0.2224 \| 0.201 \| 0.233 ⚠️ \| ✅ 2f done \|
	\| fungi \| — \| — \| — \| — \| — \| ⬜ Phase 5 \|
	\| plants \| — \| — \| — \| — \| — \| ⬜ Phase 5 \|
	\| fish \| — \| — \| — \| — \| — \| ⬜ Phase 5 \|
	\| archaea \| — \| — \| — \| — \| — \| ⬜ Phase 5 \|
	\| nematode \| — \| — \| — \| — \| — \| ⬜ Phase 5 \|

	---

	## Priority Order (do in this order)

	1. 🔄 Phase 4a — mammal build running (500k max, `artifacts/pipeline_run.log`)
	2. 🔄 Phase 2e — test eval on graph_hpo_best.pth (queued after mammal build)
	3. 🔄 Phase 2f — mammal gen eval on graph_hpo_best.pth (queued)
	4. 🔄 Phase 2d — 40-trial HPO re-run (queued)
	5. 🔄 Phase 6b-6c — upload + push to HF (queued)
	6. ⬜ Phase 3c-3d — fix threshold in server.py (skipped per user; revisit after push)
	7. ⬜ Phase 5 — broader taxon coverage
	8. ⬜ Phase 7 — generalization improvement if needed

	New scripts:
	- `scripts/eval_checkpoint.py` — test eval any .pth on insect test set
	- `scripts/run_full_pipeline.sh` — chains all steps 1-5 above

	---

	## Directory

	```
	Research Files/
	├── artifacts/
	│ ├── checkpoints/ ← model .pth files
	│ │ ├── ablation_A_ESM_only.pth ✅ best ablation
	│ │ ├── ablation_B_ESM_seq.pth ✅
	│ │ ├── ablation_C_ESM_seq_AF.pth ✅ (AF hurts transfer)
	│ │ └── improved_res.pth baseline
	│ ├── graph_hpo/ ← joint insect+mammal HPO pipeline
	│ │ ├── graph_hpo_best.pth ✅ val_fmax=0.9533 (BEST)
	│ │ ├── graph_hpo_best_thresholds.json
	│ │ ├── graph_hpo_best_log.json
	│ │ ├── hpo_results.json ← only 2 trials run
	│ │ └── methodology.json
	│ ├── logs/ ← training JSON logs
	│ ├── thresholds/ ← per-label threshold JSON files
	│ ├── splits/ ← train/val/test index splits
	│ ├── generalization/ ← taxon embeddings + eval results
	│ │ ├── mammal_embeddings_v3.parquet (7 proteins only ⚠️)
	│ │ ├── mammal_full_v1.parquet (used in HPO training)
	│ │ └── generalization_results.json
	│ ├── threshold_comparison_results.json ← use C (novelty-gated)
	│ └── hpo_test.json ← old 2-trial HPO result
	├── scripts/
	│ ├── train_v3_fixed.py
	│ ├── hpo.py ← graph-aware joint HPO
	│ ├── graph_hpo_sequence.py ← runs full pipeline
	│ ├── eval_generalization.py
	│ ├── prep_taxon.py
	│ ├── build_mammal_dataset.py
	│ ├── threshold_comparison.py
	│ └── archive/
	├── Important Files/ ← mlb, parquets, fastas
	└── CHECKLIST.md
	```