# ProtFunc Checklist
*Updated: 2026-04-16 | Goal: best GO-MF on insects + max cross-taxon transfer*
*Pipeline running: `artifacts/pipeline_run.log` | PID: bash 1963, python 1966*

---

## CRITICAL FINDINGS

| Finding | Impact |
|---------|--------|
| `graph_hpo_best.pth` val_fmax **0.9540**, test_fmax **0.9533**, CAFA **0.6536** — joint insect+mammal training | Massive jump over ablation A (0.8947/0.6338). No overfit. Mammals eval pending. |
| HPO only ran **2 trials** (target was 40) | Best params likely suboptimal. Re-run HPO with full budget. |
| Threshold A (current v3): ~1448 preds/protein, precision=0.002 | **Broken for inference**. Use threshold C (novelty-gated) instead. |
| All gen_ratios < 0.50, mammals n=7 only | Stats unreliable. Need ≥100 mammal proteins. |
| AF features (Model C) kill transfer: gen_ratio=0.02 | Never use `esm_all` features for cross-taxon models. |
| graph_hpo mammal gen_ratio=**0.233** (n=4672) | Joint insect+mammal HPO didn't significantly improve transfer vs ablation A (0.4347 at n=7, unreliable). Phase 7 needed. |
| `protfunc_v3_fixed.pth` referenced in server.py but NOT in HF Space | server.py loads it as priority — will silently fall back if missing. |

---

## Phase 1 — Ablation ✅ DONE

| Model | Val Fmax | Test Fmax | Test CAFA | Mammal gen_ratio |
|-------|----------|-----------|-----------|-----------------|
| A: ESM only (320d) | 0.8900 | 0.8947 | 0.6338 | 0.4347 ← best |
| B: ESM+seq (331d) | — | 0.8999 | 0.6360 | 0.4167 |
| C: ESM+seq+AF (360d) | 0.8900 | 0.8902 | 0.6326 | 0.0225 ⚠️ AF hurts |

Winner: **Model A** (ESM only) — best insect+mammal balance.

---

## Phase 2 — HPO (graph_hpo joint pipeline)

- ✅ **2a.** HPO script ran — but only 2/40 trials completed
- ✅ **2b.** Best params saved: `graph_hpo/hpo_results.json` (hidden=2048, n_blocks=8, feat_level=esm_seq, score=0.7756)
- ✅ **2c.** Full train on best params done → `graph_hpo_best.pth`, val_fmax=**0.9533**
- ⬜ **2d.** Re-run HPO with full 40 trials — current best may not be global optimum
  ```bash
  cd "/Users/siddhantbhat/Desktop/Research Files"
  .venv/bin/python3 scripts/hpo.py \
    --mammal artifacts/generalization/mammal_full_v1.parquet \
    --n_trials 40 --epochs 20 --patience 6 --alpha 0.6 \
    --startup_trials 5 --warmup_steps 5 \
    --multivariate_tpe --group_tpe \
    --out artifacts/graph_hpo/hpo_results.json
  ```
- ✅ **2e.** Test eval on `graph_hpo_best.pth` → test_micro_fmax=**0.9533**, CAFA=**0.6536**, P=0.9553, R=0.9514, t*=0.94
- ✅ **2f.** Mammal gen eval on `graph_hpo_best.pth` → n=4672, micro_fmax=**0.2224**, CAFA=0.201, **gen_ratio=0.233** ⚠️ poor transfer
  ```bash
  .venv/bin/python3 scripts/eval_generalization.py \
    --checkpoint artifacts/graph_hpo/graph_hpo_best.pth \
    --thresholds artifacts/graph_hpo/graph_hpo_best_thresholds.json \
    --mlb "Important Files/mlb_public_v1.pkl" \
    --taxon_parquet artifacts/generalization/mammal_embeddings_v3.parquet \
    --taxon_name mammals_graph_hpo --obo go-basic.obo \
    --out artifacts/graph_hpo/generalization_results.json
  ```

---

## Phase 3 — Threshold Fix ⚠️ BROKEN

Current v3 thresholds output **1448 preds/protein at 0.2% precision** — unusable.

- ✅ **3a.** Comparison run → `artifacts/threshold_comparison_results.json`
- ✅ **3b.** Winner: **C (novelty-gated)** — F1=0.0733, 6.69 preds/protein, novelty subset F1=0.2757
- ✅ **3c.** Threshold comparison on `graph_hpo_best.pth` → A (per-label t*): P=0.143/R=0.984/F1=0.250/20.6preds ✅ use this | B: P=0.878/F1=0.219/0.43preds (83% zero) | C≈B
  ```bash
  .venv/bin/python3 scripts/threshold_comparison.py
  # Edit script to point to graph_hpo_best.pth first
  ```
- ⬜ **3d.** Update server.py to use novelty-gated thresholds by default (currently falls back to broken A thresholds)

---

## Phase 4 — Mammal Dataset Expansion ⚠️ URGENT

n=7 proteins → all gen_ratio stats are noise.

- ⬜ **4a.** Run `build_mammal_dataset.py` for ≥100 mammal proteins with GO-MF annotations
  ```bash
  .venv/bin/python3 scripts/build_mammal_dataset.py
  # Check script args — output should go to artifacts/generalization/mammal_full_v2.parquet
  ```
- ⬜ **4b.** Re-run gen eval for A, B, graph_hpo_best with new mammal set
- ⬜ **4c.** Update CHECKLIST gen_ratio table with reliable numbers

---

## Phase 5 — Broader Taxon Coverage

- ⬜ **5a.** Get FASTAs: fungi, plants (arabidopsis), fish (zebrafish), archaea, nematode
- ⬜ **5b.** For each: `prep_taxon.py` → `eval_generalization.py`
  ```bash
  BASE="/Users/siddhantbhat/Desktop/Research Files"
  .venv/bin/python3 scripts/prep_taxon.py \
    --fasta "Important Files/<taxon>.fasta" \
    --taxon_name <taxon> \
    --mlb "Important Files/mlb_public_v1.pkl" \
    --out artifacts/generalization/<taxon>_embeddings.parquet
  ```
- ⬜ **5c.** Fill generalization table below

---

## Phase 6 — HF Upload & Webapp

- ✅ **6a.** server.py updated locally (generalization API + v3_fixed priority) — commit `bd99db9e`
- ✅ **6b.** Uploaded `graph_hpo_best.pth` → HF as `protfunc_v3_fixed.pth` + thresholds (test_fmax=0.9533 confirmed better)
  ```bash
  huggingface-cli upload Sbhat2026/protfunc-models \
    "artifacts/graph_hpo/graph_hpo_best.pth" protfunc_v3_fixed.pth
  huggingface-cli upload Sbhat2026/protfunc-models \
    "artifacts/graph_hpo/graph_hpo_best_thresholds.json" protfunc_v3_fixed_thresholds.json
  ```
- ✅ **6c.** Pushed `static/interface.html` to HF Space (commit 2aa49963) — collapsible lower-confidence UI
  ```bash
  cd /Users/siddhantbhat/insecta_webapp
  git add server.py static/interface.html
  git commit -m "fix: use novelty-gated thresholds; add generalization panel"
  git push
  ```
- ⬜ **6d.** Add `/api/generalization` endpoint to serve `generalization_results.json` for all taxons (currently only mammals)

---

## Phase 7 — Generalization Improvement (if gen_ratio < 0.85)

- ⬜ **7a.** Mixed-taxon fine-tuning with more mammal data (Phase 4 first)
- ⬜ **7b.** Domain adaptation: freeze ESM layers, fine-tune MLP head on target taxon
- ⬜ **7c.** Re-eval after changes

---

## Generalization Table

*gen_ratio = taxon micro_fmax / insect test micro_fmax. Target ≥ 0.85.*

| Taxon | Model | n | micro_fmax | cafa_fmax | gen_ratio | Status |
|-------|-------|---|------------|-----------|-----------|--------|
| insects | A | ~250k | 0.8947 | 0.6338 | 1.00 (ref) | ✅ |
| mammals | A | 7 ⚠️ | 0.3889 | 0.3917 | 0.4347 | ⚠️ n too small |
| mammals | B | 7 ⚠️ | 0.3750 | 0.3088 | 0.4167 | ⚠️ n too small |
| mammals | C | 7 ⚠️ | 0.0200 | 0.0056 | 0.0225 ⚠️ | AF kills transfer |
| insects | graph_hpo | ~250k | 0.9533 | 0.6536 | 1.00 (ref) | ✅ 2e done |
| mammals | graph_hpo | 4672 | 0.2224 | 0.201 | **0.233** ⚠️ | ✅ 2f done |
| fungi | — | — | — | — | — | ⬜ Phase 5 |
| plants | — | — | — | — | — | ⬜ Phase 5 |
| fish | — | — | — | — | — | ⬜ Phase 5 |
| archaea | — | — | — | — | — | ⬜ Phase 5 |
| nematode | — | — | — | — | — | ⬜ Phase 5 |

---

## Priority Order (do in this order)

1. 🔄 **Phase 4a** — mammal build running (500k max, `artifacts/pipeline_run.log`)
2. 🔄 **Phase 2e** — test eval on graph_hpo_best.pth (queued after mammal build)
3. 🔄 **Phase 2f** — mammal gen eval on graph_hpo_best.pth (queued)
4. 🔄 **Phase 2d** — 40-trial HPO re-run (queued)
5. 🔄 **Phase 6b-6c** — upload + push to HF (queued)
6. ⬜ **Phase 3c-3d** — fix threshold in server.py (skipped per user; revisit after push)
7. ⬜ **Phase 5** — broader taxon coverage
8. ⬜ **Phase 7** — generalization improvement if needed

New scripts:
- `scripts/eval_checkpoint.py` — test eval any .pth on insect test set
- `scripts/run_full_pipeline.sh` — chains all steps 1-5 above

---

## Directory

```
Research Files/
├── artifacts/
│   ├── checkpoints/        ← model .pth files
│   │   ├── ablation_A_ESM_only.pth    ✅ best ablation
│   │   ├── ablation_B_ESM_seq.pth     ✅
│   │   ├── ablation_C_ESM_seq_AF.pth  ✅ (AF hurts transfer)
│   │   └── improved_res.pth           baseline
│   ├── graph_hpo/          ← joint insect+mammal HPO pipeline
│   │   ├── graph_hpo_best.pth         ✅ val_fmax=0.9533 (BEST)
│   │   ├── graph_hpo_best_thresholds.json
│   │   ├── graph_hpo_best_log.json
│   │   ├── hpo_results.json           ← only 2 trials run
│   │   └── methodology.json
│   ├── logs/               ← training JSON logs
│   ├── thresholds/         ← per-label threshold JSON files
│   ├── splits/             ← train/val/test index splits
│   ├── generalization/     ← taxon embeddings + eval results
│   │   ├── mammal_embeddings_v3.parquet  (7 proteins only ⚠️)
│   │   ├── mammal_full_v1.parquet        (used in HPO training)
│   │   └── generalization_results.json
│   ├── threshold_comparison_results.json  ← use C (novelty-gated)
│   └── hpo_test.json       ← old 2-trial HPO result
├── scripts/
│   ├── train_v3_fixed.py
│   ├── hpo.py              ← graph-aware joint HPO
│   ├── graph_hpo_sequence.py  ← runs full pipeline
│   ├── eval_generalization.py
│   ├── prep_taxon.py
│   ├── build_mammal_dataset.py
│   ├── threshold_comparison.py
│   └── archive/
├── Important Files/        ← mlb, parquets, fastas
└── CHECKLIST.md
```