GraphRecSys
Production-style recommendation system that combines graph retrieval, causal debiasing, multi-objective ranking, calibrated probabilities, vector search, and low-latency serving.
This project is designed as an end-to-end recommender systems portfolio piece: it starts from raw KuaiRec interaction logs, trains a debiased LightGCN retrieval model, indexes item embeddings with FAISS, ranks candidates with an MMoE multi-task model, calibrates click probabilities, and serves personalized recommendations through FastAPI with Redis-backed embedding caching.
Why This Project Exists
Most recommender demos stop at model training. Real recommendation systems are pipelines: data quality, retrieval, ranking, calibration, serving latency, offline evaluation, and product trade-offs all matter at the same time.
GraphRec-MultiOpt demonstrates those production concerns in one coherent system:
- Retrieval: graph collaborative filtering with LightGCN.
- Debiasing: inverse propensity weighting to reduce exposure bias.
- Ranking: multi-task MMoE for click probability and expected value.
- Calibration: Platt scaling and reliability diagrams for trustworthy probabilities.
- Serving: FastAPI endpoint with FAISS candidate retrieval, Redis cache, scalarization, and diversity reranking.
- Decision support: mock A/B simulation and Pareto frontier analysis for engagement vs. value trade-offs.
System Architecture
flowchart LR
raw["KuaiRec raw logs"] --> loader["Schema validation + labels"]
loader --> split["Temporal train/val/test split"]
split --> graph_data["PyG bipartite graph"]
split --> propensity["Item propensity estimates"]
graph_data --> lightgcn["LightGCN retrieval"]
propensity --> lightgcn
lightgcn --> embeddings["User/item embeddings"]
embeddings --> faiss["FAISS IVF-PQ index"]
embeddings --> features["Ranking feature builder"]
split --> features
features --> mmoe["MMoE ranker"]
mmoe --> calibration["Platt calibration"]
faiss --> api["FastAPI /recommend"]
mmoe --> api
calibration --> api
redis["Redis embedding cache"] --> api
api --> response["Top-10 recommendations"]
mmoe --> ab["Mock A/B simulation"]
ab --> pareto["Pareto frontier"]
Technical Highlights
| Area | Implementation |
|---|---|
| Dataset | KuaiRec dense multi-action logs |
| Retrieval | LightGCN with 3 graph propagation layers |
| Retrieval loss | BPR with optional inverse propensity weighting |
| Negative sampling | Uniform sampler with API reserved for hard negatives |
| Vector search | FAISS IVF-PQ, configurable nprobe |
| Ranking model | Multi-gate Mixture-of-Experts with click and value towers |
| Ranking targets | label_click = watch_ratio >= 0.5, label_value = log1p(watch_ratio) |
| Calibration | Platt scaling on validation logits |
| Diversity | Maximal Marginal Relevance reranking |
| Serving | Async FastAPI app with latency breakdown |
| Cache | Redis user embedding cache with TTL |
| Evaluation | Recall@K, NDCG@K, AUC, MSE/RMSE, ECE, latency, Pareto sweep |
| Tracking | MLflow metrics and artifacts |
Repository Layout
.
βββ data/
β βββ download.py
β βββ raw/
β βββ processed/
βββ src/
β βββ data/ # loading, splitting, graph construction, propensity
β βββ retrieval/ # LightGCN, BPR, negative sampling, retrieval eval
β βββ indexing/ # FAISS index build/query/benchmark
β βββ ranking/ # feature builder, MMoE, calibration, ranking eval
β βββ serving/ # FastAPI, Redis cache, schemas, scoring
β βββ evaluation/ # A/B simulation, Pareto frontier, results report
βββ configs/
βββ tests/
βββ scripts/
βββ outputs/
βββ checkpoints/
βββ Dockerfile
βββ implementation_plan.md
βββ recsys_architecture.md
Modeling Approach
1. Data And Labels
The data layer validates KuaiRec interaction logs, derives model targets, and creates train/validation/test splits.
label_click = (watch_ratio >= 0.5).astype(int)
label_value = np.log1p(watch_ratio)
The graph builder creates a PyTorch Geometric HeteroData bipartite graph:
- Node types:
user,item - Edge type:
("user", "interacts", "item") - Reverse edge type for message passing
- Edge weights from clipped watch ratio
2. Debiased Retrieval
The retrieval stage trains LightGCN using Bayesian Personalized Ranking:
loss = -mean(IPS(item) * log sigmoid(score(user, positive) - score(user, negative)))
The IPS term upweights less frequently exposed items, reducing the tendency of the retrieval model to overfit historical exposure patterns.
3. Multi-Objective Ranking
The ranking model uses MMoE to optimize two related objectives:
- pClick tower: calibrated probability that the user meaningfully engages.
- E-value tower: expected value proxy based on watch ratio.
Ranking features combine:
- user embedding
- item embedding
- time/session context
- item duration
- category representation
Total feature dimension: 1046.
4. Serving-Time Optimization
The serving endpoint follows the same shape used by production recommendation stacks:
- Fetch user embedding from Redis or local embedding table.
- Retrieve top-K candidates from FAISS.
- Build ranking features for candidates.
- Score candidates with MMoE.
- Apply Platt calibration.
- Scalarize engagement and value.
- Apply MMR diversity reranking.
- Return top-10 items with latency breakdown.
Quickstart
Install
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Run The Pipeline
bash scripts/run_pipeline.sh
The pipeline follows the architecture sequence:
download -> preprocess -> graph -> propensity -> LightGCN -> FAISS -> ranking -> calibration -> evaluation -> serving
For raw data without timestamps, the split script can use a deterministic fallback:
python -m src.data.splits --allow_no_timestamp
Run FAISS Benchmark
bash scripts/run_benchmark.sh
Benchmark output is written to:
outputs/faiss_benchmark.csv
Serving API
Start the service:
uvicorn src.serving.app:app --host 0.0.0.0 --port 8000
Health check:
curl http://localhost:8000/health
Recommendation request:
curl http://localhost:8000/recommend/0
Example response shape:
{
"user_id": 0,
"items": [
{
"item_id": 123,
"p_click": 0.71,
"e_value": 1.42,
"final_score": 0.82
}
],
"retrieval_latency_ms": 6.4,
"ranking_latency_ms": 14.8,
"total_latency_ms": 23.1,
"cache_hit": true
}
Prometheus-compatible metrics:
curl http://localhost:8000/metrics
Reload model artifacts:
curl -X POST http://localhost:8000/reload
Evaluation
The project evaluates recommender quality at multiple layers.
| Layer | Metrics |
|---|---|
| Retrieval | Recall@10, Recall@20, Recall@50, Recall@500, NDCG@10 |
| Ranking | ROC-AUC, MSE, RMSE |
| Calibration | ECE before/after Platt scaling, reliability curve |
| Serving | p50, p95, p99 latency |
| Product trade-off | Simulated CTR, GMV proxy, diversity, Pareto frontier |
Generate the final results table:
python -m src.evaluation.report
Outputs:
outputs/results_table.csv
outputs/results_table.md
outputs/calibration_curve.png
outputs/pareto_curve.png
Results
Metrics are generated after running the full pipeline. This table is intentionally artifact-driven so reported numbers come from reproducible runs rather than hand-edited README values.
| Metric | LightGCN + IPS | MMoE single-task | MMoE multi-task |
|---|---|---|---|
| Recall@500 | 0.0011 | - | - |
| NDCG@10 | 0.0443 | - | - |
| AUC (pClick) | - | 0.8319 | 0.8223 |
| ECE (after cal.) | - | - | 0.0677 |
| MSE (E-value) | - | 0.1172 | 0.0787 |
| p50 latency ms | 0.04 | - | - |
| p99 latency ms | 0.13 | - | - |
Configuration
The system is config-driven:
configs/retrieval.yamlconfigs/ranking.yamlconfigs/serving.yaml
Examples:
model:
emb_dim: 512
num_layers: 3
training:
lr: 1.0e-3
batch_size: 4096
epochs: 100
ips:
clip_max: 10.0
Serving trade-offs can be tuned without changing model code:
scoring:
w_engagement: 0.6
w_revenue: 0.4
lambda_diversity: 0.3
top_n_serve: 10
Docker
Build:
docker build -t graphrec-multiopt .
Run:
docker run -p 8000:8000 graphrec-multiopt
For real experiments, mount model artifacts and processed data as volumes:
docker run \
-p 8000:8000 \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/checkpoints:/app/checkpoints" \
graphrec-multiopt
Engineering Notes
This repository is structured to show senior-level recommender systems judgment:
- Separates retrieval and ranking instead of forcing one model to do both.
- Includes causal debiasing through IPS rather than optimizing only observed engagement.
- Treats probability calibration as a first-class serving concern.
- Uses vector search and caching to reflect real serving constraints.
- Adds diversity reranking to avoid purely exploitative recommendations.
- Exposes business-level trade-offs through scalarization and Pareto analysis.
- Keeps training, serving, and evaluation configuration outside model code.
Known Limitations
- KuaiRec timestamp availability varies by source file; the splitter supports temporal mode when timestamps are present and an explicit deterministic fallback otherwise.
- The current hard-negative sampling interface is reserved, while uniform negative sampling is implemented.
- Full reported metrics require running the pipeline on the downloaded dataset.
- Redis is optional for local development but recommended for serving realism.
- FAISS IVF-PQ configuration may need scaling down for tiny smoke-test datasets.
Roadmap
- Add hard negative sampling from FAISS retrieval misses.
- Add popularity and matrix-factorization baselines.
- Add online feature store abstraction for serving-time context.
- Add load tests for concurrent recommendation traffic.
- Add Docker Compose for API + Redis + MLflow.
- Add CI workflow for unit tests, linting, and smoke-mode pipeline execution.
Resume Summary
Built an end-to-end production-style recommendation system using PyTorch, PyTorch Geometric, FAISS, Redis, FastAPI, and MLflow. Implemented LightGCN retrieval with IPS debiasing, MMoE multi-task ranking, Platt calibration, MMR diversity reranking, vector-search serving, offline A/B simulation, and Pareto frontier analysis for engagement/value trade-off optimization.