| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - recommendation |
| | - ranking |
| | - personalization |
| | - xgboost |
| | - xgbranker |
| | - recipe |
| | - cold-start |
| | datasets: |
| | - your-username/recipe-cleaned-dataset |
| | model-index: |
| | - name: Personalized Recipe Ranking Models |
| | results: |
| | - task: |
| | type: recommendation |
| | name: Personalized Recipe Ranking |
| | dataset: |
| | name: Food.com (Cleaned) |
| | type: your-username/recipe-cleaned-dataset |
| | metrics: |
| | - type: ndcg@5 |
| | value: 0.44 |
| | - type: ndcg@10 |
| | value: 0.44 |
| | --- |
| | |
| | # Model Card: Personalized Recipe Ranking Models |
| |
|
| | ## Overview |
| |
|
| | This project implements a personalized recipe recommendation system using two model categories: |
| |
|
| | 1. **Scratch-trained baseline**: A simple rule-based + embedding matching ranker trained on a synthetic preference dataset (no user-specific rules). |
| | 2. **Rule-enhanced cold-start models**: Five separate XGBRanker models trained with more complex rule-based preference signals and user-specific interaction patterns (user1–user5). |
| |
|
| | The goal is to evaluate how different user profiles affect ranking behavior and recommendation diversity, even when overall NDCG scores are lower than the baseline. |
| |
|
| | --- |
| |
|
| | ## Model Category 1: Scratch-trained Baseline |
| |
|
| | ### Purpose |
| | Provide a simple cold-start recommendation baseline that matches ingredients and ranks recipes without personalization. It uses parent–child ingredient overlap and a few numeric features (e.g., protein, cost, cooking time). |
| |
|
| | ### Data Sources |
| | - Cleaned Food.com dataset (~180k recipes) |
| | - 10,000 synthetic preference samples generated via uniform random selection |
| |
|
| | ### Training Details |
| | - Model type: **XGBRanker** (`objective='rank:pairwise'`) |
| | - Features: ~1000 numeric ingredient-parent ratio features + basic nutrition/time features |
| | - Train/test split: 80/20 (by recipe ID) |
| | - Evaluation metric: NDCG@5, NDCG@10 |
| |
|
| | ### Evaluation |
| | The baseline achieves **very high NDCG scores (95%+)**, because training and evaluation rely on synthetic signals that align perfectly with the ranking structure. |
| |
|
| | ### Intended Use |
| | Serve as a **sanity check** and upper bound for ranking performance, not for deployment. |
| |
|
| | ### Limitations |
| | - Unrealistically clean preference structure |
| | - No user differentiation |
| | - Inflated metrics due to synthetic evaluation |
| |
|
| | --- |
| |
|
| | ## Model Category 2: Rule-enhanced Cold Start Models (User1–User5) |
| |
|
| | ### Purpose |
| | Capture user-specific dietary preferences and ranking heuristics using richer rule sets, leading to more diverse recommendation patterns across different users. |
| |
|
| | ### Data Sources |
| | - Cleaned Food.com dataset (~180k recipes) |
| | - 5,000 cold-start synthetic interactions per user profile |
| | - Additional unselected (negative) samples included to simulate realistic cold-start scenarios |
| |
|
| | ### Model |
| | - Model type: **XGBRanker** (scratch-trained) |
| | - Training objective: `rank:pairwise` |
| | - Feature space: |
| | - Ingredient-parent coverage ratios (~1000 parent nodes) |
| | - Nutrition features: protein, calories, cost, cooking time |
| | - User preference weights: protein/time/cost |
| | - Dietary tag filters and exclusion rules |
| |
|
| | ### Training Setup |
| | - Train/valid/test split: 70/15/15 by recipe ID per profile |
| | - No fine-tuning between profiles; each profile trained independently |
| | - Evaluation metric: NDCG@5 and NDCG@10 |
| |
|
| | ### Evaluation Results |
| |
|
| | | User Profile | NDCG@5 | NDCG@10 | |
| | |-------------|--------|---------| |
| | | user1 | 0.4400 | 0.4400 | |
| | | user2 | 0.4342 | 0.4342 | |
| | | user3 | 0.4179 | 0.4179 | |
| | | user4 | 0.1651 | 0.1651 | |
| | | user5 | 0.4607 | 0.4607 | |
| |
|
| | **Note:** User4 has very restrictive dietary preferences, resulting in very few matching recipes and inherently lower achievable NDCG. |
| |
|
| | :contentReference[oaicite:0]{index=0}:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}:contentReference[oaicite:3]{index=3}:contentReference[oaicite:4]{index=4} |
| |
|
| | Although these NDCG values are lower than the baseline, this is expected for several reasons: |
| |
|
| | - The cold-start datasets contain a large proportion of unselected recipes, leading to sparse positive signals. |
| | - More complex preference rules increase variability and reduce alignment with NDCG’s single-label relevance assumptions. |
| | - The models now produce more differentiated ranking behaviors across user profiles, which aligns with the intended personalization goals. |
| |
|
| | --- |
| |
|
| | ## Model Selection Justification |
| |
|
| | - **XGBRanker** was chosen for all models due to its effectiveness on structured tabular data, fast training time, and compatibility with large feature spaces (1000+ ingredients). |
| | - The **baseline model** acts as a clean control, providing an upper bound on achievable NDCG under idealized preferences. |
| | - The **rule-enhanced models** trade some raw NDCG performance for greater personalization fidelity, which is critical in multi-user recommendation contexts. |
| |
|
| | --- |
| |
|
| | ## Evaluation Methodology |
| |
|
| | - Metric: NDCG@5 and NDCG@10 on held-out cold-start samples |
| | - Each user model evaluated independently |
| | - Negative samples retained to approximate real-world recommendation class imbalance |
| |
|
| | --- |
| |
|
| | ## Intended Uses and Limitations |
| |
|
| | **Intended Uses** |
| | - Multi-profile recipe recommendation |
| | - Studying personalization behaviors under sparse feedback |
| | - Cold-start scenarios for new users |
| |
|
| | **Limitations** |
| | - Synthetic user interactions do not perfectly reflect real-world feedback |
| | - NDCG is not well aligned with multi-rule personalization behavior |
| | - User4 performance is limited by scarcity of relevant recipes |
| |
|
| | --- |
| | ## Risks and Bias |
| |
|
| | The models are trained on the Food.com dataset, which has known biases: |
| | - **Regional bias**: Western and American cuisines dominate the dataset, leading to potential under-representation of other regions. |
| | - **Popularity bias**: Highly rated or frequently interacted recipes are over-represented. |
| | - **Cold-start leakage risk**: Although user interactions are synthetic, overlapping ingredient-parent structures between train/test may create mild information leakage, potentially inflating baseline metrics. |
| |
|
| | These biases may affect recommendation diversity and fairness across different cuisines or dietary groups. |
| |
|
| | --- |
| |
|
| | ## Cost and Latency |
| |
|
| | All models are based on **XGBRanker**, which runs efficiently on CPU: |
| | - **Inference latency**: Approximately 1–5 ms per recipe for ranking (measured on a laptop CPU, single thread). |
| | - **Training cost**: Training each user profile model on 5,000 interactions takes less than 2 minutes on CPU. |
| |
|
| | The approach is designed for real-time personalization in lightweight interfaces (e.g., Hugging Face Spaces). |
| |
|
| | --- |
| |
|
| | ## Usage Disclosure |
| |
|
| | **Intended Uses** |
| | - Academic and educational research on personalized recommendation |
| | - Cold-start personalization experiments |
| | - Recipe recommendation for diverse dietary profiles |
| |
|
| | **Not Intended For** |
| | - Medical or dietary decision-making |
| | - Real-world deployment without additional bias mitigation |
| | - High-stakes personalization where fairness across demographic groups is critical |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | Tang, Xinxuan. Personalized Recipe Ranking Models. 2025. |
| |
|