--- license: mit --- # Grocery Self Checkout Item Detection By: Daniel Bagcal ## 1. Model Description ### Context This model is a YOLOv11 object detection model fine-tuned from COCO pretrained weights to identify 17 grocery product categories in a retail self-checkout environment. This model is designed to detect common grocery items from an overhead/top-down camera perspective, mimicking a real view of a mounted self-checkout camera. The intended use case is for store-specific automatic item detection, where the model is able to assist with item counting, checkout verification, and loss/theft prevention. This model is best suited for stores whose inventory closely matches the training data, as performance will degrade on unseen brands or product types not represented in the training data. ## 2. Training Data The training dataset is a subset of the **RPC-Dataset** ([rpc-dataset.github.io](https://rpc-dataset.github.io/)), a large-scale retail product checkout dataset consisting of 83,699 images across 200 grocery product classes. The working dataset is a subset of this, consisting of 9,616 images across the same 200 classes, sourced via Roboflow ([universe.roboflow.com/groceries-jxjfd/grocery-goods](https://universe.roboflow.com/groceries-jxjfd/grocery-goods)). ### Annotation Process The original RPC-Dataset contained 200 product-specific classes, where each class represented a specific product variant (e.g., `100_milk`, `101_milk`, `102_milk`). These classes were collapsed into 17 broader product categories to improve generalization, reduce class imbalance, and better reflect how a self-checkout system categorizes items by type rather than specific SKU. For example, all milk classes were merged into a single `milk` class, reducing the total class count from 200 to 17. Random samples were reviewed after relabeling to validate annotation quality, with no corrections needed. This is the final dataset used for training, after the annotation process ([https://app.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb](https://universe.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb)). ### Class Distribution | Class Name | Total Count | Training Count | Validation Count | Test Count | | ----------------- | ----------- | -------------- | ---------------- | ---------- | | tissue | 4,813 | 3,369 | 963 | 481 | | dessert | 4,372 | 3,060 | 874 | 437 | | drink | 3,760 | 2,632 | 752 | 376 | | seasoner | 3,199 | 2,239 | 640 | 320 | | puffed\_food | 3,156 | 2,209 | 631 | 316 | | chocolate | 3,146 | 2,202 | 629 | 315 | | instant\_noodles | 3,033 | 2,123 | 607 | 303 | | canned\_food | 2,714 | 1,900 | 543 | 271 | | milk | 2,517 | 1,762 | 503 | 252 | | candy | 2,499 | 1,749 | 500 | 250 | | personal\_hygiene | 2,495 | 1,747 | 499 | 250 | | instant\_drink | 2,492 | 1,744 | 498 | 249 | | alcohol | 2,381 | 1,667 | 476 | 238 | | dried\_fruit | 2,368 | 1,658 | 474 | 237 | | dried\_food | 2,222 | 1,555 | 444 | 222 | | gum | 1,923 | 1,346 | 385 | 192 | | stationery | 1,466 | 1,026 | 293 | 147 | ### Train/Validation/Test Split | Split | Ratio | Count | | ---------- | ----- | ------ | | Train | 70% | 36,928 | | Validation | 20% | 10,505 | | Test | 10% | 5,276 | ### Data Augmentation The following augmentations were applied during training to simulate real-world checkout conditions: | Augmentation | Purpose | | ---------------------------------- | --------------------------------------- | | Rotation | Items placed on belt in any orientation | | Horizontal/Vertical Flip | Additional orientation variation | | Mosaic | Multiple items on belt simultaneously | | HSV Shift (hue, saturation, value) | Simulate varied store lighting | | Translation & Scale | Camera height and position variation | ### Known Biases and Limitations - Dataset is predominantly composed of Chinese grocery product packaging, limiting generalizability to Western or European retail environments - Fresh and unpackaged produce (such as fruits or vegetables) are not represented in the dataset - Limited lighting variation. Real checkout environments may have inconsistent lighting not well represented in training images ## 3. Training Procedure - **Framework**: Ultralytics YOLOv11n - **Hardware**: A100 GPU in Google Colab - **Epochs**: 50 - **Batch Size**: 64 - **Image Size**: 640x640 - **Patience**: 50 - **Training Time** ~36.5 minutes (2,189.69 seconds) - **Preprocessing**: Augmentations applied at training time (see Data Augmentation section) ## 4. Evaluation Results ### Comprehensive Metrics All files outputted from the 'runs\detect\train' folder are provided in the file section. The model was evaluated on a held-out test set of 1,928 images containing 9,825 instances across all 17 classes. The model demonstrates strong performance across all metrics, achieving near-perfect precision and recall with a mAP50 of 0.992. | Metric | Value | | --------- | ----- | | Precision | 0.989 | | Recall | 0.985 | | mAP50 | 0.992 | | mAP50-95 | 0.862 | ### Per-Class Breakdown | Class | Test Images | Instances | Precision | Recall | mAP50 | mAP50-95 | | ---------------- | ----------- | --------- | --------- | ------ | ----- | -------- | | **all** | **1,928** | **9,825** | **0.989** | **0.985** | **0.992** | **0.862** | | alcohol | 252 | 503 | 0.996 | 0.986 | 0.995 | 0.864 | | candy | 257 | 502 | 0.988 | 0.980 | 0.990 | 0.815 | | canned\_food | 252 | 545 | 0.982 | 0.996 | 0.990 | 0.877 | | chocolate | 360 | 700 | 0.982 | 0.984 | 0.993 | 0.833 | | dessert | 389 | 819 | 0.995 | 0.991 | 0.995 | 0.881 | | dried\_food | 244 | 415 | 0.982 | 0.993 | 0.995 | 0.877 | | dried\_fruit | 263 | 516 | 0.986 | 0.986 | 0.995 | 0.887 | | drink | 360 | 796 | 0.982 | 0.990 | 0.994 | 0.871 | | gum | 183 | 360 | 0.989 | 0.979 | 0.991 | 0.812 | | instant\_drink | 271 | 554 | 0.984 | 0.982 | 0.994 | 0.886 | | instant\_noodles | 302 | 614 | 0.989 | 0.997 | 0.995 | 0.888 | | milk | 256 | 491 | 0.996 | 0.990 | 0.994 | 0.859 | | personal\_hygiene| 255 | 506 | 0.990 | 0.982 | 0.994 | 0.854 | | puffed\_food | 324 | 654 | 0.996 | 1.000 | 0.995 | 0.907 | | seasoner | 302 | 572 | 0.986 | 0.965 | 0.993 | 0.849 | | stationery | 162 | 300 | 0.986 | 0.957 | 0.972 | 0.785 | | tissue | 482 | 978 | 0.999 | 0.994 | 0.995 | 0.909 | ### Visual Examples of Classes ![Class Examples](class_collage.png) the grid above shows representative examples from the training dataset, organized alphabetically by class (left to right, top to bottom, following the order shown in the per-class breakdown), with the final three images showing multi-item/class detection examples. Because each class was reduced into a single category, each class has multiple different examples and varies from one another. | Position | Class | Features | | -------------- | ----------------- | ---------------------------------------------------------- | | Row 1, Col 1 | alcohol | Glass bottles, aluminum cans/beer bottles | | Row 1, Col 2 | candy | Small packaging, often cylindrical or box-shaped | | Row 1, Col 3 | canned\_food | Cylindrical canned foods | | Row 1, Col 4 | chocolate | Flat packaging, things like Snickers bars | | Row 2, Col 1 | dessert | Cup, boxed, flat packaging. Varies widely. | | Row 2, Col 2 | dried\_food | Flat sealed bags, often with food photography on packaging | | Row 2, Col 3 | dried\_fruit | Flat sealed bags, clear bags, colored packaging | | Row 2, Col 4 | drink | Plastic bottles such as sodas, aluminum cans | | Row 3, Col 1 | gum | Small box or pouch packaging | | Row 3, Col 2 | instant\_drink | Varies widely, small cylinders, boxes, sealed packs | | Row 3, Col 3 | instant\_noodles | Like instant ramen packs or cup-noodle packs | | Row 3, Col 4 | milk | Small milk cartons, slim box packaging, bottled packs | | Row 4, Col 1 | personal\_hygiene | Items like toothbrushes, mouth wash, toothpaste | | Row 4, Col 2 | puffed\_food | Inflated bags such as Cheetos, other chip bags | | Row 4, Col 3 | seasoner | Items varies from soy sauce to small seasoning packets | | Row 4, Col 4 | stationery | Items such as notebooks, paper, pencils, etc. | | Row 5, Col 1 | tissue | Small rectangular box packaging with soft branding | | Row 5, Col 2-4 | multi-class | Multiple items detected simultaneously in a single scene | ### Key Visualizations #### Confusion Matrix ![Confusion Matrix](confusion_matrix_normalized.png) #### F1 Confidence Curve ![BoxF1 Curve](BoxF1_curve.png) #### Training & Validation Loss Curves ![Results](results.png) ### Performance Analysis The model performs consistently well across all 17 classes on the validation dataset, with the lowest mAP50 being **stationery** at 0.972. The strongest performing classes were **tissue** and **puffed_food** (mAP50-95: 0.909, 0.907), likely due to their distinct packaging shapes and high training sample counts. The weakest performing class was **stationery** (mAP50: 0.972, mAP50-95: 0.785), which is also the smallest class at 1,466 training images, suggesting performance is partially limited by sample size. ## 5. Limitations and Biases ### D2S Wild Image Test Sample (Failure Case) When tested on the [**D2S Dataset**](https://www.mvtec.com/company/research/datasets/mvtec-d2s), the model struggled significantly with unseen products and environments. The image below shows a representative failure case: ![D2S Sample](D2Ssampletest1.png) In this test image, the model: - Completely missed the avocado (no detection) - Missed the tea box entirely & drink under it - Misclassified a water bottle as `instant_noodles` (0.63 confidence) - Produced a low-confidence `dried_fruit` detection (0.45) on an incorrect region This suggests the model learned to recognize specific packaging patterns from its training data rather than generalizing to grocery items as a broader category. This model should be store-specific on inventory with this training data. ### Poor Performing Classes | Class | mAP50 | mAP50-95 | Likely Reason | | ---------- | ----- | -------- | -------------------------------------------------- | | stationery | 0.972 | 0.785 | Smallest class (1,466 images) | | chocolate | 0.993 | 0.833 | Similar packaging causes 0.13 background confusion | ### Data Biases - **Geographic bias:** Dataset is predominantly composed of Chinese grocery product packaging. Model is not generalizable to Western or European retail environments - **Product bias:** Heavily skewed toward packaged and processed goods; fresh produce and unpackaged items classes are entirely absent - **Environmental bias:** Images were collected in controlled photography settings. Does not fully represent real store lighting, shadows, or possible covered items. ### Environmental and Contextual Limitations - Performance degrades significantly when used on items not present in the training data, as seen with the D2S Dataset - Overlapping or partially occluded items in a self-checkout camera may cause missed or incorrect detections - Model was designed for overhead/top-down perspective, so differing angles/views could degrade performance ### Inappropriate Use Cases This specific model: - Should **NOT** be deployed in stores with inventory significantly different from the training data without retraining. Different models with different data should be used for different inventory and stock! - Should **NOT** be used as a standalone loss prevention or security system - Should **NOT** be used to detect fresh produce, unpackaged items, or non-grocery products - Should **NOT** be used in serious applications where misclassification has serious consequences ### Ethical Considerations - Overhead camera systems at self-checkout may raise **customer privacy concerns** depending on how image/video data is stored and used - Model should not be used to make automated decisions that negatively impact customers without human review, as misclassifications may affect customers purchasing unfamiliar or international products not well represented in the training data ### Sample Size Limitations - **Stationery** (1,466 images) is the smallest class and shows the weakest overall performance (albeit still strong). Additional training data would likely improve results - No representation of fresh produce, meaning the model has zero capability to detect items like fruits, vegetables, or deli products - Model would likely significantly improve if used with full 83,699 image dataset