license: mit
Grocery Self Checkout Item Detection
Model Description
Context
This model is a YOLOv11 object detection model fine-tuned from COCO pretrained weights to identify 17 grocery product categories in a retail self-checkout environment. This model is designed to detect common grocery items from an overhead/top-down camera perspective, mimicking a real view of a mounted self-checkout camera. The intended use case is for store-specific automatic item detection, where the model is able to assist with item counting, checkout verification, and loss/theft prevention. This model is best suited for stores whose inventory closely mathes the training data, as performance will degrade on unseen brands or product types not represented in the training data.
Training Data
The training dataset is a subset of the RPC-Dataset (rpc-dataset.github.io), a large-scale retail product checkout dataset consisting of 83699 images across 200 grocery product classes. The working dataset is a subset of this, consisting of 9,616 images across the same 200 classes, sourced via Roboflow (universe.roboflow.com/groceries-jxjfd/grocery-goods).
Annotation Process
The original RPC-Dataset contained 200 product-specific classes,
where each class represented a specific product variant (e.g., 100_milk, 101_milk, 102_milk). These classes were collapsed into 17 broader product categories
to improve generalization, reduce class imbalance, and better reflect how a self-checkout system categorizes items by type rather than specific SKU.
For example, all milk classes were merged into a single milk class, reducing the total class count from 200 to 17.
Random samples were reviewed after relabeling to validate annotation quality, with no corrections needed.
This is the final dataset used for training, after the annotation process (https://app.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb).
Class Distribution
| Class Name | Total Count | Training Count | Validation Count | Test Count |
|---|---|---|---|---|
| tissue | 4,813 | 3,369 | 963 | 481 |
| dessert | 4,372 | 3,060 | 874 | 437 |
| drink | 3,760 | 2,632 | 752 | 376 |
| seasoner | 3,199 | 2,239 | 640 | 320 |
| puffed_food | 3,156 | 2,209 | 631 | 316 |
| chocolate | 3,146 | 2,202 | 629 | 315 |
| instant_noodles | 3,033 | 2,123 | 607 | 303 |
| canned_food | 2,714 | 1,900 | 543 | 271 |
| milk | 2,517 | 1,762 | 503 | 252 |
| candy | 2,499 | 1,749 | 500 | 250 |
| personal_hygiene | 2,495 | 1,747 | 499 | 250 |
| instant_drink | 2,492 | 1,744 | 498 | 249 |
| alcohol | 2,381 | 1,667 | 476 | 238 |
| dried_fruit | 2,368 | 1,658 | 474 | 237 |
| dried_food | 2,222 | 1,555 | 444 | 222 |
| gum | 1,923 | 1,346 | 385 | 192 |
| stationery | 1,466 | 1,026 | 293 | 147 |
Train/Validation/Test Split
| Split | Ratio | Count |
|---|---|---|
| Train | 70% | 36,928 |
| Validation | 20% | 10,505 |
| Test | 10% | 5,276 |
Data Augmentation
The following augmentations were applied during training to simulate real-world checkout conditions:
| Augmentation | Purpose |
|---|---|
| Rotation | Items placed on belt in any orientation |
| Horizontal/Vertical Flip | Additional orientation variation |
| Mosaic | Multiple items on belt simultaneously |
| HSV Shift (hue, saturation, value) | Simulate varied store lighting |
| Translation & Scale | Camera height and position variation |
Known Biases and Limitations
- Dataset is predominantly composed of Chinese grocery product packaging, limiting generalizability to Western or European retail environments
- Fresh and unpackaged produce (such as fruits or vegetables) are not represented in the dataset
- Limited lighting variation — real checkout environments may have inconsistent lighting not well represented in training images
Training Procedure
- Framework: Ultralytics YOLOv11n
- Hardware: A100 GPU in Google Colab
- Epochs: 50
- Batch Size: 64
- Image Size: 640x640
- Patience: 50
- Training Time ~36.5 minutes (2,189.69 seconds)
- Preprocessing: Augmentations applied at training time (see Data Augmentation section)
Evaluation Results
Comprehensive Metrics
All files outputted from the 'runs\detect\train' folder are provided in the file section. The model was evaluated on a held-out test set of 1,928 images containing 9,825 instances across all 17 classes. The model demonstrates strong performance across all metrics, achieving near-perfect precision and recall with a mAP50 of 0.992.
| Metric | Value |
|---|---|
| Precision | 0.989 |
| Recall | 0.985 |
| mAP50 | 0.992 |
| mAP50-95 | 0.862 |
Per-Class Breakdown
| Class | Test Images | Instances | Precision | Recall | mAP50 | mAP50-95 |
|---|---|---|---|---|---|---|
| all | 1,928 | 9,825 | 0.989 | 0.985 | 0.992 | 0.862 |
| alcohol | 252 | 503 | 0.996 | 0.986 | 0.995 | 0.864 |
| candy | 257 | 502 | 0.988 | 0.980 | 0.990 | 0.815 |
| canned_food | 252 | 545 | 0.982 | 0.996 | 0.990 | 0.877 |
| chocolate | 360 | 700 | 0.982 | 0.984 | 0.993 | 0.833 |
| dessert | 389 | 819 | 0.995 | 0.991 | 0.995 | 0.881 |
| dried_food | 244 | 415 | 0.982 | 0.993 | 0.995 | 0.877 |
| dried_fruit | 263 | 516 | 0.986 | 0.986 | 0.995 | 0.887 |
| drink | 360 | 796 | 0.982 | 0.990 | 0.994 | 0.871 |
| gum | 183 | 360 | 0.989 | 0.979 | 0.991 | 0.812 |
| instant_drink | 271 | 554 | 0.984 | 0.982 | 0.994 | 0.886 |
| instant_noodles | 302 | 614 | 0.989 | 0.997 | 0.995 | 0.888 |
| milk | 256 | 491 | 0.996 | 0.990 | 0.994 | 0.859 |
| personal_hygiene | 255 | 506 | 0.990 | 0.982 | 0.994 | 0.854 |
| puffed_food | 324 | 654 | 0.996 | 1.000 | 0.995 | 0.907 |
| seasoner | 302 | 572 | 0.986 | 0.965 | 0.993 | 0.849 |
| stationery | 162 | 300 | 0.986 | 0.957 | 0.972 | 0.785 |
| tissue | 482 | 978 | 0.999 | 0.994 | 0.995 | 0.909 |
Visual Examples of Classes
blah blah do this later
Key Visualizations
Confusion Matrix
F1 Confidence Curve
Training & Validation Loss Curves
Performance Analysis
The model performs consistently well across all 17 classes on the validation dataset, with the lowest mAP50 being stationery at 0.972. The strongest performing classes were tissue and puffed_food (mAP50-95: 0.909, 0.907), likely due to their distinct packaging shapes and high training sample counts. The weakest performing class was stationery (mAP50: 0.972, mAP50-95: 0.785), which is also the smallest class at 1,466 training images, suggesting performance is partially limited by sample size.
Limitations and Biases
When tested on the external D2S Dataset (wild images),
performance dropped significantly. The model missed entire
objects, produced low-confidence detections, and misclassified items.
For example, it labeled a water bottle as instant_noodles. This
suggests the model may have overfit to the specific visual patterns
of the training data, or alternatively reflects a domain gap between
Asian grocery packaging (training data) and the European products in D2S.
Both explanations are plausible and further testing on diverse datasets
would be needed to distinguish between them.


