Update README.md

dd09338 verified about 3 hours ago

9.21 kB

license: mit

Grocery Self Checkout Item Detection

Model Description

Context

This model is a YOLOv11 object detection model fine-tuned from COCO pretrained weights to identify 17 grocery product categories in a retail self-checkout environment. This model is designed to detect common grocery items from an overhead/top-down camera perspective, mimicking a real view of a mounted self-checkout camera. The intended use case is for store-specific automatic item detection, where the model is able to assist with item counting, checkout verification, and loss/theft prevention. This model is best suited for stores whose inventory closely mathes the training data, as performance will degrade on unseen brands or product types not represented in the training data.

Training Data

The training dataset is a subset of the RPC-Dataset (rpc-dataset.github.io), a large-scale retail product checkout dataset consisting of 83699 images across 200 grocery product classes. The working dataset is a subset of this, consisting of 9,616 images across the same 200 classes, sourced via Roboflow (universe.roboflow.com/groceries-jxjfd/grocery-goods).

Annotation Process

The original RPC-Dataset contained 200 product-specific classes, where each class represented a specific product variant (e.g., 100_milk, 101_milk, 102_milk). These classes were collapsed into 17 broader product categories to improve generalization, reduce class imbalance, and better reflect how a self-checkout system categorizes items by type rather than specific SKU. For example, all milk classes were merged into a single milk class, reducing the total class count from 200 to 17. Random samples were reviewed after relabeling to validate annotation quality, with no corrections needed.

This is the final dataset used for training, after the annotation process (https://app.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb).

Class Distribution

Class Name	Total Count	Training Count	Validation Count	Test Count
tissue	4,813	3,369	963	481
dessert	4,372	3,060	874	437
drink	3,760	2,632	752	376
seasoner	3,199	2,239	640	320
puffed_food	3,156	2,209	631	316
chocolate	3,146	2,202	629	315
instant_noodles	3,033	2,123	607	303
canned_food	2,714	1,900	543	271
milk	2,517	1,762	503	252
candy	2,499	1,749	500	250
personal_hygiene	2,495	1,747	499	250
instant_drink	2,492	1,744	498	249
alcohol	2,381	1,667	476	238
dried_fruit	2,368	1,658	474	237
dried_food	2,222	1,555	444	222
gum	1,923	1,346	385	192
stationery	1,466	1,026	293	147

Train/Validation/Test Split

Split	Ratio	Count
Train	70%	36,928
Validation	20%	10,505
Test	10%	5,276

Data Augmentation

The following augmentations were applied during training to simulate real-world checkout conditions:

Augmentation	Purpose
Rotation	Items placed on belt in any orientation
Horizontal/Vertical Flip	Additional orientation variation
Mosaic	Multiple items on belt simultaneously
HSV Shift (hue, saturation, value)	Simulate varied store lighting
Translation & Scale	Camera height and position variation

Known Biases and Limitations

Dataset is predominantly composed of Chinese grocery product packaging, limiting generalizability to Western or European retail environments
Fresh and unpackaged produce (such as fruits or vegetables) are not represented in the dataset
Limited lighting variation — real checkout environments may have inconsistent lighting not well represented in training images

Training Procedure

Framework: Ultralytics YOLOv11n
Hardware: A100 GPU in Google Colab
Epochs: 50
Batch Size: 64
Image Size: 640x640
Patience: 50
Training Time ~36.5 minutes (2,189.69 seconds)
Preprocessing: Augmentations applied at training time (see Data Augmentation section)

Evaluation Results

Comprehensive Metrics

All files outputted from the 'runs\detect\train' folder are provided in the file section. The model was evaluated on a held-out test set of 1,928 images containing 9,825 instances across all 17 classes. The model demonstrates strong performance across all metrics, achieving near-perfect precision and recall with a mAP50 of 0.992.

Metric	Value
Precision	0.989
Recall	0.985
mAP50	0.992
mAP50-95	0.862

Per-Class Breakdown

Class	Test Images	Instances	Precision	Recall	mAP50	mAP50-95
all	1,928	9,825	0.989	0.985	0.992	0.862
alcohol	252	503	0.996	0.986	0.995	0.864
candy	257	502	0.988	0.980	0.990	0.815
canned_food	252	545	0.982	0.996	0.990	0.877
chocolate	360	700	0.982	0.984	0.993	0.833
dessert	389	819	0.995	0.991	0.995	0.881
dried_food	244	415	0.982	0.993	0.995	0.877
dried_fruit	263	516	0.986	0.986	0.995	0.887
drink	360	796	0.982	0.990	0.994	0.871
gum	183	360	0.989	0.979	0.991	0.812
instant_drink	271	554	0.984	0.982	0.994	0.886
instant_noodles	302	614	0.989	0.997	0.995	0.888
milk	256	491	0.996	0.990	0.994	0.859
personal_hygiene	255	506	0.990	0.982	0.994	0.854
puffed_food	324	654	0.996	1.000	0.995	0.907
seasoner	302	572	0.986	0.965	0.993	0.849
stationery	162	300	0.986	0.957	0.972	0.785
tissue	482	978	0.999	0.994	0.995	0.909

Visual Examples of Classes

blah blah do this later

Key Visualizations

Confusion Matrix

F1 Confidence Curve

Training & Validation Loss Curves

Performance Analysis

The model performs consistently well across all 17 classes on the validation dataset, with the lowest mAP50 being stationery at 0.972. The strongest performing classes were tissue and puffed_food (mAP50-95: 0.909, 0.907), likely due to their distinct packaging shapes and high training sample counts. The weakest performing class was stationery (mAP50: 0.972, mAP50-95: 0.785), which is also the smallest class at 1,466 training images, suggesting performance is partially limited by sample size.

Limitations and Biases

When tested on the external D2S Dataset (wild images), performance dropped significantly. The model missed entire objects, produced low-confidence detections, and misclassified items. For example, it labeled a water bottle as instant_noodles. This suggests the model may have overfit to the specific visual patterns of the training data, or alternatively reflects a domain gap between Asian grocery packaging (training data) and the European products in D2S. Both explanations are plausible and further testing on diverse datasets would be needed to distinguish between them.