File size: 8,356 Bytes
9b6c321
 
 
 
 
 
 
 
 
 
 
 
61798fb
 
6fc7d9b
61798fb
 
 
d97d416
61798fb
 
 
 
 
 
 
 
6fc7d9b
319cb5e
6fc7d9b
 
38c61bc
108d0b8
6fc7d9b
 
 
 
 
 
 
 
 
 
 
38c61bc
6fc7d9b
 
 
 
 
 
38c61bc
6fc7d9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8df8639
6fc7d9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38c61bc
319cb5e
38c61bc
 
7b8abae
38c61bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
319cb5e
bd10690
 
 
 
 
 
 
 
 
 
 
 
a4d3f1e
bd10690
 
 
a4d3f1e
 
 
aa8137d
 
 
 
 
 
e1d4c4f
 
 
 
 
 
 
 
 
 
 
 
 
 
49b4a16
 
 
 
ab2c1de
 
6dca2e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a55ebec
6dca2e6
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
language: en
license: mit
tags:
- computer-vision
- object-detection
- yolov8
- gesture-recognition
- gaming
pipeline_tag: object-detection
library_name: ultralytics
---
# Model Description
### Overview
This model detects hand gestures for use as input controls for video games. It uses object detection to recognize specific hand poses from a webcam or standard camera and translate them into game actions.
The goal of the project is to explore whether computer vision–based gesture recognition can provide a low-cost and accessible alternative to traditional game controllers.

### Training Approach
The model was trained using the nano version of YOLOv8 (YOLOv8n) through the Ultralytics training framework.
The model was trained from pretrained YOLOv8n weights and fine-tuned on a custom hand gesture dataset.

### Intended Use Cases
* Gesture-controlled video games with simple control schemes
* Touchless interfaces
* Interactive displays
* Public kiosks
* Smart home media controls
* Desktop navigation
***
# Training Data
### Dataset Sources
**The training dataset was constructed from two sources:**

Rock-Paper-Scissors dataset
* Source: Roboflow Universe
* Creator: Audrey
* Used for the first three gesture classes
* Dataset URL: https://universe.roboflow.com/audrey-x3i6m/rps-knmjj

Custom gesture dataset
* Created by recording a 30-second video of the author performing gestures
* Video parsed into frames at 10 frames per second
* Images manually selected and annotated

### Dataset Size
| Category         | Count     |
| ---------------- | --------- |
| Original Images  | 444       |
| Augmented Images | 1066      |
| Image Resolution | 512 × 512 |

### Class Distribution
| Class    | Gesture     | Annotation Count |
| -------- | ----------- | ---------------- |
| Forward  | Open Palm   | 169              |
| Backward | Closed Fist | 210              |
| Jump     | Peace Sign  | 187              |
| Attack   | Thumbs Up   | 121              |

### Data Collection Methodology
The dataset combines stock gesture images with a custom dataset created from recorded video frames.

**The custom dataset was generated by:**
* Recording a short gesture demonstration video
* Extracting frames at 10 FPS
* Selecting usable frames
* Annotating gesture bounding boxes
* This process produced 236 custom images that were merged with the stock dataset.

### Annotation Process
All annotations were created manually using Roboflow.
Bounding boxes were drawn around the visible hand gesture in each image.
Due to failure to import annotation metadata from the original dataset, all 444 images were annotated manually.
Estimated annotation time: 2–3 hours

### Train / Validation / Test Split
| Dataset Split | Image Count |
| ------------- | ----------- |
| Training      | 933         |
| Validation    | 88          |
| Test          | 45          |

### Data Augmentation
**The following augmentations were applied:**
* Rotation: ±15 degrees
* Saturation adjustment: ±30%

*These augmentations expanded the dataset from 444 to 1066 images.*

### Dataset Availability
Dataset availability: https://universe.roboflow.com/b-data-497-ws/hand-gesture-controls

### Known Dataset Biases and Limitations
* Small dataset size
* Class imbalance (thumbs-up has fewer examples)
* Mixed image quality between stock and custom images
* Limited diversity in backgrounds and lighting conditions
* Limited number of subjects (primarily one person)

*These factors may affect model generalization.*
***
# Training Procedure
### Framework
Training was performed in Google Colab using altered Python code from a YOLOv11 training run. Code was taken and altered for YOLOv8n from [here](https://oceancv.org/book/TrainandDeployObj_YOLO.html).

### Model Architecture
Base model: YOLOv8n (Nano)

**Reasons for selection:**
* Lightweight architecture
* Low inference latency
* Lower hardware requirements
* Faster training times
* Suitable for real-time applications

### Training Configuration
| Parameter               | Value                        |
| ----------------------- | ---------------------------- |
| Epochs                  | 200 (training stopped early) |
| Early stopping patience | 10                           |
| Image size              | 512 × 512                    |
| Batch size              | 64                           |

### Training Hardware
| Component     | Specification    |
| ------------- | ---------------- |
| GPU           | A100 (High Ram)  |
| VRAM          | 80 GB            |
| Training Time | ~40 minutes      |

### Preprocessing Steps
* Images resized to 512×512
* Bounding box annotations normalized
* Augmented images generated before training
***
# Evaluation Results
### Overall Metrics

**Final model performance at epoch 41:**
| Metric    | Score |
| --------- | ----- |
| mAP@50    | 0.97  |
| mAP@50–95 | 0.78  |
| Precision | 0.93  |
| Recall    | 0.91  |
| F1 Score  | 0.94  |

*These results exceed the predefined project success criteria.*

**Per-Class Performance**
<img alt= "Per-Class Performance" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/perclass_perf.png" width="1000" height="180"></img>

**Sample Class Images**
<img alt= "Sample Images" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/sample_images.png" width="1100" height="700"></img>

### Key Visualizations
<img alt= "Confusion Matrix" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/confusion_matrix_normalized.png" width="1100" height="700"></img>
<img alt= "F1 Curve" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/BoxF1_curve.png" width="1100" height="700"></img>
<img alt= "Precision-Recall Curve" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/BoxPR_curve.png" width="1100" height="700"></img>

### Performance Analysis
The model achieved high precision and recall across all gesture classes, indicating strong detection performance on the test set.

Several factors contributed to this performance:
* A small number of distinct gesture classes
* Highly visible and consistent hand poses
* A balanced dataset for most classes
However, the dataset size is relatively small, which may inflate evaluation scores and limit generalization.

Failure cases were observed in several situations:
* Complex or cluttered backgrounds
* Low confidence detections
* Ambiguous or blurred gesture poses
These issues highlight areas where the model could be improved with more diverse training data.
***
# Limitations and Biases
### Known Failure Cases
<img alt= "Failure Cases" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/failure_cases.png" width="1100" height="700"></img>
The model struggled with some of the photos from the RPS dataset as these images contain complex backgrounds, partially occluded hands, or ambiguous gestures. 

### Data Biases
Potential biases include:
* limited subject diversity
* similar backgrounds across many images
* dataset partially composed of stock imagery
* limited environmental variability

### Environmental Limitations
Model performance may degrade when:
* lighting conditions vary significantly
* gestures are performed at unusual angles
* hands are partially occluded
* gestures appear at extreme scales or distances

### Inappropriate Use Cases
This model should not be used for:
* complex gesture recognition (complex 3D control schemes)
* sign language recognition
* high-precision human-computer interaction systems
* any safety-critical applications

### Sample Size Limitations
The dataset is relatively small for object detection training, which may limit generalization to new users or environments.
Future improvements to the model would likely be a larger and more diverse dataset. Best course of action would be to remove stock images dataset and culminate gesture videos using diverse individuals, backgrounds, etc.
***
# Future Work
Potential improvements include:
* collecting a larger and more diverse gesture dataset
* increasing the number of gesture classes
* improving image quality and environmental diversity
* exploring hand keypoint detection models instead of object detection
* Keypoint estimation could allow detection of more complex hand gestures and improve gesture recognition accuracy.