| --- |
| language: en |
| license: mit |
| tags: |
| - computer-vision |
| - object-detection |
| - yolov8 |
| - gesture-recognition |
| - gaming |
| pipeline_tag: object-detection |
| library_name: ultralytics |
| --- |
| # Model Description |
| ### Overview |
| This model detects hand gestures for use as input controls for video games. It uses object detection to recognize specific hand poses from a webcam or standard camera and translate them into game actions. |
| The goal of the project is to explore whether computer vision–based gesture recognition can provide a low-cost and accessible alternative to traditional game controllers. |
|
|
| ### Training Approach |
| The model was trained using the nano version of YOLOv8 (YOLOv8n) through the Ultralytics training framework. |
| The model was trained from pretrained YOLOv8n weights and fine-tuned on a custom hand gesture dataset. |
|
|
| ### Intended Use Cases |
| * Gesture-controlled video games with simple control schemes |
| * Touchless interfaces |
| * Interactive displays |
| * Public kiosks |
| * Smart home media controls |
| * Desktop navigation |
| *** |
| # Training Data |
| ### Dataset Sources |
| **The training dataset was constructed from two sources:** |
|
|
| Rock-Paper-Scissors dataset |
| * Source: Roboflow Universe |
| * Creator: Audrey |
| * Used for the first three gesture classes |
| * Dataset URL: https://universe.roboflow.com/audrey-x3i6m/rps-knmjj |
|
|
| Custom gesture dataset |
| * Created by recording a 30-second video of the author performing gestures |
| * Video parsed into frames at 10 frames per second |
| * Images manually selected and annotated |
|
|
| ### Dataset Size |
| | Category | Count | |
| | ---------------- | --------- | |
| | Original Images | 444 | |
| | Augmented Images | 1066 | |
| | Image Resolution | 512 × 512 | |
|
|
| ### Class Distribution |
| | Class | Gesture | Annotation Count | |
| | -------- | ----------- | ---------------- | |
| | Forward | Open Palm | 169 | |
| | Backward | Closed Fist | 210 | |
| | Jump | Peace Sign | 187 | |
| | Attack | Thumbs Up | 121 | |
|
|
| ### Data Collection Methodology |
| The dataset combines stock gesture images with a custom dataset created from recorded video frames. |
|
|
| **The custom dataset was generated by:** |
| * Recording a short gesture demonstration video |
| * Extracting frames at 10 FPS |
| * Selecting usable frames |
| * Annotating gesture bounding boxes |
| * This process produced 236 custom images that were merged with the stock dataset. |
|
|
| ### Annotation Process |
| All annotations were created manually using Roboflow. |
| Bounding boxes were drawn around the visible hand gesture in each image. |
| Due to failure to import annotation metadata from the original dataset, all 444 images were annotated manually. |
| Estimated annotation time: 2–3 hours |
|
|
| ### Train / Validation / Test Split |
| | Dataset Split | Image Count | |
| | ------------- | ----------- | |
| | Training | 933 | |
| | Validation | 88 | |
| | Test | 45 | |
|
|
| ### Data Augmentation |
| **The following augmentations were applied:** |
| * Rotation: ±15 degrees |
| * Saturation adjustment: ±30% |
|
|
| *These augmentations expanded the dataset from 444 to 1066 images.* |
|
|
| ### Dataset Availability |
| Dataset availability: https://universe.roboflow.com/b-data-497-ws/hand-gesture-controls |
|
|
| ### Known Dataset Biases and Limitations |
| * Small dataset size |
| * Class imbalance (thumbs-up has fewer examples) |
| * Mixed image quality between stock and custom images |
| * Limited diversity in backgrounds and lighting conditions |
| * Limited number of subjects (primarily one person) |
|
|
| *These factors may affect model generalization.* |
| *** |
| # Training Procedure |
| ### Framework |
| Training was performed in Google Colab using altered Python code from a YOLOv11 training run. Code was taken and altered for YOLOv8n from [here](https://oceancv.org/book/TrainandDeployObj_YOLO.html). |
| |
| ### Model Architecture |
| Base model: YOLOv8n (Nano) |
| |
| **Reasons for selection:** |
| * Lightweight architecture |
| * Low inference latency |
| * Lower hardware requirements |
| * Faster training times |
| * Suitable for real-time applications |
|
|
| ### Training Configuration |
| | Parameter | Value | |
| | ----------------------- | ---------------------------- | |
| | Epochs | 200 (training stopped early) | |
| | Early stopping patience | 10 | |
| | Image size | 512 × 512 | |
| | Batch size | 64 | |
|
|
| ### Training Hardware |
| | Component | Specification | |
| | ------------- | ---------------- | |
| | GPU | A100 (High Ram) | |
| | VRAM | 80 GB | |
| | Training Time | ~40 minutes | |
|
|
| ### Preprocessing Steps |
| * Images resized to 512×512 |
| * Bounding box annotations normalized |
| * Augmented images generated before training |
| *** |
| # Evaluation Results |
| ### Overall Metrics |
| |
| **Final model performance at epoch 41:** |
| | Metric | Score | |
| | --------- | ----- | |
| | mAP@50 | 0.97 | |
| | mAP@50–95 | 0.78 | |
| | Precision | 0.93 | |
| | Recall | 0.91 | |
| | F1 Score | 0.94 | |
|
|
| *These results exceed the predefined project success criteria.* |
|
|
| **Per-Class Performance** |
| <img alt= "Per-Class Performance" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/perclass_perf.png" width="1000" height="180"></img> |
|
|
| **Sample Class Images** |
| <img alt= "Sample Images" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/sample_images.png" width="1100" height="700"></img> |
|
|
| ### Key Visualizations |
| <img alt= "Confusion Matrix" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/confusion_matrix_normalized.png" width="1100" height="700"></img> |
| <img alt= "F1 Curve" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/BoxF1_curve.png" width="1100" height="700"></img> |
| <img alt= "Precision-Recall Curve" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/BoxPR_curve.png" width="1100" height="700"></img> |
|
|
| ### Performance Analysis |
| The model achieved high precision and recall across all gesture classes, indicating strong detection performance on the test set. |
|
|
| Several factors contributed to this performance: |
| * A small number of distinct gesture classes |
| * Highly visible and consistent hand poses |
| * A balanced dataset for most classes |
| However, the dataset size is relatively small, which may inflate evaluation scores and limit generalization. |
|
|
| Failure cases were observed in several situations: |
| * Complex or cluttered backgrounds |
| * Low confidence detections |
| * Ambiguous or blurred gesture poses |
| These issues highlight areas where the model could be improved with more diverse training data. |
| *** |
| # Limitations and Biases |
| ### Known Failure Cases |
| <img alt= "Failure Cases" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/failure_cases.png" width="1100" height="700"></img> |
| The model struggled with some of the photos from the RPS dataset as these images contain complex backgrounds, partially occluded hands, or ambiguous gestures. |
| |
| ### Data Biases |
| Potential biases include: |
| * limited subject diversity |
| * similar backgrounds across many images |
| * dataset partially composed of stock imagery |
| * limited environmental variability |
| |
| ### Environmental Limitations |
| Model performance may degrade when: |
| * lighting conditions vary significantly |
| * gestures are performed at unusual angles |
| * hands are partially occluded |
| * gestures appear at extreme scales or distances |
| |
| ### Inappropriate Use Cases |
| This model should not be used for: |
| * complex gesture recognition (complex 3D control schemes) |
| * sign language recognition |
| * high-precision human-computer interaction systems |
| * any safety-critical applications |
| |
| ### Sample Size Limitations |
| The dataset is relatively small for object detection training, which may limit generalization to new users or environments. |
| Future improvements to the model would likely be a larger and more diverse dataset. Best course of action would be to remove stock images dataset and culminate gesture videos using diverse individuals, backgrounds, etc. |
| *** |
| # Future Work |
| Potential improvements include: |
| * collecting a larger and more diverse gesture dataset |
| * increasing the number of gesture classes |
| * improving image quality and environmental diversity |
| * exploring hand keypoint detection models instead of object detection |
| * Keypoint estimation could allow detection of more complex hand gestures and improve gesture recognition accuracy. |