Object Detection

Short Definition

Object detection is a computer vision task that identifies and locates objects within images or video frames by predicting both the class label and bounding box coordinates for each detected object. It combines classification with spatial localization to understand what objects are present and where they are.

Full Definition

Object detection is one of the most practically important tasks in computer vision, enabling machines to not just recognize what is in an image but also precisely locate each object. Unlike image classification which assigns a single label to an entire image, object detection must identify multiple objects of potentially different classes and draw tight bounding boxes around each one. The field has undergone dramatic progress since the deep learning revolution. The R-CNN family (R-CNN, Fast R-CNN, Faster R-CNN) introduced the two-stage approach: first proposing candidate regions, then classifying each region. YOLO (You Only Look Once) pioneered the one-stage approach, treating detection as a single regression problem and achieving real-time performance. SSD (Single Shot Detector) offered another efficient one-stage alternative. Modern detectors like YOLOv8 and DETR (Detection Transformer) achieve impressive accuracy at real-time speeds. Object detection powers autonomous vehicles (detecting pedestrians, vehicles, traffic signs), surveillance systems, medical imaging (detecting tumors), retail analytics (tracking products on shelves), industrial automation (quality inspection), and augmented reality (placing virtual objects relative to real ones). The field continues to advance with 3D object detection, video object detection, and open-vocabulary detection using models like OWL-ViT that can detect any object described in text.

Technical Explanation

Two-stage detectors: Faster R-CNN uses a Region Proposal Network (RPN) to generate candidate boxes, then classifies and refines each. One-stage detectors: YOLO divides the image into a grid and predicts B bounding boxes per cell, each with (x, y, w, h, confidence, class_probabilities). Non-Maximum Suppression (NMS) removes duplicate detections by keeping the highest confidence box and removing overlapping boxes above an IoU threshold. Evaluation uses mean Average Precision (mAP): for each class, compute precision-recall curve and average precision, then average across classes. IoU (Intersection over Union) measures bounding box overlap: IoU = area(intersection) / area(union). Anchor-free detectors like FCOS predict offsets from each pixel to box boundaries, eliminating predefined anchor boxes.

Use Cases

Autonomous vehicle perception | Surveillance and security | Medical image analysis | Retail shelf monitoring | Industrial quality inspection | Drone-based inspection | Wildlife monitoring | Augmented reality

Advantages

Enables precise spatial understanding of scenes | Real-time performance with modern architectures | Handles multiple objects and classes simultaneously | Powers safety-critical applications | Active research with rapid improvements | Open-vocabulary detection expanding capabilities

Disadvantages

Requires large annotated datasets with bounding boxes | Small objects remain challenging to detect | Occlusion and crowded scenes reduce accuracy | Annotation is expensive and time-consuming | Real-time accuracy tradeoff | Struggles with unusual viewpoints or conditions

Schema Type

DefinedTerm

Difficulty Level

Beginner