Image Segmentation

Short Definition

Image segmentation is a computer vision task that partitions an image into meaningful regions by assigning a class label to every pixel. It provides pixel-level understanding of visual scenes, enabling precise delineation of object boundaries for applications requiring detailed spatial information.

Full Definition

Image segmentation represents the most detailed level of visual understanding in computer vision, going beyond object detection to provide pixel-perfect outlines of every object and region in an image. There are three main types of segmentation. Semantic segmentation assigns a class label to every pixel but does not distinguish between different instances of the same class (all cars are labeled the same). Instance segmentation identifies and separates individual instances of each class (each car gets a unique label). Panoptic segmentation combines both, providing complete scene understanding with both stuff (sky, road) and things (individual cars, people). The field was transformed by the Fully Convolutional Network (FCN) in 2014, which adapted classification networks for dense pixel prediction. U-Net (2015) introduced the encoder-decoder architecture with skip connections that became the standard for medical image segmentation. More recently, Mask R-CNN extended Faster R-CNN to predict segmentation masks alongside bounding boxes. The Segment Anything Model (SAM) from Meta AI represents a paradigm shift, providing a foundation model that can segment any object in any image with just a point or box prompt. Segmentation is critical for autonomous driving (understanding drivable areas), medical imaging (precisely delineating tumors), satellite imagery (land use classification), and video editing (background replacement).

Technical Explanation

Semantic segmentation models output a tensor of shape (H, W, C) where C is the number of classes, with each pixel assigned to argmax across classes. U-Net architecture: encoder downsamples using convolution and pooling, decoder upsamples using transposed convolutions, with skip connections concatenating encoder features to decoder at each level. Mask R-CNN adds a mask prediction branch to Faster R-CNN: for each detected instance, a small FCN predicts a binary mask. Loss functions include pixel-wise cross-entropy and Dice loss: Dice = 2|A intersection B| / (|A| + |B|). Evaluation uses mean Intersection over Union (mIoU): for each class, IoU = TP / (TP + FP + FN), then average across classes. SAM uses a promptable architecture with image encoder, prompt encoder, and mask decoder.

Use Cases

Autonomous driving scene understanding | Medical image analysis and tumor delineation | Satellite imagery land use mapping | Video conferencing background removal | Agricultural crop monitoring | Robot navigation and manipulation | Augmented reality scene understanding | Document layout analysis

Advantages

Pixel-level precision for detailed understanding | Critical for safety applications like autonomous driving | Foundation models enable universal segmentation | Well-established architectures for different needs | Essential for medical image analysis | Enables precise measurement and analysis

Disadvantages

Requires pixel-level annotations which are very expensive | Computationally intensive for high-resolution images | Boundary precision remains challenging | Class imbalance in segmentation datasets | Real-time performance requires optimization | Struggles with fine structures and thin objects

Schema Type

DefinedTerm

Difficulty Level

Beginner