⚖️ Classical vs Modern Computer Vision: When to Use Each

Understanding when to use geometric methods vs deep learning in production perception systems, and how to combine them for robust performance

7 min read

Classical vs Modern Computer Vision: When to Use Each

Computer vision has evolved dramatically with deep learning, but classical techniques remain essential for production systems. Understanding when to use geometric methods vs learned approaches—and how to combine them—is critical for building robust perception systems.

The Two Paradigms

Classical Computer Vision (Pre-2012)

Philosophy: Hand-crafted features + geometric reasoning + optimization

Core Techniques:

Feature detection (SIFT, SURF, ORB, FAST)
Edge detection (Canny, Sobel)
Geometric primitives (lines, circles, planes)
Epipolar geometry, homographies
Structure from Motion (SfM)
Bundle adjustment
Kalman filtering, particle filters

Strengths:

Mathematically interpretable
Predictable behavior
Efficient (low compute/memory)
No training data needed
Generalizes to new scenes
Precise geometric reasoning

Weaknesses:

Limited semantic understanding
Struggles with cluttered scenes
Sensitive to viewpoint/lighting changes
Manual feature engineering
Breaks down with textureless surfaces

Modern Computer Vision (2012+)

Philosophy: Learn features + representations + reasoning end-to-end from data

Core Techniques:

Convolutional Neural Networks (CNNs)
Transformers (Vision Transformers, DETR)
Semantic segmentation (U-Net, DeepLab, Mask R-CNN)
Object detection (YOLO, Faster R-CNN)
Depth estimation (MonoDepth, DPT)
Optical flow networks (FlowNet, RAFT)
Neural rendering (NeRF, Gaussian Splatting)

Strengths:

Rich semantic understanding
Handles complex appearance variations
Works on textureless/ambiguous regions
Learns domain-specific priors
State-of-the-art accuracy on benchmarks

Weaknesses:

Requires large labeled datasets
Computationally expensive (GPU needed)
Unpredictable edge case behavior
Difficult to interpret/debug
Can fail silently without geometric consistency
Overfits to training distribution

When to Use Classical CV

1. Geometric Reasoning Tasks

Camera Calibration:

Zhang's method is simple, robust, well-understood
No need for ML when geometry is exact

Pose Estimation (PnP):

Given 2D-3D correspondences, classical solvers are optimal
EPnP, P3P are fast and accurate
Used in production: Astrobee, AR/VR systems, robotics

Stereo Vision:

Rectification is pure geometry
Semi-global matching works well for textured scenes
No training data needed

Structure from Motion:

COLMAP, OpenSfM provide robust 3D reconstruction
Handles sparse features efficiently
Bundle adjustment optimizes geometry globally

2. Resource-Constrained Environments

Embedded Systems:

Space robotics (Astrobee): ARM processors, limited memory
Drones: Real-time on mobile GPUs
Edge devices: Can't run large neural networks

Example: Astrobee on ISS

BRISK features: 15-20ms per frame on ARM Cortex-A9
EPnP + RANSAC: 10ms for pose estimation
Total: 30 FPS real-time on space-grade hardware
No GPU needed, deterministic performance

3. Limited or No Training Data

Novel Environments:

Space stations, underwater, caves
No pre-existing datasets
Classical methods generalize without training

Custom Scenarios:

Unique industrial inspection tasks
Scientific instruments
One-off robotics applications

4. Safety-Critical Systems

Interpretability:

Geometric methods have clear failure modes
Can prove mathematical correctness
Easier to validate and test

Predictability:

Deterministic behavior
No silent failures from out-of-distribution data

Example: Autonomous vehicles

Use classical methods for localization (HD maps + particle filters)
Combine with learned perception for robustness

5. Real-Time Requirements

Low Latency:

Feature detection: < 10ms
Optical flow (Lucas-Kanade): < 5ms
Template matching: < 1ms

Predictable Timing:

No variable network inference time
Can guarantee real-time deadlines

When to Use Modern CV (Deep Learning)

1. Semantic Understanding

Object Recognition:

Classify objects, detect instances
Handle intra-class variation
Learn complex appearance models

Scene Understanding:

Semantic segmentation (road, sidewalk, building, vegetation)
Instance segmentation (separate objects)
Panoptic segmentation (stuff + things)

Example: Zipline Delivery Zones

Segment aerial imagery into safe landing zones
Identify obstacles (trees, power lines, buildings)
Learn from labeled satellite/drone data

2. Dense Prediction Tasks

Monocular Depth Estimation:

Predict depth from single image
Classical methods require stereo or motion
Networks learn geometric priors from data

Optical Flow:

Dense motion field estimation
FlowNet, RAFT outperform classical methods
Better at motion boundaries

Surface Normal Estimation:

Predict 3D orientation from shading
Learns shape-from-shading priors

3. Ill-Posed Problems

Super-Resolution:

Hallucinate high-frequency details
Learned priors from natural image statistics

Denoising/Inpainting:

Remove artifacts, fill missing regions
Neural priors for plausible completions

HDR Reconstruction:

Merge multiple exposures
Handle saturated regions

4. Large-Scale Annotated Datasets

Pre-Trained Models:

ImageNet (1.4M images)
COCO (330K images)
Cityscapes (5K urban scenes)

Transfer Learning:

Fine-tune on specific task
Requires less data than training from scratch

5. Complex Appearance Variations

Illumination Changes:

Shadows, highlights, reflections
Learned features are more robust than hand-crafted

Occlusions:

Partially visible objects
Networks learn to complete from context

Clutter:

Crowded scenes with overlapping objects
Attention mechanisms focus on relevant features

Hybrid Approaches: Best of Both Worlds

The most robust production systems combine classical and modern techniques.

Architecture Pattern

Raw Sensor Data
      ↓
[Classical Preprocessing]
- Undistortion
- Rectification
- Feature detection
      ↓
[Learned Feature Extraction]
- CNN backbone
- Feature pyramid
      ↓
[Classical Geometric Reasoning]
- Epipolar constraints
- PnP / Triangulation
- Bundle adjustment
      ↓
[Learned Refinement]
- Pose refinement network
- Depth completion
      ↓
Output (Pose, Depth, Segmentation)

Example Systems

ORB-SLAM3 + DepthNet:

Classical: ORB features, tracking, bundle adjustment
Learned: Monocular depth prediction
Combination: Depth network provides scale, SLAM provides consistency

DeepTAM:

Classical: Camera pose optimization
Learned: Dense depth prediction
Combination: Geometric consistency constrains network

Astrobee + Future ML:

Classical: Current production system (feature tracking, PnP)
Learned: Potential future addition (semantic understanding, texture-less localization)
Combination: ML provides hints, geometry provides precision

When to Combine

Learned Feature Detection + Classical Matching:

SuperPoint, DISK for learned keypoints
Geometric verification (RANSAC, epipolar constraints)
Best of both: robust features + geometric consistency

Classical Depth + Learned Completion:

Stereo matching for sparse/semi-dense depth
Neural network fills holes, smooths noise
Hybrid: metric accuracy + dense output

Semantic Segmentation + Geometric SLAM:

Segment scene into semantic classes
Use only static classes (road, building) for SLAM
Ignore dynamic objects (cars, people)
Reduces drift in dynamic environments

Decision Framework

Choose Classical If:

✅ Problem has clear geometric structure
✅ Computation is constrained (embedded systems)
✅ Interpretability is critical (safety systems)
✅ Training data is scarce or unavailable
✅ Need predictable, deterministic behavior
✅ Real-time deadlines are strict

Choose Modern If:

✅ Need semantic understanding (object classes, attributes)
✅ Have large labeled dataset or pre-trained models
✅ Problem is appearance-based (recognition, classification)
✅ Dealing with complex variations (lighting, occlusions)
✅ Have GPU compute available
✅ Can tolerate occasional edge-case failures

Combine Both If:

✅ Building production system (most robust approach)
✅ Need both geometric precision and semantic understanding
✅ Want geometric constraints to validate neural outputs
✅ Targeting autonomous systems (cars, drones, robots)

Production Considerations

Validation Strategy

Classical Methods:

Unit tests on synthetic data
Verify geometric properties (epipolar error, reprojection error)
Analytical error bounds

Learned Methods:

Test set evaluation (but may not cover edge cases)
Cross-validation on multiple datasets
Adversarial testing
Out-of-distribution detection

Hybrid:

Use geometry to validate neural predictions
Flag inconsistencies for human review
Degrade gracefully (fallback to classical)

Debugging

Classical:

Visualize features, matches, epipolar lines
Check algebraic constraints
Trace mathematical errors

Learned:

Visualize activation maps, attention
Check for overfitting, dataset bias
Analyze failure modes statistically

Deployment

Classical:

Lightweight: CPU-only, low memory
Deterministic: Same input → same output
Portable: Easy to cross-compile for embedded

Learned:

Heavy: GPU required for real-time
Stochastic: Dropout, quantization introduce variability
Complex: ONNX, TensorRT, model optimization

Case Studies

Zipline Autonomous Delivery

Offboard Perception (Cloud-side):

Classical: Structure from Motion for 3D mapping
Learned: Semantic segmentation for safe zones
Hybrid: Geometric 3D + semantic understanding

Onboard Perception (Aircraft):

Classical: Visual odometry, PnP localization
Learned: Obstacle detection, landing zone validation
Hybrid: Learned detections validated by geometry

Self-Driving Cars

Localization:

Classical: Particle filter with HD maps, ICP
Learned: Learned features for robust matching
Hybrid: Classical provides metric pose, learned handles appearance changes

Perception:

Learned: Object detection (pedestrians, vehicles)
Classical: Sensor fusion (cameras, lidar, radar)
Hybrid: Geometric tracking + semantic classification

Mobile Robotics (Astrobee)

Current System:

Classical: BRISK features, bag-of-words, EPnP
Why: Resource-constrained, safety-critical, no training data for ISS

Future Enhancements:

Learned: Semantic understanding (airlock, handrails, equipment)
Hybrid: Classical pose + learned scene understanding

Key Takeaways

Classical CV is not obsolete - Essential for geometric reasoning, resource-constrained, safety-critical systems
Deep learning excels at semantic tasks - Object recognition, segmentation, learning complex appearance models
Hybrid systems are most robust - Combine geometric constraints with learned features
Choose based on constraints:
- Data availability (classical needs less)
- Compute budget (classical is lightweight)
- Interpretability (classical is transparent)
- Task type (geometric vs semantic)
Production systems benefit from both:
- Classical provides metric accuracy and consistency
- Learned provides robustness and semantic understanding
- Geometric constraints validate neural predictions
Understand the math - Even when using learned methods, geometric reasoning validates outputs

For perception engineers:

Master both paradigms
Know when each applies
Build hybrid systems for production
Use geometry to keep networks honest
Always validate with classical methods

References

Hartley & Zisserman, "Multiple View Geometry in Computer Vision"
Szeliski, "Computer Vision: Algorithms and Applications"
LeCun et al., "Deep Learning" (Nature 2015)
Kendall et al., "Geometric Loss Functions for Camera Pose Regression" (CVPR 2017)
DeTone et al., "SuperPoint: Self-Supervised Interest Point Detection" (CVPR 2018)
ORB-SLAM3: https://github.com/UZ-SLAMLab/ORB_SLAM3