Is It Possible to Get Depth from a Single Image? How Apple's Depth Pro Estimates Depth

Special thanks to those who inspired and supported this post:
Murat Gülhan, Muharrem Aytekin, and Hakan Yıldırım

Introduction: The Challenge of Depth from a Single Image

Depth estimation from a single image is one of the most fascinating challenges in computer vision. Unlike humans who can perceive depth through various visual cues, computers struggle to understand the three-dimensional world from a flat, two-dimensional image.

3D World Coordinates vs. 2D Image Coordinates

In computer vision, we often represent points in the 3D world and their projections onto a 2D image using matrices:

3D World Point (Homogeneous Coordinates)

A point in 3D space is represented as a column vector:

Reference: Tola, Engin. (2005). Multi-view 3D Reconstruction of a Scene Containing Independently Moving Objects.

Where:

X, Y, Z: 3D world coordinates
1: Homogeneous coordinate (for matrix operations)

In homogeneous coordinates, this is represented as:

2D Image Point (Homogeneous Coordinates)

A point in the 2D image is represented as a column vector:

Where:

u, v: 2D image coordinates (pixels)
1: Homogeneous coordinate

Projection Matrix (P)

A camera projects a 3D point onto a 2D image plane. This is represented with a projection matrix:

P =

f_x 0 c_x 0

0 f_y c_y 0

0 0 1 0

Where:

P: 3×4 projection matrix that combines intrinsic and extrinsic parameters
f_x, f_y: Focal lengths in pixels
c_x, c_y: Principal point coordinates
Last column: All zeros for perspective projection (no translation in image plane)

Source: Georgia Tech CS4476 student project

The projection matrix P = K × [R | t] combines both the camera's intrinsic parameters (K) and extrinsic parameters (rotation and translation) into a single 3×4 matrix that directly maps 3D world points to 2D image coordinates.

Note:

Understanding Camera Parameters

Intrinsic Parameters (Camera Internal Properties)

Focal Length (fₓ, fᵧ):

What it is: The distance between the camera's optical center and the image sensor

Illustration of focal length in photography

Reference: photographylife.com/what-is-focal-length-in-photography

Physical meaning: Determines how much the camera "zooms in" - longer focal length = more zoom
Units: In the real world, a lens has a physical focal length (e.g., 35 mm). But when mapping 3D world coordinates to a digital image, the image plane is made of pixels. So, we convert the physical focal length (in millimeters) into pixels using the camera’s sensor size and the pixel pitch (the size of each pixel). Thus, focal length is measured in pixels (not millimeters) when working in image coordinates.
Why two values: fₓ for horizontal, fᵧ for vertical (accounts for non-square pixels)

Principal Point Coordinates (cₓ, cᵧ):

Camera intrinsic and extrinsic parameters illustration

Reference: Yeong, De Jong & Velasco-Hernandez, Gustavo & Barry, John & Walsh, Joseph. (2021).

Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors. 21. 2140. 10.3390/s21062140.

What it is: The intersection point of the optical axis with the image plane
Physical meaning: The center of the image sensor (usually the image center)
Units: Pixels from the top-left corner of the image
Typical values: cₓ = image_width/2, cᵧ = image_height/2 (for centered sensors)
Why important: Accounts for manufacturing imperfections where the optical center isn't exactly at the image center

Extrinsic Parameters (Camera Position and Orientation)

Rotation Matrix (R):

**3×3 matrix describing how the camera is oriented in 3D space
Properties: Orthogonal matrix (R^T × R = I), determinant = 1
**Can be expressed as Euler angles, quaternions, or rotation vectors

Translation Vector (t):

*3×1 vector describing the camera's position in 3D space

Why Single-Image Depth is Challenging

Estimating depth from a single image is fundamentally challenging because, unlike stereo vision (which uses two images from slightly different viewpoints), a single image does not provide direct geometric cues about how far objects are from the camera. The 3D world is projected onto a 2D plane, causing loss of depth information.

u = f_x × (X_c / Z_c) + c_x, v = f_y × (Y_c / Z_c) + c_y

Here, (X_c, Y_c, Z_c) are the 3D coordinates of a point in the camera coordinate system, (u, v) are the corresponding pixel coordinates in the image, f_x and f_y are the focal lengths in pixels, and (c_x, c_y) are the principal point offsets.

Why can't we get depth from a single image?

From a single image, we only observe (u, v), but both X_c and Z_c (or Y_c and Z_c) are unknown. This means that there are infinitely many 3D points that could correspond to the same pixel; therefore, the depth Z_c cannot be determined directly.

Stereo Cameras for More Accurate Depth

While single-image depth estimation is impressive, using two cameras—known as a stereo camera setup—can provide much more accurate depth information. Stereo vision mimics human binocular vision: two cameras are placed a fixed distance apart (the "baseline") and capture images of the same scene from slightly different viewpoints.

How can we estimate depth with two images (stereo)?

With two images taken from slightly different viewpoints (a stereo pair), we can find the same point in both images and measure its horizontal displacement, called disparity (d). For rectified stereo images, the depth Z_c can be computed as:

Stereo Geometry: Depth Calculation

Consider two cameras: left (L) and right (R), separated by a distance called the baseline (B).

The left camera center: O_L
The right camera center: O_R, shifted by B along the x-axis.

For a 3D point:

In the left camera: X_L = X, Z_c = Z
In the right camera: X_R = X - B, Z_c = Z

Image projections:

Left image: u_L = f × (X / Z)
Right image: u_R = f × ((X - B) / Z)

Disparity (d): The horizontal difference between the same point in both images:

d = u_L - u_R = (f × B) / Z

Solving for depth (Z):

Z = (f × B) / d

This formula summarizes how stereo cameras compute real depth.

Z_c = (f × B) / d

where:

f is the focal length (in pixels),
B is the baseline (distance between the two camera centers),
d is the disparity (difference in u coordinates between the left and right images).

By matching corresponding points and measuring disparity, we can triangulate the true depth of each point in the scene.

How Stereo Depth Works:

Reference: Luxonis Docs: Configuring Stereo Depth

By comparing the positions of the same feature in both images (called "disparity"), we can triangulate the actual distance to that feature.
The greater the disparity (difference in position), the closer the object is to the cameras.
This geometric approach provides direct, metric depth measurements, unlike single-image methods that rely on learned cues and priors.

Alternative: Depth from Multiple Frames (Structure from Motion)

While stereo cameras use two lenses to capture depth, you can also estimate depth using a single moving camera by capturing multiple frames from different viewpoints. This technique is known as Structure from Motion (SfM) or Multi-View Geometry.

Reference: Yilmaz, Ozgur & Karakus, Fatih. (2013). Stereo and kinect fusion for continuous 3D reconstruction and visual odometry. 115-118. 10.1109/ICECCO.2013.6718242

Key idea: If you move a camera (for example, by walking with your phone or flying a drone), each frame provides a slightly different view of the scene. By tracking how features move between frames, you can triangulate their 3D positions—just like with two cameras, but using time-separated images instead of two simultaneous ones.
Depth from motion: The more the camera moves (with known motion), the more accurate the depth estimation can be, as long as there is enough parallax (change in viewpoint).

How it works:

Detect and track feature points across multiple frames.
Estimate the camera's motion between frames (using visual odometry or IMU data).
Triangulate the 3D position of each feature using the known camera poses and the observed 2D positions.

Applications:

Used in 3D Mapping, AR/VR, and robotics.

Depth Estimation with Drones

Drones are a practical example of using multiple frames for depth estimation:

As a drone flies, its onboard camera captures a sequence of images from different positions.
By analyzing how features move across these frames, the drone can build a 3D map of its environment.
This is crucial for tasks like obstacle avoidance, autonomous navigation, and mapping.

Agisoft Metashape 3D Reconstruction Example

Reference: 2D/3D Drone Modeling and Mapping with Agisoft Metashape (Udemy)

Advantages:

No need for two cameras—just a single camera and motion.
Works in environments where stereo setups are impractical.

Apple's Depth Pro Model: Zero-Shot Metric Monocular Depth Estimation

Note:

All information in this section is based on the following article:

Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S. R., & Koltun, V. (2025). Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. arXiv preprint arXiv:2410.02073. https://arxiv.org/abs/2410.02073

Apple's Depth Pro is a new foundation model for zero-shot metric monocular depth estimation. In other words, given a single RGB image, Depth Pro produces a high-resolution, per-pixel depth map in real-world units (meters)—without needing any camera metadata.

Reference: Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S. R., & Koltun, V. (2025). Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. arXiv:2410.02073

Key highlights:

No camera metadata required: Depth Pro can infer the camera's field-of-view (focal length) directly from the image, so it works even when EXIF data is missing.
High speed and resolution: On modern GPUs, it processes a 1536×1536 image (~2.3 MP) in about 0.3 seconds, producing sharp, high-frequency detail (e.g., fine hair and object edges).
Metric depth (with caveats): Produces depth maps in real-world units (meters), but the absolute accuracy can vary depending on scene type, camera properties, and image content (see discussion below).

Note:

While Apple's Depth Pro model estimates per-pixel depth and focal length from a single image, its primary purpose is not to deliver perfectly accurate or 100% reliable metric depth. Instead, the model excels at understanding relative depth differences between objects in an image and segmenting objects from the background. This enables a wide range of multi-purpose applications across the Apple ecosystem, such as copying or isolating objects from photos in macOS and iPhone, AR experiences, and enhanced photo editing. The focus is on robust, generalizable depth and object separation, rather than absolute precision.

Core Innovations

Multi-scale Vision Transformer (ViT) Architecture:
Depth Pro uses an efficient, patch-based ViT backbone that fuses information at multiple scales for dense, high-resolution prediction.

Note:

What does "Multi-scale Vision Transformer (ViT) Architecture" mean?

A "Multi-scale Vision Transformer (ViT) Architecture" means that the model processes the input image at several different resolutions (scales) to capture both fine details and the overall scene structure. Here’s how it works in Depth Pro:

Input and Patch Splitting:
- The input image is resized to 1536×1536 pixels.
- The image is split into small square patches (like tiles), and each patch is embedded and processed by a Vision Transformer (ViT).
Multi-Scale Processing:
- The image is analyzed at multiple scales:
  - High resolution: Captures fine details (like hair, object edges).
  - Lower resolutions: Captures the global layout and context of the scene.
- Each scale is processed by its own ViT, extracting features at that particular level of detail.
Global Context Branch:
- A smaller version of the image (e.g., 384×384) is also processed by a separate ViT to provide a "global view" of the scene, helping the model understand the overall context.
Fusion and Decoding:
- Features from all scales (local and global) are merged together using a decoder network (similar to U-Net or DPT architectures).
- This decoder combines the information to reconstruct a dense, high-resolution depth map.
Analogy:
- Think of it like solving a jigsaw puzzle:
  - You first look at the big pieces to get the general picture (low resolution).
  - Then you focus on the small pieces for fine details (high resolution).
  - By combining both, you get a complete, detailed image.

In summary:
The multi-scale ViT architecture allows Depth Pro to produce sharp, detailed depth maps by leveraging both local details and the global scene context—all from a single image.

Focal-Length Estimation Head:
A dedicated network head predicts the camera's focal length (field-of-view) from the image itself, enabling metric depth estimation without any external camera parameters.

Training Objectives

Depth Pro predicts canonical inverse depth (1/distance) which emphasizes nearby objects. The main loss function (LMAE) measures prediction accuracy while ignoring the worst 20% of pixels in real-world datasets.

Additional loss functions for sharp boundaries:

LMAGE: Measures edge accuracy
LMALE: Measures smoothness
LMSGE: Measures edge precision

Reference: Inverse depth parametrization (Wikipedia)

This approach produces depth maps with sharp details and smooth transitions.

Training Curriculum

Depth Pro uses a two-stage approach:

Stage 1: Train on mixed datasets (real + synthetic) using LMAE loss for robustness Stage 2: Train only on synthetic data using multiple loss functions (LMAE, LMAGE, LMALE, LMSGE) for sharp boundaries

Note:

Why synthetic data last? Unlike common practice, Depth Pro trains on synthetic data in the final stage because real-world datasets often contain missing areas and blurry boundaries. Synthetic data provides pixel-perfect depth information for learning sharp object boundaries.

Evaluation Metrics for Sharp Boundaries

Uses existing annotations (segmentation, saliency, matting) to measure boundary sharpness:

Compares neighboring pixels for depth differences (5-25% thresholds)
Calculates precision/recall vs. ground truth
No manual annotation needed and scale-invariant

This ensures sharp, accurate boundaries for real-world applications.

Focal Length Estimation

How does Depth Pro handle missing camera information?

Since many images lack accurate EXIF metadata (camera information), Depth Pro includes a focal length estimation head that predicts the camera's field of view directly from the image content.

How It Works

A small convolutional network analyzes the image to predict the horizontal angular field-of-view
Uses frozen features from the depth estimation network plus task-specific features from a separate ViT encoder
Trained separately from the main depth network to avoid balancing multiple objectives

Try Depth Pro Yourself!

Drag and drop your own images to estimate depth using Apple's Depth Pro model.

Try it on Hugging Face (opens in new tab)