Apple's take on immersive video for the Apple Vision Pro is a departure from typical fisheye projections, incorporating an interesting twist for the high-resolution headset.
A fisheye lens on an iPhone
Headsets providing VR and AR experiences often offer immersive video to users as well. This can take the form of Spatial Video, providing a 3D effect, but also 360-degree video that favors surrounding the viewer with content.
Apple has also released immersive video clips, namely 180-degree 3D video at high resolutions, though it is relatively slow to grow its content library at present.
To produce those videos, cameras with fisheye lenses are often used to produce an extremely wide angle shot, with multiple videos combined to make a single video.
The Apple Vision Pro does, naturally, have the capability to view fisheye content. However, while it is used to stream Apple TV+ videos, the format is largely undocumented and is unused by third parties.
In research by Mike Swanson published on Sunday, Apple's immersive video projection for users takes a different approach from more conventional fisheye formatting.
Differing distortions
Translating an image from a 2D video into a hemispherical or spherical projection map that's viewable from the user at the center isn't easy, but it is something that has practically been solved thanks to distortion.
A typical 180-degree out-of-camera fisheye shot that encompasses everything within the frame will appear as a circle, with black sections in the corners and edges of the circle referencing areas with no visual data available.
By segmenting the video up in a specific way, it can be stretched to fit a 180-degree field of view of the user, both horizontally and vertically within a virtual sphere. This is the simplest way of accomplishing a projection, but it isn't data-efficient due to the corner sections being part of the encoded video, but not actually being used in the final image.
An alternative that eliminates the black sections exists, in the form of an 180-degree equirectangular projection. Created via editing, it warps the image to fill the entire rectangular frame.
When distorted for viewing, this means more pixels are used for the edges of the projection map, meaning more detail for users to actually see.
To create stereoscopic video for each, or a 360-degree video, each 180-degree field of view is often squished into half the available space, allowing both sides to be included within the same frame.
For this scenario, which makes it harder to preserve details in each 180-degree view, warping out the image to the corners to eliminate wasted pixels makes sense.
Reality distortion effects
Swanson had trouble initially determining what Apple changed in its fisheye projection treatment, but did pull up some details about what was performed from monitoring the network traffic of his Apple Vision Pro.
From monitoring alone, he discovered streams were approximatly 50Mbps, encoded in HDR10, at a resolution of 4,320 by 4,320 per eye, at 90fps. However, since immersive videos were DRM-protected, Swanson couldn't view the raw fisheye frames without breaking it.
He was then alerted to the Apple TV+ intro clip of the logo using the same fisheye encoding, but without DRM. This allowed further analysis of Apple's fisheye format.
For a start, rather than using a single video frame to handle two eyes or front and back 180-degree projections, Apple instead encodes stereo video using MV-HEVC. In effect, each 180-degree projection is stored in a separate video layer within the video file.
Examples of standard fisheye, equirectangular projection, and Apple's fisheye treatment [Mike Swanson]
More unusually, Apple encodes its fisheye content at a 45-degree rotation. The base of the "sphere" is located at the bottom left corner of the frame, with the top point at the opposite corner.
Swanson says this change makes sense, with one good reason being that the diagonal is the longest dimension of the frame and therefore can store more horizontal post-rotation pixels than an unrotated version.
To viewers, the advantage is that the horizon line will have the most pixels available. Since this is where most people will be looking while watching a video, preserving detail in this section is crucial to the viewing experience.
The areas with the fewest pixels to work with in a projection shifts from the middle of the top, bottom, and sides of a normal fisheye to the "corner" sections, which are feasibly less viewed.
Still some mysteries
Despite the additional information, Swanson hasn't cracked Apple's entire process, with some elements still eluding him.
One of these centers around a technique called Radial Stretching, where each degree of an image is stretched to the edge of a square frame, maximizing the usage of the entire frame for the image.
While Swanson has gotten close when processing a raw Apple fisheye frame, it's "not 100% correct." It is proposed that there are some additional logic elements at play along the diagonals to reduce the amount of radial stretching and distortion required, with Swanson's best guess being the use of simple beveled corners.
It's also offered that Apple could potentially be encoding to a specific geometry to add unnecessary complexity, making it harder for others to use the same format.
Swanson is still left with questions about why Apple uses this type of projection format. While Apple may find there are more benefits to doing so, they are still a mystery.
Encoding video for the Apple Vision Pro is just one of the challenges filmmakers face. In March, Canon executives explained that none of its cameras are capable of producing video at the resolution and refresh rate the headset requires.
If Apple is going to expand on how it treats video in the format, it may do so during WWDC 2024 in June.