Apple explains the GPU enhancements in A17 Pro and M3

Apps and games that utilize the Metal API target specific functions of Apple Silicon GPUs, which get even better with significant improvements to parallel processes in M3 and A17 Pro. Here's how it works.

Apple released a developer talk on these new Apple Silicon GPU features detailing exactly what's happening to achieve improved results. The video goes into great technical detail, but provides enough to explain in basic terms.

Developers building apps with the Metal API don't need to make any changes to their apps to see performance improvements with M3 and A17 Pro. These chipsets utilize Dynamic Caching, hardware-accelerated ray tracing, and hardware-accelerated mesh mapping to make the GPU more performant than ever.

Dynamic shader core memory

Dynamic Caching is made possible thanks to a next-generation shader core. When utilizing the latest GPU cores in A17 Pro and M3, these shaders can run in parallel much more efficiently than before, massively improving output performance.

Dotted lines represent wasted register memory

Normally, the GPU is only able to allocate register memory based on the highest bandwidth process within an executed action for the duration of that action. Therefore, if one part of an action requires significantly more register memory than the rest, the action will utilize much more register memory for a given process.

Dynamic Caching allows the GPU to allocate exactly the right amount of register memory for every action it is taking. The previously unavailable register memory is freed, allowing for many more shader tasks to occur in parallel.

Flexible on-chip memory

Previously, on-chip memory would have fixed memory allocation for register, threadgroup, and tile memory with a buffer cache. That meant significant portions of memory went unused if an action utilized more of one type of memory than another.

The entire on-chip memory can be used as cache

With flexible on-chip memory, all of the on-chip memory is a cache that can be utilized for any memory type. So, an action that heavily relies on threadgroup memory can utilize the entire span of the on-chip memory, and even overflow actions into main memory.

The shader core dynamically adjusts on-chip memory occupancy to maximize performance. That means developers can spend less time optimizing occupancy.

Shader core's high-performance ALU pipelines

Apple recommends developers execute FP16 math in their programs, but the high-performance ALUs execute different combinations of integer, FP32, and FP16 in parallel. Instructions are executed across different actions performed in parallel, which means ALU utilization is improved with higher occupancy.

Increased parallel operations with high-performance ALU pipelines

Basically, if different actions contain the same FP32 or FP16 instructions that would be executed at different points in time, the executions can be overlapped to increase parallelism.

Hardware-accelerated graphics pipelines

Hardware-accelerated ray tracing makes the process much faster, taking the vital intersection calculations out of the GPU function. Since there's hardware taking care of a portion of the calculations, it allows more operations to occur in parallel, thus speeding up ray tracing with a hardware component.

Hardware-acceleration takes over from on-chip processes

Hardware-accelerated mesh shading utilizes a similar method. It takes the middle of the geometric calculations pipeline and passes it to a dedicated unit, thus allowing more parallel operations.

Watch the Latest from AppleInsider TV

These are complex systems that can't be broken down into a few paragraphs. We recommend watching the video to get all the details with one thing in mind — A17 Pro and M3 focus on computing parallelism to speed up tasks.

The M3 is available in the MacBook Pro and 24-inch iMac. The A17 Pro is available in the iPhone 15 Pro.

11 Comments

chasm 11 Years · 3719 comments

About 1 year ago

Having watched the video and having had my head duly spun by the technical detail, I just want to say thanks for this summary. Your bottom line is 100 percent correct: being able to a) use the memory more efficiently and b) do more things in parallel adds up to waaaay faster and better graphics performance than perhaps any integrated GPU (cough HEY INTEL cough) has ever done, ever.

Discrete GPUs will still rule the roost at the end of the day, but Apple has designed all this to meet the needs of Apple buyers, not hardcore all-day-and-night PC gamers. For typical user needs AND many game titles, this will bring a big boost in performance, but eventually Apple is going to have to allow third-party GPU compatibility for the minority of Mac users who actually do seriously need more.

9secondkox2 9 Years · 3324 comments

About 1 year ago

chasm said:

Having watched the video and having had my head duly spun by the technical detail, I just want to say thanks for this summary. Your bottom line is 100 percent correct: being able to a) use the memory more efficiently and b) do more things in parallel adds up to waaaay faster and better graphics performance than perhaps any integrated GPU (cough HEY INTEL cough) has ever done, ever.
Discrete GPUs will still rule the roost at the end of the day, but Apple has designed all this to meet the needs of Apple buyers, not hardcore all-day-and-night PC gamers. For typical user needs AND many game titles, this will bring a big boost in performance, but eventually Apple is going to have to allow third-party GPU compatibility for the minority of Mac users who actually do seriously need more.

Not necessarily. Discrete GPUs aren’t better because they’re discrete. Indeed m3 max beats quite a few discrete GPUs.

The real answer is Apple beefing up their SOC GPU cores to perform on the level of big hitters like nvidia’s rtx 4090.

The difficulty is doing so in an efficient way as Nvidia is basically selling thermonuclear reactors with huge power and cooling requirements to push percormance. Meanwhile, apple is pushing rtx 3090esque performance with m3 max - and doing so efficiently - where the entire SOC uses a fraction of the power Nvidia does with only the GPU.

Apple will continue to make strides in the GPU arena and will put the pressure on the companies who are freewheeling with power and thermals right now.

Apple is already upping the GHz and adding core counts. I can see them significantly adding cores to future m series and even offering an ultra/extreme with a whole new layer of just GPU cores surrounding the SOC with new interconnections. The sky’s the limit with what they will do. But two things seem to be set in stone: 1) apple won’t look to third parties. And 2) apple won’t sacrifice efficiency to push things forward. They’ll design it properly to perform at the architecture level.

Although I’d love to see apple design a desktop only chip that is allowed to be a power glutton and just smash the others at their own game.

2 Likes · 0 Dislikes

swat671 10 Years · 162 comments

About 1 year ago

As great as this is, in order for Apple to be taken seriously by the pro crowd, they need to bring back the Mac Pro's ability to have extra PCI slots that work with whatever people want to put in them, not just what Apple says is ok. They also need to bring back the ability to use Thunderbolt and external GPU's. Adding in some sort of ability to use Windows would be a plus, either with VM support or some new version of Boot Camp. Until THAT happens, I'm staying with my Intel Mac.

1 Like · 0 Dislikes

sflocal 17 Years · 6156 comments

About 1 year ago

There was a video that saw a while back discussing the strange place the new Mac Pro resides. The PCIe slot argument was moot. The devices that those “pros” used to have a need for PCI slots have been replaced with external thunderbolt interfaces. So the Mac Mini Ultra is basically the new Mac Pro.