There is a way to use a collection of M4 Mac minis in a cluster, but the benefits only really exist when you use high-end Macs.
While most people think of having a more powerful computer means buying a single expensive device, there are other ways to perform large amounts of number crunching. In one concept that has been around for decades, you could use multiple computers to handle processing on a project.
The concept of cluster computing revolves around a task with lots of calculations being shared between two or more processing units. Working together to complete tasks in parallel, the result is a severe shortening of time to process.
In a video published to YouTube on Sunday, Alex Ziskind demonstrates a cluster computing setup using the M4 Mac mini. Using a collection of five Mac minis stacked in a plastic frame, he sets a task that is then distributed between them for processing.
While typical home cluster computing setups rely on Ethernet networking for communications between the nodes, Ziskind is instead taking advantage of the speed of Thunderbolt by using Thunderbolt Bridge. This speeds up the communications between the nodes considerably, as well as allowing larger packets of data to be sent, saving on processing performance.
Ethernet can run at 1Gb/s normally, or up to 10Gb/s if you paid for the Ethernet upgrade in some Mac models. The Thunderbolt Bridge method can instead run at 40Gb/s for Thunderbolt 4 ports, or 80Gb/s on Thunderbolt 5 in M4 Pro and M4 Max models when run bi-directionally.
Better than GPU processing
Ziskind points out that there can be benefits to using Apple Silicon rather than a PC using a powerful graphics card for cluster computing.
For a start, processing using a GPU relies on having considerable amounts of video memory available. On a graphics card, this could be 8GB on the card itself, for example.
Apple's use of Unified memory on Apple Silicon means that the Mac's memory is used by the CPU and the GPU. The Apple Silicon GPU therefore has access to a lot more memory, especially when it comes to Mac configurations with 32GB or more.
Then there's power draw, which can be considerable for a graphics card. High power usage can be equated to a higher ongoing cost of operation.
By contrast, the Mac minis were found to use very little power, and a cluster of five Mac minis running at full capacity used less power than one high-performance graphics card.
MLX, not Xgrid
To get the cluster running, Ziskind use a project we've already talked about. It uses MLX, an Apple open-source project described as an "array framework designed for efficient and flexible machine learning research on Apple Silicon."
This is vaguely reminiscent of Xgrid, Apple's long-dead dead distributed computing solution, which could control multiple Macs for cluster computing. That system also allowed for a Mac OS X Server to take advantage of workgroup Macs on a network to perform processing when they aren't being used for anything else.
However, while Xgrid worked for large-scale operations that were very well funded at a corporate or federal level, as AppleInsider's Mike Wuerthele can attest to, it didn't translate well to smaller projects. Under perfect and specific situations, and specific code, it worked fantastically, but home-made clusters tended to not perform very well, and sometimes slower than a single computer doing the work.
MLX does change that quite a bit, as it's using the standard MPI distributed computing methodology to work. It is also possible to get running on a few Macs of varying performance, without necessarily shelling out for hundreds or thousands of them.
Unlike Xgrid, MLX seems to be geared a lot more towards smaller clusters, meaning the crowd that wanted to use Xgrid but kept running into trouble.
A useful cluster for the right reasons
While adding together the performance of multiple Mac minis together in a cluster seems attractive, it's not something that everyone can benefit from.
For a start, you're not going to see benefits for typical Mac uses, like running an app or playing a game. This is intended for processing massive data sets or for high intensity tasks that benefit from parallel processing.
This makes it ideal for purposes like creating LLMs for machine learning research, for example.
It's also not exactly easy to use by the typical Mac user.
Also, the performance gains aren't necessarily going to be that beneficial for the usual Mac owner. Ziskind found in tests that simply buying a M4 Pro model offers more performance than two M4 units working together when using LLMs.
Where a cluster like this comes into play is when you need more performance than you can get from a single powerful Mac. If a model is too big to work on a single Mac, such as constraints on memory, a cluster can offer more total memory for the model to use.
Ziskind offers that, at this stage, a high-end M4 Max Mac with vast amounts of memory is better than a cluster of lower-performance machines. But even so, if your requirements somehow go beyond the highest single Mac configuration, a cluster can help out here.
However, there are still some limitations to consider. While Thunderbolt is fast, Ziskind had to resort to using a Thunderbolt hub to connect the nodes to the host Mac, which reduced the available bandwidth.
Directly connecting the Macs together solved this, but then it runs into problems such as the number of available Thunderbolt ports to connect multiple Macs together. This can make scaling the cluster problematic.
He also ran into thermal oddities, where the host Mac mini was running especially hot, while nodes ran at a more reasonable level.
Ultimately, Ziskind found the Mac mini cluster tower experiment was interesting, but he doesn't intend to use it long-term. However, it's still relatively early days for the technology, and in cases where you use multiple high-end Macs for a sufficiently tough model, it can still work very well.
5 Comments
Not surprising at all. To make the best use of parallel processing you need to have an application or problem to solve that benefits from parallelization, I.e., very low demands for synchronization. Sync points and communication constructs like unmarshalling out of order packets can create a big hit on the potential speed up of parallel processing. When synchronization is needed on a single machine it’s typically handled within the same physical memory space or shared memory, and at the thread or process level. When synchronization is required between multiple machines it involves the IO and communication subsystems which are much slower, especially on a network like Ethernet.
I wonder if daisy chaining the Mac minis would work?
/
Sounds like a OS level software problem that can or is already solved by Apple, in house and they probably are not going to share the solution with the general public anytime soon, Apple internally is using something on those Mac Studio M2, M4 ultra’s.
But I hope whatever software solutions that Apple created for running those Mac Studio M2 ultras as a part of Apple Intelligence make it to the public in a way that can be used by everyone. (hopefully)….
It would be nice if this was built into the OS for sure, but I speak as someone who has in the past used distributed rendering, such as multiple Macs 3D rendering a single image (e.g. CrowdRender for Blender).