Machine learning researchers using Ollama will enjoy a speed boost to LLM processing, as the open-source tool now uses MLX on Apple Silicon to fully take advantage of unified memory.
Anyone working with large language models (LLMs) wants results as quickly as possible. There are techniques to do this using multiple Macs, working in a cluster to increase the amount of processing at hand, but one method made by Apple also provides an extra bit of assistance.
This has been undertaken by the developers working on the open-source model management and execution tool Ollama. In a March 30 update, it announced that it is previewing a version of the tool for Apple Silicon that takes advantage of MLX.
MLX is an open source project from Apple that handles networking between multiple Macs for the purposes of distributing the processing across devices. While you could do this via Ethernet, MLX also works over Thunderbolt, enabling for massive amounts of bandwidth to be made available for processing communications.
Unified memory boost
The key here for Ollama is that MLX also leverages the unified memory used by Apple Silicon. This is the shared memory system used by the CPU and GPU, meaning there is one large pool of shared memory in use instead of multiple separate pools with duplicated data.
For Ollama, the change to use Apple's machine learning framework means it can take advantage of the unified memory architecture.
This increases the speed on most Apple Silicon hardware, but it is especially useful on the M5 generation of chips. Ollama is also able to leverage the Neural Accelerators inside each GPU core to get more processing done.
The time to first token (TTFT) as well as generation speed (tokens per second) are both accelerated using the GPU Neural Accelerators by a considerable level.
In testing, the prefill performance of Ollama 0.18 without MLX was 1,154 tokens, but this shot up to 1,810 for Ollama 0.19 with MLX. Similarly, for decode performance, version 0.18 managed 58 tokens, versus 112 for version 0.19 with MLX.
As part of the same update, Ollama has upgraded its cache for efficiency, with lower memory utilization and intelligent checkpointing. There's also support for Nvidia's NVFP4 format, which can maintain model accuracy while also reducing the memory bandwidth.
High-spec only
To end users who work with LLMs, this update can be a crucial improvement, even for a solo Mac, let alone a cluster of them. For users of OpenClaw, an agent running locally on the Mac, this can help speed up its processing considerably.
However, not everyone will benefit from the change immediately. It is offered in a preview, not a generally-released update, and has specific requirements too.
The preview release of Ollama 0.19 is able to accelerate the Qwen3.5-35B-A3B model, which has its sampling parameters configured for coding tasks.
Ollama warns that it should only be used on a Mac with more than 32GB of unified memory.
The support will be limited to this model for the moment, but the team is working to add support for others. This will also include a simpler way to import custom models in the future.







