Benchmarking Ollama vs LM Studio vs MLX
As I’m continuing to deepen my understanding about how AI works, I’m playing around more and more with using local LLMs.
I don’t want to always rely on Claude being up or having tokens available. Sometimes I just want privacy. So running my
own model that I can access anytime and have running 24/7 is pretty appealing. In the past I’ve used Ollama pretty
lightly. It’s the first project I was aware of that had an easy UI to interact with a locally hosted model. It was clear
pretty early on that my machines were pretty underpowered to do anything significant, but big improvements in smaller
models and Apple’s silicon ecosystem have opened the door to more regular usage. But is Ollama the right product for the
gig? I’ve had people telling me about LM Studio for a while, and it came up that it uses Apple’s native-ish mlx for
serving up the models. Ollama
is moving that way,
but for now it’s in preview and very limited on models. I’ve also been interacting with mlx directly as I’m playing
around with fine-tuning, so should I just use it directly?
Let’s find out.
Tonight I did a short benchmark test on the three options. I’m tapping into a Mac M2 Studio with 64 gigs of
RAM at the Initial Capacity office in Boulder via Cloudflare Tunnel. We’ve been playing
around to see what it can really do, but as a coding companion it’s pretty laggy with qwen-coder-next. Before trying to
figure out the right model, let’s check the throughput on a base level. We can get
that assumption out of the way and then focus on the model choice.
The first thing I’ve done is learn a lot about the different models. What Ollama has vs what Hugging Face has and how
they compare is genuinely complicated. Why are there so many versions on Hugging Face? What’s an MLX model vs GGUF? And
what is nvfp4? (MLX models are optimized for Apple’s MLX framework, GGUF is GGML Universal Format, nvfp4 is NVIDIA’s new
format that does better on their hardware). It was tricky to find a format we could use with all three products because Ollama, in the midst of switching
to MLX, had a regression and MLX doesn’t work on the latest versions. So the current Ollama pull default for Qwen3.6 is
GGUF. This is the main difference in the testing - slightly different quantization formats for our test as LM Studio and
mlx_lm both use a Hugging Face MLX model. Still worth the comparison, in my opinion.
I ran a benchmark script that did a couple light warm-up calls, then tasked it with 4 different coding prompts 3x each for a total of 12 measurable calls. All 12 calls hit my 1536-token cap during the reasoning phase - the model never even got to the final output. Doesn’t matter as we still see the tokens per second. And the results were pretty staggering.
Headline Numbers
| Backend | Model / Quant | Engine | tok/s median | tok/s stdev | Total s median | TTFT median |
|---|---|---|---|---|---|---|
| Ollama 0.20.3 | qwen3.6:latest / Q4_K_M GGUF | llama.cpp/Metal | 33.1 | 1.8 | 40.3 s | 360 ms |
| LM Studio | mlx-community/qwen3.6-35b-a3b / MLX 4-bit | MLX native | 80.9 | 0.8 | 19.1 s | 260 ms |
| mlx_lm.server | mlx-community/Qwen3.6-35B-A3B-4bit / MLX 4-bit | MLX native | 85.0 | 1.0 | 18.1 s | 157 ms |
Yep, Ollama is slow - at least on this machine with the current release. 33 tokens per second vs 80+ is a big gap, and Time To First Token tells the same story: it’s more than twice as fast on MLX. Ollama’s move to the MLX engine may well close the gap a bit - and their internal testing seems to show a nearly 2x improvement, but for the stable release today, MLX is the clear choice.
So what does this look like for our team? Two things.
First, I’m going to swap the M2 endpoint from Ollama to mlx_lm.server. That’s a roughly 2.5x throughput win with zero
code changes. LM Studio’s LM Link (their Tailscale-based remote access tool) is worth a look if we want a proper UI
instead of curling an API endpoint, though we would be giving up a few tokens per second.
Second, with the engine question answered, we can focus on the model. Qwen3.6 is the obvious next try - the benchmarks
on agentic coding are strong, and qwen-coder-next was half-broken for us anyway. Whether it holds up as a daily driver
is the next thing to figure out.
Get new posts in your inbox.