WebGPU feature detection was not enough to run small LLMs on phones

Four test environments where the browser exposed WebGPU, and what the measurements say.

2026-06-18

I wanted to run a small language model in the browser, on the phone, without sending inference to a server. The feature detection is easy. You ask for a WebGPU adapter, you read its limits, and if the buffer sizes are large enough you assume it will run. Every browser environment I tested exposed WebGPU. As a first-pass check, the reported limits looked large enough for the model weights.

Then I ran them. What a device reports about its GPU and what an inference run completes are two different things. Four cases from my own measurements.

All numbers below come from the raw measurement files in the repository. The models are Llama-3.2-1B-Instruct, Qwen2.5-1.5B-Instruct, and Qwen2.5-0.5B-Instruct, quantized to roughly 4-bit. The engines are WebLLM 0.2.84, transformers.js 4.2.0, and wllama 3.4.1. Each run was cold cache, with a short prompt near 50 tokens and a long prompt near 1200 tokens.

1. Safari on iPhone reloads the page during generation

The device is an iPhone 11 Pro Max on iOS 18.7, Safari 26.5. It reports webgpu: true, an Apple adapter with f16 support, and a maxBufferSize of 715827880 bytes. The reported maxBufferSize was large enough for the model weights, at least as a first-pass check.

None of them completed. Qwen2.5-1.5B through WebLLM downloaded all 728 MB and then failed at init with TypeError: Load failed. Llama-3.2-1B through WebLLM got further, reached generation on the WebGPU backend, and then the page reloaded mid-generation with no JavaScript-visible exception and no out-of-memory error I could catch. The smaller Qwen2.5-0.5B through wllama did the same thing at init: the tab reloaded before it ever became ready. Across every engine and model on this device, zero runs completed. The failure mode is not an error you handle. It is the tab restarting under you.

2. LINE's in-app browser exposes WebGPU but the run never completes

The device is a Pixel 8a, 8 GB of memory, opened inside the LINE in-app browser on Android 16. It reports webgpu: true, an Arm Valhall adapter with f16, and a maxBufferSize of 4294967292 bytes, which is the full 4 GB ceiling. Nothing in the adapter limits distinguished it from the Chrome run that completed.

The Llama-3.2-1B session started, stalled mid-download, and never reached a single completed run. The results file for that session has an empty runs list. The adapter report told me nothing about whether the in-app browser would carry a download and an init to the end. It did not.

3. Same hardware and model, about two times the throughput by engine alone

On a Windows desktop with an AMD RDNA 4 GPU, Chrome 148, I ran the same Llama-3.2-1B with the short prompt through all three engines. WebGPU is present and used in every case. The decode rate is the median of three runs.

Llama-3.2-1B, short prompt, decode tokens per second (median of 3), AMD RDNA 4
engine	decode tok/s
WebLLM 0.2.84	196.17
transformers.js 4.2.0	125.41
wllama 3.4.1	97.61

The fastest engine decodes about twice as fast as the slowest on identical hardware running the identical model. The WebGPU support flag reads the same for all three. The measured throughput does not.

4. Pixel 8a completes, but a long prefill takes 76 seconds

The device is a Pixel 8a again, this time in plain Chrome 149, not an in-app browser. The Arm Valhall adapter reports the same 4 GB buffer ceiling. Here the model loads and runs to completion, so I have full timings.

With the short prompt of 52 input tokens, time to first token is about 3.8 seconds across three runs (3782, 3954, 3752 ms). With the long prompt of 1213 input tokens, time to first token is 77153, 76996, and 76449 ms. That is 76 to 77 seconds before the first token of the answer appears. Decode after that holds near 9 tokens per second. The same device that handles a one-line prompt in a few seconds takes well over a minute to read a page of context.

Across these four test environments, WebGPU exposure and large adapter limits were not enough to predict whether a small LLM run would complete. Feature detection answered whether WebGPU could be requested, not whether inference would finish.