OpenCode + Local LLM Setup (ROCm / Lemonade / llama.cpp)
This is a summary of how I built and tuned a local OpenCode-style coding setup on AMD ROCm using Lemonade + llama.cpp, moving through multiple models, context issues, and performance tuning before settling on Gamma 26B A4B Instruct (IT).
1. Base Setup
Started with Lemonade running llama.cpp on ROCm via Docker:
docker run -d \
--name lemonade-server \
-p 13305:13305 \
-v lemonade-cache:/root/.cache/huggingface \
-v lemonade-llama:/opt/lemonade/llama \
ghcr.io/lemonade-sdk/lemonade-server:latest
Then moved to ROCm-enabled mode:
docker run -d \
--name lemonade-server \
-p 13305:13305 \
-v lemonade-cache:/root/.cache/huggingface \
-v lemonade-llama:/opt/lemonade/llama \
-v lemonade-recipe:/root/.cache/lemonade \
-e LEMONADE_LLAMACPP=rocm \
--device=/dev/kfd \
--device=/dev/dri \
ghcr.io/lemonade-sdk/lemonade-server:latest
2. Hardware Check
ROCm stack confirmed:
- AMD Ryzen AI MAX+ 395
- Radeon 8060S (gfx1151)
- Unified memory pool (~128GB)
Verified via:
rocminfo
rocm-smi
GPU utilization was high and stable, but performance depended heavily on llama.cpp flags and KV behaviour.
3. First Issue: Context Mismatch
Even with:
--ctx-size 65536
Runtime still reported:
n_ctx_slot = 16384
Which eventually led to:
request (72859 tokens) exceeds available context size (16384 tokens)
So despite the CLI flag, actual slot context was effectively capped at 16K due to:
- model GGUF constraints
- KV cache limits
- Lemonade slot handling
- context shifting behavior
4. Model Experiments
I cycled through:
- GLM-4.7-Flash-GGUF
- Qwen3 8B (tool-call instability issues)
- Qwen3 30B Q2/Q4 variants
Issues encountered:
- unexpected
<tool_call>tokens - context exhaustion at high prompt sizes
- unstable KV reuse across slots
Eventually it became clear that model behavior mattered less than runtime stability.
5. Initial Performance Tuning
First meaningful gains came from:
--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32
This alone pushed throughput from baseline into the mid-30 TPS range.
6. Batch Size Experiments
Batch tuning had a major effect:
| Batch config | Behaviour |
|---|---|
| 2048 | high spikes, unstable latency |
| 512 | balanced |
| 256 | most stable |
Final direction:
--batch-size 256 --ubatch-size 64
7. KV + Cache Tuning
Further tuning:
--kv-unified \
--cache-reuse 256 \
--no-warmup
And in some tests:
--no-context-shift(for stability)- reduced parallelism when slot issues appeared
Key observation: KV reuse had more impact than raw compute tuning.
8. Slot + Context Issues
Logs consistently showed:
- LCP similarity-based slot reuse
- KV cache accumulation across prompts
- inflated memory state from previous sessions
This caused:
- unpredictable context sizes
- stale prompt reuse
- occasional silent degradation before hard failure
9. Performance Comparison (Your Data)
This is the actual measured TPS comparison you provided:
| Configuration | Time | Tokens | TPS |
|---|---|---|---|
| Batch 512 | 39.457 | 768 | — |
| Batch 512 | 44.806 | 2048 | 239.29 |
| Batch 2048 | — | 16881 | — |
| Batch 2048 | 95 | 23025 | 64.67 |
10. Final Stable Runtime Configuration
After tuning, the stable configuration became:
--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32 \
--kv-unified \
--no-warmup \
--cache-reuse 256 \
--batch-size 256 \
--ubatch-size 64
This gave:
- stable throughput (~30–36 TPS typical range depending on load)
- reduced KV churn
- fewer slot reuse artifacts
- more predictable latency under long contexts
11. Context Failure Reality
Even with:
--ctx-size 65536
The system still enforced:
- effective slot limit (~16K in some cases)
- hard failures at ~72K token requests
Conclusion:
context size is not just a flag — it’s enforced by model + KV + runtime alignment
12. Final Model Choice
After all experiments, I settled on:
Gamma 26B A4B Instruct (IT)
Why:
- more stable under long sessions
- fewer tool-call artifacts than Qwen3
- consistent KV behavior in llama.cpp
- better balance of speed vs reliability on ROCm APU
Compared to earlier models:
- GLM → fast but unstable under long context
- Qwen3 → powerful but noisy / tool-call prone
- Gamma 26B A4B IT → stable and predictable
End State
Final system:
- Lemonade + llama.cpp (ROCm backend)
- tuned batching + KV settings
- strict context control
- Gamma 26B A4B IT as primary coding model
The key takeaway from the whole setup:
Most gains came not from the model itself, but from fixing context + slot behavior and stabilizing KV cache handling.