Occasional blog posts from a random systems engineer

OpenCode + Local LLM Setup (ROCm / Lemonade / llama.cpp)

· Read in about 3 min · (635 Words)

This is a summary of how I built and tuned a local OpenCode-style coding setup on AMD ROCm using Lemonade + llama.cpp, moving through multiple models, context issues, and performance tuning before settling on Gamma 26B A4B Instruct (IT).


1. Base Setup

Started with Lemonade running llama.cpp on ROCm via Docker:

docker run -d \
  --name lemonade-server \
  -p 13305:13305 \
  -v lemonade-cache:/root/.cache/huggingface \
  -v lemonade-llama:/opt/lemonade/llama \
  ghcr.io/lemonade-sdk/lemonade-server:latest

Then moved to ROCm-enabled mode:

docker run -d \
  --name lemonade-server \
  -p 13305:13305 \
  -v lemonade-cache:/root/.cache/huggingface \
  -v lemonade-llama:/opt/lemonade/llama \
  -v lemonade-recipe:/root/.cache/lemonade \
  -e LEMONADE_LLAMACPP=rocm \
  --device=/dev/kfd \
  --device=/dev/dri \
  ghcr.io/lemonade-sdk/lemonade-server:latest

2. Hardware Check

ROCm stack confirmed:

  • AMD Ryzen AI MAX+ 395
  • Radeon 8060S (gfx1151)
  • Unified memory pool (~128GB)

Verified via:

rocminfo
rocm-smi

GPU utilization was high and stable, but performance depended heavily on llama.cpp flags and KV behaviour.


3. First Issue: Context Mismatch

Even with:

--ctx-size 65536

Runtime still reported:

n_ctx_slot = 16384

Which eventually led to:

request (72859 tokens) exceeds available context size (16384 tokens)

So despite the CLI flag, actual slot context was effectively capped at 16K due to:

  • model GGUF constraints
  • KV cache limits
  • Lemonade slot handling
  • context shifting behavior

4. Model Experiments

I cycled through:

  • GLM-4.7-Flash-GGUF
  • Qwen3 8B (tool-call instability issues)
  • Qwen3 30B Q2/Q4 variants

Issues encountered:

  • unexpected <tool_call> tokens
  • context exhaustion at high prompt sizes
  • unstable KV reuse across slots

Eventually it became clear that model behavior mattered less than runtime stability.


5. Initial Performance Tuning

First meaningful gains came from:

--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32

This alone pushed throughput from baseline into the mid-30 TPS range.


6. Batch Size Experiments

Batch tuning had a major effect:

Batch config Behaviour
2048 high spikes, unstable latency
512 balanced
256 most stable

Final direction:

--batch-size 256 --ubatch-size 64

7. KV + Cache Tuning

Further tuning:

--kv-unified \
--cache-reuse 256 \
--no-warmup

And in some tests:

  • --no-context-shift (for stability)
  • reduced parallelism when slot issues appeared

Key observation: KV reuse had more impact than raw compute tuning.


8. Slot + Context Issues

Logs consistently showed:

  • LCP similarity-based slot reuse
  • KV cache accumulation across prompts
  • inflated memory state from previous sessions

This caused:

  • unpredictable context sizes
  • stale prompt reuse
  • occasional silent degradation before hard failure

9. Performance Comparison (Your Data)

This is the actual measured TPS comparison you provided:

Configuration Time Tokens TPS
Batch 512 39.457 768
Batch 512 44.806 2048 239.29
Batch 2048 16881
Batch 2048 95 23025 64.67

10. Final Stable Runtime Configuration

After tuning, the stable configuration became:

--flash-attn on \
--parallel 4 \
--threads 4 \
--no-mmap \
--keep 32 \
--kv-unified \
--no-warmup \
--cache-reuse 256 \
--batch-size 256 \
--ubatch-size 64

This gave:

  • stable throughput (~30–36 TPS typical range depending on load)
  • reduced KV churn
  • fewer slot reuse artifacts
  • more predictable latency under long contexts

11. Context Failure Reality

Even with:

--ctx-size 65536

The system still enforced:

  • effective slot limit (~16K in some cases)
  • hard failures at ~72K token requests

Conclusion:

context size is not just a flag — it’s enforced by model + KV + runtime alignment


12. Final Model Choice

After all experiments, I settled on:

Gamma 26B A4B Instruct (IT)

Why:

  • more stable under long sessions
  • fewer tool-call artifacts than Qwen3
  • consistent KV behavior in llama.cpp
  • better balance of speed vs reliability on ROCm APU

Compared to earlier models:

  • GLM → fast but unstable under long context
  • Qwen3 → powerful but noisy / tool-call prone
  • Gamma 26B A4B IT → stable and predictable

End State

Final system:

  • Lemonade + llama.cpp (ROCm backend)
  • tuned batching + KV settings
  • strict context control
  • Gamma 26B A4B IT as primary coding model

The key takeaway from the whole setup:

Most gains came not from the model itself, but from fixing context + slot behavior and stabilizing KV cache handling.