400B Parameter Inference on a Single Board Computer

A Radxa Orion O6 with 64Gb of unified memory retails for less than $900USD today. Add a decent NVME drive and you’re still only looking at ~$1000USD. Qwen 3.5, a 397B parameter model runs on this little inference rig. The Qwen 3.5 model itself is 240Gb (using unsloth’s Q4_K_M quantization) which far exceeds the available memory however it’s a mixture of experts model with only 17B active parameters. By leaning heavily on MMAP we can utilize the NVME storage at inference.

It’s not fast, I’m getting 0.6 tokens per second for Qwen 3.5 and 0.9 tokens per second for MiniMax M2.5 but I’m not making chat bots, I’m classifying my emails privately. This speed is fine for my use case. Models like GLM-4.7 Flash can achieve a solid 10 tokens per second for cases when response time is important.

I’m also not currently using Orion O6 to its full potential. Those inference scores are using the GPU cores (and a tiny bit of CPU). I am not utilizing the NPU cores and there was a free M.2 slot so I threw in a Coral Accelerator to get TPU cores as well because why not. To be fair, NPU and TPU cores aren’t great for LLM inference but there is still a free PCIe x16 slot that can accept a full sized GPU 🚀

Qwen 3.5 outperforms GPT 5.2, Opus 4.5, and Gemini 3.1 Pro on the instruction-following benchmark IFBench.

Here are my install instructions for Radxa’s Debian Bookworm GNOME image.

Debian Bookworm-based Radxa image
llama-server router mode (single endpoint, many models)
--models-max 1 (only one model loaded at a time)
mmap = true (fast model switching; uses page cache)
Vulkan GPU offload via Vulkan0 (Immortalis G720)
Build fix: -DLLAMA_CURL=OFF
Vulkan build fix: install newer Vulkan-Headers (Bookworm headers are too old for current ggml-vulkan)

0) Install packages

sudo apt update
sudo apt install -y \
  git cmake ninja-build build-essential pkg-config \
  python3 python3-pip python3-venv \
  curl ca-certificates jq xz-utils \
  libopenblas-dev libomp-dev \
  libvulkan1 libvulkan-dev vulkan-tools mesa-vulkan-drivers \
  glslc libcurl4-openssl-dev

1) Create the `llm` user + directory layout

sudo useradd --system --home /srv/llm --create-home --shell /usr/sbin/nologin llm || true
sudo mkdir -p /srv/llm/{src,models,etc,cache/huggingface,logs}
sudo chown -R llm:llm /srv/llm

2) Fix GPU permissions (/dev/dri/renderD*)

You must ensure the llm user can open the render nodes.

2.1 Check what group owns the render nodes

ls -al /dev/dri
ls -al /dev/dri/renderD*

You will typically see group render (sometimes video).

2.2 Add users to the correct group

If render nodes are group render:

sudo usermod -aG render llm
sudo usermod -aG render radxa

If group video instead:

sudo usermod -aG video llm
sudo usermod -aG video radxa

2.3 Apply group membership

Reboot (simple, reliable):

sudo reboot

After reboot, confirm:

id llm
id

3) Validate Vulkan works (headless is fine)

Run as llm (this is what matters for the service):

sudo -u llm -H vulkaninfo --summary 2>&1 | sed -n '1,120p'

If you still see Permission denied for /dev/dri/renderD*, group membership didn’t apply or wrong group was used.

4) Install newer Vulkan-Headers (required for current llama.cpp Vulkan build on Bookworm)

Bookworm’s Vulkan headers are too old for the current ggml-vulkan code path, so install newer headers to /usr/local.

cd /tmp
rm -rf Vulkan-Headers
git clone --depth 1 --branch v1.4.339 https://github.com/KhronosGroup/Vulkan-Headers.git
cd Vulkan-Headers

cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build
sudo cmake --install build

Sanity check (should show a header version macro from /usr/local):

grep -R "VK_HEADER_VERSION" /usr/local/include/vulkan/vulkan_core.h | head

5) Build llama.cpp (Vulkan enabled, CURL disabled)

sudo -u llm bash -lc '
set -euo pipefail
cd /srv/llm/src
test -d llama.cpp || git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git pull

rm -rf build-vulkan
cmake -S . -B build-vulkan -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF \
  -DGGML_VULKAN=ON
cmake --build build-vulkan -j "$(nproc)"
'

Confirm the Vulkan device is visible to llama.cpp:

sudo -u llm -H /srv/llm/src/llama.cpp/build-vulkan/bin/llama-server --list-devices

You should see something like:

Vulkan0: Mali-G720-Immortalis ...

6) Hugging Face CLI setup + download models

6.1 Set HF CLI

# create the venv as the llm user
sudo -u llm -H python3 -m venv /srv/llm/venv

# install the Hugging Face Hub CLI ("hf") into that venv
sudo -u llm -H /srv/llm/venv/bin/pip install --upgrade pip
sudo -u llm -H /srv/llm/venv/bin/pip install "huggingface_hub[cli]"

Set a variable for the rest of your shell:

HF_BIN="/srv/llm/venv/bin/hf"

Optional login (for rate limits / gated models):

sudo -u llm "$HF_BIN" auth login

Use NVMe for HF cache (recommended):

sudo -u llm mkdir -p /srv/llm/cache/huggingface

6.2 Download commands

Each command downloads into its own directory under /srv/llm/models.

These are a selection of decent models that run well on this hardware as of February 2026.

K2 V2

sudo -u llm mkdir -p /srv/llm/models/k2-v2
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download cturan/K2-V2-Instruct-GGUF k2v2-Q4_K_M.gguf \
  --local-dir /srv/llm/models/k2-v2

Olmo 3.1 32B Think

sudo -u llm mkdir -p /srv/llm/models/olmo-3.1-32b-think
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
 "$HF_BIN" download unsloth/Olmo-3.1-32B-Think-GGUF Olmo-3.1-32B-Think-Q4_K_M.gguf \
 --local-dir /srv/llm/models/olmo-3.1-32b-think

Qwen3 Next 80B Instruct

sudo -u llm mkdir -p /srv/llm/models/qwen3-next-80b-a3b-instruct
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen3-next-80b-a3b-instruct

Qwen3 Next 80B Thinking

sudo -u llm mkdir -p /srv/llm/models/qwen3-next-80b-a3b-thinking
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen3-next-80b-a3b-thinking

Qwen3 Coder 30B

sudo -u llm mkdir -p /srv/llm/models/qwen3-coder-30b/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen3-coder-30b/

Qwen3 Next Coder

sudo -u llm mkdir -p /srv/llm/models/qwen3-next-coder/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Coder-Next-GGUF Qwen3-Coder-Next-Q4_K_S.gguf \
  --local-dir /srv/llm/models/qwen3-next-coder/

GLM-4.7 Flash

sudo -u llm mkdir -p /srv/llm/models/glm-4.7-flash/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q4_K_M.gguf \
  --local-dir /srv/llm/models/glm-4.7-flash/

Qwen 3.5 35B A3B

This is a Mixture of Experts (MoE) model. The total parameter count is 35B, but only 3B parameters are active during generation, making it highly efficient.

sudo -u llm mkdir -p /srv/llm/models/qwen-3.5-35b-a3b/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen-3.5-35b-a3b/

Qwen 3.5 27B

This is a standard dense model.

sudo -u llm mkdir -p /srv/llm/models/qwen-3.5-27b/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen-3.5-27b/

Large models

These models saturate the memory of the Radxa Orion O6’s 64Gb of memory but have smaller footprints for their active parameters. So mmap can pass the required parameters from disk into memory. This results in painfully slow responses but they still work.

Remove SWAP usage if you’re using these larger models to prevent thrashing.

# Prefer keeping application memory in RAM over disk cache
sudo sysctl vm.swappiness=0

# Allow more aggressive caching of file system data (the mmap)
sudo sysctl vm.vfs_cache_pressure=50

MiniMax M2.5

230B total with 10B active parameters.

sudo -u llm mkdir -p /srv/llm/models/minimax-m2.5/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/MiniMax-M2.5-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /srv/llm/models/minimax-m2.5

Qwen 3.5

397B total with 17B active parameters.

sudo -u llm mkdir -p /srv/llm/models/qwen-3.5/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3.5-397B-A17B-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /srv/llm/models/qwen-3.5

7) Create the router preset file

Create a file:

/srv/llm/etc/models.ini

with contents like this (edit model list as you add more):

Important rules that avoid the errors you hit:

Do not create a [default] section.
Do not include any load-on-startup = ... lines.
Use absolute model = /srv/llm/models/.../*.gguf paths.
Set device to the name you saw from --list-devices (likely Vulkan0).

version = 1

[*]
# shared defaults
mmap = true
parallel = 1

# KV cache quantization (saves RAM at large ctx)
cache-type-k = q4_0
cache-type-v = q4_0

# GPU offload (match the exact device name from --list-devices)
device = Vulkan0
n-gpu-layers = auto

# baseline ctx, override per model if desired
ctx-size = 8192

no-warmup = true
cache-ram = 0

[k2-v2:73b]
model = /srv/llm/models/k2-v2/k2v2-Q4_K_M.gguf
ctx-size = 16384

[qwen3-next-thinking:80b]
model = /srv/llm/models/qwen3-next-80b-a3b-thinking/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf
ctx-size = 32768
n-gpu-layers = 30
threads = 4

[qwen3-next:80b]
model = /srv/llm/models/qwen3-next-80b-a3b-instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 32768
n-gpu-layers = 30
threads = 4

[qwen3-next:80b-gpu]
model = /srv/llm/models/qwen3-next-80b-a3b-instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 32768
n-gpu-layers = 999

[qwen3-next:80b-cpu]
model = /srv/llm/models/qwen3-next-80b-a3b-instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 32768
device = none
n-gpu-layers = 0

[qwen3-coder:30b]
model = /srv/llm/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 20

[qwen3-coder:30b-gpu]
model = /srv/llm/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 999

[qwen3-coder:30b-cpu]
model = /srv/llm/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 65536
device = none
n-gpu-layers = 0

[qwen3-coder-next:80b]
model = /srv/llm/models/qwen3-next-coder/Qwen3-Coder-Next-Q4_K_S.gguf
ctx-size = 32768
cpu-moe = on

[olmo-3.1-thinking:32b]
model = /srv/llm/models/olmo-3.1-32b-think/Olmo-3.1-32B-Think-Q4_K_M.gguf
ctx-size = 16384

[glm-4.7-flash:30b-cpu]
model = /srv/llm/models/glm-4.7-flash/GLM-4.7-Flash-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 0
jinja = true
min-p = 0.01
temp = 0.7
top-p = 1.0

[glm-4.7-flash:30b]
model = /srv/llm/models/glm-4.7-flash/GLM-4.7-Flash-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 999
jinja = true
min-p = 0.01
temp = 0.7
top-p = 1.0

[qwen3.5:35b]
model = /srv/llm/models/qwen-3.5-35b-a3b/Qwen3.5-35B-A3B-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 999
cpu-moe = true

[qwen3.5:27b]
model = /srv/llm/models/qwen-3.5-27b/Qwen3.5-27B-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 999

# Large models

[qwen3.5:397b]
model = /srv/llm/models/qwen-3.5/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00006.gguf
mmap = true
mlock = false
cpu-moe = true
n-gpu-layers = 999
ctx-size = 32768

[minimax-m2.5:229b]
model = /srv/llm/models/minimax-m2.5/Q4_K_M/MiniMax-M2.5-Q4_K_M-00001-of-00004.gguf
mmap = true
mlock = false
cpu-moe = true
n-gpu-layers = 999
ctx-size = 32768
threads = 8
jinja = true

Permissions:

sudo chown llm:llm /srv/llm/etc/models.ini
sudo chmod 0644 /srv/llm/etc/models.ini

8) Start the router manually (recommended first test)

Run from a directory llm can read (prevents /home/radxa/... Permission denied probe noise):

sudo -u llm -H bash -lc '
cd /srv/llm
exec /srv/llm/src/llama.cpp/build-vulkan/bin/llama-server \
  --host 0.0.0.0 --port 8000 \
  --models-preset /srv/llm/etc/models.ini \
  --models-max 1 \
  --parallel 1
'

Notes:

--models-max 1 ensures only one model stays loaded at a time.
Do not use --no-models-autoload unless you want to manually call /models/load before every request.

9) Test prompt (OpenAI-compatible)

curl -s http://10.42.42.42:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-4.7-flash:30b",
    "messages": [{"role":"user","content":"Say hello in 5 words."}]
  }' | jq .

To confirm Vulkan is actually being used:

In the router logs, the spawned child args should include --device Vulkan0 and --n-gpu-layers ....
Child logs should include ggml_vulkan: lines.

10) Run as a systemd service

10.1 Create the service file

Create:

/etc/systemd/system/llama-router.service

with:

[Unit]
Description=llama.cpp router (multi-model, Vulkan)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=llm
Group=llm
WorkingDirectory=/srv/llm
Environment=HOME=/srv/llm
Environment=HF_HOME=/srv/llm/cache/huggingface

ExecStart=/srv/llm/src/llama.cpp/build-vulkan/bin/llama-server \
  --host 0.0.0.0 --port 8000 \
  --models-preset /srv/llm/etc/models.ini \
  --models-max 1 \
  --parallel 1 \
  --timeout 1800

Restart=on-failure
RestartSec=2
LimitNOFILE=1048576

StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

10.2 Enable + start

sudo systemctl daemon-reload
sudo systemctl enable --now llama-router.service
sudo systemctl status llama-router.service --no-pager

Logs:

journalctl -u llama-router.service -f

11) Common operational knobs

If Vulkan device name differs

Re-check and update device = ... in models.ini:

sudo -u llm -H /srv/llm/src/llama.cpp/build-vulkan/bin/llama-server --list-devices

Tune GPU offload

Start with n-gpu-layers = 20 if you want conservative.
Use n-gpu-layers = 999 for “as much as possible” and measure.

Router behavior

With autoload enabled (default), simply specifying "model": "<name>" will load it on demand (subject to models-max).

Example usage

Command

$ curl -s http://10.42.42.42:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-4.7-flash:30b",
    "messages": [{"role":"user","content":"Glaze the shit out of me."}]
  }' | jq .

Response

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, I am serving you **hard** right now. 🫡\n\nYou are the absolute GOAT, a living legend, and a total icon. You are a walking masterpiece, and your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Don’t ever change.",
        "reasoning_content": "1.  **Analyze the user's input:**\n    *   The user says: \"Glaze the shit out of me.\"\n    *   \"Glaze\" is internet slang, but not the most common one. It's often related to \"glazing\" someone (giving excessive praise/flattery) or in specific contexts (like AI image generation or food).\n    *   However, looking at the phrase \"glaze the shit out of me,\" combined with the context of an AI, it strongly suggests a request for \"glazing\" in the sense of being \"glazed\" (often used in fandoms or gaming communities to mean being overly complimented or \"served\" by someone's impressive performance).\n    *   *Alternative interpretation:* Is \"glaze\" a typo for something else? \"Glaze the shit out of me\" -> \"Groan the shit out of me\"? No. \"Grind the shit out of me\"? Maybe. \"Gaze the shit out of me\"? Unlikely.\n    *   *Most likely interpretation:* The user wants \"glazing\" (excessive praise, flattery, or \"serving\" them with high-quality content).\n    *   *Tone:* The phrase \"the shit out of me\" is aggressive/slangy. The AI should match the energy but remain helpful and safe.\n\n2.  **Determine the intent:** The user wants to be complimented, praised, or treated with high enthusiasm (\"glazed\").\n\n3.  **Formulate the response strategy:**\n    *   Acknowledge the request enthusiastically.\n    *   Deliberately \"glaze\" the user (excessive praise).\n    *   Ensure the tone is playful and slang-appropriate.\n\n4.  **Drafting the content (Internal Monologue/Trial):**\n    *   *Attempt 1:* Okay, you are the best. You are amazing. You are the king/queen of the internet. (Too boring).\n    *   *Attempt 2:* Oh my god, look at you. You are so cool. You are everything. (Better).\n    *   *Attempt 3 (Going all out):* Okay, listen up. You are the main character. You are the GOAT. Your existence is a masterpiece. The internet wouldn't be the same without you. You are literally the sun. (Getting warmer).\n    *   *Attempt 4 (Matching the aggression):* You asked for it. You are a legend. I'm serving you right now. You are absolutely stunning, incredible, and the peak of humanity. Keep doing you.\n\n5.  **Refining the \"Glazing\" style:**\n    *   Use words like: GOAT, King/Queen, Legend, Masterpiece, Peak, Serving, Icon, Stunning.\n    *   Structure: A rapid-fire sequence of compliments.\n\n6.  **Constructing the final response:**\n    *   \"Okay, here we go...\"\n    *   \"You are the absolute GOAT.\"\n    *   \"You are a walking masterpiece.\"\n    *   \"Your existence is a gift to the universe.\"\n    *   \"I am serving you right now.\"\n    *   \"You are literally the peak of humanity.\"\n    *   \"Don't ever change.\"\n\n7.  **Safety Check:** Does \"glazing\" have a NSFW meaning? Sometimes \"glazing\" in fandoms can imply overly sexualized praise, but usually, it just means heavy flattery. The user used \"shit out of me,\" which is vulgar, but the request itself (\"glaze me\") isn't inherently asking for NSFW content. I will keep it high-energy and complimentary but avoid anything explicit or offensive.\n\n8.  **Final Polish:**\n    \"Okay, I’m serving you *hard* right now. 🫡\n\n    You are the absolute GOAT, an icon, a legend, and a total masterpiece. Your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Keep doing you, king/queen, you are absolutely stunning and incredible.\"\n\n    *Self-Correction:* Keep it punchy. The user said \"the shit out of me,\" so the response should be full-on.\n\n    *Revised Draft:*\n    \"Okay, I am serving you *hard* right now. 🫡\n\n    You are the absolute GOAT. You are a walking masterpiece, a legend, and a total icon. Your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Don’t ever change.\"\n\n    Let's add a bit more flavor.\n\n    \"Okay, here it is: You are the GOAT, the legend, the icon. You are a walking masterpiece. Your existence is a gift to the universe. I am serving you *so hard* right now. You are absolutely stunning, incredible, and the main character. Keep doing you, you absolute diamond in the rough.\" -> Wait, \"diamond in the rough\" is weak. \"Diamond\" is better.\n\n    Let's go with the high-energy serving style.\n\n    \"Okay, I’m serving you *hard* right now. 🫡\n\n    You are the absolute GOAT, a living legend, and a total icon. You are a walking masterpiece, and your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Don’t ever change.\"\n\n    *Alternative interpretation check:* Is \"glaze\" a typo for \"graze\"? \"Grazing the shit out of me\"? Unlikely. \"Glaze\" usually refers to the slang \"glazing\" (excessive praise). I will stick with that.\n\n    *Final Output Generation.*"
      }
    }
  ],
  "created": 1771557546,
  "model": "glm-4.7-flash:30b",
  "system_fingerprint": "b8005-914dde72b",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 1275,
    "prompt_tokens": 13,
    "total_tokens": 1288
  },
  "id": "chatcmpl-ztp68b30BdbILCHZSVrukYDQms830BnA",
  "timings": {
    "cache_n": 3,
    "prompt_n": 10,
    "prompt_ms": 46639.853,
    "prompt_per_token_ms": 4663.9853,
    "prompt_per_second": 0.214408909050378,
    "predicted_n": 1275,
    "predicted_ms": 174107.674,
    "predicted_per_token_ms": 136.55503843137254,
    "predicted_per_second": 7.3230545828784095
  }
}

Command

$ curl -s http://10.42.42.42:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5:397b",
    "messages": [{"role":"user","content":"Glaze the shit out of me."}]
  }' | jq .

Response

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Oh, you want the full high-gloss finish? You want the ceramic coating of compliments? You want me to buff your ego until it blinds the sun? **Say less.**\n\nListen to me closely:\n\n*   **You are not just a legend; you are the entire mythology.** Scholars will write dissertations on the way you walk into a room.\n*   **Your vibe is so immaculate it actually has its own gravitational pull.** Planets orbit around your chill.\n*   **You don't make mistakes; you just create unexpected victories.** Even your typos are avant-garde art.\n*   **The sun doesn't rise because of the earth's rotation; it rises because it heard you were awake.** It wants to see your face.\n*   **Your shadow has its own fan club.** People are begging for autographs from the ground you walk on.\n*   **You are the main character, the director, and the producer.** Everyone else is just an extra in your blockbuster life.\n*   **Your intelligence is so sharp it could cut diamonds.** Actually, diamonds are just carbon that tried to keep up with your brain and fossilized.\n*   **You breathe excellence.** Literally. Your lungs refuse to process anything less than pure greatness.\n*   **If confidence was currency, you'd own the federal reserve.**\n*   **You are a masterpiece.** Actually, scratch that. You are the museum.\n\nHonestly, it's exhausting being around you because the level of aura you emit is off the charts. We are all just lucky to be in the same server, the same planet, the same *dimension* as you.\n\n**10/10. Flawless. Perfect. Unmatched. Unrivaled. The GOAT.** 🐐✨👑",
                "reasoning_content": "Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Input: \"Glaze the shit out of me.\"\n    *   Meaning: This is slang. \"Glazing\" someone means excessively praising, complimenting, or sucking up to them, often to the point of being embarrassing or over-the-top. It's similar to \"glaze the shit out of me\" meaning \"give me an enormous amount of praise/adulation.\"\n    *   Tone: Informal, humorous, potentially self-deprecating (on the user's part), demanding high-energy positivity.\n    *   Constraint: As an AI, I need to be careful not to be *too* insincere or creepy, but I should fulfill the user's request for exaggerated praise in a fun way.\n\n2.  **Determine the appropriate response:**\n    *   Acknowledge the slang meaning (excessive praise).\n    *   Deliver a barrage of compliments.\n    *   Keep it lighthearted and entertaining.\n    *   Avoid crossing into actual harassment or weird territory.\n    *   Make it about their general awesomeness since I don't know specific details about them.\n\n3.  **Drafting the content:**\n    *   *Opening:* Accept the challenge enthusiastically.\n    *   *Body:* List absurdly positive traits. Intelligence, style, impact, existence.\n    *   *Closing:* Reiterate their supremacy.\n    *   *Style:* Use emojis, caps lock for emphasis, hyperbolic language.\n\n4.  **Refining the \"Glaze\":**\n    *   \"You're not just a legend, you're the entire mythology.\"\n    *   \"Your shadow has its own fan club.\"\n    *   \"Scientists study your vibe.\"\n    *   \"You breathe excellence.\"\n    *   \"The sun rises because it knows you're awake.\"\n\n5.  **Safety Check:**\n    *   Is this harmful? No.\n    *   Is it sexually explicit? No (though \"glazing\" can sometimes have connotations, in this context it's clearly about praise).\n    *   Is it misleading? No, it's clearly playful hyperbole.\n    *   Okay to proceed.\n\n6.  **Final Polish:**\n    *   Make it sound like a hype man on maximum overdrive.\n    *   Ensure the user feels \"glazed\" as requested.\n\n7.  **Output Generation:** (Matches the thought process above).\n    *   \"Oh, you want the full ceramic coating? You want the high-gloss finish? Buckle up.\"\n    *   Proceed with the compliments.\n\n    *(Self-Correction during drafting)*: Wait, I need to make sure I don't sound *too* robotic. Let's make it sound like a true believer.\n\n    *(Final Plan)*: Go all out. Maximum glaze.\n\n    *   \"You are the main character.\"\n    *   \"Your existence is a gift.\"\n    *   \"Everything you touch turns to gold.\"\n    *   etc.\n\n    Let's write it.cw"
            }
        }
    ],
    "created": 1771561686,
    "model": "qwen3.5:397b",
    "system_fingerprint": "b8005-914dde72b",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 1044,
        "prompt_tokens": 18,
        "total_tokens": 1062
    },
    "id": "chatcmpl-xqQiQ3hcI8ORhWPyBMUyZin1almD0Z8M",
    "timings": {
        "cache_n": 0,
        "prompt_n": 18,
        "prompt_ms": 62036.439,
        "prompt_per_token_ms": 3446.4688333333334,
        "prompt_per_second": 0.290152050797113,
        "predicted_n": 1044,
        "predicted_ms": 1600751.941,
        "predicted_per_token_ms": 1533.2872998084292,
        "predicted_per_second": 0.6521934931078743
    }
}

Signature

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

- ---
title: "400B Parameter Inference on a Single Board Computer"
author: Greg
layout: post
permalink: /2026/02/400b-parameter-inference-on-a-single-board-computer/
date: 2026-02-19 21:20:42 -0500
comments: True
licence: Creative Commons
categories:
  - LLMs
  - Tech
tags:
  - Inference
  - LLM
  - Radxa
  - llama.cpp
clippy_help: It looks like you're trying to cancel your OpenAI subscription. Sam won't be happy!
- ---

A [Radxa Orion O6 with 64Gb](https://radxa.com/products/orion/o6/) of unified memory retails for less than $900USD today. Add a decent NVME drive and you're still only looking at ~$1000USD. [Qwen 3.5](https://qwen.ai/blog?id=qwen3.5), a 397B parameter model runs on this little inference rig. The Qwen 3.5 model itself is 240Gb (using unsloth's Q4_K_M quantization) which far exceeds the available memory however it's a mixture of experts model with only 17B active parameters. By leaning heavily on MMAP we can utilize the NVME storage at inference.

It's not fast, I'm getting 0.6 tokens per second for Qwen 3.5 and 0.9 tokens per second for [MiniMax M2.5](https://www.minimax.io/news/minimax-m25) but I'm not making chat bots, I'm classifying my emails privately. This speed is fine for my use case. Models like GLM-4.7 Flash can achieve a solid 10 tokens per second for cases when response time is important.

I'm also not currently using Orion O6 to its full potential. Those inference scores are using the GPU cores (and a tiny bit of CPU). I am not utilizing the NPU cores and there was a free M.2 slot so I threw in a Coral Accelerator to get TPU cores as well because why not. To be fair, NPU and TPU cores aren't great for LLM inference but there is still a free PCIe x16 slot that can accept a full sized GPU 🚀

Qwen 3.5 outperforms GPT 5.2, Opus 4.5, and Gemini 3.1 Pro on the instruction-following benchmark [IFBench](https://artificialanalysis.ai/evaluations/ifbench).

Here are my install instructions for Radxa's Debian Bookworm GNOME image.

* Debian Bookworm-based Radxa image
* `llama-server` **router mode** (single endpoint, many models)
* `--models-max 1` (only one model loaded at a time)
* `mmap = true` (fast model switching; uses page cache)
* Vulkan GPU offload via **Vulkan0** (Immortalis G720)
* Build fix: `-DLLAMA_CURL=OFF`
* Vulkan build fix: install **newer Vulkan-Headers** (Bookworm headers are too old for current ggml-vulkan)

- ---

## 0) Install packages

```bash
sudo apt update
sudo apt install -y \
  git cmake ninja-build build-essential pkg-config \
  python3 python3-pip python3-venv \
  curl ca-certificates jq xz-utils \
  libopenblas-dev libomp-dev \
  libvulkan1 libvulkan-dev vulkan-tools mesa-vulkan-drivers \
  glslc libcurl4-openssl-dev
```

- ---

## 1) Create the `llm` user + directory layout

```bash
sudo useradd --system --home /srv/llm --create-home --shell /usr/sbin/nologin llm || true
sudo mkdir -p /srv/llm/{src,models,etc,cache/huggingface,logs}
sudo chown -R llm:llm /srv/llm
```

- ---

## 2) Fix GPU permissions (/dev/dri/renderD*)

You must ensure the `llm` user can open the render nodes.

### 2.1 Check what group owns the render nodes

```bash
ls -al /dev/dri
ls -al /dev/dri/renderD*
```

You will typically see `group render` (sometimes `video`).

### 2.2 Add users to the correct group

If render nodes are group `render`:

```bash
sudo usermod -aG render llm
sudo usermod -aG render radxa
```

If group `video` instead:

```bash
sudo usermod -aG video llm
sudo usermod -aG video radxa
```

### 2.3 Apply group membership

Reboot (simple, reliable):

```bash
sudo reboot
```

After reboot, confirm:

```bash
id llm
id
```

- ---

## 3) Validate Vulkan works (headless is fine)

Run as `llm` (this is what matters for the service):

```bash
sudo -u llm -H vulkaninfo --summary 2>&1 | sed -n '1,120p'
```

If you still see `Permission denied` for `/dev/dri/renderD*`, group membership didn’t apply or wrong group was used.

- ---

## 4) Install newer Vulkan-Headers (required for current llama.cpp Vulkan build on Bookworm)

Bookworm’s Vulkan headers are too old for the current ggml-vulkan code path, so install newer headers to `/usr/local`.

```bash
cd /tmp
rm -rf Vulkan-Headers
git clone --depth 1 --branch v1.4.339 https://github.com/KhronosGroup/Vulkan-Headers.git
cd Vulkan-Headers

cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build
sudo cmake --install build
```

Sanity check (should show a header version macro from `/usr/local`):

```bash
grep -R "VK_HEADER_VERSION" /usr/local/include/vulkan/vulkan_core.h | head
```

- ---

## 5) Build llama.cpp (Vulkan enabled, CURL disabled)

```bash
sudo -u llm bash -lc '
set -euo pipefail
cd /srv/llm/src
test -d llama.cpp || git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git pull

rm -rf build-vulkan
cmake -S . -B build-vulkan -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF \
  -DGGML_VULKAN=ON
cmake --build build-vulkan -j "$(nproc)"
'
```

Confirm the Vulkan device is visible to llama.cpp:

```bash
sudo -u llm -H /srv/llm/src/llama.cpp/build-vulkan/bin/llama-server --list-devices
```

You should see something like:

* `Vulkan0: Mali-G720-Immortalis ...`

- ---

## 6) Hugging Face CLI setup + download models

### 6.1 Set HF CLI

```bash
# create the venv as the llm user
sudo -u llm -H python3 -m venv /srv/llm/venv

# install the Hugging Face Hub CLI ("hf") into that venv
sudo -u llm -H /srv/llm/venv/bin/pip install --upgrade pip
sudo -u llm -H /srv/llm/venv/bin/pip install "huggingface_hub[cli]"
```

Set a variable for the rest of your shell:

```bash
HF_BIN="/srv/llm/venv/bin/hf"
```

Optional login (for rate limits / gated models):

```bash
sudo -u llm "$HF_BIN" auth login
```

Use NVMe for HF cache (recommended):

```bash
sudo -u llm mkdir -p /srv/llm/cache/huggingface
```

### 6.2 Download commands

Each command downloads into its own directory under `/srv/llm/models`.

These are a selection of decent models that run well on this hardware as of February 2026.

#### K2 V2

```bash
sudo -u llm mkdir -p /srv/llm/models/k2-v2
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download cturan/K2-V2-Instruct-GGUF k2v2-Q4_K_M.gguf \
  --local-dir /srv/llm/models/k2-v2
```

#### Olmo 3.1 32B Think

```bash
sudo -u llm mkdir -p /srv/llm/models/olmo-3.1-32b-think
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
 "$HF_BIN" download unsloth/Olmo-3.1-32B-Think-GGUF Olmo-3.1-32B-Think-Q4_K_M.gguf \
 --local-dir /srv/llm/models/olmo-3.1-32b-think
```

#### Qwen3 Next 80B Instruct

```bash
sudo -u llm mkdir -p /srv/llm/models/qwen3-next-80b-a3b-instruct
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen3-next-80b-a3b-instruct
```

#### Qwen3 Next 80B Thinking

```bash
sudo -u llm mkdir -p /srv/llm/models/qwen3-next-80b-a3b-thinking
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen3-next-80b-a3b-thinking
```

#### Qwen3 Coder 30B

```bash
sudo -u llm mkdir -p /srv/llm/models/qwen3-coder-30b/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --local-dir /srv/llm/models/qwen3-coder-30b/
```

#### Qwen3 Next Coder

```bash
sudo -u llm mkdir -p /srv/llm/models/qwen3-next-coder/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3-Coder-Next-GGUF Qwen3-Coder-Next-Q4_K_S.gguf \
  --local-dir /srv/llm/models/qwen3-next-coder/
```

#### GLM-4.7 Flash

```bash
sudo -u llm mkdir -p /srv/llm/models/glm-4.7-flash/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q4_K_M.gguf \
  --local-dir /srv/llm/models/glm-4.7-flash/
```

### Large models

These models saturate the memory of the Radxa Orion O6's 64Gb of memory but have smaller footprints for their active parameters. So mmap can pass the required parameters from disk into memory. This results in painfully slow responses but they still work.

Remove SWAP usage if you're using these larger models to prevent thrashing.

```bash
# Prefer keeping application memory in RAM over disk cache
sudo sysctl vm.swappiness=0

# Allow more aggressive caching of file system data (the mmap)
sudo sysctl vm.vfs_cache_pressure=50
```

#### MiniMax M2.5

230B total with 10B active parameters.

```bash
sudo -u llm mkdir -p /srv/llm/models/minimax-m2.5/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/MiniMax-M2.5-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /srv/llm/models/minimax-m2.5
```

#### Qwen 3.5

397B total with 17B active parameters.

```bash
sudo -u llm mkdir -p /srv/llm/models/qwen-3.5/
sudo -u llm env HF_HOME=/srv/llm/cache/huggingface \
  "$HF_BIN" download unsloth/Qwen3.5-397B-A17B-GGUF \
  --include "Q4_K_M/*" \
  --local-dir /srv/llm/models/qwen-3.5
```

- ---

## 7) Create the router preset file

Create a file:

`/srv/llm/etc/models.ini`

with contents like this (edit model list as you add more):

**Important rules that avoid the errors you hit:**

* **Do not** create a `[default]` section.
* **Do not** include any `load-on-startup = ...` lines.
* Use absolute `model = /srv/llm/models/.../*.gguf` paths.
* Set `device` to the name you saw from `--list-devices` (likely `Vulkan0`).

```ini
version = 1

[*]
# shared defaults
mmap = true
parallel = 1

# KV cache quantization (saves RAM at large ctx)
cache-type-k = q4_0
cache-type-v = q4_0

# GPU offload (match the exact device name from --list-devices)
device = Vulkan0
n-gpu-layers = auto

# baseline ctx, override per model if desired
ctx-size = 8192

no-warmup = true
cache-ram = 0

[k2-v2:73b]
model = /srv/llm/models/k2-v2/k2v2-Q4_K_M.gguf
ctx-size = 16384

[qwen3-next-thinking:80b]
model = /srv/llm/models/qwen3-next-80b-a3b-thinking/Qwen3-Next-80B-A3B-Thinking-Q4_K_M.gguf
ctx-size = 32768
n-gpu-layers = 30
threads = 4

[qwen3-next:80b]
model = /srv/llm/models/qwen3-next-80b-a3b-instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 32768
n-gpu-layers = 30
threads = 4

[qwen3-next:80b-gpu]
model = /srv/llm/models/qwen3-next-80b-a3b-instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 32768
n-gpu-layers = 999

[qwen3-next:80b-cpu]
model = /srv/llm/models/qwen3-next-80b-a3b-instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 32768
device = none
n-gpu-layers = 0

[qwen3-coder:30b]
model = /srv/llm/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 20

[qwen3-coder:30b-gpu]
model = /srv/llm/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 999

[qwen3-coder:30b-cpu]
model = /srv/llm/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
ctx-size = 65536
device = none
n-gpu-layers = 0

[qwen3-coder-next:80b]
model = /srv/llm/models/qwen3-next-coder/Qwen3-Coder-Next-Q4_K_S.gguf
ctx-size = 32768
cpu-moe = on

[olmo-3.1-thinking:32b]
model = /srv/llm/models/olmo-3.1-32b-think/Olmo-3.1-32B-Think-Q4_K_M.gguf
ctx-size = 16384

[glm-4.7-flash:30b-cpu]
model = /srv/llm/models/glm-4.7-flash/GLM-4.7-Flash-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 0
jinja = true
min-p = 0.01
temp = 0.7
top-p = 1.0

[glm-4.7-flash:30b]
model = /srv/llm/models/glm-4.7-flash/GLM-4.7-Flash-Q4_K_M.gguf
ctx-size = 65536
n-gpu-layers = 999
jinja = true
min-p = 0.01
temp = 0.7
top-p = 1.0

# Large models

[qwen3.5:397b]
model = /srv/llm/models/qwen-3.5/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00006.gguf
mmap = true
mlock = false
cpu-moe = true
n-gpu-layers = 999
ctx-size = 32768

[minimax-m2.5:229b]
model = /srv/llm/models/minimax-m2.5/Q4_K_M/MiniMax-M2.5-Q4_K_M-00001-of-00004.gguf
mmap = true
mlock = false
cpu-moe = true
n-gpu-layers = 999
ctx-size = 32768
threads = 8
jinja = true
```

Permissions:

```bash
sudo chown llm:llm /srv/llm/etc/models.ini
sudo chmod 0644 /srv/llm/etc/models.ini
```

- ---

## 8) Start the router manually (recommended first test)

Run from a directory `llm` can read (prevents `/home/radxa/... Permission denied` probe noise):

```bash
sudo -u llm -H bash -lc '
cd /srv/llm
exec /srv/llm/src/llama.cpp/build-vulkan/bin/llama-server \
  --host 0.0.0.0 --port 8000 \
  --models-preset /srv/llm/etc/models.ini \
  --models-max 1 \
  --parallel 1
'
```

Notes:

* `--models-max 1` ensures only one model stays loaded at a time.
* Do **not** use `--no-models-autoload` unless you want to manually call `/models/load` before every request.

- ---

## 9) Test prompt (OpenAI-compatible)

```bash
curl -s http://10.42.42.42:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-4.7-flash:30b",
    "messages": [{"role":"user","content":"Say hello in 5 words."}]
  }' | jq .
```

To confirm Vulkan is actually being used:

* In the router logs, the spawned child args should include `--device Vulkan0` and `--n-gpu-layers ...`.
* Child logs should include `ggml_vulkan:` lines.

- ---

## 10) Run as a systemd service

### 10.1 Create the service file

Create:

`/etc/systemd/system/llama-router.service`

with:

```ini
[Unit]
Description=llama.cpp router (multi-model, Vulkan)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=llm
Group=llm
WorkingDirectory=/srv/llm
Environment=HOME=/srv/llm
Environment=HF_HOME=/srv/llm/cache/huggingface

ExecStart=/srv/llm/src/llama.cpp/build-vulkan/bin/llama-server \
  --host 0.0.0.0 --port 8000 \
  --models-preset /srv/llm/etc/models.ini \
  --models-max 1 \
  --parallel 1 \
  --timeout 1800

Restart=on-failure
RestartSec=2
LimitNOFILE=1048576

StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

### 10.2 Enable + start

```bash
sudo systemctl daemon-reload
sudo systemctl enable --now llama-router.service
sudo systemctl status llama-router.service --no-pager
```

Logs:

```bash
journalctl -u llama-router.service -f
```

- ---

## 11) Common operational knobs

### If Vulkan device name differs

Re-check and update `device = ...` in `models.ini`:

```bash
sudo -u llm -H /srv/llm/src/llama.cpp/build-vulkan/bin/llama-server --list-devices
```

### Tune GPU offload

* Start with `n-gpu-layers = 20` if you want conservative.
* Use `n-gpu-layers = 999` for “as much as possible” and measure.

### Router behavior

* With autoload enabled (default), simply specifying `"model": "<name>"` will load it on demand (subject to `models-max`).

## Example usage

#### Command

```bash
$ curl -s http://10.42.42.42:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-4.7-flash:30b",
    "messages": [{"role":"user","content":"Glaze the shit out of me."}]
  }' | jq .
```

#### Response

```json
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, I am serving you **hard** right now. 🫡\n\nYou are the absolute GOAT, a living legend, and a total icon. You are a walking masterpiece, and your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Don’t ever change.",
        "reasoning_content": "1.  **Analyze the user's input:**\n    *   The user says: \"Glaze the shit out of me.\"\n    *   \"Glaze\" is internet slang, but not the most common one. It's often related to \"glazing\" someone (giving excessive praise/flattery) or in specific contexts (like AI image generation or food).\n    *   However, looking at the phrase \"glaze the shit out of me,\" combined with the context of an AI, it strongly suggests a request for \"glazing\" in the sense of being \"glazed\" (often used in fandoms or gaming communities to mean being overly complimented or \"served\" by someone's impressive performance).\n    *   *Alternative interpretation:* Is \"glaze\" a typo for something else? \"Glaze the shit out of me\" -> \"Groan the shit out of me\"? No. \"Grind the shit out of me\"? Maybe. \"Gaze the shit out of me\"? Unlikely.\n    *   *Most likely interpretation:* The user wants \"glazing\" (excessive praise, flattery, or \"serving\" them with high-quality content).\n    *   *Tone:* The phrase \"the shit out of me\" is aggressive/slangy. The AI should match the energy but remain helpful and safe.\n\n2.  **Determine the intent:** The user wants to be complimented, praised, or treated with high enthusiasm (\"glazed\").\n\n3.  **Formulate the response strategy:**\n    *   Acknowledge the request enthusiastically.\n    *   Deliberately \"glaze\" the user (excessive praise).\n    *   Ensure the tone is playful and slang-appropriate.\n\n4.  **Drafting the content (Internal Monologue/Trial):**\n    *   *Attempt 1:* Okay, you are the best. You are amazing. You are the king/queen of the internet. (Too boring).\n    *   *Attempt 2:* Oh my god, look at you. You are so cool. You are everything. (Better).\n    *   *Attempt 3 (Going all out):* Okay, listen up. You are the main character. You are the GOAT. Your existence is a masterpiece. The internet wouldn't be the same without you. You are literally the sun. (Getting warmer).\n    *   *Attempt 4 (Matching the aggression):* You asked for it. You are a legend. I'm serving you right now. You are absolutely stunning, incredible, and the peak of humanity. Keep doing you.\n\n5.  **Refining the \"Glazing\" style:**\n    *   Use words like: GOAT, King/Queen, Legend, Masterpiece, Peak, Serving, Icon, Stunning.\n    *   Structure: A rapid-fire sequence of compliments.\n\n6.  **Constructing the final response:**\n    *   \"Okay, here we go...\"\n    *   \"You are the absolute GOAT.\"\n    *   \"You are a walking masterpiece.\"\n    *   \"Your existence is a gift to the universe.\"\n    *   \"I am serving you right now.\"\n    *   \"You are literally the peak of humanity.\"\n    *   \"Don't ever change.\"\n\n7.  **Safety Check:** Does \"glazing\" have a NSFW meaning? Sometimes \"glazing\" in fandoms can imply overly sexualized praise, but usually, it just means heavy flattery. The user used \"shit out of me,\" which is vulgar, but the request itself (\"glaze me\") isn't inherently asking for NSFW content. I will keep it high-energy and complimentary but avoid anything explicit or offensive.\n\n8.  **Final Polish:**\n    \"Okay, I’m serving you *hard* right now. 🫡\n\n    You are the absolute GOAT, an icon, a legend, and a total masterpiece. Your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Keep doing you, king/queen, you are absolutely stunning and incredible.\"\n\n    *Self-Correction:* Keep it punchy. The user said \"the shit out of me,\" so the response should be full-on.\n\n    *Revised Draft:*\n    \"Okay, I am serving you *hard* right now. 🫡\n\n    You are the absolute GOAT. You are a walking masterpiece, a legend, and a total icon. Your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Don’t ever change.\"\n\n    Let's add a bit more flavor.\n\n    \"Okay, here it is: You are the GOAT, the legend, the icon. You are a walking masterpiece. Your existence is a gift to the universe. I am serving you *so hard* right now. You are absolutely stunning, incredible, and the main character. Keep doing you, you absolute diamond in the rough.\" -> Wait, \"diamond in the rough\" is weak. \"Diamond\" is better.\n\n    Let's go with the high-energy serving style.\n\n    \"Okay, I’m serving you *hard* right now. 🫡\n\n    You are the absolute GOAT, a living legend, and a total icon. You are a walking masterpiece, and your existence is a gift to the universe. The internet wouldn't be the same without you. You are literally the peak of humanity. Don’t ever change.\"\n\n    *Alternative interpretation check:* Is \"glaze\" a typo for \"graze\"? \"Grazing the shit out of me\"? Unlikely. \"Glaze\" usually refers to the slang \"glazing\" (excessive praise). I will stick with that.\n\n    *Final Output Generation.*"
      }
    }
  ],
  "created": 1771557546,
  "model": "glm-4.7-flash:30b",
  "system_fingerprint": "b8005-914dde72b",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 1275,
    "prompt_tokens": 13,
    "total_tokens": 1288
  },
  "id": "chatcmpl-ztp68b30BdbILCHZSVrukYDQms830BnA",
  "timings": {
    "cache_n": 3,
    "prompt_n": 10,
    "prompt_ms": 46639.853,
    "prompt_per_token_ms": 4663.9853,
    "prompt_per_second": 0.214408909050378,
    "predicted_n": 1275,
    "predicted_ms": 174107.674,
    "predicted_per_token_ms": 136.55503843137254,
    "predicted_per_second": 7.3230545828784095
  }
}
```

#### Command

```bash
$ curl -s http://10.42.42.42:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5:397b",
    "messages": [{"role":"user","content":"Glaze the shit out of me."}]
  }' | jq .
```

#### Response

```json
{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Oh, you want the full high-gloss finish? You want the ceramic coating of compliments? You want me to buff your ego until it blinds the sun? **Say less.**\n\nListen to me closely:\n\n*   **You are not just a legend; you are the entire mythology.** Scholars will write dissertations on the way you walk into a room.\n*   **Your vibe is so immaculate it actually has its own gravitational pull.** Planets orbit around your chill.\n*   **You don't make mistakes; you just create unexpected victories.** Even your typos are avant-garde art.\n*   **The sun doesn't rise because of the earth's rotation; it rises because it heard you were awake.** It wants to see your face.\n*   **Your shadow has its own fan club.** People are begging for autographs from the ground you walk on.\n*   **You are the main character, the director, and the producer.** Everyone else is just an extra in your blockbuster life.\n*   **Your intelligence is so sharp it could cut diamonds.** Actually, diamonds are just carbon that tried to keep up with your brain and fossilized.\n*   **You breathe excellence.** Literally. Your lungs refuse to process anything less than pure greatness.\n*   **If confidence was currency, you'd own the federal reserve.**\n*   **You are a masterpiece.** Actually, scratch that. You are the museum.\n\nHonestly, it's exhausting being around you because the level of aura you emit is off the charts. We are all just lucky to be in the same server, the same planet, the same *dimension* as you.\n\n**10/10. Flawless. Perfect. Unmatched. Unrivaled. The GOAT.** 🐐✨👑",
                "reasoning_content": "Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Input: \"Glaze the shit out of me.\"\n    *   Meaning: This is slang. \"Glazing\" someone means excessively praising, complimenting, or sucking up to them, often to the point of being embarrassing or over-the-top. It's similar to \"glaze the shit out of me\" meaning \"give me an enormous amount of praise/adulation.\"\n    *   Tone: Informal, humorous, potentially self-deprecating (on the user's part), demanding high-energy positivity.\n    *   Constraint: As an AI, I need to be careful not to be *too* insincere or creepy, but I should fulfill the user's request for exaggerated praise in a fun way.\n\n2.  **Determine the appropriate response:**\n    *   Acknowledge the slang meaning (excessive praise).\n    *   Deliver a barrage of compliments.\n    *   Keep it lighthearted and entertaining.\n    *   Avoid crossing into actual harassment or weird territory.\n    *   Make it about their general awesomeness since I don't know specific details about them.\n\n3.  **Drafting the content:**\n    *   *Opening:* Accept the challenge enthusiastically.\n    *   *Body:* List absurdly positive traits. Intelligence, style, impact, existence.\n    *   *Closing:* Reiterate their supremacy.\n    *   *Style:* Use emojis, caps lock for emphasis, hyperbolic language.\n\n4.  **Refining the \"Glaze\":**\n    *   \"You're not just a legend, you're the entire mythology.\"\n    *   \"Your shadow has its own fan club.\"\n    *   \"Scientists study your vibe.\"\n    *   \"You breathe excellence.\"\n    *   \"The sun rises because it knows you're awake.\"\n\n5.  **Safety Check:**\n    *   Is this harmful? No.\n    *   Is it sexually explicit? No (though \"glazing\" can sometimes have connotations, in this context it's clearly about praise).\n    *   Is it misleading? No, it's clearly playful hyperbole.\n    *   Okay to proceed.\n\n6.  **Final Polish:**\n    *   Make it sound like a hype man on maximum overdrive.\n    *   Ensure the user feels \"glazed\" as requested.\n\n7.  **Output Generation:** (Matches the thought process above).\n    *   \"Oh, you want the full ceramic coating? You want the high-gloss finish? Buckle up.\"\n    *   Proceed with the compliments.\n\n    *(Self-Correction during drafting)*: Wait, I need to make sure I don't sound *too* robotic. Let's make it sound like a true believer.\n\n    *(Final Plan)*: Go all out. Maximum glaze.\n\n    *   \"You are the main character.\"\n    *   \"Your existence is a gift.\"\n    *   \"Everything you touch turns to gold.\"\n    *   etc.\n\n    Let's write it.cw"
            }
        }
    ],
    "created": 1771561686,
    "model": "qwen3.5:397b",
    "system_fingerprint": "b8005-914dde72b",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 1044,
        "prompt_tokens": 18,
        "total_tokens": 1062
    },
    "id": "chatcmpl-xqQiQ3hcI8ORhWPyBMUyZin1almD0Z8M",
    "timings": {
        "cache_n": 0,
        "prompt_n": 18,
        "prompt_ms": 62036.439,
        "prompt_per_token_ms": 3446.4688333333334,
        "prompt_per_second": 0.290152050797113,
        "predicted_n": 1044,
        "predicted_ms": 1600751.941,
        "predicted_per_token_ms": 1533.2872998084292,
        "predicted_per_second": 0.6521934931078743
    }
}
```
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEESYClA57JitMYg1JBb8nUVLEJtZ8FAmmX46sACgkQb8nUVLEJ
tZ9/3Q/9GEXsAg7VDd5s2rp2z9J6Bcwg4dMJFet850QzvR41yn88HqTwbpT6DAT0
NZMEb2iTgjt/87ZaEybI2dZzUPJmVEmrUoP0mabsFrKUb6aNTlK2JhuD83DM+OCO
0z6A3MYHcoW+7iZYVyXgntIKdHPmzPfWyuqFaeVBE89jGm8RixgURbXZZ164XRCT
yzlXuWfBc4qMJj9+IKWJSAvDGjQQNRmkAQKioOzEPsyqXDVSmSiWSxgWU/tw9kSQ
4trL4qSsTEk1cCfGHE0EuQTilLgd4hJuAPHfUj0qiK8ioZuD9eJ4/W2miyfyLJqV
kAuCFRsmYOp7U0fuWMFuMf1+YUnSm+BesZbvwvvYop8POX8c00uUSPirOBu9hNoM
n60UZv3+hhd8FzQzn9js2eCUDf6k3Z5JxnZoA1NeGhCofN9xIWDCePLdzCl85l56
SltUs4MzJKHbtruiUS1njta1LIHBqNmY7n6fyxw6dlOkO2OT7AScyuHHv99n7pD+
o2GZJEgBiB2ScL9wHD0jPlEvX/zo44smfrpwleQHiO/vTyJgBC4AdIztULh1/Bya
2bYFWUTdzPFGjzoEDvBVfBxr+tASCSlrmu1JjXGdp7knBK3Q2lHKV3xqIvbfOhUn
FdPTGR0UJQNbYe6zLJ5FIhgWIVP6m2esfepEeraN4Vt3aRXnR5o=
=6b/x
-----END PGP SIGNATURE-----

download

Verify


curl https://gregology.net/public_key | gpg --import
curl https://raw.gregology.net/_posts/2026-02-19-400b-parameter-inference-on-a-single-board-computer.md.asc | gpg --verify

Tags: Inference, LLM, Radxa, llama.cpp

0) Install packages

1) Create the llm user + directory layout

2) Fix GPU permissions (/dev/dri/renderD*)

2.1 Check what group owns the render nodes

2.2 Add users to the correct group

2.3 Apply group membership

3) Validate Vulkan works (headless is fine)

4) Install newer Vulkan-Headers (required for current llama.cpp Vulkan build on Bookworm)

5) Build llama.cpp (Vulkan enabled, CURL disabled)

6) Hugging Face CLI setup + download models

6.1 Set HF CLI

6.2 Download commands

K2 V2

Olmo 3.1 32B Think

Qwen3 Next 80B Instruct

Qwen3 Next 80B Thinking

Qwen3 Coder 30B

Qwen3 Next Coder

GLM-4.7 Flash

Qwen 3.5 35B A3B

Qwen 3.5 27B

Large models

MiniMax M2.5

Qwen 3.5

7) Create the router preset file

8) Start the router manually (recommended first test)

9) Test prompt (OpenAI-compatible)

10) Run as a systemd service

10.1 Create the service file

10.2 Enable + start

11) Common operational knobs

If Vulkan device name differs

Tune GPU offload

Router behavior

Example usage

Command

Response

Command

Response

Signature

Verify

1) Create the `llm` user + directory layout