Can't wait for a q4 quant from you

#1
by mtcl - opened

:) I have been refreshing every 5 minutes to check if you posted yet. πŸ˜‚

@ubergarm thank you for the checklist in the readme. That keeps me sane.

@ubergarm

i was copying and pasting my llama.cpp template command for 2X 6000 Pro which would have been this...

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES="0,1" ./build/bin/llama-server \
    --model /media/mukul/data/models/ubergarm/GLM-4.7-GGUF/IQ3_KS/GLM-4.7-IQ3_KS-00001-of-00005.gguf \
    --alias ubergarm/GLM-4.7-IQ3_KS \
    --ctx-size 102400 \
    -fa on \
    -np 1 -kvu \
    --temp 0.7 \
    --top-p 0.95 \
    --top-k 40 \
    -ngl 99 \
    --parallel 1 \
    --threads 56 \
    --jinja \
    --host 0.0.0.0 \
    --port 10002

But then I noticed you have some new flags that I have not seen. You have this instruction for 2XGPU. The -ger, -smgs, -ncpu-moe etc. they are all new to me!

# Hybrid CPU + 2 or more GPUs
# using new "-sm graph" 'tensor parallel' feature!
# https://github.com/ikawrakow/ik_llama.cpp/pull/1080
./build/bin/llama-sweep-bench \
    --model /media/mukul/data/models/ubergarm/GLM-4.7-GGUF/IQ3_KS/GLM-4.7-IQ3_KS-00001-of-00005.gguf \
    --alias ubergarm/GLM-4.7 \
    --ctx-size 65536 \
    -ger \
    -sm graph \
    -smgs \
    -mea 256 \
    -ngl 99 \
    --n-cpu-moe 72 \
    -ts 41,48 \
    -ub 4096 -b 4096 \
    --threads 56 \
    --parallel 1 \
    --host 0.0.0.0 \
    --port 10002 \
    --no-mmap \
    --jinja

@mtcl

If your build ended up with exactly 2x 6000 PROs, you're gonna love the new -sm graph features ik has been adding over the past few weeks! Doesn't work for all models yet, but should be working for GLMs.

@mtcl

If your build ended up with exactly 2x 6000 PROs,

Here is my build:

System Specifications for Hybrid CPU/GPU Inference

System Overview

OS: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-37-generic


Motherboard Specifications

Model: ASUS Pro WS W790E-SAGE SE

  • Platform: Intel W790 Chipset

CPU Specifications

Processor: Intel Xeon w9-3495X
- Cores: 56 physical cores / 112 threads
- AMX (Advanced Matrix Extensions): Full support
- NUMA Nodes: 1

GPU Specifications

Graphics Cards: 2x NVIDIA RTX PRO 6000 Black Edition

  • VRAM per GPU: 96 GB
  • Driver Version: 580.95.05
  • CUDA Version: 13.0
  • Power Consumption: 600W max each

Memory Configuration

Total RAM: 512 GB
Memory Type: DDR5 (8x DIMM slots populated)
Memory Speed: DDR5-4800+ (OC capable)
OC Supported: 6000 OC Stable
Swap: 0 GB (disabled)
Memory Architecture: 8-channel DDR5 configuration
Available for Inference: 503 GB total capacity for model loading

you're gonna love the new -sm graph features ik has been adding over the past few weeks! Doesn't work for all models yet, but should be working for GLMs.

I think I can definitely feel that! I noticed that my rig suddenly was pulling 1600 watts! Both 6000 pros started maxing out on 600Watts! Which has never happened with llama.cpp. Below is the command I used. I am getting amazing speed with this IQ3_KS model. I can do 128K, --jinja works, And prompt processing and token generation is better than llama.cpp! I keep flip flopping between this and minimax-m2-nvfp4 with sglang.

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES="0,1" ./build/bin/llama-server \
    --model /media/mukul/data/models/ubergarm/GLM-4.7-GGUF/IQ3_KS/GLM-4.7-IQ3_KS-00001-of-00005.gguf \
    --alias ubergarm/GLM-4.7 \
    --ctx-size 131072 \
    -ger \
    -sm graph \
    -smgs \
    -mea 256 \
    -ngl 99 \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --threads 56 \
    --parallel 1 \
    --host 0.0.0.0 \
    --port 10002 \
    --no-mmap \
    --jinja

@mtcl

Sweet, thanks for reporting your findings for various configurations!

One thing about GLM-4.6 is that using -ctk q8_0 -ctv q8_0 is fine for quality, but it did at least in the past slow things down faster as context length grows. If you don't need the full 128k context, leaving kv-cache at default f16 unquantized might give you more speed, but I haven't checked for sure on this latest version of everything haha... Hard to keep up even for me! xD

Oh also if you're using -ngl 99 and the model is fully offloaded like in this situation, always drop -t 1 to get a few more percent boost due to less synchronization necessary on the unused CPU threads.

Also definitely consider checking out LACT to undervolt your GPUs a little bit to save on power. You can probably get similar speed output with 75% or less power usage probably based on anecdotal reports I've heard on RTX 6000s on the beaver ai discord channels. github PR reference with a lot of discussions: https://github.com/ilya-zlobintsev/LACT/issues/486#issuecomment-2748315620

As for the quality of GLM-4.7, it seems to be an improvement and is working well with my pydantic-ai framework tool calling agent experiments. I ended up adding --special on there too, but not sure it is required. Seems to be working fine with default built in chat template too. It does seem to think more than I would prefer, though that can be disabled but I'm not sure an easy way to do so in my client setup.

Thanks!

Sign up or log in to comment