GLM-4.7-Q8_0.gguf capabilities with Ollama
$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0
Capabilities
completion
missing tools and thinking capabilities . I used llama.cpp to merge all the guff files. any idea or feedback guys?
./llama-gguf-split --merge ../GLM-4.7-Q8_0-00001-of-00008.gguf ../GLM-4.7-Q8_0.gguf
echo "FROM GLM-4.7-Q8_0.gguf" > "GLM-4.7-Q8_0.model"
ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0.model
managed to make it worked by using below model file:
FROM GLM-4.7-Q8_0.gguf
SYSTEM """You are a reasoning-focused assistant.
Use ... for internal reasoning.
Provide a concise final answer after thinking.
"""
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<|User|>{{ .Content }}
{{- else if eq .Role "assistant" }}<|Assistant|>
{{- if and $.IsThinkSet (and $last .Thinking) -}}
{{ .Thinking }}
{{- end }}{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|Assistant|>
{{- if and $.IsThinkSet (not $.Think) -}}
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER min_p 0.01
PARAMETER repeat_penalty 1
PARAMETER num_predict 16384
PARAMETER num_ctx 16384
PARAMETER num_gpu -1
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0
Capabilities
completion
thinking
Parameters
min_p 0.01
num_ctx 16384
num_gpu -1
num_predict 16384
repeat_penalty 1
stop "<|end▁of▁sentence|>"
stop "<|User|>"
temperature 0.6
top_p 0.95
System
You are a reasoning-focused assistant.
Use ... for internal reasoning.
...
I don't use tools in these, but Q8 in my tests so-so, need to try BF16 almost original, frankly i got 768Gb RAM on ancient year 2014 Xeon motherboard, 4,5 BF16 runs perfectly in the past in oobabooga (in LMStudio such sizes always crashing), in pure llama,cpp server all chats can be lost if comp reboots(very often problem with such large models filling 99% memory).
Prompt hacks not helping in Q8. The Q8 in GLM4.5 from Unsloth was much better by result of code.
Guide to use oobabooga:
- download latest release of text-gen portable on Github (its distributed in one packet like ComfyUI for Windows), unzip file.
2.drop your models into models folders in user_data (in GGUF=size of file=amount of RAM (RAM+VRAM) needed). Super large models like Kimi K2 or Deepseek Speciale can be used only from external drive obviously, so for that need to be written path in CMD_FLAGS.txt file (in user_data folder) like --model-dir /drive/your/model/folder
3.launch by start_linux(or etc) in web browser (do not use high RAM consuming browsers like Chrome) - in Model section choose your model then tune launch settings:
4.1 gpu-layers if you want to use GPU+CPU or put 0 for CPU only
4.2 ctx-size important - context size of discussed topic, more=more RAM
4.3 cpu-moe, streaming-llm on your choice
4.4 Other options is important - Threads is number of your CPU cores, threads_batch is number of CPU threads
4.5 batch_size can be played after, this number affect loading of answer on prompt
4.6 no-mmap and numa can be used by some, as i remember its against using storage drive for model space and for non-uniform memory
4.7 many other setting can be played - click Load button above and wait for confirmation that model loaded into RAM (or RAM+VRAM), useful to use any system resources app to check used RAM, with very big models usually all RAM used with leaving only minimum for OS itself, so RAM-eating apps need to be removed if model not loaded or super slow (when usually Linux started using SSD drive for model space)
user_data folder with all models/chats/settings can be migrated into any next new version of oobabooga
Oobabooga also distributed in docker container, maybe for corporate environments http://github.com/ashleykleynhans/text-generation-docker
I would probably use https://ollama.com/MichelRosselli/GLM-4.6:latest/blobs/e683b5dab156 's chat template for Ollama - they also utilize our quants, so I'm assuming these chat templates work for Ollama
FROM GLM-4.7-Q8_0.gguf
[gMASK]
{{- if .Tools }}<|system|>
Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within XML tags:
{{- range .Tools }}
{"function": {{ .Function }}}
{{- end }}
For each function call, return a json object with function name and arguments within XML tags:
{"name": , "arguments": }
{{- end -}}
{{- $lastUserIdx := -1 }}
{{- range $i, $_ := .Messages }}
{{- if eq .Role "user" }}{{- $lastUserIdx = $i }}{{ end }}
{{- end -}}
{{- $prevWasTool := false -}}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- $curIsTool := eq .Role "tool" -}}
{{- $startToolBlock := and $curIsTool (not $prevWasTool) -}}
{{- if eq .Role "user" }}<|user|>
{{ .Content }}
{{- if and $.IsThinkSet (not $.Think) -}}
/nothink
{{- end -}}
{{- else if eq .Role "assistant" }}<|assistant|>
{{- if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) }}
{{ .Thinking }}
{{- else if $.IsThinkSet }}
{{- end }}
{{- if .Content }}
{{ .Content }}
{{- end -}}
{{ if .ToolCalls }}
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}
{{- end }}
{{- else if $curIsTool -}}
{{ if not $prevWasTool }}<|observation|>
{{- end }}
{{ .Content }}
{{- $prevWasTool = true -}}
{{- else if eq .Role "system" -}}<|system|>
{{ .Content }}
{{- end }}
{{- if and (ne .Role "assistant") $last }}<|assistant|>
{{- if and $.IsThinkSet (not $.Think) }}
{{- end -}}
{{- end }}
{{- $prevWasTool = $curIsTool -}}
{{- end }}
$ ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0-3.model
Error: (line 3): command must be one of "from", "license", "template", "system", "adapter", "renderer", "parser", "parameter", or "message"
$ cat GLM-4.7-Q8_0-3.model
FROM GLM-4.7-Q8_0.gguf
SYSTEM """You are a reasoning-focused assistant with tool-calling capabilities.
- Use ... for internal reasoning.
- If a tool is needed, use it.
- When you receive a tool observation, incorporate it into your final concise answer.
"""
TEMPLATE """[gMASK]{{- if .System }}<|system|>
{{ .System }}
{{- if .Tools }}
Available tools:
{{- range .Tools }}
{{ . }}
{{- end }}
{{- end }}
{{- end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|user|>
{{ .Content }}
{{- else if eq .Role "assistant" }}<|assistant|>
{{- if .Thinking }}
{{ .Thinking }}
{{ end }}
{{- if .ToolCalls }}<|observation|>
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}
{{- else }}
{{ .Content }}<|endoftext|>
{{- end }}
{{- else if eq .Role "tool" }}<|observation|>
{{ .Content }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|assistant|>
{{- end }}
{{- end }}"""
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|observation|>"
$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0
Capabilities
completion
tools
thinking
Parameters
stop "<|user|>"
stop "<|assistant|>"
stop "<|endoftext|>"
stop "<|observation|>"
temperature 0.6
top_p 0.95
System
You are a reasoning-focused assistant with tool-calling capabilities.
1. Use ... for internal reasoning.
...
the latest model file works but tested the tools calling with vs code not so working, appreciate others input
@Danielhanchen , should I download latest GLM-4.7-Q8_0-00001-of-00008.gguf and retry ?
Yes please do!