7 days ago

$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0

Capabilities
completion

missing tools and thinking capabilities . I used llama.cpp to merge all the guff files. any idea or feedback guys?

myfenris

7 days ago

./llama-gguf-split --merge ../GLM-4.7-Q8_0-00001-of-00008.gguf ../GLM-4.7-Q8_0.gguf

echo "FROM GLM-4.7-Q8_0.gguf" > "GLM-4.7-Q8_0.model"

ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0.model

myfenris

6 days ago

managed to make it worked by using below model file:

FROM GLM-4.7-Q8_0.gguf

SYSTEM """You are a reasoning-focused assistant.
Use ... for internal reasoning.
Provide a concise final answer after thinking.
"""

TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<｜User｜>{{ .Content }}
{{- else if eq .Role "assistant" }}<｜Assistant｜>
{{- if and $.IsThinkSet (and $last .Thinking) -}}

{{ .Thinking }}

{{- end }}{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<｜Assistant｜>
{{- if and $.IsThinkSet (not $.Think) -}}

{{ end }} {{- end -}} {{- end }}"""

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER min_p 0.01
PARAMETER repeat_penalty 1
PARAMETER num_predict 16384
PARAMETER num_ctx 16384
PARAMETER num_gpu -1

PARAMETER stop <｜end▁of▁sentence｜>
PARAMETER stop <｜User｜>

myfenris

6 days ago

$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0

Capabilities
completion
thinking

Parameters
min_p 0.01
num_ctx 16384
num_gpu -1
num_predict 16384
repeat_penalty 1
stop "<｜end▁of▁sentence｜>"
stop "<｜User｜>"
temperature 0.6
top_p 0.95

System
You are a reasoning-focused assistant.
Use ... for internal reasoning.
...

krustik

6 days ago

•

edited 6 days ago

I don't use tools in these, but Q8 in my tests so-so, need to try BF16 almost original, frankly i got 768Gb RAM on ancient year 2014 Xeon motherboard, 4,5 BF16 runs perfectly in the past in oobabooga (in LMStudio such sizes always crashing), in pure llama,cpp server all chats can be lost if comp reboots(very often problem with such large models filling 99% memory).
Prompt hacks not helping in Q8. The Q8 in GLM4.5 from Unsloth was much better by result of code.
Guide to use oobabooga:

download latest release of text-gen portable on Github (its distributed in one packet like ComfyUI for Windows), unzip file.
2.drop your models into models folders in user_data (in GGUF=size of file=amount of RAM (RAM+VRAM) needed). Super large models like Kimi K2 or Deepseek Speciale can be used only from external drive obviously, so for that need to be written path in CMD_FLAGS.txt file (in user_data folder) like --model-dir /drive/your/model/folder
3.launch by start_linux(or etc) in web browser (do not use high RAM consuming browsers like Chrome)
in Model section choose your model then tune launch settings:
4.1 gpu-layers if you want to use GPU+CPU or put 0 for CPU only
4.2 ctx-size important - context size of discussed topic, more=more RAM
4.3 cpu-moe, streaming-llm on your choice
4.4 Other options is important - Threads is number of your CPU cores, threads_batch is number of CPU threads
4.5 batch_size can be played after, this number affect loading of answer on prompt
4.6 no-mmap and numa can be used by some, as i remember its against using storage drive for model space and for non-uniform memory
4.7 many other setting can be played
click Load button above and wait for confirmation that model loaded into RAM (or RAM+VRAM), useful to use any system resources app to check used RAM, with very big models usually all RAM used with leaving only minimum for OS itself, so RAM-eating apps need to be removed if model not loaded or super slow (when usually Linux started using SSD drive for model space)
user_data folder with all models/chats/settings can be migrated into any next new version of oobabooga
Oobabooga also distributed in docker container, maybe for corporate environments http://github.com/ashleykleynhans/text-generation-docker

krustik

6 days ago

This comment has been hidden (marked as Off-Topic)

victor

4 days ago

Can you stop the AI spam @krustik ?

krustik

4 days ago

Excuse me, but i've thought that was related.
Is there any function of "hide" on this portal like used in many forums which shows content only if user interested? @victor

lunarflu

4 days ago

Hi @krustik ! 🤗 If possible, if you can keep the AI generated segments relatively short that would be great so they're not huge walls of text in each discussion. Some short examples should be perfect to show the differences between models / quants etc

danielhanchen

Unsloth AI org 4 days ago

I would probably use https://ollama.com/MichelRosselli/GLM-4.6:latest/blobs/e683b5dab156 's chat template for Ollama - they also utilize our quants, so I'm assuming these chat templates work for Ollama

myfenris

3 days ago

FROM GLM-4.7-Q8_0.gguf

[gMASK]

{{- if .Tools }}<|system|>

Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within XML tags:

{{- range .Tools }}
{"function": {{ .Function }}}
{{- end }}

For each function call, return a json object with function name and arguments within XML tags:

{"name": , "arguments": }

{{- end -}}

{{- $lastUserIdx := -1 }}
{{- range $i, $_ := .Messages }}
{{- if eq .Role "user" }}{{- $lastUserIdx = $i }}{{ end }}
{{- end -}}

{{- $prevWasTool := false -}}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- $curIsTool := eq .Role "tool" -}}
{{- $startToolBlock := and $curIsTool (not $prevWasTool) -}}

{{- if eq .Role "user" }}<|user|>
{{ .Content }}

{{- if and $.IsThinkSet (not $.Think) -}}
/nothink
{{- end -}}

{{- else if eq .Role "assistant" }}<|assistant|>
{{- if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) }}
{{ .Thinking }}
{{- else if $.IsThinkSet }}

{{- end }}
{{- if .Content }}
{{ .Content }}
{{- end -}}

{{ if .ToolCalls }}

{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}

{{- end }}

{{- else if $curIsTool -}}
{{ if not $prevWasTool }}<|observation|>
{{- end }}

{{ .Content }}

{{- $prevWasTool = true -}}

{{- else if eq .Role "system" -}}<|system|>
{{ .Content }}
{{- end }}

{{- if and (ne .Role "assistant") $last }}<|assistant|>
{{- if and $.IsThinkSet (not $.Think) }}

{{- end -}}
{{- end }}

$ ollama create GLM-4.7-Q8_0 -f GLM-4.7-Q8_0-3.model

Error: (line 3): command must be one of "from", "license", "template", "system", "adapter", "renderer", "parser", "parameter", or "message"

myfenris

3 days ago

$ cat GLM-4.7-Q8_0-3.model
FROM GLM-4.7-Q8_0.gguf

SYSTEM """You are a reasoning-focused assistant with tool-calling capabilities.

Use ... for internal reasoning.
If a tool is needed, use it.
When you receive a tool observation, incorporate it into your final concise answer.
"""

TEMPLATE """[gMASK]{{- if .System }}<|system|>
{{ .System }}
{{- if .Tools }}
Available tools:
{{- range .Tools }}
{{ . }}
{{- end }}
{{- end }}
{{- end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|user|>
{{ .Content }}
{{- else if eq .Role "assistant" }}<|assistant|>
{{- if .Thinking }}
{{ .Thinking }}
{{ end }}
{{- if .ToolCalls }}<|observation|>
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}
{{- else }}
{{ .Content }}<|endoftext|>
{{- end }}
{{- else if eq .Role "tool" }}<|observation|>
{{ .Content }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|assistant|>

{{- end }}
{{- end }}"""

$ ollama show GLM-4.7-Q8_0:latest
Model
architecture glm4moe
parameters 358.3B
context length 202752
embedding length 5120
quantization Q8_0

Capabilities
completion
tools
thinking

System
You are a reasoning-focused assistant with tool-calling capabilities.
1. Use ... for internal reasoning.
...

the latest model file works but tested the tools calling with vs code not so working, appreciate others input

myfenris

3 days ago

@Danielhanchen , should I download latest GLM-4.7-Q8_0-00001-of-00008.gguf and retry ?

danielhanchen

Unsloth AI org 2 days ago

@Danielhanchen , should I download latest GLM-4.7-Q8_0-00001-of-00008.gguf and retry ?

Yes please do!

unsloth
/

GLM-4.7-GGUF

GLM-4.7-Q8_0.gguf capabilities with Ollama

Tools