--- library_name: transformers tags: - torchao - qwen - qwen3 - nlp - chat - conversational language: - en base_model: - Qwen/Qwen3-4B pipeline_tag: text-generation datasets: - HuggingFaceFW/fineweb-edu --- # Quantization Recipe Install `uv` by following https://docs.astral.sh/uv/getting-started/installation/ ```bash uv venv ~/.uv-hf --python 3.13 source ~/.uv-hf/bin/activate uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao ``` ## QAT Finetuning with PARQ We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The checkpoint uploaded here was trained with a LR of 4.5e-5 on 32 GPUs with a per-device batch size of 2 using an internal codebase. An open source implementation of the training script is provided below. Adjust the `ngpu`, `device_batch_size`, `grad_accum_steps`, and `lr` variables below to fit your setup. Fetch the training script by running `curl -O https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py` before running the below. ```bash source ~/.uv-hf/bin/activate SEED=$RANDOM SAVE_DIR=checkpoints/qwen3-2bit-fineweb-${SEED} ngpu=8 device_batch_size=4 grad_accum_steps=2 lr=4.5e-5 TRANSFORMERS_VERBOSITY=error TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True HF_HUB_DISABLE_XET=1 \ torchrun \ --nproc-per-node $ngpu \ --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \ -m qat_sft \ --model_name_or_path Qwen/Qwen3-4B \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size $device_batch_size \ --gradient_accumulation_steps $grad_accum_steps \ --dataset_name HuggingFaceFW/fineweb-edu \ --dataset_train_split "train[:10%]" \ --dataloader_num_workers 4 \ --max_length 8192 \ --save_total_limit 1 \ --report_to tensorboard \ --logging_steps 2 \ --learning_rate $lr \ --lr_scheduler_type linear \ --warmup_ratio 0.0 \ --seed $SEED \ --output_dir $SAVE_DIR \ --enable_thinking \ --weight_bits 2 \ --linear_pat 'proj\.weight$' \ --embed_pat '(lm_head|embed_tokens)' ``` ## Generation from Quantized Model Note: to `push_to_hub` you need to run ```sh pip install -U "huggingface_hub[cli]" huggingface-cli login ``` and use a token with write access, from https://huggingface.co/settings/tokens To get the quantized model, run the following from the root of hf-scripts/: ```py import os from huggingface_hub import whoami, get_token from transformers import ( AutoModelForCausalLM, AutoTokenizer, set_seed, ) set_seed(0) model_path = f"{SAVE_DIR}" model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", dtype="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_path) # Manual testing prompt = "Hey, are you conscious? Can you talk to me?" messages = [ {"role": "system", "content": ""}, {"role": "user", "content": prompt}, ] templated_prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device) inputs.pop("token_type_ids", None) start_idx = len(inputs.input_ids[0]) response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0] response_ids = response_ids[start_idx:].tolist() output_text = tokenizer.decode(response_ids, skip_special_tokens=True) print(output_text) # Push to hub token = get_token() username = whoami(token=token)["name"] model_name = os.path.basename(model_path) save_to = os.path.join(username, model_name) model.push_to_hub(save_to, safe_serialization=False) tokenizer.push_to_hub(save_to) ``` The response from manual testing is: ```txt Yes, I am conscious and can communicate with you. How can I be of service to you? ``` # Model Quality | Benchmark | Qwen3-4B | Qwen3-4B-PARQ | | --- | :---: | :---: | | arc_easy | 80.26 | 73.19 | | arc_challenge | 53.92 | 47.27 | | boolq | 85.11 | 69.11 | | hellaswag | 68.49 | 66.67 | | piqa | 74.97 | 75.24 | | winogrande | 65.67 | 65.19 | # Exporting to ExecuTorch ⚠️ **Note:** These instructions only work on Arm-based machines. Running them on x86_64 will fail. We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch). Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze. To set up ExecuTorch, run the following commands: ``` git clone https://github.com/pytorch/executorch.git pushd executorch git submodule update --init --recursive python install_executorch.py popd ``` Next install the latest version of torchao: ``` git clone https://github.com/pytorch/ao.git pushd ao pip install . popd ``` (The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP). ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face. So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects: The following script does this for you. ```Shell python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin ``` Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows. To export, we must be on an an Arm-based Mac or Linux machine. (Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.) ```Shell python -m executorch.examples.models.llama.export_llama \ --model "qwen3_4b" \ --checkpoint pytorch_model_converted.bin \ --params examples/models/qwen3/config/4b_config.json \ --output_name model.pte \ -kv \ --use_sdpa_with_kv_cache \ --use-torchao-kernels \ --max_context_length 1024 \ --max_seq_length 1024 \ --dtype fp32 \ --metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}' ``` After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)). (We try to keep these instructions up-to-date, but if you find they do not work, check out our [CI test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_torchao_huggingface_checkpoints.sh) for the latest source of truth, and let us know we need to update our model card.)