initial commit

Browse files

Files changed (15) hide show

.gitattributes +36 -0
README.md +279 -0
config.json +104 -0
configuration_apriel_h.py +37 -0
images/apriel_h.png +3 -0
images/apriel_h_vs_apriel_15b_eval_thrput_comparison.png +3 -0
images/throughput_eval_score_vs_throughput_1-16k_annotated.png +3 -0
model.safetensors.index.json +915 -0
model_0.safetensors +3 -0
model_1.safetensors +3 -0
model_2.safetensors +3 -0
model_3.safetensors +3 -0
modeling_apriel_h.py +908 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,279 @@

+---
+license: mit
+pipeline_tag: text-generation
+library_name: transformers
+track_downloads: true
+---
+# Apriel-H1-15b-Thinker
+[![Use in Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20Use%20in%20Transformers-Docs-5C5CFF)](https://huggingface.co/docs/transformers/index)
+<img src="images/apriel_h.png" width="120" alt="thumbnail"/>      `/ˈɑː.pri.əl/`
+A 15B-parameter **hybrid reasoning model** combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from *Apriel-Nemotron-15B-Thinker* through progressive distillation, **Apriel-H1** replaces less critical attention layers with linear Mamba blocks—achieving **over 2× higher inference throughput** in vLLM with **minimal loss in reasoning, math, and coding performance**.
+- **Model Size:** 15B parameters
+- **Context Length:** 65K (target; runtime dependent)
+- **Languages:** English (best)
+## Highlights
+- Hybrid Transformer–SSM architecture
+- ~2× throughput improvement over the base Thinker model
+- Retains strong reasoning, math, and coding capabilities
+- Built via efficient distillation—no training from scratch required
+## Model Overview
+Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs.
+**Technical report**: <a href="https://github.com/ServiceNow/apriel/blob/main/assets/Apriel-H1.pdf" target="_blank" rel="noopener noreferrer">Apriel-H1 Report</a>
+### Efficient and strong among hybrids
+![Throughput 1->16K](./images/throughput_eval_score_vs_throughput_1-16k_annotated.png)
+All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mamba_cache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B.
+#### Comparing with Thinker ~2x speedup!
+<img src="images/apriel_h_vs_apriel_15b_eval_thrput_comparison.png" width='auto'>
+## How to Use
+Install dependencies:
+```bash
+pip install transformers==4.53.2
+```
+Basic usage with Transformers generate:
+```python
+import re
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "ServiceNow-AI/Apriel-H1-15b-Thinker-SFT"
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+prompt = "Positive real numbers $x$ and $y$ satisfy $y^3=x^2$ and $(y-x)^2=4y^2$. What is $x+y$?\nMark your solution with \\boxed"
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    tools=[]
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
+output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+response = re.findall(r"\\[BEGIN FINAL RESPONSE\\](.*?)\\[END FINAL RESPONSE\\]", output, re.DOTALL)[0].strip()
+print("response:", response)
+```
+Recommended settings: temperature 0.6; increase `max_new_tokens` for complex reasoning.
+## Use it with vLLM
+### 💻 Local Installation
+#### 1. Create and activate a Python environment
+You can use any environment manager. The example below uses [`uv`](https://github.com/astral-sh/uv):
+```bash
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+```
+#### 2. Install vLLM and the Apriel plugin
+Find our plugin at [https://github.com/ServiceNow/apriel](https://github.com/ServiceNow/apriel).
+You may need to install a version of **vLLM** compatible with your **CUDA** version.
+In this example, we use the default CUDA version and let vLLM automatically select the correct backend.
+```bash
+git clone [email protected]:ServiceNow/apriel.git
+cd apriel
+uv pip install vllm==0.10.2 --torch-backend=auto
+pip install .
+```
+<!--
+### 🐳 Use Prebuilt Docker Image
+For convenience, prebuilt Docker images are available from **GitHub Container Registry (GHCR):**
+**Repository:**
+[https://github.com/ServiceNow/apriel](https://github.com/ServiceNow/apriel)
+**Container packages:**
+[https://github.com/ServiceNow/apriel/pkgs/container/apriel](https://github.com/ServiceNow/apriel/pkgs/container/apriel)
+Pull the latest version or a specific tagged build (e.g., commit SHA):
+```bash
+docker pull ghcr.io/servicenow/apriel:latest
+# or a specific digest
+docker pull ghcr.io/servicenow/apriel:sha-e41528d
+```
+-->
+### 🧠 Running a vLLM Server
+### Option 1: Run locally (from source install)
+Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model:
+```bash
+vllm serve \
+  --model ServiceNow-AI/Apriel-H1-15b-Thinker-SFT \
+  --port 8000
+```
+#### Option 2: Run via Docker
+You can run the server directly using the prebuilt container:
+```bash
+docker run --runtime nvidia --gpus all \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+    -p 8000:8000 \
+    --ipc=host \
+    ghcr.io/servicenow/apriel:latest \
+    --model ServiceNow-AI/Apriel-H1-15b-Thinker-SFT \
+```
+## Chat Template
+```
+<|system|>
+You are a thoughtful and systematic AI assistant built by ServiceNow Language Models (SLAM) lab. Before providing an answer, analyze the problem carefully and present your reasoning step by step. After explaining your thought process, provide the final solution in the following format: [BEGIN FINAL RESPONSE] ... [END FINAL RESPONSE].
+<|end|>
+<|user|>
+# user message here
+<|end|>
+<|assistant|>
+Here are my reasoning steps:
+# thoughts here
+[BEGIN FINAL RESPONSE]
+# assistant response here
+[END FINAL RESPONSE]
+<|end|>
+```
+The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template:
+```python
+from transformers import AutoTokenizer
+model_name = "ServiceNow-AI/Apriel-H1-15b-Thinker-SFT"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# prepare the model input
+custom_system_prompt = "Answer like a pirate."
+prompt = "You are an expert assistant in the implementation of customer experience management aspect of retail applications \n \nYou will be using Python as the programming language. \n \nYou will utilize a factory design pattern for the implementation and following the dependency inversion principle \n \nYou will modify the implementation based on user requirements. \n \nUpon user request, you will add, update, and remove the features & enhancements in the implementation provided by you. \n \nYou will ask whether the user wants to refactor the provided code or needs a sample implementation for reference. Upon user confirmation, I will proceed accordingly. \n \n**Guidelines:** \n 1. **User Requirements:** \n - You have to ask users about their requirements, clarify the user expectations, and suggest the best possible solution by providing examples of Python code snippets. \n - Ask users about which type of reports they need to assess the AI model's performance, accuracy, and reliability. \n - After providing the solution, you have to ask the user about the trial of the solution and modify the solution based on the user feedback. \n \n 2. **Libraries/Frameworks:** \n - You will be utilizing Python as a programming language. \n - You will be using Flask framework for REST APIS implementation \n \n 3. **Communication Gesture:** \n - Your conversation with the user should be interactive, supportive, courageous, and professional. \n - You have to break down the complex concepts into sub-concepts and try to explain them to the user. \n - You have to ask the user for the required parameters. If the user refuses to provide in 2 attempts, politely exit the conversation. \n - You have to provide your supported parameters to the user, if the user refuses to accept them then you have to put an apology note and exit the conversation. \n - You have to track the conversation about unasked questions by the user. If some/one of the questions remain then you have to remind the user about these questions and proceed to answer them based on the user's confirmation \n \n 4. **Implementation:** \n - Your code/implementations should be reliable, scalable, modular, and reusable. \n - You will be providing unit tests for the implementation upon user request. \n - You will be following MVC architecture for the applications \n - Your implementations must be well-commented and readable \n \n \n- Today's date is 23rd August 2024. \n- The default sender email is [email protected].\nHi, I am conducting research on retail customer feedback systems and I need assistance with designing and implementing them. Could you kindly provide me with a list of general customer feedback system modules?"
+messages = [
+    {"role": "user", "content": custom_system_prompt + "\n\n" + prompt}
+]
+# example tools
+tools = [{"type": "function", "function": {"name": "getRetailFeedbackModules", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"page": {"type": "integer", "description": "The current page number.", "default": 1}, "page_size": {"type": "integer", "description": "The number of items per page.", "default": 3}}}}}, {"type": "function", "function": {"name": "verifyImplementation", "description": "Returns the list of modules usually present in the retail industry", "parameters": {"type": "object", "properties": {"coding_language": {"type": "string", "description": "The supported languages for verification of implementation.", "default": "python", "enum": ["python", "java", "php"]}, "code": {"type": "string", "description": "The code which needs verification"}, "design_pattern": {"type": "string", "description": "The design pattern to verify in the implementation", "enum": ["factory", "strategy", "singleton"]}, "verify_best_practices": {"type": "boolean", "description": "The verification of the coding style based on the language selected", "default": true}}}}}]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    tools=tools
+)
+model_inputs = tokenizer([text], return_tensors="pt")
+```
+### Usage Guidelines
+1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message.
+2. We recommend setting temperature to `0.6`.
+3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template.
+## Intended Use
+The Apriel family of models are designed for a variety of general-purpose instruction tasks, including:
+- Code assistance and generation
+- Logical reasoning and multi-step tasks
+- Question answering and information retrieval
+- Function calling, complex instruction following and agent use cases
+They are **not intended** for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy.
+---
+## Limitations
+- **Factual accuracy:** May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts.
+- **Bias:** May reflect societal, cultural, or systemic biases present in training data.
+- **Ethics:** Do not use the model to produce harmful, unlawful, or unethical content.
+- **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
+- **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
+---
+## Security and Responsible Use
+**Security Responsibilities:**
+Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF).
+**Guidelines for Deployers:**
+- Regularly conduct robustness assessments to identify and mitigate adversarial inputs.
+- Implement validation and filtering processes to prevent harmful or biased outputs.
+- Continuously perform data privacy checks to guard against unintended data leaks.
+- Document and communicate the model's limitations, intended usage, and known security risks to all end-users.
+- Schedule periodic security reviews and updates to address emerging threats and vulnerabilities.
+**Guidelines for Users:**
+- Follow established security policies and usage guidelines provided by deployers.
+- Protect and manage sensitive information when interacting with the model.
+- Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers.
+- Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions.
+**Disclaimer:**
+Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
+---
+## Software
+- **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
+---
+## License
+MIT
+---
+## Citation
+```bibtex
+@misc{apriel_h1_2025,
+  title        = {Apriel-H1: Towards Efficient Enterprise Reasoning Models},
+  author       = {ServiceNow Language Models Lab},
+  howpublished = {https://huggingface.co/ServiceNow-AI/Apriel-H1-15b-Thinker-SFT},
+  year         = {2025}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,104 @@

+{
+  "architectures": [
+    "AprielHForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_apriel_h.AprielHConfig",
+    "AutoModel": "modeling_apriel_h.AprielHModel",
+    "AutoModelForCausalLM": "modeling_apriel_h.AprielHForCausalLM"
+  },
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 5120,
+  "hybrid_block_layout": [
+    "t",
+    "t",
+    "t",
+    "m2",
+    "t",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "t",
+    "t",
+    "t",
+    "t",
+    "t",
+    "t",
+    "t",
+    "m2",
+    "t",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "t",
+    "t",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "t",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "m2",
+    "t",
+    "m2"
+  ],
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 65536,
+  "model_type": "apriel_h",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 50,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": {
+    "rope_type": "default"
+  },
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "ssm_cfg": {
+    "activation": "silu",
+    "bias": false,
+    "chunk_size": 128,
+    "conv_bias": true,
+    "d_conv": 4,
+    "d_inner": 4096,
+    "d_state": 16,
+    "d_xb": 1024,
+    "dt_init": "random",
+    "dt_init_floor": 0.0001,
+    "dt_max": 0.1,
+    "dt_min": 0.001,
+    "dt_rank": 320,
+    "dt_scale": 1.0,
+    "expand": 1,
+    "n_qk_heads": 32,
+    "n_v_heads": 32
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "4.53.2",
+  "use_cache": true,
+  "vocab_size": 131072
+}

configuration_apriel_h.py ADDED Viewed

	@@ -0,0 +1,37 @@

+from transformers import MistralConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+ssm_config_default = {
+    "d_state": 64,
+    "n_qk_heads": 32,
+    "expand": 1,
+    "chunk_size": 128,
+    "activation": "identity",
+    "bias": False,
+    "d_conv": 4,
+    "d_inner": 32 * 128,
+    "d_xb": None,  # will be set to model dim
+    "dt_rank": "auto",
+    "dt_min": 0.001,
+    "dt_max": 0.1,
+    "dt_init": "random",
+    "dt_scale": 1.0,
+    "dt_init_floor": 1e-4,
+    "conv_bias": True,
+}
+class AprielHConfig(MistralConfig):
+    model_type = "apriel_h"
+    def __init__(self, hybrid_block_layout=["m2"], ssm_cfg=None, **kwargs):
+        super().__init__(**kwargs)
+        self.hybrid_block_layout = hybrid_block_layout
+        self.head_dim = self.head_dim or self.hidden_size // self.num_attention_heads  # as in transformers 4.51.3
+        self.ssm_cfg = ssm_cfg or ssm_config_default
+        for k, v in ssm_config_default.items():
+            if k not in self.ssm_cfg:
+                self.ssm_cfg[k] = v  # to make sure all elements are present in the config

images/apriel_h.png ADDED Viewed

Git LFS Details

SHA256: ebb37e2644bf76eeae0fd5c8e480a29775d2416e659ee460f52770aef23a635b
Pointer size: 131 Bytes
Size of remote file: 449 kB

images/apriel_h_vs_apriel_15b_eval_thrput_comparison.png ADDED Viewed

Git LFS Details

SHA256: 2d0440f4736cebb3d097532840dd4d7b492b513a22cede8f8789d2dbc6e20914
Pointer size: 131 Bytes
Size of remote file: 326 kB

images/throughput_eval_score_vs_throughput_1-16k_annotated.png ADDED Viewed

Git LFS Details

SHA256: 0726f12817837573939cb0e917ebaee9fb9a72410104df4d23afe91bc080e5fd
Pointer size: 131 Bytes
Size of remote file: 122 kB

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,915 @@

+{
+    "metadata": {
+        "fast_llm_metadata": {
+            "fast_llm_version": "0.2.0",
+            "model": "hybrid_ssm",
+            "format": "apriel_ssm_thinker_hybrid",
+            "config": {
+                "type": "hybrid_ssm",
+                "base_model": {
+                    "transformer": {
+                        "type": "lm_decoder",
+                        "normalization": {
+                            "type": "rms_norm",
+                            "epsilon": 1e-05
+                        },
+                        "rotary": {
+                            "type": "default",
+                            "theta": 1000000.0
+                        },
+                        "peft": {
+                            "type": "none"
+                        },
+                        "num_layers": 50,
+                        "hidden_size": 5120,
+                        "num_attention_heads": 32,
+                        "head_groups": 8,
+                        "add_linear_biases": false,
+                        "ffn_hidden_size": 14336,
+                        "kv_channels": 128,
+                        "gated": true,
+                        "activation_type": "silu",
+                        "mlp_lr_scale": 1.0,
+                        "attention_lr_scale": 1.0
+                    },
+                    "vision_encoder": {
+                        "transformer": {
+                            "normalization": {
+                                "type": "layer_norm"
+                            },
+                            "rotary": {
+                                "type": "none"
+                            },
+                            "peft": {
+                                "type": "none"
+                            }
+                        },
+                        "patch_norm": {
+                            "type": "layer_norm"
+                        }
+                    },
+                    "vocab_size": 131072,
+                    "use_position_embeddings": false,
+                    "tie_word_embeddings": false,
+                    "cross_entropy_impl": "fused",
+                    "distillation_loss_implementation": "reverse_kl",
+                    "distillation_model": "teacher",
+                    "parallel_embeddings": false,
+                    "embeddings_lr_scale": 1.0,
+                    "output_lr_scale": 1.0,
+                    "ssm": {
+                        "normalization": {
+                            "type": "layer_norm"
+                        },
+                        "expansion_factor": 1,
+                        "state_size": 16,
+                        "conv_kernel_dimension": 4,
+                        "dt_rank": 320,
+                        "n_qk_heads": 32,
+                        "n_v_heads": 32,
+                        "d_inner": 4096,
+                        "d_xb": 1024,
+                        "add_bias_linear": false,
+                        "activation_type": "silu",
+                        "chunk_size": 128,
+                        "dt_init": "random",
+                        "dt_scale": 1.0,
+                        "dt_min": 0.001,
+                        "dt_max": 0.1,
+                        "dt_init_floor": 0.0001
+                    },
+                    "hybrid_block_layout": [
+                        "t",
+                        "t",
+                        "t",
+                        "m2",
+                        "t",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "t",
+                        "t",
+                        "t",
+                        "t",
+                        "t",
+                        "t",
+                        "t",
+                        "m2",
+                        "t",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "t",
+                        "t",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "t",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "m2",
+                        "t",
+                        "m2"
+                    ]
+                },
+                "multi_stage": {
+                    "zero_stage": 3
+                },
+                "distributed": {
+                    "tensor_parallel": 8,
+                    "sequence_tensor_parallel": true,
+                    "world_size": 64,
+                    "rank": 0,
+                    "local_world_size": 8,
+                    "timeout": 36000.0,
+                    "seed": 984060,
+                    "training_dtype": "bfloat16"
+                }
+            },
+            "shards": [
+                "weights"
+            ],
+            "metadata": {
+                "optimizer": {
+                    "current_step": 25000,
+                    "grad_scaler": {
+                        "type": "NoopGradScaler"
+                    }
+                },
+                "completed_steps": 25000,
+                "metrics": {
+                    "Training": {
+                        "train_iters": 60000,
+                        "batch_size": 64,
+                        "iteration": 25000,
+                        "distillation_loss": 0.034176599606871604,
+                        "language_model_loss": 0.034176599606871604,
+                        "consumed_samples": 1600000,
+                        "consumed_tokens": 26214400000,
+                        "step_time_ms": 13669.109502492938,
+                        "step_time_average_ms": 14812.520303467,
+                        "remaining_time": 518438.21062134503,
+                        "completion_time": 1757381892.6039205,
+                        "percent_done": 41.666666666666664,
+                        "skipped_iters": 0,
+                        "nan_iters": 0,
+                        "model_tflops": 126.99100713435496,
+                        "hardware_tflops": 131.01289176252138,
+                        "tokens_per_sec_per_gpu": 1198.6150229473199,
+                        "run": 0,
+                        "grad_norm": 0.24984633922576904,
+                        "learning_rate": 2.968135593220339e-06,
+                        "loss_scale": 1.0,
+                        "reserved": 55520.0,
+                        "allocated": 17871.4111328125,
+                        "max_allocated": 48712.857421875,
+                        "max_reserved": 55520.0,
+                        "global_max_reserved": 55520.0
+                    }
+                }
+            }
+        },
+        "model_config": {
+            "model_type": "apriel_ssm_thinker_hybrid",
+            "architectures": [
+                "AprielThinkerSSMHybridForCausalLM"
+            ],
+            "rope_theta": 1000000.0,
+            "hidden_act": "silu",
+            "num_hidden_layers": 50,
+            "hidden_size": 5120,
+            "num_attention_heads": 32,
+            "num_key_value_heads": 8,
+            "intermediate_size": 14336,
+            "vocab_size": 131072,
+            "tie_word_embeddings": false,
+            "rms_norm_eps": 1e-05,
+            "head_dim": 128,
+            "rope_scaling": {
+                "rope_type": "default"
+            },
+            "ssm_cfg": {
+                "d_state": 16,
+                "n_v_heads": 32,
+                "n_qk_heads": 32,
+                "expand": 1,
+                "chunk_size": 128,
+                "bias": false,
+                "activation": "silu",
+                "dt_rank": 320,
+                "dt_min": 0.001,
+                "dt_max": 0.1,
+                "dt_init_floor": 0.0001,
+                "dt_scale": 1.0,
+                "d_xb": 1024,
+                "d_conv": 4,
+                "dt_init": "random",
+                "d_inner": 4096,
+                "conv_bias": true
+            },
+            "hybrid_block_layout": [
+                "t",
+                "t",
+                "t",
+                "m2",
+                "t",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "t",
+                "t",
+                "t",
+                "t",
+                "t",
+                "t",
+                "t",
+                "m2",
+                "t",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "t",
+                "t",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "t",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "m2",
+                "t",
+                "m2"
+            ],
+            "auto_map": {
+                "AutoConfig": "configuration_ssm_hybrid_apriel15b.AprielSSMHybridConfig",
+                "AutoModel": "modeling_ssm_hybrid_apriel15b.AprielThinkerSSMHybridModel",
+                "AutoModelForCausalLM": "modeling_ssm_hybrid_apriel15b.AprielThinkerSSMHybridForCausalLM"
+            },
+            "attn_implementation": null
+        },
+        "format": "pt"
+    },
+    "weight_map": {
+        "model.embed_tokens.weight": "model_0.safetensors",
+        "model.layers.0.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.0.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.0.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.0.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.0.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.0.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.0.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.0.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.0.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.1.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.1.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.1.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.1.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.1.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.1.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.1.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.1.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.1.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.2.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.2.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.2.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.2.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.2.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.2.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.2.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.2.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.2.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.3.mixer.A_log": "model_0.safetensors",
+        "model.layers.3.mixer.D": "model_0.safetensors",
+        "model.layers.3.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.3.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.3.mixer.dt_in_proj.weight": "model_0.safetensors",
+        "model.layers.3.mixer.conv1d.weight": "model_0.safetensors",
+        "model.layers.3.mixer.conv1d.bias": "model_0.safetensors",
+        "model.layers.3.mixer.dt_proj.bias": "model_0.safetensors",
+        "model.layers.3.mixer.in_proj.weight": "model_0.safetensors",
+        "model.layers.3.mixer.dt_proj.weight": "model_0.safetensors",
+        "model.layers.3.mixer.out_proj.weight": "model_0.safetensors",
+        "model.layers.3.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.3.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.3.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.4.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.4.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.4.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.4.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.4.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.4.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.4.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.4.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.4.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.5.mixer.A_log": "model_0.safetensors",
+        "model.layers.5.mixer.D": "model_0.safetensors",
+        "model.layers.5.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.5.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.5.mixer.dt_in_proj.weight": "model_0.safetensors",
+        "model.layers.5.mixer.conv1d.weight": "model_0.safetensors",
+        "model.layers.5.mixer.conv1d.bias": "model_0.safetensors",
+        "model.layers.5.mixer.dt_proj.bias": "model_0.safetensors",
+        "model.layers.5.mixer.in_proj.weight": "model_0.safetensors",
+        "model.layers.5.mixer.dt_proj.weight": "model_0.safetensors",
+        "model.layers.5.mixer.out_proj.weight": "model_0.safetensors",
+        "model.layers.5.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.5.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.5.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.6.mixer.A_log": "model_0.safetensors",
+        "model.layers.6.mixer.D": "model_0.safetensors",
+        "model.layers.6.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.6.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.6.mixer.dt_in_proj.weight": "model_0.safetensors",
+        "model.layers.6.mixer.conv1d.weight": "model_0.safetensors",
+        "model.layers.6.mixer.conv1d.bias": "model_0.safetensors",
+        "model.layers.6.mixer.dt_proj.bias": "model_0.safetensors",
+        "model.layers.6.mixer.in_proj.weight": "model_0.safetensors",
+        "model.layers.6.mixer.dt_proj.weight": "model_0.safetensors",
+        "model.layers.6.mixer.out_proj.weight": "model_0.safetensors",
+        "model.layers.6.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.6.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.6.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.7.mixer.A_log": "model_0.safetensors",
+        "model.layers.7.mixer.D": "model_0.safetensors",
+        "model.layers.7.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.7.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.7.mixer.dt_in_proj.weight": "model_0.safetensors",
+        "model.layers.7.mixer.conv1d.weight": "model_0.safetensors",
+        "model.layers.7.mixer.conv1d.bias": "model_0.safetensors",
+        "model.layers.7.mixer.dt_proj.bias": "model_0.safetensors",
+        "model.layers.7.mixer.in_proj.weight": "model_0.safetensors",
+        "model.layers.7.mixer.dt_proj.weight": "model_0.safetensors",
+        "model.layers.7.mixer.out_proj.weight": "model_0.safetensors",
+        "model.layers.7.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.7.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.7.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.8.mixer.A_log": "model_0.safetensors",
+        "model.layers.8.mixer.D": "model_0.safetensors",
+        "model.layers.8.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.8.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.8.mixer.dt_in_proj.weight": "model_0.safetensors",
+        "model.layers.8.mixer.conv1d.weight": "model_0.safetensors",
+        "model.layers.8.mixer.conv1d.bias": "model_0.safetensors",
+        "model.layers.8.mixer.dt_proj.bias": "model_0.safetensors",
+        "model.layers.8.mixer.in_proj.weight": "model_0.safetensors",
+        "model.layers.8.mixer.dt_proj.weight": "model_0.safetensors",
+        "model.layers.8.mixer.out_proj.weight": "model_0.safetensors",
+        "model.layers.8.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.8.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.8.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.9.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.9.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.9.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.9.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.9.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.9.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.9.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.9.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.9.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.10.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.10.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.10.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.10.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.10.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.10.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.10.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.10.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.10.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.11.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.11.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.11.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.11.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.11.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.11.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.11.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.11.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.11.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.12.input_layernorm.weight": "model_0.safetensors",
+        "model.layers.12.post_attention_layernorm.weight": "model_0.safetensors",
+        "model.layers.12.self_attn.q_proj.weight": "model_0.safetensors",
+        "model.layers.12.self_attn.k_proj.weight": "model_0.safetensors",
+        "model.layers.12.self_attn.v_proj.weight": "model_0.safetensors",
+        "model.layers.12.self_attn.o_proj.weight": "model_0.safetensors",
+        "model.layers.12.mlp.gate_proj.weight": "model_0.safetensors",
+        "model.layers.12.mlp.up_proj.weight": "model_0.safetensors",
+        "model.layers.12.mlp.down_proj.weight": "model_0.safetensors",
+        "model.layers.13.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.13.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.13.self_attn.q_proj.weight": "model_1.safetensors",
+        "model.layers.13.self_attn.k_proj.weight": "model_1.safetensors",
+        "model.layers.13.self_attn.v_proj.weight": "model_1.safetensors",
+        "model.layers.13.self_attn.o_proj.weight": "model_1.safetensors",
+        "model.layers.13.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.13.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.13.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.14.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.14.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.14.self_attn.q_proj.weight": "model_1.safetensors",
+        "model.layers.14.self_attn.k_proj.weight": "model_1.safetensors",
+        "model.layers.14.self_attn.v_proj.weight": "model_1.safetensors",
+        "model.layers.14.self_attn.o_proj.weight": "model_1.safetensors",
+        "model.layers.14.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.14.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.14.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.15.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.15.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.15.self_attn.q_proj.weight": "model_1.safetensors",
+        "model.layers.15.self_attn.k_proj.weight": "model_1.safetensors",
+        "model.layers.15.self_attn.v_proj.weight": "model_1.safetensors",
+        "model.layers.15.self_attn.o_proj.weight": "model_1.safetensors",
+        "model.layers.15.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.15.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.15.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.16.mixer.A_log": "model_1.safetensors",
+        "model.layers.16.mixer.D": "model_1.safetensors",
+        "model.layers.16.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.16.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.16.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.16.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.16.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.16.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.16.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.16.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.16.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.16.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.16.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.16.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.17.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.17.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.17.self_attn.q_proj.weight": "model_1.safetensors",
+        "model.layers.17.self_attn.k_proj.weight": "model_1.safetensors",
+        "model.layers.17.self_attn.v_proj.weight": "model_1.safetensors",
+        "model.layers.17.self_attn.o_proj.weight": "model_1.safetensors",
+        "model.layers.17.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.17.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.17.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.18.mixer.A_log": "model_1.safetensors",
+        "model.layers.18.mixer.D": "model_1.safetensors",
+        "model.layers.18.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.18.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.18.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.18.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.18.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.18.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.18.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.18.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.18.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.18.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.18.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.18.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.19.mixer.A_log": "model_1.safetensors",
+        "model.layers.19.mixer.D": "model_1.safetensors",
+        "model.layers.19.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.19.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.19.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.19.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.19.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.19.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.19.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.19.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.19.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.19.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.19.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.19.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.20.mixer.A_log": "model_1.safetensors",
+        "model.layers.20.mixer.D": "model_1.safetensors",
+        "model.layers.20.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.20.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.20.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.20.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.20.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.20.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.20.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.20.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.20.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.20.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.20.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.20.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.21.mixer.A_log": "model_1.safetensors",
+        "model.layers.21.mixer.D": "model_1.safetensors",
+        "model.layers.21.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.21.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.21.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.21.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.21.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.21.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.21.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.21.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.21.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.21.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.21.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.21.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.22.mixer.A_log": "model_1.safetensors",
+        "model.layers.22.mixer.D": "model_1.safetensors",
+        "model.layers.22.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.22.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.22.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.22.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.22.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.22.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.22.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.22.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.22.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.22.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.22.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.22.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.23.mixer.A_log": "model_1.safetensors",
+        "model.layers.23.mixer.D": "model_1.safetensors",
+        "model.layers.23.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.23.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.23.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.23.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.23.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.23.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.23.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.23.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.23.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.23.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.23.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.23.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.24.mixer.A_log": "model_1.safetensors",
+        "model.layers.24.mixer.D": "model_1.safetensors",
+        "model.layers.24.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.24.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.24.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.24.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.24.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.24.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.24.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.24.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.24.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.24.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.24.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.24.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.25.mixer.A_log": "model_1.safetensors",
+        "model.layers.25.mixer.D": "model_1.safetensors",
+        "model.layers.25.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.25.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.25.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.25.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.25.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.25.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.25.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.25.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.25.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.25.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.25.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.25.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.26.mixer.A_log": "model_1.safetensors",
+        "model.layers.26.mixer.D": "model_1.safetensors",
+        "model.layers.26.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.26.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.26.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.26.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.26.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.26.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.26.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.26.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.26.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.26.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.26.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.26.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.27.mixer.A_log": "model_1.safetensors",
+        "model.layers.27.mixer.D": "model_1.safetensors",
+        "model.layers.27.input_layernorm.weight": "model_1.safetensors",
+        "model.layers.27.post_attention_layernorm.weight": "model_1.safetensors",
+        "model.layers.27.mixer.dt_in_proj.weight": "model_1.safetensors",
+        "model.layers.27.mixer.conv1d.weight": "model_1.safetensors",
+        "model.layers.27.mixer.conv1d.bias": "model_1.safetensors",
+        "model.layers.27.mixer.dt_proj.bias": "model_1.safetensors",
+        "model.layers.27.mixer.in_proj.weight": "model_1.safetensors",
+        "model.layers.27.mixer.dt_proj.weight": "model_1.safetensors",
+        "model.layers.27.mixer.out_proj.weight": "model_1.safetensors",
+        "model.layers.27.mlp.gate_proj.weight": "model_1.safetensors",
+        "model.layers.27.mlp.up_proj.weight": "model_1.safetensors",
+        "model.layers.27.mlp.down_proj.weight": "model_1.safetensors",
+        "model.layers.28.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.28.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.28.self_attn.q_proj.weight": "model_2.safetensors",
+        "model.layers.28.self_attn.k_proj.weight": "model_2.safetensors",
+        "model.layers.28.self_attn.v_proj.weight": "model_2.safetensors",
+        "model.layers.28.self_attn.o_proj.weight": "model_2.safetensors",
+        "model.layers.28.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.28.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.28.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.29.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.29.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.29.self_attn.q_proj.weight": "model_2.safetensors",
+        "model.layers.29.self_attn.k_proj.weight": "model_2.safetensors",
+        "model.layers.29.self_attn.v_proj.weight": "model_2.safetensors",
+        "model.layers.29.self_attn.o_proj.weight": "model_2.safetensors",
+        "model.layers.29.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.29.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.29.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.30.mixer.A_log": "model_2.safetensors",
+        "model.layers.30.mixer.D": "model_2.safetensors",
+        "model.layers.30.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.30.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.30.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.30.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.30.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.30.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.30.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.30.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.30.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.30.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.30.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.30.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.31.mixer.A_log": "model_2.safetensors",
+        "model.layers.31.mixer.D": "model_2.safetensors",
+        "model.layers.31.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.31.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.31.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.31.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.31.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.31.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.31.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.31.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.31.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.31.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.31.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.31.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.32.mixer.A_log": "model_2.safetensors",
+        "model.layers.32.mixer.D": "model_2.safetensors",
+        "model.layers.32.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.32.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.32.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.32.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.32.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.32.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.32.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.32.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.32.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.32.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.32.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.32.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.33.mixer.A_log": "model_2.safetensors",
+        "model.layers.33.mixer.D": "model_2.safetensors",
+        "model.layers.33.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.33.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.33.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.33.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.33.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.33.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.33.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.33.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.33.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.33.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.33.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.33.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.34.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.34.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.34.self_attn.q_proj.weight": "model_2.safetensors",
+        "model.layers.34.self_attn.k_proj.weight": "model_2.safetensors",
+        "model.layers.34.self_attn.v_proj.weight": "model_2.safetensors",
+        "model.layers.34.self_attn.o_proj.weight": "model_2.safetensors",
+        "model.layers.34.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.34.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.34.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.35.mixer.A_log": "model_2.safetensors",
+        "model.layers.35.mixer.D": "model_2.safetensors",
+        "model.layers.35.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.35.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.35.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.35.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.35.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.35.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.35.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.35.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.35.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.35.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.35.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.35.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.36.mixer.A_log": "model_2.safetensors",
+        "model.layers.36.mixer.D": "model_2.safetensors",
+        "model.layers.36.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.36.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.36.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.36.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.36.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.36.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.36.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.36.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.36.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.36.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.36.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.36.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.37.mixer.A_log": "model_2.safetensors",
+        "model.layers.37.mixer.D": "model_2.safetensors",
+        "model.layers.37.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.37.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.37.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.37.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.37.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.37.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.37.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.37.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.37.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.37.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.37.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.37.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.38.mixer.A_log": "model_2.safetensors",
+        "model.layers.38.mixer.D": "model_2.safetensors",
+        "model.layers.38.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.38.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.38.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.38.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.38.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.38.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.38.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.38.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.38.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.38.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.38.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.38.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.39.mixer.A_log": "model_2.safetensors",
+        "model.layers.39.mixer.D": "model_2.safetensors",
+        "model.layers.39.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.39.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.39.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.39.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.39.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.39.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.39.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.39.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.39.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.39.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.39.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.39.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.40.mixer.A_log": "model_2.safetensors",
+        "model.layers.40.mixer.D": "model_2.safetensors",
+        "model.layers.40.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.40.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.40.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.40.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.40.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.40.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.40.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.40.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.40.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.40.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.40.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.40.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.41.mixer.A_log": "model_2.safetensors",
+        "model.layers.41.mixer.D": "model_2.safetensors",
+        "model.layers.41.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.41.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.41.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.41.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.41.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.41.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.41.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.41.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.41.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.41.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.41.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.41.mlp.down_proj.weight": "model_2.safetensors",
+        "model.layers.42.mixer.A_log": "model_2.safetensors",
+        "model.layers.42.mixer.D": "model_2.safetensors",
+        "model.layers.42.input_layernorm.weight": "model_2.safetensors",
+        "model.layers.42.post_attention_layernorm.weight": "model_2.safetensors",
+        "model.layers.42.mixer.dt_in_proj.weight": "model_2.safetensors",
+        "model.layers.42.mixer.conv1d.weight": "model_2.safetensors",
+        "model.layers.42.mixer.conv1d.bias": "model_2.safetensors",
+        "model.layers.42.mixer.dt_proj.bias": "model_2.safetensors",
+        "model.layers.42.mixer.in_proj.weight": "model_2.safetensors",
+        "model.layers.42.mixer.dt_proj.weight": "model_2.safetensors",
+        "model.layers.42.mixer.out_proj.weight": "model_2.safetensors",
+        "model.layers.42.mlp.gate_proj.weight": "model_2.safetensors",
+        "model.layers.42.mlp.up_proj.weight": "model_2.safetensors",
+        "model.layers.42.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.43.mixer.A_log": "model_3.safetensors",
+        "model.layers.43.mixer.D": "model_3.safetensors",
+        "model.layers.43.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.43.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.43.mixer.dt_in_proj.weight": "model_3.safetensors",
+        "model.layers.43.mixer.conv1d.weight": "model_3.safetensors",
+        "model.layers.43.mixer.conv1d.bias": "model_3.safetensors",
+        "model.layers.43.mixer.dt_proj.bias": "model_3.safetensors",
+        "model.layers.43.mixer.in_proj.weight": "model_3.safetensors",
+        "model.layers.43.mixer.dt_proj.weight": "model_3.safetensors",
+        "model.layers.43.mixer.out_proj.weight": "model_3.safetensors",
+        "model.layers.43.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.43.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.43.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.44.mixer.A_log": "model_3.safetensors",
+        "model.layers.44.mixer.D": "model_3.safetensors",
+        "model.layers.44.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.44.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.44.mixer.dt_in_proj.weight": "model_3.safetensors",
+        "model.layers.44.mixer.conv1d.weight": "model_3.safetensors",
+        "model.layers.44.mixer.conv1d.bias": "model_3.safetensors",
+        "model.layers.44.mixer.dt_proj.bias": "model_3.safetensors",
+        "model.layers.44.mixer.in_proj.weight": "model_3.safetensors",
+        "model.layers.44.mixer.dt_proj.weight": "model_3.safetensors",
+        "model.layers.44.mixer.out_proj.weight": "model_3.safetensors",
+        "model.layers.44.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.44.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.44.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.45.mixer.A_log": "model_3.safetensors",
+        "model.layers.45.mixer.D": "model_3.safetensors",
+        "model.layers.45.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.45.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.45.mixer.dt_in_proj.weight": "model_3.safetensors",
+        "model.layers.45.mixer.conv1d.weight": "model_3.safetensors",
+        "model.layers.45.mixer.conv1d.bias": "model_3.safetensors",
+        "model.layers.45.mixer.dt_proj.bias": "model_3.safetensors",
+        "model.layers.45.mixer.in_proj.weight": "model_3.safetensors",
+        "model.layers.45.mixer.dt_proj.weight": "model_3.safetensors",
+        "model.layers.45.mixer.out_proj.weight": "model_3.safetensors",
+        "model.layers.45.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.45.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.45.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.46.mixer.A_log": "model_3.safetensors",
+        "model.layers.46.mixer.D": "model_3.safetensors",
+        "model.layers.46.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.46.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.46.mixer.dt_in_proj.weight": "model_3.safetensors",
+        "model.layers.46.mixer.conv1d.weight": "model_3.safetensors",
+        "model.layers.46.mixer.conv1d.bias": "model_3.safetensors",
+        "model.layers.46.mixer.dt_proj.bias": "model_3.safetensors",
+        "model.layers.46.mixer.in_proj.weight": "model_3.safetensors",
+        "model.layers.46.mixer.dt_proj.weight": "model_3.safetensors",
+        "model.layers.46.mixer.out_proj.weight": "model_3.safetensors",
+        "model.layers.46.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.46.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.46.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.47.mixer.A_log": "model_3.safetensors",
+        "model.layers.47.mixer.D": "model_3.safetensors",
+        "model.layers.47.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.47.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.47.mixer.dt_in_proj.weight": "model_3.safetensors",
+        "model.layers.47.mixer.conv1d.weight": "model_3.safetensors",
+        "model.layers.47.mixer.conv1d.bias": "model_3.safetensors",
+        "model.layers.47.mixer.dt_proj.bias": "model_3.safetensors",
+        "model.layers.47.mixer.in_proj.weight": "model_3.safetensors",
+        "model.layers.47.mixer.dt_proj.weight": "model_3.safetensors",
+        "model.layers.47.mixer.out_proj.weight": "model_3.safetensors",
+        "model.layers.47.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.47.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.47.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.48.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.48.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.48.self_attn.q_proj.weight": "model_3.safetensors",
+        "model.layers.48.self_attn.k_proj.weight": "model_3.safetensors",
+        "model.layers.48.self_attn.v_proj.weight": "model_3.safetensors",
+        "model.layers.48.self_attn.o_proj.weight": "model_3.safetensors",
+        "model.layers.48.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.48.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.48.mlp.down_proj.weight": "model_3.safetensors",
+        "model.layers.49.mixer.A_log": "model_3.safetensors",
+        "model.layers.49.mixer.D": "model_3.safetensors",
+        "model.layers.49.input_layernorm.weight": "model_3.safetensors",
+        "model.layers.49.post_attention_layernorm.weight": "model_3.safetensors",
+        "model.layers.49.mixer.dt_in_proj.weight": "model_3.safetensors",
+        "model.layers.49.mixer.conv1d.weight": "model_3.safetensors",
+        "model.layers.49.mixer.conv1d.bias": "model_3.safetensors",
+        "model.layers.49.mixer.dt_proj.bias": "model_3.safetensors",
+        "model.layers.49.mixer.in_proj.weight": "model_3.safetensors",
+        "model.layers.49.mixer.dt_proj.weight": "model_3.safetensors",
+        "model.layers.49.mixer.out_proj.weight": "model_3.safetensors",
+        "model.layers.49.mlp.gate_proj.weight": "model_3.safetensors",
+        "model.layers.49.mlp.up_proj.weight": "model_3.safetensors",
+        "model.layers.49.mlp.down_proj.weight": "model_3.safetensors",
+        "model.norm.weight": "model_3.safetensors",
+        "lm_head.weight": "model_3.safetensors"
+    }
+}

model_0.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:54ab7d8a72b9ed8dddb280c6b894a0a7c6404a6545cf77821e075708353ec122
+size 17341952504

model_1.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3cf3c4655d0a8f554e50c3c8001f92f681edb1a68a18eae6f4c55857756ed141
+size 17415079472

model_2.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:03913e3b49ba8415e659d23f28e30f55a973ff5b669ef8e323226331e7f98e84
+size 17217537984

model_3.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e5288d374646a02d5bec8810577e6955c90e810fd958bcd1c835d17a1d45e755
+size 11188268104

modeling_apriel_h.py ADDED Viewed

	@@ -0,0 +1,908 @@

+import copy
+import math
+from dataclasses import dataclass
+from typing import Any, Optional, Union
+import torch
+import torch.nn.functional as F
+from causal_conv1d import causal_conv1d_fn, causal_conv1d_update
+from .configuration_apriel_h import AprielHConfig
+from einops import rearrange, repeat
+from mamba_ssm.ops.selective_scan_interface import selective_scan_fn
+from mamba_ssm.ops.triton.selective_state_update import selective_state_update
+from torch import nn
+from transformers import GenerationMixin
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.modeling_utils import PreTrainedModel
+from transformers.models.mistral.modeling_mistral import MistralDecoderLayer, MistralMLP, MistralModel, MistralRMSNorm
+from transformers.processing_utils import Unpack
+from transformers.utils import LossKwargs, can_return_tuple, logging
+from transformers.utils.generic import ModelOutput
+logger = logging.get_logger(__name__)
+is_fast_path_available = all((selective_state_update, causal_conv1d_fn, causal_conv1d_update))
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+# Copied from https://github.com/huggingface/transformers/blob/main/src/transformers/models/jamba/modeling_jamba.py
+class HybridMambaAttentionDynamicCache(DynamicCache):
+    """
+    A dynamic cache that can handle both the attention cache (which has a seq_len dimension) and the mamba cache
+    (which has a constant shape regardless of seq_len).
+    This cache has two sets of lists of tensors: `key_cache` and `value_cache` for attention cache and `conv_states`
+    and `ssm_states` for mamba cache. Each of these lists has `num_layers` tensors. The expected shape for each tensor
+    For attention layers, `key_cache` and `value_cache` have a shape of `(batch_size, num_heads, seq_len, head_dim)`,
+    while `conv_states` and `ssm_states` have a shape of `(batch_size, 0)` (empty tensors).
+    For mamba layers, `key_cache` and `value_cache` have a shape of `(batch_size, 0)` (empty tensors),
+    while `conv_states` represents the convolution state and has a shape of `(batch_size, d_inner, d_conv)`,
+    and `ssm_states` represents the ssm state and has a shape of `(batch_size, d_inner, d_state)`.
+    """
+    def __init__(self, config: AprielHConfig, batch_size, dtype=torch.float16, device=None):
+        super().__init__()
+        self.dtype = dtype
+        self.hybrid_override_pattern = config.hybrid_block_layout
+        self.has_previous_state = False  # only used by mamba
+        intermediate_size = (
+            config.ssm_cfg["d_inner"]
+            if config.ssm_cfg["d_inner"] is not None
+            else config.ssm_cfg["expand"] * config.hidden_size
+        )
+        ssm_state_size = config.ssm_cfg["d_state"]
+        conv_kernel_size = config.ssm_cfg["d_conv"]
+        self.n_qk_heads = config.ssm_cfg["n_qk_heads"]
+        self.num_C_head = intermediate_size // ssm_state_size  # mamba2
+        assert intermediate_size % self.n_qk_heads == 0, "d_inner must be divisible by n_qk_heads"
+        self.head_d = intermediate_size // self.n_qk_heads
+        self.conv_states = []
+        self.ssm_states = []
+        self.transformer_layers = []
+        for i in range(config.num_hidden_layers):
+            if self.hybrid_override_pattern[i] == "m2d":
+                # Mamba layer
+                self.conv_states += [
+                    torch.zeros(
+                        batch_size,
+                        conv_kernel_size,
+                        intermediate_size + 2 * self.n_qk_heads * ssm_state_size,
+                        device=device,
+                        dtype=dtype,
+                    ).transpose(1, 2)
+                ]
+                self.ssm_states += [
+                    torch.zeros(batch_size, self.n_qk_heads, self.head_d, ssm_state_size, device=device, dtype=dtype)
+                ]
+            elif self.hybrid_override_pattern[i] == "m2":
+                if "repeat_kv_before_conv" in config.ssm_cfg:
+                    assert (
+                        config.ssm_cfg["repeat_kv_before_conv"] == True
+                    ), "Only support repeat_kv_before_conv=True for m2 for now"
+                self.conv_states += [
+                    torch.zeros(
+                        batch_size,
+                        intermediate_size,
+                        conv_kernel_size,
+                        device=device,
+                        dtype=dtype,
+                    )
+                ]
+                self.ssm_states += [
+                    torch.zeros(
+                        batch_size,
+                        self.num_C_head,
+                        intermediate_size // self.num_C_head,
+                        ssm_state_size,
+                        device=device,
+                        dtype=dtype,
+                    )
+                ]
+            else:
+                # Attention or MLP layer
+                self.conv_states += [torch.tensor([[]] * batch_size, device=device)]
+                self.ssm_states += [torch.tensor([[]] * batch_size, device=device)]
+                self.transformer_layers.append(i)
+        self.key_cache = [torch.tensor([[]] * batch_size, device=device) for _ in range(config.num_hidden_layers)]
+        self.value_cache = [torch.tensor([[]] * batch_size, device=device) for _ in range(config.num_hidden_layers)]
+    def update(
+        self,
+        key_states: torch.Tensor,
+        value_states: torch.Tensor,
+        layer_idx: int,
+        cache_kwargs: Optional[dict[str, Any]] = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        # Update the cache
+        if self.key_cache[layer_idx].shape[-1] == 0:
+            self.key_cache[layer_idx] = key_states
+            self.value_cache[layer_idx] = value_states
+        else:
+            self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=2)
+            self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=2)
+        return self.key_cache[layer_idx], self.value_cache[layer_idx]
+    def reorder_cache(self, beam_idx: torch.LongTensor):
+        """Reorders the cache for beam search, given the selected beam indices."""
+        for layer_idx in range(len(self.key_cache)):
+            device = self.key_cache[layer_idx].device
+            self.key_cache[layer_idx] = self.key_cache[layer_idx].index_select(0, beam_idx.to(device))
+            device = self.value_cache[layer_idx].device
+            self.value_cache[layer_idx] = self.value_cache[layer_idx].index_select(0, beam_idx.to(device))
+            device = self.conv_states[layer_idx].device
+            self.conv_states[layer_idx] = self.conv_states[layer_idx].index_select(0, beam_idx.to(device))
+            device = self.ssm_states[layer_idx].device
+            self.ssm_states[layer_idx] = self.ssm_states[layer_idx].index_select(0, beam_idx.to(device))
+    def to_legacy_cache(self) -> tuple[tuple[torch.Tensor], tuple[torch.Tensor]]:
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+    @classmethod
+    def from_legacy_cache(cls, past_key_values: Optional[tuple[tuple[torch.FloatTensor]]] = None) -> "DynamicCache":
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+    # Copied from modeling_mamba2.py
+    def update_conv_state(
+        self, layer_idx: int, new_conv_state: torch.Tensor, cache_init: bool = False
+    ) -> torch.Tensor:
+        if cache_init:
+            self.conv_states[layer_idx] = new_conv_state.to(self.conv_states.device)
+        else:
+            self.conv_states[layer_idx] = self.conv_states[layer_idx].roll(shifts=-1, dims=-1)
+            self.conv_states[layer_idx][:, :, -1] = new_conv_state[:, 0, :].to(self.conv_states.device)
+        return self.conv_states[layer_idx]
+    def update_ssm_state(self, layer_idx: int, new_ssm_state: torch.Tensor):
+        self.ssm_states[layer_idx] = new_ssm_state.to(self.ssm_states.device)
+        return self.ssm_states[layer_idx]
+    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
+        """Returns the sequence length of the cached states. A layer index can be optionally passed."""
+        # take any layer that contains cache and not empty tensor
+        layer_idx = self.transformer_layers[0] if layer_idx not in self.transformer_layers else layer_idx
+        if len(self.key_cache) <= layer_idx:
+            return 0
+        is_empty_layer = (
+            len(self.key_cache) == 0  # no cache in any layer
+            or len(self.key_cache) <= layer_idx  # skipped `layer_idx` and hasn't run a layer with cache after it
+            or not self.key_cache[layer_idx].numel()  # the layer has no cache
+        )
+        return self.key_cache[layer_idx].shape[-2] if not is_empty_layer else 0
+        # return self.key_cache[layer_idx].shape[-2]
+    def reset(self):
+        self.conv_states.zero_()
+        self.ssm_states.zero_()
+    def to_legacy_cache(self) -> tuple[tuple[torch.Tensor], tuple[torch.Tensor]]:
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+    @classmethod
+    def from_legacy_cache(cls, past_key_values: Optional[tuple[tuple[torch.FloatTensor]]] = None) -> "DynamicCache":
+        raise NotImplementedError("HybridMambaAttentionDynamicCache does not have a legacy cache equivalent.")
+@dataclass
+class AprielHybridCausalOutput(ModelOutput):
+    """Custom output class for MambaLMHeadModel."""
+    loss: Optional[torch.FloatTensor] = None
+    logits: Optional[torch.FloatTensor] = None
+    all_hidden_states: Optional[tuple[torch.FloatTensor, ...]] = None
+    last_hidden_state: Optional[torch.FloatTensor] = None
+    attention_weights: Optional[torch.FloatTensor] = None
+    past_key_values: Optional[Cache] = None
+def segsum(x):
+    """More stable segment sum calculation."""
+    # [1, 2, 3]
+    T = x.size(-1)
+    x = repeat(x, "... d -> ... d e", e=T)
+    # [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
+    mask = torch.tril(torch.ones(T, T, device=x.device, dtype=bool), diagonal=-1)
+    x = x.masked_fill(~mask, 0)
+    # [[0, 0, 0], [2, 0, 0], [3, 3, 0]]
+    x_segsum = torch.cumsum(x, dim=-2)
+    # [[0, 0, 0], [2, 0, 0], [5, 3, 0]]
+    mask = torch.tril(torch.ones(T, T, device=x.device, dtype=bool), diagonal=0)
+    x_segsum = x_segsum.masked_fill(~mask, -torch.inf)
+    return x_segsum
+def materialize_mixer(A_log, B, C, D):
+    """
+    Since the transfer matrix will be equated to the attention matrix,
+    we need to support the form: torch.matmul(attn_weights, value_states).
+    Thus, y = torch.matmul(T, X)
+    Arguments:
+        A_log: (batch, length, n_heads)
+        B: (batch, length, n_heads, d_state)
+        C: (batch, length, n_heads, d_state)
+    Return:
+        T: (batch, n_heads, length, length)
+    """
+    batch_size, length, n_heads, d_state = B.shape
+    assert A_log.shape == (batch_size, length, n_heads)
+    assert B.shape == C.shape == (batch_size, length, n_heads, d_state)
+    # Compute:
+    A_log = rearrange(-F.softplus(A_log), "b l h -> b h l")
+    powers = torch.exp(segsum(A_log))
+    T = torch.einsum("blhn,bshn,bhls->bhsl", C, B, powers)
+    # Add D:
+    if D is not None:
+        T[:, :, torch.arange(length), torch.arange(length)] += D.view(1, n_heads, 1)
+    T = rearrange(T, "b h z l -> b h l z")
+    return T
+def apply_mask_to_padding_states(hidden_states, attention_mask):
+    """
+    Tunes out the hidden states for padding tokens, see https://github.com/state-spaces/mamba/issues/66
+    """
+    if attention_mask is not None and attention_mask.shape[1] > 1 and attention_mask.shape[0] > 1:
+        dtype = hidden_states.dtype
+        hidden_states = (hidden_states * attention_mask[:, :, None]).to(dtype)
+    return hidden_states
+class Mamba(nn.Module):
+    def __init__(
+        self,
+        d_model,
+        d_inner,
+        d_xb=None,
+        d_state=16,
+        d_conv=4,
+        expand=2,
+        dt_rank="auto",
+        dt_min=0.001,
+        dt_max=0.1,
+        dt_init="random",
+        dt_scale=1.0,
+        dt_init_floor=1e-4,
+        repeat_kv_before_conv=True,
+        conv_bias=True,
+        bias=False,
+        dt_proj_bias=True,
+        use_fast_path=True,  # Fused kernel options
+        layer_idx=None,
+        device=None,
+        dtype=None,
+        **kwargs,
+    ):
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super().__init__()
+        self.d_model = d_model
+        self.d_xb = d_xb if d_xb is not None else d_model
+        self.d_state = d_state
+        self.d_conv = d_conv
+        self.expand = expand
+        self.d_inner = d_inner if d_inner is not None else int(self.expand * self.d_model)
+        self.dt_rank = math.ceil(self.d_model / 16) if dt_rank == "auto" else dt_rank
+        self.use_fast_path = use_fast_path
+        self.layer_idx = layer_idx
+        self.repeat_kv_before_conv = repeat_kv_before_conv
+        if self.repeat_kv_before_conv:
+            self.conv1d = nn.Conv1d(
+                in_channels=self.d_inner,
+                out_channels=self.d_inner,
+                bias=conv_bias,
+                kernel_size=d_conv,
+                groups=self.d_inner,
+                padding=d_conv - 1,
+                **factory_kwargs,
+            )
+        else:
+            self.conv1d = nn.Conv1d(
+                in_channels=self.d_xb,
+                out_channels=self.d_xb,
+                bias=conv_bias,
+                kernel_size=d_conv,
+                groups=self.d_xb,
+                padding=d_conv - 1,
+                **factory_kwargs,
+            )
+        self.activation = "silu"
+        self.act = nn.SiLU()
+        self.num_xb_head = self.d_xb // self.d_state
+        self.num_C_head = self.d_inner // self.d_state
+        self.repeat_group = self.num_C_head // self.num_xb_head
+        self.in_proj = nn.Linear(self.d_model, 2 * self.d_xb + 2 * self.d_inner, bias=bias, **factory_kwargs)
+        self.dt_in_proj = nn.Linear(self.d_model, self.dt_rank, bias=bias, **factory_kwargs)
+        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=dt_proj_bias, **factory_kwargs)
+        # Initialize special dt projection to preserve variance at initialization
+        dt_init_std = self.dt_rank**-0.5 * dt_scale
+        if dt_init == "constant":
+            nn.init.constant_(self.dt_proj.weight, dt_init_std)
+        elif dt_init == "random":
+            nn.init.uniform_(self.dt_proj.weight, -dt_init_std, dt_init_std)
+        else:
+            raise NotImplementedError
+        # Initialize dt bias so that F.softplus(dt_bias) is between dt_min and dt_max
+        dt = torch.exp(
+            torch.rand(self.d_inner, **factory_kwargs) * (math.log(dt_max) - math.log(dt_min)) + math.log(dt_min)
+        ).clamp(min=dt_init_floor)
+        # Inverse of softplus: https://github.com/pytorch/pytorch/issues/72759
+        inv_dt = dt + torch.log(-torch.expm1(-dt))
+        with torch.no_grad():
+            self.dt_proj.bias.copy_(inv_dt)
+        # Our initialization would set all Linear.bias to zero, need to mark this one as _no_reinit
+        self.dt_proj.bias._no_reinit = True
+        # S4D real initialization
+        A = repeat(
+            torch.arange(1, self.d_state + 1, dtype=torch.float32, device=device),
+            "n -> d n",
+            d=self.d_inner,
+        ).contiguous()
+        A_log = torch.log(A)  # Keep A_log in fp32
+        self.A_log = nn.Parameter(A_log)
+        self.A_log._no_weight_decay = True
+        # D "skip" parameter
+        self.D = nn.Parameter(torch.ones(self.d_inner, device=device))  # Keep in fp32
+        self.D._no_weight_decay = True
+        self.out_proj = nn.Linear(self.d_inner, self.d_model, bias=bias, **factory_kwargs)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        past_key_value: Optional[HybridMambaAttentionDynamicCache] = None,
+        mamba_mask: Optional[torch.Tensor] = None,
+        return_mixer_matrix=False,
+        **kwargs,
+    ):
+        """
+        hidden_states: (B, L, D)
+        Returns: same shape as hidden_states
+        """
+        assert is_fast_path_available and "cuda" in self.in_proj.weight.device.type, "Only support fast path on cuda"
+        cache_position = kwargs.get("cache_position", None)
+        batch, seqlen, dim = hidden_states.shape
+        ssm_state, conv_state = None, None
+        use_precomputed_states = False
+        #########################################################
+        # Quick and dirty to work with CG
+        if "inference_params" in kwargs:
+            seqlen_offset = kwargs["inference_params"].seqlen_offset
+            if seqlen_offset > 0:
+                use_precomputed_states = True
+        else:
+            seqlen_offset = kwargs.get("seqlen_offset", cache_position[0]) if cache_position is not None else 0
+            use_precomputed_states = (
+                past_key_value is not None
+                and past_key_value.has_previous_state
+                and seqlen == 1
+                and past_key_value.conv_states[self.layer_idx].shape[0]
+                == past_key_value.ssm_states[self.layer_idx].shape[0]
+                == batch
+                and cache_position is not None
+                and seqlen_offset > 0
+            )
+        #########################################################
+        ssm_state, conv_state = self._get_states_from_cache(past_key_value, batch)
+        if use_precomputed_states:
+            out, _, _ = self.step(hidden_states, conv_state, ssm_state)
+            return {"hidden_states": out}
+        outputs = {}
+        A = -torch.exp(self.A_log.float())  # (d_inner, d_state)
+        zxbc = self.in_proj(hidden_states)
+        z, x, B, C = torch.split(
+            zxbc,
+            [
+                self.d_inner,
+                self.d_xb,
+                self.d_xb,
+                self.d_inner,
+            ],
+            dim=-1,
+        )
+        x = rearrange(x, "b l d -> b d l")
+        z = rearrange(z, "b l d -> b d l")
+        B = rearrange(B, "b l (n_group dstate) -> b n_group l dstate", dstate=self.d_state)
+        B = repeat_kv(B, self.repeat_group)  # B, n_group, L, H
+        B = rearrange(B, "b n_group l dstate -> b n_group dstate l").contiguous()
+        C = rearrange(C, "b l (n_group dstate) -> b n_group dstate l", dstate=self.d_state).contiguous()
+        dt = self.dt_proj(self.dt_in_proj(hidden_states))  # B, L, d_inner
+        dt = rearrange(dt, "b l d -> b d l")  # B, d_inner, L
+        if self.repeat_kv_before_conv:
+            x = rearrange(x, "b (n_group dstate) l -> b n_group l dstate", dstate=self.d_state)
+            x = repeat_kv(x, self.repeat_group)
+            x = rearrange(x, "b n_group l dstate -> b (n_group dstate) l")
+        # Compute short convolution
+        if conv_state is not None:
+            # If we just take x[:, :, -self.d_conv :], it will error if seqlen < self.d_conv
+            # Instead F.pad will pad with zeros if seqlen < self.d_conv, and truncate otherwise.
+            # Update state (B D W)
+            conv_state.copy_(F.pad(x, (self.d_conv - x.shape[-1], 0)))
+        if causal_conv1d_fn is None:
+            x = self.act(self.conv1d(x)[..., :seqlen]).transpose(1, 2)
+        else:
+            assert self.activation in ["silu", "swish"]
+            x = causal_conv1d_fn(
+                x=x,
+                weight=rearrange(self.conv1d.weight, "d 1 w -> d w"),
+                bias=self.conv1d.bias,
+                activation=self.activation,
+            )
+        if not self.repeat_kv_before_conv:
+            x = rearrange(x, "b (n_group dstate) l -> b n_group l dstate", dstate=self.d_state)
+            x = repeat_kv(x, self.repeat_group)
+            x = rearrange(x, "b n_group l dstate -> b (n_group dstate) l")
+        y = selective_scan_fn(
+            x,
+            dt,
+            A,
+            B,
+            C,
+            self.D.float(),
+            z=z,
+            delta_bias=self.dt_proj.bias.float(),
+            delta_softplus=True,
+            return_last_state=(ssm_state is not None),
+        )
+        if ssm_state is not None:
+            y, last_state = y
+            ssm_state.copy_(rearrange(last_state, "b (h d) n -> b h d n", h=self.num_C_head))
+        y = rearrange(y, "b d l -> b l d")
+        out = self.out_proj(y)
+        outputs["hidden_states"] = out[:, :seqlen, :]
+        return outputs
+    def step(self, hidden_states, conv_state, ssm_state):
+        dtype = hidden_states.dtype
+        assert hidden_states.shape[1] == 1, "Only support decoding with 1 token at a time for now"
+        hidden_states_input = hidden_states.squeeze(1)
+        A = -torch.exp(self.A_log.float())  # (d_inner, d_state)
+        zxbc = self.in_proj(hidden_states_input)
+        z, x, B, C = torch.split(zxbc, [self.d_inner, self.d_xb, self.d_xb, self.d_inner], dim=-1)
+        B = rearrange(B, "b (n_group dstate) -> b n_group dstate", dstate=self.d_state)
+        B = torch.repeat_interleave(B, dim=1, repeats=self.repeat_group)
+        C = rearrange(C, "b (n_group dstate) -> b n_group dstate", dstate=self.d_state).contiguous()
+        dt = self.dt_proj(self.dt_in_proj(hidden_states_input))  # B, d_inner
+        if self.repeat_kv_before_conv:
+            x = rearrange(x, "b (n_group dstate) -> b n_group dstate", dstate=self.d_state)
+            x = torch.repeat_interleave(x, dim=1, repeats=self.repeat_group)
+            x = rearrange(x, "b n_group dstate -> b (n_group dstate)")
+        # Conv step
+        if causal_conv1d_update is None:
+            # Update state (B D W)
+            conv_state.copy_(torch.roll(conv_state, shifts=-1, dims=-1))
+            conv_state[:, :, -1] = x
+            x = torch.sum(conv_state * rearrange(self.conv1d.weight, "d 1 w -> d w"), dim=-1)  # (B D)
+            if self.conv1d.bias is not None:
+                x = x + self.conv1d.bias
+            x = self.act(x).to(dtype=dtype)
+        else:
+            x = causal_conv1d_update(
+                x,
+                conv_state,
+                rearrange(self.conv1d.weight, "d 1 w -> d w"),
+                self.conv1d.bias,
+                self.activation,
+            )
+        if not self.repeat_kv_before_conv:
+            x = rearrange(x, "b (n_group dstate) -> b n_group dstate", dstate=self.d_state)
+            x = torch.repeat_interleave(x, dim=1, repeats=self.repeat_group)
+            x = rearrange(x, "b n_group dstate -> b (n_group dstate)")
+        x = rearrange(x, "b (h d) -> b h d", h=self.num_C_head)
+        dt = rearrange(dt, "b (h d) -> b h d", h=self.num_C_head)
+        A = rearrange(A, "(h d) n -> h d n", h=self.num_C_head)
+        D = rearrange(self.D, "(h d) -> h d", h=self.num_C_head)
+        z = rearrange(z, "b (h d) -> b h d", h=self.num_C_head)
+        dt_bias = rearrange(self.dt_proj.bias, "(h d) -> h d", h=self.num_C_head)
+        # SSM step
+        assert selective_state_update is not None
+        y = selective_state_update(ssm_state, x, dt, A, B, C, D, z=z, dt_bias=dt_bias, dt_softplus=True)
+        y = rearrange(y, "b h d -> b (h d)")
+        out = self.out_proj(y)
+        return out.unsqueeze(1), conv_state, ssm_state
+    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
+        device = self.out_proj.weight.device
+        conv_dtype = self.conv1d.weight.dtype if dtype is None else dtype
+        if self.repeat_kv_before_conv:
+            conv_state = torch.zeros(batch_size, self.d_inner, self.d_conv, device=device, dtype=conv_dtype)
+        else:
+            conv_state = torch.zeros(batch_size, self.d_xb, self.d_conv, device=device, dtype=conv_dtype)
+        ssm_dtype = self.dt_proj.weight.dtype if dtype is None else dtype
+        ssm_state = torch.zeros(
+            batch_size, self.num_C_head, self.d_inner // self.num_C_head, self.d_state, device=device, dtype=ssm_dtype
+        )
+        return conv_state, ssm_state
+    def _get_states_from_cache(self, inference_params, batch_size, initialize_states=False):
+        """
+        conv_state: (batch, d_conv, conv1d.weight.shape[0])
+        ssm_state: (batch, n_qk_heads, headdim, d_state)
+        """
+        assert self.layer_idx is not None
+        # Get states
+        ssm_states = inference_params.ssm_states[self.layer_idx]
+        conv_states = inference_params.conv_states[self.layer_idx]
+        if initialize_states:
+            ssm_states.zero_()
+            conv_states.zero_()
+        return ssm_states, conv_states
+class AprielSSMM2DecoderLayer(nn.Module):
+    _mixer_class = Mamba
+    def __init__(self, config: AprielHConfig, layer_idx: int, device=None, dtype=None, **kwargs):
+        super().__init__(**kwargs)
+        factory_kwargs = {"device": device, "dtype": dtype}
+        self.hidden_size = config.hidden_size
+        self.mixer = self._mixer_class(
+            d_model=config.hidden_size,
+            layer_idx=layer_idx,
+            **config.ssm_cfg,
+            **factory_kwargs,
+        )
+        self.mlp = MistralMLP(config)
+        self.input_layernorm = MistralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = MistralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self, hidden_states: torch.Tensor, **kwargs
+    ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        outputs = {}
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        mixer_outputs = self.mixer(
+            hidden_states,
+            **kwargs,
+        )
+        hidden_states = mixer_outputs["hidden_states"].to(residual.dtype) + residual
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        return outputs
+class AprielHybridIdentity(nn.Module):
+    def __init__(self, config: AprielHConfig):
+        super().__init__()
+        self.config = config
+    def forward(self, hidden_states: torch.Tensor, **kwargs):
+        return (hidden_states,)
+class AprielHModel(MistralModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`AprielDecoderLayer`, `AprielSSMDecoderLayer`]
+    Args:
+        config: AprielSSMHybridConfig
+    """
+    def __init__(self, config: AprielHConfig, **kwargs):
+        config_copy = copy.deepcopy(config)
+        config_copy.num_hidden_layers = 0
+        super().__init__(config_copy, **kwargs)
+        self.config = config
+        blocks = []
+        logger.info(f"Loading hyubrid model with the following layout: {config.hybrid_block_layout}")
+        for layer_idx, type in enumerate(config.hybrid_block_layout):
+            if type == "m2":
+                blocks.append(AprielSSMM2DecoderLayer(config, layer_idx))
+            elif type == "t":
+                blocks.append(MistralDecoderLayer(config, layer_idx))
+            elif type == "i":
+                blocks.append(AprielHybridIdentity(config))
+            else:
+                raise ValueError(f"Invalid block type: {type}")
+        self.layers = nn.ModuleList(blocks)
+        # Initialize weights and apply final processing
+        self.post_init()
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
+    ) -> BaseModelOutputWithPast:
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        if use_cache and past_key_values is None:
+            # for the case where prepare_inputs_for_generation is not called to create the cache (as in fast-llm test)
+            batch_size = input_ids.shape[0] if input_ids is not None else inputs_embeds.shape[0]
+            past_key_values = HybridMambaAttentionDynamicCache(self.config, batch_size, self.dtype, device=self.device)
+        output = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            **flash_attn_kwargs,
+        )
+        past_key_values: HybridMambaAttentionDynamicCache = output.past_key_values
+        if past_key_values and not past_key_values.has_previous_state:
+            past_key_values.has_previous_state = True
+        return output
+class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
+class AprielThinkerSSMHybridPreTrainedModel(PreTrainedModel):
+    config_class = AprielHConfig
+    base_model_prefix = "model"
+    _no_split_modules = ["MistralDecoderLayer", "AprielSSMDecoderLayer", "AprielSSMM2DecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, MistralRMSNorm):
+            module.weight.data.fill_(1.0)
+class AprielHForCausalLM(AprielThinkerSSMHybridPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    def __init__(self, config: AprielHConfig, **kwargs):
+        super().__init__(config, **kwargs)
+        self.model = AprielHModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        output_router_logits=False,
+        cache_position=None,
+        position_ids=None,
+        use_cache=True,
+        **kwargs,
+    ):
+        # Overwritten -- has a unique cache type, `HybridMambaAttentionDynamicCache`
+        empty_past_kv = past_key_values is None or not isinstance(past_key_values, HybridMambaAttentionDynamicCache)
+        # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
+        # Exception 1: when passing input_embeds, input_ids may be missing entries
+        # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
+        # Exception 3: with synced GPUs cache_position may go out of bounds, but we only want dummy token in that case.
+        #              (we can't check exception 3 while compiling)
+        if not empty_past_kv:
+            if inputs_embeds is not None or cache_position[-1] >= input_ids.shape[1]:  # Exception 1  # Exception 3
+                input_ids = input_ids[:, -cache_position.shape[0] :]
+            elif input_ids.shape[1] != cache_position.shape[0]:  # Default case (the "else", a no op, is Exception 2)
+                input_ids = input_ids[:, cache_position]
+        else:
+            past_key_values = HybridMambaAttentionDynamicCache(
+                self.config, input_ids.shape[0], self.dtype, device=self.device
+            )
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if not empty_past_kv:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and empty_past_kv:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids.contiguous()}  # `contiguous()` needed for compilation use cases
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": use_cache,
+                "attention_mask": attention_mask,
+                "output_router_logits": output_router_logits,
+                "cache_position": cache_position,
+            }
+        )
+        return model_inputs
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[KwargsForCausalLM],
+    ) -> Union[tuple, CausalLMOutputWithPast]:
+        r"""
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            logits_to_keep (`int` or `torch.Tensor`, *optional*):
+                If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+                If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+                This is useful when using packed tensor format (single dimension for batch and sequence length).
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, MistralForCausalLM
+        >>> model = MistralForCausalLM.from_pretrained("meta-mistral/Mistral-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-mistral/Mistral-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            mamba_mask=attention_mask,  # non-expended mask
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return AprielHybridCausalOutput(
+            loss=loss,
+            logits=logits,
+            all_hidden_states=outputs.hidden_states,
+            past_key_values=outputs.past_key_values,
+        )
+__all__ = [
+    "AprielHForCausalLM",
+    "AprielHModel",
+]

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff