Upload KL3M multi-word tokenizer (32K) - Update README
Browse files
README.md
CHANGED
|
@@ -38,8 +38,8 @@ This tokenizer is part of a hierarchically nested family. Token IDs in smaller v
|
|
| 38 |
| 8,192 (8K) | [alea-institute/kl3m-multi-word-001-8k](https://huggingface.co/alea-institute/kl3m-multi-word-001-8k) | 249 KB |
|
| 39 |
| 16,384 (16K) | [alea-institute/kl3m-multi-word-001-16k](https://huggingface.co/alea-institute/kl3m-multi-word-001-16k) | 529 KB |
|
| 40 |
| 32,768 (32K) | [alea-institute/kl3m-multi-word-001-32k](https://huggingface.co/alea-institute/kl3m-multi-word-001-32k) | 1.2 MB |
|
| 41 |
-
| 65,536 (
|
| 42 |
-
| 131,072 (
|
| 43 |
|
| 44 |
**β You are viewing: 32,768 (32K)**
|
| 45 |
|
|
@@ -49,46 +49,87 @@ This tokenizer is part of a hierarchically nested family. Token IDs in smaller v
|
|
| 49 |
|
| 50 |
Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
|
| 51 |
|
| 52 |
-
**Example: "with respect to" (common legal phrase)**
|
| 53 |
```python
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
#
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
```
|
| 60 |
|
| 61 |
-
**Example: "Supreme Court"**
|
| 62 |
```python
|
|
|
|
|
|
|
| 63 |
# 4K tokenizer: 5 tokens
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
```
|
| 69 |
|
| 70 |
-
**
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
- Legal terminology: "the defendant", "the Court", "Therefore,", "However,"
|
| 74 |
|
| 75 |
### 2. Hierarchical Token ID Nesting
|
| 76 |
|
| 77 |
-
Token IDs are **preserved across vocabulary sizes** β a token with ID
|
| 78 |
|
| 79 |
```python
|
| 80 |
-
# "of the"
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
| 87 |
|
| 88 |
# Special tokens are identical across all sizes
|
| 89 |
-
<|start
|
| 90 |
-
<|end
|
| 91 |
-
<|pad
|
| 92 |
```
|
| 93 |
|
| 94 |
This enables:
|
|
@@ -110,6 +151,68 @@ All tokenizers include 7 special tokens with consistent IDs:
|
|
| 110 |
| `<\|sep\|>` | 5 | Separator token (BERT-style) |
|
| 111 |
| `<\|mask\|>` | 6 | Mask token (MLM training) |
|
| 112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
## Usage
|
| 114 |
|
| 115 |
### Quick Start
|
|
@@ -216,7 +319,7 @@ Compare model performance across vocabulary sizes:
|
|
| 216 |
|
| 217 |
```python
|
| 218 |
# Train models with different vocabulary sizes
|
| 219 |
-
for vocab_size in ["4k", "8k", "16k", "32k", "
|
| 220 |
tokenizer = PreTrainedTokenizerFast.from_pretrained(
|
| 221 |
f"alea-institute/kl3m-multi-word-001-{vocab_size}"
|
| 222 |
)
|
|
|
|
| 38 |
| 8,192 (8K) | [alea-institute/kl3m-multi-word-001-8k](https://huggingface.co/alea-institute/kl3m-multi-word-001-8k) | 249 KB |
|
| 39 |
| 16,384 (16K) | [alea-institute/kl3m-multi-word-001-16k](https://huggingface.co/alea-institute/kl3m-multi-word-001-16k) | 529 KB |
|
| 40 |
| 32,768 (32K) | [alea-institute/kl3m-multi-word-001-32k](https://huggingface.co/alea-institute/kl3m-multi-word-001-32k) | 1.2 MB |
|
| 41 |
+
| 65,536 (64K) | [alea-institute/kl3m-multi-word-001-64k](https://huggingface.co/alea-institute/kl3m-multi-word-001-64k) | 2.4 MB |
|
| 42 |
+
| 131,072 (128K) | [alea-institute/kl3m-multi-word-001-128k](https://huggingface.co/alea-institute/kl3m-multi-word-001-128k) | 5.2 MB |
|
| 43 |
|
| 44 |
**β You are viewing: 32,768 (32K)**
|
| 45 |
|
|
|
|
| 49 |
|
| 50 |
Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
|
| 51 |
|
| 52 |
+
**Example 1: "with respect to" (common legal phrase)**
|
| 53 |
```python
|
| 54 |
+
from tokenizers import Tokenizer
|
| 55 |
+
|
| 56 |
+
tok4k = Tokenizer.from_file("tokenizer-4096.json")
|
| 57 |
+
tok128k = Tokenizer.from_file("tokenizer-131072.json")
|
| 58 |
+
|
| 59 |
+
text = "with respect to"
|
| 60 |
|
| 61 |
+
# 4K tokenizer: 3 tokens
|
| 62 |
+
tok4k.encode(text).tokens
|
| 63 |
+
# ['with respec', 't ', 'to']
|
| 64 |
+
tok4k.encode(text).ids
|
| 65 |
+
# [2317, 313, 424]
|
| 66 |
+
|
| 67 |
+
# 128K tokenizer: 1 token
|
| 68 |
+
tok128k.encode(text).tokens
|
| 69 |
+
# ['with respect to']
|
| 70 |
+
tok128k.encode(text).ids
|
| 71 |
+
# [15903]
|
| 72 |
```
|
| 73 |
|
| 74 |
+
**Example 2: "Supreme Court"**
|
| 75 |
```python
|
| 76 |
+
text = "Supreme Court"
|
| 77 |
+
|
| 78 |
# 4K tokenizer: 5 tokens
|
| 79 |
+
tok4k.encode(text).tokens
|
| 80 |
+
# ['Sup', 'rem', 'e ', 'Cour', 't']
|
| 81 |
+
tok4k.encode(text).ids
|
| 82 |
+
# [4091, 1878, 296, 3063, 170]
|
| 83 |
+
|
| 84 |
+
# 128K tokenizer: 1 token
|
| 85 |
+
tok128k.encode(text).tokens
|
| 86 |
+
# ['Supreme Court']
|
| 87 |
+
tok128k.encode(text).ids
|
| 88 |
+
# [81445]
|
| 89 |
+
```
|
| 90 |
|
| 91 |
+
**Example 3: "United States"**
|
| 92 |
+
```python
|
| 93 |
+
text = "United States"
|
| 94 |
+
|
| 95 |
+
# 4K: 2 tokens β 128K: 1 token
|
| 96 |
+
tok4k.encode(text).tokens # ['United St', 'ates']
|
| 97 |
+
tok128k.encode(text).tokens # ['United States']
|
| 98 |
```
|
| 99 |
|
| 100 |
+
**Example 4: "Department of State"**
|
| 101 |
+
```python
|
| 102 |
+
text = "Department of State"
|
| 103 |
+
|
| 104 |
+
# 4K: 3 tokens β 8K+: 2 tokens
|
| 105 |
+
tok4k.encode(text).tokens # ['Depart', 'ment of ', 'State']
|
| 106 |
+
tok8k.encode(text).tokens # ['Department of ', 'State']
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
**Other multi-word tokens in larger vocabularies:**
|
| 110 |
+
- Legal phrases: "in accordance with", "on behalf of", "pursuant to"
|
| 111 |
+
- Frequent constructions: "of the ", "in the ", ", the ", ". The "
|
| 112 |
- Legal terminology: "the defendant", "the Court", "Therefore,", "However,"
|
| 113 |
|
| 114 |
### 2. Hierarchical Token ID Nesting
|
| 115 |
|
| 116 |
+
Token IDs are **preserved across vocabulary sizes** β a token with ID 1877 in the 4K vocabulary has the **same ID** in all larger vocabularies:
|
| 117 |
|
| 118 |
```python
|
| 119 |
+
# Example: "of the" has the same token ID across ALL vocabulary sizes
|
| 120 |
+
text = "of the"
|
| 121 |
+
|
| 122 |
+
tok4k.encode(text).ids # [1877]
|
| 123 |
+
tok8k.encode(text).ids # [1877]
|
| 124 |
+
tok16k.encode(text).ids # [1877]
|
| 125 |
+
tok32k.encode(text).ids # [1877]
|
| 126 |
+
tok64k.encode(text).ids # [1877]
|
| 127 |
+
tok128k.encode(text).ids # [1877]
|
| 128 |
|
| 129 |
# Special tokens are identical across all sizes
|
| 130 |
+
tok4k.encode("<|start|>").ids # [0]
|
| 131 |
+
tok4k.encode("<|end|>").ids # [1]
|
| 132 |
+
tok4k.encode("<|pad|>").ids # [2]
|
| 133 |
```
|
| 134 |
|
| 135 |
This enables:
|
|
|
|
| 151 |
| `<\|sep\|>` | 5 | Separator token (BERT-style) |
|
| 152 |
| `<\|mask\|>` | 6 | Mask token (MLM training) |
|
| 153 |
|
| 154 |
+
### 4. Argument Notation Tokens (Optional)
|
| 155 |
+
|
| 156 |
+
Some tokenizer variants include 47 additional special tokens (IDs 7-53) for structured reasoning, debate, and argumentation:
|
| 157 |
+
|
| 158 |
+
#### Claim Type Markers
|
| 159 |
+
- `β§` Fact/descriptive claim
|
| 160 |
+
- `β` Value/ethical claim
|
| 161 |
+
- `β΅` Policy/action claim
|
| 162 |
+
- `β¦` Preference/taste claim
|
| 163 |
+
|
| 164 |
+
#### Belief Strength
|
| 165 |
+
- `⬀` Certain true
|
| 166 |
+
- `β` Strongly believe true
|
| 167 |
+
- `β` Lean true
|
| 168 |
+
- `β` Undecided
|
| 169 |
+
- `β` Lean false
|
| 170 |
+
- `β` Certain false
|
| 171 |
+
|
| 172 |
+
#### Value/Attitude
|
| 173 |
+
- `β¬` Approve/good
|
| 174 |
+
- `β¬` Disapprove/bad
|
| 175 |
+
- `β` Mixed
|
| 176 |
+
- `β` Neutral
|
| 177 |
+
|
| 178 |
+
#### Structural Markers
|
| 179 |
+
- `β΄` Therefore
|
| 180 |
+
- `β΅` Because
|
| 181 |
+
- `β` And
|
| 182 |
+
- `β` Or
|
| 183 |
+
- `β·` Equivalent
|
| 184 |
+
- `βΆ` Supports
|
| 185 |
+
- `β` Undercuts
|
| 186 |
+
- `β’` Explains
|
| 187 |
+
- `βΊ` Mutual support
|
| 188 |
+
- `β’` Evidence marker
|
| 189 |
+
|
| 190 |
+
#### Evidence Sources
|
| 191 |
+
- `π` Observation
|
| 192 |
+
- `π§ͺ` Experiment
|
| 193 |
+
- `π` Data/statistics
|
| 194 |
+
- `π` Theory/literature
|
| 195 |
+
- `π£` Testimony
|
| 196 |
+
- `π€` Intuition
|
| 197 |
+
- `β
` Strong evidence
|
| 198 |
+
- `β` Weak evidence
|
| 199 |
+
|
| 200 |
+
#### Meta-Discourse
|
| 201 |
+
- `β ` Warning/objection
|
| 202 |
+
- `β` Emphasis
|
| 203 |
+
- `β` Question
|
| 204 |
+
- `β»` Revision
|
| 205 |
+
- `β` Reframe
|
| 206 |
+
|
| 207 |
+
#### Agent Markers
|
| 208 |
+
- `Β«` Open agent quote
|
| 209 |
+
- `Β»` Close agent quote
|
| 210 |
+
|
| 211 |
+
#### Numbered Markers
|
| 212 |
+
- `β ` `β‘` `β’` `β£` `β€` `β₯` `β¦` `β§` `β¨` `β©` Circled numbers 1-10
|
| 213 |
+
|
| 214 |
+
These tokens enable models to represent structured arguments, track evidence strength, and model multi-agent debates with explicit reasoning chains.
|
| 215 |
+
|
| 216 |
## Usage
|
| 217 |
|
| 218 |
### Quick Start
|
|
|
|
| 319 |
|
| 320 |
```python
|
| 321 |
# Train models with different vocabulary sizes
|
| 322 |
+
for vocab_size in ["4k", "8k", "16k", "32k", "64k", "128k"]:
|
| 323 |
tokenizer = PreTrainedTokenizerFast.from_pretrained(
|
| 324 |
f"alea-institute/kl3m-multi-word-001-{vocab_size}"
|
| 325 |
)
|