alea-institute commited on
Commit
ae30578
Β·
verified Β·
1 Parent(s): 4c87014

Upload KL3M multi-word tokenizer (32K) - Update README

Browse files
Files changed (1) hide show
  1. README.md +129 -26
README.md CHANGED
@@ -38,8 +38,8 @@ This tokenizer is part of a hierarchically nested family. Token IDs in smaller v
38
  | 8,192 (8K) | [alea-institute/kl3m-multi-word-001-8k](https://huggingface.co/alea-institute/kl3m-multi-word-001-8k) | 249 KB |
39
  | 16,384 (16K) | [alea-institute/kl3m-multi-word-001-16k](https://huggingface.co/alea-institute/kl3m-multi-word-001-16k) | 529 KB |
40
  | 32,768 (32K) | [alea-institute/kl3m-multi-word-001-32k](https://huggingface.co/alea-institute/kl3m-multi-word-001-32k) | 1.2 MB |
41
- | 65,536 (65K) | [alea-institute/kl3m-multi-word-001-65k](https://huggingface.co/alea-institute/kl3m-multi-word-001-65k) | 2.4 MB |
42
- | 131,072 (131K) | [alea-institute/kl3m-multi-word-001-131k](https://huggingface.co/alea-institute/kl3m-multi-word-001-131k) | 5.2 MB |
43
 
44
  **β†’ You are viewing: 32,768 (32K)**
45
 
@@ -49,46 +49,87 @@ This tokenizer is part of a hierarchically nested family. Token IDs in smaller v
49
 
50
  Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
51
 
52
- **Example: "with respect to" (common legal phrase)**
53
  ```python
54
- # 4K tokenizer: 3 tokens
55
- ['with respec', 't ', 'to'] β†’ [2286, 282, 393]
 
 
 
 
56
 
57
- # 131K tokenizer: 1 token
58
- ['with respect to'] β†’ [15878]
 
 
 
 
 
 
 
 
 
59
  ```
60
 
61
- **Example: "Supreme Court"**
62
  ```python
 
 
63
  # 4K tokenizer: 5 tokens
64
- ['Sup', 'rem', 'e ', 'Cour', 't'] β†’ [4062, 1847, 265, 3032, 123]
 
 
 
 
 
 
 
 
 
 
65
 
66
- # 131K tokenizer: 1 token
67
- ['Supreme Court'] β†’ [81439]
 
 
 
 
 
68
  ```
69
 
70
- **Other multi-word tokens in this vocabulary:**
71
- - Common legal phrases: "United States" (β†’1 token), "in accordance with" (β†’1 token), "on behalf of" (β†’1 token)
72
- - Frequent constructions: "of the " (β†’1 token), "in the " (β†’1 token), ", the " (β†’1 token)
 
 
 
 
 
 
 
 
 
73
  - Legal terminology: "the defendant", "the Court", "Therefore,", "However,"
74
 
75
  ### 2. Hierarchical Token ID Nesting
76
 
77
- Token IDs are **preserved across vocabulary sizes** β€” a token with ID 1846 in the 4K vocabulary has the **same ID** in all larger vocabularies:
78
 
79
  ```python
80
- # "of the" tokenizes to ID 1846 in ALL vocabulary sizes
81
- 4K: [1846]
82
- 8K: [1846]
83
- 16K: [1846]
84
- 32K: [1846]
85
- 65K: [1846]
86
- 131K: [1846]
 
 
87
 
88
  # Special tokens are identical across all sizes
89
- <|start|>: [0]
90
- <|end|>: [1]
91
- <|pad|>: [2]
92
  ```
93
 
94
  This enables:
@@ -110,6 +151,68 @@ All tokenizers include 7 special tokens with consistent IDs:
110
  | `<\|sep\|>` | 5 | Separator token (BERT-style) |
111
  | `<\|mask\|>` | 6 | Mask token (MLM training) |
112
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ## Usage
114
 
115
  ### Quick Start
@@ -216,7 +319,7 @@ Compare model performance across vocabulary sizes:
216
 
217
  ```python
218
  # Train models with different vocabulary sizes
219
- for vocab_size in ["4k", "8k", "16k", "32k", "65k", "131k"]:
220
  tokenizer = PreTrainedTokenizerFast.from_pretrained(
221
  f"alea-institute/kl3m-multi-word-001-{vocab_size}"
222
  )
 
38
  | 8,192 (8K) | [alea-institute/kl3m-multi-word-001-8k](https://huggingface.co/alea-institute/kl3m-multi-word-001-8k) | 249 KB |
39
  | 16,384 (16K) | [alea-institute/kl3m-multi-word-001-16k](https://huggingface.co/alea-institute/kl3m-multi-word-001-16k) | 529 KB |
40
  | 32,768 (32K) | [alea-institute/kl3m-multi-word-001-32k](https://huggingface.co/alea-institute/kl3m-multi-word-001-32k) | 1.2 MB |
41
+ | 65,536 (64K) | [alea-institute/kl3m-multi-word-001-64k](https://huggingface.co/alea-institute/kl3m-multi-word-001-64k) | 2.4 MB |
42
+ | 131,072 (128K) | [alea-institute/kl3m-multi-word-001-128k](https://huggingface.co/alea-institute/kl3m-multi-word-001-128k) | 5.2 MB |
43
 
44
  **β†’ You are viewing: 32,768 (32K)**
45
 
 
49
 
50
  Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
51
 
52
+ **Example 1: "with respect to" (common legal phrase)**
53
  ```python
54
+ from tokenizers import Tokenizer
55
+
56
+ tok4k = Tokenizer.from_file("tokenizer-4096.json")
57
+ tok128k = Tokenizer.from_file("tokenizer-131072.json")
58
+
59
+ text = "with respect to"
60
 
61
+ # 4K tokenizer: 3 tokens
62
+ tok4k.encode(text).tokens
63
+ # ['with respec', 't ', 'to']
64
+ tok4k.encode(text).ids
65
+ # [2317, 313, 424]
66
+
67
+ # 128K tokenizer: 1 token
68
+ tok128k.encode(text).tokens
69
+ # ['with respect to']
70
+ tok128k.encode(text).ids
71
+ # [15903]
72
  ```
73
 
74
+ **Example 2: "Supreme Court"**
75
  ```python
76
+ text = "Supreme Court"
77
+
78
  # 4K tokenizer: 5 tokens
79
+ tok4k.encode(text).tokens
80
+ # ['Sup', 'rem', 'e ', 'Cour', 't']
81
+ tok4k.encode(text).ids
82
+ # [4091, 1878, 296, 3063, 170]
83
+
84
+ # 128K tokenizer: 1 token
85
+ tok128k.encode(text).tokens
86
+ # ['Supreme Court']
87
+ tok128k.encode(text).ids
88
+ # [81445]
89
+ ```
90
 
91
+ **Example 3: "United States"**
92
+ ```python
93
+ text = "United States"
94
+
95
+ # 4K: 2 tokens β†’ 128K: 1 token
96
+ tok4k.encode(text).tokens # ['United St', 'ates']
97
+ tok128k.encode(text).tokens # ['United States']
98
  ```
99
 
100
+ **Example 4: "Department of State"**
101
+ ```python
102
+ text = "Department of State"
103
+
104
+ # 4K: 3 tokens β†’ 8K+: 2 tokens
105
+ tok4k.encode(text).tokens # ['Depart', 'ment of ', 'State']
106
+ tok8k.encode(text).tokens # ['Department of ', 'State']
107
+ ```
108
+
109
+ **Other multi-word tokens in larger vocabularies:**
110
+ - Legal phrases: "in accordance with", "on behalf of", "pursuant to"
111
+ - Frequent constructions: "of the ", "in the ", ", the ", ". The "
112
  - Legal terminology: "the defendant", "the Court", "Therefore,", "However,"
113
 
114
  ### 2. Hierarchical Token ID Nesting
115
 
116
+ Token IDs are **preserved across vocabulary sizes** β€” a token with ID 1877 in the 4K vocabulary has the **same ID** in all larger vocabularies:
117
 
118
  ```python
119
+ # Example: "of the" has the same token ID across ALL vocabulary sizes
120
+ text = "of the"
121
+
122
+ tok4k.encode(text).ids # [1877]
123
+ tok8k.encode(text).ids # [1877]
124
+ tok16k.encode(text).ids # [1877]
125
+ tok32k.encode(text).ids # [1877]
126
+ tok64k.encode(text).ids # [1877]
127
+ tok128k.encode(text).ids # [1877]
128
 
129
  # Special tokens are identical across all sizes
130
+ tok4k.encode("<|start|>").ids # [0]
131
+ tok4k.encode("<|end|>").ids # [1]
132
+ tok4k.encode("<|pad|>").ids # [2]
133
  ```
134
 
135
  This enables:
 
151
  | `<\|sep\|>` | 5 | Separator token (BERT-style) |
152
  | `<\|mask\|>` | 6 | Mask token (MLM training) |
153
 
154
+ ### 4. Argument Notation Tokens (Optional)
155
+
156
+ Some tokenizer variants include 47 additional special tokens (IDs 7-53) for structured reasoning, debate, and argumentation:
157
+
158
+ #### Claim Type Markers
159
+ - `⧈` Fact/descriptive claim
160
+ - `βš–` Value/ethical claim
161
+ - `⏡` Policy/action claim
162
+ - `✦` Preference/taste claim
163
+
164
+ #### Belief Strength
165
+ - `⬀` Certain true
166
+ - `●` Strongly believe true
167
+ - `◐` Lean true
168
+ - `β—Œ` Undecided
169
+ - `β—‘` Lean false
170
+ - `β—‹` Certain false
171
+
172
+ #### Value/Attitude
173
+ - `⬆` Approve/good
174
+ - `⬇` Disapprove/bad
175
+ - `⇆` Mixed
176
+ - `βŸ‚` Neutral
177
+
178
+ #### Structural Markers
179
+ - `∴` Therefore
180
+ - `∡` Because
181
+ - `β‹€` And
182
+ - `⋁` Or
183
+ - `⟷` Equivalent
184
+ - `⟢` Supports
185
+ - `⟞` Undercuts
186
+ - `β‡’` Explains
187
+ - `⟺` Mutual support
188
+ - `⊒` Evidence marker
189
+
190
+ #### Evidence Sources
191
+ - `πŸ‘` Observation
192
+ - `πŸ§ͺ` Experiment
193
+ - `πŸ“Š` Data/statistics
194
+ - `πŸ“š` Theory/literature
195
+ - `πŸ—£` Testimony
196
+ - `πŸ€”` Intuition
197
+ - `β˜…` Strong evidence
198
+ - `β˜†` Weak evidence
199
+
200
+ #### Meta-Discourse
201
+ - `⚠` Warning/objection
202
+ - `❗` Emphasis
203
+ - `❓` Question
204
+ - `↻` Revision
205
+ - `✎` Reframe
206
+
207
+ #### Agent Markers
208
+ - `Β«` Open agent quote
209
+ - `Β»` Close agent quote
210
+
211
+ #### Numbered Markers
212
+ - `β‘ ` `β‘‘` `β‘’` `β‘£` `β‘€` `β‘₯` `⑦` `β‘§` `⑨` `β‘©` Circled numbers 1-10
213
+
214
+ These tokens enable models to represent structured arguments, track evidence strength, and model multi-agent debates with explicit reasoning chains.
215
+
216
  ## Usage
217
 
218
  ### Quick Start
 
319
 
320
  ```python
321
  # Train models with different vocabulary sizes
322
+ for vocab_size in ["4k", "8k", "16k", "32k", "64k", "128k"]:
323
  tokenizer = PreTrainedTokenizerFast.from_pretrained(
324
  f"alea-institute/kl3m-multi-word-001-{vocab_size}"
325
  )