alea-institute
/

kl3m-multi-word-001-32k

@@ -38,8 +38,8 @@ This tokenizer is part of a hierarchically nested family. Token IDs in smaller v
 | 8,192 (8K) | [alea-institute/kl3m-multi-word-001-8k](https://huggingface.co/alea-institute/kl3m-multi-word-001-8k) | 249 KB |
 | 16,384 (16K) | [alea-institute/kl3m-multi-word-001-16k](https://huggingface.co/alea-institute/kl3m-multi-word-001-16k) | 529 KB |
 | 32,768 (32K) | [alea-institute/kl3m-multi-word-001-32k](https://huggingface.co/alea-institute/kl3m-multi-word-001-32k) | 1.2 MB |
-| 65,536 (65K) | [alea-institute/kl3m-multi-word-001-65k](https://huggingface.co/alea-institute/kl3m-multi-word-001-65k) | 2.4 MB |
-| 131,072 (131K) | [alea-institute/kl3m-multi-word-001-131k](https://huggingface.co/alea-institute/kl3m-multi-word-001-131k) | 5.2 MB |
 **→ You are viewing: 32,768 (32K)**
@@ -49,46 +49,87 @@ This tokenizer is part of a hierarchically nested family. Token IDs in smaller v
 Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
-**Example: "with respect to" (common legal phrase)**
 ```python
-# 4K tokenizer: 3 tokens
-['with respec', 't ', 'to'] → [2286, 282, 393]
-# 131K tokenizer: 1 token
-['with respect to'] → [15878]
 ```
-**Example: "Supreme Court"**
 ```python
 # 4K tokenizer: 5 tokens
-['Sup', 'rem', 'e ', 'Cour', 't'] → [4062, 1847, 265, 3032, 123]
-# 131K tokenizer: 1 token
-['Supreme Court'] → [81439]
 ```
-**Other multi-word tokens in this vocabulary:**
-- Common legal phrases: "United States" (→1 token), "in accordance with" (→1 token), "on behalf of" (→1 token)
-- Frequent constructions: "of the " (→1 token), "in the " (→1 token), ", the " (→1 token)
 - Legal terminology: "the defendant", "the Court", "Therefore,", "However,"
 ### 2. Hierarchical Token ID Nesting
-Token IDs are **preserved across vocabulary sizes** — a token with ID 1846 in the 4K vocabulary has the **same ID** in all larger vocabularies:
 ```python
-# "of the" tokenizes to ID 1846 in ALL vocabulary sizes
-4K:     [1846]
-8K:     [1846]
-16K:    [1846]
-32K:    [1846]
-65K:    [1846]
-131K:   [1846]
 # Special tokens are identical across all sizes
-<|start|>:  [0]
-<|end|>:    [1]
-<|pad|>:    [2]
 ```
 This enables:
@@ -110,6 +151,68 @@ All tokenizers include 7 special tokens with consistent IDs:
 | `<\|sep\|>` | 5 | Separator token (BERT-style) |
 | `<\|mask\|>` | 6 | Mask token (MLM training) |
 ## Usage
 ### Quick Start
@@ -216,7 +319,7 @@ Compare model performance across vocabulary sizes:
 ```python
 # Train models with different vocabulary sizes
-for vocab_size in ["4k", "8k", "16k", "32k", "65k", "131k"]:
     tokenizer = PreTrainedTokenizerFast.from_pretrained(
         f"alea-institute/kl3m-multi-word-001-{vocab_size}"
     )

 | 8,192 (8K) | [alea-institute/kl3m-multi-word-001-8k](https://huggingface.co/alea-institute/kl3m-multi-word-001-8k) | 249 KB |
 | 16,384 (16K) | [alea-institute/kl3m-multi-word-001-16k](https://huggingface.co/alea-institute/kl3m-multi-word-001-16k) | 529 KB |
 | 32,768 (32K) | [alea-institute/kl3m-multi-word-001-32k](https://huggingface.co/alea-institute/kl3m-multi-word-001-32k) | 1.2 MB |
+| 65,536 (64K) | [alea-institute/kl3m-multi-word-001-64k](https://huggingface.co/alea-institute/kl3m-multi-word-001-64k) | 2.4 MB |
+| 131,072 (128K) | [alea-institute/kl3m-multi-word-001-128k](https://huggingface.co/alea-institute/kl3m-multi-word-001-128k) | 5.2 MB |
 **→ You are viewing: 32,768 (32K)**
 Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
+**Example 1: "with respect to" (common legal phrase)**
 ```python
+from tokenizers import Tokenizer
+tok4k = Tokenizer.from_file("tokenizer-4096.json")
+tok128k = Tokenizer.from_file("tokenizer-131072.json")
+text = "with respect to"
+# 4K tokenizer: 3 tokens
+tok4k.encode(text).tokens
+# ['with respec', 't ', 'to']
+tok4k.encode(text).ids
+# [2317, 313, 424]
+# 128K tokenizer: 1 token
+tok128k.encode(text).tokens
+# ['with respect to']
+tok128k.encode(text).ids
+# [15903]
 ```
+**Example 2: "Supreme Court"**
 ```python
+text = "Supreme Court"
 # 4K tokenizer: 5 tokens
+tok4k.encode(text).tokens
+# ['Sup', 'rem', 'e ', 'Cour', 't']
+tok4k.encode(text).ids
+# [4091, 1878, 296, 3063, 170]
+# 128K tokenizer: 1 token
+tok128k.encode(text).tokens
+# ['Supreme Court']
+tok128k.encode(text).ids
+# [81445]
+```
+**Example 3: "United States"**
+```python
+text = "United States"
+# 4K: 2 tokens → 128K: 1 token
+tok4k.encode(text).tokens  # ['United St', 'ates']
+tok128k.encode(text).tokens  # ['United States']
 ```
+**Example 4: "Department of State"**
+```python
+text = "Department of State"
+# 4K: 3 tokens → 8K+: 2 tokens
+tok4k.encode(text).tokens  # ['Depart', 'ment of ', 'State']
+tok8k.encode(text).tokens  # ['Department of ', 'State']
+```
+**Other multi-word tokens in larger vocabularies:**
+- Legal phrases: "in accordance with", "on behalf of", "pursuant to"
+- Frequent constructions: "of the ", "in the ", ", the ", ". The "
 - Legal terminology: "the defendant", "the Court", "Therefore,", "However,"
 ### 2. Hierarchical Token ID Nesting
+Token IDs are **preserved across vocabulary sizes** — a token with ID 1877 in the 4K vocabulary has the **same ID** in all larger vocabularies:
 ```python
+# Example: "of the" has the same token ID across ALL vocabulary sizes
+text = "of the"
+tok4k.encode(text).ids    # [1877]
+tok8k.encode(text).ids    # [1877]
+tok16k.encode(text).ids   # [1877]
+tok32k.encode(text).ids   # [1877]
+tok64k.encode(text).ids   # [1877]
+tok128k.encode(text).ids  # [1877]
 # Special tokens are identical across all sizes
+tok4k.encode("<|start|>").ids  # [0]
+tok4k.encode("<|end|>").ids    # [1]
+tok4k.encode("<|pad|>").ids    # [2]
 ```
 This enables:
 | `<\|sep\|>` | 5 | Separator token (BERT-style) |
 | `<\|mask\|>` | 6 | Mask token (MLM training) |
+### 4. Argument Notation Tokens (Optional)
+Some tokenizer variants include 47 additional special tokens (IDs 7-53) for structured reasoning, debate, and argumentation:
+#### Claim Type Markers
+- `⧈` Fact/descriptive claim
+- `⚖` Value/ethical claim
+- `⏵` Policy/action claim
+- `✦` Preference/taste claim
+#### Belief Strength
+- `⬤` Certain true
+- `●` Strongly believe true
+- `◐` Lean true
+- `◌` Undecided
+- `◑` Lean false
+- `○` Certain false
+#### Value/Attitude
+- `⬆` Approve/good
+- `⬇` Disapprove/bad
+- `⇆` Mixed
+- `⟂` Neutral
+#### Structural Markers
+- `∴` Therefore
+- `∵` Because
+- `⋀` And
+- `⋁` Or
+- `⟷` Equivalent
+- `⟶` Supports
+- `⟞` Undercuts
+- `⇢` Explains
+- `⟺` Mutual support
+- `⊢` Evidence marker
+#### Evidence Sources
+- `👁` Observation
+- `🧪` Experiment
+- `📊` Data/statistics
+- `📚` Theory/literature
+- `🗣` Testimony
+- `🤔` Intuition
+- `★` Strong evidence
+- `☆` Weak evidence
+#### Meta-Discourse
+- `⚠` Warning/objection
+- `❗` Emphasis
+- `❓` Question
+- `↻` Revision
+- `✎` Reframe
+#### Agent Markers
+- `«` Open agent quote
+- `»` Close agent quote
+#### Numbered Markers
+- `①` `②` `③` `④` `⑤` `⑥` `⑦` `⑧` `⑨` `⑩` Circled numbers 1-10
+These tokens enable models to represent structured arguments, track evidence strength, and model multi-agent debates with explicit reasoning chains.
 ## Usage
 ### Quick Start
 ```python
 # Train models with different vocabulary sizes
+for vocab_size in ["4k", "8k", "16k", "32k", "64k", "128k"]:
     tokenizer = PreTrainedTokenizerFast.from_pretrained(
         f"alea-institute/kl3m-multi-word-001-{vocab_size}"
     )