Estonian NER Model - Fine-tuned on Synthetic Government Data
This model is a domain-adapted version of tartuNLP/EstBERT_NER_v2, further fine-tuned on synthetically generated Estonian text focusing on government services and public administration communications.
Model Description
Base Model: tartuNLP/EstBERT_NER_v2
Language: Estonian (et)
Task: Token Classification (Named Entity Recognition)
Training Data: Synthetic data generated using Google Gemini-3-pro API
This model specializes in extracting named entities from Estonian government and public service-related text, including citizen communications with government agencies.
Supported Entity Types
The model recognizes 11 entity types:
- PER: Person names
- ORG: Organizations, companies, government agencies
- LOC: Locations, addresses, streets, buildings
- GPE: Geopolitical entities (cities, counties, countries)
- PROD: Products
- TITLE: Titles, positions
- EVENT: Events
- DATE: Dates
- TIME: Time expressions
- MONEY: Monetary values
- PERCENT: Percentages
Each entity uses BIO tagging (B- for beginning, I- for inside).
Training Data
The model was fine-tuned on synthetically generated data created specifically for Estonian government and public service domains. The synthetic dataset includes:
- Generation Method: Google Gemini-3-pro API with structured prompts
- Domain Coverage: 22+ Estonian government agencies including Töötukassa (Unemployment Insurance Fund), Maksu- ja Tolliamet (Tax and Customs Board), Politsei- ja Piirivalveamet (Police and Border Guard), and others
- Topics: Various government services like unemployment benefits, tax declarations, social insurance, permits, registrations, etc.
- Style Diversity: Multiple writing styles (formal, casual, shorthand, mixed) to improve robustness
Why Synthetic Data?
Synthetic data generation allowed us to:
- Create domain-specific training examples for government services
- Ensure comprehensive coverage of Estonian public sector terminology
- Include diverse writing styles found in citizen-government communications
- Control entity distribution and annotation quality
Training Details
- Base Model: tartuNLP/EstBERT_NER_v2
- Training Epochs: 10
- Batch Size: 16
- Learning Rate: 5e-5
- Max Sequence Length: 512 tokens
- Optimizer: AdamW (weight decay: 0.01)
- Training Framework: Hugging Face Transformers + PyTorch
Usage
from transformers import BertTokenizerFast, BertForTokenClassification
from transformers import pipeline
# Load model and tokenizer
tokenizer = BertTokenizerFast.from_pretrained('buerokratt/{model_name}')
model = BertForTokenClassification.from_pretrained('buerokratt/{model_name}')
# Create NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
# Example text
text = ""
# Get predictions
ner_results = nlp(text)
for entity in ner_results:
print(f"{entity['word']}: {entity['entity']}")
Overall Metrics
| Metric | Score |
|---|---|
| Micro F1-Score | 0.8544 |
| Macro F1-Score | 0.8561 |
| Micro Precision | 0.8404 |
| Micro Recall | 0.8689 |
Per-Entity Performance
| Entity | Precision | Recall | F1-Score |
|---|---|---|---|
| GPE | 0.7778 | 0.7925 | 0.7850 |
| LOC | 0.9796 | 0.9412 | 0.9600 |
| ORG | 0.7778 | 0.8077 | 0.7925 |
| PER | 0.8393 | 0.9400 | 0.8868 |
Intended Use
This model is optimized for:
- Processing Estonian government service inquiries
- Extracting entities from citizen communications
- Analyzing public administration texts
- Information extraction from Estonian bureaucratic documents
Limitations
- Domain Specificity: Optimized for government/public service text; may underperform on other domains
- Synthetic Training Data: While diverse, synthetic data may not capture all real-world linguistic variations
- Base Model Limitations: Inherits limitations from EstBERT_NER_v2
Citation
If you use this model, please cite the base EstBERT_NER model:
@misc{tanvir2020estbert,
title={EstBERT: A Pretrained Language-Specific BERT for Estonian},
author={Hasan Tanvir and Claudia Kittask and Kairit Sirts},
year={2020},
eprint={2011.04784},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgments
- Base Model: tartuNLP/EstBERT_NER_v2 by the NLP research group at the University of Tartu
- Synthetic Data Generation: Google Gemini-3-pro API
- Training Framework: Hugging Face Transformers
- Downloads last month
- 28