Estonian NER Model - Fine-tuned on Synthetic Government Data

This model is a domain-adapted version of tartuNLP/EstBERT_NER_v2, further fine-tuned on synthetically generated Estonian text focusing on government services and public administration communications.

Model Description

Base Model: tartuNLP/EstBERT_NER_v2
Language: Estonian (et)
Task: Token Classification (Named Entity Recognition)
Training Data: Synthetic data generated using Google Gemini-3-pro API

This model specializes in extracting named entities from Estonian government and public service-related text, including citizen communications with government agencies.

Supported Entity Types

The model recognizes 11 entity types:

  • PER: Person names
  • ORG: Organizations, companies, government agencies
  • LOC: Locations, addresses, streets, buildings
  • GPE: Geopolitical entities (cities, counties, countries)
  • PROD: Products
  • TITLE: Titles, positions
  • EVENT: Events
  • DATE: Dates
  • TIME: Time expressions
  • MONEY: Monetary values
  • PERCENT: Percentages

Each entity uses BIO tagging (B- for beginning, I- for inside).

Training Data

The model was fine-tuned on synthetically generated data created specifically for Estonian government and public service domains. The synthetic dataset includes:

  • Generation Method: Google Gemini-3-pro API with structured prompts
  • Domain Coverage: 22+ Estonian government agencies including Töötukassa (Unemployment Insurance Fund), Maksu- ja Tolliamet (Tax and Customs Board), Politsei- ja Piirivalveamet (Police and Border Guard), and others
  • Topics: Various government services like unemployment benefits, tax declarations, social insurance, permits, registrations, etc.
  • Style Diversity: Multiple writing styles (formal, casual, shorthand, mixed) to improve robustness

Why Synthetic Data?

Synthetic data generation allowed us to:

  1. Create domain-specific training examples for government services
  2. Ensure comprehensive coverage of Estonian public sector terminology
  3. Include diverse writing styles found in citizen-government communications
  4. Control entity distribution and annotation quality

Training Details

  • Base Model: tartuNLP/EstBERT_NER_v2
  • Training Epochs: 10
  • Batch Size: 16
  • Learning Rate: 5e-5
  • Max Sequence Length: 512 tokens
  • Optimizer: AdamW (weight decay: 0.01)
  • Training Framework: Hugging Face Transformers + PyTorch

Usage

from transformers import BertTokenizerFast, BertForTokenClassification
from transformers import pipeline
# Load model and tokenizer
tokenizer = BertTokenizerFast.from_pretrained('buerokratt/{model_name}')
model = BertForTokenClassification.from_pretrained('buerokratt/{model_name}')
# Create NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
# Example text
text = ""
# Get predictions
ner_results = nlp(text)
for entity in ner_results:
    print(f"{entity['word']}: {entity['entity']}")

Overall Metrics

Metric Score
Micro F1-Score 0.8544
Macro F1-Score 0.8561
Micro Precision 0.8404
Micro Recall 0.8689

Per-Entity Performance

Entity Precision Recall F1-Score
GPE 0.7778 0.7925 0.7850
LOC 0.9796 0.9412 0.9600
ORG 0.7778 0.8077 0.7925
PER 0.8393 0.9400 0.8868

Intended Use

This model is optimized for:

  • Processing Estonian government service inquiries
  • Extracting entities from citizen communications
  • Analyzing public administration texts
  • Information extraction from Estonian bureaucratic documents

Limitations

  • Domain Specificity: Optimized for government/public service text; may underperform on other domains
  • Synthetic Training Data: While diverse, synthetic data may not capture all real-world linguistic variations
  • Base Model Limitations: Inherits limitations from EstBERT_NER_v2

Citation

If you use this model, please cite the base EstBERT_NER model:

@misc{tanvir2020estbert,
      title={EstBERT: A Pretrained Language-Specific BERT for Estonian}, 
      author={Hasan Tanvir and Claudia Kittask and Kairit Sirts},
      year={2020},
      eprint={2011.04784},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgments

  • Base Model: tartuNLP/EstBERT_NER_v2 by the NLP research group at the University of Tartu
  • Synthetic Data Generation: Google Gemini-3-pro API
  • Training Framework: Hugging Face Transformers
Downloads last month
28
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for buerokrattRIA/EstBert_NER_SyntheticGov

Finetuned
tartuNLP/EstBERT
Finetuned
(1)
this model