CernisOCR
A vision language model OCR model fine-tuned on Qwen2.5-VL-7B-Instruct for handling mathematical formulas, handwritten text, and structured documents in a single model.
Model Description
CernisOCR is a vision language model, optimized for diverse OCR tasks across multiple document domains. Unlike domain-specific OCR models, CernisOCR unifies three traditionally separate OCR tasks into a single, efficient model:
- Mathematical LaTeX conversion: Converts handwritten or printed mathematical formulas to LaTeX notation
- Handwritten text transcription: Transcribes cursive and printed handwriting
- Structured document extraction: Extracts structured data from invoices and receipts
Key Features:
- Multi-domain capability in a single model
- Handles varied image types, layouts, and text styles
- Extracts both raw text and structured information
- Robust to noise and variable image quality
Training Details
- Base Model: Qwen2.5-VL-7B-Instruct
- Training Data: 10,000 samples from three domains:
- LaTeX OCR: 3,978 samples (mathematical notation)
- Invoices & Receipts: 2,043 samples (structured documents)
- Handwritten Text: 3,978 samples (handwriting transcription)
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Loss: Reduced from 4.802 to 0.116 (97.6% improvement)
- Training Time: ~8.7 minutes on RTX 5090
Intended Use
This model is designed for:
- Mathematical formula recognition and LaTeX conversion
- Handwritten text transcription
- Invoice and receipt data extraction
- Multi-domain document processing workflows
- Applications requiring unified OCR across different document types
How to Use
from unsloth import FastVisionModel
from transformers import AutoTokenizer
from PIL import Image
# Load model and tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
"coolAI/cernis-ocr", # or "coolAI/cernis-vision-ocr" for merged model
load_in_4bit=True,
)
FastVisionModel.for_inference(model)
# Example 1: LaTeX conversion
image = Image.open("formula.png")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Write the LaTeX representation for this image."}
]
}]
# Example 2: Handwritten transcription
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Transcribe the handwritten text in this image."}
]
}]
# Example 3: Invoice extraction
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extract and structure all text content from this invoice/receipt image."}
]
}]
# Generate
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Citation
If you use this model, please cite:
@misc{cernis-ocr,
title={CernisOCR: A Unified Multi-Domain OCR Model},
author={Cernis AI},
year={2025},
howpublished={\url{https://huggingface.co/coolAI/cernis-ocr}}
}
Acknowledgments
Built using Unsloth for efficient fine-tuning. Training data sourced from publicly available OCR datasets on Hugging Face.
- Downloads last month
- 16
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for cernis-intelligence/cernis-vision-ocr
Base model
Qwen/Qwen2.5-VL-7B-Instruct
Quantized
unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit