CernisOCR

A vision language model OCR model fine-tuned on Qwen2.5-VL-7B-Instruct for handling mathematical formulas, handwritten text, and structured documents in a single model.

Model Description

CernisOCR is a vision language model, optimized for diverse OCR tasks across multiple document domains. Unlike domain-specific OCR models, CernisOCR unifies three traditionally separate OCR tasks into a single, efficient model:

  • Mathematical LaTeX conversion: Converts handwritten or printed mathematical formulas to LaTeX notation
  • Handwritten text transcription: Transcribes cursive and printed handwriting
  • Structured document extraction: Extracts structured data from invoices and receipts

Key Features:

  • Multi-domain capability in a single model
  • Handles varied image types, layouts, and text styles
  • Extracts both raw text and structured information
  • Robust to noise and variable image quality

Training Details

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Training Data: 10,000 samples from three domains:
    • LaTeX OCR: 3,978 samples (mathematical notation)
    • Invoices & Receipts: 2,043 samples (structured documents)
    • Handwritten Text: 3,978 samples (handwriting transcription)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Training Loss: Reduced from 4.802 to 0.116 (97.6% improvement)
  • Training Time: ~8.7 minutes on RTX 5090

Intended Use

This model is designed for:

  • Mathematical formula recognition and LaTeX conversion
  • Handwritten text transcription
  • Invoice and receipt data extraction
  • Multi-domain document processing workflows
  • Applications requiring unified OCR across different document types

How to Use

from unsloth import FastVisionModel
from transformers import AutoTokenizer
from PIL import Image

# Load model and tokenizer
model, tokenizer = FastVisionModel.from_pretrained(
    "coolAI/cernis-ocr",  # or "coolAI/cernis-vision-ocr" for merged model
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Example 1: LaTeX conversion
image = Image.open("formula.png")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Write the LaTeX representation for this image."}
    ]
}]

# Example 2: Handwritten transcription
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Transcribe the handwritten text in this image."}
    ]
}]

# Example 3: Invoice extraction
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Extract and structure all text content from this invoice/receipt image."}
    ]
}]

# Generate
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Citation

If you use this model, please cite:

@misc{cernis-ocr,
  title={CernisOCR: A Unified Multi-Domain OCR Model},
  author={Cernis AI},
  year={2025},
  howpublished={\url{https://huggingface.co/coolAI/cernis-ocr}}
}

Acknowledgments

Built using Unsloth for efficient fine-tuning. Training data sourced from publicly available OCR datasets on Hugging Face.

Downloads last month
16
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cernis-intelligence/cernis-vision-ocr

Adapter
(4)
this model