These have been working well in production at Monkt.com, so figured I'd share them with the community.
Just straight conversions of the original models—might save you some time if you're building OCR pipelines.
monkt/paddleocr-onnx
I got way better results now! Just needed to use the recommended version of transformers.
I'll edit the main post when I'm ready with the graphs.
Thanks once more time.
Hey Tom,
First, appreciate your work! Thanks for everything you're doing.
I did use the prompt dict for "intfloat/multilingual-e5-large", like: prompts = {"query": "query: ", "passage": "passage: "} to SentenceTransformer.
For "google/embeddinggemma-300m", I kept the default: model = SentenceTransformer("google/embeddinggemma-300m") and then evaluated with MTEB library, assuming that "MTEB will automatically detect and use these prompts if they are defined in your model's configuration," as written here https://sbert.net/docs/sentence_transformer/usage/mteb_evaluation.html
So in short, I did not add prompts for EmbeddingGemma, but added them to multilingual-e5-large, as per their instructions (didn't have time to check their model config, but I think it's not added by default).
BUT, I ran with transformers==4.55.4, so need to re-run maybe...
sentence-transformers==5.1.0, which is fine I guess.
Thanks!
pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
pip install sentence-transformers>=5.0.0try to reduce gpu_memory_utilization to some lower coefficient
Thank you.
I’m also a big fan of Qwen models. However, in this case, I don’t think they are appropriate because I’m not entirely confident in their capabilities regarding multilingual contexts. That’s why I chose Llama.
Overall, I agree that the Qwen series is excellent for most tasks.
Yeah, the issues with the tables.
For office formats, it's mostly fine. You tried using PDF or images?
I will work on improving this.