--- language: - kk - ru - en license: apache-2.0 tags: - sentence-transformers - feature-extraction - sentence-similarity - embedding - retrieval - kazakh - rag datasets: - issai/kazqad - issai/kazqad-retrieval - DarkyMan/powerful-kazakh-dialogue library_name: sentence-transformers pipeline_tag: sentence-similarity base_model: intfloat/multilingual-e5-base --- # KazEmbed-V5: Kazakh Embedding Model for RAG 🏆 **Best BASE-size embedding model for Kazakh language retrieval tasks** ## Model Description KazEmbed-V5 is a fine-tuned version of [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) optimized for Kazakh language retrieval and RAG (Retrieval-Augmented Generation) applications. ### Key Features - 🇰🇿 **Specialized for Kazakh**: Fine-tuned on 61,255 Kazakh text pairs - 📈 **+2.1% MRR improvement** over multilingual-e5-base - ⚡ **Efficient**: 278M parameters (base-size model) - 🔍 **RAG-optimized**: Trained specifically for retrieval tasks ## Benchmark Results | Model | Hits@1 | Hits@5 | MRR | Params | |-------|--------|--------|-----|--------| | **KazEmbed-V5 (Ours)** | **72%** | **96%** | **0.835** | 278M | | multilingual-e5-base | 72% | 96% | 0.818 | 278M | | multilingual-e5-large | 85% | 99% | 0.909 | 560M | | paraphrase-mpnet-v2 | 53% | 80% | 0.648 | 278M | | LaBSE | 48% | 73% | 0.601 | 471M | *Evaluated on KazQAD test set with TF-IDF hard negatives (100 candidates per query)* ## Usage ### Installation ```bash pip install sentence-transformers ``` ### Basic Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('YOUR_USERNAME/kazembed-v5') # For queries (questions) query = "query: Қазақстанның астанасы қай қала?" query_embedding = model.encode(query) # For passages (documents) passage = "passage: Астана — Қазақстан Республикасының астанасы." passage_embedding = model.encode(passage) # Calculate similarity from sklearn.metrics.pairwise import cosine_similarity similarity = cosine_similarity([query_embedding], [passage_embedding])[0][0] print(f"Similarity: {similarity:.4f}") ``` ### For RAG Applications ```python from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('YOUR_USERNAME/kazembed-v5') # Your document corpus documents = [ "Астана — Қазақстан Республикасының астанасы.", "Алматы — Қазақстанның ең үлкен қаласы.", "Қазақстан — Орталық Азиядағы мемлекет.", ] # Encode documents (do once, store in vector DB) doc_embeddings = model.encode(["passage: " + doc for doc in documents]) # Query query = "Қазақстанның астанасы қай қала?" query_embedding = model.encode("query: " + query) # Find most similar similarities = np.dot(doc_embeddings, query_embedding) best_idx = np.argmax(similarities) print(f"Best match: {documents[best_idx]}") ``` ## Training Details ### Training Data | Dataset | Pairs | Description | |---------|-------|-------------| | KazQAD | 6,640 | Question-Context pairs | | KazQAD-Retrieval | 44,615 | Title-Text pairs | | Powerful-Kazakh-Dialogue | 10,000 | User-Assistant pairs | | **Total** | **61,255** | Retrieval-focused pairs | ### Training Configuration - **Base Model**: intfloat/multilingual-e5-base - **Epochs**: 2 - **Batch Size**: 16 - **Learning Rate**: 1e-5 - **Loss**: MultipleNegativesRankingLoss - **Hardware**: NVIDIA GPU ### Training Strategy We found that: 1. **Retrieval-only data** works best (no NLI/STS data) 2. **2 epochs** is optimal (1 = underfit, 3 = overfit) 3. **Larger batch size** (16) provides more in-batch negatives ## Limitations - Optimized for Kazakh; performance on other languages may vary - Best for retrieval tasks; may not be optimal for semantic similarity - Requires `query:` and `passage:` prefixes for best results ## Citation ```bibtex @misc{kazembed2024, title={KazEmbed-V5: A Fine-tuned Embedding Model for Kazakh Language Retrieval}, author={Your Name}, year={2024}, howpublished={HuggingFace Hub} } ``` ## Acknowledgments - Base model: [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) - Training data: [ISSAI](https://issai.nu.edu.kz/) for KazQAD dataset