kazembed-v5 / README.md
Nurlykhan's picture
Upload KazEmbed-V5: Best BASE-size Kazakh embedding model
5c701d7 verified
---
language:
- kk
- ru
- en
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embedding
- retrieval
- kazakh
- rag
datasets:
- issai/kazqad
- issai/kazqad-retrieval
- DarkyMan/powerful-kazakh-dialogue
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: intfloat/multilingual-e5-base
---
# KazEmbed-V5: Kazakh Embedding Model for RAG
🏆 **Best BASE-size embedding model for Kazakh language retrieval tasks**
## Model Description
KazEmbed-V5 is a fine-tuned version of [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) optimized for Kazakh language retrieval and RAG (Retrieval-Augmented Generation) applications.
### Key Features
- 🇰🇿 **Specialized for Kazakh**: Fine-tuned on 61,255 Kazakh text pairs
- 📈 **+2.1% MRR improvement** over multilingual-e5-base
-**Efficient**: 278M parameters (base-size model)
- 🔍 **RAG-optimized**: Trained specifically for retrieval tasks
## Benchmark Results
| Model | Hits@1 | Hits@5 | MRR | Params |
|-------|--------|--------|-----|--------|
| **KazEmbed-V5 (Ours)** | **72%** | **96%** | **0.835** | 278M |
| multilingual-e5-base | 72% | 96% | 0.818 | 278M |
| multilingual-e5-large | 85% | 99% | 0.909 | 560M |
| paraphrase-mpnet-v2 | 53% | 80% | 0.648 | 278M |
| LaBSE | 48% | 73% | 0.601 | 471M |
*Evaluated on KazQAD test set with TF-IDF hard negatives (100 candidates per query)*
## Usage
### Installation
```bash
pip install sentence-transformers
```
### Basic Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('YOUR_USERNAME/kazembed-v5')
# For queries (questions)
query = "query: Қазақстанның астанасы қай қала?"
query_embedding = model.encode(query)
# For passages (documents)
passage = "passage: Астана — Қазақстан Республикасының астанасы."
passage_embedding = model.encode(passage)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([query_embedding], [passage_embedding])[0][0]
print(f"Similarity: {similarity:.4f}")
```
### For RAG Applications
```python
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('YOUR_USERNAME/kazembed-v5')
# Your document corpus
documents = [
"Астана — Қазақстан Республикасының астанасы.",
"Алматы — Қазақстанның ең үлкен қаласы.",
"Қазақстан — Орталық Азиядағы мемлекет.",
]
# Encode documents (do once, store in vector DB)
doc_embeddings = model.encode(["passage: " + doc for doc in documents])
# Query
query = "Қазақстанның астанасы қай қала?"
query_embedding = model.encode("query: " + query)
# Find most similar
similarities = np.dot(doc_embeddings, query_embedding)
best_idx = np.argmax(similarities)
print(f"Best match: {documents[best_idx]}")
```
## Training Details
### Training Data
| Dataset | Pairs | Description |
|---------|-------|-------------|
| KazQAD | 6,640 | Question-Context pairs |
| KazQAD-Retrieval | 44,615 | Title-Text pairs |
| Powerful-Kazakh-Dialogue | 10,000 | User-Assistant pairs |
| **Total** | **61,255** | Retrieval-focused pairs |
### Training Configuration
- **Base Model**: intfloat/multilingual-e5-base
- **Epochs**: 2
- **Batch Size**: 16
- **Learning Rate**: 1e-5
- **Loss**: MultipleNegativesRankingLoss
- **Hardware**: NVIDIA GPU
### Training Strategy
We found that:
1. **Retrieval-only data** works best (no NLI/STS data)
2. **2 epochs** is optimal (1 = underfit, 3 = overfit)
3. **Larger batch size** (16) provides more in-batch negatives
## Limitations
- Optimized for Kazakh; performance on other languages may vary
- Best for retrieval tasks; may not be optimal for semantic similarity
- Requires `query:` and `passage:` prefixes for best results
## Citation
```bibtex
@misc{kazembed2024,
title={KazEmbed-V5: A Fine-tuned Embedding Model for Kazakh Language Retrieval},
author={Your Name},
year={2024},
howpublished={HuggingFace Hub}
}
```
## Acknowledgments
- Base model: [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
- Training data: [ISSAI](https://issai.nu.edu.kz/) for KazQAD dataset