|
|
--- |
|
|
language: |
|
|
- kk |
|
|
- ru |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- embedding |
|
|
- retrieval |
|
|
- kazakh |
|
|
- rag |
|
|
datasets: |
|
|
- issai/kazqad |
|
|
- issai/kazqad-retrieval |
|
|
- DarkyMan/powerful-kazakh-dialogue |
|
|
library_name: sentence-transformers |
|
|
pipeline_tag: sentence-similarity |
|
|
base_model: intfloat/multilingual-e5-base |
|
|
--- |
|
|
|
|
|
# KazEmbed-V5: Kazakh Embedding Model for RAG |
|
|
|
|
|
🏆 **Best BASE-size embedding model for Kazakh language retrieval tasks** |
|
|
|
|
|
## Model Description |
|
|
|
|
|
KazEmbed-V5 is a fine-tuned version of [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) optimized for Kazakh language retrieval and RAG (Retrieval-Augmented Generation) applications. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- 🇰🇿 **Specialized for Kazakh**: Fine-tuned on 61,255 Kazakh text pairs |
|
|
- 📈 **+2.1% MRR improvement** over multilingual-e5-base |
|
|
- ⚡ **Efficient**: 278M parameters (base-size model) |
|
|
- 🔍 **RAG-optimized**: Trained specifically for retrieval tasks |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
| Model | Hits@1 | Hits@5 | MRR | Params | |
|
|
|-------|--------|--------|-----|--------| |
|
|
| **KazEmbed-V5 (Ours)** | **72%** | **96%** | **0.835** | 278M | |
|
|
| multilingual-e5-base | 72% | 96% | 0.818 | 278M | |
|
|
| multilingual-e5-large | 85% | 99% | 0.909 | 560M | |
|
|
| paraphrase-mpnet-v2 | 53% | 80% | 0.648 | 278M | |
|
|
| LaBSE | 48% | 73% | 0.601 | 471M | |
|
|
|
|
|
*Evaluated on KazQAD test set with TF-IDF hard negatives (100 candidates per query)* |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install sentence-transformers |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer('YOUR_USERNAME/kazembed-v5') |
|
|
|
|
|
# For queries (questions) |
|
|
query = "query: Қазақстанның астанасы қай қала?" |
|
|
query_embedding = model.encode(query) |
|
|
|
|
|
# For passages (documents) |
|
|
passage = "passage: Астана — Қазақстан Республикасының астанасы." |
|
|
passage_embedding = model.encode(passage) |
|
|
|
|
|
# Calculate similarity |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
similarity = cosine_similarity([query_embedding], [passage_embedding])[0][0] |
|
|
print(f"Similarity: {similarity:.4f}") |
|
|
``` |
|
|
|
|
|
### For RAG Applications |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
import numpy as np |
|
|
|
|
|
model = SentenceTransformer('YOUR_USERNAME/kazembed-v5') |
|
|
|
|
|
# Your document corpus |
|
|
documents = [ |
|
|
"Астана — Қазақстан Республикасының астанасы.", |
|
|
"Алматы — Қазақстанның ең үлкен қаласы.", |
|
|
"Қазақстан — Орталық Азиядағы мемлекет.", |
|
|
] |
|
|
|
|
|
# Encode documents (do once, store in vector DB) |
|
|
doc_embeddings = model.encode(["passage: " + doc for doc in documents]) |
|
|
|
|
|
# Query |
|
|
query = "Қазақстанның астанасы қай қала?" |
|
|
query_embedding = model.encode("query: " + query) |
|
|
|
|
|
# Find most similar |
|
|
similarities = np.dot(doc_embeddings, query_embedding) |
|
|
best_idx = np.argmax(similarities) |
|
|
print(f"Best match: {documents[best_idx]}") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
| Dataset | Pairs | Description | |
|
|
|---------|-------|-------------| |
|
|
| KazQAD | 6,640 | Question-Context pairs | |
|
|
| KazQAD-Retrieval | 44,615 | Title-Text pairs | |
|
|
| Powerful-Kazakh-Dialogue | 10,000 | User-Assistant pairs | |
|
|
| **Total** | **61,255** | Retrieval-focused pairs | |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Base Model**: intfloat/multilingual-e5-base |
|
|
- **Epochs**: 2 |
|
|
- **Batch Size**: 16 |
|
|
- **Learning Rate**: 1e-5 |
|
|
- **Loss**: MultipleNegativesRankingLoss |
|
|
- **Hardware**: NVIDIA GPU |
|
|
|
|
|
### Training Strategy |
|
|
|
|
|
We found that: |
|
|
1. **Retrieval-only data** works best (no NLI/STS data) |
|
|
2. **2 epochs** is optimal (1 = underfit, 3 = overfit) |
|
|
3. **Larger batch size** (16) provides more in-batch negatives |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for Kazakh; performance on other languages may vary |
|
|
- Best for retrieval tasks; may not be optimal for semantic similarity |
|
|
- Requires `query:` and `passage:` prefixes for best results |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{kazembed2024, |
|
|
title={KazEmbed-V5: A Fine-tuned Embedding Model for Kazakh Language Retrieval}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
howpublished={HuggingFace Hub} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model: [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |
|
|
- Training data: [ISSAI](https://issai.nu.edu.kz/) for KazQAD dataset |
|
|
|