--- language: - ru - en pipeline_tag: sentence-similarity tags: - embeddings - sentence-transformers - vllm - inference-optimized - inference license: mit base_model: cointegrated/rubert-tiny2 --- # rubert-tiny2-vllm **vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference. This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching. ## Modifications - **No weight changes** - uses original query/key/value weights directly - vLLM automatically converts Q/K/V to fused qkv_proj format during loading - Removed pretraining heads (MLM/NSP) - not needed for embeddings - Changed architecture to `BertModel` for vLLM compatibility ## Usage ### vLLM Server ```bash # IMPORTANT: Use fp32 for exact numerical match with original model vllm serve WpythonW/rubert-tiny2-vllm --dtype float32 ``` ### OpenAI-compatible API ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="dummy" ) response = client.embeddings.create( input="Привет мир", model="WpythonW/rubert-tiny2-vllm" ) print(response.data[0].embedding[:5]) ``` ### Transformers ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm") model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm") def embed_bert_cls(text, model, tokenizer): t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**{k: v.to(model.device) for k, v in t.items()}) embeddings = model_output.last_hidden_state[:, 0, :] embeddings = torch.nn.functional.normalize(embeddings) return embeddings[0].cpu().numpy() print(embed_bert_cls('привет мир', model, tokenizer).shape) # (312,) ``` ### Sentence Transformers ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('WpythonW/rubert-tiny2-vllm') sentences = ["привет мир", "hello world", "здравствуй вселенная"] embeddings = model.encode(sentences) print(embeddings.shape) ``` ## Validation Results Comparison between vLLM and SentenceTransformers on identical inputs: ``` Max embedding difference: 3.375e-7 Mean embedding difference: 1.136e-7 Cosine similarity matrices: Identical (np.allclose with default tolerances) ``` This confirms **bit-level equivalence** within float32 precision limits. ## Conversion Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW) **Conversion process:** 1. Load original cointegrated/rubert-tiny2 weights 2. Remove `bert.` prefix from weight names 3. Remove unused heads (cls.*, bert.pooler.*) 4. Keep query/key/value weights as-is (vLLM handles fusion automatically) Tested on Google Colab Tesla T4 with: - vLLM 0.11.2 - Transformers 4.57.2 - PyTorch 2.9.0+cu126 ## Original Model For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)