Update README.md
Browse files
README.md
CHANGED
|
@@ -30,7 +30,10 @@ To know more about the model, read the [announcement blogpost](https://huggingfa
|
|
| 30 |
|
| 31 |
# Usage
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
```python
|
| 36 |
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
|
|
@@ -46,6 +49,7 @@ min_pixels = 1 * 28 * 28
|
|
| 46 |
# Load the embedding model and processor
|
| 47 |
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
| 48 |
'llamaindex/vdr-2b-multi-v1',
|
|
|
|
| 49 |
attn_implementation="flash_attention_2",
|
| 50 |
torch_dtype=torch.bfloat16,
|
| 51 |
device_map="cuda:0"
|
|
@@ -105,6 +109,7 @@ def encode_queries(queries: list[str], dimension: int) -> torch.Tensor:
|
|
| 105 |
```
|
| 106 |
|
| 107 |
**Encode documents**
|
|
|
|
| 108 |
```python
|
| 109 |
def round_by_factor(number: float, factor: int) -> int:
|
| 110 |
return round(number / factor) * factor
|
|
@@ -167,6 +172,59 @@ def encode_documents(documents: list[Image.Image], dimension: int):
|
|
| 167 |
return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
|
| 168 |
```
|
| 169 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
# Training
|
| 171 |
|
| 172 |
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
|
|
|
|
| 30 |
|
| 31 |
# Usage
|
| 32 |
|
| 33 |
+
<details>
|
| 34 |
+
<summary>
|
| 35 |
+
via HuggingFace Transformers
|
| 36 |
+
</summary>
|
| 37 |
|
| 38 |
```python
|
| 39 |
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
|
|
|
|
| 49 |
# Load the embedding model and processor
|
| 50 |
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
| 51 |
'llamaindex/vdr-2b-multi-v1',
|
| 52 |
+
# These are the recommended kwargs for the model, but change them as needed
|
| 53 |
attn_implementation="flash_attention_2",
|
| 54 |
torch_dtype=torch.bfloat16,
|
| 55 |
device_map="cuda:0"
|
|
|
|
| 109 |
```
|
| 110 |
|
| 111 |
**Encode documents**
|
| 112 |
+
|
| 113 |
```python
|
| 114 |
def round_by_factor(number: float, factor: int) -> int:
|
| 115 |
return round(number / factor) * factor
|
|
|
|
| 172 |
return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
|
| 173 |
```
|
| 174 |
|
| 175 |
+
</details>
|
| 176 |
+
|
| 177 |
+
<details>
|
| 178 |
+
<summary>
|
| 179 |
+
via LlamaIndex
|
| 180 |
+
</summary>
|
| 181 |
+
|
| 182 |
+
```bash
|
| 183 |
+
pip install -U llama-index-embeddings-huggingface
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
```python
|
| 187 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
| 188 |
+
|
| 189 |
+
model = HuggingFaceEmbedding(
|
| 190 |
+
model_name_or_path="llamaindex/vdr-2b-multi-v1",
|
| 191 |
+
device="mps",
|
| 192 |
+
trust_remote_code=True,
|
| 193 |
+
)
|
| 194 |
+
|
| 195 |
+
embeddings = model.get_image_embedding("image.png")
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
</details>
|
| 199 |
+
|
| 200 |
+
|
| 201 |
+
<details>
|
| 202 |
+
<summary>
|
| 203 |
+
via SentenceTransformers
|
| 204 |
+
</summary>
|
| 205 |
+
|
| 206 |
+
```python
|
| 207 |
+
from sentence_transformers import SentenceTransformer
|
| 208 |
+
|
| 209 |
+
model = SentenceTransformer(
|
| 210 |
+
model_name_or_path="llamaindex/vdr-2b-multi-v1",
|
| 211 |
+
device="mps",
|
| 212 |
+
trust_remote_code=True,
|
| 213 |
+
# These are the recommended kwargs for the model, but change them as needed
|
| 214 |
+
model_kwargs={
|
| 215 |
+
"torch_dtype": torch.bfloat16,
|
| 216 |
+
"device_map": "cuda:0",
|
| 217 |
+
"attn_implementation": "flash_attention_2"
|
| 218 |
+
},
|
| 219 |
+
)
|
| 220 |
+
|
| 221 |
+
embeddings = model.encode("image.png")
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
</details>
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
|
| 228 |
# Training
|
| 229 |
|
| 230 |
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset that consinsists of 500k high quality, multilingual query image pairs. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
|