|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- multimodal |
|
|
- vision-language |
|
|
- video understanding |
|
|
- visuospatial cognition |
|
|
- spatial reasoning |
|
|
- vlm |
|
|
- llava |
|
|
- qwen |
|
|
- siglip |
|
|
- hiera |
|
|
- sam2 |
|
|
- dual-encoder |
|
|
datasets: |
|
|
- nkkbr/ViCA-thinking-2.68k |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: video-text-to-text |
|
|
model_name: ViCA2-7B-Thinking |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage and Full Documentation |
|
|
|
|
|
For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the following links**: |
|
|
|
|
|
[](https://github.com/nkkbr/ViCA) |
|
|
|
|
|
[](https://huggingface.co/nkkbr/ViCA2) |