VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
AI & ML interests
Natural Language Processing, Machine Learning, and Computer Vision
Recent Activity
Papers
Attention Is All You Need for KV Cache in Diffusion LLMs
Do LLMs "Feel"? Emotion Circuits Discovery and Control
Large-scale dataset and model suite for cross-architecture GPU code transpilation between CUDA and HIP at both source and assembly levels
BiMediX2 : Bio-Medical EXpert LMM for Diverse Medical Modalities
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos.
PALO: A Polyglot Large Multimodal Model for 5B People
GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios.
Official training and dev datasets for NADI 2025 Subtask 3 (Diacritic Restoration) Shared Task
Pixel Grounding Large Multimodal Model in Remote Sensing
Open source project for Arabic Speech Recognition and Generation
Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated.
Extending Visual Capabilities of LLaVA with LLaMA-3 and Phi-3
-
LLaVA++ (LLaMA-3-V)
👁33Start a chatbot server for text-based interactions
-
LLaVA++ (Phi-3-V)
👁24Launch a chatbot with image and text understanding
-
MBZUAI/LLaVA-Phi-3-mini-4k-instruct
Text Generation • 4B • Updated • 172 • 21 -
MBZUAI/LLaVA-Meta-Llama-3-8B-Instruct-FT
Text Generation • 8B • Updated • 13 • 12
Collection of MobiLlama Language Models.
Collection of ViT models trained using SatMAE++ approach.
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
Official training and dev datasets for NADI 2025 Subtask 3 (Diacritic Restoration) Shared Task
Large-scale dataset and model suite for cross-architecture GPU code transpilation between CUDA and HIP at both source and assembly levels
Pixel Grounding Large Multimodal Model in Remote Sensing
BiMediX2 : Bio-Medical EXpert LMM for Diverse Medical Modalities
Open source project for Arabic Speech Recognition and Generation
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated.
"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos.
Extending Visual Capabilities of LLaVA with LLaMA-3 and Phi-3
-
LLaVA++ (LLaMA-3-V)
👁33Start a chatbot server for text-based interactions
-
LLaVA++ (Phi-3-V)
👁24Launch a chatbot with image and text understanding
-
MBZUAI/LLaVA-Phi-3-mini-4k-instruct
Text Generation • 4B • Updated • 172 • 21 -
MBZUAI/LLaVA-Meta-Llama-3-8B-Instruct-FT
Text Generation • 8B • Updated • 13 • 12
PALO: A Polyglot Large Multimodal Model for 5B People
Collection of MobiLlama Language Models.
GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios.
Collection of ViT models trained using SatMAE++ approach.