ORYX

All

38 repositories

Agent-X
Public
Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
benchmarking vision reasoning vfm foundation-models llms-benchmarking agentic-ai
Jupyter Notebook
•2•19•0•0•Updated Aug 6, 2025Aug 6, 2025
Video-LLaVA
Public
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
video transcription lmm grounding video-grounding llm video-conversation
Python
•12•257•15•0•Updated Aug 5, 2025Aug 5, 2025
PALO
Public
(WACV 2025 - Oral) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu.
Python
•
Apache License 2.0
•7•83•5•1•Updated Aug 5, 2025Aug 5, 2025
groundingLMM
Public
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
vision-and-language lmm foundation-models vision-language-model llm-agent
Python
•
Apache License 2.0
•50•905•34•0•Updated Aug 5, 2025Aug 5, 2025
LLaVA-pp
Public
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
conversation lmms vision-language llm llava llama3 phi3 llava-llama3 llava-phi3 llama3-llava
Python
•61•840•17•2•Updated Aug 5, 2025Aug 5, 2025
VideoGPT-plus
Public
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
chatbot clip image-encoder video-encoder multimodal dual-encoder vision-language vicuna gpt4 vision-language-pretraining
Python
•
Creative Commons Attribution 4.0 International
•19•284•16•0•Updated Aug 5, 2025Aug 5, 2025
Video-ChatGPT
Public
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
chatbot llama clip mulit-modal vision-language vicuna gpt-4 vision-language-pretraining llava video-chatboat
Python
•
Creative Commons Attribution 4.0 International
•116•1.4k•23•1•Updated Aug 5, 2025Aug 5, 2025
ThinkGeo
Public
ThinkGeo is a Comprehensive Benchmark to evaluate Tool-Augmented Agents for Remote Sensing Tasks
Python
•
Apache License 2.0
•1•36•0•0•Updated Jul 25, 2025Jul 25, 2025
MIRA
Public
[ACM MM 2025 🔥🔥 ] MIRA: A first-of-its-kind medical RAG framework that fuses image features and retrieved knowledge with dynamic context control to boost factual accuracy in multimodal medical reasoning.
Python
•
Apache License 2.0
•1•11•1•0•Updated Jul 18, 2025Jul 18, 2025
Awesome-LLM-Post-training
Public
Awesome Reasoning LLM Tutorial/Survey/Guide
reinforcement-learning scaling reasoning fine post-training large-language-models
Python
•133•1.9k•6•0•Updated Jul 11, 2025Jul 11, 2025
VideoMolmo
Public
Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"
Python
•2•47•1•0•Updated Jul 5, 2025Jul 5, 2025
VideoMathQA
Public
VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos
0•15•0•0•Updated Jun 24, 2025Jun 24, 2025
ViMUL
Public
Apache License 2.0
•0•0•0•0•Updated Jun 11, 2025Jun 11, 2025
TerraFM
Public
TerraFM is a scalable foundation model for unified multisensor Earth observation, trained on 18.7M Sentinel-1/2 tiles and achieving state-of-the-art results on GEO-Bench and Copernicus-Bench.
Python
•1•21•1•0•Updated Jun 9, 2025Jun 9, 2025
GeoPixel
Public
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
remote-sensing segmentation-models foundation-models large-vision-language-models large-multimodal-models vision-language-models grounding-llms
Python
•
Apache License 2.0
•14•106•6•0•Updated May 28, 2025May 28, 2025
FannOrFlop
Public
A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
poetry arabic llm
Python
•
Apache License 2.0
•0•3•0•0•Updated May 26, 2025May 26, 2025
ALM-Bench
Public
[CVPR 2025 🔥] ALM-Bench is a multilingual multi-modal diverse cultural benchmark for 100 languages across 19 categories. It assesses the next generation of LMMs on cultural inclusitivity.
multilingual benchmarking multi-modal cultural gpt-4 multimodal-large-language-models
Python
•
Other
•2•43•1•0•Updated May 26, 2025May 26, 2025
CoVR-VidLLM-CVPRW25
Public
Composed Video Retrieval Challenge CVPR Workshop 2025
Python
•1•6•1•0•Updated May 25, 2025May 25, 2025
ARB
Public
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
benchmark arabic reasoning cot lmm
HTML
•
Apache License 2.0
•0•15•0•0•Updated May 25, 2025May 25, 2025
KITAB-Bench
Public
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
benchmark ocr vqa pdf-to-text arabic table-detection layout-detection vlms
Python
•
MIT License
•3•45•2•0•Updated May 24, 2025May 24, 2025
NestEO
Public
NestEO: Nested and Aligned Earth Observation Framework with Multimodal Dataset
Jupyter Notebook
•
Apache License 2.0
•0•7•0•0•Updated May 23, 2025May 23, 2025
TimeTravel
Public
[ACL 2025 🔥] Time Travel is a Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
benchmark historical cultural lmm
Python
•
MIT License
•1•18•0•0•Updated May 22, 2025May 22, 2025
LlamaV-o1
Public
[ACL 2025 🔥] Rethinking Step-by-step Visual Reasoning in LLMs
Python
•
Apache License 2.0
•17•305•4•0•Updated May 21, 2025May 21, 2025
LLMVoX
Public
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
audio text-to-speech streaming transformers tts codec omni voice-assistant neural-speech-synthesis mbzuai
Python
•
Other
•32•266•5•2•Updated May 16, 2025May 16, 2025
MobiLlama
Public
[ICLR-2025-SLLM Spotlight 🔥]MobiLlama : Small Language Model tailored for edge devices
slm llm efficient-llm mobile-llm tiny-llm
Python
•
Apache License 2.0
•48•654•13•1•Updated May 10, 2025May 10, 2025
BiMediX2
Public
Bio-Medical EXpert LMM with English and Arabic Language Capabilities
7•69•2•0•Updated May 2, 2025May 2, 2025
UniMed-CLIP
Public
Official repository of paper titled "UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities".
Python
•
Other
•13•121•0•0•Updated Apr 27, 2025Apr 27, 2025
Camel-Bench
Public
[NAACL 2025 🔥] CAMEL-Bench is an Arabic benchmark for evaluating multimodal models across eight domains with 29,000 questions.
benchmark vqa arabic multimodal-learning visual-question-answering mbzuai large-multimodal-models
Python
•
MIT License
•1•32•0•0•Updated Apr 17, 2025Apr 17, 2025
VideoGLaMM
Public
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
vision-and-language lmm foundation-models vision-language-model llm-agent cvpr2025
Python
•1•76•6•0•Updated Apr 14, 2025Apr 14, 2025
TrackingMeetsLMM
Public
Python
•3•9•0•0•Updated Apr 7, 2025Apr 7, 2025