LLM-in-Vision

Recent LLM (Large Language Models)-based CV and multi-modal works. Welcome to comment/contribute!

2024.6

(arXiv 2024.6) Bootstrap3D: Improving 3D Content Creation with Synthetic Data, [Paper], [Project]

2024.5

(arXiv 2024.5) VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation, [Paper]
(arXiv 2024.5) Grounded 3D-LLM with Referent Tokens, [Paper], [Project]
(arXiv 2024.5) Self-supervised Pre-training for Transferable Multi-modal Perception, [Paper]
(arXiv 2024.5) Multi-modal Generation via Cross-Modal In-Context Learning, [Paper]
(arXiv 2024.5) RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots, [Paper], [Project]
(arXiv 2024.5) Unveiling the Tapestry of Consistency in Large Vision-Language Models, [Paper]
(arXiv 2024.5) Dense Connector for MLLMs, [Paper], [Project]
(arXiv 2024.5) Adapting Multi-modal Large Language Model to Concept Drift in the Long-tailed Open World, [Paper], [Project]
(arXiv 2024.5) VTG-LLM: INTEGRATING TIMESTAMP KNOWLEDGE INTO VIDEO LLMS FOR ENHANCED VIDEO TEMPORAL GROUNDING, [Paper], [Project]
(arXiv 2024.5) Calibrated Self-Rewarding Vision Language Models, [Paper], [Project]
(arXiv 2024.5) From Text to Pixel: Advancing Long-Context Understanding in MLLMs, [Paper], [Project]
(arXiv 2024.5) Explaining Multi-modal Large Language Models by Analyzing their Vision Perception, [Paper]
(arXiv 2024.5) Octopi: Object Property Reasoning with Large Tactile-Language Models, [Paper], [Project]
(arXiv 2024.5) Auto-Encoding Morph-Tokens for Multimodal LLM, [Paper], [Project]
(arXiv 2024.5) What matters when building vision-language models? [Paper]

2024.4

(arXiv 2024.4) VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing, [Paper], [Project]
(arXiv 2024.4) GROUNDHOG: Grounding Large Language Models to Holistic Segmentation, [Paper], [Project]
(arXiv 2024.4) Hallucination of Multimodal Large Language Models: A Survey, [Paper], [Project]
(arXiv 2024.4) PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning, [Paper], [Project]
(arXiv 2024.4) MovieChat+: Question-aware Sparse Memory for Long Video Question Answering, [Paper], [Project]
(arXiv 2024.4) Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models, [Paper], [Project]
(arXiv 2024.4) A Survey on the Memory Mechanism of Large Language Model based Agents, [Paper]
(arXiv 2024.4) Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples, [Paper]
(arXiv 2024.4) A Multimodal Automated Interpretability Agent, [Paper], [Project]
(arXiv 2024.4) Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models, [Paper], [Project]
(arXiv 2024.4) TextSquare: Scaling up Text-Centric Visual Instruction Tuning, [Paper]
(arXiv 2024.4) What Makes Multimodal In-Context Learning Work?, [Paper]
(arXiv 2024.4) ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction, [Paper], [Project]
(arXiv 2024.4) Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs, [Paper]
(arXiv 2024.4) Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer, [Paper]
(arXiv 2024.4) MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI, [Paper]
(arXiv 2024.4) Cantor: Inspiring Multimodal Chain-of-Thought of MLLM, [Paper], [Project]
(arXiv 2024.4) Make-it-Real: Unleashing Large Multimodal Model’s Ability for Painting 3D Objects with Realistic Materials, [Paper], [Project]
(arXiv 2024.4) How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites, [Paper], [Project]
(arXiv 2024.4) Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, [Paper], [Project]
(arXiv 2024.4) SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension, [Paper], [Project]
(arXiv 2024.4) List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs, [Paper], [Project]
(arXiv 2024.4) Step Differences in Instructional Video, [Paper]
(arXiv 2024.4) A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming, [Paper]
(arXiv 2024.4) TextSquare: Scaling up Text-Centric Visual Instruction Tuning, [Paper]
(arXiv 2024.4) Pre-trained Vision-Language Models Learn Discoverable Visual Concepts, [Paper], [Project]
(arXiv 2024.4) MoVA: Adapting Mixture of Vision Experts to Multimodal Context, [Paper], [Project]
(arXiv 2024.4) Uni3DR^2: Unified Scene Representation and Reconstruction for 3D Large Language Models, [Paper], [Project]
(arXiv 2024.4) Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models, [Paper], [Project]
(arXiv 2024.4) Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, [Paper]
(arXiv 2024.4) Empowering Large Language Models on Robotic Manipulation with Affordance Prompting, [Paper]
(arXiv 2024.4) Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction Tuning, [Paper]
(arXiv 2024.4) OVAL-Prompt: Open-Vocabulary Affordance Localization for Robot Manipulation through LLM Affordance-Grounding, [Paper]
(arXiv 2024.4) FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models, [Paper], [Project]
(arXiv 2024.4) Towards Human Awareness in Robot Task Planning with Large Language Models, [Paper]
(arXiv 2024.4) Self-Supervised Visual Preference Alignment, [Paper]
(arXiv 2024.4) Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering, [Paper]
(arXiv 2024.4) COMBO: Compositional World Models for Embodied Multi-Agent Cooperation, [Paper], [Project]
(arXiv 2024.4) Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent, [Paper]
(arXiv 2024.4) Fact:Teaching MLLMs with Faithful, Concise and Transferable Rationales, [Paper]
(arXiv 2024.4) Exploring the Transferability of Visual Prompting for Multimodal Large Language Models, [Paper]
(arXiv 2024.4) TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models, [Paper], [Project]
(arXiv 2024.4) EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM, [Paper]
(arXiv 2024.4) BRIDGING VISION AND LANGUAGE SPACES WITH ASSIGNMENT PREDICTION, [Paper]
(arXiv 2024.4) TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding, [Paper], [Project]
(arXiv 2024.4) HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision, [Paper], [Project]
(arXiv 2024.4) MMInA: Benchmarking Multihop Multimodal Internet Agents, [Paper], [Project]
(arXiv 2024.4) Evolving Interpretable Visual Classifiers with Large Language Models, [Paper]
(arXiv 2024.4) OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, [Paper], [Project]
(arXiv 2024.4) Reflectance Estimation for Proximity Sensing by Vision-Language Models: Utilizing Distributional Semantics for Low-Level Cognition in Robotics, [Paper]
(arXiv 2024.4) Sketch-Plan-Generalize: Continual Few-Shot Learning of Inductively Generalizable Spatial Concepts for Language-Guided Robot Manipulation, [Paper]
(arXiv 2024.4) MORPHeus: a Multimodal One-armed Robot-assisted Peeling System with Human Users In-the-loop, [Paper], [Project]
(arXiv 2024.4) GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks, [Paper], [Project]
(arXiv 2024.4) Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection, [Paper], [Project]

2024.3

(arXiv 2024.3) AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling, [Paper], [Project]
(arXiv 2024.3) OCTAVIUS: MITIGATING TASK INTERFERENCE IN MLLMS VIA LORA-MOE, [Paper], [Project]
(arXiv 2024.3) INSTRUCTCV: INSTRUCTION-TUNED TEXT-TO-IMAGE DIFFUSION MODELS AS VISION GENERALISTS, [Paper], [Project]
(arXiv 2024.3) Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld, [Paper], [Project]
(arXiv 2024.3) ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models, [Paper], [Project]
(arXiv 2024.3) MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control, [Paper], [Project]
(arXiv 2024.3) LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning, [Paper], [Project]
(arXiv 2024.3) RAIL: Robot Affordance Imagination with Large Language Models, [Paper]
(arXiv 2024.3) Are We on the Right Way for Evaluating Large Vision-Language Models? [Paper], [Project]
(arXiv 2024.3) FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues, [Paper], [Project]
(arXiv 2024.3) Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models, [Paper], [Project]
(arXiv 2024.3) OAKINK2 : A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion, [Paper], [Project]
(arXiv 2024.3) InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction, [Paper], [Project]
(arXiv 2024.3) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training, [Paper]
(arXiv 2024.3) Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models, [Paper], [Project]
(arXiv 2024.3) INSIGHT: End-to-End Neuro-Symbolic Visual Reinforcement Learning with Language Explanations, [Paper]
(arXiv 2024.3) DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM, [Paper]
(arXiv 2024.3) Embodied LLM Agents Learn to Cooperate in Organized Teams, [Paper]
(arXiv 2024.3) To Help or Not to Help: LLM-based Attentive Support for Human-Robot Group Interactions, [Paper], [Project]
(arXiv 2024.3) BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs, [Paper]
(arXiv 2024.3) Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models, [Paper], [Project]
(arXiv 2024.3) HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning, [Paper], [Project]
(arXiv 2024.3) RelationVLM: Making Large Vision-Language Models Understand Visual Relations, [Paper]
(arXiv 2024.3) Towards Multimodal In-Context Learning for Vision & Language Models, [Paper]
(arXiv 2024.3) Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models, [Paper]
(arXiv 2024.3) HawkEye: Training Video-Text LLMs for Grounding Text in Videos, [Paper], [Project]
(arXiv 2024.3) UniCode: Learning a Unified Codebook for Multimodal Large Language Models, [Paper]
(arXiv 2024.3) Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models, [Paper], [Project]
(arXiv 2024.3) MoAI: Mixture of All Intelligence for Large Language and Vision Models, [Paper], [Project]
(arXiv 2024.3) Multi-modal Auto-regressive Modeling via Visual Words, [Paper], [Project]
(arXiv 2024.3) DeepSeek-VL: Towards Real-World Vision-Language Understanding, [Paper], [Project]
(arXiv 2024.3) WILL GPT-4 RUN DOOM?, [Paper], [Project]
(arXiv 2024.3) Debiasing Large Visual Language Models, [Paper], [Project]

2024.2

(arXiv 2024.2) Efficient Multimodal Learning from Data-centric Perspective, [Paper], [Project]
(arXiv 2024.2) Dej´ a Vu Memorization in Vision-Language Models, [Paper]
(arXiv 2024.2) Lumos: Empowering Multimodal LLMs with Scene Text Recognition, [Paper]
(arXiv 2024.2) MOSAIC: A Modular System for Assistive and Interactive Cooking, [Paper], [Project]
(arXiv 2024.2) Visual Hallucinations of Multi-modal Large Language Models, [Paper], [Project]
(arXiv 2024.2) DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models, [Paper], [Project]
(arXiv 2024.2) RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation, [Paper]
(arXiv 2024.2) TinyLLaVA: A Framework of Small-scale Large Multimodal Models, [Paper], [Project]
(arXiv 2024.2) Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models, [Paper]
(arXiv 2024.2) Uncertainty-Aware Evaluation for Vision-Language Models, [Paper], [Project]
(arXiv 2024.2) RealDex: Towards Human-like Grasping for Robotic Dexterous Hand, [Paper]
(arXiv 2024.2) Aligning Modalities in Vision Large Language Models via Preference Fine-tuning, [Paper], [Project]
(arXiv 2024.2) Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships, [Paper]
(arXiv 2024.2) ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning, [Paper], [Project]
(arXiv 2024.2) LVCHAT: Facilitating Long Video Comprehension, [Paper], [Project]
(arXiv 2024.2) Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models, [Paper], [Project]
(arXiv 2024.2) Using Left and Right Brains Together: Towards Vision and Language Planning, [Paper]
(arXiv 2024.2) Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering, [Paper]
(arXiv 2024.2) PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter, [Paper]
(arXiv 2024.2) Grounding LLMs For Robot Task Planning Using Closed-loop State Feedback, [Paper]
(arXiv 2024.2) BBSEA: An Exploration of Brain-Body Synchronization for Embodied Agents, [Paper], [Project]
(arXiv 2024.2) Reasoning Grasping via Multimodal Large Language Model, [Paper]
(arXiv 2024.2) LOTA-BENCH: BENCHMARKING LANGUAGE-ORIENTED TASK PLANNERS FOR EMBODIED AGENTS, [Paper], [Project]
(arXiv 2024.2) OS-COPILOT: TOWARDS GENERALIST COMPUTER AGENTS WITH SELF-IMPROVEMENT, [Paper], [Project]
(arXiv 2024.2) Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning, [Paper]
(arXiv 2024.2) Preference-Conditioned Language-Guided Abstraction, [Paper]
(arXiv 2024.2) Affordable Generative Agents, [Paper], [Project]
(arXiv 2024.2) An Interactive Agent Foundation Model, [Paper]
(arXiv 2024.2) InCoRo: In-Context Learning for Robotics Control with Feedback Loops, [Paper]
(arXiv 2024.2) Real-World Robot Applications of Foundation Models: A Review, [Paper]
(arXiv 2024.2) Question Aware Vision Transformer for Multimodal Reasoning, [Paper]
(arXiv 2024.2) SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models, [Paper], [Project]
(arXiv 2024.2) CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion, [Paper], [Project]
(arXiv 2024.2) S-AGENTS: SELF-ORGANIZING AGENTS IN OPENENDED ENVIRONMENT, [Paper], [Project]
(arXiv 2024.2) Code as Reward: Empowering Reinforcement Learning with VLMs, [Paper]
(arXiv 2024.2) Data-efficient Large Vision Models through Sequential Autoregression, [Paper], [Project]
(arXiv 2024.2) MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark, [Paper], [Project]
(arXiv 2024.2) Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models, [Paper], [Project]
(arXiv 2024.2) “Task Success” is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors, [Paper], [Project]
(arXiv 2024.2) RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents, [Paper]
(arXiv 2024.2) Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models, [Paper]
(arXiv 2024.2) The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs, [Paper], [Project]
(arXiv 2024.2) MobileVLM V2: Faster and Stronger Baseline for Vision Language Model, [Paper], [Project]
(arXiv 2024.2) CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations, [Paper], [Project]
(arXiv 2024.2) Compositional Generative Modeling: A Single Model is Not All You Need, [Paper]
(arXiv 2024.2) IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition, [Paper]
(arXiv 2024.2) SKIP \N: A SIMPLE METHOD TO REDUCE HALLUCINATION IN LARGE VISION-LANGUAGE MODELS, [Paper], [Project]

2024.1

(arXiv 2024.1) Red Teaming Visual Language Models, [Paper], [Project]
(arXiv 2024.1) AUTORT: EMBODIED FOUNDATION MODELS FOR LARGE SCALE ORCHESTRATION OF ROBOTIC AGENTS, [Paper], [Project]
(arXiv 2024.1) LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model, [Paper]
(arXiv 2024.1) TRAINING DIFFUSION MODELS WITH REINFORCEMENT LEARNING, [Paper], [Project]
(arXiv 2024.1) SWARMBRAIN: EMBODIED AGENT FOR REAL-TIME STRATEGY GAME STARCRAFT II VIA LARGE LANGUAGE MODELS, [Paper]
(arXiv 2024.1) YTCommentQA: Video Question Answerability in Instructional Videos, [Paper], [Project]
(arXiv 2024.1) MouSi: Poly-Visual-Expert Vision-Language Models, [Paper], [Project]
(arXiv 2024.1) DORAEMONGPT: TOWARD UNDERSTANDING DYNAMIC SCENES WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2024.1) KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning, [Paper]
(arXiv 2024.1) Growing from Exploration: A self-exploring framework for robots based on foundation models, [Paper], [Project]
(arXiv 2024.1) TRUE KNOWLEDGE COMES FROM PRACTICE: ALIGNING LLMS WITH EMBODIED ENVIRONMENTS VIA REINFORCEMENT LEARNING, [Paper], [Project]
(arXiv 2024.1) Red Teaming Visual Language Models, [Paper], [Project]
(arXiv 2024.1) The Neglected Tails of Vision-Language Models, [Paper], [Project]
(arXiv 2024.1) Zero Shot Open-ended Video Inference, [Paper]
(arXiv 2024.1) Small Language Model Meets with Reinforced Vision Vocabulary, [Paper], [Project]
(arXiv 2024.1) HAZARD CHALLENGE: EMBODIED DECISION MAKING IN DYNAMICALLY CHANGING ENVIRONMENTS, [Paper], [Project]
(arXiv 2024.1) VisualWebArena: EVALUATING MULTIMODAL AGENTS ON REALISTIC VISUAL WEB TASKS, [Paper], [Project]
(arXiv 2024.1) ChatterBox: Multi-round Multimodal Referring and Grounding, [Paper], [Project]
(arXiv 2024.1) CONTEXTUAL: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models, [Paper], [Project]
(arXiv 2024.1) UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion, [Paper], [Project]
(arXiv 2024.1) DEMOCRATIZING FINE-GRAINED VISUAL RECOGNITION WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2024.1) Benchmarking Large Multimodal Models against Common Corruptions, [Paper], [Project]
(arXiv 2024.1) CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark, [Paper], [Project]
(arXiv 2024.1) Prompting Large Vision-Language Models for Compositional Reasoning, [Paper], [Project]
(arXiv 2024.1) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs, [Paper], [Project]
(arXiv 2024.1) SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, [Paper], [Project]
(arXiv 2024.1) Towards A Better Metric for Text-to-Video Generation, [Paper], [Project]
(arXiv 2024.1) EXPLOITING GPT-4 VISION FOR ZERO-SHOT POINT CLOUD UNDERSTANDING, [Paper]
(arXiv 2024.1) MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception, [Paper], [Project]
(arXiv 2024.1) GATS: Gather-Attend-Scatter, [Paper]
(arXiv 2024.1) DiffusionGPT: LLM-Driven Text-to-Image Generation System, [Paper], [Project]
(arXiv 2024.1) TEMPORAL INSIGHT ENHANCEMENT: MITIGATING TEMPORAL HALLUCINATION IN MULTIMODAL LARGE LANGUAGE MODELS, [Paper]
(arXiv 2024.1) Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation, [Paper]
(arXiv 2024.1) GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition, [Paper]
(arXiv 2024.1) SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding, [Paper], [Project]
(arXiv 2024.1) Vlogger: Make Your Dream A Vlog, [Paper], [Project]
(arXiv 2024.1) CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot, [Paper], [Project]
(arXiv 2024.1) Consolidating Trees of Robotic Plans Generated Using Large Language Models to Improve Reliability, [Paper]
(arXiv 2024.1) Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences, [Paper], [Project]
(arXiv 2024.1) Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering, [Paper]
(arXiv 2024.1) Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge, [Paper]
(arXiv 2024.1) Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning, [Paper], [Project]
(arXiv 2024.1) MMToM-QA: Multimodal Theory of Mind Question Answering, [Paper], [Project]
(arXiv 2024.1) EgoGen: An Egocentric Synthetic Data Generator, [Paper], [Project]
(arXiv 2024.1) COCO IS “ALL” YOU NEED FOR VISUAL INSTRUCTION FINE-TUNING, [Paper]
(arXiv 2024.1) OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality, [Paper], [Project]
(arXiv 2024.1) MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World, [Paper], [Project]
(arXiv 2024.1) SELF-IMAGINE: EFFECTIVE UNIMODAL REASONING WITH MULTIMODAL MODELS USING SELF-IMAGINATION, [Paper], [Project]
(arXiv 2024.1) Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation, [Paper], [Project]
(arXiv 2024.1) Towards Language-Driven Video Inpainting via Multimodal Large Language Models, [Paper], [Project]
(arXiv 2024.1) An Improved Baseline for Reasoning Segmentation with Large Language Model, [Paper]
(arXiv 2024.1) MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World, [Paper], [Project]
(arXiv 2024.1) 3D-PREMISE: CAN LARGE LANGUAGE MODELS GENERATE 3D SHAPES WITH SHARP FEATURES AND PARAMETRIC CONTROL? [Paper]
(arXiv 2024.1) 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model, [Paper], [Project]
(arXiv 2024.1) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, [Paper], [Project]
(arXiv 2024.1) AffordanceLLM: Grounding Affordance from Vision Language Models, [Paper], [Project]
(arXiv 2024.1) ModaVerse: Efficiently Transforming Modalities with LLMs, [Paper]
(arXiv 2024.1) REPLAN: ROBOTIC REPLANNING WITH PERCEPTION AND LANGUAGE MODELS, [Paper], [Project]
(arXiv 2024.1) Language-Conditioned Robotic Manipulation with Fast and Slow Thinking, [Paper], [Project]
(arXiv 2024.1) FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild, [Paper]
(arXiv 2024.1) REBUS: A Robust Evaluation Benchmark of Understanding Symbols, [Paper], [Project]
(arXiv 2024.1) LEGO:Language Enhanced Multi-modal Grounding Model, [Paper], [Project]
(arXiv 2024.1) Distilling Vision-Language Models on Millions of Videos, [Paper]
(arXiv 2024.1) EXPLORING LARGE LANGUAGE MODEL BASED INTELLIGENT AGENTS: DEFINITIONS, METHODS, AND PROSPECTS, [Paper]
(arXiv 2024.1) AGENT AI: SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION, [Paper]
(arXiv 2024.1) ExTraCT – Explainable Trajectory Corrections from language inputs using Textual description of features, [Paper]
(arXiv 2024.1) Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models, [Paper]
(arXiv 2024.1) GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation, [Paper], [Project]
(arXiv 2024.1) Large Language Models as Visual Cross-Domain Learners, [Paper], [Project]
(arXiv 2024.1) 3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING, [Paper], [Project]
(arXiv 2024.1) CaMML: Context-Aware Multimodal Learner for Large Models, [Paper]
(arXiv 2024.1) Object-Centric Instruction Augmentation for Robotic Manipulation, [Paper], [Project]
(arXiv 2024.1) Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers, [Paper]
(arXiv 2024.1) A Vision Check-up for Language Models, [Paper], [Project]
(arXiv 2024.1) GPT-4V(ision) is a Generalist Web Agent, if Grounded, [Paper], [Project]
(arXiv 2024.1) LLaVA-ϕ: Efficient Multi-Modal Assistant with Small Language Model, [Paper], [Project]

2023.12

(arXiv 2023.12) GLaMM: Pixel Grounding Large Multimodal Model, [Paper], [Project]
(arXiv 2023.12) MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations, [Paper], [Project]
(arXiv 2023.12) Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models, [Paper], [Project]
(arXiv 2023.12) Customization Assistant for Text-to-image Generation, [Paper]
(arXiv 2023.12) GPT4Point: A Unified Framework for Point-Language Understanding and Generation, [Paper], [Project]
(arXiv 2023.12) LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models, [Paper], [Project]
(arXiv 2023.12) BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models, [Paper], [Project]
(arXiv 2023.12) Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions, [Paper], [Project]
(arXiv 2023.12) Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning, [Paper], [Project]
(arXiv 2023.12) ETC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model, [Paper]
(arXiv 2023.12) Lenna: Language Enhanced Reasoning Detection Assistant, [Paper], [Project]
(arXiv 2023.12) VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding, [Paper]
(arXiv 2023.12) StoryGPT-V: Large Language Models as Consistent Story Visualizers, [Paper], [Project]
(arXiv 2023.12) Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images, [Paper]
(arXiv 2023.12) Recursive Visual Programming, [Paper]
(arXiv 2023.12) PixelLM: Pixel Reasoning with Large Multimodal Model, [Paper]
(arXiv 2023.12) Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition, [Paper]
(arXiv 2023.12) Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models, [Paper], [Project]
(arXiv 2023.12) Video Summarization: Towards Entity-Aware Captions, [Paper]
(arXiv 2023.12) VLAP: Efficient Video-Language Alignment via Frame Prompting and Distilling for Video Question Answering, [Paper]
(arXiv 2023.12) See, Say, and Segment: Teaching LMMs to Overcome False Premises, [Paper], [Project]
(arXiv 2023.12) Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers, [Paper], [Code]
(arXiv 2023.12) Interfacing Foundation Models’ Embeddings, [Paper], [Project]
(arXiv 2023.12) VILA: On Pre-training for Visual Language Models, [Paper]
(arXiv 2023.12) MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception, [Paper], [Project]
(arXiv 2023.12) Hallucination Augmented Contrastive Learning for Multimodal Large Language Model, [Paper]
(arXiv 2023.12) Honeybee: Locality-enhanced Projector for Multimodal LLM, [Paper], [Project]
(arXiv 2023.12) SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models, [Paper], [Project]
(arXiv 2023.12) InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following, [Paper], [Project]
(arXiv 2023.12) EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models, [Paper], [Project]
(arXiv 2023.12) AM-RADIO: Agglomerative Model – Reduce All Domains Into One, [Paper], [Project]
(arXiv 2023.12) Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning, [Paper]
(arXiv 2023.12) How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation, [Paper], [Project]
(arXiv 2023.12) Audio-Visual LLM for Video Understanding, [Paper]
(arXiv 2023.12) AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes, [Paper], [Project]
(arXiv 2023.12) Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models, [Paper], [[Project]]( https:// github.com/Vill-Lab/2024-AAAI-HPT)
(arXiv 2023.12) Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models, [Paper], [Project]
(arXiv 2023.12) AllSpark: a multimodal spatiotemporal general model, [Paper]
(arXiv 2023.12) Tracking with Human-Intent Reasoning, [Paper], [Project]
(arXiv 2023.12) Retrieval-Augmented Egocentric Video Captioning, [Paper]
(arXiv 2023.12) COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training, [Paper], [Project]
(arXiv 2023.12) LARP: LANGUAGE-AGENT ROLE PLAY FOR OPEN-WORLD GAMES, [Paper], [Project]
(arXiv 2023.12) CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update, [Paper], [Project]
(arXiv 2023.12) DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics, [Paper], [Project]
(arXiv 2023.12) VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens, [Paper], [Project]
(arXiv 2023.12) VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation, [Paper], [Project]
(arXiv 2023.12) Pixel Aligned Language Models, [Paper], [Project]
(arXiv 2023.12) Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos, [Paper]
(arXiv 2023.12) Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels, [Paper], [Project]
(arXiv 2023.12) Osprey: Pixel Understanding with Visual Instruction Tuning, [Paper], [Project]
(arXiv 2023.12) Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action, [Paper], [Project]
(arXiv 2023.12) A Simple LLM Framework for Long-Range Video Question-Answering, [Paper], [Project]
(arXiv 2023.12) TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones, [Paper], [Project]
(arXiv 2023.12) ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation, [Paper], [Project]
(arXiv 2023.12) ChartBench: A Benchmark for Complex Visual Reasoning in Charts, [Paper]
(arXiv 2023.12) FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects, [Paper], [Project]
(arXiv 2023.12) Make-A-Character: High Quality Text-to-3D Character Generation within Minutes, [Paper], [Project]
(arXiv 2023.12) Osprey: Pixel Understanding with Visual Instruction Tuning, [Paper], [Project]
(arXiv 2023.12) 3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V, [Paper]
(arXiv 2023.12) SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models, [Paper], [Project]
(arXiv 2023.12) VideoPoet: A Large Language Model for Zero-Shot Video Generation, [Paper], [Project]
(arXiv 2023.12) V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs, [Paper], [Project]
(arXiv 2023.12) A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties, [Paper], [Project]
(arXiv 2023.12) AppAgent: Multimodal Agents as Smartphone Users, [Paper], [Project]
(arXiv 2023.12) InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models, [Paper], [Project]
(arXiv 2023.12) Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models, [Paper]
(arXiv 2023.12) Generative Multimodal Models are In-Context Learners, [Paper], [Project]
(arXiv 2023.12) VCoder: Versatile Vision Encoders for Multimodal Large Language Models, [Paper], [Project]
(arXiv 2023.12) LLM4VG: Large Language Models Evaluation for Video Grounding, [Paper]
(arXiv 2023.12) InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks, [Paper], [Project]
(arXiv 2023.12) VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation, [Paper], [Project]
(arXiv 2023.12) Plan, Posture and Go: Towards Open-World Text-to-Motion Generation, [Paper], [Project]
(arXiv 2023.12) MotionScript: Natural Language Descriptions for Expressive 3D Human Motions, [Paper], [Project]
(arXiv 2023.12) Assessing GPT4-V on Structured Reasoning Tasks, [Paper], [Project]
(arXiv 2023.12) Iterative Motion Editing with Natural Language, [Paper], [Project]
(arXiv 2023.12) Gemini: A Family of Highly Capable Multimodal Models, [Paper], [Project]
(arXiv 2023.12) StarVector: Generating Scalable Vector Graphics Code from Images, [Paper], [Project]
(arXiv 2023.12) Text-Conditioned Resampler For Long Form Video Understanding, [Paper]
(arXiv 2023.12) Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning, [Paper]
(arXiv 2023.12) A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise, [Paper], [Project]
(arXiv 2023.12) Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model, [Paper], [Project]
(arXiv 2023.12) M^2ConceptBase: A Fine-grained Aligned Multi-modal Conceptual Knowledge Base, [Paper]
(arXiv 2023.12) Language-conditioned Learning for Robotic Manipulation: A Survey, [Paper]
(arXiv 2023.12) TUNING LAYERNORM IN ATTENTION: TOWARDS EFFICIENT MULTI-MODAL LLM FINETUNING, [Paper]
(arXiv 2023.12) GSVA: Generalized Segmentation via Multimodal Large Language Models, [Paper], [Project]
(arXiv 2023.12) SILKIE: PREFERENCE DISTILLATION FOR LARGE VISUAL LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.12) AN EVALUATION OF GPT-4V AND GEMINI IN ONLINE VQA, [Paper]
(arXiv 2023.12) CEIR: CONCEPT-BASED EXPLAINABLE IMAGE REPRESENTATION LEARNING, [Paper], [Project]
(arXiv 2023.12) Language-Assisted 3D Scene Understanding, [Paper], [Project]
(arXiv 2023.12) M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts, [Paper], [Project]
(arXiv 2023.12) Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI, [Paper], [Project]
(arXiv 2023.12) FROM TEXT TO MOTION: GROUNDING GPT-4 IN A HUMANOID ROBOT “ALTER3”, [Paper], [Project]
(arXiv 2023.12) Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks, [Paper]
(arXiv 2023.12) LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos, [Paper]
(arXiv 2023.12) LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning, [Paper]
(arXiv 2023.12) Localized Symbolic Knowledge Distillation for Visual Commonsense Models, [Paper], [Project]
(arXiv 2023.12) MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding, [Paper], [Project]
(arXiv 2023.12) Human Demonstrations are Generalizable Knowledge for Robots, [Paper]
(arXiv 2023.12) WonderJourney: Going from Anywhere to Everywhere, [Paper], [Project]
(arXiv 2023.12) VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models, [Paper], [Code]
(arXiv 2023.12) Text as Image: Learning Transferable Adapter for Multi-Label Classification, [Paper]
(arXiv 2023.12) Prompt Highlighter: Interactive Control for Multi-Modal LLMs, [Paper], [Project]
(arXiv 2023.12) Digital Life Project: Autonomous 3D Characters with Social Intelligence, [Paper], [Project]
(arXiv 2023.12) Generating Illustrated Instructions, [Paper], [Project]
(arXiv 2023.12) Aligning and Prompting Everything All at Once for Universal Visual Perception, [Paper], [Code]
(arXiv 2023.12) LEAP: LLM-Generation of Egocentric Action Programs, [Paper]
(arXiv 2023.12) OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition, [Paper], [Project]
(arXiv 2023.12) Merlin: Empowering Multimodal LLMs with Foresight Minds, [Paper], [Project]
(arXiv 2023.12) VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things, [Paper], [Code]
(arXiv 2023.12) Making Large Multimodal Models Understand Arbitrary Visual Prompts, [Paper], [Project]

2023.11

(arXiv 2023.11) Video-LLaVA: Learning United Visual Representation by Alignment Before Projection, [Paper], [Code]
(arXiv 2023.11) Self-Chained Image-Language Model for Video Localization and Question Answering, [Paper], [Code]
(arXiv 2023.11) Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning, [Paper], [Project]
(arXiv 2023.11) LALM: Long-Term Action Anticipation with Language Models, [Paper]
(arXiv 2023.11) Contrastive Vision-Language Alignment Makes Efficient Instruction Learner, [Paper], [Code]
(arXiv 2023.11) ChatIllusion: Efficient-Aligning Interleaved Generation ability with Visual Instruction Model, [Paper], [Code]
(arXiv 2023.11) MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition, [Paper]
(arXiv 2023.11) VTimeLLM: Empower LLM to Grasp Video Moments, [Paper], [Code]
(arXiv 2023.11) Simple Semantic-Aided Few-Shot Learning, [Paper]
(arXiv 2023.11) LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning, [Paper], [Project]
(arXiv 2023.11) Detailed Human-Centric Text Description-Driven Large Scene Synthesis, [Paper]
(arXiv 2023.11) X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning, [Paper], [Code]
(arXiv 2023.11) CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation, [Paper], [Project]
(arXiv 2023.11) AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond, [Paper], [Project]
(arXiv 2023.11) InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation, [Paper], [Code]
(arXiv 2023.11) MLLMs-Augmented Visual-Language Representation Learning, [Paper], [Code]
(arXiv 2023.11) PoseGPT: Chatting about 3D Human Pose, [Paper], [Project]
(arXiv 2023.11) LLM-State: Expandable State Representation for Long-horizon Task Planning in the Open World, [Paper]
(arXiv 2023.11) UniIR: Training and Benchmarking Universal Multimodal Information Retrievers, [Paper], [Project]
(arXiv 2023.11) VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models, [Paper], [Code]
(arXiv 2023.11) MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning, [Paper], [Project]
(arXiv 2023.11) Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis, [Paper]
(arXiv 2023.11) Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects, [Paper]
(arXiv 2023.11) OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, [Paper], [Code]
(arXiv 2023.11) ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model, [Paper], [Project]
(arXiv 2023.11) VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following, [Paper], [Project]
(arXiv 2023.11) Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models, [Paper], [Code]
(arXiv 2023.11) Self-correcting LLM-controlled Diffusion Models, [Paper]
(arXiv 2023.11) InterControl: Generate Human Motion Interactions by Controlling Every Joint, [Paper], [Code]
(arXiv 2023.11) DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback, [Paper], [Code]
(arXiv 2023.11) GAIA: A Benchmark for General AI Assistants, [Paper], [Project]
(arXiv 2023.11) PG-Video-LLaVA: Pixel Grounding Large Video-Language Models, [Paper], [Code]
(arXiv 2023.11) Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge, [Paper]
(arXiv 2023.11) AN EMBODIED GENERALIST AGENT IN 3D WORLD, [Paper], [Project]
(arXiv 2023.11) ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, [Paper], [Project]
(arXiv 2023.11) KNVQA: A Benchmark for evaluation knowledge-based VQA, [Paper]
(arXiv 2023.11) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning, [Paper], [Project]
(arXiv 2023.11) Boosting Audio-visual Zero-shot Learning with Large Language Models, [Paper], [Code]
(arXiv 2023.11) Few-Shot Classification & Segmentation Using Large Language Models Agent, [Paper]
(arXiv 2023.11) Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents, [Paper], [Code]
(arXiv 2023.11) VLM-Eval: A General Evaluation on Video Large Language Models, [Paper]
(arXiv 2023.11) LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions, [Paper], [Code]
(arXiv 2023.11) LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge, [Paper], [Project]
(arXiv 2023.11) Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding, [Paper], [Code]
(arXiv 2023.11) How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model, [Paper]
(arXiv 2023.11) Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models, [Paper], [Code]
(arXiv 2023.11) Towards Open-Ended Visual Recognition with Large Language Model, [Paper], [Code]
(arXiv 2023.11) Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models, [Paper], [Code]
(arXiv 2023.11) VILMA: A ZERO-SHOT BENCHMARK FOR LINGUISTIC AND TEMPORAL GROUNDING IN VIDEO-LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.11) VOLCANO: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision, [Paper], [Code]
(arXiv 2023.11) AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation, [Paper], [Code]
(arXiv 2023.11) Analyzing Modular Approaches for Visual Question Decomposition, [Paper], [Code]
(arXiv 2023.11) LayoutPrompter: Awaken the Design Ability of Large Language Models, [Paper], [Code]
(arXiv 2023.11) PerceptionGPT: Effectively Fusing Visual Perception into LLM, [Paper]
(arXiv 2023.11) InfMLLM: A Unified Framework for Visual-Language Tasks, [Paper], [Code]
(arXiv 2023.11) WHAT LARGE LANGUAGE MODELS BRING TO TEXTRICH VQA?, [Paper]
(arXiv 2023.11) Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text, [Paper], [Project]
(arXiv 2023.11) GPT-4V(ision) as A Social Media Analysis Engine, [Paper], [Code]
(arXiv 2023.11) GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation, [Paper], [Code]
(arXiv 2023.11) To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning, [Paper], [Code]
(arXiv 2023.11) SPHINX: THE JOINT MIXING OF WEIGHTS, TASKS, AND VISUAL EMBEDDINGS FOR MULTI-MODAL LARGE LANGUAGE MODELS, [Paper], [Code]
(arXiv 2023.11) ADAPT: As-Needed Decomposition and Planning with Language Models, [Paper], [Project]
(arXiv 2023.11) JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models, [Paper], [Project]
(arXiv 2023.11) Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, [Paper]
(arXiv 2023.11) Multitask Multimodal Prompted Training for Interactive Embodied Task Completion, [Paper], [Code]
(arXiv 2023.11) TEAL: TOKENIZE AND EMBED ALL FOR MULTIMODAL LARGE LANGUAGE MODELS, [Paper]
(arXiv 2023.11) u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model, [Paper]
(arXiv 2023.11) LLAVA-PLUS: LEARNING TO USE TOOLS FOR CREATING MULTIMODAL AGENTS, [Paper], [Project]
(arXiv 2023.11) Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models, [Paper], [Code]
(arXiv 2023.11) OtterHD: A High-Resolution Multi-modality Model, [Paper], [Code]
(arXiv 2023.11) NExT-Chat: An LMM for Chat, Detection and Segmentation, [Paper], [Project]
(arXiv 2023.11) GENOME: GENERATIVE NEURO-SYMBOLIC VISUAL REASONING BY GROWING AND REUSING MODULES, [Paper], [Project]
(arXiv 2023.11) MAKE A DONUT: LANGUAGE-GUIDED HIERARCHICAL EMD-SPACE PLANNING FOR ZERO-SHOT DEFORMABLE OBJECT MANIPULATION, [Paper]
(arXiv 2023.11) Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs, [Paper], [Code]
(arXiv 2023.11) Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models, [Paper]
(arXiv 2023.11) ROBOGEN: TOWARDS UNLEASHING INFINITE DATA FOR AUTOMATED ROBOT LEARNING VIA GENERATIVE SIMULATION, [Paper]

2023.10

(arXiv 2023.10) MINIGPT-5: INTERLEAVED VISION-AND-LANGUAGE GENERATION VIA GENERATIVE VOKENS, [Paper], [Code]
(arXiv 2023.10) What’s “up” with vision-language models? Investigating their struggle with spatial reasoning, [Paper], [Code]
(arXiv 2023.10) APOLLO: ZERO-SHOT MULTIMODAL REASONING WITH MULTIPLE EXPERTS, [Paper], [Code]
(arXiv 2023.10) ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense, [Paper]
(arXiv 2023.10) Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models, [Paper], [Project]
(arXiv 2023.10) LARGE LANGUAGE MODELS AS GENERALIZABLE POLICIES FOR EMBODIED TASKS, [Paper], [Project]
(arXiv 2023.10) Humanoid Agents: Platform for Simulating Human-like Generative Agents, [Paper], [Project]
(arXiv 2023.10) REVO-LION: EVALUATING AND REFINING VISION-LANGUAGE INSTRUCTION TUNING DATASETS, [Paper], [Code]
(arXiv 2023.10) How (not) to ensemble LVLMs for VQA, [Paper]
(arXiv 2023.10) What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models, [Paper], [Code]
(arXiv 2023.10) Words into Action: Learning Diverse Humanoid Robot Behaviors using Language Guided Iterative Motion Refinement, [Paper], [Code]
(arXiv 2023.10) GameGPT: Multi-agent Collaborative Framework for Game Development, [Paper]
(arXiv 2023.10) STEVE-EYE: EQUIPPING LLM-BASED EMBODIED AGENTS WITH VISUAL PERCEPTION IN OPEN WORLDS, [Paper]
(arXiv 2023.10) BENCHMARKING SEQUENTIAL VISUAL INPUT REASONING AND PREDICTION IN MULTIMODAL LARGE LANGUAGE MODELS, [Paper], [Code]
(arXiv 2023.10) A Simple Baseline for Knowledge-Based Visual Question Answering, [Paper], [Code]
(arXiv 2023.10) Interactive Robot Learning from Verbal Correction, [Paper], [Project]
(arXiv 2023.10) Exploring Question Decomposition for Zero-Shot VQA, [Paper], [Project]
(arXiv 2023.10) RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments, [Paper], [Project]
(arXiv 2023.10) An Early Evaluation of GPT-4V(ision), [Paper], [Code]
(arXiv 2023.10) DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models, [Paper], [Project]
(arXiv 2023.10) CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images, [Paper], [Code]
(arXiv 2023.10) VIDEOPROMPTER: AN ENSEMBLE OF FOUNDATIONAL MODELS FOR ZERO-SHOT VIDEO UNDERSTANDING, [Paper]
(arXiv 2023.10) Inject Semantic Concepts into Image Tagging for Open-Set Recognition, [Paper], [Code]
(arXiv 2023.10) Woodpecker: Hallucination Correction for Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.10) Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.10) Large Language Models are Temporal and Causal Reasoners for Video Question Answering, [Paper], [Code]
(arXiv 2023.10) What’s Left? Concept Grounding with Logic-Enhanced Foundation Models, [Paper]
(arXiv 2023.10) Evaluating Spatial Understanding of Large Language Models, [Paper]
(arXiv 2023.10) Learning Reward for Physical Skills using Large Language Model, [Paper]
(arXiv 2023.10) CREATIVE ROBOT TOOL USE WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.10) Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models, [Paper], [Project]
(arXiv 2023.10) Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning, [Paper], [Project]
(arXiv 2023.10) LARGE LANGUAGE MODELS CAN Share IMAGES, TOO! [Paper], [Code]
(arXiv 2023.10) Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond, [Paper]
(arXiv 2023.10) HALLUSIONBENCH: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(vision), LLaVA-1.5, and Other Multi-modality Models, [Paper], [Code]
(arXiv 2023.10) Can Language Models Laugh at YouTube Short-form Videos? [Paper], [Code]
(arXiv 2023.10) Large Language Models are Visual Reasoning Coordinators, [Paper], [Code]
(arXiv 2023.10) Language Models as Zero-Shot Trajectory Generators, [Paper], [Project]
(arXiv 2023.10) Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge, [Paper], [Code]
(arXiv 2023.10) Multimodal Large Language Model for Visual Navigation, [Paper]
(arXiv 2023.10) MAKING MULTIMODAL GENERATION EASIER: WHEN DIFFUSION MODELS MEET LLMS, [Paper], [Code]
(arXiv 2023.10) Open X-Embodiment: Robotic Learning Datasets and RT-X Models, [Paper], [Project]
(arXiv 2023.10) Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT, [Paper], [Code]
(arXiv 2023.10) Lost in Translation: When GPT-4V(ision) Can’t See Eye to Eye with Text A Vision-Language-Consistency Analysis of VLLMs and Beyond, [Paper]
(arXiv 2023.10) Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models, [Paper]
(arXiv 2023.10) VLIS: Unimodal Language Models Guide Multimodal Language Generation, [Paper], [Code]
(arXiv 2023.10) CLIN: A CONTINUALLY LEARNING LANGUAGE AGENT FOR RAPID TASK ADAPTATION AND GENERALIZATION, [Paper], [Project]
(arXiv 2023.10) Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning, [Paper]
(arXiv 2023.10) Lost in Translation: When GPT-4V(ision) Can’t See Eye to Eye with Text A Vision-Language-Consistency Analysis of VLLMs and Beyond, [Paper]
(arXiv 2023.10) FROZEN TRANSFORMERS IN LANGUAGE MODELS ARE EFFECTIVE VISUAL ENCODER LAYERS, [Paper], [Code]
(arXiv 2023.10) CLAIR: Evaluating Image Captions with Large Language Models, [Paper], [Project]
(arXiv 2023.10) 3D-GPT: PROCEDURAL 3D MODELING WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.10) Automated Natural Language Explanation of Deep Visual Neurons with Large Models, [Paper]
(arXiv 2023.10) Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, [Paper], [Project]
(arXiv 2023.10) EvalCrafter: Benchmarking and Evaluating Large Video Generation Models, [Paper], [Project]
(arXiv 2023.10) MISAR: A MULTIMODAL INSTRUCTIONAL SYSTEM WITH AUGMENTED REALITY, [Paper], [Code]
(arXiv 2023.10) NON-INTRUSIVE ADAPTATION: INPUT-CENTRIC PARAMETER-EFFICIENT FINE-TUNING FOR VERSATILE MULTIMODAL MODELING, [Paper]
(arXiv 2023.10) LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation, [Paper], [Project]
(arXiv 2023.10) ChatGPT-guided Semantics for Zero-shot Learning, [Paper]
(arXiv 2023.10) On the Benefit of Generative Foundation Models for Human Activity Recognition, [Paper]
(arXiv 2023.10) DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning, [Paper], [Project]
(arXiv 2023.10) Interactive Task Planning with Language Models, [Paper], [Project]
(arXiv 2023.10) Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance, [Paper], [Project]
(arXiv 2023.10) Penetrative AI: Making LLMs Comprehend the Physical World, [Paper]
(arXiv 2023.10) BONGARD-OPENWORLD: FEW-SHOT REASONING FOR FREE-FORM VISUAL CONCEPTS IN THE REAL WORLD, [Paper], [Project]
(arXiv 2023.10) ViPE: Visualise Pretty-much Everything, [Paper]
(arXiv 2023.10) MINIGPT-V2: LARGE LANGUAGE MODEL AS A UNIFIED INTERFACE FOR VISION-LANGUAGE MULTITASK LEARNING, [Paper], [Project]
(arXiv 2023.10) MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations, [Paper]
(arXiv 2023.10) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GENERATION WITH COMPLEX AND DETAILED PROMPTS, [Paper]
(arXiv 2023.10) VIDEO LANGUAGE PLANNING, [Paper], [Project]
(arXiv 2023.10) Dobby: A Conversational Service Robot Driven by GPT-4, [Paper]
(arXiv 2023.10) CoPAL: Corrective Planning of Robot Actions with Large Language Models, [Paper]
(arXiv 2023.10) Forgetful Large Language Models: Lessons Learned from Using LLMs in Robot Programming, [Paper]
(arXiv 2023.10) TREE-PLANNER: EFFICIENT CLOSE-LOOP TASK PLANNING WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.10) TOWARDS ROBUST MULTI-MODAL REASONING VIA MODEL SELECTION, [Paper], [Code]
(arXiv 2023.10) FERRET: REFER AND GROUND ANYTHING ANYWHERE AT ANY GRANULARITY, [Paper], [Code]
(arXiv 2023.10) FROM SCARCITY TO EFFICIENCY: IMPROVING CLIP TRAINING VIA VISUAL-ENRICHED CAPTIONS, [Paper]
(arXiv 2023.10) OPENLEAF: OPEN-DOMAIN INTERLEAVED IMAGE-TEXT GENERATION AND EVALUATION, [Paper]
(arXiv 2023.10) Can We Edit Multimodal Large Language Models? [Paper], [Code]
(arXiv 2023.10) VISUAL DATA-TYPE UNDERSTANDING DOES NOT EMERGE FROM SCALING VISION-LANGUAGE MODELS, [Paper], [Code]
(arXiv 2023.10) Idea2Img: Iterative Self-Refinement with GPT-4V(vision) for Automatic Image Design and Generation, [Paper], [Project]
(arXiv 2023.10) OCTOPUS: EMBODIED VISION-LANGUAGE PROGRAMMER FROM ENVIRONMENTAL FEEDBACK, [Paper], [Project]

2023.9

(arXiv 2023.9) LMEye: An Interactive Perception Network for Large Language Models, [Paper], [Code]
(arXiv 2023.9) DynaCon: Dynamic Robot Planner with Contextual Awareness via LLMs, [Paper], [Project]
(arXiv 2023.9) AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model, [Paper]
(arXiv 2023.9) ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning, [Paper], [Project]
(arXiv 2023.9) LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement, [Paper], [Code]
(arXiv 2023.9) ONE FOR ALL: VIDEO CONVERSATION IS FEASIBLE WITHOUT VIDEO INSTRUCTION TUNING, [Paper]
(arXiv 2023.9) Verifiable Learned Behaviors via Motion Primitive Composition: Applications to Scooping of Granular Media, [Paper]
(arXiv 2023.9) Human-Assisted Continual Robot Learning with Foundation Models, [Paper], [Project]
(arXiv 2023.9) InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition, [Paper], [Code]
(arXiv 2023.9) VIDEODIRECTORGPT: CONSISTENT MULTI-SCENE VIDEO GENERATION VIA LLM-GUIDED PLANNING, [Paper], [Project]
(arXiv 2023.9) Text-to-Image Generation for Abstract Concepts, [Paper]
(arXiv 2023.9) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator, [Paper], [Code]
(arXiv 2023.9) ALIGNING LARGE MULTIMODAL MODELS WITH FACTUALLY AUGMENTED RLHF, [Paper], [Project]
(arXiv 2023.9) Self-Recovery Prompting: Promptable General Purpose Service Robot System with Foundation Models and Self-Recovery, [Paper], [Project]
(arXiv 2023.9) Q-BENCH: A BENCHMARK FOR GENERAL-PURPOSE FOUNDATION MODELS ON LOW-LEVEL VISION, [Paper]
(arXiv 2023.9) DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, [Paper], [Code]
(arXiv 2023.9) LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition, [Paper], [Code]
(arXiv 2023.9) LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent, [Paper], [Project]
(arXiv 2023.9) Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, [Paper], [Code]
(arXiv 2023.9) STRUCTCHART: PERCEPTION, STRUCTURING, REASONING FOR VISUAL CHART UNDERSTANDING, [Paper]
(arXiv 2023.9) DREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION, [Paper], [Project]
(arXiv 2023.9) A LARGE-SCALE DATASET FOR AUDIO-LANGUAGE REPRESENTATION LEARNING, [Paper], [Project]
(arXiv 2023.9) YOU ONLY LOOK AT SCREENS: MULTIMODAL CHAIN-OF-ACTION AGENTS, [Paper], [Code]
(arXiv 2023.9) SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models, [Paper], [Project]
(arXiv 2023.9) Conformal Temporal Logic Planning using Large Language Models: Knowing When to Do What and When to Ask for Help, [Paper], [Project]
(arXiv 2023.9) Investigating the Catastrophic Forgetting in Multimodal Large Language Models, [Paper]
(arXiv 2023.9) Specification-Driven Video Search via Foundation Models and Formal Verification, [Paper]
(arXiv 2023.9) Language as the Medium: Multimodal Video Classification through text only, [Paper]
(arXiv 2023.9) Multimodal Foundation Models: From Specialists to General-Purpose Assistants, [Paper]
(arXiv 2023.9) TEXTBIND: Multi-turn Interleaved Multimodal Instruction-following, [Paper], [Project]
(arXiv 2023.9) Prompt a Robot to Walk with Large Language Models, [Paper], [Project]
(arXiv 2023.9) Grasp-Anything: Large-scale Grasp Dataset from Foundation Models, [Paper], [Project]
(arXiv 2023.9) MMICL: EMPOWERING VISION-LANGUAGE MODEL WITH MULTI-MODAL IN-CONTEXT LEARNING, [Paper], [Code]
(arXiv 2023.9) SwitchGPT: Adapting Large Language Models for Non-Text Outputs, [Paper], [Code]
(arXiv 2023.9) UNIFIED HUMAN-SCENE INTERACTION VIA PROMPTED CHAIN-OF-CONTACTS, [Paper], [Code]
(arXiv 2023.9) Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models, [Paper]
(arXiv 2023.9) NExT-GPT: Any-to-Any Multimodal LLM, [Paper], [Project]
(arXiv 2023.9) Multi3DRefer: Grounding Text Description to Multiple 3D Objects, [Paper], [Project]
(arXiv 2023.9) Language Models as Black-Box Optimizers for Vision-Language Models, [Paper]
(arXiv 2023.9) Evaluation and Mitigation of Agnosia in Multimodal Large Language Models, [Paper]
(arXiv 2023.9) Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models, [Paper], [Code]
(arXiv 2023.9) Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment, [Paper]
(arXiv 2023.9) ImageBind-LLM: Multi-modality Instruction Tuning, [Paper], [Code]
(arXiv 2023.9) Developmental Scaffolding with Large Language Models, [Paper]
(arXiv 2023.9) Gesture-Informed Robot Assistance via Foundation Models, [Paper], [Project]
(arXiv 2023.9) Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging, [Paper]
(arXiv 2023.9) Large AI Model Empowered Multimodal Semantic Communications, [Paper]
(arXiv 2023.9) CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection, [Paper], [Project]
(arXiv 2023.9) Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning, [Paper]
(arXiv 2023.9) CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning, [Paper]
(arXiv 2023.9) Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following, [Paper], [Code]

2023.8

(arXiv 2023.8) Planting a SEED of Vision in Large Language Model, [Paper], [Code]
(arXiv 2023.8) EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE, [Paper]
(arXiv 2023.8) Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images, [Paper], [Project]
(arXiv 2023.8) Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis, [Paper]
(arXiv 2023.8) Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models, [Paper], [Code]
(arXiv 2023.8) PointLLM: Empowering Large Language Models to Understand Point Clouds, [Paper], [Project]
(arXiv 2023.8) TouchStone: Evaluating Vision-Language Models by Language Models, [Paper], [Code]
(arXiv 2023.8) Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes, [Paper], [Project]
(arXiv 2023.8) WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model, [Paper]
(arXiv 2023.8) ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning, [Paper], [Code]
(arXiv 2023.8) LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks, [Paper]
(arXiv 2023.8) Evaluation and Analysis of Hallucination in Large Vision-Language Models, [Paper]
(arXiv 2023.8) MLLM-DataEngine: An Iterative Refinement Approach for MLLM, [Paper]
(arXiv 2023.8) Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models, [Paper]
(arXiv 2023.8) Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? [Paper], [Code]
(arXiv 2023.8) VIGC: Visual Instruction Generation and Correction, [Paper]
(arXiv 2023.8) Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment, [Paper]
(arXiv 2023.8) Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities, [Paper], [Code]
(arXiv 2023.8) DIFFUSION LANGUAGE MODELS CAN PERFORM MANY TASKS WITH SCALING AND INSTRUCTION-FINETUNING, [Paper], [Code]
(arXiv 2023.8) CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images, [Paper], [Project]
(arXiv 2023.8) ProAgent: Building Proactive Cooperative AI with Large Language Models, [Paper], [Project]
(arXiv 2023.8) ROSGPT_Vision: Commanding Robots Using Only Language Models’ Prompts, [Paper], [Code]
(arXiv 2023.8) StoryBench: A Multifaceted Benchmark for Continuous Story Visualization, [Paper], [Code]
(arXiv 2023.8) Tackling Vision Language Tasks Through Learning Inner Monologues, [Paper]
(arXiv 2023.8) ExpeL: LLM Agents Are Experiential Learners, [Paper]
(arXiv 2023.8) On the Adversarial Robustness of Multi-Modal Foundation Models, [Paper]
(arXiv 2023.8) WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models, [Paper], [Project]
(arXiv 2023.8) March in Chat: Interactive Prompting for Remote Embodied Referring Expression, [Paper], [Code]
(arXiv 2023.8) BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions, [Paper], [Code]
(arXiv 2023.8) VIT-LENS: Towards Omni-modal Representations, [Paper], [Code]
(arXiv 2023.8) StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data, [Paper], [Project]
(arXiv 2023.8) PUMGPT: A Large Vision-Language Model for Product Understanding, [Paper]
(arXiv 2023.8) Link-Context Learning for Multimodal LLMs, [Paper], [Code]
(arXiv 2023.8) Detecting and Preventing Hallucinations in Large Vision Language Models, [Paper]
(arXiv 2023.8) VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, [Paper], [Project]
(arXiv 2023.8) Foundation Model based Open Vocabulary Task Planning and Executive System for General Purpose Service Robots, [Paper]
(arXiv 2023.8) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation, [Paper], [Project]
(arXiv 2023.8) OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation, [Paper]
(arXiv 2023.8) EMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS, [Paper], [Code]
(arXiv 2023.8) 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment, [Paper], [Project]
(arXiv 2023.8) Gentopia.AI: A Collaborative Platform for Tool-Augmented LLMs, [Paper], [Project]
(arXiv 2023.8) AgentBench: Evaluating LLMs as Agents, [Paper], [Project]
(arXiv 2023.8) Learning Concise and Descriptive Attributes for Visual Recognition, [Paper]
(arXiv 2023.8) Tiny LVLM-eHub: Early Multimodal Experiments with Bard, [Paper], [Project]
(arXiv 2023.8) MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, [Paper], [Code]
(arXiv 2023.8) RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension, [Paper], [Code]
(arXiv 2023.8) Learning to Model the World with Language, [Paper], [Project]
(arXiv 2023.8) The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World, [Paper], [Code]
(arXiv 2023.8) Multimodal Neurons in Pretrained Text-Only Transformers, [Paper], [Project]
(arXiv 2023.8) LISA: REASONING SEGMENTATION VIA LARGE LANGUAGE MODEL, [Paper], [Code]

2023.7

(arXiv 2023.7) Caption Anything: Interactive Image Description with Diverse Multimodal Controls, [Paper], [Code]
(arXiv 2023.7) DesCo: Learning Object Recognition with Rich Language Descriptions, [Paper]
(arXiv 2023.7) KOSMOS-2: Grounding Multimodal Large Language Models to the World, [Paper], [Project]
(arXiv 2023.7) MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.7) Evaluating ChatGPT and GPT-4 for Visual Programming, [Paper]
(arXiv 2023.7) SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, [Paper], [Code]
(arXiv 2023.7) AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [Paper], [Project]
(arXiv 2023.7) Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks, [Paper]
(arXiv 2023.7) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding, [Paper], [Project]
(arXiv 2023.7) Large Language Models as General Pattern Machines, [Paper], [Project]
(arXiv 2023.7) How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges, [Paper], [Project]
(arXiv 2023.7) RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, [Paper], [Project]
(arXiv 2023.7) Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, [Paper], [Project]
(arXiv 2023.7) GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping, [Paper], [Project]
(arXiv 2023.7) CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots, [Paper]
(arXiv 2023.7) 3D-LLM: Injecting the 3D World into Large Language Models, [Paper], [Project]
(arXiv 2023.7) Generative Pretraining in Multimodality, [Paper], [Code]
(arXiv 2023.7) VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, [Paper], [Project]
(arXiv 2023.7) VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View, [Paper]
(arXiv 2023.7) SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning, [Paper], [Project]
(arXiv 2023.7) Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts, [Paper]
(arXiv 2023.7) InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, [Paper], [Data]
(arXiv 2023.7) MBLIP: EFFICIENT BOOTSTRAPPING OF MULTILINGUAL VISION-LLMS, [Paper], [Code]
(arXiv 2023.7) Bootstrapping Vision-Language Learning with Decoupled Language Pre-training, [Paper]
(arXiv 2023.7) BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs, [Paper], [Project]
(arXiv 2023.7) ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning, [Paper], [Project]
(arXiv 2023.7) TOWARDS A UNIFIED AGENT WITH FOUNDATION MODELS, [Paper]
(arXiv 2023.7) Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners, [Paper], [Project]
(arXiv 2023.7) Building Cooperative Embodied Agents Modularly with Large Language Models, [Paper], [Project]
(arXiv 2023.7) Embodied Task Planning with Large Language Models, [Paper], [Project]
(arXiv 2023.7) What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?, [Paper], [Project]
(arXiv 2023.7) GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest, [Paper], [Code]
(arXiv 2023.7) JourneyDB: A Benchmark for Generative Image Understanding, [Paper], [Code]
(arXiv 2023.7) DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment, [Paper], [Project]
(arXiv 2023.7) Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset, [Paper], [Code]
(arXiv 2023.7) Visual Instruction Tuning with Polite Flamingo, [Paper], [Code]
(arXiv 2023.7) Statler: State-Maintaining Language Models for Embodied Reasoning, [Paper], [Project]
(arXiv 2023.7) SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, [Paper]
(arXiv 2023.7) SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs, [Paper], [Code]
(arXiv 2023.7) KITE: Keypoint-Conditioned Policies for Semantic Manipulation, [Paper], [Project]

2023.6

(arXiv 2023.6) MultiModal-GPT: A Vision and Language Model for Dialogue with Humans, [Paper], [Code]
(arXiv 2023.6) InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language, [Paper], [Code]
(arXiv 2023.6) InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, [Paper], [Code]
(arXiv 2023.6) LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, [Paper], [Code]
(arXiv 2023.6) Scalable 3D Captioning with Pretrained Models, [Paper], [Code]
(arXiv 2023.6) AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers, [Paper], [Code]
(arXiv 2023.6) VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY, [Paper], [Code]
(arXiv 2023.6) Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots, [Paper]
(arXiv 2023.6) LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models, [Paper]
(arXiv 2023.6) AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn, [Paper], [Project]
(arXiv 2023.6) Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models, [Paper]
(arXiv 2023.6) MACAW-LLM: MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO, AND TEXT INTEGRATION, [Paper], [Code]
(arXiv 2023.6) Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering, [Paper]
(arXiv 2023.6) Language to Rewards for Robotic Skill Synthesis, [Paper], [Project]
(arXiv 2023.6) Toward Grounded Social Reasoning, [Paper], [Code]
(arXiv 2023.6) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion, [Paper], [Code]
(arXiv 2023.6) RM-PRT: Realistic Robotic Manipulation Simulator and Benchmark with Progressive Reasoning Tasks, [Paper], [Code]
(arXiv 2023.6) Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, [Paper], [Project]
(arXiv 2023.6) Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language, [Paper], [Code]
(arXiv 2023.6) LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding, [Paper], [Project]
(arXiv 2023.6) OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, [Paper], [Project]
(arXiv 2023.6) Statler: State-Maintaining Language Models for Embodied Reasoning, [Paper], [Project]
(arXiv 2023.6) CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents, [Paper]
(arXiv 2023.6) Mass-Producing Failures of Multimodal Systems with Language Models, [Paper], [Code]
(arXiv 2023.6) SoftGPT: Learn Goal-oriented Soft Object Manipulation Skills by Generative Pre-trained Heterogeneous Graph Transformer, [Paper]
(arXiv 2023.6) SPRINT: SCALABLE POLICY PRE-TRAINING VIA LANGUAGE INSTRUCTION RELABELING, [Paper], [Project]
(arXiv 2023.6) MotionGPT: Finetuned LLMs are General-Purpose Motion Generators, [Paper], [Project]
(arXiv 2023.6) MIMIC-IT: Multi-Modal In-Context Instruction Tuning, [Paper], [Code]
(arXiv 2023.6) Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models, [Paper]

2023.5

(arXiv 2023.5) IMAGENETVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories, [Paper], [Code]
(arXiv 2023.5) ECHO: A Visio-Linguistic Dataset for Event Causality Inference via Human-Centric ReasOning, [Paper], [Code]
(arXiv 2023.5) PROMPTING LANGUAGE-INFORMED DISTRIBUTION FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]
(arXiv 2023.5) Exploring Diverse In-Context Configurations for Image Captioning, [Paper]
(arXiv 2023.5) Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, [Paper], [Code]
(arXiv 2023.5) IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models, [Paper], [Code]
(arXiv 2023.5) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models, [Paper], [Code]
(arXiv 2023.5) Enhance Reasoning Ability of Visual-Language Models via Large Language Models, [Paper]
(arXiv 2023.5) DetGPT: Detect What You Need via Reasoning, [Paper], [Code]
(arXiv 2023.5) Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, [Paper], [Code]
(arXiv 2023.5) TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding, [Paper]
(arXiv 2023.5) i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data, [Paper]
(arXiv 2023.5) What Makes for Good Visual Tokenizers for Large Language Models?, [Paper], [Code]
(arXiv 2023.5) Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration, [Paper], [Code]
(arXiv 2023.5) X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages, [Paper], [Project]
(arXiv 2023.5) Otter: A Multi-Modal Model with In-Context Instruction Tuning, [Paper], [Code]
(arXiv 2023.5) VideoChat: Chat-Centric Video Understanding, [Paper], [Code]
(arXiv 2023.5) Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering, [Paper], [Code]
(arXiv 2023.5) VIMA: General Robot Manipulation with Multimodal Prompts, [Paper], [Project]
(arXiv 2023.5) TidyBot: Personalized Robot Assistance with Large Language Models, [Paper], [Project]
(arXiv 2023.5) Training Diffusion Models with Reinforcement Learning, [Paper], [Project]
(arXiv 2023.5) EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought, [Paper], [Project]
(arXiv 2023.5) ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4, [Paper], [Code]
(arXiv 2023.5) Evaluating Object Hallucination in Large Vision-Language Models, [Paper], [Code]
(arXiv 2023.5) LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation, [Paper], [Code]
(arXiv 2023.5) VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, [Paper], [Code]
(arXiv 2023.5) OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, [Paper], [Project]
(arXiv 2023.5) Towards A Foundation Model for Generalist Robots: Diverse Skill Learning at Scale via Automated Task and Scene Generation, [Paper]
(arXiv 2023.5) An Android Robot Head as Embodied Conversational Agent, [Paper]
(arXiv 2023.5) Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, [Paper], [Code]
(arXiv 2023.5) Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision, [Paper], [Project]
(arXiv 2023.5) Multimodal Procedural Planning via Dual Text-Image Prompting, [Paper], [Code]
(arXiv 2023.5) ArK: Augmented Reality with Knowledge Interactive Emergent Ability, [Paper]

2023.4

(arXiv 2023.4) LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, [Paper], [Code]
(arXiv 2023.4) Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning, [Paper]
(arXiv 2023.4) mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality, [Paper], [Code]
(arXiv 2023.4) ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System, [Paper], [Project]
(arXiv 2023.4) ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT, [Paper]
(arXiv 2023.4) Robot-Enabled Construction Assembly with Automated Sequence Planning based on ChatGPT: RoboGPT, [Paper]
(arXiv 2023.4) Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT, [Paper], [Code]
(arXiv 2023.4) Can GPT-4 Perform Neural Architecture Search?, [Paper], [Code]
(arXiv 2023.4) MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models, [Paper], [Project]
(arXiv 2023.4) SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation, [Paper], [Project]
(arXiv 2023.4) LLM as A Robotic Brain: Unifying Egocentric Memory and Control, [Paper]
(arXiv 2023.4) Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, [Paper], [Project]
(arXiv 2023.4) Visual Instruction Tuning, [Paper], [Project]
(arXiv 2023.4) MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models, [Paper], [Project]
(arXiv 2023.4) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment, [Paper], [Code]
(arXiv 2023.4) Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text, [Paper], [Code]
(arXiv 2023.4) ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance, [Paper], [Code]
(arXiv 2023.4) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, [Paper], [Code]
(arXiv 2023.4) ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks, [Paper], [Code]
(arXiv 2023.4) Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT, [Paper]
(arXiv 2023.4) ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application, [Paper], [Code]
(arXiv 2023.4) OpenAGI: When LLM Meets Domain Experts, [Paper], [Code]
(arXiv 2023.4) Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions, [Paper], [Code]

2023.3

(arXiv 2023.3) Open-World Object Manipulation using Pre-Trained Vision-Language Models, [Paper], [Project]
(arXiv 2023.3) Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control, [Paper], [Project]
(arXiv 2023.3) Task and Motion Planning with Large Language Models for Object Rearrangement, [Paper], [Project]
(arXiv 2023.3) RE-MOVE: An Adaptive Policy Design Approach for Dynamic Environments via Language-Based Feedback, [Paper], [Project]
(arXiv 2023.3) Chat with the Environment: Interactive Multimodal Perception using Large Language Models, [Paper]
(arXiv 2023.3) MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge, [Paper], [Code]
(arXiv 2023.3) DialogPaint: A Dialog-based Image Editing Model, [Paper]
(arXiv 2023.3) MM-REACT : Prompting ChatGPT for Multimodal Reasoning and Action, [Paper], [Project]
(arXiv 2023.3) eP-ALM: Efficient Perceptual Augmentation of Language Models, [Paper], [Code]
(arXiv 2023.3) Errors are Useful Prompts: Instruction Guided Task Programming with Verifier-Assisted Iterative Prompting, [Paper], [Project]
(arXiv 2023.3) LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, [Paper], [Code]
(arXiv 2023.3) MULTIMODAL ANALOGICAL REASONING OVER KNOWLEDGE GRAPHS, [Paper], [Code]
(arXiv 2023.3) CAN LARGE LANGUAGE MODELS DESIGN A ROBOT? [Paper]
(arXiv 2023.3) Learning video embedding space with Natural Language Supervision, [Paper]
(arXiv 2023.3) Audio Visual Language Maps for Robot Navigation, [Paper], [Project]
(arXiv 2023.3) ViperGPT: Visual Inference via Python Execution for Reasoning, [Paper]
(arXiv 2023.3) ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions, [Paper], [Code]
(arXiv 2023.3) Can an Embodied Agent Find Your “Cat-shaped Mug”? LLM-Based Zero-Shot Object Navigation, [Paper], [Project]
(arXiv 2023.3) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, [Paper], [Code]
(arXiv 2023.3) PaLM-E: An Embodied Multimodal Language Model, [Paper], [Project]
(arXiv 2023.3) Language Is Not All You Need: Aligning Perception with Language Models, [Paper], [Code]

2023.2

(arXiv 2023.2) ChatGPT for Robotics: Design Principles and Model Abilities, , [Paper], [Code]
(arXiv 2023.2) Internet Explorer: Targeted Representation Learning on the Open Web, [Paper], [Project]

2022.11

(arXiv 2022.11) Visual Programming: Compositional visual reasoning without training, [Paper], [Project]

2022.7

(arXiv 2022.7) Language Models are General-Purpose Interfaces, [Paper], [Code]
(arXiv 2022.7) LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, [Paper], [Project]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM-in-Vision

2024.6

2024.5

2024.4

2024.3

2024.2

2024.1

2023.12

2023.11

2023.10

2023.9

2023.8

2023.7

2023.6

2023.5

2023.4

2023.3

2023.2

2022.11

2022.7

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM-in-Vision

2024.6

2024.5

2024.4

2024.3

2024.2

2024.1

2023.12

2023.11

2023.10

2023.9

2023.8

2023.7

2023.6

2023.5

2023.4

2023.3

2023.2

2022.11

2022.7