- [2024] A Survey on Vision-Language-Action Models for Embodied AI [paper]
- [2024] A Survey of Embodied Learning for Object-Centric Robotic Manipulation [paper]
- [2024] Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [paper]
- [2025] EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [paper]
- [2025] Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing [paper]
- [2025] Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding [paper]
- [2025] FAST: Efficient Action Tokenization for Vision-Language-Action Models [paper]
- [2025] GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation [paper]
- [2025] Universal Actions for Enhanced Embodied Foundation Models [paper]
- [2025] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [paper]
- [2025] RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation [paper]
- [2025] SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation [paper]
- [2025] Improving Vision-Language-Action Model with Online Reinforcement Learning [paper]
- [2025] Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation [paper]
- [2024] π0: A Vision-Language-Action Flow Model for General Robot Control [paper]
- [2024] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [paper]
- [2024] OpenVLA: An Open-Source Vision-Language-Action Model [paper]
- [2024] Octo: An Open-Source Generalist Robot Policy [paper]
- [2024] Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
- [2024] RT-H: Action Hierarchies Using Language [paper]
- [2024] Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models [paper]
- [2024] Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
- [2024] Baku: An Efficient Transformer for Multi-Task Policy Learning [paper]
- [2024] Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals [paper]
- [2024] TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [paper]
- [2024] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression [paper]
- [2024] CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [paper]
- [2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper]
- [2024] Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations [paper]
- [2024] An Embodied Generalist Agent in 3D World [paper]
- [2024] RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation [paper]
- [2024] SpatialBot: Precise Spatial Understanding with Vision Language Models [paper]
- [2024] Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection [paper]
- [2024] HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [paper]
- [2024] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [paper]
- [2024] RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation [paper]
- [2024] Robotic Control via Embodied Chain-of-Thought Reasoning [paper]
- [2024] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [paper]
- [2024] Latent Action Pretraining from Videos [paper]
- [2024] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [paper]
- [2024] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation [paper]
- [2024] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [paper]
- [2024] TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [paper]
- [2024] Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments [paper]
- [2023] RT-1: Robotics Transformer for Real-World Control at Scale [paper]
- [2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
- [2023] PaLM-E: An Embodied Multimodal Language Model: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
- [2023] Vision-Language Foundation Models as Effective Robot Imitators [paper]
- [2023] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation [paper]
- [2025] Semantic Mapping in Indoor Embodied AI – A Comprehensive Survey and Future Directions [paper]
- [2024] Navid: Video-based vlm plans the next step for vision-andlanguage navigation [paper]
- [2024] NaVILA: Legged Robot Vision-Language-Action Model for Navigation [paper]
- [2024] The One RING: a Robotic Indoor Navigation Generalist [paper]
- [2025] Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [paper]
- [2025] You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations [paper]
-
[2024] Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching [paper]
-
[2024] 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [paper]
-
[2024] Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning [paper]
-
[2024] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation [paper]
-
[2024] 3d diffuser actor: Policy diffusion with 3d scene representations [paper]
-
[2024] Diffusion Policy Policy Optimization [paper]
-
[2024] Language-Guided Object-Centric Diffusion Policy for Collision-Aware Robotic Manipulation [paper]
-
[2024] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning [paper]
-
[2024] Equivariant Diffusion Policy [paper]
-
[2024] Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models [paper]
-
[2024] Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies [paper]
-
[2024] Motion Before Action: Diffusing Object Motion as Manipulation Condition [paper]
-
[2024] One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation [paper]
-
[2024] Consistency policy: Accelerated visuomotor policies via consistency distillation [paper]
-
[2024] SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation [paper]
-
[2024] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins [paper]
-
[2024] Few-Shot Task Learning through Inverse Generative Modeling [paper]
-
[2024] G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation [paper]
-
[2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation [paper]
-
[2024] Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies [paper]
-
[2024] Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies [paper]
-
[2024] Equivariant diffusion policy [paper]
-
[2024] Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation [paper]
-
[2024] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation [paper]
-
[2024] Equivariant diffusion policy [paper]
-
[2024] Learning universal policies via text-guided video generation [paper]
-
[2024] Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning [paper]
-
[2024] 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [paper]
-
[2024] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [paper]
-
[2024] GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy [paper]
-
[2024] Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation [paper]
-
[2024] Prediction with Action: Visual Policy Learning via Joint Denoising Process [paper]
-
[2024] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations [paper]
-
[2024] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [paper]
-
[2024] Streaming Diffusion Policy: Fast Policy Synthesis with Variable Noise Diffusion Models [paper]
- [2023] Diffusion policy: Visuomotor policy learning via action diffusion [paper]
- Awesome-Generalist-Agents [repo]