Skip to content

hk-zh/language-conditioned-robot-manipulation-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome-Language-conditioned Robot Manipulation Models Awesome

alt text This architectural framework provides an overview of language-conditioned robot manipulation. The agent comprises three key modules: the language module, the perception module, and the control module. These modules serve the functions of understanding instructions, perceiving the environment's state, and acquiring skills, respectively. The vision-language module establishes a connection between instructions and the surrounding environment to achieve a more profound comprehension of both aspects. This interplay of information from both modalities enables the robot to engage in high-level planning and perform visual question answering tasks, ultimately enhancing its overall performance. The control module has the capability to acquire low-level policies through learning from rewards (reinforcement learning) and demonstrations (imitation learning) which engineered by experts. At times, these low-level policies can also be directly designed or hard-coded, making use of path and motion planning algorithms. There are two key loops to highlight. The interactive loop, located on the left, facilitates human-robot language interaction. The control loop, positioned on the right, signifies the interaction between the agent and its surrounding environment.

Table of the Content

Survey

This paper is basically based on the survey paper

Language-conditioned Learning for Robot Manipulation: A Survey
Hongkuan Zhou, Xiangtong Yao, Yuan Meng, Siming Sun, Zhenshan Bing, Kai Huang, Hong Qiao, Alois Knoll

@article{zhou2023language,
  title={Language-conditioned Learning for Robotic Manipulation: A Survey},
  author={Zhou, Hongkuan and Yao, Xiangtong and Meng, Yuan and Sun, Siming and BIng, Zhenshan and Huang, Kai and Knoll, Alois},
  journal={arXiv preprint arXiv:2312.10807},
  year={2023}
}

Language-conditioned Reinforcement Learning

Games

  • From language to goals: Inverse reinforcement learning for vision-based instruction following [paper]
  • Grounding english commands to reward function [paper]
  • Learning to understand goal specifications by modelling reward [paper]
  • Beating atari with natural language guided reinforcement learning [paper] [code]
  • Using natural language for reward shaping in reinforcement learning [paper]

Navigation

  • Gated-attention architectures for task-oriented language grounding [paper] [code]
  • Mapping instructions and visual observations to actions with reinforcement learning [paper]
  • Modular multitask reinforcement learning with policy sketches [paper]
  • Representation learning for grounded spatial reasoning [paper]

Manipulation

  • Lancon-learn: Learning with language to enable generalization in multi-task manipulation [paper] [code]
  • Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards [paper][code]
  • Learning from symmetry: Meta-reinforcement learning with symmetrical behaviors and language instructions [paper][website]
  • Meta-reinforcement learning via language instructions [paper][code][website]
  • Learning language-conditioned robot behavior from offline data and crowd-sourced annotation [paper]
  • Concept2robot: Learning manipulation concepts from instructions and human demonstrations [paper]

Language-conditioned Imitation Learning

Behaviour Cloning

  • Language conditioned imitation learning over unstructured data [paper] [code] [website]
  • Bc-z: Zero-shot task generalization with robotic imitation learning [paper]
  • What matters in language-conditioned robotic imitation learning over unstructured data [paper] [code][website]
  • Grounding language with visual affordances over unstructured data [paper] [code][website]
  • Language-conditioned imitation learning with base skill priors under unstructured data [paper] [code] [website]
  • Pay attention!- robustifying a deep visuomotor policy through task-focused visual attention [paper]
  • Language-conditioned imitation learning for robot manipulation tasks [paper]

Inverse Reinforcement Learning

  • Grounding english commands to reward function [paper]
  • From language to goals: Inverse reinforcement learning for vision-based instruction following [paper]

Empowered by LLMs

Planning

  • Sayplan: Grounding large language models using 3d scene graphs for scalable task planning [paper]
  • Language models as zero-shot planners: Extracting actionable knowledge for embodied agents [paper]
  • Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents [paper]
  • Progprompt: Generating situated robot task plans using large language models [paper]
  • Robots that ask for help: Uncertainty alignment for large language model planners [paper]
  • Task and motion planning with large language models for object rearrangement [paper]
  • Do as i can, not as i say: Grounding language in robotic affordances [paper]
  • The 2014 international planning competition: Progress and trends [paper]
  • Robot task planning via deep reinforcement learning: a tabletop object sorting application [paper]
  • Robot task planning and situation handling in open worlds [paper] [code] [website]
  • Embodied Task Planning with Large Language Models [paper] [code] [website]
  • Text2motion: From natural language instructions to feasible plans [paper] [website]
  • Large language models as commonsense knowledge for large-scale task planning [paper] [code] [website]
  • Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation [paper]
  • Learning to reason over scene graphs: a case study of finetuning gpt-2 into a robot language model for grounded task planning [paper] [code]
  • Scaling up and distilling down: Language-guided robot skill acquisition [paper][code] [website]
  • Stap: Sequencing task-agnostic policies [paper] [code][website]
  • Inner monologue: Embodied reasoning through planning with language models [paper] [website]

Reasoning

  • Rearrangement:A challenge for embodied ai [paper]
  • The threedworld transport challenge: A visually guided task and motion planning benchmark for physically realistic embodied ai [paper]
  • Tidy up my room: Multi-agent cooperation for service tasks in smart environments [paper]
  • A quantifiable stratification strategy for tidy-up in service robotics [paper]
  • Tidybot: Personalized robot assistance with large language models [paper]
  • Housekeep: Tidying virtual households using commonsense reasoning [paper]
  • Building cooperative embodied agents modularly with large language models [paper]
  • Socratic models: Composing zero-shot multimodal reasoning with language [paper]
  • Voyager: An open-ended embodied agent with large language models [paper]
  • Translating natural language to planning goals with large-language models [paper]

Empowered by VLMs

  • Cliport: What and where pathways for robotic manipulation [paper] [code] [website]
  • Transporter networks: Rearranging the visual world for robotic manipulation [paper] [code] [website]
  • Simple but effective: Clip embeddings for embodied ai [paper]
  • Instruct2act: Mapping multi-modality instructions to robotic actions with large language model [paper] [code]
  • Latte: Language trajectory transformer [paper] [code]
  • Embodied Task Planning with Large Language Models [paper] [code] [website]
  • Palm-e: An embodied multimodal language model [paper] [website]
  • Socratic models: Composing zero-shot multimodal reasoning with language [paper]
  • Pretrained language models as visual planners for human assistance [paper] [code]
  • Open-world object manipulation using pre-trained vision-language models [paper] [website]
  • Robotic skill acquisition via instruction augmentation with vision-language models [paper] [website]
  • Language reward modulation for pretraining reinforcement learning [paper] [code]
  • Vision-language models as success detectors [paper]

Comparative Analysis

Simulator

Simulator Description
PyBullet
With its origins rooted in the Bullet physics engine, PyBullet transcends the boundaries of conventional simulation platforms, offering a wealth of tools and resources for tasks ranging from robot manipulation and locomotion to computer-aided design analysis.
MuJoCo
MuJoCo, short for "Multi-Joint dynamics with Contact", originates from the vision of creating a physics engine tailored for simulating articulated and deformable bodies. It has evolved into an essential tool for exploring diverse domains, from robot locomotion and manipulation to human movement and control.
CoppeliaSim
CoppeliaSim is formerly known as V-REP (Virtual Robot Experimentation Platform). It offers a comprehensive environment for simulating and prototyping robotic systems, enabling users to create, analyze, and optimize a wide spectrum of robotic applications. Its origins as an educational tool have evolved into a full-fledged simulation framework, revered for its versatility and user-friendly interface.
NVIDIA Omniverse
NVIDIA Omniverse offers real-time physics simulation and lifelike rendering, creating a virtual environment for comprehensive testing and fine-tuning of robotic manipulation algorithms and control strategies, all prior to their actual deployment in the physical realm.
Unity
Unity is a cross-platform game engine developed by Unity Technologies. Renowned for its user-friendly interface and powerful capabilities, Unity has become a cornerstone in the worlds of video games, augmented reality (AR), virtual reality (VR), and also simulations.

Benchmarks

Benchmark Simulation Engine Manipulator Observation Tool used Multi-agents Long-horizon
RGB Depth Masks
CALVIN PyBullet Franka Panda
Meta-world MuJoCo Sawyer
LEMMA NVIDIA Omniverse UR10 & UR5
RLbench CoppeliaSim Franka Panda
VIMAbench Pybullet UR5
LoHoRavens Pybullet UR5
ARNOLD NVIDIA Isaac Gym Franka Panda

Models

Model Year Benchmark Simulation Engine Language Module Perception Module Real World Experiment LLM Reinforcement Learning Imitation Learning
DREAMCELL 2019 # - LSTM *
PixL2R 2020 Meta-World MuJoCo LSTM CNN
Concept2Robot 2020 # PyBullet BERT ResNet-18
LanguagePolicy 2020 # CoppeliaSim GLoVe Faster RCNN
LOReL 2021 Meta-World MuJoCo distillBERT CNN
CARE 2021 Meta-World MuJoCo RoBERTa *
MCIL 2021 # MuJoCo MUSE CNN
BC-Z 2021 # - MUSE ResNet18
CLIPort 2021 # Pybullet CLIP CLIP/ResNet
LanCon-Learn 2022 Meta-World MuJoCo GLoVe *
MILLON 2022 Meta-World MuJoCo GLoVe *
PaLM-SayCan 2022 # - PaLM ViLD
ATLA 2022 # PyBullet BERT-Tiny CNN
HULC 2022 CALVIN Pybullet MiniLM-L3-v2 CNN
PerAct 2022 RLbench CoppelaSim CLIP ViT
RT-1 2022 # - USE EfficientNet-B3
LATTE 2023 # CoppeliaSim distillBERT, CLIP CLIP
DIAL 2022 # - CLIP CLIP
R3M 2022 # - distillBERT ResNet
Inner Monologue 2022 # - CLIP CLIP
NLMap 2023 # - CLIP ViLD
Code as Policies 2023 # - GPT3, Codex ViLD
PROGPROMPT 2023 Virtualhome Unity3D GPT-3 *
Language2Reward 2023 # MuJoCo MPC GPT-4 *
LfS 2023 Meta-World MuJoCo Cons. Parser *
HULC++ 2023 CALVIN PyBullet MiniLM-L3-v2 CNN
LEMMA 2023 LEMMA NVIDIA Omniverse CLIP CLIP
SPIL 2023 CALVIN PyBullet MiniLM-L3-v2 CNN
PaLM-E 2023 # PyBullet PaLM ViT
LAMP 2023 RLbench CoppelaSim ChatGPT R3M
MOO 2023 # - OWL-ViT OWL-ViT
Instruction2Act 2023 VIMAbench PyBullet ChatGPT CLIP
VoxPoser 2023 # SAPIEN CPT-4 OWL-ViT
SuccessVQA 2023 # IA Playroom Flamingo Flamingo
VIMA 2023 VIMAbench PyBullet T5 model ViT
TidyBot 2023 # - GPT-3 CLIP
Text2Motion 2023 # - GPT-3, Codex *
LLM-GROP 2023 # Gazebo GPT-3 *
Scaling Up 2023 # MuJoCo CLIP, GPT-3 ResNet-18
Socratic Models 2023 # - RoBERTa, GPT-3 CLIP
SayPlan 2023 # - GPT-4 *
RT-2 2023 # - PaLI-X, PaLM-E PaLI-X, PaLM-E
KNOWNO 2023 # PyBullet PaLM-2L *