CLIP-RT

Gi-Cheon Kang^*, Junghyun Kim^*, Kyuhwan Shim, Jun Ki Lee, Byoung-Tak Zhang

3rd Workshop on Language and Robot Learning @ CoRL2024

arXiv | Project Page

This repository provides detailed instructions on:

Training OpenVLA
Collecting data using natural language supervision
Conducting experiments with UR5

Main CLIP-RT model can be found on this repo.

OpenVLA

Prepare Dataset

To finetune the openVLA model with your own data, you first need to convert your own dataset to RLDS Dataset format.

Clone https://github.com/kpertsch/rlds_dataset_builder
Environment

conda env create -f environment_ubuntu.yml
conda activate rlds_env

Change directory name example_dataset to [your_dataset_name]
Make a data directory [your_dataset_name]/data/raw

We expect data to be uploaded to the following directory structure:

├── raw         
│   ├── task_0       
│   │   ├── episode_0
│   │   │   ├── x.png      
│   │   │   ├── x.json      
│   │   │   └── ...     
│   │   ├── episode_1   
│   │   │   ├── y.png   
│   │   │   └── ...      
└──

Each elements in .jsonfile consists of the initial instruction of the task, natural language supervision, low-level action vector (7-d and 8-d), file name of the image as shown below.

{"prev_eef_pose": [0.4867851071573382, 0.11065651089581363, 0.43149384252823314, 3.1369514868261072, 0.006225836816466727, 0.0016994614054684717], "prev_joint": [3.1415085792541504, -1.5707486311541956, 1.570798397064209, 4.71236515045166, -1.5706971327411097, -1.5709307829486292], "eef_pose": [0.48678753690778354, 0.210619672338698, 0.43147651408980553, 3.136830345592423, 0.006070714275545524, 0.0016729409424801845], "joint": [3.3396871089935303, -1.4953535238849085, 1.4921374320983887, 4.716606140136719, -1.570972744618551, -1.3726237455951136], "instruction": "open the cabinet", "supervision": "move arm to the left", "action": [0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0], "image_path": "2024_9_11_13_38_56.png", "openx_action": [0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 1.0]}

We provide functions to preprocess the raw data. Extracting 7-d action from 8-d action for example. data_utils/preprocess_raw.py
Convert the raw data to .npy format via data_utils/convert_to_RLDS.py
In [your_dataset_name]_dataset_builder.py, modify the following lines according to your dataset:

def _info(self) -> tfds.core.DatasetInfo:
def _split_generators(self, dl_manager: tfds.download.DownloadManager):
def _generate_examples(self, path) -> Iterator[Tuple[str, Any]]:

Modify the name of the class as following rule: my_data to class MyData():

Finally, execute

tfds build --overwrite

You will then get a RLDS format of your own dataset in ~/tensorflow_datasets/<name_of_your_dataset>

Finetune with LoRA

Clone the OpenVLA git repo.

Install the following dependencies.

conda create -n openvla python=3.10
pip install -r requirements.txt
cd openvla
pip install -e .

To test the OpenVLA model with a random test image and text, run

python test.py --mode single --model openvla

To fine-tune the model:

In openvla/prismatic/vla/datasets/rlds/oxe/configs.py, include:

  "[name_of_your_dataset]": {
      "image_obs_keys": {"primary": "image", "secondary": None, "wrist": None},
      "depth_obs_keys": {"primary": None, "secondary": None, "wrist": None},
      "state_obs_keys": [None, None, None, None, None, None, None, None],
      "state_encoding": StateEncoding.NONE,
      "action_encoding": ActionEncoding.EEF_POS,
  },

Then, in openvla/prismatic/vla/datasets/rlds/oxe/transforms.py, include:

def [name_of_your_dataset]_dataset_transform():
...

and

OXE_STANDARDIZATION_TRANSFORMS = {
    "[name_of_your_dataset]": [name_of_your_dataset]_dataset_transform,

Then, run

torchrun --standalone --nnodes 1 --nproc-per-node 2 vla-scripts/finetune.py \
  --vla_path "openvla/openvla-7b" \
  --data_root_dir ~/tensorflow_datasets \
  --dataset_name [name_of_your_dataset] \
  --run_root_dir /openvla/vla-scripts/runs \
  --adapter_tmp_dir /openvla/vla-scripts/adapter-tmp \
  --lora_rank 32 \
  --batch_size 8 \
  --grad_accumulation_steps 1 \
  --learning_rate 5e-4 \
  --image_aug False \
  --wandb_project [name]

With 4 H100 GPUs 100 steps is enough for training 10 episodes, taking only couple of min.

Test Your Model

Unnorm key seems not automatically added in config.json. Manually copy it from dataset_statics.json under vla-scripts/runs. You MUST modify pred_action function in openvla/prismatic/extern/hf/modeling_prismatic.py, where the unrom_key is used OR you can try modifying config.json, by manually inserting the dataset statistics.

Now, you're ready to test your model with:

python test.py --mode serveronly

Collect Training Data

Robot Server (Client)

This is the code for collecting robotic action data with UR5.

conda activate ur

Test the camera

python kinect_get_image.py

Test the robot

python ur_controller.py

You can check on the current joint pose by adding the argument --getpose.

To collect data, communicate with the remote server. For the client (robot server),

python client_ur_collect.py

Remote Server

Comming Soon!

Stochastic Trajectory Diversification

Comming Soon!

CLIP-RT Dataset

The common.zip and novel.zip files each contain datasets for multiple tasks (9 common tasks and 10 novel tasks). These datasets were collected using human natural language supervisions. For each task, we collected 10 episodes.

Each episode consists of a series of PNG-JSON pairs:

PNG: An image representing the current scene.
JSON: Metadata including the initial human instruction, natural language supervision, low-level actions, and other relevant details.

Name	Content	Examples	Size	Link
`common.zip`	Data for common tasks collected through natural language supervision	911	670 MBytes	Download
`common_augmented.zip`	Augmented data of common tasks	9,841	1.8 GBytes	Download
`novel.zip`	Data for novel tasks collected through natural language supervision	1,276	1.2 GBytes	Download
`novel_augmented.zip`	Augmented data of novel tasks	11,578	2.0 GBytes	Download

Finally Testing on Your Robot

UR5

Robot Server

conda activate ur
python client_ur_infer.py --mode safe

Set --mode to cont if you want for the robot to continuously act throughout the task. For the safety mode, you must press the key for every action the robot receives. If you press "n", the robot does not take an action and returns to home pose.

Remote Server

(OpenVLA)

conda activate openvla
python test.py --mode full

(CLIP-RT) Check out this repo.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
EVAL		EVAL
UR		UR
data_utils		data_utils
gpt_sampling		gpt_sampling
openvla		openvla
readme_figure		readme_figure
README.md		README.md
action_augmentation.py		action_augmentation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP-RT

arXiv | Project Page

Table of Contents

OpenVLA

Prepare Dataset

Finetune with LoRA

Test Your Model

Collect Training Data

Robot Server (Client)

Remote Server

Stochastic Trajectory Diversification

CLIP-RT Dataset

Finally Testing on Your Robot

UR5

About

Releases

Packages

Contributors 2

Languages

JHKim-snu/CLIP-RT

Folders and files

Latest commit

History

Repository files navigation

CLIP-RT

arXiv | Project Page

Table of Contents

OpenVLA

Prepare Dataset

Finetune with LoRA

Test Your Model

Collect Training Data

Robot Server (Client)

Remote Server

Stochastic Trajectory Diversification

CLIP-RT Dataset

Finally Testing on Your Robot

UR5

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages