Skip to content

Commit 728ab36

Browse files
authored
[Fix] Update MMScan's README (#99)
* edit readme * edit readme * edit readme * update readme
1 parent cc3842a commit 728ab36

File tree

2 files changed

+98
-44
lines changed

2 files changed

+98
-44
lines changed

README.md

Lines changed: 63 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,13 @@
2323

2424
1. [About](#-about)
2525
2. [Getting Started](#-getting-started)
26-
3. [Model and Benchmark](#-model-and-benchmark)
27-
4. [TODO List](#-todo-list)
26+
3. [MMScan API Tutorial](#-mmscan-api-tutorial)
27+
4. [MMScan Benchmark](#-mmscan-benchmark)
28+
5. [TODO List](#-todo-list)
2829

2930
## 🏠 About
3031

32+
3133
<!-- ![Teaser](assets/teaser.jpg) -->
3234

3335
<div style="text-align: center;">
@@ -55,7 +57,8 @@ Furthermore, we use this high-quality dataset to train state-of-the-art 3D visua
5557
grounding and LLMs and obtain remarkable performance improvement both on
5658
existing benchmarks and in-the-wild evaluation.
5759

58-
## 🚀 Getting Started:
60+
## 🚀 Getting Started
61+
5962

6063
### Installation
6164

@@ -90,7 +93,7 @@ existing benchmarks and in-the-wild evaluation.
9093
├── embodiedscan_split
9194
│ ├──embodiedscan-v1/ # EmbodiedScan v1 data in 'embodiedscan.zip'
9295
│ ├──embodiedscan-v2/ # EmbodiedScan v2 data in 'embodiedscan-v2-beta.zip'
93-
├── MMScan-beta-release # MMScan veta data in 'embodiedscan-v2-beta.zip'
96+
├── MMScan-beta-release # MMScan data in 'embodiedscan-v2-beta.zip'
9497
```
9598

9699
2. Prepare the point clouds files.
@@ -99,6 +102,7 @@ existing benchmarks and in-the-wild evaluation.
99102

100103
## 👓 MMScan API Tutorial
101104

105+
102106
The **MMScan Toolkit** provides comprehensive tools for dataset handling and model evaluation in tasks.
103107

104108
To import the MMScan API, you can use the following commands:
@@ -137,39 +141,43 @@ Each dataset item is a dictionary containing key elements:
137141

138142
(1) 3D Modality
139143

140-
- **"ori_pcds"** (tuple\[tensor\]): Raw point cloud data from the `.pth` file.
141-
- **"pcds"** (np.ndarray): Point cloud data, dimensions (\[n_points, 6(xyz+rgb)\]).
142-
- **"instance_labels"** (np.ndarray): Instance IDs for each point.
143-
- **"class_labels"** (np.ndarray): Class IDs for each point.
144-
- **"bboxes"** (dict): Bounding boxes in the scan.
144+
- **"ori_pcds"** (tuple\[tensor\]): Original point cloud data extracted from the .pth file.
145+
- **"pcds"** (np.ndarray): Point cloud data with dimensions [n_points, 6(xyz+rgb)], representing the coordinates and color of each point.
146+
- **"instance_labels"** (np.ndarray): Instance ID assigned to each point in the point cloud.
147+
- **"class_labels"** (np.ndarray): Class IDs assigned to each point in the point cloud.
148+
- **"bboxes"** (dict): Information about bounding boxes within the scan, structured as { object ID:
149+
{
150+
"type": object type (str),
151+
"bbox": 9 DoF box (np.ndarray)
152+
}}
145153

146154
(2) Language Modality
147155

148-
- **"sub_class"**: Sample category.
149-
- **"ID"**: Unique sample ID.
150-
- **"scan_id"**: Corresponding scan ID.
151-
- **--------------For Visual Grounding Task**
156+
- **"sub_class"**: The category of the sample.
157+
- **"ID"**: The sample's ID.
158+
- **"scan_id"**: The scan's ID.
159+
- *For Visual Grounding task*
152160
- **"target_id"** (list\[int\]): IDs of target objects.
153-
- **"text"** (str): Grounding text.
154-
- **"target"** (list\[str\]): Types of target objects.
161+
- **"text"** (str): Text used for grounding.
162+
- **"target"** (list\[str\]): Text prompt to specify the target grounding object.
155163
- **"anchors"** (list\[str\]): Types of anchor objects.
156164
- **"anchor_ids"** (list\[int\]): IDs of anchor objects.
157-
- **"tokens_positive"** (dict): Position indices of mentioned objects in the text.
158-
- **--------------ForQuestion Answering Task**
159-
- **"question"** (str): The question text.
165+
- **"tokens_positive"** (dict): Indices of positions where mentioned objects appear in the text.
166+
- *For Qusetion Answering task*
167+
- **"question"** (str): The text of the question.
160168
- **"answers"** (list\[str\]): List of possible answers.
161169
- **"object_ids"** (list\[int\]): Object IDs referenced in the question.
162170
- **"object_names"** (list\[str\]): Types of referenced objects.
163171
- **"input_bboxes_id"** (list\[int\]): IDs of input bounding boxes.
164-
- **"input_bboxes"** (list\[np.ndarray\]): Input bounding boxes, 9 DoF.
172+
- **"input_bboxes"** (list\[np.ndarray\]): Input 9-DoF bounding boxes.
165173

166174
(3) 2D Modality
167175

168-
- **'img_path'** (str): Path to RGB image.
169-
- **'depth_img_path'** (str): Path to depth image.
170-
- **'intrinsic'** (np.ndarray): Camera intrinsic parameters for RGB images.
171-
- **'depth_intrinsic'** (np.ndarray): Camera intrinsic parameters for depth images.
172-
- **'extrinsic'** (np.ndarray): Camera extrinsic parameters.
176+
- **'img_path'** (str): File path to the RGB image.
177+
- **'depth_img_path'** (str): File path to the depth image.
178+
- **'intrinsic'** (np.ndarray): Intrinsic parameters of the camera for RGB images.
179+
- **'depth_intrinsic'** (np.ndarray): Intrinsic parameters of the camera for depth images.
180+
- **'extrinsic'** (np.ndarray): Extrinsic parameters of the camera.
173181
- **'visible_instance_id'** (list): IDs of visible objects in the image.
174182

175183
### MMScan Evaluator
@@ -182,7 +190,9 @@ For the visual grounding task, our evaluator computes multiple metrics including
182190

183191
- **AP and AR**: These metrics calculate the precision and recall by considering each sample as an individual category.
184192
- **AP_C and AR_C**: These versions categorize samples belonging to the same subclass and calculate them together.
185-
- **gtop-k**: An expanded metric that generalizes the traditional top-k metric, offering insights into broader performance aspects.
193+
- **gTop-k**: An expanded metric that generalizes the traditional Top-k metric, offering superior flexibility and interpretability compared to traditional ones when oriented towards multi-target grounding.
194+
195+
*Note:* Here, AP corresponds to AP<sub>sample</sub> in the paper, and AP_C corresponds to AP<sub>box</sub> in the paper.
186196

187197
Below is an example of how to utilize the Visual Grounding Evaluator:
188198

@@ -301,11 +311,36 @@ The input structure remains the same as for the question answering evaluator:
301311
]
302312
```
303313

304-
### Models
314+
## 🏆 MMScan Benchmark
315+
316+
317+
### MMScan Visual Grounding Benchmark
305318

306-
We have adapted the MMScan API for some [models](./models/README.md).
319+
| Methods | gTop-1 | gTop-3 | AP<sub>sample</sub> | AP<sub>box</sub> | AR | Release | Download |
320+
|---------|--------|--------|---------------------|------------------|----|-------|----|
321+
| ScanRefer | 4.74 | 9.19 | 9.49 | 2.28 | 47.68 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/Scanrefer) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link) |
322+
| MVT | 7.94 | 13.07 | 13.67 | 2.50 | 86.86 | - | - |
323+
| BUTD-DETR | 15.24 | 20.68 | 18.58 | 9.27 | 66.62 | - | - |
324+
| ReGround3D | 16.35 | 26.13 | 22.89 | 5.25 | 43.24 | - | - |
325+
| EmbodiedScan | 19.66 | 34.00 | 29.30 | **15.18** | 59.96 | [code](https://github.com/OpenRobotLab/EmbodiedScan/tree/mmscan/models/EmbodiedScan) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link) |
326+
| 3D-VisTA | 25.38 | 35.41 | 33.47 | 6.67 | 87.52 | - | - |
327+
| ViL3DRef | **26.34** | **37.58** | **35.09** | 6.65 | 86.86 | - | - |
328+
329+
### MMScan Question Answering Benchmark
330+
| Methods | Overall | ST-attr | ST-space | OO-attr | OO-space | OR| Advanced | Release | Download |
331+
|---|--------|--------|--------|--------|--------|--------|-------|----|----|
332+
| LL3DA | 45.7 | 39.1 | 58.5 | 43.6 | 55.9 | 37.1 | 24.0| [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LL3DA) | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
333+
| LEO |54.6 | 48.9 | 62.7 | 50.8 | 64.7 | 50.4 | 45.9 | [code](https://github.com/rbler1234/EmbodiedScan/tree/mmscan-devkit/models/LEO) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link)|
334+
| LLaVA-3D |**61.6** | 58.5 | 63.5 | 56.8 | 75.6 | 58.0 | 38.5|- | - |
335+
336+
*Note:* These two tables only show the results for main metrics; see the paper for complete results.
337+
338+
We have released the codes of some models under [./models](./models/README.md).
307339

308340
## 📝 TODO List
309341

310-
- \[ \] More Visual Grounding baselines and Question Answering baselines.
342+
343+
- \[ \] MMScan annotation and samples for ARKitScenes.
344+
- \[ \] Online evaluation platform for the MMScan benchmark.
345+
- \[ \] Codes of more MMScan Visual Grounding baselines and Question Answering baselines.
311346
- \[ \] Full release and further updates.

models/README.md

Lines changed: 35 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,59 +2,68 @@
22

33
These are 3D visual grounding models adapted for the mmscan-devkit. Currently, two models have been released: EmbodiedScan and ScanRefer.
44

5-
### Scanrefer
5+
### ScanRefer
66

7-
1. Follow the [Scanrefer](https://github.com/daveredrum/ScanRefer/blob/master/README.md) to setup the Env. For data preparation, you need not load the datasets, only need to download the [preprocessed GLoVE embeddings](https://kaldir.vc.in.tum.de/glove.p) (~990MB) and put them under `data/`
7+
1. Follow the [ScanRefer](https://github.com/daveredrum/ScanRefer/blob/master/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to download the [preprocessed GLoVE embeddings](https://kaldir.vc.in.tum.de/glove.p) (~990MB) and put them under `data/`
88

99
2. Install MMScan API.
1010

1111
3. Overwrite the `lib/config.py/CONF.PATH.OUTPUT` to your desired output directory.
1212

13-
4. Run the following command to train Scanrefer (one GPU):
13+
4. Run the following command to train ScanRefer (one GPU):
1414

1515
```bash
1616
python -u scripts/train.py --use_color --epoch {10/25/50}
1717
```
1818

19-
5. Run the following command to evaluate Scanrefer (one GPU):
19+
5. Run the following command to evaluate ScanRefer (one GPU):
2020

2121
```bash
2222
python -u scripts/train.py --use_color --eval_only --use_checkpoint "path/to/pth"
2323
```
24+
#### Results and Models
2425

26+
| Epoch | gTop-1 @ 0.25|gTop-1 @0.50 | Config | Download |
27+
| :-------: | :---------:| :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
28+
| 50 | 4.74 | 2.52 | [config](https://drive.google.com/file/d/1iJtsjt4K8qhNikY8UmIfiQy1CzIaSgyU/view?usp=drive_link) | [model](https://drive.google.com/file/d/1C0-AJweXEc-cHTe9tLJ3Shgqyd44tXqY/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1ENOS2FE7fkLPWjIf9J76VgiPrn6dGKvi/view?usp=drive_link)
2529
### EmbodiedScan
2630

27-
1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the Env. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
31+
1. Follow the [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/blob/main/README.md) to setup the environment. Download the [Multi-View 3D Detection model's weights](https://download.openmmlab.com/mim-example/embodiedscan/mv-3ddet.pth) and change the "load_from" path in the config file under `configs/grounding` to the path where the weights are saved.
2832

2933
2. Install MMScan API.
3034

31-
3. Run the following command to train EmbodiedScan (multiple GPU):
35+
3. Run the following command to train EmbodiedScan (multiple GPUs):
3236

3337
```bash
3438
# Single GPU training
3539
python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save
3640

37-
# Multiple GPU training
41+
# Multiple GPUs training
3842
python tools/train.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py --work-dir=path/to/save --launcher="pytorch"
3943
```
4044

41-
4. Run the following command to evaluate EmbodiedScan (multiple GPU):
45+
4. Run the following command to evaluate EmbodiedScan (multiple GPUs):
4246

4347
```bash
4448
# Single GPU testing
4549
python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth
4650

47-
# Multiple GPU testing
51+
# Multiple GPUs testing
4852
python tools/test.py configs/grounding/pcd_4xb24_mmscan_vg_num256.py path/to/load_pth --launcher="pytorch"
4953
```
54+
#### Results and Models
55+
56+
| Input Modality | Det Pretrain | Epoch | gTop-1 @ 0.25 | gTop-1 @ 0.50 | Config | Download |
57+
| :-------: | :----: | :----:| :----: | :---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
58+
| Point Cloud | &#10004; | 12 | 19.66 | 8.82 | [config](https://github.com/rbler1234/EmbodiedScan/blob/mmscan-devkit/models/EmbodiedScan/configs/grounding/pcd_4xb24_mmscan_vg_num256.py) | [model](https://drive.google.com/file/d/1F6cHY6-JVzAk6xg5s61aTT-vD-eu_4DD/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1Ua_-Z2G3g0CthbeBkrR1a7_sqg_Spd9s/view?usp=drive_link)
5059

5160
## 3D Question Answering Models
5261

5362
These are 3D question answering models adapted for the mmscan-devkit. Currently, two models have been released: LL3DA and LEO.
5463

5564
### LL3DA
5665

57-
1. Follow the [LL3DA](https://github.com/Open3DA/LL3DA/blob/main/README.md) to setup the Env. For data preparation, you need not load the datasets, only need to:
66+
1. Follow the [LL3DA](https://github.com/Open3DA/LL3DA/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
5867

5968
(1) download the [release pre-trained weights.](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth) and put them under `./pretrained`
6069

@@ -64,13 +73,13 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
6473

6574
3. Edit the config under `./scripts/opt-1.3b/eval.mmscanqa.sh` and `./scripts/opt-1.3b/tuning.mmscanqa.sh`
6675

67-
4. Run the following command to train LL3DA (4 GPU):
76+
4. Run the following command to train LL3DA (4 GPUs):
6877

6978
```bash
7079
bash scripts/opt-1.3b/tuning.mmscanqa.sh
7180
```
7281

73-
5. Run the following command to evaluate LL3DA (4 GPU):
82+
5. Run the following command to evaluate LL3DA (4 GPUs):
7483

7584
```bash
7685
bash scripts/opt-1.3b/eval.mmscanqa.sh
@@ -84,10 +93,17 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
8493
--tmp_path path/to/tmp --api_key your_api_key --eval_size -1
8594
--nproc 4
8695
```
96+
#### Results and Models
97+
98+
| Detector | Captioner | Iters | Overall GPT Score | Download |
99+
| :-------: | :----: | :----: | :---------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
100+
| Vote2Cap-DETR | LL3DA | 100k | 45.7 | [model](https://drive.google.com/file/d/1mcWNHdfrhdbtySBtmG-QRH1Y1y5U3PDQ/view?usp=drive_link) \| [log](https://drive.google.com/file/d/1VHpcnO0QmAvMa0HuZa83TEjU6AiFrP42/view?usp=drive_link) |
101+
102+
87103

88104
### LEO
89105

90-
1. Follow the [LEO](https://github.com/embodied-generalist/embodied-generalist/blob/main/README.md) to setup the Env. For data preparation, you need not load the datasets, only need to:
106+
1. Follow the [LEO](https://github.com/embodied-generalist/embodied-generalist/blob/main/README.md) to setup the environment. For data preparation, you need not load the datasets, only need to:
91107

92108
(1) Download [Vicuna-7B](https://huggingface.co/huangjy-pku/vicuna-7b/tree/main) and update cfg_path in configs/llm/\*.yaml
93109

@@ -97,13 +113,13 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
97113

98114
3. Edit the config under `scripts/train_tuning_mmscan.sh` and `scripts/test_tuning_mmscan.sh`
99115

100-
4. Run the following command to train LEO (4 GPU):
116+
4. Run the following command to train LEO (4 GPUs):
101117

102118
```bash
103119
bash scripts/train_tuning_mmscan.sh
104120
```
105121

106-
5. Run the following command to evaluate LEO (4 GPU):
122+
5. Run the following command to evaluate LEO (4 GPUs):
107123

108124
```bash
109125
bash scripts/test_tuning_mmscan.sh
@@ -117,5 +133,8 @@ These are 3D question answering models adapted for the mmscan-devkit. Currently,
117133
--tmp_path path/to/tmp --api_key your_api_key --eval_size -1
118134
--nproc 4
119135
```
136+
#### ckpts & Logs
120137

121-
PS : It is possible that LEO may encounter an "NaN" error in the MultiHeadAttentionSpatial module due to the training setup when training more epoches. ( no problem for 4GPU one epoch)
138+
| LLM | 2D Backbone | 3D Backbone | Epoch | Overall GPT Score | Config | Download |
139+
| :-------: | :----: | :----: | :----: |:---------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
140+
| Vicuna7b | ConvNeXt | PointNet++ | 1 | 54.6 | [config](https://drive.google.com/file/d/1CJccZd4TOaT_JdHj073UKwdA5PWUDtja/view?usp=drive_link) | [model](https://drive.google.com/drive/folders/1HZ38LwRe-1Q_VxlWy8vqvImFjtQ_b9iA?usp=drive_link) |

0 commit comments

Comments
 (0)