I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on #29

jchen706 · 2020-08-20T00:50:19Z

I used 50 images per cluster with 100 images. it group to into clusters of 66 images, 2 clusters. But the clusters only ran on 1 worker not both workers in parallel.

The second worker is idle in the picture.

Stuck at 0% for 2 workers, with 1 worker on localhost max time was 40 minutes. This lasted longer that 40 minutes if we sum both worker's time.

jchen706 · 2020-08-20T01:48:28Z

Another question i know the workers are running the cluster, but does the master also run the cluster because I only see the master sending images to the workers for the local SfM and the master is just the task scheduler.

Another Question, the Distributed Mapper is the only thing running Distributed among the devices right?

Merging and bundle adjustment happens on the master?

Is transferring Image to each worker blocking?

Another Question:

What is the I0819 and 22381 mean in that line after the time, I highlighted?

AIBluefisher · 2020-08-20T15:32:05Z

I used 50 images per cluster with 100 images. it group to into clusters of 66 images, 2 clusters. But the clusters only ran on 1 worker not both workers in parallel.

The second worker is idle in the picture.

Stuck at 0% for 2 workers, with 1 worker on localhost max time was 40 minutes. This lasted longer that 40 minutes if we sum both worker's time.

This is due to race condition between the three threads in master, I wrote a thread safely distribute task controller and updated it to dev branch. Now this issue should have been fixed.

AIBluefisher · 2020-08-20T15:34:15Z

Another question i know the workers are running the cluster, but does the master also run the cluster because I only see the master sending images to the workers for the local SfM and the master is just the task scheduler.

Another Question, the Distributed Mapper is the only thing running Distributed among the devices right?

Merging and bundle adjustment happens on the master?

Is transferring Image to each worker blocking?

Another Question:

What is the I0819 and 22381 mean in that line after the time, I highlighted?

In a distribute system, we must specify master and worker, master is only take the responsibility of task schedule and it doesn't need to do work that a worker should do. Actually, you can run a master and a worker on a same physical server.

jchen706 · 2020-08-21T02:02:10Z

Infinite Loop it seems.

AIBluefisher · 2020-08-21T14:06:57Z

Forget to commit some code. Try the newest branch!

AIBluefisher · 2020-08-21T14:09:02Z

As I merge the code manually, and I currently don't have more than one machine to test the distributed code. Just keep this issue open if you have any problem.

AIBluefisher · 2020-08-21T14:34:41Z

And also check you log in log directory, if you start the master and worker correctly, you should see the info like below:

I0821 22:31:43.197476 29655 image_clustering.cpp:450] Analysing Statistics...
I0821 22:31:43.197505 29655 image_clustering.cpp:356] Images Clustering Config:
- image upperbound: 17
- completeness ratio: 0.7
- cluster type: NCUT
Images Clutering Summary:
Clusters number: 3
Total graph cutting time: 0.003766 seconds
Total graph cutting number: 1
Total graph expansion time: 0.022821 seconds
Total graph expansion number: 1
Total time took: 0.026587 seconds
Total iteration number: 0
Images number expanded from 52 to 83
Repeated Ratio: 0.596154
Edges number reduced from 952 to 447
Lost ratio: 0.530462
I0821 22:31:43.197710 29655 image_clustering.h:43] 28 nodes
I0821 22:31:43.197749 29655 image_clustering.h:49] 158 edges
I0821 22:31:43.197821 29655 image_clustering.h:43] 27 nodes
I0821 22:31:43.197854 29655 image_clustering.h:49] 162 edges
I0821 22:31:43.197918 29655 image_clustering.h:43] 28 nodes
I0821 22:31:43.197948 29655 image_clustering.h:49] 127 edges
I0821 22:31:43.198192 29655 distributed_mapper_controller.cpp:202] cluster 0 has 28 images.
I0821 22:31:43.428670 29655 distributed_mapper_controller.cpp:202] cluster 1 has 27 images.
I0821 22:31:43.681368 29655 distributed_mapper_controller.cpp:202] cluster 2 has 28 images.
I0821 22:31:44.132424 29658 distributed_task_manager.inl:85] Transferring images to worker #0.
I0821 22:31:44.134135 29659 distributed_task_manager.inl:111] start update running info
I0821 22:31:44.136473 29659 distributed_task_manager.inl:113] end update running info
I0821 22:31:44.342805 29658 distributed_task_manager.inl:91] Transferring images to worker #0 completed.
I0821 22:31:44.343420 29658 distributed_task_manager.h:47] Call run sfm
I0821 22:31:44.704735 29658 distributed_task_manager.h:49] end call RunSfM

Make sure your output has distributed_task_manager.h:47] Call run sfm.

jchen706 · 2020-08-22T03:21:45Z

About the Feature Extraction Error with the Gerrard Hall Dataset from colmap: https://colmap.github.io/datasets.html#.
I just have the gerrard-hall folder with a log file folder and images folder.
It's a colmap error.

The Feature extraction process killed itself.

Last error for me for now.

AIBluefisher · 2020-08-22T10:57:28Z

Try to use the GPU version feature extraction. Or it would require much time for large scale datasets, and could be killed by operating system.

Yzhbuaa · 2020-09-29T09:15:56Z

I met the same problem. I have 3 computers in total, I set one of them as master, the other two as workers. But when master started, both workers' status are IDLE.

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 IDLE 0/0 % mapping 00:00:00
1 10.134.92.104:8080 IDLE 0/0 % mapping 00:00:00
I0929 09:12:13.576000 30726 distributed_task_manager.inl:111] start update running info
I0929 09:12:13.620071 30726 distributed_task_manager.inl:113] end update running info

Have you solved this problem?

After approximately 10 minutes, worker 0 start running. However, worker 2 remains IDLE.

AIBluefisher · 2020-09-29T13:02:37Z

Could you show me the running information of workers? Make sure the command from the master has been sent to workers, and workers received the command. From the Progress item, it seems the data is not correctly sent or received. Maybe you should start from the distributed mode on one computer and see what's going on. So that I'm able to help you as I can.

Yzhbuaa · 2020-09-29T14:37:03Z

My config.txt:

1
10.134.93.68 8080 /mnt/common_storage/distributed_sfm_test/images

running information of workers:

Could not create logging file: No such file or directory
COULD NOT CREATE A LOGGINGFILE 20200929-222957.3103!I0929 22:29:57.778079 3104 worker.cpp:15] Worker get running info
I0929 22:29:59.047288 3104 worker.cpp:15] Worker get running info
I0929 22:30:00.097246 3105 worker.cpp:15] Worker get running info
I0929 22:30:01.500092 3104 worker.cpp:15] Worker get running info
I0929 22:30:02.534713 3105 worker.cpp:15] Worker get running info
I0929 22:30:03.569555 3104 worker.cpp:15] Worker get running info
I0929 22:30:04.607694 3105 worker.cpp:15] Worker get running info
I0929 22:30:05.642477 3104 worker.cpp:15] Worker get running info
I0929 22:30:06.687559 3105 worker.cpp:15] Worker get running info
I0929 22:30:07.713531 3105 worker.cpp:15] Worker get running info
I0929 22:30:08.739782 3105 worker.cpp:15] Worker get running info
I0929 22:30:09.765760 3105 worker.cpp:15] Worker get running info

Progress item:

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 IDLE 0/0 % mapping 00:00:00
I0929 14:36:15.011899 103077 distributed_task_manager.inl:111] start update running info
I0929 14:36:15.043043 103077 distributed_task_manager.inl:113] end update running info

AIBluefisher · 2020-09-29T14:51:30Z

It seems data is not sent to workers, since the status of worker is IDLE, and the progress is 0/0 (The first 0 denote the number of registered cameras, the second 0 denotes the total number of images in this worker). My suggestion is to set logs into the distributed mode function in order to locate the bug. For example, may be the matching data is not retrived and there is no data to distribute and no SfM task is assigned to workers.

Yzhbuaa · 2020-09-29T14:54:34Z

After approximately 15 minutes, the worker started to reconstruct:

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 NONIDLE 28/89 % mapping 00:08:05
I0929 14:52:14.798004 103077 distributed_task_manager.inl:111] start update running info
I0929 14:52:14.799695 103077 distributed_task_manager.inl:113] end update running info

I have 133 images in total, which is divided into 2 clusters. The first cluster has 89 images.

I set the --transfer_images_to_server to 0, cause all the servers are connected to a storage server and all the images are stored on the storage server.

AIBluefisher · 2020-09-29T14:56:18Z

After approximately 15 minutes, the worker started to reconstruct:

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 NONIDLE 28/89 % mapping 00:08:05
I0929 14:52:14.798004 103077 distributed_task_manager.inl:111] start update running info
I0929 14:52:14.799695 103077 distributed_task_manager.inl:113] end update running info

I'm busy recently, so it would not be a short time for me to reproduce this issue. You're encouraged to debug the code, and feel free to fix this issue.

Yzhbuaa · 2020-09-29T15:01:59Z

Okay, I will try to debug the code and my settings. Thank you!

Yzhbuaa · 2020-09-29T15:40:01Z

The problem solved after I set the --transfer_images_to_server to 1. I am wondering how to save the images transferring time using storage server's share folder?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on #29

I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on #29

jchen706 commented Aug 20, 2020 •

edited

jchen706 commented Aug 20, 2020 •

edited

AIBluefisher commented Aug 20, 2020

AIBluefisher commented Aug 20, 2020

jchen706 commented Aug 21, 2020 •

edited

AIBluefisher commented Aug 21, 2020

AIBluefisher commented Aug 21, 2020 •

edited

AIBluefisher commented Aug 21, 2020

jchen706 commented Aug 22, 2020 •

edited

AIBluefisher commented Aug 22, 2020 •

edited

Yzhbuaa commented Sep 29, 2020 •

edited

AIBluefisher commented Sep 29, 2020

Yzhbuaa commented Sep 29, 2020 •

edited

AIBluefisher commented Sep 29, 2020 •

edited

Yzhbuaa commented Sep 29, 2020 •

edited

AIBluefisher commented Sep 29, 2020

Yzhbuaa commented Sep 29, 2020 •

edited

Yzhbuaa commented Sep 29, 2020

I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on #29

I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on #29

Comments

jchen706 commented Aug 20, 2020 • edited

jchen706 commented Aug 20, 2020 • edited

AIBluefisher commented Aug 20, 2020

AIBluefisher commented Aug 20, 2020

jchen706 commented Aug 21, 2020 • edited

AIBluefisher commented Aug 21, 2020

AIBluefisher commented Aug 21, 2020 • edited

AIBluefisher commented Aug 21, 2020

jchen706 commented Aug 22, 2020 • edited

AIBluefisher commented Aug 22, 2020 • edited

Yzhbuaa commented Sep 29, 2020 • edited

AIBluefisher commented Sep 29, 2020

Yzhbuaa commented Sep 29, 2020 • edited

AIBluefisher commented Sep 29, 2020 • edited

Yzhbuaa commented Sep 29, 2020 • edited

AIBluefisher commented Sep 29, 2020

Yzhbuaa commented Sep 29, 2020 • edited

Yzhbuaa commented Sep 29, 2020

jchen706 commented Aug 20, 2020 •

edited

jchen706 commented Aug 20, 2020 •

edited

jchen706 commented Aug 21, 2020 •

edited

AIBluefisher commented Aug 21, 2020 •

edited

jchen706 commented Aug 22, 2020 •

edited

AIBluefisher commented Aug 22, 2020 •

edited

Yzhbuaa commented Sep 29, 2020 •

edited

Yzhbuaa commented Sep 29, 2020 •

edited

AIBluefisher commented Sep 29, 2020 •

edited

Yzhbuaa commented Sep 29, 2020 •

edited

Yzhbuaa commented Sep 29, 2020 •

edited