Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I got it to run in a distributed manner, but it's only using 1 worker while I had 2 workers on #29

Open
jchen706 opened this issue Aug 20, 2020 · 17 comments

Comments

@jchen706
Copy link

jchen706 commented Aug 20, 2020

I used 50 images per cluster with 100 images. it group to into clusters of 66 images, 2 clusters. But the clusters only ran on 1 worker not both workers in parallel.

image

image

The second worker is idle in the picture.

image

Stuck at 0% for 2 workers, with 1 worker on localhost max time was 40 minutes. This lasted longer that 40 minutes if we sum both worker's time.

@jchen706
Copy link
Author

jchen706 commented Aug 20, 2020

Another question i know the workers are running the cluster, but does the master also run the cluster because I only see the master sending images to the workers for the local SfM and the master is just the task scheduler.

Another Question, the Distributed Mapper is the only thing running Distributed among the devices right?

Merging and bundle adjustment happens on the master?

Is transferring Image to each worker blocking?

Another Question:
image
What is the I0819 and 22381 mean in that line after the time, I highlighted?

@AIBluefisher
Copy link
Owner

I used 50 images per cluster with 100 images. it group to into clusters of 66 images, 2 clusters. But the clusters only ran on 1 worker not both workers in parallel.

image

image

The second worker is idle in the picture.

image

Stuck at 0% for 2 workers, with 1 worker on localhost max time was 40 minutes. This lasted longer that 40 minutes if we sum both worker's time.

This is due to race condition between the three threads in master, I wrote a thread safely distribute task controller and updated it to dev branch. Now this issue should have been fixed.

@AIBluefisher
Copy link
Owner

Another question i know the workers are running the cluster, but does the master also run the cluster because I only see the master sending images to the workers for the local SfM and the master is just the task scheduler.

Another Question, the Distributed Mapper is the only thing running Distributed among the devices right?

Merging and bundle adjustment happens on the master?

Is transferring Image to each worker blocking?

Another Question:
image
What is the I0819 and 22381 mean in that line after the time, I highlighted?

In a distribute system, we must specify master and worker, master is only take the responsibility of task schedule and it doesn't need to do work that a worker should do. Actually, you can run a master and a worker on a same physical server.

@jchen706
Copy link
Author

jchen706 commented Aug 21, 2020

image

Infinite Loop it seems.

image

@AIBluefisher
Copy link
Owner

Forget to commit some code. Try the newest branch!

@AIBluefisher
Copy link
Owner

AIBluefisher commented Aug 21, 2020

As I merge the code manually, and I currently don't have more than one machine to test the distributed code. Just keep this issue open if you have any problem.

@AIBluefisher
Copy link
Owner

And also check you log in log directory, if you start the master and worker correctly, you should see the info like below:

I0821 22:31:43.197476 29655 image_clustering.cpp:450] Analysing Statistics...
I0821 22:31:43.197505 29655 image_clustering.cpp:356] Images Clustering Config:
- image upperbound: 17
- completeness ratio: 0.7
- cluster type: NCUT
Images Clutering Summary:
Clusters number: 3
Total graph cutting time: 0.003766 seconds
Total graph cutting number: 1
Total graph expansion time: 0.022821 seconds
Total graph expansion number: 1
Total time took: 0.026587 seconds
Total iteration number: 0
Images number expanded from 52 to 83
Repeated Ratio: 0.596154
Edges number reduced from 952 to 447
Lost ratio: 0.530462
I0821 22:31:43.197710 29655 image_clustering.h:43] 28 nodes
I0821 22:31:43.197749 29655 image_clustering.h:49] 158 edges
I0821 22:31:43.197821 29655 image_clustering.h:43] 27 nodes
I0821 22:31:43.197854 29655 image_clustering.h:49] 162 edges
I0821 22:31:43.197918 29655 image_clustering.h:43] 28 nodes
I0821 22:31:43.197948 29655 image_clustering.h:49] 127 edges
I0821 22:31:43.198192 29655 distributed_mapper_controller.cpp:202] cluster 0 has 28 images.
I0821 22:31:43.428670 29655 distributed_mapper_controller.cpp:202] cluster 1 has 27 images.
I0821 22:31:43.681368 29655 distributed_mapper_controller.cpp:202] cluster 2 has 28 images.
I0821 22:31:44.132424 29658 distributed_task_manager.inl:85] Transferring images to worker #0.
I0821 22:31:44.134135 29659 distributed_task_manager.inl:111] start update running info
I0821 22:31:44.136473 29659 distributed_task_manager.inl:113] end update running info
I0821 22:31:44.342805 29658 distributed_task_manager.inl:91] Transferring images to worker #0 completed.
I0821 22:31:44.343420 29658 distributed_task_manager.h:47] Call run sfm
I0821 22:31:44.704735 29658 distributed_task_manager.h:49] end call RunSfM

Make sure your output has distributed_task_manager.h:47] Call run sfm.

@jchen706
Copy link
Author

jchen706 commented Aug 22, 2020

About the Feature Extraction Error with the Gerrard Hall Dataset from colmap: https://colmap.github.io/datasets.html#.
I just have the gerrard-hall folder with a log file folder and images folder.
It's a colmap error.

The Feature extraction process killed itself.

image
Last error for me for now.

@AIBluefisher
Copy link
Owner

AIBluefisher commented Aug 22, 2020

Try to use the GPU version feature extraction. Or it would require much time for large scale datasets, and could be killed by operating system.

@Yzhbuaa
Copy link
Contributor

Yzhbuaa commented Sep 29, 2020

I met the same problem. I have 3 computers in total, I set one of them as master, the other two as workers. But when master started, both workers' status are IDLE.

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 IDLE 0/0 % mapping 00:00:00
1 10.134.92.104:8080 IDLE 0/0 % mapping 00:00:00
I0929 09:12:13.576000 30726 distributed_task_manager.inl:111] start update running info
I0929 09:12:13.620071 30726 distributed_task_manager.inl:113] end update running info

Have you solved this problem?

After approximately 10 minutes, worker 0 start running. However, worker 2 remains IDLE.

@AIBluefisher
Copy link
Owner

Could you show me the running information of workers? Make sure the command from the master has been sent to workers, and workers received the command. From the Progress item, it seems the data is not correctly sent or received. Maybe you should start from the distributed mode on one computer and see what's going on. So that I'm able to help you as I can.

@Yzhbuaa
Copy link
Contributor

Yzhbuaa commented Sep 29, 2020

My config.txt:

1
10.134.93.68 8080 /mnt/common_storage/distributed_sfm_test/images

running information of workers:

Could not create logging file: No such file or directory
COULD NOT CREATE A LOGGINGFILE 20200929-222957.3103!I0929 22:29:57.778079 3104 worker.cpp:15] Worker get running info
I0929 22:29:59.047288 3104 worker.cpp:15] Worker get running info
I0929 22:30:00.097246 3105 worker.cpp:15] Worker get running info
I0929 22:30:01.500092 3104 worker.cpp:15] Worker get running info
I0929 22:30:02.534713 3105 worker.cpp:15] Worker get running info
I0929 22:30:03.569555 3104 worker.cpp:15] Worker get running info
I0929 22:30:04.607694 3105 worker.cpp:15] Worker get running info
I0929 22:30:05.642477 3104 worker.cpp:15] Worker get running info
I0929 22:30:06.687559 3105 worker.cpp:15] Worker get running info
I0929 22:30:07.713531 3105 worker.cpp:15] Worker get running info
I0929 22:30:08.739782 3105 worker.cpp:15] Worker get running info
I0929 22:30:09.765760 3105 worker.cpp:15] Worker get running info

Progress item:

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 IDLE 0/0 % mapping 00:00:00
I0929 14:36:15.011899 103077 distributed_task_manager.inl:111] start update running info
I0929 14:36:15.043043 103077 distributed_task_manager.inl:113] end update running info

@AIBluefisher
Copy link
Owner

AIBluefisher commented Sep 29, 2020

It seems data is not sent to workers, since the status of worker is IDLE, and the progress is 0/0 (The first 0 denote the number of registered cameras, the second 0 denotes the total number of images in this worker). My suggestion is to set logs into the distributed mode function in order to locate the bug. For example, may be the matching data is not retrived and there is no data to distribute and no SfM task is assigned to workers.

@Yzhbuaa
Copy link
Contributor

Yzhbuaa commented Sep 29, 2020

After approximately 15 minutes, the worker started to reconstruct:

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 NONIDLE 28/89 % mapping 00:08:05
I0929 14:52:14.798004 103077 distributed_task_manager.inl:111] start update running info
I0929 14:52:14.799695 103077 distributed_task_manager.inl:113] end update running info

I have 133 images in total, which is divided into 2 clusters. The first cluster has 89 images.

I set the --transfer_images_to_server to 0, cause all the servers are connected to a storage server and all the images are stored on the storage server.

@AIBluefisher
Copy link
Owner

After approximately 15 minutes, the worker started to reconstruct:

Cluster Id IP Worker Status Progress Task Status Time
0 10.134.93.68:8080 NONIDLE 28/89 % mapping 00:08:05
I0929 14:52:14.798004 103077 distributed_task_manager.inl:111] start update running info
I0929 14:52:14.799695 103077 distributed_task_manager.inl:113] end update running info

I'm busy recently, so it would not be a short time for me to reproduce this issue. You're encouraged to debug the code, and feel free to fix this issue.

@Yzhbuaa
Copy link
Contributor

Yzhbuaa commented Sep 29, 2020

Okay, I will try to debug the code and my settings. Thank you!

@Yzhbuaa
Copy link
Contributor

Yzhbuaa commented Sep 29, 2020

The problem solved after I set the --transfer_images_to_server to 1. I am wondering how to save the images transferring time using storage server's share folder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants