Skip to content
This repository has been archived by the owner on Jul 28, 2024. It is now read-only.

Iterating over api results in no images downloaded #16

Closed
robmarkcole opened this issue Apr 1, 2022 · 16 comments
Closed

Iterating over api results in no images downloaded #16

robmarkcole opened this issue Apr 1, 2022 · 16 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@robmarkcole
Copy link

Describe the bug
I have a pandas dataframe with locations I wish to download tiles for. I am downloading a limited area around each location. However placing in a loop, I find that often no images are downloaded. Deleting the chromedriver and retrying can fix the issue, but not always

To Reproduce

for index in tqdm(range(len(df))):
    test_lat = df.iloc[index]['lat']
    test_lon = df.iloc[index]['lon']
    extent = 0.01

    min_lat_deg = test_lat - extent
    max_lat_deg = test_lat + extent
    min_lon_deg = test_lon - extent
    max_lon_deg = test_lon + extent

    print(min_lat_deg, max_lat_deg, min_lon_deg, max_lon_deg)

    download_obj = api(
        min_lat_deg = min_lat_deg, # min_lat,
        max_lat_deg = max_lat_deg, # max_lat,
        min_lon_deg = min_lon_deg, # min_lon,
        max_lon_deg = max_lon_deg, # max_lon,
        zoom = 16, # 0 is min, 17 is good
        verbose = False,
        threads_ = 5, 
        container_dir = img_dir
        )

    download_obj.download(getMasks = False)
    time.sleep(2) # wait for download to finish

Expected behavior
Either an exception is raised if there are no images to download, or some mechanism is available to retry

Screenshots
NA

Desktop (please complete the following information):

  • OS: macOS
  • Browser: chrome
  • Version: jimutmap==1.3.9

Additional context
Add any other context about the problem here.

@welcome
Copy link

welcome bot commented Apr 1, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@Jimut123 Jimut123 added bug Something isn't working help wanted Extra attention is needed labels Apr 1, 2022
@Jimut123
Copy link
Owner

Jimut123 commented Apr 1, 2022

I probably might have run into this bug before, but ignored it, as I thought it might be a network connection issue. Here is a workable idea to solve the issue, but not sure if it will be efficient. Let’s keep a temporary database for each run containing the IDs and corresponding marker to tile image and corresponding road masks. We need to continue the download until all the markers are marked to 1.

ID Road Mask
XXXXX1_YYYYY1 1 1
XXXXX1_YYYYY2 1 1
XXXXX1_YYYYY3 0 0
... ... ...

I am not sure if this will provide a suitable solution. It might also happen that some tiles are not downloading at all and we need to do this repeatedly. Please provide suggestions, if you find a better way to mitigate the problem :)

@robmarkcole
Copy link
Author

It appears the first 2-3 requests go through, then subsequent ones do not. I've tried adding some sleep, without success. Could it be a block on the IP or issue with the auth?

Your suggestion is to use retries essentially? I've used this approach in previous role as data engineer.

@Jimut123
Copy link
Owner

Jimut123 commented Apr 1, 2022

I am thinking of using retries as it came to my mind in the first glance of the issue. If the issue is regarding block on the IP or authentication, then I am not sure if that will work out or not. I also thought that it might also be an issue with the multiprocessing.pool library, since in case of overloading of requests, I am assuming Operating System might kill some requests internally. In that case, retries will be our only option :)

@Jimut123 Jimut123 pinned this issue Apr 1, 2022
@robmarkcole
Copy link
Author

Yes suspect something to do with multiprocessing.. Set threads_=1 but issue persists. May try using celery or aws lambda to spread the load. Feel free to close this issue if you want

@Jimut123
Copy link
Owner

Jimut123 commented Apr 1, 2022

Probably I will try to find a solution by next week.
Let’s see.
If you solve this before, then please feel free to share the solution by doing a PR :)

Best,
-Jimut

@robmarkcole
Copy link
Author

OK today threads_=1 appears to be OK...

@Jimut123
Copy link
Owner

Jimut123 commented Apr 2, 2022

I think the maximum threads offered by CPU (in my case it is 4) will also work.

Thanks for pin-pointing the issue, now I am sure it is a thread issue. The only problem with threads=1 is it will be very slow compared to the others, since it is searching deterministically. But increasing the thread over the capacity of hardware may also cause the computer to slow down, like for example in Linux and Windows-based OS, it slows down considerably and may even result in deadlock (hang).

Retries will again slow down, since we are checking it repeatedly. Looks like I have to use some buffer mechanism, which selectively retries the links by using the database. It will slow down things considerably, but using multiprocessing within retries may solve the issue. I have an exam tomorrow. Let’s see, I hope I will come up with a workable, efficient solution by next week.

@Jimut123
Copy link
Owner

Jimut123 commented Apr 2, 2022

Eventually, I have to use a database, since I have to write the image stitching module sometime.

This tool was created with the hypothetical idea of converting 2D satellite images to 3D by using GANs and other related unsupervised deep learning stuffs (back in 2019).

Not sure when I will get time to work on this project for solving that initial purpose :) But I will come up with the solution to the present bug by next week.

@robmarkcole
Copy link
Author

I'm quite happy to leave it running over a weekend on my Max, so speed is not my main concern. The generated filenames are unique? One suggestion is the download method could return a dictionary of the created files, request etc. This can then be appended to a pandas data frame, inserted to SQLite db etc.

@Jimut123
Copy link
Owner

Jimut123 commented Apr 4, 2022

Hi, it should work now, created a dirty patch, and it will be a bit slow to start probably. The patch uses the maximum number of threads the core of your CPU can provide, so this will be the maximum limit of the hardware.

Be sure to install the latest version using pip and then try to check the test.py file, and update accordingly.

"""
Jimut Bahan Pal
First updated : 22-03-2021
Last updated : 04-04-2022
"""

import os
import glob
import shutil
from jimutmap import api, sanity_check


download_obj = api(min_lat_deg = 10,
                      max_lat_deg = 10.01,
                      min_lon_deg = 10,
                      max_lon_deg = 10.01,
                      zoom = 19,
                      verbose = False,
                      threads_ = 50, 
                      container_dir = "myOutputFolder")

# If you don't have Chrome and can't take advantage of the auto access key fetch, set
# a.ac_key = ACCESS_KEY_STRING
# here

# getMasks = False if you just need the tiles 
download_obj.download(getMasks = True)

# create the object of class jimutmap's api
sanity_obj = api(min_lat_deg = 10,
                      max_lat_deg = 10.01,
                      min_lon_deg = 10,
                      max_lon_deg = 10.01,
                      zoom = 19,
                      verbose = False,
                      threads_ = 50, 
                      container_dir = "myOutputFolder")

sanity_check(min_lat_deg = 10,
                max_lat_deg = 10.01,
                min_lon_deg = 10,
                max_lon_deg = 10.01,
                zoom = 19,
                verbose = False,
                threads_ = 50, 
                container_dir = "myOutputFolder")

print("Cleaning up... hold on")

sqlite_temp_files = glob.glob('*.sqlite*')

print("Temporary sqlite files to be deleted = {} ? ".format(sqlite_temp_files))
inp = input("(y/N) : ")
if inp == 'y' or inp == 'yes' or inp == 'Y':
    for item in sqlite_temp_files:
        os.remove(item)



## Try to remove tree; if failed show an error using try...except on screen
try:
    chromdriver_folders = glob.glob('[0-9]*')
    print("Temporary chromedriver folders to be deleted = {} ? ".format(chromdriver_folders))
    inp = input("(y/N) : ")
    if inp == 'y' or inp == 'yes' or inp == 'Y':
        for item in chromdriver_folders:
            shutil.rmtree(item)
except OSError as e:
    print ("Error: %s - %s." % (e.filename, e.strerror))

Kindly tell if it works or not.

Note: This patch will force download all the road masks too.

@robmarkcole
Copy link
Author

robmarkcole commented Apr 4, 2022

Tried 1.4.0 but issue persists. I set threads=20 and get this nice warning: Sorry, 20 -- threads unavailable, using maximum CPU threads : 8

Running test.py:

(venv) robin@Robins-MacBook-Pro dataset-global-solar-plant-locations % python3 test.py 
Initializing jimutmap ... Please wait...
Sorry, 50 -- threads unavailable, using maximum CPU threads : 8
Initializing jimutmap ... Please wait...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 332.47it/s]
Sorry, 50 -- threads unavailable, using maximum CPU threads : 8
Initializing jimutmap ... Please wait...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 4418.78it/s]
Total satellite images to be downloaded =  225
Total roads tiles to be downloaded =  225
Approx. estimated disk space required = 4.39453125 MB
Total number of satellite images needed to be downloaded =  225
Total number of satellite images needed to be downloaded =  225
Batch =============================================================================  1
===================================================================================
Sorry, 50 -- threads unavailable, using maximum CPU threads : 8
Downloading all the satellite tiles: 
Updating sanity db ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 1817.74it/s]
Total number of satellite images needed to be downloaded =  210
Total number of satellite images needed to be downloaded =  210
Waiting for 15 seconds... Busy downloading
Batch =============================================================================  2
===================================================================================
Downloading all the satellite tiles: 
Updating sanity db ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [00:00<00:00, 177724.75it/s]
Total number of satellite images needed to be downloaded =  0
Total number of satellite images needed to be downloaded =  0
************************* Download Sucessful *************************
Cleaning up... hold on
Temporary sqlite files to be deleted = ['temp_sanity.sqlite'] ? 
(y/N) : y
Temporary chromedriver folders to be deleted = ['99'] ? 
(y/N) : y

@Jimut123
Copy link
Owner

Jimut123 commented Apr 4, 2022

Could you please tell what is the expected number of files that are to be downloaded...?
And how many of them are actually being downloaded ?

I think in your code, increasing the sleep might fix the issue.

for index in tqdm(range(len(df))):
    test_lat = df.iloc[index]['lat']
    test_lon = df.iloc[index]['lon']
    extent = 0.01

    min_lat_deg = test_lat - extent
    max_lat_deg = test_lat + extent
    min_lon_deg = test_lon - extent
    max_lon_deg = test_lon + extent

    print(min_lat_deg, max_lat_deg, min_lon_deg, max_lon_deg)

    download_obj = api(
        min_lat_deg = min_lat_deg, # min_lat,
        max_lat_deg = max_lat_deg, # max_lat,
        min_lon_deg = min_lon_deg, # min_lon,
        max_lon_deg = max_lon_deg, # max_lon,
        zoom = 16, # 0 is min, 17 is good
        verbose = False,
        threads_ = 5, 
        container_dir = img_dir
        )

    download_obj.download(getMasks = False)
    time.sleep(100) # wait for download to finish

@robmarkcole
Copy link
Author

If I use threads=8 images are downloaded in the first iteration, but not subsequent ones. If I use threads=1 images are downloaded at every iteration

@Jimut123
Copy link
Owner

Jimut123 commented Apr 4, 2022

I am not sure about this. Sorry I couldn't solve this, I give up. Probably a multiprocessing issue.
How long is it taking to download all the files using threads=1?

@robmarkcole
Copy link
Author

Using threads=1 I left it overnight and it completed. OK thanks for looking into this, I will consider other ways to parallize if I need to in future. Cheers

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants