Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDL: cannot redownload additional years #2007

Open
robmarkcole opened this issue Apr 17, 2024 · 20 comments
Open

CDL: cannot redownload additional years #2007

robmarkcole opened this issue Apr 17, 2024 · 20 comments
Labels
datasets Geospatial or benchmark datasets

Comments

@robmarkcole
Copy link
Contributor

Description

image
Data should be downloading, over an hour in nothing has happened

Steps to reproduce

from torchgeo.datasets import CDL

dataset = CDL(years=[2022], download=True, ) 

Version

0.6.0.dev0

@robmarkcole
Copy link
Contributor Author

Nothing wrong with connection, can manually download

image

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Apr 17, 2024
@adamjstewart
Copy link
Collaborator

I'm unable to reproduce this issue. I tried both 2017 and 2022 and both downloaded fine on my system. What version of torchvision are you using? Can you try upgrading to the newest version?

@robmarkcole
Copy link
Contributor Author

robmarkcole commented Apr 17, 2024

I have torchvision==0.17.1+cu121

I upgrade, and the cell now executes immediately, but no data is downloaded (2022)

image

@adamjstewart
Copy link
Collaborator

Is it possible that you already have some CDL data somewhere in that folder recursively?

@robmarkcole
Copy link
Contributor Author

Dont see anything:

⚡ ~ find data -type f 
data/2017_30m_cdls.aux
data/2017_30m_cdls.tfw
data/Metadata_Cropland-Data-Layer.htm
data/2017_30m_cdls.zip
data/2017_30m_cdls.tif
data/2017_30m_cdls.tif.ovr

Also, even the manually downloaded dataset doesn't look correct, shouldn't this work?:
image

@adamjstewart
Copy link
Collaborator

Your screenshot doesn't contain the full stack trace, and I also can't copy-n-paste error messages from screenshots...

@robmarkcole
Copy link
Contributor Author



'0.6.0.dev0'
CDL Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=-127.88721217969017, maxx=-65.34561975376272, miny=22.94022503977174, maxy=51.60512156832182, mint=1483228800.0, maxt=1514764799.999999)
    size: 1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], [line 4](vscode-notebook-cell:?execution_count=8&line=4)
      [1](vscode-notebook-cell:?execution_count=8&line=1) sampler = RandomGeoSampler(dataset, size=224, length=3)
      [2](vscode-notebook-cell:?execution_count=8&line=2) dataloader = DataLoader(dataset, sampler=sampler, collate_fn=stack_samples)
----> [4](vscode-notebook-cell:?execution_count=8&line=4) for batch in dataloader:
      [5](vscode-notebook-cell:?execution_count=8&line=5)     sample = unbind_samples(batch)[0]
      [6](vscode-notebook-cell:?execution_count=8&line=6)     dataset.plot(sample)

File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631), in _BaseDataLoaderIter.__next__(self)
    [628](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:628) if self._sampler_iter is None:
    [629](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:629)     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    [630](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:630)     self._reset()  # type: ignore[call-arg]
--> [631](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631) data = self._next_data()
    [632](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:632) self._num_yielded += 1
    [633](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:633) if self._dataset_kind == _DatasetKind.Iterable and \
    [634](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:634)         self._IterableDataset_len_called is not None and \
    [635](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:635)         self._num_yielded > self._IterableDataset_len_called:

File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674), in _SingleProcessDataLoaderIter._next_data(self)
    [673](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:673) def _next_data(self):
--> [674](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674)     index = self._next_index()  # may raise StopIteration
    [675](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:675)     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    [676](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:676)     if self._pin_memory:

File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:621](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:621), in _BaseDataLoaderIter._next_index(self)
    [620](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:620) def _next_index(self):
--> [621](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:621)     return next(self._sampler_iter)

File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:287](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:287), in BatchSampler.__iter__(self)
    [285](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:285) batch = [0] * self.batch_size
    [286](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:286) idx_in_batch = 0
--> [287](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:287) for idx in self.sampler:
    [288](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:288)     batch[idx_in_batch] = idx
    [289](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/sampler.py:289)     idx_in_batch += 1

File [/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:140](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:140), in RandomGeoSampler.__iter__(self)
    [133](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:133) """Return the index of a dataset.
    [134](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:134) 
    [135](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:135) Returns:
    [136](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:136)     (minx, maxx, miny, maxy, mint, maxt) coordinates to index a dataset
    [137](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:137) """
    [138](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:138) for _ in range(len(self)):
    [139](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:139)     # Choose a random tile, weighted by area
--> [140](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:140)     idx = torch.multinomial(self.areas, 1)
    [141](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:141)     hit = self.hits[idx]
    [142](https://vscode-remote+vscode-002d01hvnmksrn19shhky9tc8w99r0-002estudio-002elightning-002eai.vscode-resource.vscode-cdn.net/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/samplers/single.py:142)     bounds = BoundingBox(*hit.bounds)

RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement

@adamjstewart
Copy link
Collaborator

Never seen this error before, interesting...

We still need to figure out how to reproduce this. Are you able to reproduce this in Google Colab or some other shared computing resource I can access? That will make it easier to debug.

@robmarkcole
Copy link
Contributor Author

If you create an account on https://lightning.ai/ I can grant you access!

@calebrob6
Copy link
Member

I can't reproduce this locally with main branch

image

@calebrob6
Copy link
Member

One thing I am noticing is that the bounds shown in the output of your print(dataset) seem to be in lat/lon while mine are not:

bbox: BoundingBox(minx=-2356095.0, maxx=2258235.0, miny=276915.0, maxy=3172605.0, mint=1483228800.0, maxt=1514764799.999999)

Is there anything else in the data/ directory?

@yichiac
Copy link
Contributor

yichiac commented Apr 17, 2024

I cannot reproduce the issue either. The dataset can be downloaded immediately. I did find that the other years can't be downloaded after downloading some years. For example:

from torchgeo.datasets import CDL
dataset = CDL(years=[2022], download=True, ) 

This can download the corresponding year without issues. But if I restart the terminal and run

from torchgeo.datasets import CDL
dataset = CDL(years=[2023], download=True, ) 

It won't download anything. It seems that the download function only works for the first time when the data directory doesn't have any downloaded CDL files. This issue is not related to certain years. I tried different combination of years.

@calebrob6
Copy link
Member

calebrob6 commented Apr 17, 2024

One bug here is that if I do:
dataset = CDL(paths="data/", years=[2017], download=True)

and the data/ directory is empty, then the 2017 layer is downloaded as expected. However, if I then do:

dataset = CDL(paths="data/", years=[2023], download=True)

the second download of the 2023 layer does not happen.

Edit: It seems @yichiac and I discovered this at the same time 🙂

in ._verify(self) the following code should take into account the current layers requested:

pathname = os.path.join(
    self.paths, self.zipfile_glob.replace("*", str(year))
)

@robmarkcole
Copy link
Contributor Author

Can confirm (for my own sanity) that this bug I only see on lighnting.ai, will ask them

image

@adamjstewart
Copy link
Collaborator

The problem is actually higher up:

# Check if the extracted files already exist                                     
if self.files:                                                                   
    return 

If any CDL files are found, the method exits, even if the specific years you requested aren't there. This broke in #1442. The fix would be to check for the specific years requested. However, this is difficult if you can't know whether paths is a directory or a list of files. Anyone want to take a stab at fixing this?

@adamjstewart adamjstewart changed the title CDL dataset not downloading CDL: cannot redownload additional years Apr 17, 2024
@calebrob6
Copy link
Member

The problem is actually higher up:

Yes, just discovered this as well

@robmarkcole
Copy link
Contributor Author

robmarkcole commented Apr 18, 2024

I found if I run the command in terminal (rather than jupyter) I get a warning - I pointed to a fresh directory (data2):

>>> from torchgeo.datasets import CDL
>>> dataset = CDL(paths='/teamspace/studios/this_studio/data2/', years=[2010], download=True, checksum=False, crs="EPSG:4326") 
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/geo.py:313: UserWarning: Could not find any relevant files for provided path '/teamspace/studios/this_studio/data2/'. Path was ignored.
  warnings.warn(

Appears it is ignoring the path and hanging. If I interrupt and rerun the command, I do not get the warning.
On keyboard interrupt I get the following:

/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/geo.py:313: UserWarning: Could not find any relevant files for provided path 'data'. Path was ignored.
  warnings.warn(
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[2], line 2
      1 # dataset = CDL(years=[2017], download=False, checksum=False, crs="EPSG:4326") # manually downloaded
----> 2 dataset = CDL(years=[2020], download=True, checksum=False, crs="EPSG:4326") # 

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/cdl.py:263, in CDL.__init__(self, paths, crs, res, years, classes, transforms, cache, download, checksum)
    260 self.ordinal_map = torch.zeros(max(self.cmap.keys()) + 1, dtype=self.dtype)
    261 self.ordinal_cmap = torch.zeros((len(self.classes), 4), dtype=torch.uint8)
--> 263 self._verify()
    265 super().__init__(paths, crs, res, transforms=transforms, cache=cache)
    267 # Map chosen classes to ordinal numbers, all others mapped to background class

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/cdl.py:315, in CDL._verify(self)
    312     raise DatasetNotFoundError(self)
    314 # Download the dataset
--> 315 self._download()
    316 self._extract()

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchgeo/datasets/cdl.py:321, in CDL._download(self)
    319 """Download the dataset."""
    320 for year in self.years:
--> 321     download_url(
    322         self.url.format(year),
    323         self.paths,
    324         md5=self.md5s[year] if self.checksum else None,
    325     )

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/datasets/utils.py:130, in download_url(url, root, filename, md5, max_redirect_hops)
    127     _download_file_from_remote_location(fpath, url)
    128 else:
    129     # expand redirect chain if needed
--> 130     url = _get_redirect_url(url, max_hops=max_redirect_hops)
    132     # check if file is located on Google Drive
    133     file_id = _get_google_drive_file_id(url)

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torchvision/datasets/utils.py:78, in _get_redirect_url(url, max_hops)
     75 headers = {"Method": "HEAD", "User-Agent": USER_AGENT}
     77 for _ in range(max_hops + 1):
---> 78     with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
     79         if response.url == url or response.url is None:
     80             return url

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    214 else:
    215     opener = _opener
--> 216 return opener.open(url, data, timeout)

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:519, in OpenerDirector.open(self, fullurl, data, timeout)
    516     req = meth(req)
    518 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 519 response = self._open(req, data)
    521 # post-process response
    522 meth_name = protocol+"_response"

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:536, in OpenerDirector._open(self, req, data)
    533     return result
    535 protocol = req.type
--> 536 result = self._call_chain(self.handle_open, protocol, protocol +
    537                           '_open', req)
    538 if result:
    539     return result

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    494 for handler in handlers:
    495     func = getattr(handler, meth_name)
--> 496     result = func(*args)
    497     if result is not None:
    498         return result

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:1391, in HTTPSHandler.https_open(self, req)
   1390 def https_open(self, req):
-> 1391     return self.do_open(http.client.HTTPSConnection, req,
   1392         context=self._context, check_hostname=self._check_hostname)

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/urllib/request.py:1352, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1350     except OSError as err: # timeout error
   1351         raise URLError(err)
-> 1352     r = h.getresponse()
   1353 except:
   1354     h.close()

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/http/client.py:1374, in HTTPConnection.getresponse(self)
   1372 try:
   1373     try:
-> 1374         response.begin()
   1375     except ConnectionError:
   1376         self.close()

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/http/client.py:318, in HTTPResponse.begin(self)
    316 # read until we get a non-100 response
    317 while True:
--> 318     version, status, reason = self._read_status()
    319     if status != CONTINUE:
    320         break

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/http/client.py:279, in HTTPResponse._read_status(self)
    278 def _read_status(self):
--> 279     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    280     if len(line) > _MAXLINE:
    281         raise LineTooLong("status line")

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/socket.py:705, in SocketIO.readinto(self, b)
    703 while True:
    704     try:
--> 705         return self._sock.recv_into(b)
    706     except timeout:
    707         self._timeout_occurred = True

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/ssl.py:1274, in SSLSocket.recv_into(self, buffer, nbytes, flags)
   1270     if flags != 0:
   1271         raise ValueError(
   1272           "non-zero flags not allowed in calls to recv_into() on %s" %
   1273           self.__class__)
-> 1274     return self.read(nbytes, buffer)
   1275 else:
   1276     return super().recv_into(buffer, nbytes, flags)

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/ssl.py:1130, in SSLSocket.read(self, len, buffer)
   1128 try:
   1129     if buffer is not None:
-> 1130         return self._sslobj.read(len, buffer)
   1131     else:
   1132         return self._sslobj.read(len)

KeyboardInterrupt: 

@tchaton
Copy link

tchaton commented Apr 18, 2024

Hey,

I can reproduce the same issue in a Studio on Lightning.Ai. The hanging seems to be coming from torchvision:

Here is a minimal repro.

import urllib
import urllib.error
import urllib.request

USER_AGENT = "pytorch/vision"

def _get_redirect_url(url: str, max_hops: int = 3) -> str:
    initial_url = url
    headers = {"Method": "HEAD", "User-Agent": USER_AGENT}

    for _ in range(max_hops + 1):
        with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
            if response.url == url or response.url is None:
                return url

            url = response.url
    else:
        raise RecursionError(
            f"Request to {initial_url} exceeded {max_hops} redirects. The last redirect points to {url}."
        )


url = "https://www.nass.usda.gov/Research_and_Science/Cropland/Release/datasets/2022_30m_cdls.zip"
url = _get_redirect_url(url)
assert url == url
print(url)

@tchaton
Copy link

tchaton commented Apr 18, 2024

Interestingly enough, it works if I remove the "User-Agent": USER_AGENT from the headers.

Screenshot 2024-04-18 at 09 22 38

@robmarkcole
Copy link
Contributor Author

robmarkcole commented Apr 18, 2024

A temporary workaround on lightning.ai thanks to @tchaton

from torchgeo.datasets import CDL

# Apply patch to pop User-Agent until we figure out why it hangs
from torchvision.datasets.utils import urllib
original_request = urllib.request.Request
def Request(*args, headers, **kwargs):
    if "User-Agent" in headers:
        headers.pop("User-Agent")
    return original_request(*args, headers=headers, **kwargs)
urllib.request.Request = Request

dataset = CDL(years=[2022], download=True, paths="./data") 

print(dataset)

However when I go to plot a sample I get the error

RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement

I suspect this error is due to setting a crs that is different from the native dataset crs, as when I don't do this there is no error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

No branches or pull requests

5 participants