Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS Connection Error #606

Open
Saurav-D opened this issue Oct 12, 2020 · 0 comments
Open

GCS Connection Error #606

Saurav-D opened this issue Oct 12, 2020 · 0 comments

Comments

@Saurav-D
Copy link

Saurav-D commented Oct 12, 2020

Describe the bug
Got ConnectionError while training after looking at the traceback it seems when event_file_writer do flush() and if there's a connection error it hangs that thread as well as training.

Expected behavior
Training to complete without any errors.

Traceback

Exception in thread Thread-65:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 275, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/dist-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/dist-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 275, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/dist-packages/tensorboardX/event_file_writer.py", line 219, in run
    self._record_writer.flush()
  File "/usr/local/lib/python3.7/dist-packages/tensorboardX/event_file_writer.py", line 69, in flush
    self._py_recordio_writer.flush()
  File "/usr/local/lib/python3.7/dist-packages/tensorboardX/record_writer.py", line 187, in flush
    self._writer.flush()
  File "/usr/local/lib/python3.7/dist-packages/tensorboardX/record_writer.py", line 149, in flush
    self.blob.upload_from_string(upload_buffer.getvalue())
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/storage/blob.py", line 1733, in upload_from_string
    if_metageneration_not_match=if_metageneration_not_match,
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/storage/blob.py", line 1567, in upload_from_file
    if_metageneration_not_match,
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/storage/blob.py", line 1420, in _do_upload
    if_metageneration_not_match,
  File "/usr/local/lib/python3.7/dist-packages/google/cloud/storage/blob.py", line 1098, in _do_multipart_upload
    response = upload.transmit(transport, data, object_metadata, content_type)
  File "/usr/local/lib/python3.7/dist-packages/google/resumable_media/requests/upload.py", line 106, in transmit
    retry_strategy=self._retry_strategy,
  File "/usr/local/lib/python3.7/dist-packages/google/resumable_media/requests/_helpers.py", line 136, in http_request
    return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
  File "/usr/local/lib/python3.7/dist-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
    response = func()
  File "/usr/local/lib/python3.7/dist-packages/google/auth/transport/requests.py", line 470, in request
    **kwargs
  File "/usr/local/lib/python3.7/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Seems like there's already a PR that handle connection error for S3: #555

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant