Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with firebird stream blobs #31

Open
bryancole opened this issue Oct 9, 2023 · 2 comments
Open

Problem with firebird stream blobs #31

bryancole opened this issue Oct 9, 2023 · 2 comments

Comments

@bryancole
Copy link

bryancole commented Oct 9, 2023

I'm trying to use the new v1.10 driver with SqlAlchemy v1.4. I'm trying to load a BLOB type from my database. The problem is that SqlAlchemy iterates over the connection Cursor to obtain all rows, and then passes these rows to a "processor" object to do type-conversions. The DB cursor returns BlobReader objects for each BLOB column. The BlobReader objects are file-like objects that are read later in the processor object, after the cursor iteration has finished. This used to work OK with the old FDB driver. The problem with the new driver is that the cursor object closes all BlobReaders objects after the initial iteration over the cursor. Thus, any later attempt to read the blob data fails.

I can get things to work if I remove the lines in the Cursor._clear() method (see

while self.__blob_readers:
self.__blob_readers.pop().close()
).

I can't figure out what the "correct" DB API compliant behaviour should be as the DB-API doesn't describe any special handling for BLOBs or BlobReader objects. I don't really see that it should be necessary to close the BlobReaders after iteration over the cursor. It's all still within the context of the transaction.

A further related enhancement would be to add a __bytes__(self) method to the BlobReader object. Since the basic DB-API assumes the database will return simple types, a typical way to handle a BLOB would be to pass the value to the python bytes constructor. This will try to iterate over the BlobReader using repeated calls to BlobReader.readline(). This also fails. Even if this succeeded, this is an inefficient way to convert a BLOB to bytes (unless it really is text split into lines). Better to just call BlobReader.read() and get the entire thing as a bytes object. This is easily implementing in a __bytes__() method.

I could submit patches for these but I'm holding off to see what you think.

@pcisar
Copy link
Contributor

pcisar commented Oct 9, 2023

  1. BlobReaders are not closed when cursor iteration is finished, but when cursor is closed. That's because the underlying Firebird structures are AFAIK related. So, until you do not close the cursor, all returned BlobReader instances are valid and could be used to read BLOB values even repeatedly. Cursors are closed either explicitly via Cursor.close(), or implicitly when: a) new statement is executed by cursor, b) cursor is used in context manager at the end of its code block, c) transaction end (commit/rollback), or d) connection is closed
  2. The best method to read BLOB value via BlobReader (read/readline/readlines/iteration) depends on BLOB sub-type (text or binary). You can use BlobReader.is_text() method to determine this. However, this method is valid only for TEXT/BINARY BLOB su-btypes. There are other sub-types (including user ones) that could be either text or binary. In such case you need to know these sub-types in your application and use the proper read method according to BlobReader.sub_type value.
  3. About introducing the __bytes__ method to BlobReader, I will not do that. BLOB values could be quite big (even gigabytes) and reading the whole value into memory is a perfect method how to shoot itself in the foot. It also beats the whole idea of BlobReader, that is there to avoid such disasters (there is also a safety threshold, see Cursor.stream_blob_threshold in documentation). Anyway, bytes() will work only for binary BLOBs, but not for text ones. They are returned as Python str, not bytes (see driver documentation for automatic string conversions and character set handling).

Note that iteration over BlobReader uses readline, so it works only with text blobs. I considered using read() for binary BLOBs, but rejected the idea, as there is no easy way how to determine the amount of data to be returned. Of course, I could define a BlobReader attribute (for example fetch_size) with default 1 that could be used for that, but it will only clutter the BlobReader instances and does not provide much convenience and code readability over direct use of read() in application. But I'm open to discus that.

Also note, that BLOB value does not need to be returned as BlobReader. Normally they are returned as materialized values (either str or bytes according to BLOB sub-type). BlobReader is returned only when explicitly requested via Cursor.stream_blobs list or when size of value to be returned exceeds Cursor.stream_blob_threshold (default is 65536 bytes, i.e. 64K). Application needs to be prepared for that. Also note that stream_blob_threshold value could be configured at driver level, see DriverConfig.stream_blob_threshold configuration option.

@bryancole
Copy link
Author

Thanks for the feedback.

Of course you are right, sqlalchemy is closing the DB-API cursor immediately after iterating over it, and before accessing the data in the row-object. Sadly, there does not seem to be any way to avoid this behaviour in the sqlalchemy core except by modifying the sqlalchemy code directly. I will take this question to the sqlalchemy issue-list.

Ultimately, the easiest solution for my problem of reading large-ish BLOBs using sqlalchemy is solved if I set the stream_blob_threshold to a very large value. For BLOBs smaller than stream_blob_threshold the Cursor.stream_blobs overrides the default materialised blob behaviour (as you point out). I think it would be similarly useful to have an override to force large blobs (> stream_blob_threshold ) to be materialised. There is a clear use case for making large blobs materialised, in cases where the calling library is expecting the return type for the data to be a simple type. E.g. sqlalchemy doesn't know anything about the existence of BlobReader objects, so how can it know that it must not close the cursor before accessing the returned object?

I am surprised to discover that it is not possible to pass a python file-object to the bytes constructor, to read the a binary file into a bytes-object. I really expected this to work. I was wrong. Since the BlobReader is expected to behave the same as a file-object, I think I agree that adding a __bytes__ method would be adding non-file-like behaviour. However, it is possible to iterate over a binary file-object. This will read the entire file on the first call to __next__(), then iteration will end. By contrast, the BlobReader raises an exception: firebird.driver.types.InterfaceError: Can't read line from binary BLOB. My opinion is that iteration over a binary stream should return chunks. The chunk size could default to the system PAGESIZE, for example. The chunk size is rarely important except to be not-too-small and not-too-big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants