Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix offsets for nullable columns #3611

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Jan 22, 2025

Issue and/or context:

[sc-62104]

We were adding the offset to the ArrowArray's validity buffer directly which is incorrect because the validity is stored as a bit, not as a uint8_t.

Changes:

  • Add util::bitmap_to_uint8 which takes a pointer to a bitmap, length, and offset, and returns a vector<uint8_t> representation; if the bitmap is a nullptr, then return a nullopt
  • Add helper functions ManagedQuery::_cast_bool_data, formally util::cast_bit_to_uint8, and ManagedQuery::_cast_validity_buffer, which takes the validity buffer of the passed in ArrowArray and returns vector<uint8_t> with the properly shifted offset
  • Update ManagedQuery::setup_write_column and ColumnBuffer::set_data to take in an std::optional<std::vector<uint8_t>> where the validity buffer has already been offset and casted to uint8
  • Add pytest unit testing for nullable columns with offsetting for string, Boolean, enumeration, and the general case (int, float, etc.)

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.28%. Comparing base (f3ba4e5) to head (b5e818c).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3611      +/-   ##
==========================================
+ Coverage   86.26%   86.28%   +0.01%     
==========================================
  Files          55       55              
  Lines        6378     6378              
==========================================
+ Hits         5502     5503       +1     
+ Misses        876      875       -1     
Flag Coverage Δ
python 86.28% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.28% <ø> (+0.01%) ⬆️
libtiledbsoma ∅ <ø> (∅)

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some additional tests:

  1. Call Sparse/Dense NDArray write with an array that has a validity buffer, and ensure it acts as expected.
  2. For DataFrame, need to verify expected behavior if an index (dimension) column is written with a validity buffer.
  3. In test_dataframe, I'd also test the "invalid" timestamp types (NaT) alongside the other nullable types.

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing requests only at this point - the C++ code looks OK (I did a quick pass only - would be good to get @XanthosXanthopoulos or @jp-dark to review it in more depth)

apis/python/tests/test_dataframe.py Outdated Show resolved Hide resolved
@nguyenv nguyenv force-pushed the viviannguyen/sc-62104/python-data-corruption-in-validity-layer branch from d89b2f4 to 4271442 Compare January 23, 2025 00:01
@jp-dark jp-dark self-requested a review January 23, 2025 14:52
Copy link
Collaborator

@jp-dark jp-dark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the behavior for dense arrays is correct, but otherwise it looks good to me.

apis/python/src/tiledbsoma/_dense_nd_array.py Outdated Show resolved Hide resolved
libtiledbsoma/src/soma/managed_query.cc Outdated Show resolved Hide resolved
@nguyenv nguyenv force-pushed the viviannguyen/sc-62104/python-data-corruption-in-validity-layer branch 2 times, most recently from 70ab5db to e8e8405 Compare January 23, 2025 21:26
@nguyenv nguyenv force-pushed the viviannguyen/sc-62104/python-data-corruption-in-validity-layer branch from e8e8405 to 043b36d Compare January 23, 2025 21:58
@nguyenv nguyenv force-pushed the viviannguyen/sc-62104/python-data-corruption-in-validity-layer branch from 043b36d to 716b705 Compare January 24, 2025 05:42
@nguyenv
Copy link
Member Author

nguyenv commented Jan 24, 2025

Updated now with all requested changes except:

Call Dense NDArray write with an array that has a validity buffer

Because DenseNDArray takes in a pa.Tensor which does not support nulls.

@nguyenv nguyenv requested a review from bkmartinjr January 24, 2025 18:12

with soma.SparseNDArray.open(uri, "w") as A:
with raises_no_typeguard(soma.SOMAError):
# soma_joinid cannot be nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, soma_data also can't be nullable - can we add a branch to test for both?

Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small (optional) suggestion. Otherwise LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants