-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DataFrameColumn Apply and DropNulls methods #7123
Implement DataFrameColumn Apply and DropNulls methods #7123
Conversation
cc @JakeRadMSFT |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7123 +/- ##
========================================
Coverage 68.66% 68.67%
========================================
Files 1262 1263 +1
Lines 257774 257955 +181
Branches 26660 26698 +38
========================================
+ Hits 176991 177140 +149
- Misses 73971 74000 +29
- Partials 6812 6815 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -79,6 +79,27 @@ public void Append(string value) | |||
Length++; | |||
} | |||
|
|||
/// <summary> | |||
/// Applies a function to all values in the column, that are not null. | |||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not null instead of using the isvalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsValid can be used here as well, however for StringDataFrame column it doesn't make a lot of sense (actualy IsValid works a little bit slower in this case). Using IsValid instead of checking for null dramaticaly improves performance in case of PrimitivesDataFrame columns without Nulls (most use cases), because such columns uses validity buffers and checking for Null is very expensive and can be skiped if NullCount is 0
# Conflicts: # src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url |
@tarekgh ? |
Yes, the gpt-2 link became invalid. I'll submit a PR soon to fix that. |
@asmirnov82 if you are intersted, you may add the fix to this PR. - yield return new object[] { GPT2, @"https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken" };
+ yield return new object[] { GPT2, @"https://fossies.org/linux/misc/whisper-20231117.tar.gz/whisper-20231117/whisper/assets/gpt2.tiktoken?m=b" }; otherwise, I can submit PR for that. |
@tarekgh, I implemented the fix. Now all tokenizer tests passed. Thank you for your help |
Fixes #7107 as was asked in #6144 (comment)
Additionaly:
Reasons for refactoring: