Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrameColumn Apply and DropNulls methods #7123

Merged
merged 10 commits into from
Jun 11, 2024
Merged

Implement DataFrameColumn Apply and DropNulls methods #7123

merged 10 commits into from
Jun 11, 2024

Conversation

asmirnov82
Copy link
Contributor

@asmirnov82 asmirnov82 commented Apr 9, 2024

Fixes #7107 as was asked in #6144 (comment)

Additionaly:

  1. Fix incorrect IsNumeric method
  2. Fix error FillNulls crashes with NotImplemented exception on DataFrame with VBufferDataFrameColumn
  3. Improve speed and redesign API (legacy API marked as obsolete), internal methods are rewritten to use new API
  4. Add extra unit tests

Reasons for refactoring:

  1. Legacy implementation of ApplyElementwise is written not only to apply function on each element in the column, but also to be used for fast columns iteration, that's why it has rowIndex as one of it's arguments. This duplication of responsibilities is very confusing and absolutely not straightforward. More over, it is slow, as instead of only reading values, ApplyElementwise writes function results into into memory, also it converts read only columns to editable (and that again involves memory copy). So in some circumstances memory can be excessively copied even twice.
  2. Legacy method requires to provide function, that takes into account null values - this is not userfriendly. For working with Nulls there are already FillNulls and DropNulls methods. New method applies user function only to valuable values, this also leads to better performance

@asmirnov82
Copy link
Contributor Author

cc @JakeRadMSFT

Copy link

codecov bot commented Apr 9, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 82 lines in your changes missing coverage. Please review.

Project coverage is 68.67%. Comparing base (4bc753a) to head (9af789d).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #7123    +/-   ##
========================================
  Coverage   68.66%   68.67%            
========================================
  Files        1262     1263     +1     
  Lines      257774   257955   +181     
  Branches    26660    26698    +38     
========================================
+ Hits       176991   177140   +149     
- Misses      73971    74000    +29     
- Partials     6812     6815     +3     
Flag Coverage Δ
Debug 68.67% <75.00%> (+<0.01%) ⬆️
production 62.93% <61.86%> (-0.01%) ⬇️
test 88.88% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.cs 85.67% <100.00%> (+0.25%) ⬆️
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs 75.84% <ø> (ø)
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.90% <100.00%> (+<0.01%) ⬆️
...ta.Analysis.Tests/PrimitiveDataFrameColumnTests.cs 99.74% <100.00%> (+0.03%) ⬆️
....Data.Analysis.Tests/StringDataFrameColumnTests.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs 99.55% <100.00%> (ø)
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs 86.52% <96.96%> (+0.43%) ⬆️
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 72.53% <97.29%> (+0.09%) ⬆️
src/Microsoft.Data.Analysis/DataFrameColumn.cs 65.46% <0.00%> (+2.28%) ⬆️
...nalysis/DataFrameColumns/VBufferDataFrameColumn.cs 44.57% <30.30%> (-0.04%) ⬇️
... and 2 more

... and 5 files with indirect coverage changes

@asmirnov82 asmirnov82 changed the title Implement DataFrameColumn Apply method and DropNulls methods Implement DataFrameColumn Apply and DropNulls methods Apr 9, 2024
@@ -79,6 +79,27 @@ public void Append(string value)
Length++;
}

/// <summary>
/// Applies a function to all values in the column, that are not null.
/// </summary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not null instead of using the isvalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsValid can be used here as well, however for StringDataFrame column it doesn't make a lot of sense (actualy IsValid works a little bit slower in this case). Using IsValid instead of checking for null dramaticaly improves performance in case of PrimitivesDataFrame columns without Nulls (most use cases), because such columns uses validity buffers and checking for Null is very expensive and can be skiped if NullCount is 0

# Conflicts:
#	src/Microsoft.Data.Analysis/DataFrameColumns/VBufferDataFrameColumn.cs
@JakeRadMSFT
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@asmirnov82
Copy link
Contributor Author

TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url

@JakeRadMSFT
Copy link
Contributor

TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url

@tarekgh ?

@tarekgh
Copy link
Member

tarekgh commented Jun 10, 2024

TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url

Yes, the gpt-2 link became invalid. I'll submit a PR soon to fix that.

@tarekgh
Copy link
Member

tarekgh commented Jun 10, 2024

@asmirnov82 if you are intersted, you may add the fix to this PR.

-            yield return new object[] { GPT2, @"https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken" };
+            yield return new object[] { GPT2, @"https://fossies.org/linux/misc/whisper-20231117.tar.gz/whisper-20231117/whisper/assets/gpt2.tiktoken?m=b" };

otherwise, I can submit PR for that.

@asmirnov82
Copy link
Contributor Author

@tarekgh, I implemented the fix. Now all tokenizer tests passed. Thank you for your help

@JakeRadMSFT JakeRadMSFT merged commit 0c51cc6 into dotnet:main Jun 11, 2024
25 checks passed
@asmirnov82 asmirnov82 deleted the 7107_dataframe_apply_method branch June 11, 2024 17:52
@github-actions github-actions bot locked and limited conversation to collaborators Jul 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the Apply method available to StringDataFrame column
3 participants