Skip to content

Commit

Permalink
Improved README file
Browse files Browse the repository at this point in the history
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
  • Loading branch information
shahrokhDaijavad committed Feb 6, 2025
1 parent 7e84256 commit e8e7e8e
Showing 1 changed file with 14 additions and 7 deletions.
21 changes: 14 additions & 7 deletions transforms/universal/bloom/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
# Bloom Annotation
Please see the set of [transform project conventions](../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up.

## Summary
Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that
caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform
those trained on FineWeb by 2.14 percentage points in terms of average score computed
on a set of 11 commonly used benchmarks.

The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter.

Bloom annotator transform maps a non-empty input table to an output table with an added is_in_GneissWeb column. Each row in the table corresponds to a UUID and its associated document. The Bloom annotator transform verifies whether the document's UUID exists in the GneissWeb Bloom filter.

## Contributor
- Yang Zhao ([email protected])

## Description
### Prerequisite
Please refer to `requirements.txt` to install the necessary packages.
## General Information
Please see the set of [transform project conventions](../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up.

### Overview
The bloom transform maps a non-empty input table to an output table with an added `is_in_GneissWeb` column. Each row in the table corresponds to a UUID and its associated document. The Bloom transform verifies whether the document's UUID exists in the GneissWeb Bloom filter.
## Prerequisite
Please refer to `requirements.txt` to install the necessary packages.


### input format
Expand Down Expand Up @@ -41,7 +49,6 @@ configuration for values are as follows:




## Usage
Place your input Parquet file in the `test-data/input/` directory. A sample file, `test1.parquet`, is available in this directory. Once done, run the script.

Expand Down

0 comments on commit e8e7e8e

Please sign in to comment.