-
Notifications
You must be signed in to change notification settings - Fork 173
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
- Loading branch information
1 parent
7e84256
commit e8e7e8e
Showing
1 changed file
with
14 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,23 @@ | ||
# Bloom Annotation | ||
Please see the set of [transform project conventions](../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up. | ||
|
||
## Summary | ||
Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that | ||
caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform | ||
those trained on FineWeb by 2.14 percentage points in terms of average score computed | ||
on a set of 11 commonly used benchmarks. | ||
|
||
The Bloom Annotator transform assigns a label of 1 if the document is present in the GneissWeb Bloom filter; otherwise, it assigns 0. This approach provides a clear understanding of which documents in FineWeb are also present in GneissWeb and which are not. The GneissWeb Bloom filter is just one use case; the Bloom Annotator transform can work with any Bloom filter. | ||
|
||
Bloom annotator transform maps a non-empty input table to an output table with an added is_in_GneissWeb column. Each row in the table corresponds to a UUID and its associated document. The Bloom annotator transform verifies whether the document's UUID exists in the GneissWeb Bloom filter. | ||
|
||
## Contributor | ||
- Yang Zhao ([email protected]) | ||
|
||
## Description | ||
### Prerequisite | ||
Please refer to `requirements.txt` to install the necessary packages. | ||
## General Information | ||
Please see the set of [transform project conventions](../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up. | ||
|
||
### Overview | ||
The bloom transform maps a non-empty input table to an output table with an added `is_in_GneissWeb` column. Each row in the table corresponds to a UUID and its associated document. The Bloom transform verifies whether the document's UUID exists in the GneissWeb Bloom filter. | ||
## Prerequisite | ||
Please refer to `requirements.txt` to install the necessary packages. | ||
|
||
|
||
### input format | ||
|
@@ -41,7 +49,6 @@ configuration for values are as follows: | |
|
||
|
||
|
||
|
||
## Usage | ||
Place your input Parquet file in the `test-data/input/` directory. A sample file, `test1.parquet`, is available in this directory. Once done, run the script. | ||
|
||
|