Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates the doc to show how to pip install and run a transform at the CLI #928

Open
wants to merge 36 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
021c8f5
fix lib doc .py links and update resize readme
daw3rd Aug 13, 2024
666e558
Merge branch 'dev' into Readme-Changes
daw3rd Sep 18, 2024
591f3d8
reorder some instructions in RELEASE.md
daw3rd Sep 18, 2024
0b4c712
Merge branch 'dev' into Readme-Changes
daw3rd Sep 23, 2024
31e8354
updated doc on exception processing by the runtime
daw3rd Sep 23, 2024
ebbc0a1
updated release notes and release process doc
daw3rd Sep 25, 2024
96122eb
Merge branch 'dev' into Readme-Changes
daw3rd Sep 25, 2024
9671d6f
Merge branch 'dev' into Readme-Changes
daw3rd Sep 25, 2024
698edbe
cleanups in the release documentation
daw3rd Sep 26, 2024
81f0b35
cleanups in the release documentation
daw3rd Sep 26, 2024
148fde8
Merge branch 'dev' into Readme-Changes
daw3rd Oct 1, 2024
61dc844
remove duplicated table of transforms
daw3rd Oct 1, 2024
88fc03a
center columns in module table of readme
daw3rd Oct 1, 2024
474ab8d
Merge branch 'dev' into Readme-Changes
daw3rd Dec 12, 2024
60171f8
readme changes for simplified start example
daw3rd Jan 6, 2025
355ab20
notebook readme
daw3rd Jan 6, 2025
b9cd435
Merge branch 'dev' into readme-david
daw3rd Jan 6, 2025
289fbba
add pip install/python to show running transform from cli in top leve…
daw3rd Jan 8, 2025
60a15f3
add terminology to readme and tune cli python run
daw3rd Jan 8, 2025
ce2ab62
use wget to get data and reorder Getting Started sections
daw3rd Jan 9, 2025
ac355d9
improved wget urls
daw3rd Jan 9, 2025
4c63be0
simplify first/readme notebook and setup
daw3rd Jan 9, 2025
4a99d86
fix google colab link for new notebook - temporarily for testing
daw3rd Jan 9, 2025
6588e9f
restore collab notebook link to be to dev branch
daw3rd Jan 9, 2025
d2a1279
change readme to only install pdf2parquet to workaround fasttext inst…
daw3rd Jan 9, 2025
bc3d99f
Merge branch 'dev' into readme-david
daw3rd Jan 24, 2025
ce00d2c
Update/merge CLI transform invocation in the README
daw3rd Jan 24, 2025
2676cf0
delete unneeded notebook and readme
daw3rd Jan 24, 2025
a00398c
Merge branch 'dev' into readme-david
daw3rd Feb 6, 2025
4521794
add cli examples linked from ADVANCED
daw3rd Feb 6, 2025
e3ec9d5
move CLI section up in the ADVANCED doc
daw3rd Feb 6, 2025
8189eef
more restructuring of ADVANCED.md
daw3rd Feb 6, 2025
750468e
More ADVANCED changes
daw3rd Feb 6, 2025
d68dbd9
link in new cli doc to quickstart
daw3rd Feb 6, 2025
c4e1d4f
add anchors/links to ADVANCED.md
daw3rd Feb 6, 2025
5353768
fix broken link in ADVANCED and replace wget with python
daw3rd Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 44 additions & 12 deletions ADVANCED.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,14 @@

![alt text](doc/Data-prep-kit-diagram.png)

### Add your own transform
Below we discuss the following:
* [Adding your own transform to the repository](ADVANCED.md#adding)
* [Running transforms using the CLI](#cli)
* [Scaling transform execution](#scaling)
* [Using HuggingFace data](#huggingface)

<a name ="adding"></a>
## Add your own transform

At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language).
Every parquet file follows a set [schema](transforms/code/code2parquet/python/README.md). A user can use one or more transforms (or modules) as discussed above to process their data.
Expand All @@ -15,26 +22,44 @@ The annotator design also allows a user to verify the results of the processing
- **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication.
A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing.

For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms).
For a new module to be added, a user can pick the right design based on the
processing to be applied. More details [here](transforms).

One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an [example transform](transforms/universal/noop) that can serve as a template to add new simple transforms. Follow the step by step [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) to help you add your own new transform.
One can leverage Python-based processing logic and the Data Processing Library
to easily build and contribute new transforms.
We have provided an [example transform](transforms/universal/noop) that
can serve as a template to add new simple transforms.
Follow the step-by-step [tutorial](doc/quick-start/contribute-your-own-transform.md)
to help you add your own new transform.

For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document](data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms](data-processing-lib/doc/transforms.md) and [runtimes](data-processing-lib/doc/transform-runtimes.md).

Additionally, check out our [video tutorial](https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules.

<a name = "cli"></a>
## Running Transforms at the Command Line

You can run transforms via the command line or from within a docker image.
* This [document](doc/quick-start/run-transform-cli.md) shows how to
run a transform using the command line interface and a virtual environment.
* You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image.

## 💻 -> 🖥️☁️ From laptop to cluster <a name = "laptop_cluster"></a>
Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them.

### Scaling of Transforms
<a name = "scaling"></a>
## Scaling from laptop to cluster <a name = "laptop_cluster"></a>💻 -> 🖥️☁️
Data-prep-kit provides the flexibility to transition your projects from
proof-of-concept (PoC) stage to full-scale production mode,
offering all the necessary tools to run your data transformations at high volume.
In this section, we enable you how to run your transforms at scale and how to automate them.

#### Scaling of Transforms

To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html)
or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.

A generalized workflow is shown [here](doc/data-processing.md).

### Automation
#### KFP Automation

The toolkit also supports transform execution automation based on
[Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP),
Expand All @@ -51,12 +76,21 @@ In addition, if you want to combine several transformers in a single pipeline, y
When you finish working with the cluster, and want to clean up or destroy it. See the
[clean up the cluster](kfp/doc/setup.md#cleanup)

### Using data from HuggingFace
<a name = "huggingface"></a>
## Using HuggingFace Data

If you wish to download and use real parquet data files from HuggingFace while testing any of the toolkit transforms, use HuggingFace [download APIs](https://huggingface.co/docs/huggingface_hub/en/guides/download) that provide caching and optimize the download process. Here is an example of the code needed to download a sample file:
If you wish to download and use parquet data files from HuggingFace
while testing any of the toolkit transforms, use HuggingFace
[download APIs](https://huggingface.co/docs/huggingface_hub/en/guides/download)
that provide caching and optimize the download process.
Here is an example of the code needed to download a sample file,
first install huggingface_hub

```bash
!pip install --upgrade huggingface_hub
pip install --upgrade huggingface_hub
```
Then use the following to download a specific file,
```python
from huggingface_hub import hf_hub_download
import pandas as pd

Expand All @@ -66,6 +100,4 @@ FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
```

### Run your first transform using command line options

You can run transforms via docker image or using virtual environments. This [document](doc/quick-start/run-transform-venv.md) shows how to run a transform using virtual environment. You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image.
12 changes: 5 additions & 7 deletions doc/quick-start/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,9 @@ pip3 install 'data-prep-toolit-tranforms[ray,lang_id]'
pip install jupyterlab ipykernel ipywidgets
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
```



## Running transforms

* Notebooks
### Notebooks
* There is a [simple notebook](../../examples/notebooks/Run_your_first_transform_colab.ipynb) for running a single transform that can be run from either Google Colab or the local environment by downloading the file.
* In most indidividual transform folders, we have included one (Python), two (Python and Ray), or three (Python, Ray and Spark) notebooks for running that transform. In order to run all these notebooks in the local environment, we clone the repo as:
```bash
Expand All @@ -82,9 +79,10 @@ python -m ipykernel install --user --name=data-prep-kit --display-name "dataprep
You can now run the [Python version](../../transforms/universal/fdedup/fdedup_python.ipynb), [Ray version](../../transforms/universal/fdedup/fdedup_ray.ipynb) or [Spark version](../../transforms/universal/fdedup/fdedup_spark.ipynb) of the three notebooks for this transform.


* Command line
* [Using a docker image](run-transform-image.md) - runs a transform in a docker transform image
* [Using a virtual environment](run-transform-venv.md) - runs a transform on the local host
### Command line
* [Using the CLI](run-transform-cli.md) - install and run a transform from the command line.
* [Using a docker image](run-transform-image.md) - runs a transform in a docker transform image
* [Using a project's virtual environment](run-transform-venv.md) - runs a transform on from its project directory

## Running transforms on Windows

Expand Down
66 changes: 66 additions & 0 deletions doc/quick-start/run-transform-cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Running a Transform from the Command Line
Here we address a simple use case of applying a single transform to a
set of parquet files.
We'll use the `pdf2parquet` transform as an example, but in general, this process
will work for any of the transforms contained in Data Prep Kit.
Additionally, what follows uses the
[python runtime](../../data-processing-lib/doc/python-runtime.md)
but the examples below should also work for the
[ray](../../data-processing-lib/doc/ray-runtime.md)
or
[spark ](../../data-processing-lib/doc/spark-runtime.md)
runtimes.

### Install data prep kit from PyPi

The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 or 3.12. It can be installed using:

```bash
pip install 'data-prep-toolkit-transforms[ray,all]'
```

The above installs all available transforms and both the python and Ray runtimes.

NOTE: As of this writing, on linux systems there is an
[issue](https://github.com/IBM/data-prep-kit/issues/873)
installing `fasttext` for the `lang_id` transform.
A workaround is to
[install using conda](quick-start.md#conda).
Alternatively, you may choose to install only the transform(s) of interest (see below).

When installing select transforms, users can specify the name of the transform in the pip command, rather than [all]. For example, use the following command to install only the pdf2parquet transform:
```bash
pip install 'data-prep-toolkit-transforms[pdf2parquet]'
```
As an alternative, installing in a conda environment
can be found
[here](quick-start.md#conda).

### Run a transform at the command line
Here we run the `pdf2parquet` transform on its input data to
import pdf content into rows of a parquet file.
First, we load some data for the transform to run on using the following python code:
```python
import urllib.request
import shutil
shutil.os.makedirs("input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/archive1.zip", "input/archive1.zip")
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf", "input/redp5110-ch1.pdf")
```
```shell
% ls input
archive1.zip redp5110-ch1.pdf
```

Next we run `pdf2parquet` on the data in the `input` folder.
```shell
python -m dpk_pdf2parquet.transform_python \
--data_local_config "{ 'input_folder': 'input', 'output_folder': 'output'}" \
--data_files_to_use "['.pdf', '.zip']"
```
Parquet files are generated in the designated `output` folder:
```shell
% ls output
archive1.parquet metadata.json redp5110-ch1.parquet
```
All transforms are runnable from the command line in the manner above.