IBM · daw3rd · Aug 13, 2024 · Sep 18, 2024 · Sep 18, 2024 · Sep 23, 2024
diff --git a/ADVANCED.md b/ADVANCED.md
@@ -3,7 +3,14 @@
 
 ![alt text](doc/Data-prep-kit-diagram.png)
 
-### Add your own transform
+Below we discuss the following: 
+* [Adding your own transform to the repository](ADVANCED.md#adding)
+* [Running transforms using the CLI](#cli)
+* [Scaling transform execution](#scaling)
+* [Using HuggingFace data](#huggingface)
+
+<a name ="adding"></a>
+## Add your own transform
 
 At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). 
 Every parquet file follows a set [schema](transforms/code/code2parquet/python/README.md). A user can use one or more transforms (or modules) as discussed above to process their data. 
@@ -15,26 +22,44 @@ The annotator design also allows a user to verify the results of the processing
 - **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication.
 A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing.
 
-For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms).
+For a new module to be added, a user can pick the right design based on the 
+processing to be applied. More details [here](transforms).
 
-One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an [example transform](transforms/universal/noop) that can serve as a template to add new simple transforms. Follow the step by step [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) to help you add your own new transform. 
+One can leverage Python-based processing logic and the Data Processing Library 
+to easily build and contribute new transforms.
+We have provided an [example transform](transforms/universal/noop) that 
+can serve as a template to add new simple transforms. 
+Follow the step-by-step [tutorial](doc/quick-start/contribute-your-own-transform.md)
+to help you add your own new transform. 
 
 For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document](data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms](data-processing-lib/doc/transforms.md) and [runtimes](data-processing-lib/doc/transform-runtimes.md).
 
 Additionally, check out our [video tutorial](https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules.
 
+<a name = "cli"></a>
+## Running Transforms at the Command Line 
+
+You can run transforms via the command line or from within a docker image.
+* This [document](doc/quick-start/run-transform-cli.md) shows how to
+  run a transform using the command line interface and a virtual environment.
+* You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image.
 
-## 💻 -> 🖥️☁️ From laptop to cluster <a name = "laptop_cluster"></a>
-Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them. 
 
-### Scaling of Transforms
+<a name = "scaling"></a>
+## Scaling from laptop to cluster <a name = "laptop_cluster"></a>💻 -> 🖥️☁️ 
+Data-prep-kit provides the flexibility to transition your projects from 
+proof-of-concept (PoC) stage to full-scale production mode, 
+offering all the necessary tools to run your data transformations at high volume. 
+In this section, we enable you how to run your transforms at scale and how to automate them. 
+
+#### Scaling of Transforms
 
 To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) 
 or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.
 
 A generalized workflow is shown [here](doc/data-processing.md).
 
-### Automation
+#### KFP Automation
 
 The toolkit also supports transform execution automation based on 
 [Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP),
@@ -51,12 +76,21 @@ In addition, if you want to combine several transformers in a single pipeline, y
 When you finish working with the cluster, and want to clean up or destroy it. See the 
 [clean up the cluster](kfp/doc/setup.md#cleanup)
 
-### Using data from HuggingFace 
+<a name = "huggingface"></a>
+## Using HuggingFace Data 
 
-If you wish to download and use real parquet data files from HuggingFace while testing any of the toolkit transforms, use HuggingFace [download APIs](https://huggingface.co/docs/huggingface_hub/en/guides/download) that provide caching and optimize the download process. Here is an example of the code needed to download a sample file: 
+If you wish to download and use parquet data files from HuggingFace 
+while testing any of the toolkit transforms, use HuggingFace 
+[download APIs](https://huggingface.co/docs/huggingface_hub/en/guides/download) 
+that provide caching and optimize the download process.
+Here is an example of the code needed to download a sample file,
+first install huggingface_hub
 
  ```bash
- !pip install --upgrade huggingface_hub
+ pip install --upgrade huggingface_hub
+```
+Then use the following to download a specific file, 
+```python
 from huggingface_hub import hf_hub_download
 import pandas as pd
 
@@ -66,6 +100,4 @@ FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
 hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
 ```
 
-### Run your first transform using command line options
 
-You can run transforms via docker image or using virtual environments. This [document](doc/quick-start/run-transform-venv.md) shows how to run a transform using virtual environment. You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image. 
diff --git a/doc/quick-start/quick-start.md b/doc/quick-start/quick-start.md
@@ -59,12 +59,9 @@ pip3 install 'data-prep-toolit-tranforms[ray,lang_id]'
 pip install jupyterlab ipykernel ipywidgets
 python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
 ```
-
-
-
 ## Running transforms 
 
-* Notebooks
+### Notebooks
     * There is a [simple notebook](../../examples/notebooks/Run_your_first_transform_colab.ipynb) for running a single transform that can be run from either Google Colab or the local environment by downloading the file.  
     * In most indidividual transform folders, we have included one (Python), two (Python and Ray), or three (Python, Ray and Spark) notebooks for running that transform. In order to run all these notebooks in the local environment, we clone the repo as: 
     ```bash
@@ -82,9 +79,10 @@ python -m ipykernel install --user --name=data-prep-kit --display-name "dataprep
     You can now run the [Python version](../../transforms/universal/fdedup/fdedup_python.ipynb), [Ray version](../../transforms/universal/fdedup/fdedup_ray.ipynb) or [Spark version](../../transforms/universal/fdedup/fdedup_spark.ipynb) of the three notebooks for this transform. 
 
 
-* Command line  
-    * [Using a docker image](run-transform-image.md) - runs a transform in a docker transform image 
-    * [Using a virtual environment](run-transform-venv.md) - runs a transform on the local host 
+### Command line  
+  * [Using the CLI](run-transform-cli.md) - install and run a transform from the command line.
+  * [Using a docker image](run-transform-image.md) - runs a transform in a docker transform image 
+  * [Using a project's virtual environment](run-transform-venv.md) - runs a transform on from its project directory
 
 ## Running transforms on Windows
 

diff --git a/doc/quick-start/run-transform-cli.md b/doc/quick-start/run-transform-cli.md
@@ -0,0 +1,66 @@
+# Running a Transform from the Command Line 
+Here we address a simple use case of applying a single transform to a 
+set of parquet files. 
+We'll use the `pdf2parquet` transform as an example, but in general, this process
+will work for any of the transforms contained in Data Prep Kit.
+Additionally, what follows uses the 
+[python runtime](../../data-processing-lib/doc/python-runtime.md)
+but the examples below should also work for the
+[ray](../../data-processing-lib/doc/ray-runtime.md)
+or
+[spark ](../../data-processing-lib/doc/spark-runtime.md)
+runtimes.
+
+### Install data prep kit from PyPi
+
+The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 or 3.12. It can be installed using: 
+
+```bash
+pip install  'data-prep-toolkit-transforms[ray,all]'
+```
+
+The above installs all available transforms and both the python and Ray runtimes. 
+
+NOTE: As of this writing, on linux systems there is an 
+[issue](https://github.com/IBM/data-prep-kit/issues/873) 
+installing `fasttext` for the `lang_id` transform. 
+A workaround is to
+[install using conda](quick-start.md#conda).
+Alternatively, you may choose to install only the transform(s) of interest (see below).
+
+When installing select transforms, users can specify the name of the transform in the pip command, rather than [all]. For example, use the following command to install only the pdf2parquet transform:
+```bash
+pip install 'data-prep-toolkit-transforms[pdf2parquet]'
+```
+As an alternative, installing in a conda environment
+can be found
+[here](quick-start.md#conda).
+
+### Run a transform at the command line
+Here we run the `pdf2parquet` transform on its input data to 
+import pdf content into rows of a parquet file.
+First, we load some data for the transform to run on using the following python code:
+```python
+import urllib.request
+import shutil
+shutil.os.makedirs("input", exist_ok=True)
+urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/archive1.zip", "input/archive1.zip")
+urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf", "input/redp5110-ch1.pdf")
+```
+```shell 
+% ls input
+archive1.zip		redp5110-ch1.pdf
+```
+
+Next we run `pdf2parquet` on the data in the `input` folder.
+```shell
+python -m dpk_pdf2parquet.transform_python \
+    --data_local_config "{ 'input_folder': 'input', 'output_folder': 'output'}" \
+    --data_files_to_use "['.pdf', '.zip']" 
+```
+Parquet files are generated in the designated `output` folder:
+```shell
+% ls output
+archive1.parquet        metadata.json           redp5110-ch1.parquet
+```
+All transforms are runnable from the command line in the manner above.