diff --git a/ADVANCED.md b/ADVANCED.md new file mode 100644 index 0000000000..da95bdc5cd --- /dev/null +++ b/ADVANCED.md @@ -0,0 +1,71 @@ + +

Data Prep Kit for Advanced Users

+ +![alt text](doc/Data-prep-kit-diagram.png) + +### Add your own transform + +At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). +Every parquet file follows a set [schema](transforms/code/code2parquet/python/README.md). A user can use one or more transforms (or modules) as discussed above to process their data. +A transform can follow one of the two patterns: annotator or filter. + +- **Annotator** An annotator transform adds information during the processing by adding one more columns to the parquet files. +The annotator design also allows a user to verify the results of the processing before the actual filtering of the data. + +- **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. +A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing. + +For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). + +One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an [example transform](transforms/universal/noop) that can serve as a template to add new simple transforms. Follow the step by step [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) to help you add your own new transform. + +For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document](data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms](data-processing-lib/doc/transforms.md) and [runtimes](data-processing-lib/doc/transform-runtimes.md). + +Additionally, check out our [video tutorial](https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules. + + +## 💻 -> 🖥️☁️ From laptop to cluster +Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them. + +### Scaling of Transforms + +To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) +or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations. + +A generalized workflow is shown [here](doc/data-processing.md). + +### Automation + +The toolkit also supports transform execution automation based on +[Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP), +tested on a locally deployed [Kind cluster](https://kind.sigs.k8s.io/) and external OpenShift clusters. There is an +automation to create a Kind cluster and deploy all required components on it. +The KFP implementation is based on the [KubeRay Operator](https://docs.ray.io/en/master/cluster/kubernetes/getting-started.html) +for creating and managing the Ray cluster and [KubeRay API server](https://github.com/ray-project/kuberay/tree/master/apiserver) +to interact with the KubeRay operator. An additional [framework](kfp/kfp_support_lib) along with several +[kfp components](kfp/kfp_ray_components) is used to simplify the pipeline implementation. + +A simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md) explains the pipeline creation and execution. +In addition, if you want to combine several transformers in a single pipeline, you can look at [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md) + +When you finish working with the cluster, and want to clean up or destroy it. See the +[clean up the cluster](kfp/doc/setup.md#cleanup) + +### Using data from HuggingFace + +If you wish to download and use real parquet data files from HuggingFace while testing any of the toolkit transforms, use HuggingFace [download APIs](https://huggingface.co/docs/huggingface_hub/en/guides/download) that provide caching and optimize the download process. Here is an example of the code needed to download a sample file: + + ```bash + !pip install --upgrade huggingface_hub +from huggingface_hub import hf_hub_download +import pandas as pd + +REPO_ID = "HuggingFaceFW/fineweb" +FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet" + +hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") +``` + +### Run your first transform using command line options + +You can run transforms via docker image or using virtual environments. This [document](doc/quick-start/run-transform-venv.md) shows how to run a transform using virtual environment. You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image. diff --git a/README.md b/README.md index 6a0580c505..02664ae106 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@

Data Prep Kit

+ +
@@ -9,85 +11,56 @@
-Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. -With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune, instruct-tune the LLMs or to build RAG applications for LLMs. -As the variety of use cases grow, so does the need to support: - -- New ways of transforming the data to enhance the performance of the resulting LLMs for each specific use case. -- A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale -- Support for different data modalities including language, code, vision, multimodal etc - -Data Prep Kit offers implementations of commonly needed data preparation steps, called *modules* or *transforms*, for both Code and Language modalities, with vision to extend to images, speech and multimodal data. -The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks. - -![alt text](doc/Data-prep-kit-diagram.png) - +Data Prep Kit is a community-driven project that simplifies unstructured data preparation for LLM application development. It addresses the growing challenge of preparing diverse data (language, code, vision, multimodal) for fine-tuning, instruction-tuning, and RAG applications. The modules in the kit have been tested in producing pre-training datasets for the [Granite open source LLM models](https://huggingface.co/ibm-granite). -## 📝 Table of Contents +## Features -- [About](#about) -- [Getting Started](#gettingstarted) -- [Scaling transforms from laptop to cluster](#laptop_cluster) -- [Repository Use and Navigation](doc/repo.md) -- [How to Contribute](CONTRIBUTING.md) -- [Resources (papers, talks, presentations and tutorials)](resources.md) -- [Citations](#citations) +- The kit provides a growing set of [modules/transforms](#table) targeting laptop-scale to datacenter-scale processing. +- The data modalities supported _today_ are: Natural Language and Code. +- The modules are built on common frameworks for Python, Ray and Spark runtimes for scaling up data processing. +- The kit provides a framework for developing custom transforms for processing parquet files. +- The kit uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md). -## 📖 About +## Installation -Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning, RAG or instruction-tuning. -Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. -These modules have been tested while producing pre-training datasets for the [Granite open source LLM models](https://huggingface.co/ibm-granite). - -The modules are built on common frameworks (for Spark and Ray), called the *data processing library* that allows the developers to build new custom modules that readily scale across a variety of runtimes. +The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 or 3.12. It can be installed using: -Features of the toolkit: +```bash +pip install 'data-prep-toolkit-transforms[all]' +``` -- It aims to accelerate unstructured data prep for the "long tail" of LLM use cases. -- It offers a growing set of [module](transforms) implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing. -- It provides a growing set of [sample data processing pipelines](examples) that can be used for real enterprise use cases. -- It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases. -- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md). +This will install all available transforms. -Data modalities supported _today_: Code and Natural Language. +For guidance on creating the virtual environment for installing the data prep kit, click [here](doc/quick-start/quick-start.md). ## 🚀 Getting Started ### Fastest way to experience Data Prep Kit -With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: [examples/notebooks/Run_your_first_transform_colab.ipynb](examples/notebooks/Run_your_first_transform_colab.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb). ([Here](doc/google-colab.md) are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.) The same notebook can be downloaded and run on the local machine, without cloning the repo or any other setup. For additional guidance on setting up Jupyter lab, click [here](doc/quick-start/quick-start.md#jupyter). +With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: [examples/notebooks/Run_your_first_transform_colab.ipynb](examples/notebooks/Run_your_first_transform_colab.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb). ([Here](doc/google-colab.md) are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.) The same notebook can be downloaded and run on the local machine, without cloning the repo or any other setup. -### Install data prep kit from PyPi +### Examples -The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 or 3.12. It can be installed using: +Now that you have run a single transform, the next step is to explore how to put these transforms +together to run a data prep pipeline for end to end real enterprise use cases like fine-tuning a model or building a RAG application. -```bash -pip install 'data-prep-toolkit-transforms[ray,all]' -``` +We have a complete set of data processing [recipes](examples) for such use cases. -The above installs all available transforms. +We also have [a developer tutorial](doc/quick-start/contribute-your-own-transform.md) for contributing a new transform to the kit. -When installing select transforms, users can specify the name of the transform in the pip command, rather than [all]. For example, use the following command to install only the pdf2parquet transform: -```bash -pip install 'data-prep-toolkit-transforms[pdf2parquet]' -``` -For additional guidance on creating the virtual environment for installing the data prep kit, click [here](doc/quick-start/quick-start.md#conda). +For advanced users, [here](ADVANCED.md) is more information for adding your own transform, its scaling and automation. Also,repository structure and use are discussed [here](doc/repo.md). -### Run your first data prep pipeline +### Windows users -Now that you have run a single transform, the next step is to explore how to put these transforms -together to run a data prep pipeline for an end to end use case like fine tuning a model or building -a RAG application. -This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of -how to build an end to end data prep pipeline for fine tuning for code LLMs. -You can also explore how to build a RAG pipeline [here](examples/notebooks/rag). +Please click [here](doc/quick-start/quick-start.md#running-transforms-on-windows) for guidance on how to run transforms in Windows. -### Windows users +### Using HuggingFace data files + +All the transforms in the kit include small sample data files for testing, but advanced users who want to download real data files from HuggingFace and use them in testing, can refer to [this](ADVANCED.md#using-data-from-huggingface). -Please click [here](doc/quick-start/quick-start.md#running-transforms-on-windows) for guidance on how to run transforms in Windows. -### Current list of transforms -The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples/notebooks) folder. +## Current list of transforms +The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder. | Modules | Python-only | Ray | Spark | KFP on Ray | @@ -122,62 +95,16 @@ The matrix below shows the the combination of modules and supported runtimes. Al | [License Select Annotation](transforms/code/license_select/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [Code profiler](transforms/code/code_profiler/README.md) | :white_check_mark: | :white_check_mark: | | | +## Contributing +Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules! Please read [this](CONTRIBUTING.md) for details. -Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules! - -### Add your own transform - -At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). -Every parquet file follows a set [schema](transforms/code/code2parquet/python/README.md). A user can use one or more transforms (or modules) as discussed above to process their data. -A transform can follow one of the two patterns: annotator or filter. - -- **Annotator** An annotator transform adds information during the processing by adding one more columns to the parquet files. -The annotator design also allows a user to verify the results of the processing before the actual filtering of the data. - -- **Filter** A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. -A general purpose [SQL-based filter transform](transforms/universal/filter) enables a powerful mechanism for identifying columns and rows of interest for downstream processing. - -For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). - -One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an [example transform](transforms/universal/noop) that can serve as a template to add new simple transforms. Follow the step by step [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) to help you add your own new transform. - -For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive [overview document](data-processing-lib/doc/overview.md) alongside dedicated sections on [transforms](data-processing-lib/doc/transforms.md) and [runtimes](data-processing-lib/doc/transform-runtimes.md). - -Additionally, check out our [video tutorial](https://www.youtube.com/watch?v=0WUMG6HIgMg) for a visual, example-driven guide on adding custom modules. - - -## 💻 -> 🖥️☁️ From laptop to cluster -Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them. - -### Scaling of Transforms - -To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) -or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations. - -A generalized workflow is shown [here](doc/data-processing.md). - -### Automation - -The toolkit also supports transform execution automation based on -[Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP), -tested on a locally deployed [Kind cluster](https://kind.sigs.k8s.io/) and external OpenShift clusters. There is an -automation to create a Kind cluster and deploy all required components on it. -The KFP implementation is based on the [KubeRay Operator](https://docs.ray.io/en/master/cluster/kubernetes/getting-started.html) -for creating and managing the Ray cluster and [KubeRay API server](https://github.com/ray-project/kuberay/tree/master/apiserver) -to interact with the KubeRay operator. An additional [framework](kfp/kfp_support_lib) along with several -[kfp components](kfp/kfp_ray_components) is used to simplify the pipeline implementation. - -A simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md) explains the pipeline creation and execution. -In addition, if you want to combine several transformers in a single pipeline, you can look at [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md) - -When you finish working with the cluster, and want to clean up or destroy it. See the -[clean up the cluster](kfp/doc/setup.md#cleanup) - -### Run your first transform using command line options +## Get help and support +Please feel free to connect with us using the [discussion](https://github.com/IBM/data-prep-kit/discussions) section. -You can run transforms via docker image or using virtual environments. This [document](doc/quick-start/run-transform-venv.md) shows how to run a transform using virtual environment. You can follow this [document](doc/quick-start/run-transform-image.md) to run using docker image. +## Resources +[Papers, talks, presentations and tutorials](resources.md). -## Citations +## Citation If you use Data Prep Kit in your research, please cite our paper: diff --git a/doc/quick-start/contribute-your-own-transform.md b/doc/quick-start/contribute-your-own-transform.md index ccf3067b36..78b14b2bd7 100644 --- a/doc/quick-start/contribute-your-own-transform.md +++ b/doc/quick-start/contribute-your-own-transform.md @@ -205,8 +205,8 @@ from .transform import * **dpk_digest/runtime.py** This file implements 3 classes, the first being TransformConfiguration. It defines two user defined methods that must be implemented by the developer for each transform: -* add_input_params() is called by the framework to validate the presence of all required configuration parameters for this transform and specifies guidance to the user if any is missing -* apply_input_params() is called by the framework to validate the values associated with the configuration parameter. +* The add_input_params() method is called by the framework to define the set of command line parameters exposed by the runtime to configure the transforms. The runtime processes the command line parameters and makes them available to the transform instance initializer. +* The apply_input_params() method is called by the framework to validate the values associated with the configuration parameter. ```python # (C) Copyright IBM Corp. 2024. diff --git a/doc/repo.md b/doc/repo.md index 6df6510c49..41945f831f 100644 --- a/doc/repo.md +++ b/doc/repo.md @@ -78,6 +78,47 @@ This might include things published to pypi or the docker registry. Sub-directories are free to define these as empty/no-op targets, but generally are required to define them unless a parent directory does not recurse into the directory. +### Build and deploy a dev release for integration testing (Recommended step for all transforms prior to merging the corresponding PR) + +1. Create your fork from the main repo or sync an existing fork with main repo +1. Clone the fork + ```shell + git clone git@github.com:/data-prep-kit.git data-prep-kit-dev + cd data-prep-kit-dev + ``` +1. Create a new local branch from dev + ```shell + git checkout dev + git checkout -b "testing-$(date '+%Y-%m-%d')" + ``` +1. Merge changes from remote branch (if more than one PR, repeat below for each PR). In the example below, replace '' and '' with the git url and branch from each PR (e.g, PR1, PR2, ...) + ```shell + git remote add + git fetch + git merge / + ``` +1. Change to the transforms folder, clean any previous build, build a new wheel and publish the wheel as a dev branch to pypi. Follow [instructions](https://packaging.python.org/en/latest/specifications/pypirc/#using-another-package-index) to setup your environment to be able to publish: + ```shell + cd transforms + rm -fr build dist data_prep_toolkit_transforms.egg-info + make build-pkg-dist + pip install twine + make publish-dist + ``` +1. **Note**- 'make publish-dist' will fail if a previous build with the same tag is already present on pypi. In this case, add a 'build tag' and publish again. The 'build tag' is a number that immediately follows the distribution package version seperated by a dash `({distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl)` + + ```shell + mv dist/data_prep_toolkit_transforms-1.0.1.dev1-py3-none-any.whl dist/data_prep_toolkit_transforms-1.0.1.dev1-1-py3-none-any.whl + ``` + ```shell + make publish-dist + ``` + **Note**- 'make publish-dist' will fail if the choosen 'build tag' already exists. In this case, consult the pypi site to identify the latest build tag previously used and increment by 1 + +1. When testing the new wheel in a notebook or a venv, make sure to use the --no-cache option: `pip install --no-cache data-prep-toolkit-transforms-1.0.1.dev1` + + + ## Developers Generally, developers will be working in a python project directory (e.g., data-processing-lib/python, transforms/universal/filter, etc.) diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000000..67a4f022f8 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,13 @@ + +

About Data Prep Kit Recipes

+ +Welcome to cooking with Data Prep Kit. Here we share some of our most asked, searched and shared Data Prep Kit recipes for processing unstructured and structured data for plethora of use cases like RAG, fine-tuning etc. along with examples of KFP workflows. + + - [**Data Files**](./data-files/) + - [**Introductory Recipe Notebook to get started**](notebooks/Run_your_first_transform_colab.ipynb) + - [**Recipes for Processing Code and Language Data for Finetuning LLMs**](./notebooks/fine%20tuning/code/) + - [**Recipe for building RAG system using pdf data**](./notebooks/rag/) + - [**Recipe for building RAG system using html data**](./notebooks/rag-html-1/) + - [**Recipe for curating customer service data for HAP**](./notebooks/hap/) + - [**Recipe for curating customer service data for PII**](./notebooks/PII/) + - [**KFP Pipeline Walkthrough**](kfp-pipelines/superworkflows) diff --git a/examples/agentic/Planning_DPK_agent.ipynb b/examples/agentic/Planning_DPK_agent.ipynb new file mode 100644 index 0000000000..a022411a7f --- /dev/null +++ b/examples/agentic/Planning_DPK_agent.ipynb @@ -0,0 +1,276 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "

data-prep-kit planning agent

\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qq -r requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import HTML\n", + "task = \"Process the provided PDF dataset to identify and extract only documents that don't contain inappropriate language. Remove the duplications.\"\n", + "HTML(f\"

TASK: {task}

\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import os\n", + "\n", + "from llm_utils.logging import prep_loggers\n", + "os.environ[\"LLM_LOG_PATH\"] = \"./logs/llm_log.txt\"\n", + "prep_loggers(\"llm=INFO\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The tools in DPK agents are the transforms.\n", + "# Each tool is described as json dictionary with its name, description, input parameters, and how to import it.\n", + "# The list of the tools exists in llm_utils/tools.py file.\n", + "from llm_utils.dpk.tools import *\n", + "print(tools_json)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This is an example of a plan for a simple task. It is possed to the prompt to enhance the planning results.\n", + "from llm_utils.dpk.examples import *\n", + "print(example_task)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This is a string that contains several constraints on the order of the tools in the plan.\n", + "# It is a free text and can be found in llm_utils/constraints.py file.\n", + "from llm_utils.dpk.constraints import *\n", + "print(constraints)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define LLM models\n", + "\n", + "We have have tested our project with the following LLM execution frameworks: [Watsonx](https://www.ibm.com/watsonx), [Replicate](https://replicate.com/), and locally running [Ollama](https://ollama.com/).\n", + "To use one of the frameworks uncomment its part in the cell below while commenting out the other frameworks.\n", + "Please note that the notebooks have been tested with specific Large Language Models (LLMs) that are mentioned in the cell, and due to the inherent nature of LLMs, using a different model may not produce the same results.\n", + "\n", + "- To use Replicate:\n", + " - Obtain Replicate API token\n", + " - Store the following value in the `.env` file located in your project directory:\n", + " ```\n", + " REPLICATE_API_TOKEN=\n", + " ```\n", + "- To use Ollama: \n", + " - Download [Ollama](https://ollama.com/download).\n", + " - Download one of the supported [models](https://ollama.com/search). We tested with `llama3.3` model.\n", + " - update the `model_ollama_*` names if needed.\n", + "- To use Watsonx:\n", + " - Register for Watsonx\n", + " - Obtain its API key\n", + " - Store the following values in the `.env` file located in your project directory:\n", + " ```\n", + " WATSONX_URL=\n", + " WATSON_PROJECT_ID=\n", + " WATSONX_APIKEY=\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from llm_utils.models import getChatLLM\n", + "from dotenv import dotenv_values\n", + "\n", + "# watsonx part \n", + "# config = dotenv_values(\"./.env\")\n", + "# model_watsonx_id1 = \"ibm-granite/granite-3.1-8b-instruct\"\n", + "# model_watsonx_id2 = \"meta-llama/llama-3-1-70b-instruct\"\n", + "# model_watsonx_id3 = \"meta-llama/llama-3-3-70b-instruct\"\n", + "# model_watsonx_id4 = \"ibm/granite-34b-code-instruct\"\n", + "\n", + "# llm_plan = getChatLLM(\"watsonx\", model_watsonx_id2, config)\n", + "# llm_judge = getChatLLM(\"watsonx\", model_watsonx_id2, config)\n", + "# llm_generate = getChatLLM(\"watsonx\", model_watsonx_id2, config)\n", + "\n", + "# # ollama part\n", + "# model_ollama = \\\"llama3.3\\\"\\n\",\n", + "# llm_plan = getChatLLM(\\\"ollama\\\", model_ollama);\\n\",\n", + "# llm_judge = getChatLLM(\\\"ollama\\\", model_ollama)\\n\",\n", + "# llm_generate = getChatLLM(\\\"ollama\\\", model_ollama)\"\n", + "\n", + "# replicate part\n", + "config = dotenv_values(\"./.env\")\n", + "# You can use different llm models\n", + "model_replicate_id1 = \"meta/meta-llama-3-70b-instruct\"\n", + "llm_plan = getChatLLM(\"replicate\", model_replicate_id1, config)\n", + "llm_judge = getChatLLM(\"replicate\", model_replicate_id1, config)\n", + "llm_generate = getChatLLM(\"replicate\", model_replicate_id1, config)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langgraph.graph import StateGraph, END\n", + "from llm_utils.agent_helpers import *\n", + "from llm_utils.prompts.planner_prompt import *\n", + "from llm_utils.prompts.judge_prompt import *\n", + "from llm_utils.prompts.generate_prompt import *\n", + "from llm_utils.dpk.tools import *\n", + "from llm_utils.dpk.examples import *\n", + "from llm_utils.dpk.constraints import *\n", + "from functools import partial\n", + "\n", + "\n", + "# Create the graph\n", + "workflow = StateGraph(State)\n", + "\n", + "# Add nodes\n", + "workflow.add_node(\"planner\", partial(planner, prompt=planner_prompt_str, tools=tools_json, example=example_task1, context=constraints, llm=llm_plan))\n", + "workflow.add_node(\"judge\", partial(judge, prompt=judge_prompt_str_dpk, tools=tools_json, context=constraints, llm=llm_judge))\n", + "workflow.add_node(\"user_review\", get_user_review)\n", + "workflow.add_node(\"code generator\", partial(generator, prompt=generate_prompt_str_with_example, llm=llm_generate))\n", + "workflow.add_node(\"code validator\", code_validator_noop)\n", + "\n", + "# Add edges\n", + "workflow.set_entry_point(\"planner\")\n", + "workflow.add_edge(\"code generator\", \"code validator\")\n", + "workflow.add_edge(\"code validator\", END)\n", + "\n", + "# Add conditional edges from judge\n", + "workflow.add_conditional_edges(\n", + " \"judge\",\n", + " is_plan_OK,\n", + " {\n", + " False: \"planner\", # If needs revision, go back to planner\n", + " True: \"user_review\" # If plan is good, proceed to user review\n", + " }\n", + ")\n", + "\n", + "# Add conditional edges from planner\n", + "workflow.add_conditional_edges(\n", + " \"planner\",\n", + " need_judge,\n", + " {\n", + " True: \"judge\", # If needs revision, go back to planner\n", + " False: \"user_review\" # If plan is good, proceed to user review\n", + " }\n", + ")\n", + "\n", + "workflow.add_conditional_edges(\n", + " \"user_review\",\n", + " is_user_review_OK,\n", + " {\n", + " False: \"planner\", # If needs revision, go back to planner\n", + " True: \"code generator\",\n", + " }\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "app = workflow.compile()\n", + "\n", + "from IPython.display import Image, display\n", + "\n", + "#display(Image(app.get_graph(xray=True).draw_mermaid_png()))\n", + "display(Image(app.get_graph().draw_mermaid_png()))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run the graph\n", + "initial_state = {\n", + " \"task\": task,\n", + " \"context\": \"\",\n", + " \"plan\": [\"still no plan\"],\n", + " \"planning_attempts\": 0,\n", + " \"feedback\": \"Still no review\",\n", + " \"needs_revision\": \"\",\n", + " \"need_judge\": True,\n", + "}\n", + "\n", + "state = initial_state\n", + "\n", + "for output in app.stream(state):\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/agentic/README.md b/examples/agentic/README.md new file mode 100644 index 0000000000..9c3776fc89 --- /dev/null +++ b/examples/agentic/README.md @@ -0,0 +1,82 @@ +# Agentic Data Agent Experiments + +## Table of Contents +1. [Project Overview](#project-overview) +2. [Installation Guide](#installation-guide) +3. [Usage](#usage) + + +## Project Overview + +This project focuses on automating the integration of Large Language Models (LLM) based workflow in the data access. +It contains the following notebooks: + +- [Planning_DPK_agent.ipynb](Planning_DPK_agent.ipynb): Planner for Data-Prep-Kit tasks with code generation. This notebook enables the data engineer (or data user) to efficiently build and run pipelines that performs required tasks defined by a natural language. It includes a langgraph LLM agent that has several components like planner, judge, and code generator. This agent can generate as a result a python code of a DPK pipeline which can be run by the user from command line. + +- [dpk_as_tools.ipynb](dpk_as_tools.ipynb): Use DPK transforms defined as [langchain tools](https://python.langchain.com/v0.1/docs/modules/tools/) or [llama-index tools](https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/tools/). +This notebook leverages LLM to generate a DPK transforms pipeline based on natural language inputs. +The LLM processes the provided input and produces the pipeline in the correct format, making it ready for execution. +Subsequently, each transform in the pipeline is invoked by calling its lang-chain or llama-index implementations. + + +## Before you begin + +Ensure that you have python 3.11 + +## Installation Guide + +1. Clone the repository: +```bash +git clone git@github.com:IBM/data-prep-kit.git +cd examples/agentic +``` + +2. Create Python virtual environment: +```bash +python -m venv venv +source venv/bin/activate +pip install --upgrade pip +pip install jupyter +pip install ipython && pip install ipykernel +pip install -r requirements.txt +``` + +3. Configure access to LLM: + + We have have tested our project with the following LLM execution frameworks: + - [Replicate](https://replicate.com/) + - [Watsonx](https://www.ibm.com/watsonx) + - locally running [Ollama](https://ollama.com/) (on mac) + + 3.1 Setup Instructions for each framework: + + The notebook cell that defines the models contains all frameworks with only the `replicate` part uncomment. To use one of the other frameworks uncomment its part in the cell while commenting out the other frameworks. Please note that the frameworks have been tested with a specific LLM and due to the inherent nature of LLMs, using a different model may not produce the same results. + + - Replicate: + - Obtain Replicate API token + - Store the following value in the `.env` file located in your project directory: + ``` + REPLICATE_API_TOKEN= + ``` + - Ollama: + - Download [Ollama](https://ollama.com/download). + - Download one of the supported [models](https://ollama.com/search). We tested with `llama3.3` model. + - update the `model_ollama_*` names in the relevant cells if needed. + - Watsonx: + - Register for Watsonx + - Obtain its API key + - Store the following values in the `.env` file located in your project directory: + ``` + WATSONX_URL= + WATSON_PROJECT_ID= + WATSONX_APIKEY= + ``` + +## Usage + +To launch the notebooks, execute the following command in your terminal: +```bash +Jupyter notebook +``` + +Once the Jupyter interface is loaded, select the desired notebook to begin working with it. diff --git a/examples/agentic/dpk-requirements.txt b/examples/agentic/dpk-requirements.txt new file mode 100644 index 0000000000..2540fd9102 --- /dev/null +++ b/examples/agentic/dpk-requirements.txt @@ -0,0 +1,3 @@ +data-prep-toolkit==0.2.3 +data-prep-toolkit-transforms[all,ray]==1.0.0a2 +deepsearch-toolkit diff --git a/examples/agentic/dpk_as_tools.ipynb b/examples/agentic/dpk_as_tools.ipynb new file mode 100644 index 0000000000..6af0811ea2 --- /dev/null +++ b/examples/agentic/dpk_as_tools.ipynb @@ -0,0 +1,689 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Demonstrate Data-Prep-kit transforms as LangChain or llama-index tools" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### This notebook is based on Data Prep Kit Demo\n", + "link: https://github.com/IBM/data-prep-kit/blob/v0.2.3/examples/notebooks/intro/dpk_intro_1_ray.ipynb\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/v0.2.3/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install dependencies. This can take some time" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qq -r requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qq -r dpk-requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! cd llm_utils/dpk/llama_index_tools/llama_index_tools_dpk && pip install -qq -e ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -qq llama-index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use langchain or llama-index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set to True to define DPK transforms as langchain tools; otherwise they will be defined as llama-index tools\n", + "define_dpk_as_langchain_tools=False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the input task" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set to True to execute the transforms on the local Ray cluster; otherwise, the Python implementation is used.\n", + "run_with_local_ray=False" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ray_text=\"\"\n", + "if run_with_local_ray:\n", + " ray_text=\"on a local ray cluster \"\n", + "\n", + "task=f\"Execute pdf2parquet, doc_chunk, doc_id, ededup, text_encoder transforms {ray_text} one after the other where the input to a transform is the output of the previous transform run.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set input/output paths" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "import os\n", + "cwd = os.getcwd()\n", + "\n", + "output_base_path = f\"{cwd}/output\"\n", + "\n", + "input_folder = f\"{cwd}/test-data/input/\"\n", + "output_folder = f\"{output_base_path}/final_1/\"\n", + "\n", + "shutil.rmtree(output_base_path, ignore_errors=True)\n", + "print (f\"✅ Cleared {output_base_path} directory\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set transforms parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "def prepare_params(params: dict):\n", + " params_json=json.dumps(params)\n", + " # trim clurly braces\n", + " return params_json[1:-1]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from data_processing.utils import GB, ParamsUtils\n", + "\n", + "pdf2parquet_params_dict={\"data_files_to_use\": \"['.pdf']\", \"input_folder\":input_folder, \"pdf2parquet_contents_type\": \"application/json\"}\n", + "doc_chunk_params_dict={}\n", + "doc_id_params_dict={\"doc_id_hash_column\": \"chunk_hash\", \"doc_id_int_column\": \"chunk_id\"}\n", + "ededup_params_dict={\"ededup_doc_column\": \"contents\", \"ededup_doc_id_column\": \"chunk_hash\"}\n", + "text_encoder_params_dict={\"text_encoder_model_name\": \"sentence-transformers/all-MiniLM-L6-v2\", \"output_folder\":output_folder}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if run_with_local_ray:\n", + " worker_options_str=ParamsUtils.convert_to_ast({\"num_cpus\" : 0.8, \"memory\": 2 * GB})\n", + " ededup_params_dict=ededup_params_dict|{\"ededup_hash_cpu\": 0.5, \n", + " \"ededup_num_hashes\": 2,\n", + " \"runtime_worker_options\": worker_options_str,\n", + " \"runtime_num_workers\": 2}\n", + " \n", + "pdf2parquet_params=prepare_params(pdf2parquet_params_dict)\n", + "doc_chunk_params=prepare_params(doc_chunk_params_dict)\n", + "doc_id_params=prepare_params(doc_id_params_dict)\n", + "ededup_params=prepare_params(ededup_params_dict)\n", + "text_encoder_params=prepare_params(text_encoder_params_dict)\n", + "\n", + "params=f\"for pdf2parquet params use {pdf2parquet_params}. for doc_id use params {doc_id_params}. for ededup use params {ededup_params}. for text_encoder use params {text_encoder_params}\"\n", + "input=f\"{task} {params}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Print input task" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import HTML\n", + "\n", + "print_task=f\"

TASK: {task}

\"\n", + "print_pdf2parquet=f\"

PDF2PARQUET Params: {pdf2parquet_params}

\"\n", + "print_doc_chunks=f\"

DOC CHUNKS Params: {doc_chunk_params}

\"\n", + "print_doc_id_params=f\"

DOC_ID Params: {doc_id_params}

\"\n", + "print_ededup_params=f\"

EDEDUP Params: {ededup_params}

\"\n", + "print_text_encoder_params=f\"

TEXT_ENCODER Params: {text_encoder_params}

\"\n", + "\n", + "HTML(f\"{print_task}{print_pdf2parquet}{print_doc_chunks}{print_doc_id_params}{print_ededup_params}{print_text_encoder_params}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define LLM models and tools\n", + "\n", + "We have have tested our project with the following LLM execution frameworks: [Watsonx](https://www.ibm.com/watsonx), [Replicate](https://replicate.com/), and locally running [Ollama](https://ollama.com/).\n", + "To use one of the frameworks uncomment its part in the cell below while commenting out the other frameworks.\n", + "Please note that the notebooks have been tested with specific Large Language Models (LLMs) that are mentioned in the cell, and due to the inherent nature of LLMs, using a different model may not produce the same results.\n", + "\n", + "- To use Replicate:\n", + " - Obtain Replicate API token\n", + " - Store the following value in the `.env` file located in your project directory:\n", + " ```\n", + " REPLICATE_API_TOKEN=\n", + " ```\n", + "- To use Ollama: \n", + " - Download [Ollama](https://ollama.com/download).\n", + " - Download one of the supported [models](https://ollama.com/search). We tested with `llama3.3` model.\n", + " - update the `model_ollama_*` names if needed.\n", + "- To use Watsonx:\n", + " - Register for Watsonx\n", + " - Obtain its API key\n", + " - Store the following values in the `.env` file located in your project directory:\n", + " ```\n", + " WATSONX_URL=\n", + " WATSON_PROJECT_ID=\n", + " WATSONX_APIKEY=\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "from dotenv import dotenv_values\n", + "config = dotenv_values(\".env\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from llm_utils.logging import prep_loggers\n", + "\n", + "os.environ[\"LLM_LOG_PATH\"] = \"./logs/llm_log.txt\"\n", + "os.environ[\"TOOL_CALLING_LOG_PATH\"] = \"./logs/tool_log.txt\"\n", + "prep_loggers(\"llm=INFO,tool_calling=INFO\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "### Define the model\n", + "\n", + "config = dotenv_values(\"./.env\")\n", + "# watsonx part \n", + "\n", + "# model_watsonx_id = \"meta-llama/llama-3-1-70b-instruct\"\n", + "# llm = getChatLLM(\"watsonx\", model_watsonx_id, config)\n", + "\n", + "# # ollama part\n", + "# model_ollama = \"llama3.3\"\n", + "# llm = getChatLLM(\"ollama\", model_ollama)\n", + "\n", + "# replicate part\n", + "# You can use different llm models\n", + "model_replicate = \"meta/meta-llama-3-70b-instruct\"\n", + "llm = getChatLLM(\"replicate\", model_replicate, config)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from llm_utils.models import getChatLLM\n", + "from dotenv import dotenv_values\n", + "\n", + "# replicate part\n", + "config = dotenv_values(\"./.env\")\n", + "\n", + "model_replicate_id1 = \"meta/meta-llama-3-70b-instruct\"\n", + "llm = getChatLLM(\"replicate\", model_replicate_id1, config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List DPK transforms" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if define_dpk_as_langchain_tools:\n", + " from llm_utils.dpk.langchain_tools.agent_toolkit.toolkit import DataPrepKitToolkit\n", + " \n", + " toolkit = DataPrepKitToolkit() \n", + " tools = toolkit.get_tools()\n", + " print(\"-- DPK tools: --\")\n", + " print(tools)\n", + "else:\n", + " from llama_index_dpk.tools.dpk.base import DPKTransformsToolSpec\n", + " \n", + " dpk_spec = DPKTransformsToolSpec()\n", + " tools = dpk_spec.to_tool_list()\n", + " print(\"-- DPK tools: --\")\n", + " for t in tools:\n", + " print(t.metadata.name)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if define_dpk_as_langchain_tools:\n", + " from langchain.tools import Tool\n", + " from typing import Union, List\n", + " \n", + " def find_tool_by_name(tools: List[Tool], tool_name: str) -> Tool:\n", + " for tool in tools:\n", + " if tool.name == tool_name:\n", + " return tool\n", + " raise ValueError(f\"Tool with name {tool_name} not found\")\n", + "else:\n", + " from llama_index.core.tools import FunctionTool\n", + " from typing import Union, List\n", + " \n", + " def find_tool_by_name(tools: List[FunctionTool], tool_name: str) -> FunctionTool:\n", + " for tool in tools:\n", + " if tool.metadata.name == tool_name:\n", + " return tool\n", + " raise ValueError(f\"Tool with name {tool_name} not found\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if define_dpk_as_langchain_tools:\n", + " from langchain.tools.render import render_text_description\n", + " \n", + " tools_str = render_text_description(tools)\n", + " tool_names = \", \".join([t.name for t in tools]),\n", + "else:\n", + " tools_str = '\\n'.join(dpk_spec.spec_functions)\n", + " tool_names=\", \".join(dpk_spec.spec_functions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the Prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "from langchain.tools.render import render_text_description\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "\n", + "\n", + "prompt_template = ChatPromptTemplate.from_template( \"\"\"Answer the following questions as best you can. You have access to the following tools:\n", + "\n", + " {tools}\n", + " \n", + " Use the following format:\n", + " \n", + " Question: the input question you must answer\n", + " Thought: you should always think about what to do\n", + " Action: the action to take, should be one of [{tool_names}]\n", + " Action Input: the input to the action\n", + " Observation: the result of the action\n", + " ... (this Thought/Action/Action Input/Observation can repeat N times)\n", + " Thought: I now know the final answer\n", + " Final Answer: the final answer to the original input question\n", + "\n", + " Final Answer or Action should appear in the answer but not both.\n", + " Follow the exact Action Input format provided in the examples when crafting your response.\n", + " Avoid numbering the output.\n", + "\n", + " Here's an example.\n", + "\n", + " For example, If the required task was to execute ededup , doc_id transforms one after the other. \n", + " The output directory of a transform is the input for the next transform in the transform order. \n", + " for ededup params use: 'input_folder':'/home/user/input/ededup'\n", + " for doc_id params use : 'output_folder':'/home/user/output/final'. \n", + " The output should be the following:\n", + " \n", + " Thought: I need to execute the ededup and doc_id one after the other.\n", + " \n", + " Action: ededup\n", + " Action Input: \"input_folder\":\"/home/user/input/ededup\", \"output_folder\":\"/home/user/output/ededup\"\n", + " Observation: The output of the ededup transform is stored in \"/home/user/output/ededup\".\n", + "\n", + " Action: doc_id\n", + " Action Input: \"input_folder\":\"/home/user/output/ededup\", \"output_folder\":\"/home/user/output/final\"\n", + " Observation: The output of the doc_id transform is stored in \"/home/eres/output/final\".\n", + "\n", + " Here's another example: \n", + "\n", + " If the required task was to execute ededup , doc_id transforms on a local ray cluster one after the other. \n", + " The output directory of a transform is the input for the next transform in the transform order. \n", + " for ededup params use: 'input_folder':'/home/user/input/ededup'\n", + " for doc_id params use : 'output_folder':'/home/user/output/final'\n", + " The output should be the following:\n", + " \n", + " Thought: I need to execute the ededup and doc_id one after the other.\n", + " \n", + " Action: ededup\n", + " Action Input: \"runtime_type\": \"ray\", \"run_locally\": \"True\", \"input_folder\":\"/home/user/input/ededup\", \"output_folder\":\"/home/user/output/ededup\"\n", + " Observation: The output of the ededup transform is stored in \"/home/user/output/ededup\".\n", + "\n", + " Action: doc_id\n", + " Action Input: \"runtime_type\": \"ray\", \"run_locally\": \"True\", \"input_folder\":\"/home/user/output/ededup\", \"output_folder\":\"/home/user/output/final\"\n", + " Observation: The output of the doc_id transform is stored in \"/home/user/output/final\".\n", + "\n", + " \n", + " Begin!\n", + " \n", + " Question: {input}\n", + " \"\"\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Invoke the agent to create the plan" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "print(input)\n", + "\n", + "agent = prompt_template | llm \n", + "\n", + "agent_step = \"\"\n", + "agent_step = agent.invoke(\n", + " {\n", + " \"input\": input,\n", + " \"tool_names\": tool_names,\n", + " \"tools\": tools_str,\n", + " }\n", + " )\n", + " \n", + "print(agent_step.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parse the plan" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "content = agent_step.content\n", + "if type(content) == list:\n", + " content = ''.join(content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "\n", + "regex = (r\"Action\\s*\\d*\\s*:[\\s]*(.*?)[\\s]*Action\\s*\\d*\\s*Input\\s*\\d*\\s*:[\\s]*(.*)\")\n", + "matches = re.findall(regex, content)\n", + "\n", + "print(\"LLM result contain the following transforms:\\n\")\n", + "for match in matches:\n", + " print(f\"TRANSFORM NAME {match[0]}\")\n", + " print(f\"TRANSFORM PARAMS {match[1]}\")\n", + " print(\"--------------------------------------\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from typing import Any\n", + "\n", + "def load_from_json(js: str) -> dict[str, Any]:\n", + " try:\n", + " return json.loads(js)\n", + " except Exception as e:\n", + " print(f\"Failed to load parameters {js} with error {e}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Execute the transfoms by calling their tool definition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def run_tool(match) -> str:\n", + " def contains_parquet_files(dir_path):\n", + " return any(file.endswith(\".parquet\") for file in os.listdir(dir_path) if os.path.isfile(os.path.join(dir_path, file)))\n", + "\n", + " tool_name = match[0]\n", + " tool_to_use = find_tool_by_name(tools, tool_name)\n", + " tool_name = match[0]\n", + " tool_input=\"{\"+match[1]+\"}\"\n", + " tool_input_dict = load_from_json(tool_input)\n", + " print(\"=======================================================\")\n", + " print (f\"🏃🏼 RUNNING {tool_name} with params: {tool_input_dict}\")\n", + " print(\"=======================================================\")\n", + " if define_dpk_as_langchain_tools:\n", + " tool_result = tool_to_use.run(tool_input_dict)\n", + " else:\n", + " tool_result = tool_to_use.call(**tool_input_dict)\n", + " if not contains_parquet_files(tool_input_dict[\"output_folder\"]):\n", + " out_dir=tool_input_dict[\"output_folder\"]\n", + " raise Exception (f\"The {out_dir} directory is unexpectedly empty, indicating the job failed.\")\n", + " print (f\"✅ {tool_result}\")\n", + " \n", + " return tool_result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "error=False\n", + "for match in matches:\n", + " try:\n", + " tool_result = run_tool(match)\n", + " time.sleep(10)\n", + " except Exception as e:\n", + " error=True\n", + " print(f\"❌ Error: \" + str(e))\n", + " break\n", + "\n", + "if not error:\n", + " print(\"=================================================\")\n", + " print (f\"✅ Transforms execution completed successfully\")\n", + " print(\"=================================================\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inspect Generated Output File" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You will see a column called embeddings added at the end. This the text content converted into vectors or embeddings. \n", + "We used the model sentence-transformers/all-MiniLM-L6-v2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "import pandas as pd\n", + "\n", + "def read_parquet_files_as_df (parquet_dir):\n", + " parquet_files = glob.glob(f'{parquet_dir}/*.parquet')\n", + "\n", + " # read each parquet file into a DataFrame and store in a list\n", + " dfs = [pd.read_parquet (f) for f in parquet_files]\n", + "\n", + " # Concatenate all DataFrames into a single DataFrame\n", + " data_df = pd.concat(dfs, ignore_index=True)\n", + " return data_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print the last transform output\n", + "last_transform=matches[-1]\n", + "tool_input=\"{\"+match[1]+\"}\"\n", + "tool_input_dict = load_from_json(tool_input)\n", + "dir=tool_input_dict[\"output_folder\"]\n", + "print(dir)\n", + "output_df = read_parquet_files_as_df(dir)\n", + "\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(2)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv-newone1", + "language": "python", + "name": "venv-newone1" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/agentic/helpers.py b/examples/agentic/helpers.py new file mode 100644 index 0000000000..b252e267b6 --- /dev/null +++ b/examples/agentic/helpers.py @@ -0,0 +1,9 @@ +import re + + +def parse_output(message): + pattern = r"output_folder\s+(?:{)?(.*?)(?:}|\.|$)" + match = re.search(pattern, message) + if match: + return match.group(1) + return None \ No newline at end of file diff --git a/examples/agentic/llm_utils/__init__.py b/examples/agentic/llm_utils/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/agent_helpers.py b/examples/agentic/llm_utils/agent_helpers.py new file mode 100644 index 0000000000..132c1dea94 --- /dev/null +++ b/examples/agentic/llm_utils/agent_helpers.py @@ -0,0 +1,172 @@ +from pathlib import Path +import requests +import json +from typing import List, TypedDict +from langchain_core.prompts import ChatPromptTemplate +from .visualize_plan import visualize_plan + + +# Define the state +class State(TypedDict): + plan: List[str] + task: str + context: str + planning_attempts: int # track planning iterations + feedback: str + need_judge: bool + needs_revision: bool + +# step_template_dict = {"step_name": "Step #", "tool_name": "tool_name", "tool_input": [{"param_name": "param_value"}], "step_ev":"Ev1"} +step_template_dict = {"step_name": "Step #", "tool_name": "tool_name", "tool_input": [{"param_name": "param_value"}], "import": "import line of the tool", "step_ev":"Ev1"} +step_template = json.dumps(step_template_dict) + +# url = "https://raw.githubusercontent.com/roytman/test_repo/refs/heads/main/instructlab.md" +# response = requests.get(url) +# md_content = response.text + + +# Define the planner node +def planner(state: State, llm, prompt: str, tools: str, example: str, context: str) -> State: + planner_prompt = ChatPromptTemplate.from_template(prompt) + planner_chain = planner_prompt | llm + output = planner_chain.invoke({ + "task": state["task"], + "tool_not_implemented": "tool_not_implemented", + "tools": tools, + "example_task": example, + "context": context, + "previous_plan": '\n'.join(state['plan']), + "feedback": state['feedback'] + }) + output.content = "".join(output.content) + state['plan'] = output.content.split('\n') + state['current_step'] = 0 + print(f"\033[36m\033[1m\nPlanner: suggested plan is:\033[0m") + print(output.content) + visualize_plan(output.content) + return state + + +# Define the edge conditions +def is_plan_complete(state: State) -> bool: + return state['current_step'] >= len(state['plan']) + +def generate_run_file(llm, plan, prompt, output_file) -> State: + generate_chain = ChatPromptTemplate.from_template(prompt) | llm + evaluation = generate_chain.invoke({ + "step_template": step_template, + "plan": plan + }) + # Split the evaluation into feedback and decision + evaluation.content = "".join(evaluation.content) + eval_parts = evaluation.content.split('\n') + code = extract_python_code(eval_parts) + # format the content + import black + formatted_code = black.format_str(code, mode=black.FileMode()) + save_python_file(formatted_code, output_file) + +def judge(state: State, llm, prompt: str, tools: str, context: str) -> State: + # Get judge's evaluation + judge_chain = ChatPromptTemplate.from_template(prompt) | llm + evaluation = judge_chain.invoke({ + "task": state['task'], + "plan": '\n'.join(state['plan']), + "context": context, + "tools": tools, + }) + # Split the evaluation into feedback and decision + evaluation.content = "".join(evaluation.content) + eval_parts = evaluation.content.split('\n') + decision_line = next((line for line in eval_parts if 'NEEDS_REVISION:' in line), '') + needs_revision = 'yes' in decision_line.lower() + + eval_parts = evaluation.content.splitlines() + decision_line = "" + filtered_lines = [] + for line in eval_parts: + if 'NEEDS_REVISION:' in line: + decision_line = line + else: + filtered_lines.append(line) + + # Store results in state + print(f"\033[36m\033[1m\nJudge: review:\033[0m") + print(evaluation.content) + state['feedback'] = '\n'.join(filtered_lines) + state['needs_revision'] = needs_revision + state['planning_attempts'] = state['planning_attempts'] + 1 + return state + +def is_plan_OK(state: State) -> bool: + if state["planning_attempts"] >= 3: + return True + return not state['needs_revision'] + +def need_judge(state: State) -> bool: + return state["need_judge"] + +def is_user_review_OK(state: State) -> bool: + if state["feedback"] in ["", "OK", "okay"]: + print("The planning is done") + return True + return False + +def user_review(state: State) -> bool: + state['planning_attempts'] = 0 + return state + +# User review function +def get_user_review(state: State) -> State: + new_state = state.copy() + feedback = input("\nPlease review the plan and provide feedback (or print 'okay', 'OK' or just Enter to continue): ") + new_state['feedback'] = feedback + new_state['planning_attempts'] = 0 + new_state['need_judge'] = False + + return new_state + +def get_steps(plan): + json_steps = [] + for json_str in plan: + if json_str.strip(): # Skip empty lines + try: + json_obj = json.loads(json_str) + json_steps.append(json_obj) + except json.JSONDecodeError as e: + print(f"Skip line") + return json_steps + +# Define the generator node (simplified for this example) +def generator(state: State, llm, prompt: str, file_name: str="llm_plan_generated.py") -> State: + steps = get_steps(state["plan"]) + llm_output = generate_run_file(llm, steps, prompt, file_name) + return state + +def extract_python_code(llm_output: list) -> str: + code = [] + in_code_block = False + + for line in llm_output: + if line.strip() == '```python': + in_code_block = True + continue + elif line.strip() == '```': + in_code_block = False + continue + + if in_code_block: + code.append(line) + + return '\n'.join(code) + +def save_python_file(code: str, filename: str): + try: + with open(filename, 'w') as f: + f.write(code) + print(f"Successfully saved code to {filename}") + except Exception as e: + print(f"Error saving file: {e}") + +def code_validator_noop(state: State) -> State: + return state diff --git a/examples/agentic/llm_utils/callbacks.py b/examples/agentic/llm_utils/callbacks.py new file mode 100644 index 0000000000..acedca48f7 --- /dev/null +++ b/examples/agentic/llm_utils/callbacks.py @@ -0,0 +1,67 @@ +# from https://github.ibm.com/mc-connectors/connector/blob/main/gin/common/ai_platforms/callbacks.py + +from typing import Dict, Any, List + +import logging + +from langchain.callbacks.base import BaseCallbackHandler +from langchain.schema import LLMResult + +from llm_utils.logging import Logging + + +class LoggingCallbackHandler(BaseCallbackHandler): + """ + Callbacks for printing LLM prompt and response. + """ + + def on_llm_start( + self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any + ) -> Any: + """Run when LLM starts running.""" + + llm_log = logging.getLogger(Logging.LLM) + for prompt in prompts: + llm_log.info("***LLM prompt***\n%s\n", prompt) + for handler in llm_log.handlers: + handler.flush() + + def on_llm_end(self, response: LLMResult, **kwargs: Any) -> Any: + """Run when LLM ends running.""" + llm_log = logging.getLogger(Logging.LLM) + llm_log.info("***LLM Response:***\n%s\n", response.generations[0][0].text) + for handler in llm_log.handlers: + handler.flush() + + def on_llm_error(self, error, *, run_id, parent_run_id=None, **kwargs) -> Any: + """Run when LLM returns error.""" + llm_log = logging.getLogger(Logging.LLM) + llm_log.error(f"***LLM Error:***\n{error}\n") + for handler in llm_log.handlers: + handler.flush() + + def on_tool_start( + self, + serialized, + input_str, + *, + run_id, + parent_run_id=None, + tags=None, + metadata=None, + inputs=None, + **kwargs, + ): + """Run when tool starts running""" + tool_log = logging.getLogger(Logging.TOOL_CALLING) + tool_log.info(f"***Tool start***\n{serialized}\n{input_str=}\n") + + def on_tool_end(self, output, *, run_id, parent_run_id=None, **kwargs) -> Any: + """Run when tool ends running.""" + tool_log = logging.getLogger(Logging.TOOL_CALLING) + tool_log.info(f"***Tool Response***\n{output}\n") + + def on_tool_error(self, error, *, run_id, parent_run_id = None, **kwargs) -> Any: + """Run when tool returns error.""" + tool_log = logging.getLogger(Logging.TOOL_CALLING) + tool_log.error(f"***Tool Error:***\n{error}\n") diff --git a/examples/agentic/llm_utils/dpk/README.md b/examples/agentic/llm_utils/dpk/README.md new file mode 100644 index 0000000000..589c38e61c --- /dev/null +++ b/examples/agentic/llm_utils/dpk/README.md @@ -0,0 +1,14 @@ +This directory contains implementation of DPK transforms as [langchain](https://python.langchain.com/v0.1/docs/modules/tools/) or [llama-index](https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/tools/) tools. + +* **[`langchain_tools`](./langchain_tools):** directory contains code to define DPK transforms as langchain tools similar to the tools defined in [here](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools). +* **[`llama_index_tools`](./llama_index_tools):** directory contains code to define DPK transforms as llama-index tools similar to the tools defined in [here](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/tools). +* **[`dpk_common.py`](./dpk_common.py):** contains definitions used in both of the implementations defined above. + +For example usage please look at [`dpk_as_tools.ipynb`](../../dpk_as_tools.ipynb) notebook. + + +In addition, this directory contains files that define the context for the [`DPK agent`](../../Planning_DPK_agent.ipynb): + +* **[`tools.py`](./tools.py):** Json dictionaries that describe the tools to pass it to the agent. Each dictionary describes a DPK transform. +* **[`examples.py`](./examples.py):** Examples of tasks and their matched plan. Used in the planner prompt to get more accurate results. +* **[`constraints.py`](./constraints.py):** Constrains that the generated plan or pipeline should satisfy. \ No newline at end of file diff --git a/examples/agentic/llm_utils/dpk/__init__.py b/examples/agentic/llm_utils/dpk/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/constraints.py b/examples/agentic/llm_utils/dpk/constraints.py new file mode 100644 index 0000000000..45531a75e6 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/constraints.py @@ -0,0 +1,9 @@ +# constraints = "constraint: If a plan has 'exact_dedup' and 'fuzzy_dedup' steps then 'exact_dedup' must run before 'fuzzy_dedup'.\n" +constraints = "" +constraints = constraints + "constraint: If the 'exact_dedup' tool is needed then it must run as the first step or immediately after 'Pdf2Parquet'.\n" +constraints = constraints + "constraint: If the 'tokenization' transform is needed then it must run as the last transform of the plan. Otherwise, it should not be part of the plan.\n" +constraints = constraints + "constraint: run 'Pdf2Parquet' tool as a first step if the input is pdf files.\n" + +# constraints = constraints + "constraint: 'document_id' tool must run before 'fuzzy_dedup'.\n" +# constraints = constraints + "constraint: If a plan has a 'fuzzy_dedup' tool then it must run as earlier as possible.\n" +# constraints = constraints + "constraint: If a plan has a 'exact_dedup' tool then it must run as the first step or after 'Pdf2Parquet'.\n" \ No newline at end of file diff --git a/examples/agentic/llm_utils/dpk/dpk_common.py b/examples/agentic/llm_utils/dpk/dpk_common.py new file mode 100644 index 0000000000..00658770ca --- /dev/null +++ b/examples/agentic/llm_utils/dpk/dpk_common.py @@ -0,0 +1,118 @@ +from typing import Optional +from pydantic import Field +import sys +from data_processing.utils import ParamsUtils + + +class DPKDataAccessInput: + """DPK Input for Data access params""" + + data_type: str = Field( + "local", description="type of the data access can be one of local, s3" + ) + output_folder: str = Field("", description="ast string containing output folder.") + input_folder: str = Field("", description="string containing input folder.") + data_s3_cred: Optional[str] = Field( + None, + description="AST string of options for s3 credentials. Only required for S3 data access.", + ) + data_max_files: Optional[int] = Field( + None, description="Max amount of files to process" + ) + data_checkpointing: Optional[str] = Field(None, description="checkpointing flag") + data_files_to_checkpoint: Optional[str] = Field( + None, description="list of file extensions to choose for checkpointing." + ) + data_data_sets: Optional[str] = Field( + None, + description="List of sub-directories of input directory to use for input. For example, ['dir1', 'dir2']", + ) + data_files_to_use: Optional[str] = Field( + None, description="list of file extensions to choose for input." + ) + data_num_samples: Optional[str] = Field( + None, description="number of random input files to process" + ) + + +worker_options = {"num_cpus": 0.8} + + +class DPKRuntimeInput: + """DPK Input for Runtime params""" + + runtime_type: str = Field( + "python", description="type of the runtime can be one of python, ray or spark" + ) + run_locally: Optional[str] = Field(None, description="running ray local flag") + runtime_num_processors: Optional[str] = Field( + None, description="size of multiprocessing pool" + ) + runtime_pipeline_id: Optional[str] = Field(None, description="pipeline id") + runtime_job_id: Optional[str] = Field(None, description="job id") + runtime_code_location: Optional[str] = Field( + None, description="AST string containing code location" + ) + runtime_num_workers: Optional[int] = Field(None, description="number of workers") + runtime_worker_options: Optional[str] = Field( + ParamsUtils.convert_to_ast(worker_options), + description="AST string defining worker resource requirements.", + ) + runtime_creation_delay: Optional[int] = Field( + None, description="delay between actor creation" + ) + + +def add_runtime_params(transform_params: dict, runtime_type: str, kwargs): + """Add parameters related to runtime""" + + def _remove_ray_runtime_params(): + transform_params.pop("run_locally", None) + transform_params.pop("runtime_worker_options", None) + transform_params.pop("runtime_num_workers", None) + transform_params.pop("runtime_creation_delay", None) + + fields = list(DPKRuntimeInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + if runtime_type.strip().lower() == "python": + _remove_ray_runtime_params() + transform_params.pop("runtime_type", None) + + +def add_data_access_params(transform_params: dict, data_type: str, kwargs): + """Add parameters related to data access""" + + def _get_data_access_key(): + if data_type.strip().lower() == "local": + return "data_local_config" + elif data_type.strip().lower() == "s3": + return "data_s3_config" + else: + print(f"Unrecognizable type of TransformRuntimeConfiguration - {data_type}") + sys.exit(1) + + input_folder = transform_params.get("input_folder", "") + output_folder = transform_params.get("output_folder", "") + data_conf = { + "input_folder": f"{input_folder}", + "output_folder": f"{output_folder}", + } + data_key = _get_data_access_key() + transform_params[data_key] = ParamsUtils.convert_to_ast(data_conf) + fields = list(DPKDataAccessInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + transform_params.pop("input_folder", None) + transform_params.pop("output_folder", None) + transform_params.pop("data_type", None) + + +def check_params(transform_params: dict, kwargs): + """A basic check for the transform params""" + # input/output path are added as one param to the final transform params + if len(transform_params) != len(kwargs) + 1: + print("Warning: unexpected parameter provided for the transform") + print(kwargs.keys()) diff --git a/examples/agentic/llm_utils/dpk/examples.py b/examples/agentic/llm_utils/dpk/examples.py new file mode 100644 index 0000000000..ef19519b2e --- /dev/null +++ b/examples/agentic/llm_utils/dpk/examples.py @@ -0,0 +1,19 @@ +example_task = ''' +For example, If the required task was Filter the parquet files to include only english documents. The the plan should be the following: +{"step_name": "Step #1 language identification", "tool_name": "language_id", "tool_input": [{"in_folder": "user_input", "out_folder": "user_input"}], "step_ev": "Ev1"} +{"step_name": "Step #2 filter english documents", "tool_name": "filter_transform", "tool_input": [{"in_folder": "#Ev1"}, {"out_folder": "user_input"}, {"filter_criteria_list": [lang==en]}], "step_ev": "Ev2"} +''' + + +example_task1 = ''' +For example, If the required task was Filter the parquet files to include only english documents. The the plan should be the following: +{"step_name": "Step #1 language identification", "tool_name": "language_id", "tool_input": [{"in_folder": "user_input", "out_folder": "user_input"}], "import": "from llm_utils.dpk.langchain_tools.tools.language.lang_id import LangIdentificationTransform", "step_ev": "Ev1"} +{"step_name": "Step #2 filter english documents", "tool_name": "filter_transform", "tool_input": [{"in_folder": "#Ev1", "out_folder": "#Ev1", "filter_criteria_list": "[lang==en]"}], import: "from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform", "step_ev": "Ev2"} +''' + + +example_task2 = ''' +For example, If the required task was Filter the parquet files to include only english documents. The the plan should be the following: +{"step_name": "Step #1 language identification", "tool_name": "language_id", "tool_input": [{"in_folder": "user_input", "out_folder": "user_input+'_langid'"}], "import": "from llm_utils.dpk.langchain_tools.tools.language.lang_id import LangIdentificationTransform", "step_ev": "Ev1"} +{"step_name": "Step #2 filter english documents", "tool_name": "filter_transform", "tool_input": [{"in_folder": "#Ev1", "out_folder": "#Ev1+'_filter'", "filter_criteria_list": "[lang==en]"}], import: "from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform", "step_ev": "Ev2"} +''' diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/__init__.py b/examples/agentic/llm_utils/dpk/langchain_tools/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/agent_toolkit/__init__.py b/examples/agentic/llm_utils/dpk/langchain_tools/agent_toolkit/__init__.py new file mode 100644 index 0000000000..bb48663488 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/agent_toolkit/__init__.py @@ -0,0 +1,3 @@ +"""Local file management toolkit.""" + +__all__ = ["DataPrepKitToolkit"] diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/agent_toolkit/toolkit.py b/examples/agentic/llm_utils/dpk/langchain_tools/agent_toolkit/toolkit.py new file mode 100644 index 0000000000..d05fb16cfe --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/agent_toolkit/toolkit.py @@ -0,0 +1,86 @@ +from __future__ import annotations + +from typing import Any, Dict, List, Optional, Type + +from langchain_core.tools import BaseTool, BaseToolkit +from langchain_core.utils.pydantic import get_fields +from pydantic import model_validator + +# from noop_tool import NoopTransform +from llm_utils.dpk.langchain_tools.tools.universal.fdedup import FdedupTransform +from llm_utils.dpk.langchain_tools.tools.universal.ededup import EdedupTransform +from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform +from llm_utils.dpk.langchain_tools.tools.universal.resize import ResizeTransform +from llm_utils.dpk.langchain_tools.tools.universal.tokenization import TokenizationTransform +from llm_utils.dpk.langchain_tools.tools.universal.doc_id import DocIDTransform + + +from llm_utils.dpk.langchain_tools.tools.code.code2parquet import Code2ParquetTransform +from llm_utils.dpk.langchain_tools.tools.code.code_quality import CodeQualityTransform +from llm_utils.dpk.langchain_tools.tools.code.proglang_select import ProgLangSelectTransform + + +from llm_utils.dpk.langchain_tools.tools.language.doc_chunk import DocChunkTransform +from llm_utils.dpk.langchain_tools.tools.language.doc_quality import DocQualityTransform +from llm_utils.dpk.langchain_tools.tools.language.lang_id import LangIdentificationTransform +from llm_utils.dpk.langchain_tools.tools.language.pdf2parquet import Pdf2parquetTransform +from llm_utils.dpk.langchain_tools.tools.language.text_encoder import TextEncoderTransform +from llm_utils.dpk.langchain_tools.tools.language.pii_redactor import PIIRedactorTransform + + +_FILE_TOOLS: List[Type[BaseTool]] = [ + FdedupTransform, + EdedupTransform, + FilterTransform, + ResizeTransform, + TokenizationTransform, + DocIDTransform, + Pdf2parquetTransform, + CodeQualityTransform, + ProgLangSelectTransform, + DocChunkTransform, + DocQualityTransform, + Code2ParquetTransform, + LangIdentificationTransform, + TextEncoderTransform, + PIIRedactorTransform, +] +_FILE_TOOLS_MAP: Dict[str, Type[BaseTool]] = { + get_fields(tool_cls)["name"].default: tool_cls for tool_cls in _FILE_TOOLS +} + + +class DataPrepKitToolkit(BaseToolkit): + """Toolkit for applying data transformations using data prep kit. + + Parameters: + selected_tools: Optional. The tools to include in the toolkit. If not + provided, all tools are included. + """ + + selected_tools: Optional[List[str]] = None + """If provided, only provide the selected tools. Defaults to all.""" + + @model_validator(mode="before") + @classmethod + def validate_tools(cls, values: dict) -> Any: + selected_tools = values.get("selected_tools") or [] + for tool_name in selected_tools: + if tool_name not in _FILE_TOOLS_MAP: + raise ValueError( + f"File Tool of name {tool_name} not supported." + f" Permitted tools: {list(_FILE_TOOLS_MAP)}" + ) + return values + + def get_tools(self) -> List[BaseTool]: + """Get the tools in the toolkit.""" + allowed_tools = self.selected_tools or _FILE_TOOLS_MAP + tools: List[BaseTool] = [] + for tool in allowed_tools: + tool_cls = _FILE_TOOLS_MAP[tool] + tools.append(tool_cls()) # type: ignore[call-arg] + return tools + + +__all__ = ["DataPrepKitToolkit"] diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/__init__.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/__init__.py new file mode 100644 index 0000000000..467e704b82 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/__init__.py @@ -0,0 +1,21 @@ +"""Data Processing Kit Transforms.""" + +from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform + +__all__ = [ + "FdedupTransform", + "EdedupTransform", + "FilterTransform", + "ResizeTransform", + "TokenizationTransform", + "DocIDTransform", + "Code2ParquetTransform", + "CodeQualityTransform", + "ProgLangSelectTransform", + "DocChunkTransform", + "DocQualityTransform", + "LangIdentificationTransform", + "Pdf2parquetTransform", + "TextEncoderTransform", + "PIIRedactorTransform", +] diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/__init__.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/code2parquet.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/code2parquet.py new file mode 100644 index 0000000000..7d1384176d --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/code2parquet.py @@ -0,0 +1,103 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class Code2ParquetInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for Code2ParquetTransform.""" + + code2parquet_supported_langs_file: Optional[str] = Field( + None, + description="set the `supported_langs_file` configuration key.", + ) + code2parquet_detect_programming_lang: Optional[str] = Field( + None, + description="set the `detect_programming_lang` configuration key.", + ) + code2parquet_snapshot: Optional[str] = Field( + None, + description="set the `snapshot` configuration key.", + ) + code2parquet_domain: Optional[str] = Field( + None, + description="set the `domain` configuration key.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(Code2ParquetInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class Code2ParquetTransform(BaseTool): + """Tool that apples code2parquet transform.""" + + name: str = "code2parquet" + args_schema: Type[BaseModel] = Code2ParquetInput + description: str = "Apply code2parquet transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from code2parquet_transform_ray import ( + CodeToParquetRayConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(CodeToParquetRayConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from code2parquet_transform_python import ( + CodeToParquetPythonConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(CodeToParquetPythonConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in code2parquet transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error code2parquet Job Failed" + + return f"code2parquet transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/code_quality.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/code_quality.py new file mode 100644 index 0000000000..142bba5002 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/code_quality.py @@ -0,0 +1,104 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class CodeQualityInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for CodeQualityTransform.""" + + cq_contents_column_name: Optional[str] = Field( + None, + description="Name of the column holds the data to process", + ) + cq_language_column_name: Optional[str] = Field( + None, + description="Name of the column holds the programming language details", + ) + cq_tokenizer: Optional[str] = Field( + None, + description="Name or path to the tokenizer.", + ) + cq_hf_token: Optional[str] = Field( + None, + description="Huggingface auth token to download and use the tokenizer.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(CodeQualityInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class CodeQualityTransform(BaseTool): + """Tool that apples Code Quality transform.""" + + name: str = "code_quality" + args_schema: Type[BaseModel] = CodeQualityInput + description: str = "Apply code_quality transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from code_quality_transform_ray import ( + CodeQualityRayTransformConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(CodeQualityRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from code_quality_transform_python import ( + CodeQualityPythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + CodeQualityPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in code quality transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error code quality Job Failed" + return f"code quality transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/proglang_select.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/proglang_select.py new file mode 100644 index 0000000000..3329f9e106 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/code/proglang_select.py @@ -0,0 +1,102 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class ProgLangSelectInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for ProgLangSelectTransform.""" + + proglang_select_allowed_langs: Optional[str] = Field( + None, + description="Path to file containing the list of languages to be matched", + ) + proglang_select_language_column: Optional[str] = Field( + None, + description="The column name holding the name of the programming language assigned to the document", + ) + proglang_select_output_column: Optional[int] = Field( + None, + description="he column name to add and that contains the matching information", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(ProgLangSelectInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class ProgLangSelectTransform(BaseTool): # type: ignore[override, override] + """Tool that apples progLang_select transform.""" + + name: str = "proglang_select" + args_schema: Type[BaseModel] = ProgLangSelectInput + description: str = "Apply proglang_select transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from progLang_select_transform_ray import ( + ProgLangSelectRayConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + print(f"running ray with transform_params: {transform_params}") + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(ProgLangSelectRayConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from progLang_select_transform_python import ( + ProgLangSelectPythonConfiguration, + ) + + print(f"running python with transform_params: {transform_params}") + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(ProgLangSelectPythonConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in progLang_select transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error progLang_select Job Failed" + + return f"progLang_select transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/__init__.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/doc_chunk.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/doc_chunk.py new file mode 100644 index 0000000000..df4956b480 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/doc_chunk.py @@ -0,0 +1,125 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class DocChunkInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for DocChunkTransform.""" + + doc_chunk_chunking_type: Optional[str] = Field( + None, + description="Chunking type to apply. Valid options are `li_markdown` for using the LlamaIndex, which chunks the text into fixed-sized windows of tokens.", + ) + doc_chunk_content_column_name: Optional[str] = Field( + None, + description="Name of the column containing the text to be chunked.", + ) + doc_chunk_doc_id_column_name: Optional[str] = Field( + None, + description="Name of the column containing the doc_id to be propagated in the output.", + ) + doc_chunk_output_chunk_column_name: Optional[str] = Field( + None, + description="Column name to store the chunks in the output table. ", + ) + doc_chunk_output_source_doc_id_column_name: Optional[str] = Field( + None, + description="Column name to store the `doc_id` from the input table.", + ) + doc_chunk_output_jsonpath_column_name: Optional[str] = Field( + None, + description="Column name to store the document path of the chunk in the output table.", + ) + doc_chunk_output_pageno_column_name: Optional[str] = Field( + None, + description="path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.", + ) + doc_chunk_output_bbox_column_name: Optional[str] = Field( + None, + description="Column name to store the bbox of the chunk in the output table", + ) + doc_chunk_chunk_size_tokens: Optional[int] = Field( + None, + description="Size of the chunk in tokens for the token text chunker.", + ) + doc_chunk_chunk_overlap_tokens: Optional[int] = Field( + None, + description="Number of tokens overlapping between chunks for the token text chunker.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(DocChunkInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class DocChunkTransform(BaseTool): # type: ignore[override, override] + """Tool that apples doc_chunk transform.""" + + name: str = "doc_chunk" + args_schema: Type[BaseModel] = DocChunkInput + description: str = "Apply DocChunk transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_doc_chunk.ray.transform import DocChunkRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(DocChunkRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_doc_chunk.transform_python import DocChunkPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + DocChunkPythonTransformConfiguration() + ) + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in doc_chunk transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error doc_chunk Job Failed" + + return f"doc_chunk transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/doc_quality.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/doc_quality.py new file mode 100644 index 0000000000..03dc29ec75 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/doc_quality.py @@ -0,0 +1,98 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class DocQualityInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for DocQualityTransform.""" + + docq_text_lang: Optional[str] = Field( + None, + description="language used in the text content. By default, en is used", + ) + docq_doc_content_column: Optional[str] = Field( + None, + description="column name that contain document text. By default, contents is used.", + ) + docq_bad_word_filepath: Optional[str] = Field( + None, + description="path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(DocQualityInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class DocQualityTransform(BaseTool): # type: ignore[override, override] + """Tool that apples doc_quality transform.""" + + name: str = "doc_quality" + args_schema: Type[BaseModel] = DocQualityInput + description: str = "Apply DocQuality transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_doc_quality.ray.transform import DocQualityRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(DocQualityRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_doc_quality.transform_python import DocQualityPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + DocQualityPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in doc_quality transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error doc_quality Job Failed" + + return f"doc_quality transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/lang_id.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/lang_id.py new file mode 100644 index 0000000000..9f2b221802 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/lang_id.py @@ -0,0 +1,111 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class LangIdentificationInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for LangIdentificationTransform.""" + + lang_id_model_credential: Optional[str] = Field( + None, + description="Credential to access model for language detection placed in url.", + ) + lang_id_model_kind: Optional[str] = Field( + None, + description="Kind of model for language detection.", + ) + lang_id_model_url: Optional[str] = Field( + None, + description="Url to model for language detection.", + ) + lang_id_content_column_name: Optional[str] = Field( + None, + description="Column name to get content.", + ) + lang_id_output_lang_column_name: Optional[str] = Field( + None, + description="Column name to store identified language.", + ) + lang_id_output_score_column_name: Optional[str] = Field( + None, + description="Column name to store the score of language identification.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(LangIdentificationInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class LangIdentificationTransform(BaseTool): # type: ignore[override, override] + """Tool that apples lang_id transform.""" + + name: str = "lang_id" + args_schema: Type[BaseModel] = LangIdentificationInput + description: str = "Apply LangIdentification transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_lang_id.ray.transform import LangIdentificationRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher( + LangIdentificationRayTransformConfiguration() + ) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_lang_id.transform_python import LangIdentificationPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + LangIdentificationPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in lang_id transform - {runtime_type}." + return_code = launcher.launch() + if return_code != 0: + return "Error Job Failed" + + return f"lang_id transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/pdf2parquet.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/pdf2parquet.py new file mode 100644 index 0000000000..d8e7d70c0e --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/pdf2parquet.py @@ -0,0 +1,120 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class Pdf2parquetInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for Pdf2parquetTransform.""" + + pdf2parquet_batch_size: Optional[int] = Field( + None, + description="Number of documents to be saved in the same result table. A value of -1 will generate one result file for each input file.", + ) + pdf2parquet_artifacts_path: Optional[str] = Field( + None, + description="Path where to Docling models artifacts are located, if unset they will be downloaded and fetched from the [HF_HUB_CACHE](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache) folder.", + ) + pdf2parquet_contents_type: Optional[str] = Field( + None, + description="The output type for the `contents` column. Valid types are `text/markdown`, `text/plain` and `application/json`.", + ) + pdf2parquet_do_table_structure: Optional[str] = Field( + None, + description="If true, detected tables will be processed with the table structure model.", + ) + pdf2parquet_do_ocr: Optional[str] = Field( + None, + description="If true, optical character recognition (OCR) will be used to read the content of bitmap parts of the document.", + ) + pdf2parquet_ocr_engine: Optional[str] = Field( + None, + description=" The OCR engine to use. Valid values are `easyocr`, `tesseract`, `tesseract_cli`.", + ) + pdf2parquet_bitmap_area_threshold: Optional[float] = Field( + None, + description="Threshold for running OCR on bitmap figures embedded in document. The threshold is computed as the fraction of the area covered by the bitmap, compared to the whole page area.", + ) + pdf2parquet_pdf_backend: Optional[str] = Field( + None, + description="The PDF backend to use. Valid values are `dlparse_v2`, `dlparse_v1`, `pypdfium2`", + ) + pdf2parquet_double_precision: Optional[int] = Field( + None, + description="If set, all floating points (e.g. bounding boxes) are rounded to this precision. For tests it is advised to use 0.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(Pdf2parquetInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class Pdf2parquetTransform(BaseTool): + """Tool that apples pdf2parquet transform.""" + + name: str = "pdf2parquet" + args_schema: Type[BaseModel] = Pdf2parquetInput + description: str = "Apply pdf2parquet transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from data_processing_ray.runtime.ray import RayTransformLauncher + from dpk_pdf2parquet.ray.transform import Pdf2ParquetRayTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from dpk_pdf2parquet.transform_python import Pdf2ParquetPythonTransformConfiguration + from data_processing.runtime.pure_python import PythonTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + Pdf2ParquetPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in pdf2parquet transform - {runtime_type}." + return_code = launcher.launch() + if return_code != 0: + return "Error pdf2parquet Job Failed" + + return f"pdf2parquet transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/pii_redactor.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/pii_redactor.py new file mode 100644 index 0000000000..113d2f523d --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/pii_redactor.py @@ -0,0 +1,108 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class PIIRedactorInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for PIIRedactorTransform.""" + + pii_redactor_entities: Optional[str] = Field( + None, + description="List of entities to be redacted from the input data: {json.dumps(default_supported_entities, indent=2, default=str)}. ", + ) + pii_redactor_operator: Optional[str] = Field( + None, + description="Redaction technique to be applied on detected pii data. Supported techniques redact, replace. ", + ) + pii_redactor_transformed_contents: Optional[str] = Field( + None, + description="Mention column name in which transformed contents will be added. ", + ) + pii_redactor_score_threshold: Optional[float] = Field( + None, + description="The score_threshold is a parameter that " + "sets the minimum confidence score required for an entity to be considered a match." + "Provide a value above 0.6 ", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(PIIRedactorInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class PIIRedactorTransform(BaseTool): # type: ignore[override, override] + """Tool that apples pii_redactor transform.""" + + name: str = "pii_redactor" + args_schema: Type[BaseModel] = PIIRedactorInput + description: str = "Apply PIIRedactor transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from pii_redactor_transform_ray import ( + PIIRedactorRayTransformConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(PIIRedactorRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from pii_redactor_transform_python import ( + PIIRedactorPythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + PIIRedactorPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in pii_redactor transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error pii_redactor Job Failed" + + return f"pii_redactor transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/text_encoder.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/text_encoder.py new file mode 100644 index 0000000000..9e1c8c7930 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/language/text_encoder.py @@ -0,0 +1,96 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class TextEncoderInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for TextEncoderTransform.""" + + text_encoder_content_column_name: Optional[str] = Field( + None, + description="Name of the column containing the text to be encoded.", + ) + text_encoder_output_embeddings_column_name: Optional[str] = Field( + None, + description="Column name to store the embeddings in the output table.", + ) + text_encoder_model_name: Optional[str] = Field( + None, + description="The HF model to use for encoding the text.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(TextEncoderInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class TextEncoderTransform(BaseTool): + """Tool that apples text_encoder transform.""" + + name: str = "text_encoder" + args_schema: Type[BaseModel] = TextEncoderInput + description: str = "Apply text_encoder transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + if runtime_type.strip().lower() == "ray": + from dpk_text_encoder.ray.transform import TextEncoderRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_text_encoder.transform_python import TextEncoderPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + TextEncoderPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in text_encoder transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error text_encoder Job Failed" + + return f"text_encoder transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/__init__.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/doc_id.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/doc_id.py new file mode 100644 index 0000000000..7610aeb2af --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/doc_id.py @@ -0,0 +1,101 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class DocIDInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for DocIDTransform.""" + + doc_id_doc_column: Optional[str] = Field( + None, + description="doc column name", + ) + doc_id_hash_column: Optional[str] = Field( + None, + description="Compute document hash and place in the given named column", + ) + doc_id_int_column: Optional[str] = Field( + None, + description="Compute unique integer id and place in the given named column", + ) + doc_id_start_id: Optional[str] = Field( + None, + description="starting integer id", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(DocIDInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class DocIDTransform(BaseTool): + """Tool that apples doc id transform.""" + + name: str = "doc_id" + args_schema: Type[BaseModel] = DocIDInput + description: str = "Apply doc_id transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_doc_id.ray.transform import DocIDRayTransformRuntimeConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_doc_id.transform_python import DocIDPythonTransformRuntimeConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + DocIDPythonTransformRuntimeConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in doc id transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error Job Failed" + + return f"doc_id transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/ededup.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/ededup.py new file mode 100644 index 0000000000..f085c0c38a --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/ededup.py @@ -0,0 +1,112 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, + check_params, +) +from data_processing.utils import ParamsUtils + + +class EdedupInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for EdedupTransform.""" + + ededup_doc_column: Optional[str] = Field( + None, + description="name of the column containing document", + ) + ededup_doc_id_column: Optional[str] = Field( + None, + description="name of the column containing document id", + ) + ededup_use_snapshot: Optional[str] = Field( + None, + description="flag to continue from snapshot", + ) + ededup_snapshot_directory: Optional[str] = Field( + None, + description="location of snapshot files", + ) + ededup_doc_column: Optional[str] = Field( + None, + description="name of the column containing document", + ) + ededup_num_hashes: Optional[int]= Field( + None, + description="Number of hashes should be greater then zero",) + ededup_hash_cpu: Optional[float]=Field( + None, + description="number of CPUs per hash",) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(EdedupInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class EdedupTransform(BaseTool): + """Tool that apples Ededup transform.""" + + name: str = "ededup" + args_schema: Type[BaseModel] = EdedupInput + description: str = "Apply Ededup transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + if runtime_type.strip().lower() == "ray": + from dpk_ededup.ray.transform import EdedupRayTransformRuntimeConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_ededup.transform_python import EdedupPythonTransformRuntimeConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + EdedupPythonTransformRuntimeConfiguration() + ) + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in ededup transform - {runtime_type}." + check_params(transform_params, kwargs) + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error ededup Job Failed" + + return f"Ededup transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/fdedup.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/fdedup.py new file mode 100644 index 0000000000..7022ea9fee --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/fdedup.py @@ -0,0 +1,129 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, + check_params, +) +from data_processing.utils import ParamsUtils + + +class FdedupInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for FdedupTransform.""" + + operation_mode: Optional[str] = Field( + None, + description="operation mode for data cleanup", + ) + contents_column: Optional[str] = Field( + None, + description="name of the column that stores document text", + ) + document_id_column: Optional[str] = Field( + None, + description="name of the column containing document id", + ) + seed: Optional[int] = Field( + None, + description="seed of the random number generator", + ) + num_permutations: Optional[int] = Field( + None, + description="number of permutations to use for minhash calculation", + ) + num_bands: Optional[int] = Field( + None, + description="number of permutations to use for minhash calculation", + ) + num_minhashes_per_band: Optional[int]= Field( + None, + description="number of minhashes to use in each band",) + word_shingle_size: Optional[int]=Field( + None, + description="number of words included in one shingle",) + jaccard_similarity_threshold: Optional[float] = Field( + None, + description="jaccard similarity threshold above which two documents", ) + num_segments: Optional[int] = Field( + None, + description="the number of segments dividing the hashing space for each band (for scalability)", ) + services: Optional[str] = Field( + None, + description="Comma separated list of services to run", ) + shingle_option: Optional[str] = Field( + None, + description="Option used for shinglingComma separated list of services to run", ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(FdedupInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class FdedupTransform(BaseTool): + """Tool that apples Fdedup transform.""" + + name: str = "fdedup" + args_schema: Type[BaseModel] = FdedupInput + description: str = "Apply Fdedup transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + #add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + transform_params.pop("data_type", None) + + if runtime_type.strip().lower() == "ray": + from dpk_fdedup.ray.transform import RayServiceOrchestrator + from dpk_fdedup.transform_python import parse_args + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + args = parse_args() + orchestrator = RayServiceOrchestrator(global_params=args) + + elif runtime_type.strip().lower() == "python": + from dpk_fdedup.transform_python import ServiceOrchestrator, parse_args + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + args = parse_args() + orchestrator = ServiceOrchestrator(global_params=args) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in Fdedup transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = orchestrator.orchestrate() + if return_code != 0: + return "Error Fdedup Job Failed" + + return f"Fdedup transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/filter.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/filter.py new file mode 100644 index 0000000000..408e418e82 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/filter.py @@ -0,0 +1,98 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class FilterInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for FilterTransform.""" + + filter_criteria_list: Optional[str] = Field( + None, + description="list of filter criteria (in SQL WHERE clause format).", + ) + filter_columns_to_drop: Optional[str] = Field( + None, + description="list of columns to drop after filtering.", + ) + filter_logical_operator: Optional[str] = Field( + None, + description="Compute unique integer id and place in the given named column", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(FilterInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class FilterTransform(BaseTool): + """Tool that apples filter transform.""" + + name: str = "filter" + args_schema: Type[BaseModel] = FilterInput + description: str = "Apply filter transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_ededup.ray.transform import FilterRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(FilterRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_filter.transform_python import ( + FilterPythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(FilterPythonTransformConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in filter transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error filter Job Failed" + + return f"filter transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/resize.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/resize.py new file mode 100644 index 0000000000..f84a303eaa --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/resize.py @@ -0,0 +1,98 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class ResizeInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for ResizeTransform.""" + + resize_max_rows_per_table: Optional[int] = Field( + None, + description="Max number of rows per table", + ) + resize_max_mbytes_per_table: Optional[float] = Field( + None, + description="Max table size (MB). Size is measured according to the --resize_size_type parameter", + ) + resize_size_type: Optional[str] = Field( + None, + description="Determines how memory is measured when using the --resize_max_mbytes_per_table option.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(ResizeInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class ResizeTransform(BaseTool): + """Tool that apples Code Quality transform.""" + + name: str = "resize" + args_schema: Type[BaseModel] = ResizeInput + description: str = "Apply resize transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from resize_transform_ray import ( + ResizeRayTransformConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(ResizeRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from resize_transform_python import ( + ResizePythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(ResizePythonTransformConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in resize transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error resize Job Failed" + return f"resize transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/tokenization.py b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/tokenization.py new file mode 100644 index 0000000000..b1cacd65af --- /dev/null +++ b/examples/agentic/llm_utils/dpk/langchain_tools/tools/universal/tokenization.py @@ -0,0 +1,108 @@ +import logging +from typing import Optional, Type +import sys + +from langchain_core.callbacks import CallbackManagerForToolRun +from langchain_core.tools import BaseTool +from pydantic import BaseModel, Field + + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class TokenizationInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for TokenizationTransform.""" + + tkn_tokenizer: Optional[int] = Field( + None, + description="Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By default, `hf-internal-testing/llama-tokenizer` from HuggingFace is used", + ) + tkn_tokenizer_args: Optional[int] = Field( + None, + description="Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for `bigcode/starcoder`", + ) + tkn_doc_id_column: Optional[int] = Field( + None, + description="Column contains document id which values should be unique across dataset", + ) + tkn_doc_content_column: Optional[int] = Field( + None, + description="Column contains document content", + ) + tkn_text_lang: Optional[int] = Field( + None, + description="Specify language used in text content for better text splitting if needed", + ) + tkn_chunk_size: Optional[int] = Field( + None, + description="Specify >0 value to tokenize each row/text in chunks of characters (rounded in words)", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(TokenizationInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +class TokenizationTransform(BaseTool): # type: ignore[override, override] + """Tool that apples tokenization transform.""" + + name: str = "tokenization" + args_schema: Type[BaseModel] = TokenizationInput + description: str = "Apply Tokenization transform on files in input folder" + + def _run( + self, + input_folder: str = "", + output_folder: str = "", + run_manager: Optional[CallbackManagerForToolRun] = None, + **kwargs, + ) -> str: + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from data_processing_ray.runtime.ray import RayTransformLauncher + from dpk_tokenization.ray.transform import TokenizationRayConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(TokenizationRayConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_tokenization.transform_python import TokenizationPythonConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(TokenizationPythonConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in tokenization transform - {runtime_type}." + print(f"Launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error Tokenization Job Failed" + + return f"Tokenization transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error!!: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/__init__.py b/examples/agentic/llm_utils/dpk/llama_index_tools/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/__init__.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/__init__.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/__init__.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/__init__.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/__init__.py new file mode 100644 index 0000000000..a19349aa60 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/__init__.py @@ -0,0 +1,5 @@ +from .base import DPKTransformsToolSpec + +__all__ = [ + "DPKTransformsToolSpec", +] diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/base.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/base.py new file mode 100644 index 0000000000..299bdfc034 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/base.py @@ -0,0 +1,124 @@ +from llama_index.core.tools.tool_spec.base import BaseToolSpec + + +class DPKTransformsToolSpec(BaseToolSpec): + """ + + DPK transforms. + + Methods: + code2parquet(self, **kwargs) -> str: + Applies code2parquet transform. Returns a string with the result. + code_quality(self, **kwargs) -> str: + Applies code_quality transform. Returns a string with the result. + prolang_select(self, **kwargs) -> str: + Applies prolang_select transform. Returns a string with the result. + doc_chunk(self, **kwargs) -> str: + Applies doc_chunk transform. Returns a string with the result. + doc_chunk(self, **kwargs) -> str: + Applies code2parquet transform. Returns a string with the result. + doc_quality(self, **kwargs) -> str: + Applies doc_quality transform. Returns a string with the result. + lang_id(self, **kwargs) -> str: + Applies lang_id transform. Returns a string with the result. + pdf2parquet(self, **kwargs) -> str: + Applies pdf2parquet transform. Returns a string with the result. + pii_redactor(self, **kwargs) -> str: + Applies pii_redactor transform. Returns a string with the result. + text_encoder(self, **kwargs) -> str: + Applies text_encoder transform. Returns a string with the result. + doc_id(self, **kwargs) -> str: + Applies doc_id transform. Returns a string with the result. + ededup(self, **kwargs) -> str: + Applies ededup transform. Returns a string with the result. + fdedup(self, **kwargs) -> str: + Applies fdedup transform. Returns a string with the result. + doc_id(self, **kwargs) -> str: + Applies doc_id transform. Returns a string with the result. + filter(self, **kwargs) -> str: + Applies filter transform. Returns a string with the result. + resize(self, **kwargs) -> str: + Applies resize transform. Returns a string with the result. + tokenization(self, **kwargs) -> str: + Applies tokenization transform. Returns a string with the result. + + """ + + spec_functions = ["code2parquet", "code_quality", "prolang_select", "doc_chunk", "doc_quality", "lang_id", + "pdf2parquet", "pii_redactor", "text_encoder", "doc_id", "ededup", "fdedup", "filter", + "resize", "tokenization"] + + def code2parquet(self, **kwargs) -> str: + from .code import code2parquet + + return code2parquet.code2parquet(kwargs=kwargs) + + def code_quality(self, **kwargs) -> str: + from .code import code_quality + + return code_quality.code_quality(kwargs=kwargs) + + def prolang_select(self, **kwargs) -> str: + from .code import proglang_select + + return proglang_select.proglang_select(kwargs=kwargs) + + def doc_chunk(self, **kwargs) -> str: + from .language import doc_chunk + + return doc_chunk.doc_chunk(kwargs=kwargs) + + def doc_quality(self, **kwargs) -> str: + from .language import doc_quality + + return doc_quality.doc_quality(kwargs=kwargs) + + def lang_id(self, **kwargs) -> str: + from .language import lang_id + + return lang_id.lang_id(kwargs=kwargs) + + def pdf2parquet(self, **kwargs) -> str: + from .language import pdf2parquet + + return pdf2parquet.pdf2parquet(kwargs=kwargs) + + def pii_redactor(self, **kwargs) -> str: + from .language import pii_redactor + + return pii_redactor.pii_redactor(kwargs=kwargs) + + def text_encoder(self, **kwargs) -> str: + from .language import text_encoder + + return text_encoder.text_encoder(kwargs=kwargs) + + def doc_id(self, **kwargs) -> str: + from .universal import doc_id + + return doc_id.doc_id(kwargs=kwargs) + + def ededup(self, **kwargs) -> str: + from .universal import ededup + + return ededup.ededup(kwargs=kwargs) + + def fdedup(self, **kwargs) -> str: + from .universal import fdedup + + return fdedup.fdedup(kwargs=kwargs) + + def filter(self, **kwargs) -> str: + from .universal import filter + + return filter.filter(kwargs=kwargs) + + def resize(self, **kwargs) -> str: + from .universal import resize + + return resize.resize(kwargs=kwargs) + + def tokenization(self, **kwargs) -> str: + from .universal import tokenization + + return tokenization.tokenization(kwargs=kwargs) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/__init__.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/code2parquet.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/code2parquet.py new file mode 100644 index 0000000000..015be77939 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/code2parquet.py @@ -0,0 +1,96 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class Code2ParquetInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for Code2ParquetTransform.""" + + code2parquet_supported_langs_file: Optional[str] = Field( + None, + description="set the `supported_langs_file` configuration key.", + ) + code2parquet_detect_programming_lang: Optional[str] = Field( + None, + description="set the `detect_programming_lang` configuration key.", + ) + code2parquet_snapshot: Optional[str] = Field( + None, + description="set the `snapshot` configuration key.", + ) + code2parquet_domain: Optional[str] = Field( + None, + description="set the `domain` configuration key.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(Code2ParquetInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def code2parquet(**kwargs: Any) -> str: + """Tool that apples code2parquet transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from code2parquet_transform_ray import ( + CodeToParquetRayConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(CodeToParquetRayConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from code2parquet_transform_python import ( + CodeToParquetPythonConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(CodeToParquetPythonConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in code2parquet transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error code2parquet Job Failed" + + return f"code2parquet transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/code_quality.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/code_quality.py new file mode 100644 index 0000000000..46d013f462 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/code_quality.py @@ -0,0 +1,97 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class CodeQualityInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for CodeQualityTransform.""" + + cq_contents_column_name: Optional[str] = Field( + None, + description="Name of the column holds the data to process", + ) + cq_language_column_name: Optional[str] = Field( + None, + description="Name of the column holds the programming language details", + ) + cq_tokenizer: Optional[str] = Field( + None, + description="Name or path to the tokenizer.", + ) + cq_hf_token: Optional[str] = Field( + None, + description="Huggingface auth token to download and use the tokenizer.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(CodeQualityInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def code_quality(**kwargs: Any) -> str: + """Tool that apples code_quality transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from code_quality_transform_ray import ( + CodeQualityRayTransformConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(CodeQualityRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from code_quality_transform_python import ( + CodeQualityPythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + CodeQualityPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in code quality transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error code quality Job Failed" + return f"code quality transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/proglang_select.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/proglang_select.py new file mode 100644 index 0000000000..b71c2ff8d8 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/code/proglang_select.py @@ -0,0 +1,94 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class ProgLangSelectInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for ProgLangSelectTransform.""" + + proglang_select_allowed_langs: Optional[str] = Field( + None, + description="Path to file containing the list of languages to be matched", + ) + proglang_select_language_column: Optional[str] = Field( + None, + description="The column name holding the name of the programming language assigned to the document", + ) + proglang_select_output_column: Optional[int] = Field( + None, + description="he column name to add and that contains the matching information", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(ProgLangSelectInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def proglang_select(**kwargs: Any) -> str: + """Tool that apples proglang_select transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from progLang_select_transform_ray import ( + ProgLangSelectRayConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + print(f"running ray with transform_params: {transform_params}") + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(ProgLangSelectRayConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from progLang_select_transform_python import ( + ProgLangSelectPythonConfiguration, + ) + + print(f"running python with transform_params: {transform_params}") + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(ProgLangSelectPythonConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in progLang_select transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error progLang_select Job Failed" + + return f"progLang_select transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/doc_chunk.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/doc_chunk.py new file mode 100644 index 0000000000..3ecfa1fb55 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/doc_chunk.py @@ -0,0 +1,117 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class DocChunkInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for DocChunkTransform.""" + + doc_chunk_chunking_type: Optional[str] = Field( + None, + description="Chunking type to apply. Valid options are `li_markdown` for using the LlamaIndex, which chunks the text into fixed-sized windows of tokens.", + ) + doc_chunk_content_column_name: Optional[str] = Field( + None, + description="Name of the column containing the text to be chunked.", + ) + doc_chunk_doc_id_column_name: Optional[str] = Field( + None, + description="Name of the column containing the doc_id to be propagated in the output.", + ) + doc_chunk_output_chunk_column_name: Optional[str] = Field( + None, + description="Column name to store the chunks in the output table. ", + ) + doc_chunk_output_source_doc_id_column_name: Optional[str] = Field( + None, + description="Column name to store the `doc_id` from the input table.", + ) + doc_chunk_output_jsonpath_column_name: Optional[str] = Field( + None, + description="Column name to store the document path of the chunk in the output table.", + ) + doc_chunk_output_pageno_column_name: Optional[str] = Field( + None, + description="path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.", + ) + doc_chunk_output_bbox_column_name: Optional[str] = Field( + None, + description="Column name to store the bbox of the chunk in the output table", + ) + doc_chunk_chunk_size_tokens: Optional[int] = Field( + None, + description="Size of the chunk in tokens for the token text chunker.", + ) + doc_chunk_chunk_overlap_tokens: Optional[int] = Field( + None, + description="Number of tokens overlapping between chunks for the token text chunker.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(DocChunkInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def doc_chunk(**kwargs: Any) -> str: + """Tool that apples doc_chunk transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_doc_chunk.ray.transform import DocChunkRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(DocChunkRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_doc_chunk.transform_python import DocChunkPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + DocChunkPythonTransformConfiguration() + ) + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in doc_chunk transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error doc_chunk Job Failed" + + return f"doc_chunk transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/doc_quality.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/doc_quality.py new file mode 100644 index 0000000000..8c48e0ff64 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/doc_quality.py @@ -0,0 +1,90 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class DocQualityInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for DocQualityTransform.""" + + docq_text_lang: Optional[str] = Field( + None, + description="language used in the text content. By default, en is used", + ) + docq_doc_content_column: Optional[str] = Field( + None, + description="column name that contain document text. By default, contents is used.", + ) + docq_bad_word_filepath: Optional[str] = Field( + None, + description="path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(DocQualityInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def doc_quality(**kwargs: Any) -> str: + """Tool that apples doc_quality transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_doc_quality.ray.transform import DocQualityRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(DocQualityRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_doc_quality.transform_python import DocQualityPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + DocQualityPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in doc_quality transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error doc_quality Job Failed" + + return f"doc_quality transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/lang_id.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/lang_id.py new file mode 100644 index 0000000000..f068f9e691 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/lang_id.py @@ -0,0 +1,103 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class LangIdentificationInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for LangIdentificationTransform.""" + + lang_id_model_credential: Optional[str] = Field( + None, + description="Credential to access model for language detection placed in url.", + ) + lang_id_model_kind: Optional[str] = Field( + None, + description="Kind of model for language detection.", + ) + lang_id_model_url: Optional[str] = Field( + None, + description="Url to model for language detection.", + ) + lang_id_content_column_name: Optional[str] = Field( + None, + description="Column name to get content.", + ) + lang_id_output_lang_column_name: Optional[str] = Field( + None, + description="Column name to store identified language.", + ) + lang_id_output_score_column_name: Optional[str] = Field( + None, + description="Column name to store the score of language identification.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(LangIdentificationInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def lang_id(**kwargs: Any) -> str: + """Tool that apples lang_id transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_lang_id.ray.transform import LangIdentificationRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher( + LangIdentificationRayTransformConfiguration() + ) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_lang_id.transform_python import LangIdentificationPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + LangIdentificationPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in lang_id transform - {runtime_type}." + return_code = launcher.launch() + if return_code != 0: + return "Error Job Failed" + + return f"lang_id transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/pdf2parquet.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/pdf2parquet.py new file mode 100644 index 0000000000..e8c0dfe91e --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/pdf2parquet.py @@ -0,0 +1,113 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class Pdf2parquetInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for Pdf2parquetTransform.""" + + pdf2parquet_batch_size: Optional[int] = Field( + None, + description="Number of documents to be saved in the same result table. A value of -1 will generate one result file for each input file.", + ) + pdf2parquet_artifacts_path: Optional[str] = Field( + None, + description="Path where to Docling models artifacts are located, if unset they will be downloaded and fetched from the [HF_HUB_CACHE](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache) folder.", + ) + pdf2parquet_contents_type: Optional[str] = Field( + None, + description="The output type for the `contents` column. Valid types are `text/markdown`, `text/plain` and `application/json`.", + ) + pdf2parquet_do_table_structure: Optional[str] = Field( + None, + description="If true, detected tables will be processed with the table structure model.", + ) + pdf2parquet_do_ocr: Optional[str] = Field( + None, + description="If true, optical character recognition (OCR) will be used to read the content of bitmap parts of the document.", + ) + pdf2parquet_ocr_engine: Optional[str] = Field( + None, + description=" The OCR engine to use. Valid values are `easyocr`, `tesseract`, `tesseract_cli`.", + ) + pdf2parquet_bitmap_area_threshold: Optional[float] = Field( + None, + description="Threshold for running OCR on bitmap figures embedded in document. The threshold is computed as the fraction of the area covered by the bitmap, compared to the whole page area.", + ) + pdf2parquet_pdf_backend: Optional[str] = Field( + None, + description="The PDF backend to use. Valid values are `dlparse_v2`, `dlparse_v1`, `pypdfium2`", + ) + pdf2parquet_double_precision: Optional[int] = Field( + None, + description="If set, all floating points (e.g. bounding boxes) are rounded to this precision. For tests it is advised to use 0.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(Pdf2parquetInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def pdf2parquet(**kwargs: Any) -> str: + """Tool that apples pdf2parquet transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from data_processing_ray.runtime.ray import RayTransformLauncher + from dpk_pdf2parquet.ray.transform import Pdf2ParquetRayTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from dpk_pdf2parquet.transform_python import Pdf2ParquetPythonTransformConfiguration + from data_processing.runtime.pure_python import PythonTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + Pdf2ParquetPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in pdf2parquet transform - {runtime_type}." + return_code = launcher.launch() + if return_code != 0: + return "Error pdf2parquet Job Failed" + + return f"pdf2parquet transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/pii_redactor.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/pii_redactor.py new file mode 100644 index 0000000000..dcbec87afa --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/pii_redactor.py @@ -0,0 +1,100 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class PIIRedactorInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for PIIRedactorTransform.""" + + pii_redactor_entities: Optional[str] = Field( + None, + description="List of entities to be redacted from the input data: {json.dumps(default_supported_entities, indent=2, default=str)}. ", + ) + pii_redactor_operator: Optional[str] = Field( + None, + description="Redaction technique to be applied on detected pii data. Supported techniques redact, replace. ", + ) + pii_redactor_transformed_contents: Optional[str] = Field( + None, + description="Mention column name in which transformed contents will be added. ", + ) + pii_redactor_score_threshold: Optional[float] = Field( + None, + description="The score_threshold is a parameter that " + "sets the minimum confidence score required for an entity to be considered a match." + "Provide a value above 0.6 ", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(PIIRedactorInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def pii_redactor(**kwargs: Any) -> str: + """Tool that apples pii_redactor transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from pii_redactor_transform_ray import ( + PIIRedactorRayTransformConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(PIIRedactorRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from pii_redactor_transform_python import ( + PIIRedactorPythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + PIIRedactorPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in pii_redactor transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error pii_redactor Job Failed" + + return f"pii_redactor transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/text_encoder.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/text_encoder.py new file mode 100644 index 0000000000..e3d03cfdd1 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/language/text_encoder.py @@ -0,0 +1,90 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class TextEncoderInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for TextEncoderTransform.""" + + text_encoder_content_column_name: Optional[str] = Field( + None, + description="Name of the column containing the text to be encoded.", + ) + text_encoder_output_embeddings_column_name: Optional[str] = Field( + None, + description="Column name to store the embeddings in the output table.", + ) + text_encoder_model_name: Optional[str] = Field( + None, + description="The HF model to use for encoding the text.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(TextEncoderInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def text_encoder(**kwargs: Any) -> str: + """Tool that apples text_encoder transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + if runtime_type.strip().lower() == "ray": + from dpk_text_encoder.ray.transform import TextEncoderRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_text_encoder.transform_python import TextEncoderPythonTransformConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + TextEncoderPythonTransformConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in text_encoder transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error text_encoder Job Failed" + + return f"text_encoder transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/doc_id.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/doc_id.py new file mode 100644 index 0000000000..f8e39a3f6e --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/doc_id.py @@ -0,0 +1,93 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class DocIDInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for DocIDTransform.""" + + doc_id_doc_column: Optional[str] = Field( + None, + description="doc column name", + ) + doc_id_hash_column: Optional[str] = Field( + None, + description="Compute document hash and place in the given named column", + ) + doc_id_int_column: Optional[str] = Field( + None, + description="Compute unique integer id and place in the given named column", + ) + doc_id_start_id: Optional[str] = Field( + None, + description="starting integer id", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(DocIDInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def doc_id(**kwargs: Any) -> str: + """Tool that apples doc_id transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_doc_id.ray.transform import DocIDRayTransformRuntimeConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_doc_id.transform_python import DocIDPythonTransformRuntimeConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + DocIDPythonTransformRuntimeConfiguration() + ) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in doc id transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error Job Failed" + + return f"doc_id transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/ededup.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/ededup.py new file mode 100644 index 0000000000..46167a6e22 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/ededup.py @@ -0,0 +1,105 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, + check_params, +) +from data_processing.utils import ParamsUtils + + +class EdedupInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for EdedupTransform.""" + + ededup_doc_column: Optional[str] = Field( + None, + description="name of the column containing document", + ) + ededup_doc_id_column: Optional[str] = Field( + None, + description="name of the column containing document id", + ) + ededup_use_snapshot: Optional[str] = Field( + None, + description="flag to continue from snapshot", + ) + ededup_snapshot_directory: Optional[str] = Field( + None, + description="location of snapshot files", + ) + ededup_doc_column: Optional[str] = Field( + None, + description="name of the column containing document", + ) + ededup_num_hashes: Optional[int] = Field( + None, + description="Number of hashes should be greater then zero", ) + ededup_hash_cpu: Optional[float] = Field( + None, + description="number of CPUs per hash", ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(EdedupInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def ededup(**kwargs: Any) -> str: + """Tool that apples ededup transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + if runtime_type.strip().lower() == "ray": + from dpk_ededup.ray.transform import EdedupRayTransformRuntimeConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_ededup.transform_python import EdedupPythonTransformRuntimeConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher( + EdedupPythonTransformRuntimeConfiguration() + ) + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in ededup transform - {runtime_type}." + check_params(transform_params, kwargs) + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error ededup Job Failed" + + return f"Ededup transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/fdedup.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/fdedup.py new file mode 100644 index 0000000000..f01792c165 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/fdedup.py @@ -0,0 +1,122 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, + check_params, +) +from data_processing.utils import ParamsUtils + + +class FdedupInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for FdedupTransform.""" + + operation_mode: Optional[str] = Field( + None, + description="operation mode for data cleanup", + ) + contents_column: Optional[str] = Field( + None, + description="name of the column that stores document text", + ) + document_id_column: Optional[str] = Field( + None, + description="name of the column containing document id", + ) + seed: Optional[int] = Field( + None, + description="seed of the random number generator", + ) + num_permutations: Optional[int] = Field( + None, + description="number of permutations to use for minhash calculation", + ) + num_bands: Optional[int] = Field( + None, + description="number of permutations to use for minhash calculation", + ) + num_minhashes_per_band: Optional[int] = Field( + None, + description="number of minhashes to use in each band", ) + word_shingle_size: Optional[int] = Field( + None, + description="number of words included in one shingle", ) + jaccard_similarity_threshold: Optional[float] = Field( + None, + description="jaccard similarity threshold above which two documents", ) + num_segments: Optional[int] = Field( + None, + description="the number of segments dividing the hashing space for each band (for scalability)", ) + services: Optional[str] = Field( + None, + description="Comma separated list of services to run", ) + shingle_option: Optional[str] = Field( + None, + description="Option used for shinglingComma separated list of services to run", ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(FdedupInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def fdedup(**kwargs: Any) -> str: + """Tool that apples fdedup transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + # add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + transform_params.pop("data_type", None) + + if runtime_type.strip().lower() == "ray": + from dpk_fdedup.ray.transform import RayServiceOrchestrator + from dpk_fdedup.transform_python import parse_args + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + args = parse_args() + orchestrator = RayServiceOrchestrator(global_params=args) + + elif runtime_type.strip().lower() == "python": + from dpk_fdedup.transform_python import ServiceOrchestrator, parse_args + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + args = parse_args() + orchestrator = ServiceOrchestrator(global_params=args) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in Fdedup transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = orchestrator.orchestrate() + if return_code != 0: + return "Error Fdedup Job Failed" + + return f"Fdedup transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/filter.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/filter.py new file mode 100644 index 0000000000..7c4c16a040 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/filter.py @@ -0,0 +1,90 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class FilterInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for FilterTransform.""" + + filter_criteria_list: Optional[str] = Field( + None, + description="list of filter criteria (in SQL WHERE clause format).", + ) + filter_columns_to_drop: Optional[str] = Field( + None, + description="list of columns to drop after filtering.", + ) + filter_logical_operator: Optional[str] = Field( + None, + description="Compute unique integer id and place in the given named column", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(FilterInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def filter(**kwargs: Any) -> str: + """Tool that apples filter transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from dpk_ededup.ray.transform import FilterRayTransformConfiguration + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(FilterRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_filter.transform_python import ( + FilterPythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(FilterPythonTransformConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in filter transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error filter Job Failed" + + return f"filter transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/resize.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/resize.py new file mode 100644 index 0000000000..78f99ba081 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/resize.py @@ -0,0 +1,90 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +logger = logging.getLogger(__name__) + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + + +class ResizeInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for ResizeTransform.""" + + resize_max_rows_per_table: Optional[int] = Field( + None, + description="Max number of rows per table", + ) + resize_max_mbytes_per_table: Optional[float] = Field( + None, + description="Max table size (MB). Size is measured according to the --resize_size_type parameter", + ) + resize_size_type: Optional[str] = Field( + None, + description="Determines how memory is measured when using the --resize_max_mbytes_per_table option.", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(ResizeInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def resize(**kwargs: Any) -> str: + """Tool that apples resize transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from resize_transform_ray import ( + ResizeRayTransformConfiguration, + ) + from data_processing_ray.runtime.ray import RayTransformLauncher + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(ResizeRayTransformConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from resize_transform_python import ( + ResizePythonTransformConfiguration, + ) + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(ResizePythonTransformConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in resize transform - {runtime_type}." + print(f"launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error resize Job Failed" + return f"resize transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/tokenization.py b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/tokenization.py new file mode 100644 index 0000000000..992e8c3186 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/llama_index_dpk/tools/dpk/universal/tokenization.py @@ -0,0 +1,100 @@ +import logging +from typing import Optional, Type +import sys +from typing import Any + +from pydantic import BaseModel, Field + +from llm_utils.dpk.dpk_common import ( + DPKDataAccessInput, + DPKRuntimeInput, + add_runtime_params, + add_data_access_params, +) +from data_processing.utils import ParamsUtils + +logger = logging.getLogger(__name__) + + +class TokenizationInput(BaseModel, DPKDataAccessInput, DPKRuntimeInput): + """Input for TokenizationTransform.""" + + tkn_tokenizer: Optional[int] = Field( + None, + description="Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By default, `hf-internal-testing/llama-tokenizer` from HuggingFace is used", + ) + tkn_tokenizer_args: Optional[int] = Field( + None, + description="Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for `bigcode/starcoder`", + ) + tkn_doc_id_column: Optional[int] = Field( + None, + description="Column contains document id which values should be unique across dataset", + ) + tkn_doc_content_column: Optional[int] = Field( + None, + description="Column contains document content", + ) + tkn_text_lang: Optional[int] = Field( + None, + description="Specify language used in text content for better text splitting if needed", + ) + tkn_chunk_size: Optional[int] = Field( + None, + description="Specify >0 value to tokenize each row/text in chunks of characters (rounded in words)", + ) + + +def add_transform_params(transform_params: dict, kwargs): + """Add transform specific params""" + fields = list(TokenizationInput.__annotations__.keys()) + for field in fields: + if field in kwargs and kwargs[field] is not None: + transform_params[field] = kwargs[field] + + +def tokenization(**kwargs: Any) -> str: + """Tool that apples tokenization transform.""" + + kwargs = kwargs.get("kwargs", None) + + input_folder = kwargs.get("input_folder", "") + output_folder = kwargs.get("output_folder", "") + + if input_folder == "" or output_folder == "": + return "Error: input folder or output folder are missing" + try: + runtime_type = kwargs.get("runtime_type", "python") + data_type = kwargs.get("data_type", "local") + transform_params = { + "input_folder": input_folder, + "output_folder": output_folder, + } + add_runtime_params(transform_params, runtime_type, kwargs) + add_data_access_params(transform_params, data_type, kwargs) + add_transform_params(transform_params, kwargs) + + if runtime_type.strip().lower() == "ray": + from data_processing_ray.runtime.ray import RayTransformLauncher + from dpk_tokenization.ray.transform import TokenizationRayConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = RayTransformLauncher(TokenizationRayConfiguration()) + + elif runtime_type.strip().lower() == "python": + from data_processing.runtime.pure_python import PythonTransformLauncher + from dpk_tokenization.transform_python import TokenizationPythonConfiguration + + sys.argv = ParamsUtils.dict_to_req(d=transform_params) + launcher = PythonTransformLauncher(TokenizationPythonConfiguration()) + + else: + return f"Error: Unrecognizable type of TransformRuntimeConfiguration in tokenization transform - {runtime_type}." + print(f"Launching transform with params: {transform_params}") + return_code = launcher.launch() + if return_code != 0: + return "Error Tokenization Job Failed" + + return f"Tokenization transform successfully applied with input_folder {input_folder} output_folder {output_folder}." + except Exception as e: + return "Error!!: " + str(e) diff --git a/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/pyproject.toml b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/pyproject.toml new file mode 100644 index 0000000000..b0e7f2d28a --- /dev/null +++ b/examples/agentic/llm_utils/dpk/llama_index_tools/llama_index_tools_dpk/pyproject.toml @@ -0,0 +1,54 @@ +[build-system] +build-backend = "poetry.core.masonry.api" +requires = ["poetry-core"] + +[tool.codespell] +check-filenames = true +check-hidden = true +# Feel free to un-skip examples, and experimental, you will just need to +# work through many typos (--write-changes and --interactive will help) +skip = "*.csv,*.html,*.json,*.jsonl,*.pdf,*.txt,*.ipynb" + +[tool.llamahub] +contains_example = true +import_path = "llama_index_dpk.tools.dpk" + +[tool.mypy] +disallow_untyped_defs = true +# Remove venv skip when integrated with pre-commit +exclude = ["_static", "build", "examples", "notebooks", "venv"] +ignore_missing_imports = true +python_version = "3.11" + +[tool.poetry] +description = "llama-index tools DPK integration" +license = "MIT" +name = "llama-index-tools-dpk" +packages = [{include = "llama_index_dpk/"}] +version = "0.0.1" + +[tool.poetry.dependencies] +python = ">=3.9,<4.0" +data-prep-toolkit = "0.2.3" +data-prep-toolkit-transforms = "1.0.0a2" +llama-index-core = "^0.12.0" + +[tool.poetry.group.dev.dependencies] +black = {extras = ["jupyter"], version = "<=23.9.1,>=23.7.0"} +codespell = {extras = ["toml"], version = ">=v2.2.6"} +ipython = "8.10.0" +jupyter = "^1.0.0" +mypy = "0.991" +pre-commit = "3.2.0" +pylint = "2.15.10" +pytest = "7.2.1" +pytest-mock = "3.11.1" +python-dotenv = "^1.0.0" +ruff = "0.0.292" +tree-sitter-languages = "^1.8.0" +types-Deprecated = ">=0.1.0" +types-PyYAML = "^6.0.12.12" +types-protobuf = "^4.24.0.4" +types-redis = "4.5.5.0" +types-requests = "2.28.11.8" # TODO: unpin when mypy>0.991 +types-setuptools = "67.1.0.0" diff --git a/examples/agentic/llm_utils/dpk/tools.py b/examples/agentic/llm_utils/dpk/tools.py new file mode 100644 index 0000000000..a584161537 --- /dev/null +++ b/examples/agentic/llm_utils/dpk/tools.py @@ -0,0 +1,50 @@ +tools_json = ''' +[ + {"name": "exact_dedup", "description": "Exact data deduplication is used to identify (and remove) redundant records.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}], + "import": "from llm_utils.dpk.langchain_tools.tools.universal.ededup import EdedupTransform"}, + + {"name": "Pdf2Parquet", "description": "The Pdf2Parquet transform generates parquet files containing the converted document.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}, + {"name": "data_files_to_use", "description": "files extentions to transform.", "type": "list"}], + "import": "from llm_utils.dpk.langchain_tools.tools.language.pdf2parquet import Pdf2parquetTransform"}, + + {"name": "doc_quality", "description": "The doc_quality transform will calculate and annotate several metrics which are useful to assess the quality of the document.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}, + {"name": "docq_bad_word_filepath", "description": "path to bad words file.", "type": "str"}], + "import": "from llm_utils.dpk.langchain_tools.tools.language.doc_quality import DocQualityTransform"}, + + {"name": "document_id", "description": "The Document ID transforms adds a document identification (unique integers and content hashes), which later can be used in de-duplication operations.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}, + {"name": "doc_id_int_column", "description": "Compute unique integer id and place in the given named column.", "type": "str"}], + "import": "from llm_utils.dpk.langchain_tools.tools.universal.doc_id import DocIDTransform"}, + + {"name": "language_id", "description": "The Language Identification transforms added a column containing the language of the document.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}], + "import": "from llm_utils.dpk.langchain_tools.tools.language.lang_id import LangIdentificationTransform"}, + + {"name": "filter_transform", "description": "The filter transforms provides SQL-based expressions for filtering rows and optionally column removal from parquet files.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}, + {"name": "filter_criteria_list", "description": "list of sql queries to filter the input files.", "type": "list"}], + "import": "from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform"}, + + {"name": "tokenization", "description": "The tokenization transform annotates pyarrow tables and parquet files to add a column containing tokens for the document column.", + "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, + {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}], + "import": "from llm_utils.dpk.langchain_tools.tools.universal.tokenization import TokenizationTransform"}, + + {"name": "tool_not_implemented", "description": "A placeholder tool for the casses when a suitable tool cannot be found"} +] +''' + +# tools_json = json.dumps(json.loads(tools_json)) +# {"name": "fuzzy_dedup", "description": "The fdedup transforms removes documents that are very similar to each other.", +# "input": [{"name": "in_folder", "description": "input directory to transform files from.", "type": "str"}, +# {"name": "out_folder", "description": "destination directory to store the transformed files.", "type": "str"}], +# "import": "from llm_utils.dpk.langchain_tools.tools.universal.fdedup import FdedupTransform"}, \ No newline at end of file diff --git a/examples/agentic/llm_utils/logging.py b/examples/agentic/llm_utils/logging.py new file mode 100644 index 0000000000..4ca8f2329a --- /dev/null +++ b/examples/agentic/llm_utils/logging.py @@ -0,0 +1,159 @@ +# from gin/common/logging.py + +""" +Module for holding logging information +""" + +import logging.handlers +import os +import logging +from pathlib import Path + + +# ANSI SGR control codes for text formatting +TEXT = { + "DEFAULT": "\x1b[0m", + "BOLD": "\x1b[1m", + "BOLD_OFF": "\x1b[22m", + "UNDERLINE": "\x1b[4m", + "UNDERLINE_OFF": "\x1b[24m", + "DEFAULT_COLOR": "\x1b[39m", + "DEFAULT_BG_COLOR": "\x1b[49m", + "RED": "\x1b[31m", + "YELLOW": "\x1b[33m", + "GREEN": "\x1b[32m", + "CYAN": "\x1b[36m", + "BLUE": "\x1b[34m", + "MAGENTA": "\x1b[35m", + "BLACK": "\x1b[30m", + "WHITE": "\x1b[37m", + "BG_RED": "\x1b[41m", + "BG_YELLOW": "\x1b[43m", + "BG_GREEN": "\x1b[42m", + "BG_CYAN": "\x1b[46m", + "BG_BLUE": "\x1b[44m", + "BG_MAGENTA": "\x1b[45m", + "BG_BLACK": "\x1b[40m", + "BG_WHITE": "\x1b[47m", +} + + +class Logging: + """ + Task-dependent logger names + """ + + # logging package requires logger names must be strings, so this class + # must not inherit from enum + BASE = "base" + LLM = "llm" + AGENTIC_WORKFLOW = "agentic_workflow" + TOOL_CALLING = "tool_calling" + + +class Formatter(logging.Formatter): + """ + Custom log formatting for GIN modules. + """ + + def __init__(self): + # Log message format + msg_fmt = f"{TEXT['BOLD']}%(name)s:%(levelname)s:{TEXT['BOLD_OFF']}%(message)s" + + # Dictionary of formatters that add color codes (based on log level) + self.formatters: dict[int, logging.Formatter] = { + logging.DEBUG: logging.Formatter(TEXT["BLUE"] + msg_fmt + TEXT["DEFAULT"]), + logging.INFO: logging.Formatter(TEXT["GREEN"] + msg_fmt + TEXT["DEFAULT"]), + logging.WARNING: logging.Formatter( + TEXT["YELLOW"] + msg_fmt + TEXT["DEFAULT"] + ), + logging.ERROR: logging.Formatter(TEXT["RED"] + msg_fmt + TEXT["DEFAULT"]), + logging.CRITICAL: logging.Formatter( + TEXT["BG_RED"] + TEXT["BLACK"] + msg_fmt + TEXT["DEFAULT"] + ), + } + + super().__init__() + + def format(self, record: logging.LogRecord): + """ + Format the specified record as text, while adding ANSI color codes + based on log level. + """ + # Get the log formatter based on log level + formatter = self.formatters[record.levelno] + return formatter.format(record) + + +def prep_loggers(log_level: str, default_log_level: str = "WARNING") -> int: + """ + Prepare the GIN loggers. + + Args: + log_level (str): Level(s) to set loggers to. This is a comma separated + list (without spaces) of levels for each logger in the format: + LOGGER_NAME1=LEVEL1,LOGGER_NAME2=LEVEL2,DEFAULT_LEVEL + The default level can be specified anywhere (or nowhere) and will + be identified by the lack of a '=' symbol. + default_log_level (str, optional): Default logging level, if not + otherwise specified in log_level. Defaults to "WARNING". + + Returns: + int: Error code (returns 0 for no error, 1 otherwise) + """ + log_levels = {"_default": default_log_level} + for logger_level_pair in log_level.split(","): + if "=" in logger_level_pair: + logger_name, level = logger_level_pair.split("=") + else: + # This is not a pair, but a default level + logger_name = "_default" + level = logger_level_pair + logger_attrs = [attr for attr in dir(Logging) if attr[0] != "_"] + logger_names = list(map(lambda x: getattr(Logging, x), logger_attrs)) + if logger_name not in logger_names and logger_name != "_default": + logging.error("Invalid logger name: %s", logger_name) + logging.error("Must be one of: %s", logger_names) + return 1 + if level not in logging._nameToLevel: + logging.error("Invalid logging level: %s", level) + logging.error( + "Must be one of: %s", + str(list(logging._nameToLevel.keys()))[1:-1], + ) + return 1 + log_levels[logger_name] = level + + # Set up loggers + logging.basicConfig() + # Set the default logging level across all loggers + logging.getLogger().setLevel(log_levels["_default"]) + # Get handler for custom formatting, and apply to all GIN loggers + handler = logging.StreamHandler() + handler.setFormatter(Formatter()) + for logger_attr in [attr for attr in dir(Logging) if attr[0] != "_"]: + logger_name = getattr(Logging, logger_attr) + logger = logging.getLogger(logger_name) + # Check if there is an environment override for any of the log files + logger_env_override = f"{logger_name}_log_path".upper() + logger_path = os.environ.get(logger_env_override, None) + if logger_path: + path = Path(logger_path) + log_dir = path.parent + log_dir.mkdir(parents=True, exist_ok=True) + handler = logging.handlers.RotatingFileHandler( + logger_path, maxBytes=102400, backupCount=5 + ) + # Apply handler to logger + logger.addHandler(handler) + # Only use custom formatter + logger.propagate = False + if logger_name in log_levels: + # Set the custom logging level + logger.setLevel(log_levels[logger_name]) + if logger_path: + # Make sure that the log level is set to INFO or finer + if logger.getEffectiveLevel() > logging.INFO: + logger.setLevel(logging.INFO) + + return 0 diff --git a/examples/agentic/llm_utils/models.py b/examples/agentic/llm_utils/models.py new file mode 100644 index 0000000000..5419147cd5 --- /dev/null +++ b/examples/agentic/llm_utils/models.py @@ -0,0 +1,136 @@ +import os +from dotenv import dotenv_values +from langchain_core.language_models.chat_models import BaseChatModel +from langchain_core.language_models.llms import LLM +from langchain.callbacks.base import BaseCallbackHandler +from langchain.schema import BaseMessage, ChatResult, ChatGeneration, AIMessage +from langchain.chat_models.base import BaseChatModel +from pydantic import Field +from llm_utils.callbacks import LoggingCallbackHandler +from typing import Iterator, List, Optional, Any, Dict +import replicate + +CONFIG_LOCATION = ".env" + + +class ReplicateChatModel(BaseChatModel): + model_id: str = Field(description="The Replicate model ID") + params: Dict = Field(default_factory=dict, description="Model parameters") + + def _generate(self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional = None, + **kwargs) -> ChatResult: + prompt = " ".join(m.content for m in messages) + response = replicate.run(self.model_id, input={"prompt": prompt, **self.params}) + message = AIMessage(content=response) + return ChatResult(generations=[ChatGeneration(message=message)]) + + def _stream(self, messages: List[BaseMessage], stop: Optional[List[str]] = None, run_manager: Optional = None, + **kwargs) -> Iterator[ChatGeneration]: + print(f"replicate stream") + prompt = " ".join(m.content for m in messages) + for chunk in replicate.stream(self.model_id, input={"prompt": prompt, **self.params}): + yield ChatGeneration(message=AIMessage(content=chunk)) + + @property + def _llm_type(self) -> str: + return "replicate" + + +def getLLM(inference: str, model_id: str = None, config: dict = None) -> LLM: + loggingCallbackHandler = LoggingCallbackHandler() + if config is None: + config = dotenv_values(CONFIG_LOCATION) + + if inference == "ollama": + from langchain_ollama.llms import OllamaLLM + + if model_id is None or len(model_id) == 0: + model_id = "llama3.1:70b" + return OllamaLLM(model=model_id, temperature=0, callbacks=[loggingCallbackHandler]) + elif inference == "watsonx": + from langchain_ibm import WatsonxLLM + from genai.schema import DecodingMethod + + parameters = { + "decoding_method": DecodingMethod.GREEDY, + "max_new_tokens": 1024, + "min_new_tokens": 1, + "temperature": 0, + "top_k": 50, + "top_p": 1, + } + if model_id is None or len(model_id) == 0: + # see supported models at https://dataplatform.cloud.ibm.com/samples?context=wx + model_id = "meta-llama/llama-3-70b-instruct" + + return WatsonxLLM( + model_id=model_id, + apikey=config["WATSONX_APIKEY"], + url=config["WATSONX_URL"], + project_id=config["WATSON_PROJECT_ID"], + params=parameters, + callbacks=[loggingCallbackHandler], + + ) + else: + raise ValueError( + f"Inference type {inference} is wrong, supported values are [ollama, watsonx]" + ) + + +def getChatLLM( + inference: str, model_id: str = None, config: dict = None, params: Optional[Dict[str, Any]] = None +) -> BaseChatModel: + loggingCallbackHandler = LoggingCallbackHandler() + + if config is None: + config = dotenv_values(CONFIG_LOCATION) + + if inference == "ollama": + from langchain_ollama import ChatOllama + + if model_id is None or len(model_id) == 0: + model_id = "llama3.1:70b" + return ChatOllama(model=model_id, temperature=0, callbacks=[loggingCallbackHandler]) + + elif inference == "watsonx": + from langchain_ibm import ChatWatsonx + from genai.schema import DecodingMethod + + parameters = { + "decoding_method": DecodingMethod.GREEDY, + "max_new_tokens": 1024, + "min_new_tokens": 1, + "temperature": 0, + "top_k": 50, + "top_p": 1, + } + if model_id is None or len(model_id) == 0: + # see supported models at https://dataplatform.cloud.ibm.com/samples?context=wx + model_id = "meta-llama/llama-3-70b-instruct" + + return ChatWatsonx( + model_id=model_id, + apikey=config["WATSONX_APIKEY"], + url=config["WATSONX_URL"], + project_id=config["WATSON_PROJECT_ID"], + params=parameters, + callbacks=[loggingCallbackHandler], + ) + elif inference == "replicate": + if model_id is None: + model_id = "meta/meta-llama-3-70b-instruct" + default_params = { + "temperature": 0, + "max_length": 1024, + "max_new_tokens": 4096, + "top_p": 1 + } + if params: + default_params.update(params) + os.environ["REPLICATE_API_TOKEN"] = config["REPLICATE_API_TOKEN"] + return ReplicateChatModel(model_id=model_id, params=default_params, callbacks=[loggingCallbackHandler]) + else: + raise ValueError( + f"Inference type {inference} is wrong, supported values are [ollama, watsonx]" + ) diff --git a/examples/agentic/llm_utils/prompts/__init__.py b/examples/agentic/llm_utils/prompts/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/agentic/llm_utils/prompts/generate_prompt.py b/examples/agentic/llm_utils/prompts/generate_prompt.py new file mode 100644 index 0000000000..acf76fed12 --- /dev/null +++ b/examples/agentic/llm_utils/prompts/generate_prompt.py @@ -0,0 +1,190 @@ +generate_prompt_str = """ +Create a Python script that implements the following workflow plan. Each step in the plan follows this JSON structure: +{step_template} +Key requirements: + +Assume all tools are available in a 'tools.py' file - just import them and don't put an implementation of the tools +Create a main function that accepts the variable parameters as command-line arguments using argparse +Create an execute_workflow function that implements the plan step by step +Each step should store its result in a results dictionary using the step_ev as key +Each subsequent step should be able to reference previous steps' results using the step_ev values +Include basic error handling and progress logging. +The script should be runnable from command line. +The python code shouldn't include the dictionaries of the plan's steps +parse the plan steps and for each step add its tool call. +If there are hardcoded values of the input parameters of the steps then add an appropriate input argument with the hardcoded value as a default values of the argument. + +Here's the plan to implement: +{plan} +Please generate a complete, runnable Python script that implements this particular plan without including it, the file should got all the input parameters in the main as args. +Note: If a step uses a previous step's output, it will reference it using the step_ev value (e.g., #E1, #E2). Your implementation should resolve these references to their actual values during execution. + +Here's a template for the script structure: + +import argparse +import os +from tools import ( + #tools functions +) + +def extract_output_folder(message): + pattern = r"output_folder\s+(?:{)?(.*?)(?:}|\.|$)" + match = re.search(pattern, message) + if match: + return match.group(1) + return None + +def execute_workflow(args): + results["Ev1"] = transform1()._run(args.in_folder, args.out_folder+"_transform1Name", rest params) + results["Ev2"] = transform2()._run(extract_output_folder(results["Ev1"]), extract_output_folder(results["Ev1"])+"_transform2Name", rest params) + ... + return results + + +def main(): + parser = argparse.ArgumentParser(description="Execute the workflow plan.") + parser.add_argument( + "--in_folder", + type=str, + required=True, + help="Input folder for the workflow.", + ) + parser.add_argument( + "--out_folder", + type=str, + required=True, + help="Output folder for the workflow.", + )parser.add_argument( + "--tool_param", + type=type of the parameter, + required=True, + help="description of the argument", + ) + # The rest of the arguments + + args = parser.parse_args() + + try: + results = execute_workflow(args) + print(results) + except Exception as e: + print(f"An error occurred") + + +if __name__ == "__main__": + main() +""" + + +# Each step in the plan follows this JSON structure: +# {step_template} +generate_prompt_str_with_example = """ +Create a Python script that implements the following workflow plan. +Key requirements: + +Create a main function that accepts the variable parameters as command-line arguments using argparse. +Create an execute_workflow function that implements the plan step by step. +Each step should store its result in a results dictionary using the step_ev as key. +Each subsequent step should be able to reference previous steps' results using the step_ev values. +The script should be runnable from command line. +The python code shouldn't include the dictionaries of the plan's steps. +Parse the plan steps and for each step add its tool call. +If there are hardcoded values of the input parameters of the steps then replace it with an appropriate input argument with the hardcoded value as a default values of the argument (for example filter_criteria_list=args.filter_criteria_list, data_files_to_use=args.data_files_to_use). +The function calls of "_run()" of the steps must not include hardcoded values in the parameters, just use args or outputs from previous steps. +Each parameter in the transform run call should be from an output of a previous transform or an argument of the script. +The script must include the import lines of the transform tools. +The script must use from parse_output function and import it "from helpers import parse_output". +The script must get the envirnoment variables and pass the data_type and data_s3_cred parameters to each step. + +Here's the plan to implement: +==== +{plan} +==== +Please generate a complete, runnable Python script that implements this particular plan without including it, the file should got all the input parameters in the main as args. +Note: If a step uses a previous step's output, it will reference it using the step_ev value (e.g., #E1, #E2). Your implementation should resolve these references to their actual values during execution. +==== + +Here's an example: +==== +For the following plan +{{"step_name": "Step #1 language identification", "tool_name": "language_id", "tool_input": [{{"in_folder": "user_input", "out_folder": "user_input"}}], "import": "from llm_utils.dpk.langchain_tools.tools.language.lang_id import LangIdentificationTransform", "step_ev": "Ev1"}} +{{"step_name": "Step #2 filter english documents", "tool_name": "filter_transform", "tool_input": [{{"in_folder": "#Ev1", "out_folder": "#Ev1", "filter_criteria_list": "[lang==en]"}}], "import": "from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform", "step_ev": "Ev2"}} +{{"step_name": "Step #3 tokenization", "tool_name": "tokenization", "tool_input": [{{"in_folder": "#Ev2", "out_folder": "#Ev2"}}], "import": "from llm_utils.dpk.langchain_tools.tools.universal.tokenization import TokenizationTransform"}} + +the code should be: +```python +import argparse +import os +from helpers import parse_output +from llm_utils.dpk.langchain_tools.tools.universal.filter import FilterTransform +from llm_utils.dpk.langchain_tools.tools.language.lang_id import LangIdentificationTransform +from llm_utils.dpk.langchain_tools.tools.universal.tokenization import TokenizationTransform + +def execute_workflow(args): + results = + res = LangIdentificationTransform()._run( + data_type=args.data_type, + data_s3_cred={{"access_key": args.access_key, "secret_key": args.secret_key, "url": args.url}}, + input_folder=args.in_folder, + output_folder=args.out_folder+"_langid" + ) + results["Ev1"] = parse_output(res) + res = FilterTransform()._run( + data_type=args.data_type, + data_s3_cred={{"access_key": args.access_key, "secret_key": args.secret_key, "url": args.url}}, + input_folder=results["Ev1"], + output_folder=results["Ev1"]+"_filter", + filter_criteria_list=args.filter_criteria_list + ) + results["Ev2"] = parse_output(res) + res = TokenizationTransform()._run( + data_type=args.data_type, + data_s3_cred={{"access_key": args.access_key, "secret_key": args.secret_key, "url": args.url}}, + input_folder=results["Ev2"], + output_folder=results["Ev2"]+"_token" + ) + results["Ev3"] = parse_output(res) + return results + + +def main(): + parser = argparse.ArgumentParser(description="Execute the workflow plan.") + parser.add_argument( + "--in_folder", + type=str, + required=True, + help="Input folder for the workflow.", + ) + parser.add_argument( + "--out_folder", + type=str, + required=True, + help="Output folder for the workflow.", + ) + parser.add_argument( + "--filter_criteria_list", + type=str, + required=True, + default="[lang==en]", + help="Filter query for the workflow.", + ) + + args = parser.parse_args() + args.data_type = "s3" + args.access_key = os.environ.get('ACCESS_KEY') + args.secret_key = os.environ.get('SECRET_KEY') + args.url = os.environ.get('MINIO_URL') + + try: + results = execute_workflow(args) + print(results) + except Exception as e: + print(f"An error occurred:") + + +if __name__ == "__main__": + main() +``` +==== + +""" diff --git a/examples/agentic/llm_utils/prompts/judge_prompt.py b/examples/agentic/llm_utils/prompts/judge_prompt.py new file mode 100644 index 0000000000..629c2009e9 --- /dev/null +++ b/examples/agentic/llm_utils/prompts/judge_prompt.py @@ -0,0 +1,53 @@ +judge_prompt_str_dpk = """ +Please evaluate the following task and its implementation plan: + +Task: {task} +tools: +{tools} +constraints: +{context} +Plan: +{plan} + +Please analyze this plan using the following criteria: + - Does the plan properly addressing the task requirements according to the given tools? + - Does the plan satisfies all constraints? + - Does the plan include unnecessary steps (according to the task description and the constraints)? + +Please provide: +1. An overall assessment of the plan's validity +2. Specific issues found with constraints (if any), no more than three sentences. +3. Show the unnecessary transforms. + +On a different line write either "NEEDS_REVISION: Yes" or "NEEDS_REVISION: No". +""" + # 4. If the plan is not according to the provided example, for example, it contains reviews or additional comments, please recommend what should be changed. + # 3. Wrong use of outputs of the previous steps. + +judge_prompt_str1 = """ +Please evaluate the following task and its implementation plan: + +Task: {task} +~~~~~~ +Plan: {plan} +~~~~~~ +Tools: {tools} +~~~~~~ +Context: {context} +~~~~~~ + +Please analyze this plan using the following criteria: + - Does the plan properly address the task requirements according to the given tools? + - Does the plan include unnecessary steps? + - Does the plan satisfies all constraints? + - Does the plan use tool_not_implemented? + - Check that the tools' inputs are user-provided inputs or outputs from previous steps. + +Please provide instructions (no more than three sentences) to the planner to update the plan (if needed) according to the following: + 1. An overall assessment of the plan's validity. + 2. Specific issues found (if any) according to the provided citeria (be specific with no more than three sentences). + 3. Show the values of the input parameters of the tools that are hardcoded values (different than user_input or #E). Please ask the planner to change the hardcoded values to user_input value. Refer just to values of input params. + 4. Show the extra steps. + +On a different line write either "NEEDS_REVISION: Yes" or "NEEDS_REVISION: No". +""" diff --git a/examples/agentic/llm_utils/prompts/planner_prompt.py b/examples/agentic/llm_utils/prompts/planner_prompt.py new file mode 100644 index 0000000000..ffb1d9b8e6 --- /dev/null +++ b/examples/agentic/llm_utils/prompts/planner_prompt.py @@ -0,0 +1,69 @@ +planner_prompt_str1 = """ +Create a data pipeline to accomplish a given task: {task}. Please create a detailed execution plan using only the tools listed below while adhering to the specified constraints. + +You have access to only following tools: +==== +{tools} +==== + +The plan should satisfy these constraints: +===== +{context} +===== + + +Here's an example of the kind of detailed pipeline plan I'm looking for: +===== +{example_task} +===== + + +The previously generated plan: +===== +{previous_plan} +===== + +The Review of the previously generated plan: +===== +{feedback} +===== + +update the plan based on the review. +** Each step MUST be as a separate line that includes only a json dictionary without addition description and without indices. +** Show ONLY the one final plan without any additional text or thoughts. Don't include previous plans in the output. +** Make sure the the json are correct and can be parsed without errors. + +The final plan: +""" + +planner_prompt_str = """ +You are an expert in planning data access. For the given context, create a data pipeline to accomplish a given task: {task} +This plan should involve individual tasks that, if executed correctly, will yield the correct answer. +Do not add any superfluous steps. The result of the final step should be the final answer. +Make sure that each step has all the information needed - do not skip steps +For each plan, indicate \ +which external tool together with the tool input to retrieve evidence. You can store the evidence into a \ +variable #E that can be called by later tools. (Plan, #E1, Plan, #E2, Plan, ...). +Your task is to generate a plan using ONLY the provided tools. Do not use any other tools or methods. +If you cannot find a suitable tool, use 'tool_not_implemented' +Each step should use only one tool. +Ensure to specify all required input parameters of the tools. +Ensure that the input parameter of the tools are not hardcoded. +Each step should be a separate line that includes only a JSON dictionary without any additional descriptions. +Do not provide ANY plan explanations or reviews + +You have access to only the following JSON list of tools: {tools} + +Context is {context} +~~~~~~ +Here's an example of the kind of detailed pipeline plan I'm looking for: +{example_task} +~~~~~~ +The previously generated plan: {previous_plan} +~~~~~~ +The Review of the previously generated plan: {feedback} +""" + + +# In a separate line, specify user-provided input parameters that will be provided by a user before the execution of the plan. +# Keep the answer concise; do not print the #E variables separately. diff --git a/examples/agentic/llm_utils/visualize_plan.py b/examples/agentic/llm_utils/visualize_plan.py new file mode 100644 index 0000000000..c0c5dcde30 --- /dev/null +++ b/examples/agentic/llm_utils/visualize_plan.py @@ -0,0 +1,98 @@ +import json +from IPython.display import display, Markdown + +def parse_plan_string(plan_string): + """ + Parses a string where each line is a JSON dictionary. + + Args: + plan_string (str): String containing one JSON dictionary per line + + Returns: + list: List of parsed dictionaries + """ + # Split into lines and filter out empty lines + lines = [line.strip() for line in plan_string.split('\n') if line.strip()] + #print(f"{lines}") + # Parse each line as JSON + plan_steps = [] + for line in lines: + try: + step = json.loads(line) + plan_steps.append(step) + except json.JSONDecodeError as e: + # print(f"Error parsing line: {line}") + # print(f"Error: {e}") + continue + return plan_steps + + +def extract_plan(msg_content): + """ + Extracts JSON dictionaries from numbered lines like: + 1. {"name": "tool1", ...} + 2. {"name": "tool2", ...} + """ + plan_lines = [] + + for line in msg_content.split('\n'): + line = line.strip() + # Skip empty lines + if not line: + continue + + # Remove numbering prefix if it exists (e.g., "1. ", "2. ", etc.) + if line[0].isdigit(): + # Find the position after the number and dot + pos = line.find('. ') + if pos != -1: + line = line[pos + 2:] + + # Check if the line is a JSON dictionary + try: + if line.startswith('{') and line.endswith('}'): + step = json.loads(line) + if " #1" in step["step_name"]: + plan_lines = [step] + else: + plan_lines.append(step) + except json.JSONDecodeError as e: + # print(f"Error parsing line: {line}") + continue + return plan_lines + +def visualize_plan(plan_input): + """ + Creates a diagram visualization of a plan. + + Args: + plan_input (str or list): Either a string with JSON dictionaries per line, + or a list of dictionaries + """ + # Parse the plan if it's a string + if isinstance(plan_input, str): + plan_steps = extract_plan(plan_input) + else: + plan_steps = plan_input + + graph_str = ["graph LR;"] + + # Add nodes and edges + for i, step in enumerate(plan_steps): + # Create node with step name and tool name + node_id = f"step{i}" + label = f"{step['tool_name']}" + + # Create node with proper escaping for quotes + graph_str.append(f' {node_id}["{label}"]') + + # Add edge to next step if not the last step + if i < len(plan_steps) - 1: + next_node = f"step{i+1}" + graph_str.append(f' {node_id} --> {next_node}') + + # Join all lines with newlines + diagram = "\n".join(graph_str) + + # Display the diagram in the notebook + display(Markdown(f"```mermaid\n{diagram}\n```")) diff --git a/examples/agentic/requirements.txt b/examples/agentic/requirements.txt new file mode 100644 index 0000000000..6babb920aa --- /dev/null +++ b/examples/agentic/requirements.txt @@ -0,0 +1,12 @@ +langchain==0.3.1 +langchain_community==0.3.1 +ibm-generative-ai==3.0.0 +langchain_ollama==0.2.0 +langchain-ibm==0.2.1 +langchain-openai==0.2.14 +langgraph==0.2.35 +ibm_watsonx_ai==1.1.11 +python-dotenv +pypdf +replicate +black==24.10.0 diff --git a/examples/notebooks/intro/input/solar-system/earth.pdf b/examples/agentic/test-data/input/earth.pdf similarity index 100% rename from examples/notebooks/intro/input/solar-system/earth.pdf rename to examples/agentic/test-data/input/earth.pdf diff --git a/examples/agentic/test-data/ldnoobw/en b/examples/agentic/test-data/ldnoobw/en new file mode 100644 index 0000000000..1a9899b4ec --- /dev/null +++ b/examples/agentic/test-data/ldnoobw/en @@ -0,0 +1,403 @@ +2g1c +2 girls 1 cup +acrotomophilia +alabama hot pocket +alaskan pipeline +anal +anilingus +anus +apeshit +arsehole +ass +asshole +assmunch +auto erotic +autoerotic +babeland +baby batter +baby juice +ball gag +ball gravy +ball kicking +ball licking +ball sack +ball sucking +bangbros +bangbus +bareback +barely legal +barenaked +bastard +bastardo +bastinado +bbw +bdsm +beaner +beaners +beaver cleaver +beaver lips +beastiality +bestiality +big black +big breasts +big knockers +big tits +bimbos +birdlock +bitch +bitches +black cock +blonde action +blonde on blonde action +blowjob +blow job +blow your load +blue waffle +blumpkin +bollocks +bondage +boner +boob +boobs +booty call +brown showers +brunette action +bukkake +bulldyke +bullet vibe +bullshit +bung hole +bunghole +busty +butt +buttcheeks +butthole +camel toe +camgirl +camslut +camwhore +carpet muncher +carpetmuncher +chocolate rosebuds +cialis +circlejerk +cleveland steamer +clit +clitoris +clover clamps +clusterfuck +cock +cocks +coprolagnia +coprophilia +cornhole +coon +coons +creampie +cum +cumming +cumshot +cumshots +cunnilingus +cunt +darkie +date rape +daterape +deep throat +deepthroat +dendrophilia +dick +dildo +dingleberry +dingleberries +dirty pillows +dirty sanchez +doggie style +doggiestyle +doggy style +doggystyle +dog style +dolcett +domination +dominatrix +dommes +donkey punch +double dong +double penetration +dp action +dry hump +dvda +eat my ass +ecchi +ejaculation +erotic +erotism +escort +eunuch +fag +faggot +fecal +felch +fellatio +feltch +female squirting +femdom +figging +fingerbang +fingering +fisting +foot fetish +footjob +frotting +fuck +fuck buttons +fuckin +fucking +fucktards +fudge packer +fudgepacker +futanari +gangbang +gang bang +gay sex +genitals +giant cock +girl on +girl on top +girls gone wild +goatcx +goatse +god damn +gokkun +golden shower +goodpoop +goo girl +goregasm +grope +group sex +g-spot +guro +hand job +handjob +hard core +hardcore +hentai +homoerotic +honkey +hooker +horny +hot carl +hot chick +how to kill +how to murder +huge fat +humping +incest +intercourse +jack off +jail bait +jailbait +jelly donut +jerk off +jigaboo +jiggaboo +jiggerboo +jizz +juggs +kike +kinbaku +kinkster +kinky +knobbing +leather restraint +leather straight jacket +lemon party +livesex +lolita +lovemaking +make me come +male squirting +masturbate +masturbating +masturbation +menage a trois +milf +missionary position +mong +motherfucker +mound of venus +mr hands +muff diver +muffdiving +nambla +nawashi +negro +neonazi +nigga +nigger +nig nog +nimphomania +nipple +nipples +nsfw +nsfw images +nude +nudity +nutten +nympho +nymphomania +octopussy +omorashi +one cup two girls +one guy one jar +orgasm +orgy +paedophile +paki +panties +panty +pedobear +pedophile +pegging +penis +phone sex +piece of shit +pikey +pissing +piss pig +pisspig +playboy +pleasure chest +pole smoker +ponyplay +poof +poon +poontang +punany +poop chute +poopchute +porn +porno +pornography +prince albert piercing +pthc +pubes +pussy +queaf +queef +quim +raghead +raging boner +rape +raping +rapist +rectum +reverse cowgirl +rimjob +rimming +rosy palm +rosy palm and her 5 sisters +rusty trombone +sadism +santorum +scat +schlong +scissoring +semen +sex +sexcam +sexo +sexy +sexual +sexually +sexuality +shaved beaver +shaved pussy +shemale +shibari +shit +shitblimp +shitty +shota +shrimping +skeet +slanteye +slut +s&m +smut +snatch +snowballing +sodomize +sodomy +spastic +spic +splooge +splooge moose +spooge +spread legs +spunk +strap on +strapon +strappado +strip club +style doggy +suck +sucks +suicide girls +sultry women +swastika +swinger +tainted love +taste my +tea bagging +threesome +throating +thumbzilla +tied up +tight white +tit +tits +titties +titty +tongue in a +topless +tosser +towelhead +tranny +tribadism +tub girl +tubgirl +tushy +twat +twink +twinkie +two girls one cup +undressing +upskirt +urethra play +urophilia +vagina +venus mound +viagra +vibrator +violet wand +vorarephilia +voyeur +voyeurweb +voyuer +vulva +wank +wetback +wet dream +white power +whore +worldsex +wrapping men +wrinkled starfish +xx +xxx +yaoi +yellow showers +yiffy +zoophilia +🖕 \ No newline at end of file diff --git a/examples/data-files/pdf-processing-1/README.md b/examples/data-files/pdf-processing-1/README.md new file mode 100644 index 0000000000..e81e80ee8b --- /dev/null +++ b/examples/data-files/pdf-processing-1/README.md @@ -0,0 +1,11 @@ +## Creating Input PDFs (Optional) + +Sample PDFs we use for this example are created from markdown documents using pandoc utility, as follows. + +```bash +pandoc earth.md -o earth.pdf +pandoc earth2.md -o earth2.pdf +pandoc mars.md -o mars.pdf +pandoc spam.md -o spam.pdf +pandoc lorem-ipsum.md -o lorem-ipsum.pdf +``` \ No newline at end of file diff --git a/examples/data-files/pdf-processing-1/earth-copy.pdf b/examples/data-files/pdf-processing-1/earth-copy.pdf new file mode 100644 index 0000000000..9a775a9984 Binary files /dev/null and b/examples/data-files/pdf-processing-1/earth-copy.pdf differ diff --git a/examples/notebooks/intro/input/solar-system/earth.md b/examples/data-files/pdf-processing-1/earth.md similarity index 100% rename from examples/notebooks/intro/input/solar-system/earth.md rename to examples/data-files/pdf-processing-1/earth.md diff --git a/examples/data-files/pdf-processing-1/earth.pdf b/examples/data-files/pdf-processing-1/earth.pdf new file mode 100644 index 0000000000..9a775a9984 Binary files /dev/null and b/examples/data-files/pdf-processing-1/earth.pdf differ diff --git a/examples/data-files/pdf-processing-1/earth2.md b/examples/data-files/pdf-processing-1/earth2.md new file mode 100644 index 0000000000..04f4eb6c3e --- /dev/null +++ b/examples/data-files/pdf-processing-1/earth2.md @@ -0,0 +1,18 @@ +# Earth + + +## Solar System + +Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun. + +For more details about the Solar system see Chapter 1. + +## Earth + +Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life. + +Basic facts about Earth: + +- Distance from the Sun: Average of 149.6 million kilometers (93 million miles) +- Rotation Period: 24 hours (one day) +- Moons: One moon, called Luna or simply "the Moon". \ No newline at end of file diff --git a/examples/data-files/pdf-processing-1/earth2.pdf b/examples/data-files/pdf-processing-1/earth2.pdf new file mode 100644 index 0000000000..5c024886af Binary files /dev/null and b/examples/data-files/pdf-processing-1/earth2.pdf differ diff --git a/examples/data-files/pdf-processing-1/lorem-ipsum.md b/examples/data-files/pdf-processing-1/lorem-ipsum.md new file mode 100644 index 0000000000..35723ccaa1 --- /dev/null +++ b/examples/data-files/pdf-processing-1/lorem-ipsum.md @@ -0,0 +1,3 @@ +Lorem ipsum +Lorem ipsum +Lorem ipsum \ No newline at end of file diff --git a/examples/data-files/pdf-processing-1/lorem-ipsum.pdf b/examples/data-files/pdf-processing-1/lorem-ipsum.pdf new file mode 100644 index 0000000000..b2807a44d1 Binary files /dev/null and b/examples/data-files/pdf-processing-1/lorem-ipsum.pdf differ diff --git a/examples/notebooks/intro/input/solar-system/mars.md b/examples/data-files/pdf-processing-1/mars.md similarity index 100% rename from examples/notebooks/intro/input/solar-system/mars.md rename to examples/data-files/pdf-processing-1/mars.md diff --git a/examples/notebooks/intro/input/solar-system/mars.pdf b/examples/data-files/pdf-processing-1/mars.pdf similarity index 99% rename from examples/notebooks/intro/input/solar-system/mars.pdf rename to examples/data-files/pdf-processing-1/mars.pdf index a48c4365b3..5e464d8708 100644 Binary files a/examples/notebooks/intro/input/solar-system/mars.pdf and b/examples/data-files/pdf-processing-1/mars.pdf differ diff --git a/examples/data-files/pdf-processing-1/spam.md b/examples/data-files/pdf-processing-1/spam.md new file mode 100644 index 0000000000..e5526cbad2 --- /dev/null +++ b/examples/data-files/pdf-processing-1/spam.md @@ -0,0 +1 @@ +Free xxx \ No newline at end of file diff --git a/examples/data-files/pdf-processing-1/spam.pdf b/examples/data-files/pdf-processing-1/spam.pdf new file mode 100644 index 0000000000..43999b8ac3 Binary files /dev/null and b/examples/data-files/pdf-processing-1/spam.pdf differ diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md deleted file mode 100644 index 77a80865b7..0000000000 --- a/examples/notebooks/intro/README.md +++ /dev/null @@ -1,36 +0,0 @@ -# Data Prep Kit Introduction - -This is an example featuring some of the features of data prep kit. - -## Running the code - -The code can be run on either - -1. Google colab: very easy to run; no local setup needed. -2. On your local Python environment. Here is a quick guide. You can find instructions for latest version [here](../../../README.md#-getting-started) - -```bash -conda create -n data-prep-kit -y python=3.11 -conda activate data-prep-kit - -# install the following in 'data-prep-kit' environment -pip3 install data-prep-toolkit==0.2.1 -pip3 install data-prep-toolkit-transforms==0.2.1 -pip3 install data-prep-toolkit-transforms-ray==0.2.1 -pip3 install jupyterlab ipykernel ipywidgets - -## install custom kernel -## Important: Use this kernel when running example notebooks! -python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit" - -# start jupyter and run the notebooks with this jupyter -jupyter lab -``` - -## Intro - -This notebook will demonstrate processing PDFs - -`PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings` - -[python version](dpk_intro_1_python.ipynb)   |   [ray version](dpk_intro_1_ray.ipynb) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb deleted file mode 100644 index ab7cda8548..0000000000 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ /dev/null @@ -1,3667 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", - "metadata": { - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" - }, - "source": [ - "# Data Prep Kit Demo 1 - Python version\n", - "\n", - "This notebook will introduce DPK and showcase some of it's capabilities.\n", - "\n", - "Here is the workflow\n", - "\n", - "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" - ] - }, - { - "cell_type": "markdown", - "id": "b15976e3", - "metadata": { - "id": "b15976e3" - }, - "source": [ - "## How to run this notebook\n", - "\n", - "Two options:\n", - "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", - "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", - "\n", - "The notebook will work as in both environments" - ] - }, - { - "cell_type": "markdown", - "id": "eb8b0d5c", - "metadata": { - "id": "eb8b0d5c" - }, - "source": [ - "## Step-1: Inspect the Data\n", - "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/IBM/data-prep-kit/tree/dev/examples/notebooks/intro/input/solar-system)\n", - "\n", - "- [earth.pdf](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/input/solar-system/mars.pdf)\n" - ] - }, - { - "cell_type": "markdown", - "id": "39a0ab6e", - "metadata": { - "id": "39a0ab6e" - }, - "source": [ - "## Step-2: Figure out Runtime Environment\n", - "\n", - "### 2.1 - Determine runtime\n", - "\n", - "Determine if we are running on Google colab or local python environment" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "1fe354b7", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1fe354b7", - "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "markdown", - "id": "8e7c104b", - "metadata": { - "id": "8e7c104b" - }, - "source": [ - "### 2.2 -Download Data if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "3309799e", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3309799e", - "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input/solar-system'\n", - " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/my_utils.py'" - ] - }, - { - "cell_type": "markdown", - "id": "a5dc2b68", - "metadata": { - "id": "a5dc2b68" - }, - "source": [ - "### 2.3 - Install dependencies if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "1fcec577", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "1fcec577", - "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit==0.2.1 \\\n", - " data-prep-toolkit-transforms==0.2.1 \\\n", - " deepsearch-toolkit\n" - ] - }, - { - "cell_type": "markdown", - "id": "243322b8", - "metadata": { - "id": "243322b8" - }, - "source": [ - "### 2.4 - Restart Runtime\n", - "\n", - "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", - "\n", - "You do this by going to **`Runtime --> Restart Session`**\n", - "\n", - "Then you can continue to the next step (no need to re-run the notebook)" - ] - }, - { - "cell_type": "markdown", - "id": "e8b10be1", - "metadata": { - "id": "e8b10be1" - }, - "source": [ - "## Step-2: Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "356c66f7", - "metadata": { - "id": "356c66f7" - }, - "source": [ - "### 2.1 - Basic Config" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e4YMZrBuFycl", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e4YMZrBuFycl", - "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "33345487", - "metadata": { - "id": "33345487" - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "## Configuration\n", - "class MyConfig:\n", - " pass\n", - "\n", - "MY_CONFIG = MyConfig ()\n", - "\n", - "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", - "\n", - "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", - "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", - "\n", - "## Embedding model\n", - "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "b15e6827", - "metadata": { - "id": "b15e6827" - }, - "outputs": [], - "source": [ - "## Add parent dir to path\n", - "import os,sys\n", - "\n", - "this_dir = os.path.abspath('')\n", - "parent_dir = os.path.dirname(this_dir)\n", - "sys.path.append (os.path.abspath (parent_dir))" - ] - }, - { - "cell_type": "markdown", - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", - "metadata": { - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" - }, - "source": [ - "### 2.2 - Setup input/outpur directories" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Cleared output directory\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "\n", - "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", - " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", - "\n", - "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", - "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", - "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", - "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", - "\n", - "## clear output folder\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", - "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", - "\n", - "print (\"✅ Cleared output directory\")" - ] - }, - { - "cell_type": "markdown", - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", - "metadata": { - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" - }, - "source": [ - "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", - "\n", - "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", - "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", - "metadata": { - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" - }, - "source": [ - "### 3.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" - ] - } - ], - "source": [ - "STAGE = 1\n", - "\n", - "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", - "output_folder = output_parquet_dir\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", - "metadata": { - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" - }, - "source": [ - "### 3.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 657, - "referenced_widgets": [ - "97b603697cfa4b4ea4e6735b6768ca35", - "e87e8d3262c54cfaaa8768505edacda3", - "b78aa40816e44f7fbebcb24ca68818b3", - "7053c9606a414e978636a7e241909504", - "da0787b239764847a731083997780a85", - "553f3c16839a49d79591d0fc4862bed6", - "c0eb5bc8f6ee427ca42204b3c56f9a4e", - "9d184ed175f0403fb03c2e13dfd04e0a", - "724778729161445c98b187031ae4f67c", - "1cb3bbf7d724411cbe9831543a4aecc0", - "06f9b33494984e4885d5aad813d1d2bc" - ] - }, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:34:39 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "13:34:39 INFO - pipeline id pipeline_id\n", - "13:34:39 INFO - code location None\n", - "13:34:39 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", - "13:34:39 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:34:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "13:34:39 INFO - orchestrator pdf2parquet started at 2024-10-18 13:34:39\n", - "13:34:39 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "13:34:39 INFO - Initializing models\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "750f3b6951094b2eb68490c7f5f98148", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Fetching 10 files: 0%| | 0/10 [00:00\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...10116e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011efbdbcb9-f0af-42f0-b191-2f14ce3ddc7cpdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdf
\n", - "" - ], - "text/plain": [ - " filename contents num_pages \\\n", - "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "\n", - " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 pdf \n", - "1 0 11 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "1 2024-10-18T13:34:43.410297 0.794765 earth.pdf " - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(5)\n", - "\n", - "## To display certain columns\n", - "#parquet_df[['column1', 'column2', 'column3']].head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "e5058a21", - "metadata": { - "id": "e5058a21" - }, - "source": [ - "\n", - "### 3.4 - Understand the output\n", - "\n", - "Here are some interesting attributes to note:\n", - "\n", - "- **filename** : original filename\n", - "- **contents** : text\n", - "- **document_id**: unique id (UUID) assignd to this document\n", - "- **hash** : hash of document\n", - "- **pdf_convert_time** : time to convert this pdf in seconds\n", - "\n", - "Let's inspect the **contents** column. See how the text is being divided up!" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "f870e624", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "f870e624", - "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", - " 'filename': 'mars.pdf',\n", - " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.35137939,\n", - " 654.45184326,\n", - " 169.88169861,\n", - " 667.98492432],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.09541321,\n", - " 630.68127441,\n", - " 210.66503906,\n", - " 642.34405518],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.84518433,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.02520752],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.18510437,\n", - " 570.83258057,\n", - " 374.99838257,\n", - " 581.07043457],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about the Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.22866821,\n", - " 542.98168945,\n", - " 163.86282349,\n", - " 554.45288086],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87440491,\n", - " 500.84011841,\n", - " 477.48345947,\n", - " 534.55810547],\n", - " 'page': 1,\n", - " 'span': [0, 196]}],\n", - " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", - " 'desert world with a thin atmosphere composed '\n", - " 'primarily of carbon dioxide. Its reddish hue comes '\n", - " 'from iron oxide, or rust, prevalent on its surface.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.2026062,\n", - " 482.90710449,\n", - " 237.04431152,\n", - " 493.07443237],\n", - " 'page': 1,\n", - " 'span': [0, 23]}],\n", - " 'text': 'Basic facts about Mars:',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 453.019104,\n", - " 477.48171997,\n", - " 474.9703064],\n", - " 'page': 1,\n", - " 'span': [0, 78]}],\n", - " 'text': '· Distance from the Sun: Average of 228 million '\n", - " 'kilometers (142 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.79351807,\n", - " 431.73287964,\n", - " 451.2142334],\n", - " 'page': 1,\n", - " 'span': [0, 64]}],\n", - " 'text': '· Rotation Period: 24.6 hours (one Martian day - '\n", - " 'called a \"sol\")',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 429.10913086,\n", - " 365.9559021,\n", - " 438.83737183],\n", - " 'page': 1,\n", - " 'span': [0, 44]}],\n", - " 'text': '· Moons: Two small moons, Phobos and Deimos.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.51646423],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "import pprint\n", - "import json\n", - "\n", - "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", - "# json.loads(output_df.iloc[0, ]['contents'])" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "e1a10c2d", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e1a10c2d", - "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", - " 'filename': 'earth.pdf',\n", - " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.30961609,\n", - " 654.45184326,\n", - " 174.04208374,\n", - " 667.93347168],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.12528992,\n", - " 630.69073486,\n", - " 210.66503906,\n", - " 642.27935791],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87112427,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.04595947],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.20942688,\n", - " 570.81555176,\n", - " 375.57919312,\n", - " 581.08459473],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about our Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.15542603,\n", - " 542.98168945,\n", - " 167.32983398,\n", - " 554.36669922],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.91053772,\n", - " 512.46295166,\n", - " 477.84887695,\n", - " 534.48431396],\n", - " 'page': 1,\n", - " 'span': [0, 107]}],\n", - " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", - " 'planet. Earth is the only place we know of with life.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.30151367,\n", - " 494.86206055,\n", - " 240.17156982,\n", - " 505.07229614],\n", - " 'page': 1,\n", - " 'span': [0, 24]}],\n", - " 'text': 'Basic facts about Earth:',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 464.97409058,\n", - " 477.47979736,\n", - " 487.02810669],\n", - " 'page': 1,\n", - " 'span': [0, 79]}],\n", - " 'text': '· Distance from the Sun: Average of 149.6 million '\n", - " 'kilometers (93 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 452.86901855,\n", - " 317.90722656,\n", - " 463.24041748],\n", - " 'page': 1,\n", - " 'span': [0, 37]}],\n", - " 'text': '· Rotation Period: 24 hours (one day)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.71496582,\n", - " 396.66357422,\n", - " 451.19915771],\n", - " 'page': 1,\n", - " 'span': [0, 52]}],\n", - " 'text': '· Moons: One moon, called Luna or simply \"the Moon\".',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.53633118],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" - ] - }, - { - "cell_type": "markdown", - "id": "72274586", - "metadata": { - "id": "72274586" - }, - "source": [ - "## Step-4: Doc chunks\n", - "\n", - "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", - "\n", - "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", - "\n", - "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", - "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", - "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", - "which provides the required JSON structure." - ] - }, - { - "cell_type": "markdown", - "id": "96198fa6", - "metadata": { - "id": "96198fa6" - }, - "source": [ - "### 4.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "305f00a3", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "305f00a3", - "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" - ] - } - ], - "source": [ - "STAGE = 2\n", - "\n", - "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_chunk_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "369f2cd1", - "metadata": { - "id": "369f2cd1" - }, - "source": [ - "### 4.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "5b7b18d5", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5b7b18d5", - "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:34:45 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "13:34:45 INFO - pipeline id pipeline_id\n", - "13:34:45 INFO - code location None\n", - "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:34:45 INFO - orchestrator doc_chunk started at 2024-10-18 13:34:45\n", - "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", - "13:34:45 INFO - done flushing in 0.0 sec\n", - "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:2 completed successfully\n", - "CPU times: user 826 ms, sys: 101 ms, total: 928 ms\n", - "Wall time: 923 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # doc_chunk arguments\n", - " # ...\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "213afdf6", - "metadata": { - "id": "213afdf6" - }, - "source": [ - "### 4.3 - Inspect Generated output\n", - "\n", - "We would see documents are split into many chunks" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d8138d43", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 897 - }, - "id": "d8138d43", - "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Files processed : 2\n", - "Chunks created : 8\n", - "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 16)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "3 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "7 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "3 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "7 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (f\"Files processed : {input_df.shape[0]:,}\")\n", - "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "9e9ca75c", - "metadata": { - "id": "9e9ca75c" - }, - "source": [ - "### 4.4 - Understanding the Output\n", - "\n", - "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", - "\n", - "See how **document_id** is carried throughout. This helps us identify original documents.\n", - "\n", - "Also note **contents** is now plain text (not JSON as before)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "3090c950", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 300 - }, - "id": "3090c950", - "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "5 earth.pdf Solar System\\nFor more details about our Solar...\n", - "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "7 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "d5f151ae", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "d5f151ae", - "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 3------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "7ad1c60d", - "metadata": { - "id": "7ad1c60d" - }, - "source": [ - "## Step-5: DOC ID generation of Chunks\n", - "\n", - "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", - "\n", - " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", - " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", - "\n", - "**This is a pre-requisite for fuzzy dedup** in the pipeline." - ] - }, - { - "cell_type": "markdown", - "id": "1afaa0fd", - "metadata": { - "id": "1afaa0fd" - }, - "source": [ - "### 5.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "6ffd6f54", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6ffd6f54", - "outputId": "1784c80d-6309-4913-9f55-c018b978968f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" - ] - } - ], - "source": [ - "\n", - "# Input for this stage is the output of exact dedeup component\n", - "# output of this component makes it possible for fdedup component to run on data.\n", - "\n", - "STAGE = 3\n", - "\n", - "input_folder = output_chunk_dir\n", - "output_folder = output_docid_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "f78a51b7", - "metadata": { - "id": "f78a51b7" - }, - "source": [ - "### 5.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "5fc77557", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5fc77557", - "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:34:45 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "13:34:45 INFO - pipeline id pipeline_id\n", - "13:34:45 INFO - code location None\n", - "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:34:45 INFO - orchestrator doc_id started at 2024-10-18 13:34:45\n", - "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", - "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", - "13:34:45 INFO - done flushing in 0.0 sec\n", - "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:3 completed successfully\n", - "CPU times: user 12.8 ms, sys: 3.7 ms, total: 16.5 ms\n", - "Wall time: 13.1 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " # doc id configuration\n", - " \"doc_id_doc_column\": \"contents\",\n", - " \"doc_id_hash_column\": \"chunk_hash\",\n", - " \"doc_id_int_column\": \"chunk_id\",\n", - "}\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "a9a8c1fa", - "metadata": { - "id": "a9a8c1fa" - }, - "source": [ - "### 5.3 - Inspect Generated output\n", - "\n", - "You will notice we have two extra columns\n", - "\n", - "- **hash_column**\n", - "- **int_id_column**\n", - "\n", - "But still the same number or rows as before" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "da9adede", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 860 - }, - "id": "da9adede", - "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 16)\n", - "Output data dimensions (rows x columns)= (8, 18)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "3 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "7 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "3 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "7 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", - "metadata": { - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" - }, - "source": [ - "## Step-6: Exact Dedup\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", - "metadata": { - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" - }, - "source": [ - "### 6.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "4c7a1b94", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "4c7a1b94", - "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" - ] - } - ], - "source": [ - "STAGE = 4\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_exact_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", - "metadata": { - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" - }, - "source": [ - "### 6.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:34:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", - "13:34:45 INFO - pipeline id pipeline_id\n", - "13:34:45 INFO - code location None\n", - "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:34:45 INFO - orchestrator ededup started at 2024-10-18 13:34:45\n", - "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "13:34:45 INFO - Starting from the beginning\n", - "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", - "13:34:45 INFO - done flushing in 0.0 sec\n", - "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:4 completed successfully\n", - "CPU times: user 17.6 ms, sys: 997 μs, total: 18.6 ms\n", - "Wall time: 15.2 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # ededup parameters\n", - " \"ededup_doc_column\": \"contents\",\n", - " \"ededup_doc_id_column\": \"chunk_hash\",\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "eaf1c3c3", - "metadata": { - "id": "eaf1c3c3" - }, - "source": [ - "### 6.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "d824ebf6", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 815 - }, - "id": "d824ebf6", - "outputId": "68f55770-c750-4607-a205-ba183603019d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 18)\n", - "Output data dimensions (rows x columns)= (7, 19)\n", - "Input chunks before exact dedupe : 8\n", - "Output chunks after exact dedupe : 7\n", - "Duplicate chunks removed : 1\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "3 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "3 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] " - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", - "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", - "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "82cc9bb0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 269 - }, - "id": "82cc9bb0", - "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\n· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nFor more details about the Solar...\n", - "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "4 earth.pdf Solar System\\nFor more details about our Solar...\n", - "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "cc61dffa", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cc61dffa", - "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 1------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 2------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "383f40ba", - "metadata": { - "id": "383f40ba" - }, - "source": [ - "### 6.4 - Understanding the output\n", - "\n", - "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", - "\n", - "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", - "\n", - "```text\n", - "## Solar System\n", - "\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "85309751-8556-41c6-ac32-84acc941bc8d", - "metadata": { - "id": "85309751-8556-41c6-ac32-84acc941bc8d" - }, - "source": [ - " ## Step-7: Fuzzy Dedup\n", - "\n", - "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", - "\n", - "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": { - "id": "5370950a-2a3a-4143-8218-f9b4808099ba" - }, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "85aba685", - "metadata": { - "id": "85aba685" - }, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" - ] - } - ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "c97545f4", - "metadata": { - "id": "c97545f4" - }, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:34:45 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "13:34:45 INFO - pipeline id pipeline_id\n", - "13:34:45 INFO - code location None\n", - "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", - "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:34:45 INFO - orchestrator text_encoder started at 2024-10-18 13:34:45\n", - "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", - "13:34:47 INFO - Completed 1 files (50.0%) in 0.004 min\n", - "13:34:47 INFO - Completed 2 files (100.0%) in 0.005 min\n", - "13:34:47 INFO - Done processing 2 files, waiting for flush() completion.\n", - "13:34:47 INFO - done flushing in 0.0 sec\n", - "13:34:47 INFO - Completed execution in 0.034 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:6 completed successfully\n", - "CPU times: user 615 ms, sys: 146 ms, total: 761 ms\n", - "Wall time: 2.24 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": { - "id": "b734852c" - }, - "source": [ - "### 8.3 - Inspect Generated output\n", - "\n", - "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "7b1c1d09", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 760 - }, - "id": "7b1c1d09", - "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (7, 19)\n", - "Output data dimensions (rows x columns)= (7, 20)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremovedembeddings
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...[-0.051861435, 0.0035226212, 0.030617002, 0.04...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[][0.07728295, 0.024970993, -0.043180738, 0.0580...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[][0.10598018, 0.025460618, 0.023627337, 0.03905...
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[][0.0077404436, -0.02055944, 0.026426593, 0.011...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[][-0.062105548, -0.0053322907, 0.031277698, 0.0...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[][0.072435796, -0.058001805, -0.019771898, -0.0...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[][0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", - "3 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", - "3 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \\\n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] \n", - "\n", - " embeddings \n", - "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", - "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", - "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", - "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", - "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", - "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", - "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": { - "id": "f5e12630-be6b-4188-a925-77117155617b" - }, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" - ] - } - ], - "source": [ - "import shutil\n", - "\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", - "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", - "\n", - "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "dpk-2-basic-021-py311", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.10" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "06f9b33494984e4885d5aad813d1d2bc": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "1cb3bbf7d724411cbe9831543a4aecc0": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "553f3c16839a49d79591d0fc4862bed6": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7053c9606a414e978636a7e241909504": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", - "placeholder": "​", - "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", - "value": " 10/10 [00:00<00:00, 349.38it/s]" - } - }, - "724778729161445c98b187031ae4f67c": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "97b603697cfa4b4ea4e6735b6768ca35": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", - "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", - "IPY_MODEL_7053c9606a414e978636a7e241909504" - ], - "layout": "IPY_MODEL_da0787b239764847a731083997780a85" - } - }, - "9d184ed175f0403fb03c2e13dfd04e0a": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b78aa40816e44f7fbebcb24ca68818b3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", - "max": 10, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", - "value": 10 - } - }, - "c0eb5bc8f6ee427ca42204b3c56f9a4e": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "da0787b239764847a731083997780a85": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e87e8d3262c54cfaaa8768505edacda3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", - "placeholder": "​", - "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", - "value": "Fetching 10 files: 100%" - } - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb deleted file mode 100644 index b2feb9135a..0000000000 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ /dev/null @@ -1,4359 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", - "metadata": { - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" - }, - "source": [ - "# Data Prep Kit Demo 1 - Ray Version\n", - "\n", - "This notebook will introduce DPK and showcase some of it's capabilities.\n", - "\n", - "Here is the workflow\n", - "\n", - "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "b15976e3", - "metadata": { - "id": "b15976e3" - }, - "source": [ - "## How to run this notebook\n", - "\n", - "Two options:\n", - "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", - "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", - "\n", - "The notebook will work as in both environments" - ] - }, - { - "cell_type": "markdown", - "id": "eb8b0d5c", - "metadata": { - "id": "eb8b0d5c" - }, - "source": [ - "## Step-1: Inspect the Data\n", - "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/IBM/data-prep-kit/tree/dev/examples/notebooks/intro/input/solar-system)\n", - "\n", - "- [earth.pdf](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/input/solar-system/mars.pdf)\n" - ] - }, - { - "cell_type": "markdown", - "id": "39a0ab6e", - "metadata": { - "id": "39a0ab6e" - }, - "source": [ - "## Step-2: Figure out Runtime Environment\n", - "\n", - "### 2.1 - Determine runtime\n", - "\n", - "Determine if we are running on Google colab or local python environment" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "1fe354b7", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1fe354b7", - "outputId": "6665c654-baa5-46dc-d370-9931e0e9eed3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "markdown", - "id": "8e7c104b", - "metadata": { - "id": "8e7c104b" - }, - "source": [ - "### 2.2 -Download Data if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "3309799e", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3309799e", - "outputId": "00d7362e-d675-4aaf-8c87-d99027d9a06c" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input/solar-system'\n", - " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/my_utils.py'" - ] - }, - { - "cell_type": "markdown", - "id": "a5dc2b68", - "metadata": { - "id": "a5dc2b68" - }, - "source": [ - "### 2.3 - Install dependencies if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "1fcec577", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "1fcec577", - "outputId": "48cf233b-f04e-4b9b-9605-423f87693f10" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit==0.2.1 \\\n", - " data-prep-toolkit-transforms==0.2.1 \\\n", - " data-prep-toolkit-transforms-ray==0.2.1 \\\n", - " deepsearch-toolkit" - ] - }, - { - "cell_type": "markdown", - "id": "243322b8", - "metadata": { - "id": "243322b8" - }, - "source": [ - "### 2.4 - Restart Runtime\n", - "\n", - "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", - "\n", - "You do this by going to **`Runtime --> Restart Session`**\n", - "\n", - "Then you can continue to the next step (no need to re-run the notebook)" - ] - }, - { - "cell_type": "markdown", - "id": "e8b10be1", - "metadata": { - "id": "e8b10be1" - }, - "source": [ - "## Step-2: Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "356c66f7", - "metadata": { - "id": "356c66f7" - }, - "source": [ - "### 2.1 - Basic Config" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e4YMZrBuFycl", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e4YMZrBuFycl", - "outputId": "1a1d5f01-0856-40b6-8b1c-8187b0c38d64" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "33345487", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "33345487", - "outputId": "f3e71a25-4864-4f8f-dfce-4af3d7e08a8a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", - "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", - "MY_CONFIG.RAY_MEMORY_GB: 2\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "## Configuration\n", - "class MyConfig:\n", - " pass\n", - "\n", - "MY_CONFIG = MyConfig ()\n", - "\n", - "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", - "\n", - "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", - "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", - "\n", - "## Embedding model\n", - "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", - "\n", - "## RAY CONFIGURATION\n", - "### For local runs, we can use more parallelism\n", - "### For google colab, be conservative\n", - "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", - " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", - " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - "else: # local run\n", - " num_cpus_available = os.cpu_count()\n", - " # print (num_cpus_available)\n", - "\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", - " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", - " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", - "\n", - "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", - "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", - "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "b15e6827", - "metadata": { - "id": "b15e6827" - }, - "outputs": [], - "source": [ - "## Add parent dir to path\n", - "import os,sys\n", - "\n", - "this_dir = os.path.abspath('')\n", - "parent_dir = os.path.dirname(this_dir)\n", - "sys.path.append (os.path.abspath (parent_dir))" - ] - }, - { - "cell_type": "markdown", - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", - "metadata": { - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" - }, - "source": [ - "### 2.2 - Setup input/outpur directories" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "ec5beb05-027a-49eb-9a96-271471619d81" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Cleared output directory\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "\n", - "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", - " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", - "\n", - "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", - "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", - "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", - "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", - "\n", - "## clear output folder\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", - "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", - "\n", - "print (\"✅ Cleared output directory\")" - ] - }, - { - "cell_type": "markdown", - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", - "metadata": { - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" - }, - "source": [ - "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", - "\n", - "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", - "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", - "metadata": { - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" - }, - "source": [ - "### 3.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "f8383739-a4fb-450c-dc37-5df32aab8212" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" - ] - } - ], - "source": [ - "STAGE = 1\n", - "\n", - "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", - "output_folder = output_parquet_dir\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", - "metadata": { - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" - }, - "source": [ - "### 3.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "14a36e73-a186-4431-a755-f46ccb691130" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:30:44 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "13:30:44 INFO - pipeline id pipeline_id\n", - "13:30:44 INFO - code location None\n", - "13:30:44 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}\n", - "13:30:44 INFO - actor creation delay 0\n", - "13:30:44 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", - "13:30:44 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", - "13:30:44 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:30:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "13:30:44 INFO - Running locally\n", - "2024-10-18 13:30:47,436\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - orchestrator started at 2024-10-18 13:30:50\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.872821807861328, 'object_store': 7.436410903930664}\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(RayTransformFileProcessor pid=10098)\u001b[0m 13:30:53 INFO - Initializing models\n", - "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 110376.42it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=10098)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:59 INFO - Completed processing 2 files in 0.145 min\n", - "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:59 INFO - done flushing in 0.001 sec\n", - "\u001b[36m(RayTransformFileProcessor pid=10099)\u001b[0m 13:30:53 INFO - Initializing models\n", - "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 73713.60it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=10099)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "13:31:09 INFO - Completed execution in 0.421 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:1 completed successfully\n", - "CPU times: user 4.41 s, sys: 1.39 s, total: 5.8 s\n", - "Wall time: 31.1 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "import ast\n", - "import os\n", - "import sys\n", - "\n", - "from pdf2parquet_transform import (\n", - " pdf2parquet_contents_type_cli_param,\n", - " pdf2parquet_contents_types,\n", - ")\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration\n", - "from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration\n", - "\n", - "from data_processing.utils import GB, ParamsUtils\n", - "\n", - "\n", - "# create parameters\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS, \"memory\": MY_CONFIG.RAY_MEMORY_GB * GB}\n", - "ingest_config = {\n", - " pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,\n", - "}\n", - "\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " \"data_files_to_use\": ast.literal_eval(\"['.pdf']\"),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": 1, # so model download to cleanup works properly\n", - "\n", - "}\n", - "\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))\n", - "# create launcher\n", - "launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())\n", - "# launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "5ca790e0", - "metadata": { - "id": "5ca790e0" - }, - "source": [ - "### 3.3 - Inspect Generated output\n", - "\n", - "Here we should see one entry per input file processed." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "fe59563d", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 255 - }, - "id": "fe59563d", - "outputId": "d10c022d-524f-4a13-ebf8-6431114e9172" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Output dimensions (rows x columns)= (2, 12)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...101162e5639f-f922-4ccc-a041-3cb02f1cfd83pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011f3c0ac2e-1de2-472b-8216-2043f3b3e9d1pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdf
\n", - "
" - ], - "text/plain": [ - " filename contents num_pages \\\n", - "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "\n", - " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 62e5639f-f922-4ccc-a041-3cb02f1cfd83 pdf \n", - "1 0 11 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "1 2024-10-18T13:30:59.494027 2.015123 earth.pdf " - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(5)\n", - "\n", - "## To display certain columns\n", - "#parquet_df[['column1', 'column2', 'column3']].head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "e5058a21", - "metadata": { - "id": "e5058a21" - }, - "source": [ - "\n", - "### 3.4 - Understand the output\n", - "\n", - "Here are some interesting attributes to note:\n", - "\n", - "- **filename** : original filename\n", - "- **contents** : text\n", - "- **document_id**: unique id (UUID) assignd to this document\n", - "- **hash** : hash of document\n", - "- **pdf_convert_time** : time to convert this pdf in seconds\n", - "\n", - "Let's inspect the **contents** column. See how the text is being divided up!" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "f870e624", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "f870e624", - "outputId": "9142246b-988c-4674-99d7-e2f3fffbaaf4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", - " 'filename': 'mars.pdf',\n", - " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.35137939,\n", - " 654.45184326,\n", - " 169.88169861,\n", - " 667.98492432],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.09541321,\n", - " 630.68127441,\n", - " 210.66503906,\n", - " 642.34405518],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.84518433,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.02520752],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.18510437,\n", - " 570.83258057,\n", - " 374.99838257,\n", - " 581.07043457],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about the Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.22866821,\n", - " 542.98168945,\n", - " 163.86282349,\n", - " 554.45288086],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87440491,\n", - " 500.84011841,\n", - " 477.48345947,\n", - " 534.55810547],\n", - " 'page': 1,\n", - " 'span': [0, 196]}],\n", - " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", - " 'desert world with a thin atmosphere composed '\n", - " 'primarily of carbon dioxide. Its reddish hue comes '\n", - " 'from iron oxide, or rust, prevalent on its surface.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.2026062,\n", - " 482.90710449,\n", - " 237.04431152,\n", - " 493.07443237],\n", - " 'page': 1,\n", - " 'span': [0, 23]}],\n", - " 'text': 'Basic facts about Mars:',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 453.019104,\n", - " 477.48171997,\n", - " 474.9703064],\n", - " 'page': 1,\n", - " 'span': [0, 78]}],\n", - " 'text': '· Distance from the Sun: Average of 228 million '\n", - " 'kilometers (142 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.79351807,\n", - " 431.73287964,\n", - " 451.2142334],\n", - " 'page': 1,\n", - " 'span': [0, 64]}],\n", - " 'text': '· Rotation Period: 24.6 hours (one Martian day - '\n", - " 'called a \"sol\")',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 429.10913086,\n", - " 365.9559021,\n", - " 438.83737183],\n", - " 'page': 1,\n", - " 'span': [0, 44]}],\n", - " 'text': '· Moons: Two small moons, Phobos and Deimos.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.51646423],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "import pprint\n", - "import json\n", - "\n", - "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", - "# json.loads(output_df.iloc[0, ]['contents'])" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "e1a10c2d", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e1a10c2d", - "outputId": "ca74113e-6fd3-488b-836a-60bd58299fb1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", - " 'filename': 'earth.pdf',\n", - " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.30961609,\n", - " 654.45184326,\n", - " 174.04208374,\n", - " 667.93347168],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.12528992,\n", - " 630.69073486,\n", - " 210.66503906,\n", - " 642.27935791],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87112427,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.04595947],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.20942688,\n", - " 570.81555176,\n", - " 375.57919312,\n", - " 581.08459473],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about our Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.15542603,\n", - " 542.98168945,\n", - " 167.32983398,\n", - " 554.36669922],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.91053772,\n", - " 512.46295166,\n", - " 477.84887695,\n", - " 534.48431396],\n", - " 'page': 1,\n", - " 'span': [0, 107]}],\n", - " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", - " 'planet. Earth is the only place we know of with life.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.30151367,\n", - " 494.86206055,\n", - " 240.17156982,\n", - " 505.07229614],\n", - " 'page': 1,\n", - " 'span': [0, 24]}],\n", - " 'text': 'Basic facts about Earth:',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 464.97409058,\n", - " 477.47979736,\n", - " 487.02810669],\n", - " 'page': 1,\n", - " 'span': [0, 79]}],\n", - " 'text': '· Distance from the Sun: Average of 149.6 million '\n", - " 'kilometers (93 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 452.86901855,\n", - " 317.90722656,\n", - " 463.24041748],\n", - " 'page': 1,\n", - " 'span': [0, 37]}],\n", - " 'text': '· Rotation Period: 24 hours (one day)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.71496582,\n", - " 396.66357422,\n", - " 451.19915771],\n", - " 'page': 1,\n", - " 'span': [0, 52]}],\n", - " 'text': '· Moons: One moon, called Luna or simply \"the Moon\".',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.53633118],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" - ] - }, - { - "cell_type": "markdown", - "id": "72274586", - "metadata": { - "id": "72274586" - }, - "source": [ - "## Step-4: Doc chunks\n", - "\n", - "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", - "\n", - "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", - "\n", - "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", - "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", - "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", - "which provides the required JSON structure." - ] - }, - { - "cell_type": "markdown", - "id": "96198fa6", - "metadata": { - "id": "96198fa6" - }, - "source": [ - "### 4.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "305f00a3", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "305f00a3", - "outputId": "689f1531-7007-49d9-9a27-39c39f8f2c50" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" - ] - } - ], - "source": [ - "STAGE = 2\n", - "\n", - "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_chunk_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "369f2cd1", - "metadata": { - "id": "369f2cd1" - }, - "source": [ - "### 4.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "5b7b18d5", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5b7b18d5", - "outputId": "0146bd91-2ccb-4e56-c649-f415a38bfcf8" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:31:12 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "13:31:12 INFO - pipeline id pipeline_id\n", - "13:31:12 INFO - code location None\n", - "13:31:12 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "13:31:12 INFO - actor creation delay 0\n", - "13:31:12 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", - "13:31:12 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "13:31:12 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:31:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:31:12 INFO - Running locally\n", - "2024-10-18 13:31:14,121\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - orchestrator started at 2024-10-18 13:31:16\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.963891602121294, 'object_store': 7.4819458005949855}\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:18 INFO - Completed processing 2 files in 0.032 min\n", - "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:18 INFO - done flushing in 0.001 sec\n", - "13:31:28 INFO - Completed execution in 0.269 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:2 completed successfully\n", - "CPU times: user 982 ms, sys: 291 ms, total: 1.27 s\n", - "Wall time: 18.9 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "from doc_chunk_transform_ray import DocChunkRayTransformConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # doc_chunk arguments\n", - " # ...\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "213afdf6", - "metadata": { - "id": "213afdf6" - }, - "source": [ - "### 4.3 - Inspect Generated output\n", - "\n", - "We would see documents are split into many chunks" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d8138d43", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 897 - }, - "id": "d8138d43", - "outputId": "e1758b0c-5f22-4368-c3e6-ff778fc9ae82" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Files processed : 2\n", - "Chunks created : 8\n", - "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 16)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "3 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "7 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "3 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "7 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (f\"Files processed : {input_df.shape[0]:,}\")\n", - "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "9e9ca75c", - "metadata": { - "id": "9e9ca75c" - }, - "source": [ - "### 4.4 - Understanding the Output\n", - "\n", - "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", - "\n", - "See how **document_id** is carried throughout. This helps us identify original documents.\n", - "\n", - "Also note **contents** is now plain text (not JSON as before)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "3090c950", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 300 - }, - "id": "3090c950", - "outputId": "3f542446-2cfa-404c-c642-3732f7b74568" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "5 earth.pdf Solar System\\nFor more details about our Solar...\n", - "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "7 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "d5f151ae", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "d5f151ae", - "outputId": "4616d648-0852-4ecb-cef8-f5940e176de0" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 3------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "20217298", - "metadata": { - "id": "20217298" - }, - "source": [ - "## Step-5: DOC ID generation\n", - "\n", - "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", - "\n", - " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", - " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", - "\n", - "**This is a pre-requisite for fuzzy dedup** in the pipeline." - ] - }, - { - "cell_type": "markdown", - "id": "66811f5b", - "metadata": { - "id": "66811f5b" - }, - "source": [ - "### 5.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "1f747c0d", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1f747c0d", - "outputId": "e42500b7-5d1e-41fd-b53b-34d3393f36f4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" - ] - } - ], - "source": [ - "\n", - "# Input for this stage is the output of exact dedeup component\n", - "# output of this component makes it possible for fdedup component to run on data.\n", - "\n", - "STAGE = 3\n", - "\n", - "input_folder = output_chunk_dir\n", - "output_folder = output_docid_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "18aa0fe1", - "metadata": { - "id": "18aa0fe1" - }, - "source": [ - "### 5.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "f6e9e145", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "f6e9e145", - "outputId": "2add5f0c-3ab6-4336-8a7b-ac8b1b76ab73" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:31:29 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "13:31:29 INFO - pipeline id pipeline_id\n", - "13:31:29 INFO - code location None\n", - "13:31:29 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "13:31:29 INFO - actor creation delay 0\n", - "13:31:29 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", - "13:31:29 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "13:31:29 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:31:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:31:29 INFO - Running locally\n", - "2024-10-18 13:31:31,792\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - orchestrator started at 2024-10-18 13:31:32\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.033103181049228, 'object_store': 7.516551589593291}\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:33 INFO - Completed processing 2 files in 0.012 min\n", - "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:33 INFO - done flushing in 0.001 sec\n", - "13:31:43 INFO - Completed execution in 0.228 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:3 completed successfully\n", - "CPU times: user 123 ms, sys: 145 ms, total: 267 ms\n", - "Wall time: 15.2 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "from doc_id_transform_ray import DocIDRayTransformRuntimeConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # doc id configuration\n", - " \"doc_id_doc_column\": \"contents\",\n", - " \"doc_id_hash_column\": \"chunk_hash\",\n", - " \"doc_id_int_column\": \"chunk_id\",\n", - "}\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "4954402f", - "metadata": { - "id": "4954402f" - }, - "source": [ - "### 5.3 - Inspect Generated output\n", - "\n", - "You will notice we have two extra columns\n", - "\n", - "- **hash_column**\n", - "- **int_id_column**\n", - "\n", - "But still the same number or rows as before" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "1911179a", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 860 - }, - "id": "1911179a", - "outputId": "45e83e2a-1f70-46b9-e311-c50f025419be" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 16)\n", - "Output data dimensions (rows x columns)= (8, 18)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "3 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "7 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "3 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "7 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "852829dc", - "metadata": { - "id": "852829dc" - }, - "source": [ - "## Step-6: Exact Dedup\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", - "metadata": { - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" - }, - "source": [ - "### 6.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "4c7a1b94", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "4c7a1b94", - "outputId": "40a119b4-44fc-483d-9ad0-da178a2a8eb1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" - ] - } - ], - "source": [ - "STAGE = 4\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_exact_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", - "metadata": { - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" - }, - "source": [ - "### 6.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "bd0f3f94-8c48-4c6b-b911-858e389243f4" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:31:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", - "13:31:45 INFO - pipeline id pipeline_id\n", - "13:31:45 INFO - code location None\n", - "13:31:45 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "13:31:45 INFO - actor creation delay 0\n", - "13:31:45 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", - "13:31:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "13:31:45 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:31:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:31:45 INFO - Running locally\n", - "2024-10-18 13:31:47,001\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - orchestrator started at 2024-10-18 13:31:48\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.010423279367387, 'object_store': 7.505211639218032}\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Completed processing 2 files in 0.013 min\n", - "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - done flushing in 0.001 sec\n", - "13:31:58 INFO - Completed execution in 0.228 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:4 completed successfully\n", - "CPU times: user 136 ms, sys: 154 ms, total: 289 ms\n", - "Wall time: 15.2 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "from ededup_transform_ray import EdedupRayTransformRuntimeConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # ededup parameters\n", - " \"ededup_hash_cpu\": 0.5,\n", - " \"ededup_num_hashes\": 2,\n", - " \"ededup_doc_column\": \"contents\",\n", - " \"ededup_doc_id_column\": \"chunk_hash\",\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "eaf1c3c3", - "metadata": { - "id": "eaf1c3c3" - }, - "source": [ - "### 6.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "d824ebf6", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 815 - }, - "id": "d824ebf6", - "outputId": "9173efb6-1b95-4a7e-b531-1a611841a4d0" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 18)\n", - "Output data dimensions (rows x columns)= (7, 19)\n", - "Input chunks before exact dedupe : 8\n", - "Output chunks after exact dedupe : 7\n", - "Duplicate chunks removed : 1\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] " - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", - "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", - "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "82cc9bb0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 269 - }, - "id": "82cc9bb0", - "outputId": "e043fa01-ceca-49ae-b764-8154219c7b6c" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\n· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nFor more details about the Solar...\n", - "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "4 earth.pdf Solar System\\nFor more details about our Solar...\n", - "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "cc61dffa", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cc61dffa", - "outputId": "aff7a0d9-a791-42a5-d5b7-ad643f59f261" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 1------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 2------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "383f40ba", - "metadata": { - "id": "383f40ba" - }, - "source": [ - "### 6.4 - Understanding the output\n", - "\n", - "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", - "\n", - "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", - "\n", - "```text\n", - "## Solar System\n", - "\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "85309751-8556-41c6-ac32-84acc941bc8d", - "metadata": { - "id": "85309751-8556-41c6-ac32-84acc941bc8d" - }, - "source": [ - "## Step-7: Fuzzy Dedup\n", - "\n", - "Post exact deduplication, fuzzy deduplication is applied with the goal of removing code files that may have **slight variations** and thereby unbiasing\n", - "the data further.\n", - "\n", - "Small variations are quite commonly seen in code data in the form of variations in the values of variables, addittion of logging statements etc." - ] - }, - { - "cell_type": "markdown", - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", - "metadata": { - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" - }, - "source": [ - "### 7.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "outputId": "d53a92d2-0f1c-465f-f11c-b9bc2931f651" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-5: Processing input='output/03_docid_out' --> output='output/05_fuzzy_dedupe_out'\n" - ] - } - ], - "source": [ - "## Input to this component is the output of doc_id generator component.\n", - "\n", - "STAGE = 5\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_fuzzy_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", - "metadata": { - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" - }, - "source": [ - "### 7.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "outputId": "1e63d364-3944-465a-ff7c-6e1dc750b2de" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:32:00 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", - "13:32:00 INFO - pipeline id pipeline_id\n", - "13:32:00 INFO - code location None\n", - "13:32:00 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "13:32:00 INFO - actor creation delay 0\n", - "13:32:00 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "13:32:00 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", - "13:32:00 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:32:00 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:32:00 INFO - Running locally\n", - "2024-10-18 13:32:02,246\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - orchestrator started at 2024-10-18 13:32:03\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.000544739887118, 'object_store': 7.500272369012237}\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:07 INFO - Completed 1 files in 0.064 min\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:07 INFO - Completed 1 files (50.0%) in 0.064 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:15 INFO - Completed processing 2 files in 0.197 min\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:15 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:16 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:16 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=16209)\u001b[0m 13:32:17 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=16209)\u001b[0m 13:32:17 INFO - Done submitting long buckets\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - Done processing buckets in 0.01 min\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - creating document snapshots\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=16602)\u001b[0m 13:32:17 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:18 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:18 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:25 INFO - Completed processing 2 files in 0.113 min\n", - "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:25 INFO - done flushing in 0.005 sec\n", - "13:32:35 INFO - Completed execution in 0.588 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:5 completed successfully\n", - "CPU times: user 270 ms, sys: 200 ms, total: 470 ms\n", - "Wall time: 36.6 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "import os\n", - "import sys\n", - "\n", - "from data_processing.utils import ParamsUtils\n", - "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "\n", - "# create parameters\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # Orchestration parameters\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # columns used\n", - " \"fdedup_doc_column\": \"contents\",\n", - " \"fdedup_id_column\": \"chunk_id\",\n", - " \"fdedup_cluster_column\": \"chunk_hash\",\n", - " # infrastructure\n", - " \"fdedup_bucket_cpu\": 0.3,\n", - " \"fdedup_doc_cpu\": 0.3,\n", - " \"fdedup_mhash_cpu\": 0.3,\n", - " \"fdedup_num_doc_actors\": 1,\n", - " \"fdedup_num_bucket_actors\": 1,\n", - " \"fdedup_num_minhash_actors\": 1,\n", - " \"fdedup_num_preprocessors\": 1,\n", - " # fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # (default 0.8)\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", - "}\n", - "\n", - "# Pass commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "a6f8cd11", - "metadata": { - "id": "a6f8cd11" - }, - "source": [ - "### 7.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "e899ad60", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 677 - }, - "id": "e899ad60", - "outputId": "fcfda84c-ebbf-490f-f478-ceef7ca9e83b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 18)\n", - "Output data dimensions (rows x columns)= (6, 18)\n", - "Duplicate chunks removed by fuzzy-dedupe: 2\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_idchunk_hash
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4-1
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6-1
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7-1
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...15
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2-1
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3-1
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id chunk_id chunk_hash \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 -1 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 -1 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 -1 \n", - "3 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 5 \n", - "4 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 -1 \n", - "5 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 -1 " - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "ab7ea52b", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 238 - }, - "id": "ab7ea52b", - "outputId": "e38754ee-777f-4ed7-ebc0-9299ee122662" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\n· Distance from the S...
3earth.pdfSolar System\\nFor more details about our Solar...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "3 earth.pdf Solar System\\nFor more details about our Solar...\n", - "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "5 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "6bdd3515", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6bdd3515", - "outputId": "e6e3f2c0-5b23-4336-bc95-013921f0724a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 2------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 1------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "2b34d9c6", - "metadata": { - "id": "2b34d9c6" - }, - "source": [ - "### 7.4- Understanding the output\n", - "\n", - "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", - "\n", - "These are pretty similar chunks except for the words 'the' and 'our'\n", - "\n", - "**earth.pdf**\n", - "\n", - "`For more details about *our* Solar system see Chapter 1.`\n", - "\n", - "**mars.pdf**\n", - "\n", - "`For more details about *the* Solar system see Chapter 1.`\n", - "\n", - "Pretty neat, eh? 👏\n", - "\n", - "### Configuring Fuzzy de-dupe\n", - "\n", - "You can tweak fuzzy dedupe by tweaking the following parameters\n", - "\n", - "```python\n", - "# fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # (default 0.8)\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", - "```\n", - "\n", - "In our case, we set `fdedup_threshold` parameter to 0.7. \n" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": { - "id": "5370950a-2a3a-4143-8218-f9b4808099ba" - }, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "85aba685", - "metadata": { - "id": "85aba685" - }, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "530e65c6-7ceb-4c73-cb87-50da46c78add" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" - ] - } - ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "c97545f4", - "metadata": { - "id": "c97545f4" - }, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 914, - "referenced_widgets": [ - "8b7571c585df431eb901fcdebdf8177e", - "06107a2f48b3491f91bbe84e46e10ba0", - "bd74356eca18423aa0373c808d9097e3", - "7e13e8779a81400f996d4428c74acfaf", - "a75892696be546a3970962bae7bf732a", - "68997339f13240a4824a9e416096bee4", - "919b086abd314077bbff75687392bd91", - "b4c209371e7a403986991a786cfb296d", - "6c08de2dd9a2402c90b1a7a645db9b13", - "91fff81a1de8487c9009e872b751edb0", - "ada62d24cbcf4361acbb21808f334d33" - ] - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "b10eecc1-cd17-49c1-e3b1-b80e0e1bfa86" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "13:32:37 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "13:32:37 INFO - pipeline id pipeline_id\n", - "13:32:37 INFO - code location None\n", - "13:32:37 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "13:32:37 INFO - actor creation delay 0\n", - "13:32:37 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", - "13:32:37 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "13:32:37 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:32:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:32:37 INFO - Running locally\n", - "2024-10-18 13:32:39,609\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - orchestrator started at 2024-10-18 13:32:42\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.943363189697266, 'object_store': 7.471681594848633}\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:47 INFO - Completed processing 2 files in 0.087 min\n", - "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:47 INFO - done flushing in 0.001 sec\n", - "13:32:57 INFO - Completed execution in 0.333 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:6 completed successfully\n", - "CPU times: user 607 ms, sys: 226 ms, total: 833 ms\n", - "Wall time: 22.1 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from text_encoder_transform_ray import TextEncoderRayTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())\n", - "# Launch the ray actor(s) to process the input\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": { - "id": "b734852c" - }, - "source": [ - "### 8.3 - Inspect Generated output\n", - "\n", - "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "id": "7b1c1d09", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 659 - }, - "id": "7b1c1d09", - "outputId": "70612634-b336-4ad5-ddb3-782ca0676bae" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (6, 18)\n", - "Output data dimensions (rows x columns)= (6, 19)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_idchunk_hashembeddings
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4-1[0.0077404897, -0.020559434, 0.026426662, 0.01...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6-1[0.07728298, 0.024971062, -0.04318075, 0.05809...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:30:59.4900072.011138mars.pdf62e5639f-f922-4ccc-a041-3cb02f1cfd83Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7-1[0.1059802, 0.025460616, 0.02362733, 0.0390564...
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Solar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...15[-0.062105577, -0.0053322953, 0.03127779, 0.04...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2-1[0.0724358, -0.058001805, -0.01977186, -0.0243...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:30:59.4940272.015123earth.pdff3c0ac2e-1de2-472b-8216-2043f3b3e9d1Earth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3-1[0.091821924, 0.015197907, 0.07716932, 0.01711...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", - "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", - "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id chunk_id chunk_hash \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 -1 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 -1 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 -1 \n", - "3 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 5 \n", - "4 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 -1 \n", - "5 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 -1 \n", - "\n", - " embeddings \n", - "0 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", - "1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", - "2 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", - "3 [-0.062105577, -0.0053322953, 0.03127779, 0.04... \n", - "4 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", - "5 [0.091821924, 0.015197907, 0.07716932, 0.01711... " - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": { - "id": "f5e12630-be6b-4188-a925-77117155617b" - }, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "d151e618-6528-40b5-fdbd-1c67291a7279" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" - ] - } - ], - "source": [ - "import shutil\n", - "\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", - "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", - "\n", - "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "dc0a6728", - "metadata": { - "id": "dc0a6728" - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "dpk-3-basic-022dev1-py311", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.10" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "06107a2f48b3491f91bbe84e46e10ba0": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_68997339f13240a4824a9e416096bee4", - "placeholder": "​", - "style": "IPY_MODEL_919b086abd314077bbff75687392bd91", - "value": "" - } - }, - "68997339f13240a4824a9e416096bee4": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6c08de2dd9a2402c90b1a7a645db9b13": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "7e13e8779a81400f996d4428c74acfaf": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_91fff81a1de8487c9009e872b751edb0", - "placeholder": "​", - "style": "IPY_MODEL_ada62d24cbcf4361acbb21808f334d33", - "value": " 0/0 [00:00<?, ?it/s]" - } - }, - "8b7571c585df431eb901fcdebdf8177e": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_06107a2f48b3491f91bbe84e46e10ba0", - "IPY_MODEL_bd74356eca18423aa0373c808d9097e3", - "IPY_MODEL_7e13e8779a81400f996d4428c74acfaf" - ], - "layout": "IPY_MODEL_a75892696be546a3970962bae7bf732a" - } - }, - "919b086abd314077bbff75687392bd91": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "91fff81a1de8487c9009e872b751edb0": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "a75892696be546a3970962bae7bf732a": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "ada62d24cbcf4361acbb21808f334d33": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "b4c209371e7a403986991a786cfb296d": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": "20px" - } - }, - "bd74356eca18423aa0373c808d9097e3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_b4c209371e7a403986991a786cfb296d", - "max": 1, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_6c08de2dd9a2402c90b1a7a645db9b13", - "value": 0 - } - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/notebooks/intro/images/data-prep-kit-3-workflow.png b/examples/notebooks/intro/images/data-prep-kit-3-workflow.png deleted file mode 100644 index 851adbfebc..0000000000 Binary files a/examples/notebooks/intro/images/data-prep-kit-3-workflow.png and /dev/null differ diff --git a/examples/notebooks/intro/my_utils.py b/examples/notebooks/intro/my_utils.py deleted file mode 100644 index 9a6477dfc3..0000000000 --- a/examples/notebooks/intro/my_utils.py +++ /dev/null @@ -1,55 +0,0 @@ -import os -import requests -from humanfriendly import format_size -import pandas as pd -import glob - - -## Reads parquet files in a folder into a pandas dataframe -def read_parquet_files_as_df (parquet_dir): - parquet_files = glob.glob(f'{parquet_dir}/*.parquet') - - # read each parquet file into a DataFrame and store in a list - dfs = [pd.read_parquet (f) for f in parquet_files] - - # Concatenate all DataFrames into a single DataFrame - data_df = pd.concat(dfs, ignore_index=True) - return data_df - - -def download_file(url, local_file, chunk_size=1024*1024): - """ - Downloads a remote URL to a local file. - - Args: - url (str): The remote URL. - local_filename (str): The name of the local file to save the downloaded content. - chunk_size (int): The size in bytes of each chunk. Defaults to 1024. - - Returns: - None - - Example usage: - download_file('http://example.com/file.txt', 'file.txt', chunk_size=1024*1024) # Download in chunks of 1MB - """ - # Check if the local file already exists - if os.path.exists(local_file): - file_size = format_size(os.path.getsize(local_file)) - print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.") - return - - # Create the directory if it doesn't exist - os.makedirs(os.path.dirname(local_file), exist_ok=True) - - # Stream the file download - with requests.get(url, stream=True) as r: - r.raise_for_status() - with open(local_file, 'wb') as f: - for chunk in r.iter_content(chunk_size=chunk_size): - if chunk: # filter out keep-alive new chunks - f.write(chunk) - print() - file_size = format_size(os.path.getsize(local_file)) - print(f"{local_file} ({file_size}) downloaded successfully.") -## --- end: download_file ------ - diff --git a/examples/notebooks/intro/.gitignore b/examples/notebooks/pdf-processing-1/.gitignore similarity index 100% rename from examples/notebooks/intro/.gitignore rename to examples/notebooks/pdf-processing-1/.gitignore diff --git a/examples/notebooks/pdf-processing-1/README.md b/examples/notebooks/pdf-processing-1/README.md new file mode 100644 index 0000000000..043f37cde6 --- /dev/null +++ b/examples/notebooks/pdf-processing-1/README.md @@ -0,0 +1,53 @@ +# PDF Processing with Data Prep Kit + +Show cases Data Prep Kit capabilities of processing PDFs. + +We will demonstrate the following: + +- Extracting text from PDF files +- removing duplicates (exact and fuzzy matches) +- accessing document quality and removing documents containing spam words, placeholder content like 'lorem ipsum' ..etc. + +**Workflow** + +![](images/data-prep-kit-3-workflow.png) + +## Setting up Python Environment + +The code can be run on either + +1. Google colab: very easy to run; no local setup needed. +2. On your local Python environment. Here is a quick guide. You can find instructions for latest version [here](../../../README.md#-getting-started) + +```bash +conda create -n data-prep-kit -y python=3.11 +conda activate data-prep-kit + +# install the following in 'data-prep-kit' environment +cd examples/notebooks/pdf-processing-1 +pip3 install -r requirements.txt + +# start jupyter and run the notebooks with this jupyter +jupyter lab +``` + +## Data Files + +PDF files are located in [examples/data-files/pdf-processing-1](../../data-files/pdf-processing-1/) + +## Running the code + +[python version](pdf_processing_1_python.ipynb)   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_python.ipynb) + +[ray version](pdf_processing_1_ray.ipynb)   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb) + +## Troubleshooting + +If you encounter any errors loading libraries, try creating a custom kernel and using it to run the notebooks. + +```bash +python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit" +# and select this kernel within jupyter notebook +``` + + diff --git a/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw b/examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.excalidraw similarity index 63% rename from examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw rename to examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.excalidraw index c0525c556b..03b19ce3c4 100644 --- a/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw +++ b/examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.excalidraw @@ -5,44 +5,8 @@ "elements": [ { "type": "image", - "version": 128, - "versionNonce": 146671843, - "index": "b45", - "isDeleted": false, - "id": "nQdFTOsh8Rjwn3poFcnOO", - "fillStyle": "solid", - "strokeWidth": 1, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 258.1818181818182, - "y": 213.63636363636363, - "strokeColor": "transparent", - "backgroundColor": "transparent", - "width": 64, - "height": 64, - "seed": 222183398, - "groupIds": [ - "4aSnKsxGoqeqA7eYu4s2e" - ], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726186954844, - "link": null, - "locked": false, - "status": "saved", - "fileId": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", - "scale": [ - 1, - 1 - ] - }, - { - "type": "image", - "version": 240, - "versionNonce": 2054222979, + "version": 457, + "versionNonce": 173110248, "index": "b46", "isDeleted": false, "id": "hlPJZs7lUbLYhuRbSmYHs", @@ -52,29 +16,23 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 260.90909090909093, - "y": 285.4545454545455, + "x": 194.90909090909093, + "y": 202.4545454545455, "strokeColor": "transparent", "backgroundColor": "transparent", "width": 64, "height": 64, "seed": 961787386, - "groupIds": [ - "4aSnKsxGoqeqA7eYu4s2e" - ], + "groupIds": [], "frameId": null, "roundness": null, "boundElements": [ { "id": "FVhCmDYbWjGck9rgcESwp", "type": "arrow" - }, - { - "id": "JMprrs8mNVD4CrqUlVm7i", - "type": "arrow" } ], - "updated": 1726186954844, + "updated": 1737528573258, "link": null, "locked": false, "status": "saved", @@ -82,12 +40,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "arrow", - "version": 2550, - "versionNonce": 1240871476, + "version": 2976, + "versionNonce": 1926996376, "index": "b47", "isDeleted": false, "id": "FVhCmDYbWjGck9rgcESwp", @@ -97,12 +56,12 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 823.5583207607388, - "y": 273.73602641681657, + "x": 583.0728843528818, + "y": 265.0654681139756, "strokeColor": "#2f9e44", "backgroundColor": "transparent", - "width": 154.2895204048931, - "height": 2.3372664247598323, + "width": 221.74126076768994, + "height": 0.598117686721821, "seed": 1954615226, "groupIds": [], "frameId": null, @@ -110,16 +69,21 @@ "type": 2 }, "boundElements": [], - "updated": 1726708776348, + "updated": 1737528696232, "link": null, "locked": false, "startBinding": { - "elementId": "Wxv71stEiYRpNjyhzzXgO", - "focus": 1.202109076005182, - "gap": 9.103775306193256, + "elementId": "YFlD_rDw6IwCctPG9BjYf", + "focus": 0.841290319837998, + "gap": 12.052870784360664, + "fixedPoint": null + }, + "endBinding": { + "elementId": "DolT9H5aqzEugA7sUfNlx", + "focus": -0.14468495613909563, + "gap": 10.4071488270705, "fixedPoint": null }, - "endBinding": null, "lastCommittedPoint": null, "startArrowhead": null, "endArrowhead": "arrow", @@ -129,61 +93,15 @@ 0 ], [ - 154.2895204048931, - 2.3372664247598323 + 221.74126076768994, + -0.598117686721821 ] ] }, - { - "type": "text", - "version": 324, - "versionNonce": 1281521869, - "index": "b4M", - "isDeleted": false, - "id": "zSJvmm-7DrsR5-qRb96Kl", - "fillStyle": "solid", - "strokeWidth": 1, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 595.4118679291607, - "y": 242.27481706603328, - "strokeColor": "#1e1e1e", - "backgroundColor": "#ffc9c9", - "width": 141.51840079198635, - "height": 59.453152259008114, - "seed": 409665722, - "groupIds": [], - "frameId": null, - "roundness": null, - "boundElements": [ - { - "id": "JMprrs8mNVD4CrqUlVm7i", - "type": "arrow" - }, - { - "id": "0wYqjwjKHCGbx7CfmDR__", - "type": "arrow" - } - ], - "updated": 1726186894805, - "link": null, - "locked": false, - "fontSize": 23.781260903603247, - "fontFamily": 1, - "text": "2. split into\nchunks", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "2. split into\nchunks", - "autoResize": true, - "lineHeight": 1.25 - }, { "type": "arrow", - "version": 848, - "versionNonce": 138401069, + "version": 1191, + "versionNonce": 1753926120, "index": "b4N", "isDeleted": false, "id": "JMprrs8mNVD4CrqUlVm7i", @@ -193,12 +111,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 329.1268602850381, - "y": 278.24885892455757, + "x": 303.3582097473162, + "y": 267.24885892455757, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 185.2530890548909, - "height": 2.823455039174007, + "width": 198.02173959261273, + "height": 2.6228850442226985, "seed": 1319994682, "groupIds": [], "frameId": null, @@ -206,19 +124,19 @@ "type": 2 }, "boundElements": [], - "updated": 1726186962183, + "updated": 1737528662023, "link": null, "locked": false, "startBinding": { - "elementId": "hlPJZs7lUbLYhuRbSmYHs", - "focus": -1.189794049219074, - "gap": 7.205686529987929, + "elementId": "QSiEFZIoz081ipwdmU8sg", + "focus": 0.36390758833591985, + "gap": 4.736856944692818, "fixedPoint": null }, "endBinding": { "elementId": "YFlD_rDw6IwCctPG9BjYf", - "focus": 1.1403432588201572, - "gap": 6.460959750980123, + "focus": -0.7972060339621995, + "gap": 9.46095975098018, "fixedPoint": null }, "lastCommittedPoint": null, @@ -230,15 +148,15 @@ 0 ], [ - 185.2530890548909, - -2.823455039174007 + 198.02173959261273, + -2.6228850442226985 ] ] }, { "type": "text", - "version": 757, - "versionNonce": 361576332, + "version": 865, + "versionNonce": 1985915368, "index": "b4O", "isDeleted": false, "id": "G0k27V_VE7lyh7YGr_fts", @@ -248,11 +166,11 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1128.9917648038, - "y": 212.9780740734803, + "x": 934.9917648037998, + "y": 247.9780740734803, "strokeColor": "#1e1e1e", "backgroundColor": "#b2f2bb", - "width": 110.85037231445312, + "width": 100.90922546386719, "height": 58.225670034857664, "seed": 970452474, "groupIds": [], @@ -264,23 +182,23 @@ "type": "arrow" } ], - "updated": 1726708803406, + "updated": 1737528832732, "link": null, "locked": false, "fontSize": 23.290268013943066, "fontFamily": 1, - "text": "4. dedupe\n(exact)", + "text": "3. exact\ndedupe", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "4. dedupe\n(exact)", + "originalText": "3. exact\ndedupe", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 598, - "versionNonce": 1689279715, + "version": 614, + "versionNonce": 181505944, "index": "b4g", "isDeleted": false, "id": "XUbC5cWQCm-GEFrdqZW7g", @@ -290,8 +208,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 333.94038113680745, - "y": 243.15978750685963, + "x": 319.94038113680745, + "y": 233.15978750685963, "strokeColor": "#1e1e1e", "backgroundColor": "#ffc9c9", "width": 173.54608154296875, @@ -306,7 +224,7 @@ "type": "arrow" } ], - "updated": 1726187078639, + "updated": 1737528653755, "link": null, "locked": false, "fontSize": 22.766190549743982, @@ -319,183 +237,10 @@ "autoResize": true, "lineHeight": 1.25 }, - { - "type": "image", - "version": 145, - "versionNonce": 1461008621, - "index": "b4h", - "isDeleted": false, - "id": "XH-Rt0Q5-K2g4tM9reh76", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 520.8409090909091, - "y": 209.88636363636368, - "strokeColor": "transparent", - "backgroundColor": "transparent", - "width": 64, - "height": 64, - "seed": 1159948140, - "groupIds": [ - "KKvJ56bTHwzAbN8YXYU0-" - ], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726186894805, - "link": null, - "locked": false, - "status": "saved", - "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", - "scale": [ - 1, - 1 - ] - }, - { - "type": "image", - "version": 193, - "versionNonce": 1127846733, - "index": "b4i", - "isDeleted": false, - "id": "YFlD_rDw6IwCctPG9BjYf", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 520.8409090909091, - "y": 279.8863636363637, - "strokeColor": "transparent", - "backgroundColor": "transparent", - "width": 64, - "height": 64, - "seed": 1369151980, - "groupIds": [ - "KKvJ56bTHwzAbN8YXYU0-" - ], - "frameId": null, - "roundness": null, - "boundElements": [ - { - "id": "0wYqjwjKHCGbx7CfmDR__", - "type": "arrow" - }, - { - "id": "JMprrs8mNVD4CrqUlVm7i", - "type": "arrow" - } - ], - "updated": 1726186894805, - "link": null, - "locked": false, - "status": "saved", - "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", - "scale": [ - 1, - 1 - ] - }, - { - "type": "arrow", - "version": 753, - "versionNonce": 1832909987, - "index": "b4j", - "isDeleted": false, - "id": "0wYqjwjKHCGbx7CfmDR__", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 0, - "opacity": 100, - "angle": 0, - "x": 587.6995151292258, - "y": 276.08728311464677, - "strokeColor": "#2f9e44", - "backgroundColor": "#b2f2bb", - "width": 160.10395921482052, - "height": 0.6238794650969908, - "seed": 1397245780, - "groupIds": [], - "frameId": null, - "roundness": { - "type": 2 - }, - "boundElements": [], - "updated": 1726186894829, - "link": null, - "locked": false, - "startBinding": { - "elementId": "YFlD_rDw6IwCctPG9BjYf", - "focus": -1.1101505124640194, - "gap": 3.799080521716917, - "fixedPoint": null - }, - "endBinding": { - "elementId": "zSJvmm-7DrsR5-qRb96Kl", - "focus": -0.1259939432648205, - "gap": 10.873205622899263, - "fixedPoint": null - }, - "lastCommittedPoint": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "points": [ - [ - 0, - 0 - ], - [ - 160.10395921482052, - -0.6238794650969908 - ] - ] - }, - { - "type": "text", - "version": 19, - "versionNonce": 1725165603, - "index": "b4t", - "isDeleted": false, - "id": "56KAsZE3Fub50OzL9XJ35", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 344.7055268721148, - "y": 290.01136363636374, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "width": 137.6798553466797, - "height": 25, - "seed": 961622755, - "groupIds": [], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726187031887, - "link": null, - "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "(pdf2parquet)", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "(pdf2parquet)", - "autoResize": true, - "lineHeight": 1.25 - }, { "type": "text", - "version": 89, - "versionNonce": 1217800429, + "version": 132, + "versionNonce": 1504935576, "index": "b4u", "isDeleted": false, "id": "GEwyTqhl4LrSwcaOeKRT5", @@ -505,71 +250,34 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 514.7055268721148, - "y": 356.01136363636374, + "x": 518.7055268721148, + "y": 383.01136363636374, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 74.97993469238281, + "width": 92.63992309570312, "height": 50, "seed": 31755757, "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726187172155, - "link": null, - "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "parquet\nfiles", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "parquet\nfiles", - "autoResize": true, - "lineHeight": 1.25 - }, - { - "type": "text", - "version": 273, - "versionNonce": 821721012, - "index": "b5F", - "isDeleted": false, - "id": "ZGkHBN9UBrJLYPIlm-KTj", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 1355.555487199263, - "y": 305.51136363636374, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "width": 118.5198974609375, - "height": 50, - "seed": 1591407981, - "groupIds": [], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726708923087, + "updated": 1737528618509, "link": null, "locked": false, "fontSize": 20, "fontFamily": 5, - "text": "duplicate 'B'\nis removed", + "text": "markdown\ntext", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "duplicate 'B'\nis removed", + "originalText": "markdown\ntext", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 747, - "versionNonce": 104645940, + "version": 804, + "versionNonce": 859000296, "index": "b5G", "isDeleted": false, "id": "DolT9H5aqzEugA7sUfNlx", @@ -579,34 +287,39 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 827.643003983931, - "y": 226.3985286189349, + "x": 596.643003983931, + "y": 231.3985286189349, "strokeColor": "#1e1e1e", "backgroundColor": "#b2f2bb", - "width": 166.41502380371094, - "height": 29.112835017428832, + "width": 197.7639923095703, + "height": 58.225670034857664, "seed": 466678605, "groupIds": [], "frameId": null, "roundness": null, - "boundElements": [], - "updated": 1726708795102, + "boundElements": [ + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + } + ], + "updated": 1737528686607, "link": null, "locked": false, "fontSize": 23.290268013943066, "fontFamily": 1, - "text": "3. document id", + "text": "2. document id\n(compute hashes)", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "3. document id", + "originalText": "2. document id\n(compute hashes)", "autoResize": true, "lineHeight": 1.25 }, { "type": "arrow", - "version": 1071, - "versionNonce": 474965812, + "version": 1254, + "versionNonce": 980324072, "index": "b5U", "isDeleted": false, "id": "cXhTkxU13WdQeAv3Z_1mR", @@ -616,12 +329,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1318.993474938044, - "y": 401.3233033689122, + "x": 1145.993474938044, + "y": 268.31133050044286, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 0.8539592148204065, - "height": 113.62612053490295, + "width": 167.8539592148204, + "height": 1.6380934033722951, "seed": 605419139, "groupIds": [], "frameId": null, @@ -629,11 +342,21 @@ "type": 2 }, "boundElements": [], - "updated": 1726709016812, + "updated": 1737528943852, "link": null, "locked": false, - "startBinding": null, - "endBinding": null, + "startBinding": { + "elementId": "Qaz1byDgzm-0ZrVLBmU4v", + "focus": -0.37744699407794313, + "gap": 8.76620221077144, + "fixedPoint": null + }, + "endBinding": { + "elementId": "LbPBuhQ2btuEnjbeSDvuK", + "focus": -2.1413835587747667, + "gap": 14.33294663108768, + "fixedPoint": null + }, "lastCommittedPoint": null, "startArrowhead": null, "endArrowhead": "arrow", @@ -643,15 +366,15 @@ 0 ], [ - 0.8539592148204065, - 113.62612053490295 + 167.8539592148204, + 1.6380934033722951 ] ] }, { "type": "text", - "version": 976, - "versionNonce": 988237964, + "version": 1037, + "versionNonce": 1974786200, "index": "b5V", "isDeleted": false, "id": "Ba_pxAykcwH_ZsTbAtduc", @@ -661,34 +384,34 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1218.815207047896, - "y": 429.9549461276493, + "x": 1160.815207047896, + "y": 234.9549461276493, "strokeColor": "#1e1e1e", "backgroundColor": "#b2f2bb", - "width": 184.07017517089844, - "height": 29.112835017428832, + "width": 98.09219360351562, + "height": 58.225670034857664, "seed": 1665190893, "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726709020882, + "updated": 1737528881336, "link": null, "locked": false, "fontSize": 23.290268013943066, "fontFamily": 1, - "text": "5. fuzzy dedupe", + "text": "4. fuzzy\ndedupe", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "5. fuzzy dedupe", + "originalText": "4. fuzzy\ndedupe", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 580, - "versionNonce": 693951668, + "version": 677, + "versionNonce": 1394703256, "index": "b5h", "isDeleted": false, "id": "XFHbtP2KmiHNNjZhz8ajW", @@ -698,8 +421,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1299.1022727272725, - "y": 517.40625, + "x": 1334.1022727272725, + "y": 178.40625, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -718,14 +441,14 @@ "id": "OdGsWefGyr6uqMl0wC6mH" } ], - "updated": 1726708989657, + "updated": 1737528940801, "link": null, "locked": false }, { "type": "text", - "version": 323, - "versionNonce": 1216816692, + "version": 420, + "versionNonce": 2107525272, "index": "b5i", "isDeleted": false, "id": "OdGsWefGyr6uqMl0wC6mH", @@ -735,8 +458,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1315.9786418568, - "y": 522.40625, + "x": 1350.9786418568, + "y": 183.40625, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 13.519989013671875, @@ -748,7 +471,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708989657, + "updated": 1737528940801, "link": null, "locked": false, "fontSize": 20, @@ -763,8 +486,8 @@ }, { "type": "rectangle", - "version": 573, - "versionNonce": 1856782260, + "version": 677, + "versionNonce": 1612348312, "index": "b5j", "isDeleted": false, "id": "NzWqph0M7tEkeTDKLPGZR", @@ -774,8 +497,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1301.1931818181815, - "y": 564.5880681818182, + "x": 1336.1931818181815, + "y": 225.58806818181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -792,16 +515,20 @@ { "type": "text", "id": "K1QK2dyVWiWfd32P8ovQK" + }, + { + "id": "-CNAjEmW6cbufb2V3aXbb", + "type": "arrow" } ], - "updated": 1726708989657, + "updated": 1737530583902, "link": null, "locked": false }, { "type": "text", - "version": 264, - "versionNonce": 334637364, + "version": 364, + "versionNonce": 150023400, "index": "b5k", "isDeleted": false, "id": "K1QK2dyVWiWfd32P8ovQK", @@ -811,11 +538,11 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1317.219552473588, - "y": 569.5880681818182, + "x": 1351.329545454545, + "y": 230.58806818181824, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 15.219985961914062, + "width": 17, "height": 25, "seed": 1350557773, "groupIds": [ @@ -824,7 +551,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708989657, + "updated": 1737530583904, "link": null, "locked": false, "fontSize": 20, @@ -839,8 +566,8 @@ }, { "type": "rectangle", - "version": 680, - "versionNonce": 1002365620, + "version": 777, + "versionNonce": 1889202072, "index": "b5l", "isDeleted": false, "id": "Lf5-FqrnO7iDVhOKUtEnT", @@ -850,8 +577,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1306.9204545454545, - "y": 619.3267045454547, + "x": 1341.9204545454545, + "y": 280.32670454545473, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -870,14 +597,14 @@ "id": "cTJ-8HZCMcNbXqDHggxAH" } ], - "updated": 1726708989657, + "updated": 1737528940801, "link": null, "locked": false }, { "type": "text", - "version": 375, - "versionNonce": 213412916, + "version": 472, + "versionNonce": 331955352, "index": "b5m", "isDeleted": false, "id": "cTJ-8HZCMcNbXqDHggxAH", @@ -887,8 +614,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1324.2668248956852, - "y": 624.3267045454547, + "x": 1359.2668248956852, + "y": 285.32670454545473, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 12.579986572265625, @@ -900,7 +627,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708989657, + "updated": 1737528940801, "link": null, "locked": false, "fontSize": 20, @@ -915,8 +642,8 @@ }, { "type": "text", - "version": 141, - "versionNonce": 1757726132, + "version": 238, + "versionNonce": 900065688, "index": "b5n", "isDeleted": false, "id": "LK6nmMo09HhGvAeViRfcK", @@ -926,8 +653,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1274.397727272727, - "y": 523.3664772727274, + "x": 1309.397727272727, + "y": 184.36647727272737, "strokeColor": "#e03131", "backgroundColor": "transparent", "width": 12, @@ -939,7 +666,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708989657, + "updated": 1737528940801, "link": null, "locked": false, "fontSize": 20, @@ -954,8 +681,8 @@ }, { "type": "text", - "version": 196, - "versionNonce": 761917108, + "version": 294, + "versionNonce": 1508025832, "index": "b5o", "isDeleted": false, "id": "LbPBuhQ2btuEnjbeSDvuK", @@ -965,8 +692,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1278.397727272727, - "y": 569.6164772727275, + "x": 1313.397727272727, + "y": 230.61647727272748, "strokeColor": "#e03131", "backgroundColor": "transparent", "width": 11, @@ -977,8 +704,13 @@ ], "frameId": null, "roundness": null, - "boundElements": [], - "updated": 1726708993287, + "boundElements": [ + { + "id": "cXhTkxU13WdQeAv3Z_1mR", + "type": "arrow" + } + ], + "updated": 1737528943380, "link": null, "locked": false, "fontSize": 20, @@ -993,8 +725,8 @@ }, { "type": "text", - "version": 385, - "versionNonce": 800257204, + "version": 484, + "versionNonce": 1538941848, "index": "b5p", "isDeleted": false, "id": "tEnh5H4Dm1tA62FJY7ZnT", @@ -1004,8 +736,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1279.647727272727, - "y": 629.6164772727275, + "x": 1314.647727272727, + "y": 290.6164772727275, "strokeColor": "#e03131", "backgroundColor": "transparent", "width": 11, @@ -1017,7 +749,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726709003336, + "updated": 1737528940801, "link": null, "locked": false, "fontSize": 20, @@ -1032,8 +764,8 @@ }, { "type": "text", - "version": 307, - "versionNonce": 51819060, + "version": 406, + "versionNonce": 313505768, "index": "b5q", "isDeleted": false, "id": "TExMhRi4612k0BcybcpHE", @@ -1043,8 +775,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1251.2855058149858, - "y": 678.5113636363637, + "x": 1286.2855058149858, + "y": 339.51136363636374, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 143.59986877441406, @@ -1056,7 +788,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708989657, + "updated": 1737530582726, "link": null, "locked": false, "fontSize": 20, @@ -1069,243 +801,28 @@ "autoResize": true, "lineHeight": 1.25 }, - { - "type": "arrow", - "version": 1039, - "versionNonce": 199529869, - "index": "b5r", - "isDeleted": false, - "id": "KvvwHoDnDT0vBh2bOfiTz", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 0, - "opacity": 100, - "angle": 0, - "x": 1245.243474938044, - "y": 579.5733033689121, - "strokeColor": "#2f9e44", - "backgroundColor": "#b2f2bb", - "width": 192.8960407851796, - "height": 1.126120534903066, - "seed": 1004556899, - "groupIds": [], - "frameId": null, - "roundness": { - "type": 2 - }, - "boundElements": [], - "updated": 1726188444758, - "link": null, - "locked": false, - "startBinding": null, - "endBinding": null, - "lastCommittedPoint": null, - "startArrowhead": null, - "endArrowhead": "arrow", - "points": [ - [ - 0, - 0 - ], - [ - -192.8960407851796, - 1.126120534903066 - ] - ] - }, - { - "type": "text", - "version": 989, - "versionNonce": 923042467, - "index": "b5s", - "isDeleted": false, - "id": "cPSHqIr9Peb5h5TNxl3Bb", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 0, - "opacity": 100, - "angle": 0, - "x": 1100.5103669600053, - "y": 536.2049461276495, - "strokeColor": "#1e1e1e", - "backgroundColor": "#b2f2bb", - "width": 138.99639892578125, - "height": 29.112835017428832, - "seed": 865272429, - "groupIds": [], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726188447614, - "link": null, - "locked": false, - "fontSize": 23.290268013943066, - "fontFamily": 1, - "text": "6. vectorize", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "6. vectorize", - "autoResize": true, - "lineHeight": 1.25 - }, - { - "type": "diamond", - "version": 103, - "versionNonce": 679668419, - "index": "b5vV", - "isDeleted": false, - "id": "tPvUjMUp7lW3F8V3H2MGV", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 960.0454545454546, - "y": 515.5113636363637, - "strokeColor": "#1e1e1e", - "backgroundColor": "#d0bfff", - "width": 63.75, - "height": 45, - "seed": 782762477, - "groupIds": [ - "CuM_sg3LC9KTYRVST18pX" - ], - "frameId": null, - "roundness": { - "type": 2 - }, - "boundElements": [], - "updated": 1726188516836, - "link": null, - "locked": false - }, - { - "type": "diamond", - "version": 117, - "versionNonce": 224511779, - "index": "b5w", - "isDeleted": false, - "id": "uOIVUAj_hGKNiZ3NnQm2n", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 961.9204545454546, - "y": 564.5113636363637, - "strokeColor": "#1e1e1e", - "backgroundColor": "#d0bfff", - "width": 63.75, - "height": 45, - "seed": 1245990083, - "groupIds": [ - "CuM_sg3LC9KTYRVST18pX" - ], - "frameId": null, - "roundness": { - "type": 2 - }, - "boundElements": [], - "updated": 1726188516836, - "link": null, - "locked": false - }, - { - "type": "diamond", - "version": 122, - "versionNonce": 1205596301, - "index": "b5x", - "isDeleted": false, - "id": "ylh6O0GmjhRAHndHyuEo2", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 966.9204545454546, - "y": 615.7613636363637, - "strokeColor": "#1e1e1e", - "backgroundColor": "#d0bfff", - "width": 63.75, - "height": 45, - "seed": 499397773, - "groupIds": [ - "CuM_sg3LC9KTYRVST18pX" - ], - "frameId": null, - "roundness": { - "type": 2 - }, - "boundElements": [], - "updated": 1726188516836, - "link": null, - "locked": false - }, - { - "type": "text", - "version": 260, - "versionNonce": 1136192621, - "index": "b5y", - "isDeleted": false, - "id": "ekXIjXxtZ6f2w_A-9CVUV", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 938.2855058149859, - "y": 670.7613636363637, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "width": 107.5399169921875, - "height": 25, - "seed": 1616985635, - "groupIds": [], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726188507123, - "link": null, - "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "embeddings", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "embeddings", - "autoResize": true, - "lineHeight": 1.25 - }, { "type": "rectangle", - "version": 381, - "versionNonce": 1618061620, - "index": "b5z", + "version": 589, + "versionNonce": 1049638120, + "index": "b698", "isDeleted": false, - "id": "Uv-8TiLeECJuuNx1yJjtv", + "id": "JNHVvikjirDDllCKotbJC", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 768.5454545454545, - "y": 280.72727272727275, + "x": 844.9545454545454, + "y": 249.68750000000006, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 637818278, + "seed": 848769955, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": { @@ -1313,45 +830,45 @@ }, "boundElements": [ { - "id": "0wYqjwjKHCGbx7CfmDR__", - "type": "arrow" + "type": "text", + "id": "8Msc7tXcZdg2UUH2NmUn-" }, { - "type": "text", - "id": "B8Nj-HzRDl-FA-5UJ2hiw" + "id": "M_WCuesgPRdSQ_zqaUtz0", + "type": "arrow" } ], - "updated": 1726708776347, + "updated": 1737528714494, "link": null, "locked": false }, { "type": "text", - "version": 140, - "versionNonce": 1472181260, - "index": "b60", + "version": 348, + "versionNonce": 1968921752, + "index": "b69G", "isDeleted": false, - "id": "B8Nj-HzRDl-FA-5UJ2hiw", + "id": "8Msc7tXcZdg2UUH2NmUn-", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 783.2418233698064, - "y": 285.72727272727275, + "x": 859.6509142788972, + "y": 254.68750000000006, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 17.879989624023438, "height": 25, - "seed": 1971906541, + "seed": 1297532739, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false, "fontSize": 20, @@ -1359,33 +876,33 @@ "text": "A'", "textAlign": "center", "verticalAlign": "middle", - "containerId": "Uv-8TiLeECJuuNx1yJjtv", + "containerId": "JNHVvikjirDDllCKotbJC", "originalText": "A'", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 391, - "versionNonce": 1280205492, - "index": "b61", + "version": 626, + "versionNonce": 1609828760, + "index": "b69O", "isDeleted": false, - "id": "l7XMM15Xwzq5xmDF0QvyN", + "id": "fkbHGW5tJ-Ay0sh8h-9hJ", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 764.090909090909, - "y": 186.09090909090912, + "x": 841.4999999999999, + "y": 156.05113636363643, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 1556091898, + "seed": 2116216547, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": { @@ -1394,40 +911,40 @@ "boundElements": [ { "type": "text", - "id": "SZp9x_uNQ-65LQPMQ768C" + "id": "BNiP4zX7PtFTn_e_5vXX3" } ], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false }, { "type": "text", - "version": 132, - "versionNonce": 809849484, - "index": "b62", + "version": 369, + "versionNonce": 753866392, + "index": "b69V", "isDeleted": false, - "id": "SZp9x_uNQ-65LQPMQ768C", + "id": "BNiP4zX7PtFTn_e_5vXX3", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 780.9672782204367, - "y": 191.09090909090912, + "x": 858.3763691295275, + "y": 161.05113636363643, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 13.519989013671875, "height": 25, - "seed": 912377443, + "seed": 1804210819, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false, "fontSize": 20, @@ -1435,83 +952,75 @@ "text": "A", "textAlign": "center", "verticalAlign": "middle", - "containerId": "l7XMM15Xwzq5xmDF0QvyN", + "containerId": "fkbHGW5tJ-Ay0sh8h-9hJ", "originalText": "A", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 413, - "versionNonce": 1599597620, - "index": "b63", + "version": 619, + "versionNonce": 553681816, + "index": "b69d", "isDeleted": false, - "id": "Wxv71stEiYRpNjyhzzXgO", + "id": "QYKbNgibs7-HxaNNr8tfG", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 767.1818181818182, - "y": 234.27272727272725, + "x": 843.5909090909089, + "y": 203.23295454545456, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 775085434, + "seed": 1716177443, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": { "type": 3 }, "boundElements": [ - { - "id": "0wYqjwjKHCGbx7CfmDR__", - "type": "arrow" - }, - { - "id": "FVhCmDYbWjGck9rgcESwp", - "type": "arrow" - }, { "type": "text", - "id": "zyU1230-bmsHaQTSoi7Ov" + "id": "C-rwFmAbwI_qgVqpkXy7m" } ], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false }, { "type": "text", - "version": 102, - "versionNonce": 1402151180, - "index": "b64", + "version": 310, + "versionNonce": 1247563928, + "index": "b69l", "isDeleted": false, - "id": "zyU1230-bmsHaQTSoi7Ov", + "id": "C-rwFmAbwI_qgVqpkXy7m", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 783.2081888372248, - "y": 239.27272727272725, + "x": 859.6172797463154, + "y": 208.23295454545456, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 15.219985961914062, "height": 25, - "seed": 1842733667, + "seed": 592678339, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false, "fontSize": 20, @@ -1519,33 +1028,33 @@ "text": "B", "textAlign": "center", "verticalAlign": "middle", - "containerId": "Wxv71stEiYRpNjyhzzXgO", + "containerId": "QYKbNgibs7-HxaNNr8tfG", "originalText": "B", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 397, - "versionNonce": 997475764, - "index": "b65", + "version": 714, + "versionNonce": 1354136984, + "index": "b69t", "isDeleted": false, - "id": "IkaeA2i4mlTdmulYEI_na", + "id": "m2Wj9fp76PKCAhrulCmTa", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 771.3636363636363, - "y": 325.3636363636364, + "x": 846.3181818181819, + "y": 339.97159090909105, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 1839286010, + "seed": 901963107, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": { @@ -1554,265 +1063,1493 @@ "boundElements": [ { "type": "text", - "id": "IgKDOIQhfqb_x9gQh30eh" + "id": "MNgTOO1UYazXucNSjXZ_z" } ], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false }, { "type": "text", - "version": 89, - "versionNonce": 421732236, - "index": "b66", + "version": 409, + "versionNonce": 1162021528, + "index": "b6A", "isDeleted": false, - "id": "IgKDOIQhfqb_x9gQh30eh", + "id": "MNgTOO1UYazXucNSjXZ_z", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 787.3900070190429, - "y": 330.3636363636364, + "x": 863.6645521684126, + "y": 344.97159090909105, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 15.219985961914062, + "width": 12.579986572265625, "height": 25, - "seed": 1893385699, + "seed": 1223112963, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false, "fontSize": 20, "fontFamily": 5, - "text": "B", + "text": "C", "textAlign": "center", "verticalAlign": "middle", - "containerId": "IkaeA2i4mlTdmulYEI_na", - "originalText": "B", + "containerId": "m2Wj9fp76PKCAhrulCmTa", + "originalText": "C", "autoResize": true, "lineHeight": 1.25 }, { - "type": "rectangle", - "version": 440, - "versionNonce": 1439264564, - "index": "b67", + "type": "text", + "version": 188, + "versionNonce": 1924528024, + "index": "b6AG", "isDeleted": false, - "id": "qGfihx9_lQSyc1F8oQTu0", + "id": "J1KVE_C00rdGo7FWIwu1X", "fillStyle": "solid", - "strokeWidth": 1, + "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 772.909090909091, - "y": 369.01136363636374, + "x": 817.7954545454544, + "y": 162.01136363636374, "strokeColor": "#e03131", - "backgroundColor": "#ffc9c9", - "width": 47.27272727272725, - "height": 35, - "seed": 1381062179, + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 1442121325, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, - "roundness": { - "type": 3 - }, - "boundElements": [ - { - "type": "text", - "id": "0DIl-np94wHje4sIubFJp" - } - ], - "updated": 1726708776347, + "roundness": null, + "boundElements": [], + "updated": 1737528708101, "link": null, - "locked": false + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 }, { "type": "text", - "version": 133, - "versionNonce": 1496272396, - "index": "b68", + "version": 242, + "versionNonce": 759383192, + "index": "b6AV", "isDeleted": false, - "id": "0DIl-np94wHje4sIubFJp", + "id": "TIEDsM4QhNNDJARAJnvDz", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 790.2554612593218, - "y": 374.01136363636374, - "strokeColor": "#1e1e1e", + "x": 820.7954545454544, + "y": 208.26136363636374, + "strokeColor": "#e03131", "backgroundColor": "transparent", - "width": 12.579986572265625, + "width": 11, "height": 25, - "seed": 1722325443, + "seed": 846611715, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "ssihZCwGeFNCQehvjAg06" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708776347, + "updated": 1737528708101, "link": null, "locked": false, "fontSize": 20, - "fontFamily": 5, - "text": "C", - "textAlign": "center", - "verticalAlign": "middle", - "containerId": "qGfihx9_lQSyc1F8oQTu0", - "originalText": "C", - "autoResize": true, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 290, + "versionNonce": 580841880, + "index": "b6Al", + "isDeleted": false, + "id": "tGvqUuD_kCzfMYn-UX8o-", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 823.2954545454544, + "y": 257.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 758667053, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528708101, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "3", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 421, + "versionNonce": 704446104, + "index": "b6B", + "isDeleted": false, + "id": "IQM8OVr381UGBDKQtda8U", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 823.0454545454544, + "y": 345.26136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 618433805, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528708101, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 672, + "versionNonce": 336685976, + "index": "b6BV", + "isDeleted": false, + "id": "fJGd6Pf-SaTmbDMUGHhUW", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 847.3972327492455, + "y": 296.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1491526540, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "Ax-8fSsrXvrkMhlGAgJgO" + } + ], + "updated": 1737528708101, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 363, + "versionNonce": 2064660632, + "index": "b6C", + "isDeleted": false, + "id": "Ax-8fSsrXvrkMhlGAgJgO", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 863.423603404652, + "y": 301.2812500000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1943704076, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528708101, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "fJGd6Pf-SaTmbDMUGHhUW", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 320, + "versionNonce": 313353624, + "index": "b6CV", + "isDeleted": false, + "id": "07qZABiLS71UbigBsFpnK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 821.033596385609, + "y": 301.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1965424820, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528708101, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "4", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "4", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 2745, + "versionNonce": 1420536808, + "index": "b6D", + "isDeleted": false, + "id": "M_WCuesgPRdSQ_zqaUtz0", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 905.532130562785, + "y": 274.97561555378826, + "strokeColor": "#2f9e44", + "backgroundColor": "transparent", + "width": 162.00146582282412, + "height": 0.6286347709357187, + "seed": 1489010356, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "startBinding": { + "elementId": "JNHVvikjirDDllCKotbJC", + "focus": 0.4403861575576877, + "gap": 13.304857835512394, + "fixedPoint": null + }, + "endBinding": { + "elementId": "NxUqy-MsYDga_9XDrU9l7", + "focus": -0.04300532190875777, + "gap": 1, + "fixedPoint": null + }, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 162.00146582282412, + -0.6286347709357187 + ] + ] + }, + { + "type": "text", + "version": 311, + "versionNonce": 212346088, + "index": "b6D8", + "isDeleted": false, + "id": "ZGkHBN9UBrJLYPIlm-KTj", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1062.555487199263, + "y": 410.51136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 118.5198974609375, + "height": 50, + "seed": 1591407981, + "groupIds": [ + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897882, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "duplicate 'B'\nis removed", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "duplicate 'B'\nis removed", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 285, + "versionNonce": 1763919848, + "index": "b6DG", + "isDeleted": false, + "id": "wkavhEPwz2TNGwf8xFeLA", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1065.0335963856091, + "y": 172.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 809955212, + "groupIds": [ + "uHtPh4-PiLJtgc-p_Cdgo", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897882, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 653, + "versionNonce": 1883376360, + "index": "b6DO", + "isDeleted": false, + "id": "Qaz1byDgzm-0ZrVLBmU4v", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1089.9545454545455, + "y": 257.1875000000001, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 144156909, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "D2HbgzHXdGyxGppwaWbBy" + }, + { + "id": "cXhTkxU13WdQeAv3Z_1mR", + "type": "arrow" + } + ], + "updated": 1737528897883, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 410, + "versionNonce": 1998221544, + "index": "b6DV", + "isDeleted": false, + "id": "D2HbgzHXdGyxGppwaWbBy", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1104.6509142788973, + "y": 262.1875000000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 2062418765, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Qaz1byDgzm-0ZrVLBmU4v", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 683, + "versionNonce": 1735136232, + "index": "b6Dd", + "isDeleted": false, + "id": "-LxVJeZLqj0MgI5FEg_pm", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1082.5, + "y": 163.55113636363643, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1514803629, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "trFDjiJr6cfNlCSEKqNjE" + } + ], + "updated": 1737528897883, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 425, + "versionNonce": 1133598440, + "index": "b6Dl", + "isDeleted": false, + "id": "trFDjiJr6cfNlCSEKqNjE", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1099.3763691295276, + "y": 168.55113636363643, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 1674925069, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "-LxVJeZLqj0MgI5FEg_pm", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 680, + "versionNonce": 269892072, + "index": "b6E", + "isDeleted": false, + "id": "Kxu9owye4gMpRvh7kJ1Nl", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1088.590909090909, + "y": 210.73295454545456, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1938377325, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "UP92rSYiIXnnBFhov6WNx" + } + ], + "updated": 1737528897883, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 370, + "versionNonce": 1611054312, + "index": "b6EG", + "isDeleted": false, + "id": "UP92rSYiIXnnBFhov6WNx", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1104.6172797463157, + "y": 215.73295454545456, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 707753165, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Kxu9owye4gMpRvh7kJ1Nl", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 707, + "versionNonce": 82763752, + "index": "b6EV", + "isDeleted": false, + "id": "KMOsOR4pOx-ute2ztnw1k", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1094.318181818182, + "y": 345.4715909090911, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 635317229, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "SsRO-f6mzQzf5jQOudz6C" + } + ], + "updated": 1737528897883, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 401, + "versionNonce": 1054515944, + "index": "b6El", + "isDeleted": false, + "id": "SsRO-f6mzQzf5jQOudz6C", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1111.6645521684127, + "y": 350.4715909090911, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1382819405, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "KMOsOR4pOx-ute2ztnw1k", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 319, + "versionNonce": 1817576936, + "index": "b6F", + "isDeleted": false, + "id": "US1PK13ekocRlMvOrHSJL", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1066.0335963856091, + "y": 215.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1525760780, + "groupIds": [ + "bQ__H1TgpJXskAm32UBLZ", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 357, + "versionNonce": 980224232, + "index": "b6FV", + "isDeleted": false, + "id": "NxUqy-MsYDga_9XDrU9l7", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1068.5335963856091, + "y": 261.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 1116920372, + "groupIds": [ + "4mN8vM1PMjtKHfzWdqXES", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "M_WCuesgPRdSQ_zqaUtz0", + "type": "arrow" + } + ], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "3", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 353, + "versionNonce": 354283240, + "index": "b6G", + "isDeleted": false, + "id": "lSEPKkiY8if2M9pDun8DS", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1071.5335963856091, + "y": 354.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 932194828, + "groupIds": [ + "Z8bVLPerSCYHViV4Ld1Ed", + "XEHMHITFJTjudNYgVFCPu", + "vyfIXhnJpss6uiuzFKps6", + "UUMeFgK8RcVkGIGDsRBi8" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528897883, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 145, + "versionNonce": 56362904, + "index": "b6Q", + "isDeleted": false, + "id": "9Bwc8DwyPnrOxUQpApvfU", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 257.30863987315786, + "y": 383.5312500000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 103.71990966796875, + "height": 50, + "seed": 1385699816, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528426042, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "PDF \ndocuments", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "PDF \ndocuments", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 494, + "versionNonce": 1068503272, + "index": "b6R", + "isDeleted": false, + "id": "QSiEFZIoz081ipwdmU8sg", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 251.34862552989614, + "y": 242.95738636363643, + "strokeColor": "#e03131", + "backgroundColor": "#b2f2bb", + "width": 47.27272727272725, + "height": 35, + "seed": 1529123224, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "_Z-rRn1k6dRs-cBIHwwQY" + }, + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1737528651437, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 265, + "versionNonce": 1790196968, + "index": "b6S", + "isDeleted": false, + "id": "_Z-rRn1k6dRs-cBIHwwQY", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 265.2249946594238, + "y": 247.95738636363643, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 19.519989013671875, + "height": 25, + "seed": 13541016, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528539700, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "QSiEFZIoz081ipwdmU8sg", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 505, + "versionNonce": 48835560, + "index": "b6T", + "isDeleted": false, + "id": "3xE7duRO9Qq4Sc-G2OvNv", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 246.89408007535064, + "y": 148.3210227272728, + "strokeColor": "#e03131", + "backgroundColor": "#b2f2bb", + "width": 47.27272727272725, + "height": 35, + "seed": 1605307288, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "Vb3hONt1wd7JHFzI3HmrQ" + } + ], + "updated": 1737528540117, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 262, + "versionNonce": 1551754904, + "index": "b6U", + "isDeleted": false, + "id": "Vb3hONt1wd7JHFzI3HmrQ", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 263.03044371171427, + "y": 153.3210227272728, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 15, + "height": 25, + "seed": 1106892952, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528540117, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "3xE7duRO9Qq4Sc-G2OvNv", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 525, + "versionNonce": 225964696, + "index": "b6V", + "isDeleted": false, + "id": "ooV7vvmtMmdPRnQmMHBmf", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 249.98498916625965, + "y": 196.50284090909093, + "strokeColor": "#e03131", + "backgroundColor": "#b2f2bb", + "width": 47.27272727272725, + "height": 35, + "seed": 191038872, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "_rMbVkq-GLuJSkRWHvjkn" + } + ], + "updated": 1737528539700, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 227, + "versionNonce": 472392424, + "index": "b6W", + "isDeleted": false, + "id": "_rMbVkq-GLuJSkRWHvjkn", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 265.1213528026233, + "y": 201.50284090909093, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 17, + "height": 25, + "seed": 152998552, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528539700, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "ooV7vvmtMmdPRnQmMHBmf", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 510, + "versionNonce": 768826600, + "index": "b6X", + "isDeleted": false, + "id": "JUjlPmSPagKyAA6ikwVcf", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 254.16680734807767, + "y": 287.59375000000006, + "strokeColor": "#e03131", + "backgroundColor": "#b2f2bb", + "width": 47.27272727272725, + "height": 35, + "seed": 1105231768, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "Tov62fM0_erGxbIhudlqt" + }, + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1737528566266, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 214, + "versionNonce": 1140033000, + "index": "b6Y", + "isDeleted": false, + "id": "Tov62fM0_erGxbIhudlqt", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 269.3031709844413, + "y": 292.59375000000006, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 17, + "height": 25, + "seed": 1172098200, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528539700, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "JUjlPmSPagKyAA6ikwVcf", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 578, + "versionNonce": 1264463000, + "index": "b6Z", + "isDeleted": false, + "id": "4cU98zwq8Qi78OlWyES2s", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 255.71226189353263, + "y": 331.2414772727274, + "strokeColor": "#e03131", + "backgroundColor": "#b2f2bb", + "width": 47.27272727272725, + "height": 35, + "seed": 2127002008, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "hDWulD4JcLixt2n_PIyWF" + } + ], + "updated": 1737528539700, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 284, + "versionNonce": 1113229544, + "index": "b6a", + "isDeleted": false, + "id": "hDWulD4JcLixt2n_PIyWF", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 272.34862552989625, + "y": 336.2414772727274, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 14, + "height": 25, + "seed": 2144634520, + "groupIds": [ + "syqTr4z_spUvkhxRP2GMv" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1737528539700, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "4cU98zwq8Qi78OlWyES2s", + "originalText": "C", + "autoResize": true, "lineHeight": 1.25 }, { - "type": "text", - "version": 70, - "versionNonce": 247294132, - "index": "b69", + "type": "image", + "version": 295, + "versionNonce": 1682243816, + "index": "b6d", "isDeleted": false, - "id": "lkM4ke2d8E4KSisX5yE08", + "id": "XH-Rt0Q5-K2g4tM9reh76", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 762.5454545454546, - "y": 429.51136363636374, - "strokeColor": "#1e1e1e", - "backgroundColor": "#d0bfff", - "width": 64.55995178222656, - "height": 25, - "seed": 1905848653, + "x": 510.8409090909091, + "y": 143.88636363636368, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 60.17910447761194, + "height": 60.17910447761194, + "seed": 1159948140, "groupIds": [ - "wECUsJGvuBUaz0aXhNgT4" + "KGVjVuaPc35r3zwmLpo6p" ], "frameId": null, "roundness": null, - "boundElements": [], - "updated": 1726708776347, + "boundElements": [ + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + } + ], + "updated": 1737528662022, "link": null, "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "chunks", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "chunks", - "autoResize": true, - "lineHeight": 1.25 + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ], + "crop": null }, { - "type": "rectangle", - "version": 527, - "versionNonce": 1269467404, - "index": "b698", + "type": "image", + "version": 344, + "versionNonce": 276052968, + "index": "b6e", "isDeleted": false, - "id": "JNHVvikjirDDllCKotbJC", + "id": "YFlD_rDw6IwCctPG9BjYf", "fillStyle": "solid", - "strokeWidth": 1, + "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1025.9545454545455, - "y": 275.68750000000006, - "strokeColor": "#e03131", - "backgroundColor": "#ffc9c9", - "width": 47.27272727272725, - "height": 35, - "seed": 848769955, + "x": 510.8409090909091, + "y": 209.70725915875175, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 60.17910447761194, + "height": 60.17910447761194, + "seed": 1369151980, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "KGVjVuaPc35r3zwmLpo6p" ], "frameId": null, - "roundness": { - "type": 3 - }, + "roundness": null, "boundElements": [ { - "type": "text", - "id": "8Msc7tXcZdg2UUH2NmUn-" + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + }, + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" } ], - "updated": 1726708934863, + "updated": 1737528663639, "link": null, - "locked": false + "locked": false, + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ], + "crop": null }, { - "type": "text", - "version": 287, - "versionNonce": 1779271564, - "index": "b69G", + "type": "image", + "version": 375, + "versionNonce": 1533627624, + "index": "b6f", "isDeleted": false, - "id": "8Msc7tXcZdg2UUH2NmUn-", + "id": "7R-AwuwB2mlKHQ4TA3v7g", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1040.6509142788973, - "y": 280.68750000000006, - "strokeColor": "#1e1e1e", + "x": 507.5390491822035, + "y": 280.3521455223882, + "strokeColor": "transparent", "backgroundColor": "transparent", - "width": 17.879989624023438, - "height": 25, - "seed": 1297532739, + "width": 60.17910447761194, + "height": 60.17910447761194, + "seed": 1189477272, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "KGVjVuaPc35r3zwmLpo6p" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737528662023, "link": null, "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "A'", - "textAlign": "center", - "verticalAlign": "middle", - "containerId": "JNHVvikjirDDllCKotbJC", - "originalText": "A'", - "autoResize": true, - "lineHeight": 1.25 + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ], + "crop": null }, { "type": "rectangle", - "version": 565, - "versionNonce": 1888269836, - "index": "b69O", + "version": 804, + "versionNonce": 602477288, + "index": "b6g", "isDeleted": false, - "id": "fkbHGW5tJ-Ay0sh8h-9hJ", + "id": "e4ecV_y0ryxDQzzpC-xuB", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1022.5, - "y": 182.05113636363643, + "x": 1480.6454339460893, + "y": 499.97869318181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 2116216547, + "seed": 1087979672, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": { @@ -1821,40 +2558,40 @@ "boundElements": [ { "type": "text", - "id": "BNiP4zX7PtFTn_e_5vXX3" + "id": "uQnFGHOdIKBjcans1vzUh" } ], - "updated": 1726708934863, + "updated": 1737530585213, "link": null, "locked": false }, { "type": "text", - "version": 308, - "versionNonce": 1814172812, - "index": "b69V", + "version": 548, + "versionNonce": 957607832, + "index": "b6h", "isDeleted": false, - "id": "BNiP4zX7PtFTn_e_5vXX3", + "id": "uQnFGHOdIKBjcans1vzUh", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1039.3763691295276, - "y": 187.05113636363643, + "x": 1496.7817975824528, + "y": 504.97869318181824, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 13.519989013671875, + "width": 15, "height": 25, - "seed": 1804210819, + "seed": 1242918296, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737530585213, "link": null, "locked": false, "fontSize": 20, @@ -1862,33 +2599,33 @@ "text": "A", "textAlign": "center", "verticalAlign": "middle", - "containerId": "fkbHGW5tJ-Ay0sh8h-9hJ", + "containerId": "e4ecV_y0ryxDQzzpC-xuB", "originalText": "A", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 558, - "versionNonce": 981967628, - "index": "b69d", + "version": 797, + "versionNonce": 102135272, + "index": "b6i", "isDeleted": false, - "id": "QYKbNgibs7-HxaNNr8tfG", + "id": "_NOEhFqnCLHtq6yXXa5Ft", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1024.590909090909, - "y": 229.23295454545456, + "x": 1482.7363430369983, + "y": 547.1605113636365, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 1716177443, + "seed": 356776600, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": { @@ -1897,40 +2634,40 @@ "boundElements": [ { "type": "text", - "id": "C-rwFmAbwI_qgVqpkXy7m" + "id": "J3LCjL2uxV-fjOQWF1Nyl" } ], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false }, { "type": "text", - "version": 249, - "versionNonce": 1916232076, - "index": "b69l", + "version": 489, + "versionNonce": 1696742552, + "index": "b6j", "isDeleted": false, - "id": "C-rwFmAbwI_qgVqpkXy7m", + "id": "J3LCjL2uxV-fjOQWF1Nyl", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1040.6172797463155, - "y": 234.23295454545456, + "x": 1497.8727066733618, + "y": 552.1605113636365, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 15.219985961914062, + "width": 17, "height": 25, - "seed": 592678339, + "seed": 1964566424, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, @@ -1938,33 +2675,33 @@ "text": "B", "textAlign": "center", "verticalAlign": "middle", - "containerId": "QYKbNgibs7-HxaNNr8tfG", + "containerId": "_NOEhFqnCLHtq6yXXa5Ft", "originalText": "B", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 653, - "versionNonce": 1248546828, - "index": "b69t", + "version": 910, + "versionNonce": 580876520, + "index": "b6k", "isDeleted": false, - "id": "m2Wj9fp76PKCAhrulCmTa", - "fillStyle": "solid", + "id": "JQQ2WM4JRpHcVDQ6tWh9E", + "fillStyle": "cross-hatch", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1027.318181818182, - "y": 365.97159090909105, + "x": 1488.4636157642713, + "y": 601.899147727273, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 901963107, + "seed": 1170748568, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": { @@ -1973,40 +2710,40 @@ "boundElements": [ { "type": "text", - "id": "MNgTOO1UYazXucNSjXZ_z" + "id": "-t96Vcbd_pHmWnfG-tPFY" } ], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false }, { "type": "text", - "version": 348, - "versionNonce": 52260492, - "index": "b6A", + "version": 602, + "versionNonce": 1943988632, + "index": "b6l", "isDeleted": false, - "id": "MNgTOO1UYazXucNSjXZ_z", + "id": "-t96Vcbd_pHmWnfG-tPFY", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1044.6645521684127, - "y": 370.97159090909105, + "x": 1505.0999794006348, + "y": 606.899147727273, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 12.579986572265625, + "width": 14, "height": 25, - "seed": 1223112963, + "seed": 1023795608, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, @@ -2014,38 +2751,38 @@ "text": "C", "textAlign": "center", "verticalAlign": "middle", - "containerId": "m2Wj9fp76PKCAhrulCmTa", + "containerId": "JQQ2WM4JRpHcVDQ6tWh9E", "originalText": "C", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 127, - "versionNonce": 1292352780, - "index": "b6AG", + "version": 365, + "versionNonce": 1829772264, + "index": "b6m", "isDeleted": false, - "id": "J1KVE_C00rdGo7FWIwu1X", + "id": "VdLIGckmm2zBfC3i4wvrn", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 998.7954545454545, - "y": 188.01136363636374, + "x": 1455.9408884915438, + "y": 505.9389204545456, "strokeColor": "#e03131", "backgroundColor": "transparent", "width": 12, "height": 25, - "seed": 1442121325, + "seed": 973467288, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, @@ -2060,31 +2797,36 @@ }, { "type": "text", - "version": 181, - "versionNonce": 832846732, - "index": "b6AV", + "version": 424, + "versionNonce": 1974063512, + "index": "b6n", "isDeleted": false, - "id": "TIEDsM4QhNNDJARAJnvDz", + "id": "KCk9Ks3UrLoOid_qWtcKt", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1001.7954545454545, - "y": 234.26136363636374, + "x": 1459.9408884915438, + "y": 552.1889204545457, "strokeColor": "#e03131", "backgroundColor": "transparent", "width": 11, "height": 25, - "seed": 846611715, + "seed": 360471448, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, - "boundElements": [], - "updated": 1726708934863, + "boundElements": [ + { + "id": "uJzNGI-VzOHyMa0kMCtyo", + "type": "arrow" + } + ], + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, @@ -2099,382 +2841,289 @@ }, { "type": "text", - "version": 229, - "versionNonce": 2066541068, - "index": "b6Al", - "isDeleted": false, - "id": "tGvqUuD_kCzfMYn-UX8o-", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 1004.2954545454545, - "y": 283.01136363636374, - "strokeColor": "#e03131", - "backgroundColor": "transparent", - "width": 12, - "height": 25, - "seed": 758667053, - "groupIds": [ - "ssihZCwGeFNCQehvjAg06" - ], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726708934863, - "link": null, - "locked": false, - "fontSize": 20, - "fontFamily": 8, - "text": "3", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "3", - "autoResize": true, - "lineHeight": 1.25 - }, - { - "type": "text", - "version": 360, - "versionNonce": 479971468, - "index": "b6B", + "version": 611, + "versionNonce": 125066984, + "index": "b6o", "isDeleted": false, - "id": "IQM8OVr381UGBDKQtda8U", + "id": "uc2hgh9lXoidExmskulnJ", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1004.0454545454545, - "y": 371.26136363636374, + "x": 1461.1908884915438, + "y": 612.1889204545457, "strokeColor": "#e03131", "backgroundColor": "transparent", "width": 11, "height": 25, - "seed": 618433805, + "seed": 1906124952, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, "fontFamily": 8, "text": "5", - "textAlign": "left", - "verticalAlign": "top", - "containerId": null, - "originalText": "5", - "autoResize": true, - "lineHeight": 1.25 - }, - { - "type": "rectangle", - "version": 611, - "versionNonce": 430626572, - "index": "b6BV", - "isDeleted": false, - "id": "fJGd6Pf-SaTmbDMUGHhUW", - "fillStyle": "solid", - "strokeWidth": 1, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 1028.3972327492456, - "y": 322.2812500000001, - "strokeColor": "#e03131", - "backgroundColor": "#ffc9c9", - "width": 47.27272727272725, - "height": 35, - "seed": 1491526540, - "groupIds": [ - "ssihZCwGeFNCQehvjAg06" - ], - "frameId": null, - "roundness": { - "type": 3 - }, - "boundElements": [ - { - "type": "text", - "id": "Ax-8fSsrXvrkMhlGAgJgO" - } - ], - "updated": 1726708934863, - "link": null, - "locked": false - }, - { - "type": "text", - "version": 302, - "versionNonce": 1859392908, - "index": "b6C", - "isDeleted": false, - "id": "Ax-8fSsrXvrkMhlGAgJgO", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 1044.423603404652, - "y": 327.2812500000001, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "width": 15.219985961914062, - "height": 25, - "seed": 1943704076, - "groupIds": [ - "ssihZCwGeFNCQehvjAg06" - ], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726708934863, - "link": null, - "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "B", - "textAlign": "center", - "verticalAlign": "middle", - "containerId": "fJGd6Pf-SaTmbDMUGHhUW", - "originalText": "B", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 259, - "versionNonce": 2035385356, - "index": "b6CV", + "version": 552, + "versionNonce": 531850136, + "index": "b6p", "isDeleted": false, - "id": "07qZABiLS71UbigBsFpnK", + "id": "vbXyYItXCJiZ95GHEna2G", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1002.0335963856091, - "y": 327.2812500000001, - "strokeColor": "#e03131", + "x": 1432.8286670338025, + "y": 661.083806818182, + "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 11, + "width": 197.33984375, "height": 25, - "seed": 1965424820, + "seed": 169629080, "groupIds": [ - "ssihZCwGeFNCQehvjAg06" + "D2eYatwoRT3Be3gQajaM5" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708934863, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, - "fontFamily": 8, - "text": "4", + "fontFamily": 5, + "text": "C is marked as spam", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "4", + "originalText": "C is marked as spam", "autoResize": true, "lineHeight": 1.25 }, { + "id": "-CNAjEmW6cbufb2V3aXbb", "type": "arrow", - "version": 2600, - "versionNonce": 1259679372, - "index": "b6D", - "isDeleted": false, - "id": "M_WCuesgPRdSQ_zqaUtz0", + "x": 1388.4659090909088, + "y": 250.5312500000001, + "width": 113.16269233010075, + "height": 228, + "angle": 0, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", "fillStyle": "solid", - "strokeWidth": 1, + "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, - "angle": 0, - "x": 1113.5321305627851, - "y": 279.97561555378826, - "strokeColor": "#2f9e44", - "backgroundColor": "transparent", - "width": 154.2895204048931, - "height": 2.3372664247598323, - "seed": 1489010356, "groupIds": [], "frameId": null, - "roundness": { - "type": 2 - }, + "index": "b6q", + "roundness": null, + "seed": 1354092264, + "version": 165, + "versionNonce": 464680344, + "isDeleted": false, "boundElements": [], - "updated": 1726708895234, + "updated": 1737530583905, "link": null, "locked": false, - "startBinding": null, - "endBinding": null, - "lastCommittedPoint": null, - "startArrowhead": null, - "endArrowhead": "arrow", "points": [ [ 0, 0 ], [ - 154.2895204048931, - 2.3372664247598323 + 113.16269233010075, + 0 + ], + [ + 113.16269233010075, + 228 ] - ] + ], + "lastCommittedPoint": null, + "startBinding": { + "elementId": "NzWqph0M7tEkeTDKLPGZR", + "focus": 0.4253246753246783, + "gap": 5.000000000000114, + "fixedPoint": [ + 1.1057692307692308, + 0.7126623376623391 + ] + }, + "endBinding": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "elbowed": true }, { "type": "text", - "version": 176, - "versionNonce": 14571020, - "index": "b6E", + "version": 1099, + "versionNonce": 1108693656, + "index": "b6s", "isDeleted": false, - "id": "wkavhEPwz2TNGwf8xFeLA", + "id": "ocrQNX8WLBEF3z4H5qV1Q", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", - "roughness": 1, + "roughness": 0, "opacity": 100, "angle": 0, - "x": 1263.0335963856091, - "y": 188.2812500000001, - "strokeColor": "#e03131", - "backgroundColor": "transparent", - "width": 12, - "height": 25, - "seed": 809955212, - "groupIds": [ - "uHtPh4-PiLJtgc-p_Cdgo" - ], + "x": 1506.5825046192517, + "y": 291.4184149825713, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 135.80796813964844, + "height": 58.225670034857664, + "seed": 1216046568, + "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708942969, + "updated": 1737529134305, "link": null, "locked": false, - "fontSize": 20, - "fontFamily": 8, - "text": "1", + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "5. document\nquality", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "1", + "originalText": "5. document\nquality", "autoResize": true, "lineHeight": 1.25 }, { - "type": "rectangle", - "version": 538, - "versionNonce": 1071049484, - "index": "b6F", + "type": "arrow", + "version": 1524, + "versionNonce": 2138633960, + "index": "b6t", "isDeleted": false, - "id": "Qaz1byDgzm-0ZrVLBmU4v", + "id": "uJzNGI-VzOHyMa0kMCtyo", "fillStyle": "solid", - "strokeWidth": 1, + "strokeWidth": 2, "strokeStyle": "solid", - "roughness": 1, + "roughness": 0, "opacity": 100, "angle": 0, - "x": 1288.9545454545455, - "y": 273.1875000000001, - "strokeColor": "#e03131", - "backgroundColor": "#ffc9c9", - "width": 47.27272727272725, - "height": 35, - "seed": 144156909, - "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" - ], + "x": 1450.701621813599, + "y": 572.658384798537, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 231.1460407851796, + "height": 1.29512872695625, + "seed": 772325608, + "groupIds": [], "frameId": null, "roundness": { - "type": 3 + "type": 2 }, - "boundElements": [ - { - "type": "text", - "id": "D2HbgzHXdGyxGppwaWbBy" - } - ], - "updated": 1726708966705, + "boundElements": [], + "updated": 1737530585216, "link": null, - "locked": false + "locked": false, + "startBinding": { + "elementId": "KCk9Ks3UrLoOid_qWtcKt", + "focus": -0.6425776620043193, + "gap": 9.23926667794467, + "fixedPoint": null + }, + "endBinding": { + "elementId": "TL7ufCnIHYiHVmKWJljll", + "focus": 0.14400907570834828, + "gap": 5.546510718694094, + "fixedPoint": null + }, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + -231.1460407851796, + -1.29512872695625 + ] + ] }, { "type": "text", - "version": 296, - "versionNonce": 2108300212, - "index": "b6G", + "version": 1200, + "versionNonce": 800272536, + "index": "b6u", "isDeleted": false, - "id": "D2HbgzHXdGyxGppwaWbBy", + "id": "AWSDUNN6IaU5NZQ1ScgSU", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", - "roughness": 1, + "roughness": 0, "opacity": 100, "angle": 0, - "x": 1303.6509142788973, - "y": 278.1875000000001, + "x": 1276.7246173511853, + "y": 540.4184149825712, "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "width": 17.879989624023438, - "height": 25, - "seed": 2062418765, - "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" - ], + "backgroundColor": "#b2f2bb", + "width": 124.44776916503906, + "height": 58.225670034857664, + "seed": 1343739368, + "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "A'", - "textAlign": "center", - "verticalAlign": "middle", - "containerId": "Qaz1byDgzm-0ZrVLBmU4v", - "originalText": "A'", + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "6. removing\nspam ..etc", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "6. removing\nspam ..etc", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 569, - "versionNonce": 509454732, - "index": "b6H", + "version": 896, + "versionNonce": 1019725032, + "index": "b6v", "isDeleted": false, - "id": "-LxVJeZLqj0MgI5FEg_pm", + "id": "Rdnl5GxK4pFbFoTLI-oOG", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1281.5, - "y": 179.55113636363643, + "x": 1164.6454339460893, + "y": 503.97869318181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 1514803629, + "seed": 1661634456, "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": { @@ -2483,41 +3132,40 @@ "boundElements": [ { "type": "text", - "id": "trFDjiJr6cfNlCSEKqNjE" + "id": "gfBsltp4ourNC3Fnk9ClO" } ], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false }, { "type": "text", - "version": 311, - "versionNonce": 1054115124, - "index": "b6I", + "version": 640, + "versionNonce": 674323864, + "index": "b6w", "isDeleted": false, - "id": "trFDjiJr6cfNlCSEKqNjE", + "id": "gfBsltp4ourNC3Fnk9ClO", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1298.3763691295276, - "y": 184.55113636363643, + "x": 1180.7817975824528, + "y": 508.97869318181824, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 13.519989013671875, + "width": 15, "height": 25, - "seed": 1674925069, + "seed": 1149621400, "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, @@ -2525,34 +3173,33 @@ "text": "A", "textAlign": "center", "verticalAlign": "middle", - "containerId": "-LxVJeZLqj0MgI5FEg_pm", + "containerId": "Rdnl5GxK4pFbFoTLI-oOG", "originalText": "A", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 566, - "versionNonce": 713594892, - "index": "b6J", + "version": 892, + "versionNonce": 1875358696, + "index": "b6x", "isDeleted": false, - "id": "Kxu9owye4gMpRvh7kJ1Nl", + "id": "TL7ufCnIHYiHVmKWJljll", "fillStyle": "solid", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1287.590909090909, - "y": 226.73295454545456, + "x": 1166.7363430369983, + "y": 551.1605113636365, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, "height": 35, - "seed": 1938377325, + "seed": 1393525144, "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": { @@ -2561,41 +3208,44 @@ "boundElements": [ { "type": "text", - "id": "UP92rSYiIXnnBFhov6WNx" + "id": "Qs_O62O1HCrusz6mXeH8i" + }, + { + "id": "uJzNGI-VzOHyMa0kMCtyo", + "type": "arrow" } ], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false }, { "type": "text", - "version": 256, - "versionNonce": 301317812, - "index": "b6K", + "version": 581, + "versionNonce": 711060120, + "index": "b6y", "isDeleted": false, - "id": "UP92rSYiIXnnBFhov6WNx", + "id": "Qs_O62O1HCrusz6mXeH8i", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1303.6172797463157, - "y": 231.73295454545456, + "x": 1181.8727066733618, + "y": 556.1605113636365, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 15.219985961914062, + "width": 17, "height": 25, - "seed": 707753165, + "seed": 500928152, "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, @@ -2603,206 +3253,125 @@ "text": "B", "textAlign": "center", "verticalAlign": "middle", - "containerId": "Kxu9owye4gMpRvh7kJ1Nl", + "containerId": "TL7ufCnIHYiHVmKWJljll", "originalText": "B", "autoResize": true, "lineHeight": 1.25 }, - { - "type": "rectangle", - "version": 593, - "versionNonce": 5355148, - "index": "b6L", - "isDeleted": false, - "id": "KMOsOR4pOx-ute2ztnw1k", - "fillStyle": "solid", - "strokeWidth": 1, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 1293.318181818182, - "y": 361.4715909090911, - "strokeColor": "#e03131", - "backgroundColor": "#ffc9c9", - "width": 47.27272727272725, - "height": 35, - "seed": 635317229, - "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" - ], - "frameId": null, - "roundness": { - "type": 3 - }, - "boundElements": [ - { - "type": "text", - "id": "SsRO-f6mzQzf5jQOudz6C" - } - ], - "updated": 1726708966705, - "link": null, - "locked": false - }, - { - "type": "text", - "version": 287, - "versionNonce": 800311348, - "index": "b6M", - "isDeleted": false, - "id": "SsRO-f6mzQzf5jQOudz6C", - "fillStyle": "solid", - "strokeWidth": 2, - "strokeStyle": "solid", - "roughness": 1, - "opacity": 100, - "angle": 0, - "x": 1310.6645521684127, - "y": 366.4715909090911, - "strokeColor": "#1e1e1e", - "backgroundColor": "transparent", - "width": 12.579986572265625, - "height": 25, - "seed": 1382819405, - "groupIds": [ - "bDrNCHlMlNcEbIn9yZXly", - "XEHMHITFJTjudNYgVFCPu" - ], - "frameId": null, - "roundness": null, - "boundElements": [], - "updated": 1726708966705, - "link": null, - "locked": false, - "fontSize": 20, - "fontFamily": 5, - "text": "C", - "textAlign": "center", - "verticalAlign": "middle", - "containerId": "KMOsOR4pOx-ute2ztnw1k", - "originalText": "C", - "autoResize": true, - "lineHeight": 1.25 - }, { "type": "text", - "version": 206, - "versionNonce": 745735436, - "index": "b6N", + "version": 457, + "versionNonce": 351906536, + "index": "b71", "isDeleted": false, - "id": "US1PK13ekocRlMvOrHSJL", + "id": "h9eneFYpYcKGCUroEQPXT", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1265.0335963856091, - "y": 231.2812500000001, + "x": 1139.9408884915438, + "y": 509.9389204545456, "strokeColor": "#e03131", "backgroundColor": "transparent", - "width": 11, + "width": 12, "height": 25, - "seed": 1525760780, + "seed": 2119562648, "groupIds": [ - "bQ__H1TgpJXskAm32UBLZ", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, "fontFamily": 8, - "text": "2", + "text": "1", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "2", + "originalText": "1", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 241, - "versionNonce": 1274323380, - "index": "b6O", + "version": 514, + "versionNonce": 284743576, + "index": "b72", "isDeleted": false, - "id": "NxUqy-MsYDga_9XDrU9l7", + "id": "2FH_CC-PbldTPMTV0l3zg", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1267.5335963856091, - "y": 277.2812500000001, + "x": 1143.9408884915438, + "y": 556.1889204545457, "strokeColor": "#e03131", "backgroundColor": "transparent", - "width": 12, + "width": 11, "height": 25, - "seed": 1116920372, + "seed": 3375768, "groupIds": [ - "4mN8vM1PMjtKHfzWdqXES", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, "fontFamily": 8, - "text": "3", + "text": "2", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "3", + "originalText": "2", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 240, - "versionNonce": 342262668, - "index": "b6P", + "version": 639, + "versionNonce": 961809896, + "index": "b74", "isDeleted": false, - "id": "lSEPKkiY8if2M9pDun8DS", + "id": "tn954yHWPQx-IDIpEMxaF", "fillStyle": "solid", "strokeWidth": 2, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "angle": 0, - "x": 1270.5335963856091, - "y": 370.2812500000001, - "strokeColor": "#e03131", + "x": 1116.8286670338025, + "y": 665.083806818182, + "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 11, + "width": 135.03990173339844, "height": 25, - "seed": 932194828, + "seed": 1349893272, "groupIds": [ - "Z8bVLPerSCYHViV4Ld1Ed", - "XEHMHITFJTjudNYgVFCPu" + "xRJf_6pX20sfp3DbcQgRs" ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1726708966705, + "updated": 1737530585214, "link": null, "locked": false, "fontSize": 20, - "fontFamily": 8, - "text": "5", + "fontFamily": 5, + "text": "Spam removed", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "5", + "originalText": "Spam removed", "autoResize": true, "lineHeight": 1.25 } diff --git a/examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.png b/examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.png new file mode 100644 index 0000000000..f40893ac17 Binary files /dev/null and b/examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.png differ diff --git a/examples/notebooks/pdf-processing-1/pdf_processing_1_python.ipynb b/examples/notebooks/pdf-processing-1/pdf_processing_1_python.ipynb new file mode 100644 index 0000000000..a871a4b799 --- /dev/null +++ b/examples/notebooks/pdf-processing-1/pdf_processing_1_python.ipynb @@ -0,0 +1,2940 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Processing PDFs using Data Prep Kit\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_python.ipynb)\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow:\n", + "\n", + "- pdf2parquet: Extract text from PDF documents\n", + "- docid: compute hashes\n", + "- exact dedupe : filter out identical documents\n", + "- fuzzy dedupe : filter out 'near duplicates'\n", + "- document quality: scoring documents for quality\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_python.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-1: Figure out Runtime Environment\n", + "\n", + "### 1.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "39cc4e90-b230-4100-92c9-3aa3d977fa3d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 1.2 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "1fcec577", + "metadata": { + "id": "1fcec577" + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit-transforms[all]==1.0.0 \\\n", + " humanfriendly" + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 1.3 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration & Utils" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "ad7fc57a-5229-4841-8d8a-23272aa5197d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "63d1d197-dfb1-4d6f-eb88-846bbbff1446" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " input_dir = \"input\"\n", + " shutil.os.makedirs(input_dir, exist_ok=True)\n", + "else:\n", + " input_dir = \"../../data-files/pdf-processing-1/\"\n", + " \n", + "output_dir = \"output\"\n", + "\n", + "output_pdf2pq_dir = os.path.join (output_dir, '01_pdf2pq_out')\n", + "output_docid_dir = os.path.join (output_dir, '02_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (output_dir, '03_exact_dedupe_out')\n", + "output_fuzzy_dedupe_dir = os.path.join (output_dir, '04_fuzzy_dedupe_out')\n", + "output_doc_quality_dir = os.path.join (output_dir, '05_doc_quality_out')\n", + "output_final_dir = os.path.join (output_dir, 'output_final')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(output_dir, ignore_errors=True)\n", + "shutil.os.makedirs(output_dir, exist_ok=True)\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "14b2f34c", + "metadata": { + "id": "14b2f34c" + }, + "source": [ + "### 2.3 - Handy Utils" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "ba47a370", + "metadata": { + "id": "ba47a370" + }, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from humanfriendly import format_size\n", + "import pandas as pd\n", + "import glob\n", + "\n", + "## Reads parquet files in a folder into a pandas dataframe\n", + "def read_parquet_files_as_df (parquet_dir):\n", + " parquet_files = glob.glob(f'{parquet_dir}/*.parquet')\n", + " # read each parquet file into a DataFrame and store in a list\n", + " dfs = [pd.read_parquet (f) for f in parquet_files]\n", + " dfs = [df for df in dfs if not df.empty] # filter out empty dataframes\n", + " # Concatenate all DataFrames into a single DataFrame\n", + " if len(dfs) > 0:\n", + " data_df = pd.concat(dfs, ignore_index=True)\n", + " return data_df\n", + " else:\n", + " return pd.DataFrame() # return empty df\n", + "# ------------\n", + "\n", + "\n", + "def download_file(url, local_file, chunk_size=1024*1024):\n", + " \"\"\"\n", + " Downloads a remote URL to a local file.\n", + "\n", + " Args:\n", + " url (str): The remote URL.\n", + " local_filename (str): The name of the local file to save the downloaded content.\n", + " chunk_size (int): The size in bytes of each chunk. Defaults to 1024.\n", + "\n", + " Returns:\n", + " None\n", + "\n", + " Example usage:\n", + " download_file('http://example.com/file.txt', 'file.txt', chunk_size=1024*1024) # Download in chunks of 1MB\n", + " \"\"\"\n", + " # Check if the local file already exists\n", + " if os.path.exists(local_file):\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"Local file '{local_file}' ({file_size}) already exists. Skipping download.\")\n", + " return\n", + "\n", + " # Create the directory if it doesn't exist\n", + " os.makedirs(os.path.dirname(local_file), exist_ok=True)\n", + "\n", + " # Stream the file download\n", + " with requests.get(url, stream=True) as r:\n", + " r.raise_for_status()\n", + " with open(local_file, 'wb') as f:\n", + " for chunk in r.iter_content(chunk_size=chunk_size):\n", + " if chunk: # filter out keep-alive new chunks\n", + " f.write(chunk)\n", + " print()\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"{local_file} ({file_size}) downloaded successfully.\")\n", + "## --- end: download_file ------\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "dc1972c3", + "metadata": { + "id": "dc1972c3" + }, + "source": [ + "## Step-3: Inspect the Data\n", + "\n", + "We will use simple PDFs. The files are [here](https://github.com/IBM/data-prep-kit/tree/dev/examples/data-files/pdf-processing-1/)\n", + "\n", + "- [earth.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth.pdf) and exact duplicate [earth-copy.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth-copy.pdf)\n", + "- [earth2.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth2.pdf) almost similar to earth.pdf (ONE word difference!)\n", + "- [mars.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/mars.pdf)\n", + "- [spam.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/spam.pdf) - contains spammy contents\n", + "- [lorem-ipsum.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/lorem-ipsum.pdf) - contains 'lorem ipsum' placeholder\n" + ] + }, + { + "cell_type": "markdown", + "id": "7113b16c", + "metadata": { + "id": "7113b16c" + }, + "source": [ + "### 3.1 -Download Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23db1064", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "23db1064", + "outputId": "d871231d-86e2-4db7-a437-1510047bef2a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using input files from : ../../data-files/pdf-processing-1/\n" + ] + } + ], + "source": [ + "if RUNNING_IN_COLAB:\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth.pdf', os.path.join(input_dir, 'earth.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth-copy.pdf', os.path.join(input_dir, 'earth-copy.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth2.pdf', os.path.join(input_dir, 'earth2.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/mars.pdf', os.path.join(input_dir, 'mars.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/spam.pdf', os.path.join(input_dir, 'spam.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/lorem-ipsum.pdf', os.path.join(input_dir, 'lorem-ipsum.pdf'))\n", + "else:\n", + " print ('Using input files from : ', input_dir)" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-4: Extract Data from PDF (pdf2parquet)\n", + "\n", + "This step we will read PDF files and extract the text data.\n", + "\n", + "[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)\n", + "\n", + "We use the [Docling package](https://github.com/DS4SD/docling).\n" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 4.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 836, + "referenced_widgets": [ + "df5c199339f5467a91453fa187e201f0", + "257dbf0b62624667b0c82afaf1c8ccf1", + "4e76bef9228546fd97cccfe7bdd856f3", + "c0c37c0262b84e9ebf02c1ce17f263ee", + "ca821137125b45d08e257f95822a6f72", + "fb81f32569c34250b901235698e5ea18", + "1ce164863aa34f64a94aeb5d05103043", + "e2b5f84c30de45d29588a07a3d106eb4", + "cc7d3125eb55461180566d1064eeb2a5", + "68eb811a52804887bc383e89a72a0975", + "55b9873ce1f34c169ecc6087c3cd65a1" + ] + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "da48c24e-c32c-4fc9-e6aa-37b1921c3d4d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='../../data-files/pdf-processing-1/' --> output='output/01_pdf2pq_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:54:24 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 8}\n", + "13:54:24 INFO - pipeline id pipeline_id\n", + "13:54:24 INFO - code location None\n", + "13:54:24 INFO - data factory data_ is using local data access: input_folder - ../../data-files/pdf-processing-1/ output_folder - output/01_pdf2pq_out\n", + "13:54:24 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:24 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "13:54:24 INFO - orchestrator pdf2parquet started at 2025-02-06 13:54:24\n", + "13:54:24 INFO - Number of files is 6, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.023715972900390625, 'total_file_size': 0.2709054946899414}\n", + "13:54:24 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f1a499a391784b7ba00cb9b1730bac8d", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 9 files: 0%| | 0/9 [00:00 output='{output_pdf2pq_dir}'\\n\", flush=True)\n", + "\n", + "result = Pdf2Parquet(input_folder= input_dir,\n", + " output_folder= output_pdf2pq_dir,\n", + " data_files_to_use=['.pdf'],\n", + " pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN, # markdown\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "5ca790e0", + "metadata": { + "id": "5ca790e0" + }, + "source": [ + "### 4.2 - Inspect Generated output\n", + "\n", + "Here we should see one entry per input file processed." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "fe59563d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + }, + "id": "fe59563d", + "outputId": "81b70c9f-cc39-4f78-f29f-f81d4fcf19ae" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Displaying contents of : output/01_pdf2pq_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filename
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1024be2a61e-96f5-4f58-bf6f-e829dbdfa9d36571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T13:54:32.1553840.651216lorem-ipsum.pdf
1spam.pdfFree xxx1022bd06750-cb70-4689-b2b8-72913b929a1d10026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T13:54:33.4406510.617823spam.pdf
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011594034db-1fcd-411b-a89e-d37e4defdfc210729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T13:54:31.5024600.645348earth2.pdf
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...101120ae1424-c2c3-436f-a7ff-b8c69fa3a3c37758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T13:54:32.8213650.664288mars.pdf
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...10114b43fb09-c9ef-4d9a-af24-8e22b5ff33b314711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T13:54:29.9095551.100482earth-copy.pdf
5earth.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011d1d30fbc-c1e9-4813-a067-085e50b4ee4914711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T13:54:30.8552250.931613earth.pdf
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "5 earth.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "5 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 4be2a61e-96f5-4f58-bf6f-e829dbdfa9d3 6571294142213095721 pdf \n", + "1 2bd06750-cb70-4689-b2b8-72913b929a1d 10026122586747302274 pdf \n", + "2 594034db-1fcd-411b-a89e-d37e4defdfc2 10729312978404042321 pdf \n", + "3 20ae1424-c2c3-436f-a7ff-b8c69fa3a3c3 7758129997476962679 pdf \n", + "4 4b43fb09-c9ef-4d9a-af24-8e22b5ff33b3 14711865278795535908 pdf \n", + "5 d1d30fbc-c1e9-4813-a067-085e50b4ee49 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "5 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2025-02-06T13:54:32.155384 0.651216 lorem-ipsum.pdf \n", + "1 2025-02-06T13:54:33.440651 0.617823 spam.pdf \n", + "2 2025-02-06T13:54:31.502460 0.645348 earth2.pdf \n", + "3 2025-02-06T13:54:32.821365 0.664288 mars.pdf \n", + "4 2025-02-06T13:54:29.909555 1.100482 earth-copy.pdf \n", + "5 2025-02-06T13:54:30.855225 0.931613 earth.pdf " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print (\"Displaying contents of : \", output_pdf2pq_dir)\n", + "output_df = read_parquet_files_as_df(output_pdf2pq_dir)\n", + "# print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "output_df.head(10)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 4.3 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **document_hash**: hash of documents\n", + "- **hash** : hash of `contents` column\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "**Note: you should notice the hash values are identical for the duplicate documents**\n", + "\n", + "Let's inspect the **contents** column." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "8064d9df-c226-4795-b9ad-34d50709a8c3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "## Earth\n", + "\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "\n", + "For more details about our Solar system see Chapter 1.\n", + "\n", + "## Earth\n", + "\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "\n", + "Basic facts about Earth:\n", + "\n", + "- · Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "- · Moons: One moon, called Luna or simply \"the Moon\".\n", + "- · Rotation Period: 24 hours (one day)\n" + ] + } + ], + "source": [ + "print (output_df[output_df['filename'] == 'earth.pdf'].iloc[0,]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "3dbf4e39-1c4c-443e-968c-32aae9010165" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Free xxx\n" + ] + } + ], + "source": [ + "print (output_df[output_df['filename'] == 'spam.pdf'].iloc[0,]['contents'])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b37dd994", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lorem ipsum Lorem ipsum Lorem ipsum\n" + ] + } + ], + "source": [ + "print (output_df[output_df['filename'] == 'lorem-ipsum.pdf'].iloc[0,]['contents'])" + ] + }, + { + "cell_type": "markdown", + "id": "7fc86d5b", + "metadata": { + "id": "7fc86d5b" + }, + "source": [ + "## Step-5: Create DOC ID for Documents\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This step is a pre-requisite for fuzzy dedup** in the pipeline.\n", + "\n", + "[DocID documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/doc_id)" + ] + }, + { + "cell_type": "markdown", + "id": "f516a253", + "metadata": { + "id": "f516a253" + }, + "source": [ + "### 5.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "cee20521", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cee20521", + "outputId": "dd568017-e39c-4524-cdcf-6c97a1341ab9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_pdf2pq_out' --> output='output/02_docid_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:54:33 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'doc_hash', 'int_column': 'int_id_column', 'start_id': 0}\n", + "13:54:33 INFO - pipeline id pipeline_id\n", + "13:54:33 INFO - code location None\n", + "13:54:33 INFO - data factory data_ is using local data access: input_folder - output/01_pdf2pq_out output_folder - output/02_docid_out\n", + "13:54:33 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:33 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:33 INFO - orchestrator doc_id started at 2025-02-06 13:54:33\n", + "13:54:33 INFO - Number of files is 6, source profile {'max_file_size': 0.010061264038085938, 'min_file_size': 0.0055408477783203125, 'total_file_size': 0.04969310760498047}\n", + "13:54:33 INFO - Completed 1 files (16.67%) in 0.0 min\n", + "13:54:33 INFO - Completed 2 files (33.33%) in 0.0 min\n", + "13:54:33 INFO - Completed 3 files (50.0%) in 0.0 min\n", + "13:54:33 INFO - Completed 4 files (66.67%) in 0.0 min\n", + "13:54:33 INFO - Completed 5 files (83.33%) in 0.0 min\n", + "13:54:33 INFO - Completed 6 files (100.0%) in 0.0 min\n", + "13:54:33 INFO - Done processing 6 files, waiting for flush() completion.\n", + "13:54:33 INFO - done flushing in 0.0 sec\n", + "13:54:33 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 27 ms, sys: 3.61 ms, total: 30.7 ms\n", + "Wall time: 26.3 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_doc_id.transform_python import DocID\n", + "\n", + "STAGE = 2\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_pdf2pq_dir}' --> output='{output_docid_dir}'\\n\", flush=True)\n", + "\n", + "result = DocID(input_folder= output_pdf2pq_dir,\n", + " output_folder= output_docid_dir,\n", + " doc_id_doc_column= \"contents\",\n", + " doc_id_hash_column= \"doc_hash\",\n", + " # doc_id_int_column= \"doc_id_int\",\n", + " doc_id_int_column= \"int_id_column\",\n", + " #doc_id_start_id= 5\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "4bd6f382", + "metadata": { + "id": "4bd6f382" + }, + "source": [ + "### 5.2 - Inspect Generated output\n", + "\n", + "You would see a new columns **doc_hash** and **int_id_column**" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "f3d4aba9", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + }, + "id": "f3d4aba9", + "outputId": "b4b868b3-ebc7-48a2-f0c5-b0b023a24238" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Displaying contents of : output/02_docid_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenamedoc_hashint_id_column
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1024be2a61e-96f5-4f58-bf6f-e829dbdfa9d36571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T13:54:32.1553840.651216lorem-ipsum.pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...3
1spam.pdfFree xxx1022bd06750-cb70-4689-b2b8-72913b929a1d10026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T13:54:33.4406510.617823spam.pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...5
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011594034db-1fcd-411b-a89e-d37e4defdfc210729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T13:54:31.5024600.645348earth2.pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...2
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...101120ae1424-c2c3-436f-a7ff-b8c69fa3a3c37758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T13:54:32.8213650.664288mars.pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...4
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...10114b43fb09-c9ef-4d9a-af24-8e22b5ff33b314711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T13:54:29.9095551.100482earth-copy.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...0
5earth.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011d1d30fbc-c1e9-4813-a067-085e50b4ee4914711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T13:54:30.8552250.931613earth.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...1
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "5 earth.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "5 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 4be2a61e-96f5-4f58-bf6f-e829dbdfa9d3 6571294142213095721 pdf \n", + "1 2bd06750-cb70-4689-b2b8-72913b929a1d 10026122586747302274 pdf \n", + "2 594034db-1fcd-411b-a89e-d37e4defdfc2 10729312978404042321 pdf \n", + "3 20ae1424-c2c3-436f-a7ff-b8c69fa3a3c3 7758129997476962679 pdf \n", + "4 4b43fb09-c9ef-4d9a-af24-8e22b5ff33b3 14711865278795535908 pdf \n", + "5 d1d30fbc-c1e9-4813-a067-085e50b4ee49 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "5 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2025-02-06T13:54:32.155384 0.651216 lorem-ipsum.pdf \n", + "1 2025-02-06T13:54:33.440651 0.617823 spam.pdf \n", + "2 2025-02-06T13:54:31.502460 0.645348 earth2.pdf \n", + "3 2025-02-06T13:54:32.821365 0.664288 mars.pdf \n", + "4 2025-02-06T13:54:29.909555 1.100482 earth-copy.pdf \n", + "5 2025-02-06T13:54:30.855225 0.931613 earth.pdf \n", + "\n", + " doc_hash int_id_column \n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 3 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 5 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 2 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 4 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 0 \n", + "5 6140cf695f269a3ddca6568536076756105ad3186086b2... 1 " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print (\"Displaying contents of : \", output_docid_dir)\n", + "output_df = read_parquet_files_as_df(output_docid_dir)\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "c55f8d3f", + "metadata": { + "id": "c55f8d3f" + }, + "source": [ + "## Step-6: Eliminate Duplicate Documents\n", + "\n", + "We have 2 exact duplicates: **earth.pdf** , **earth-copy.pdf**\n", + "\n", + "Note how **doc_hash** for these documents are the same.\n", + "\n", + "[Exact dedupe information](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/ededup)" + ] + }, + { + "cell_type": "markdown", + "id": "6f5ef1f7", + "metadata": { + "id": "6f5ef1f7" + }, + "source": [ + "### 6.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "90eddb4c", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "90eddb4c", + "outputId": "61221177-f23e-4daa-8e34-237582fc19b0" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_docid_out' --> output='output/03_exact_dedupe_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:54:33 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'doc_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "13:54:33 INFO - pipeline id pipeline_id\n", + "13:54:33 INFO - code location None\n", + "13:54:33 INFO - data factory data_ is using local data access: input_folder - output/02_docid_out output_folder - output/03_exact_dedupe_out\n", + "13:54:33 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:33 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:33 INFO - orchestrator ededup started at 2025-02-06 13:54:33\n", + "13:54:33 INFO - Number of files is 6, source profile {'max_file_size': 0.01116180419921875, 'min_file_size': 0.006641387939453125, 'total_file_size': 0.056290626525878906}\n", + "13:54:33 INFO - Starting from the beginning\n", + "13:54:33 INFO - Completed 1 files (16.67%) in 0.0 min\n", + "13:54:33 INFO - Completed 2 files (33.33%) in 0.0 min\n", + "13:54:33 INFO - Completed 3 files (50.0%) in 0.0 min\n", + "13:54:33 INFO - Completed 4 files (66.67%) in 0.0 min\n", + "13:54:33 INFO - Completed 5 files (83.33%) in 0.0 min\n", + "13:54:33 INFO - Completed 6 files (100.0%) in 0.0 min\n", + "13:54:33 INFO - Done processing 6 files, waiting for flush() completion.\n", + "13:54:33 INFO - done flushing in 0.0 sec\n", + "13:54:33 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 25.3 ms, sys: 4.27 ms, total: 29.5 ms\n", + "Wall time: 24.2 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_ededup.transform_python import Ededup\n", + "\n", + "STAGE = 3\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_docid_dir}' --> output='{output_exact_dedupe_dir}'\\n\", flush=True)\n", + "\n", + "result = Ededup(input_folder=output_docid_dir,\n", + " output_folder=output_exact_dedupe_dir,\n", + " ededup_doc_column=\"contents\",\n", + " ededup_doc_id_column=\"doc_hash\"\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "f4aacf09", + "metadata": { + "id": "f4aacf09" + }, + "source": [ + "### 6.2 - Inspect Generated output\n", + "\n", + "You can see one of **earth.pdf** or **earth-copy.pdf** will be eliminated." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "1887b26d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 611 + }, + "id": "1887b26d", + "outputId": "31210411-1abd-418a-c1d9-167770788d62" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input files before exact dedupe : 6\n", + "Output files after exact dedupe : 5\n", + "Duplicate files removed : 1\n", + "Displaying contents of : output/03_exact_dedupe_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenamedoc_hashint_id_columnremoved
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1024be2a61e-96f5-4f58-bf6f-e829dbdfa9d36571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T13:54:32.1553840.651216lorem-ipsum.pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...3[]
1spam.pdfFree xxx1022bd06750-cb70-4689-b2b8-72913b929a1d10026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T13:54:33.4406510.617823spam.pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...5[]
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011594034db-1fcd-411b-a89e-d37e4defdfc210729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T13:54:31.5024600.645348earth2.pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...2[]
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...101120ae1424-c2c3-436f-a7ff-b8c69fa3a3c37758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T13:54:32.8213650.664288mars.pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...4[]
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...10114b43fb09-c9ef-4d9a-af24-8e22b5ff33b314711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T13:54:29.9095551.100482earth-copy.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...0[]
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 4be2a61e-96f5-4f58-bf6f-e829dbdfa9d3 6571294142213095721 pdf \n", + "1 2bd06750-cb70-4689-b2b8-72913b929a1d 10026122586747302274 pdf \n", + "2 594034db-1fcd-411b-a89e-d37e4defdfc2 10729312978404042321 pdf \n", + "3 20ae1424-c2c3-436f-a7ff-b8c69fa3a3c3 7758129997476962679 pdf \n", + "4 4b43fb09-c9ef-4d9a-af24-8e22b5ff33b3 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2025-02-06T13:54:32.155384 0.651216 lorem-ipsum.pdf \n", + "1 2025-02-06T13:54:33.440651 0.617823 spam.pdf \n", + "2 2025-02-06T13:54:31.502460 0.645348 earth2.pdf \n", + "3 2025-02-06T13:54:32.821365 0.664288 mars.pdf \n", + "4 2025-02-06T13:54:29.909555 1.100482 earth-copy.pdf \n", + "\n", + " doc_hash int_id_column removed \n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 3 [] \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 5 [] \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 2 [] \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 4 [] \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 0 [] " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "input_df = read_parquet_files_as_df(output_docid_dir)\n", + "output_df = read_parquet_files_as_df(output_exact_dedupe_dir)\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input files before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output files after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate files removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "print (\"Displaying contents of : \", output_exact_dedupe_dir)\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "76ea34e2", + "metadata": { + "id": "76ea34e2" + }, + "source": [ + "## Step-7: Fuzzy Dedupe\n", + "\n", + "In previous step, we removed **exact duplicates (identical documents)**.\n", + "\n", + "Fuzzy de-dupe can further filter out documents that are **not exactly identical, but nearly identical**\n", + "\n", + "Here is a simple example:\n", + "\n", + "`Our solar system is a vast and fascinating expanse`\n", + "\n", + "`The solar system is a vast and fascinating expanse`\n", + "\n", + "Only one word is different `Our` vs `The`.\n", + "\n", + "Imagine two documents with one extra blank line. For our purposes they are the same.\n", + "\n", + "[Fuzzy dedupe documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/fdedup)\n", + "\n", + "### Tweaking fuzzy matches\n", + "\n", + "**`jaccard_similarity_threshold`** is the parameter used to tweak similarities between documents. It's value is between 0 and 1.0. Values close to 1.0 means more strict checking (fewer documents will qualify). Lower threshold means more leniant matches (more documents will qualify)\n", + "\n", + "Adjust this value to find what works for your documents" + ] + }, + { + "cell_type": "markdown", + "id": "79a37713", + "metadata": { + "id": "79a37713" + }, + "source": [ + "### 7.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "37430b60", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "37430b60", + "outputId": "48366a20-f5c2-4040-bf56-8b29ce40ed53" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_exact_dedupe_out' --> output='output/04_fuzzy_dedupe_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:54:33 INFO - Starting SignatureCalculation step\n", + "13:54:33 INFO - Got parameters for SignatureCalculation\n", + "13:54:33 INFO - minhash parameters are : {'document_id_column': 'int_id_column', 'contents_column': 'contents', 'seed': 42, 'num_permutations': 112, 'jaccard_similarity_threshold': 0.8, 'word_shingle_size': 5, 'num_bands': 14, 'num_minhashes_per_band': 8, 'num_segments': 1, 'shingle_option': 'word'}\n", + "13:54:33 INFO - data factory scdata_ is using local configuration without input/output path\n", + "13:54:33 INFO - data factory scdata_ max_files -1, n_sample -1\n", + "13:54:33 INFO - data factory scdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:33 INFO - pipeline id pipeline_id\n", + "13:54:33 INFO - code location None\n", + "13:54:33 INFO - data factory data_ is using local data access: input_folder - output/03_exact_dedupe_out output_folder - output/04_fuzzy_dedupe_out\n", + "13:54:33 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:33 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:33 INFO - orchestrator minhash started at 2025-02-06 13:54:33\n", + "13:54:33 INFO - Number of files is 6, source profile {'max_file_size': 0.011510848999023438, 'min_file_size': 0.003223419189453125, 'total_file_size': 0.050751686096191406}\n", + "13:54:33 INFO - Completed 1 files (16.67%) in 0.001 min\n", + "13:54:33 WARNING - table is empty, skipping processing\n", + "13:54:33 INFO - Completed 2 files (33.33%) in 0.001 min\n", + "13:54:33 INFO - Completed 3 files (50.0%) in 0.001 min\n", + "13:54:33 INFO - Completed 4 files (66.67%) in 0.001 min\n", + "13:54:33 INFO - Completed 5 files (83.33%) in 0.001 min\n", + "13:54:33 INFO - Completed 6 files (100.0%) in 0.001 min\n", + "13:54:33 INFO - Done processing 6 files, waiting for flush() completion.\n", + "13:54:33 INFO - Starting flush()\n", + "13:54:33 INFO - Wrote 14 tables with a total size of 33,600 bytes\n", + "13:54:33 INFO - done flushing in 0.031 sec\n", + "13:54:33 INFO - Completed execution in 0.001 min, execution result 0\n", + "13:54:33 INFO - SignatureCalculation completed successfully\n", + "13:54:33 INFO - Starting ClusterAnalysis step\n", + "13:54:33 INFO - Got parameters for ClusterAnalysis\n", + "13:54:33 INFO - cluster parameters are : {'jaccard_similarity_threshold': 0.8, 'num_bands': 14, 'num_segments': 1, 'sort_output': False}\n", + "13:54:33 INFO - pipeline id pipeline_id\n", + "13:54:33 INFO - code location None\n", + "13:54:33 INFO - data factory data_ is using local data access: input_folder - output/04_fuzzy_dedupe_out/bands output_folder - output/04_fuzzy_dedupe_out/docs_to_remove\n", + "13:54:33 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:33 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:33 INFO - orchestrator cluster started at 2025-02-06 13:54:33\n", + "13:54:33 INFO - Number of folders is 14\n", + "13:54:33 INFO - Completed 1 files (7.14%) in 0.0 min\n", + "13:54:33 INFO - Completed 2 files (14.29%) in 0.0 min\n", + "13:54:33 INFO - Completed 3 files (21.43%) in 0.0 min\n", + "13:54:33 INFO - Completed 4 files (28.57%) in 0.0 min\n", + "13:54:33 INFO - Completed 5 files (35.71%) in 0.0 min\n", + "13:54:33 INFO - Completed 6 files (42.86%) in 0.0 min\n", + "13:54:33 INFO - Completed 7 files (50.0%) in 0.001 min\n", + "13:54:34 INFO - Completed 8 files (57.14%) in 0.001 min\n", + "13:54:34 INFO - Completed 9 files (64.29%) in 0.001 min\n", + "13:54:34 INFO - Completed 10 files (71.43%) in 0.001 min\n", + "13:54:34 INFO - Completed 11 files (78.57%) in 0.001 min\n", + "13:54:34 INFO - Completed 12 files (85.71%) in 0.001 min\n", + "13:54:34 INFO - Completed 13 files (92.86%) in 0.001 min\n", + "13:54:34 INFO - Completed 14 files (100.0%) in 0.001 min\n", + "13:54:34 INFO - Done processing 14 files, waiting for flush() completion.\n", + "13:54:34 INFO - done flushing in 0.0 sec\n", + "13:54:34 INFO - Completed execution in 0.001 min, execution result 0\n", + "13:54:34 INFO - ClusterAnalysis completed successfully\n", + "13:54:34 INFO - Starting GetDuplicateList step\n", + "13:54:34 INFO - Got parameters for GetDuplicateList\n", + "13:54:34 INFO - fdlist parameters are : {'docs_to_remove': 'docs_to_remove', 'consolidated_filename': 'docs_to_remove_consolidated/docs_to_remove_consolidated.parquet', 'sort_output': False}\n", + "13:54:34 INFO - pipeline id pipeline_id\n", + "13:54:34 INFO - code location None\n", + "13:54:34 INFO - data factory data_ is using local data access: input_folder - output/04_fuzzy_dedupe_out output_folder - output/04_fuzzy_dedupe_out\n", + "13:54:34 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:34 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:34 INFO - orchestrator fdlist started at 2025-02-06 13:54:34\n", + "13:54:34 INFO - Number of folders is 1\n", + "13:54:34 INFO - Get Duplicate List for folder docs_to_remove\n", + "13:54:34 INFO - 1 documents marked as duplicates\n", + "13:54:34 INFO - Completed 1 files (100.0%) in 0.0 min\n", + "13:54:34 INFO - Done processing 1 files, waiting for flush() completion.\n", + "13:54:34 INFO - done flushing in 0.0 sec\n", + "13:54:34 INFO - Completed execution in 0.0 min, execution result 0\n", + "13:54:34 INFO - GetDuplicateList completed successfully\n", + "13:54:34 INFO - Starting DataCleaning step\n", + "13:54:34 INFO - Got parameters for DataCleaning\n", + "13:54:34 INFO - fdclean parameters are : {'document_id_column': 'int_id_column', 'duplicate_list_location': 'docs_to_remove_consolidated/docs_to_remove_consolidated.parquet', 'operation_mode': 'filter_duplicates'}\n", + "13:54:34 INFO - data factory dcdata_ is using local configuration without input/output path\n", + "13:54:34 INFO - data factory dcdata_ max_files -1, n_sample -1\n", + "13:54:34 INFO - data factory dcdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:34 INFO - pipeline id pipeline_id\n", + "13:54:34 INFO - code location None\n", + "13:54:34 INFO - data factory data_ is using local data access: input_folder - output/03_exact_dedupe_out output_folder - output/04_fuzzy_dedupe_out/cleaned\n", + "13:54:34 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:34 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:34 INFO - orchestrator fdclean started at 2025-02-06 13:54:34\n", + "13:54:34 INFO - Number of files is 6, source profile {'max_file_size': 0.011510848999023438, 'min_file_size': 0.003223419189453125, 'total_file_size': 0.050751686096191406}\n", + "13:54:34 INFO - Completed 1 files (16.67%) in 0.0 min\n", + "13:54:34 WARNING - table is empty, skipping processing\n", + "13:54:34 INFO - Completed 2 files (33.33%) in 0.0 min\n", + "13:54:34 INFO - Completed 3 files (50.0%) in 0.0 min\n", + "13:54:34 INFO - Completed 4 files (66.67%) in 0.0 min\n", + "13:54:34 INFO - Completed 5 files (83.33%) in 0.0 min\n", + "13:54:34 INFO - Completed 6 files (100.0%) in 0.0 min\n", + "13:54:34 INFO - Done processing 6 files, waiting for flush() completion.\n", + "13:54:34 INFO - done flushing in 0.0 sec\n", + "13:54:34 INFO - Completed execution in 0.0 min, execution result 0\n", + "13:54:34 INFO - DataCleaning completed successfully\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 224 ms, sys: 129 ms, total: 353 ms\n", + "Wall time: 290 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_fdedup.transform_python import Fdedup\n", + "\n", + "STAGE = 4\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_fuzzy_dedupe_dir}'\\n\", flush=True)\n", + "\n", + "result = Fdedup(input_folder=output_exact_dedupe_dir,\n", + " output_folder=output_fuzzy_dedupe_dir,\n", + " contents_column= \"contents\",\n", + " # document_id_column= \"doc_id\",\n", + " document_id_column= \"int_id_column\",\n", + " num_permutations= 112,\n", + " num_bands= 14,\n", + " num_minhashes_per_band= 8,\n", + " jaccard_similarity_threshold = 0.8, # between 0 - 1. higher means more strict checking\n", + " operation_mode=\"filter_duplicates\",\n", + " # operation_mode=\"annotate\",\n", + " ).transform()\n", + "# if result == 0:\n", + "# print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "# else:\n", + "# raise Exception (f\"❌ Stage:{STAGE} failed (result={result})\")" + ] + }, + { + "cell_type": "markdown", + "id": "b2c83592", + "metadata": { + "id": "b2c83592" + }, + "source": [ + "### 7.2 - Inspect Output\n", + "\n", + "FuzzyDedupe will write documents that are filtered in **output/04_fuzzy_dedupe_out/cleaned** folder\n", + "\n", + "You will notice only one **earth.pdf** made it! So fuzzy dedupe did filter out the almost identical doc." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "573faba2", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 511 + }, + "id": "573faba2", + "outputId": "49408c6e-a22b-404f-ccc5-c00edb7ce85a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input files before exact dedupe : 5\n", + "Output files after exact dedupe : 4\n", + "Near duplicate files removed : 1\n", + "Displaying contents of : output/04_fuzzy_dedupe_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenamedoc_hashint_id_columnremoved
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1024be2a61e-96f5-4f58-bf6f-e829dbdfa9d36571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T13:54:32.1553840.651216lorem-ipsum.pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...3[]
1spam.pdfFree xxx1022bd06750-cb70-4689-b2b8-72913b929a1d10026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T13:54:33.4406510.617823spam.pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...5[]
2mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...101120ae1424-c2c3-436f-a7ff-b8c69fa3a3c37758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T13:54:32.8213650.664288mars.pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...4[]
3earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...10114b43fb09-c9ef-4d9a-af24-8e22b5ff33b314711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T13:54:29.9095551.100482earth-copy.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...0[]
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "3 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 4be2a61e-96f5-4f58-bf6f-e829dbdfa9d3 6571294142213095721 pdf \n", + "1 2bd06750-cb70-4689-b2b8-72913b929a1d 10026122586747302274 pdf \n", + "2 20ae1424-c2c3-436f-a7ff-b8c69fa3a3c3 7758129997476962679 pdf \n", + "3 4b43fb09-c9ef-4d9a-af24-8e22b5ff33b3 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "3 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2025-02-06T13:54:32.155384 0.651216 lorem-ipsum.pdf \n", + "1 2025-02-06T13:54:33.440651 0.617823 spam.pdf \n", + "2 2025-02-06T13:54:32.821365 0.664288 mars.pdf \n", + "3 2025-02-06T13:54:29.909555 1.100482 earth-copy.pdf \n", + "\n", + " doc_hash int_id_column removed \n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 3 [] \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 5 [] \n", + "2 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 4 [] \n", + "3 6140cf695f269a3ddca6568536076756105ad3186086b2... 0 [] " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "input_df = read_parquet_files_as_df(output_exact_dedupe_dir)\n", + "output_df = read_parquet_files_as_df(os.path.join(output_fuzzy_dedupe_dir, \"cleaned\"))\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input files before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output files after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Near duplicate files removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "print (\"Displaying contents of : \", output_fuzzy_dedupe_dir)\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "3e0598a0", + "metadata": { + "id": "3e0598a0" + }, + "source": [ + "## Step-8: Document Quality\n", + "\n", + "This handy plugin will score documents across many metrics.\n", + "\n", + "Here we will look for 'bad words' metric.\n", + "\n", + "[Document quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)\n", + "\n", + "By default it uses [bad words collection](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/dpk_doc_quality/ldnoobw). You can supply a custom file by passing an argument `bad_word_filepath=/path/to/badwords_file`" + ] + }, + { + "cell_type": "markdown", + "id": "1949c2c4", + "metadata": { + "id": "1949c2c4" + }, + "source": [ + "### 8.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "b485f598", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "b485f598", + "outputId": "448a8ee1-9371-4bd4-f5ad-a596893fe65f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-5: Processing input='output/04_fuzzy_dedupe_out/cleaned' --> output='output/05_doc_quality_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:54:34 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/home/sujee/apps/anaconda3/envs/dpk-6-pdf-processing-r1.0.0-all-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", + "13:54:34 INFO - data factory docq_ is using local configuration without input/output path\n", + "13:54:34 INFO - data factory docq_ max_files -1, n_sample -1\n", + "13:54:34 INFO - data factory docq_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:34 INFO - pipeline id pipeline_id\n", + "13:54:34 INFO - code location None\n", + "13:54:34 INFO - data factory data_ is using local data access: input_folder - output/04_fuzzy_dedupe_out/cleaned output_folder - output/05_doc_quality_out\n", + "13:54:34 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:54:34 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:54:34 INFO - orchestrator docq started at 2025-02-06 13:54:34\n", + "13:54:34 INFO - Number of files is 5, source profile {'max_file_size': 0.011510848999023438, 'min_file_size': 0.0035142898559570312, 'total_file_size': 0.040172576904296875}\n", + "13:54:34 INFO - Load badwords found locally from /home/sujee/apps/anaconda3/envs/dpk-6-pdf-processing-r1.0.0-all-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en\n", + "13:54:34 INFO - Completed 1 files (20.0%) in 0.0 min\n", + "13:54:34 WARNING - table is empty, skipping processing\n", + "13:54:34 INFO - Completed 2 files (40.0%) in 0.0 min\n", + "13:54:34 INFO - Completed 3 files (60.0%) in 0.0 min\n", + "13:54:34 INFO - Completed 4 files (80.0%) in 0.0 min\n", + "13:54:34 INFO - Completed 5 files (100.0%) in 0.0 min\n", + "13:54:34 INFO - Done processing 5 files, waiting for flush() completion.\n", + "13:54:34 INFO - done flushing in 0.0 sec\n", + "13:54:34 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:5 completed successfully\n", + "CPU times: user 37 ms, sys: 3.43 ms, total: 40.4 ms\n", + "Wall time: 36 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_doc_quality.transform_python import DocQuality\n", + "\n", + "STAGE = 5\n", + "output_fuzzy_dedupe_cleaned_dir = os.path.join(output_fuzzy_dedupe_dir, \"cleaned\")\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_fuzzy_dedupe_cleaned_dir}' --> output='{output_doc_quality_dir}'\\n\", flush=True)\n", + "\n", + "result = DocQuality(input_folder=output_fuzzy_dedupe_cleaned_dir,\n", + " output_folder= output_doc_quality_dir,\n", + " docq_text_lang = \"en\",\n", + " docq_doc_content_column =\"contents\",\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed (result={result})\")" + ] + }, + { + "cell_type": "markdown", + "id": "eccefd3e", + "metadata": { + "id": "eccefd3e" + }, + "source": [ + "### 8.2 - Inspect the Output\n", + "\n", + "We will see several new columns starting with the name **docq_**.\n", + "\n", + "Look at the column **docq_contain_bad_word**; this will flag documents with 'bad words'.\n", + "\n", + "Also inspect the column **docq_lorem_ipsum_ratio**; this will flag documents with 'lorem ipsum' text\n", + "\n", + "For more information see : [Doc Quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "1f3225f8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 485 + }, + "id": "1f3225f8", + "outputId": "a6009dc0-6ca6-411a-8066-090c610860e0" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Displaying contents of : output/05_doc_quality_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsize...docq_mean_word_lendocq_symbol_to_word_ratiodocq_sentence_countdocq_lorem_ipsum_ratiodocq_curly_bracket_ratiodocq_contain_bad_worddocq_bullet_point_ratiodocq_ellipsis_line_ratiodocq_alphabet_word_ratiodocq_contain_common_en_words
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1024be2a61e-96f5-4f58-bf6f-e829dbdfa9d36571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...35...5.0000000.00000010.0857140.0False0.0000000.01.000000False
1spam.pdfFree xxx1022bd06750-cb70-4689-b2b8-72913b929a1d10026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...8...3.5000000.00000010.0000000.0True0.0000000.01.000000False
2mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...101120ae1424-c2c3-436f-a7ff-b8c69fa3a3c37758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...717...4.6880000.03200080.0000000.0False0.1764710.00.880000True
3earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...10114b43fb09-c9ef-4d9a-af24-8e22b5ff33b314711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...610...4.5412840.02752390.0000000.0False0.1764710.00.880734True
\n", + "

4 rows × 27 columns

\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "3 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 4be2a61e-96f5-4f58-bf6f-e829dbdfa9d3 6571294142213095721 pdf \n", + "1 2bd06750-cb70-4689-b2b8-72913b929a1d 10026122586747302274 pdf \n", + "2 20ae1424-c2c3-436f-a7ff-b8c69fa3a3c3 7758129997476962679 pdf \n", + "3 4b43fb09-c9ef-4d9a-af24-8e22b5ff33b3 14711865278795535908 pdf \n", + "\n", + " hash size ... \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 ... \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 ... \n", + "2 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 ... \n", + "3 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 ... \n", + "\n", + " docq_mean_word_len docq_symbol_to_word_ratio docq_sentence_count \\\n", + "0 5.000000 0.000000 1 \n", + "1 3.500000 0.000000 1 \n", + "2 4.688000 0.032000 8 \n", + "3 4.541284 0.027523 9 \n", + "\n", + " docq_lorem_ipsum_ratio docq_curly_bracket_ratio docq_contain_bad_word \\\n", + "0 0.085714 0.0 False \n", + "1 0.000000 0.0 True \n", + "2 0.000000 0.0 False \n", + "3 0.000000 0.0 False \n", + "\n", + " docq_bullet_point_ratio docq_ellipsis_line_ratio \\\n", + "0 0.000000 0.0 \n", + "1 0.000000 0.0 \n", + "2 0.176471 0.0 \n", + "3 0.176471 0.0 \n", + "\n", + " docq_alphabet_word_ratio docq_contain_common_en_words \n", + "0 1.000000 False \n", + "1 1.000000 False \n", + "2 0.880000 True \n", + "3 0.880734 True \n", + "\n", + "[4 rows x 27 columns]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df = read_parquet_files_as_df(output_doc_quality_dir)\n", + "print (\"Displaying contents of : \", output_doc_quality_dir)\n", + "output_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "02fa3bd2", + "metadata": { + "id": "02fa3bd2" + }, + "source": [ + "### 8.3 - Filtering 'quality' documents\n", + "\n", + "So from the output above we see **spam.pdf** is flagged for containing bad words (**docq_contain_bad_word=True**).\n", + "\n", + "Also **lorem.pdf** is flagged for place holder content **lorem ipsum** (**docq_lorem_ipsum_ratio > 0**)\n", + "\n", + "We are going to filter them both out" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "5dac1c70", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "5dac1c70", + "outputId": "463e897f-1099-410a-f753-34c4846228c3" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsize...docq_mean_word_lendocq_symbol_to_word_ratiodocq_sentence_countdocq_lorem_ipsum_ratiodocq_curly_bracket_ratiodocq_contain_bad_worddocq_bullet_point_ratiodocq_ellipsis_line_ratiodocq_alphabet_word_ratiodocq_contain_common_en_words
2mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...101120ae1424-c2c3-436f-a7ff-b8c69fa3a3c37758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...717...4.6880000.03200080.00.0False0.1764710.00.880000True
3earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...10114b43fb09-c9ef-4d9a-af24-8e22b5ff33b314711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...610...4.5412840.02752390.00.0False0.1764710.00.880734True
\n", + "

2 rows × 27 columns

\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "2 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "3 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "2 20ae1424-c2c3-436f-a7ff-b8c69fa3a3c3 7758129997476962679 pdf \n", + "3 4b43fb09-c9ef-4d9a-af24-8e22b5ff33b3 14711865278795535908 pdf \n", + "\n", + " hash size ... \\\n", + "2 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 ... \n", + "3 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 ... \n", + "\n", + " docq_mean_word_len docq_symbol_to_word_ratio docq_sentence_count \\\n", + "2 4.688000 0.032000 8 \n", + "3 4.541284 0.027523 9 \n", + "\n", + " docq_lorem_ipsum_ratio docq_curly_bracket_ratio docq_contain_bad_word \\\n", + "2 0.0 0.0 False \n", + "3 0.0 0.0 False \n", + "\n", + " docq_bullet_point_ratio docq_ellipsis_line_ratio \\\n", + "2 0.176471 0.0 \n", + "3 0.176471 0.0 \n", + "\n", + " docq_alphabet_word_ratio docq_contain_common_en_words \n", + "2 0.880000 True \n", + "3 0.880734 True \n", + "\n", + "[2 rows x 27 columns]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "all_docs_df = read_parquet_files_as_df(output_doc_quality_dir)\n", + "\n", + "# remove documents with badwords\n", + "clean_docs_df = all_docs_df[all_docs_df['docq_contain_bad_word'] == False]\n", + "\n", + "# also filter out 'lorem ipsum' text\n", + "clean_docs_df = clean_docs_df[clean_docs_df['docq_lorem_ipsum_ratio'] == 0]\n", + "\n", + "clean_docs_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207" + }, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(output_final_dir, ignore_errors=True)\n", + "shutil.os.makedirs(output_final_dir, exist_ok=True)\n", + "\n", + "output_final_dir_parquet = os.path.join (output_final_dir, 'pq')\n", + "shutil.os.makedirs(output_final_dir_parquet, exist_ok=True)\n", + "\n", + "output_final_dir_markdown = os.path.join (output_final_dir, 'markdown')\n", + "shutil.os.makedirs(output_final_dir_markdown, exist_ok=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "e06ce4f2", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e06ce4f2", + "outputId": "8a26e407-2cc8-44ee-ba6b-ca6485a92926" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Saved CLEAN parquet output to 'output/output_final/pq'\n" + ] + } + ], + "source": [ + "## save parquet\n", + "\n", + "clean_docs_df.to_parquet(os.path.join(output_final_dir_parquet, \"clean_docs.parquet\"))\n", + "print (f\"✅ Saved CLEAN parquet output to '{output_final_dir_parquet}'\")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "1e175302", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1e175302", + "outputId": "d54c5d80-23ce-49a6-e098-8e712d048975" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Saved CLEAN markdown output to 'output/output_final/markdown'\n" + ] + } + ], + "source": [ + "## save markdown text\n", + "\n", + "for index, row in clean_docs_df.iterrows():\n", + " output_file_name = os.path.join (output_final_dir_markdown, row['filename'] + '.md')\n", + " with open(output_file_name, 'w') as output_file:\n", + " output_file.write(row['contents'])\n", + "\n", + "print (f\"✅ Saved CLEAN markdown output to '{output_final_dir_markdown}'\")\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "dpk-6-pdf-processing-r1.0.0-all-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "1ce164863aa34f64a94aeb5d05103043": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "257dbf0b62624667b0c82afaf1c8ccf1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_fb81f32569c34250b901235698e5ea18", + "placeholder": "​", + "style": "IPY_MODEL_1ce164863aa34f64a94aeb5d05103043", + "value": "Fetching 9 files: 100%" + } + }, + "4e76bef9228546fd97cccfe7bdd856f3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e2b5f84c30de45d29588a07a3d106eb4", + "max": 9, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_cc7d3125eb55461180566d1064eeb2a5", + "value": 9 + } + }, + "55b9873ce1f34c169ecc6087c3cd65a1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "68eb811a52804887bc383e89a72a0975": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c0c37c0262b84e9ebf02c1ce17f263ee": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_68eb811a52804887bc383e89a72a0975", + "placeholder": "​", + "style": "IPY_MODEL_55b9873ce1f34c169ecc6087c3cd65a1", + "value": " 9/9 [00:00<00:00, 220.49it/s]" + } + }, + "ca821137125b45d08e257f95822a6f72": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "cc7d3125eb55461180566d1064eeb2a5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "df5c199339f5467a91453fa187e201f0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_257dbf0b62624667b0c82afaf1c8ccf1", + "IPY_MODEL_4e76bef9228546fd97cccfe7bdd856f3", + "IPY_MODEL_c0c37c0262b84e9ebf02c1ce17f263ee" + ], + "layout": "IPY_MODEL_ca821137125b45d08e257f95822a6f72" + } + }, + "e2b5f84c30de45d29588a07a3d106eb4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fb81f32569c34250b901235698e5ea18": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb b/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb new file mode 100644 index 0000000000..3da4aee37b --- /dev/null +++ b/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb @@ -0,0 +1,2906 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Processing PDFs using Data Prep Kit (Ray version)\n", + "\n", + " [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb)\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow:\n", + "\n", + "- pdf2parquet: Extract text from PDF documents\n", + "- docid: compute hashes\n", + "- exact dedupe : filter out identical documents\n", + "- fuzzy dedupe : filter out 'near duplicates'\n", + "- document quality: scoring documents for quality\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev//examples/notebooks/pdf-processing-1/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf-processing-1/pdf_processing_1_ray.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "25ef1be4", + "metadata": {}, + "source": [ + "## Step-1: Figure out Runtime Environment\n", + "\n", + "### 1.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "13c97768", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "df9594f1", + "metadata": {}, + "source": [ + "### 1.2 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "dc538bc3", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit-transforms[ray,all]==1.0.0 \\\n", + " humanfriendly" + ] + }, + { + "cell_type": "markdown", + "id": "a34c5175", + "metadata": {}, + "source": [ + "### 1.3 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "113ed1a3", + "metadata": {}, + "source": [ + "## Step-2: Configuration & Utils" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d4f57ff5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "970e692b", + "metadata": {}, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "74ed9531", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " input_dir = \"input\"\n", + " shutil.os.makedirs(input_dir, exist_ok=True)\n", + "else:\n", + " input_dir = \"../../data-files/pdf-processing-1/\"\n", + "\n", + "output_dir = \"output\"\n", + "\n", + "output_pdf2pq_dir = os.path.join (output_dir, '01_pdf2pq_out')\n", + "output_docid_dir = os.path.join (output_dir, '02_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (output_dir, '03_exact_dedupe_out')\n", + "output_fuzzy_dedupe_dir = os.path.join (output_dir, '04_fuzzy_dedupe_out')\n", + "output_doc_quality_dir = os.path.join (output_dir, '05_doc_quality_out')\n", + "output_final_dir = os.path.join (output_dir, 'output_final')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(output_dir, ignore_errors=True)\n", + "shutil.os.makedirs(output_dir, exist_ok=True)\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "3a3bf77f", + "metadata": {}, + "source": [ + "### 2.3 - Runtime Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "991f58d9", + "metadata": {}, + "outputs": [], + "source": [ + "from data_processing.utils import GB\n", + "\n", + "CONFIG_RAY_NUM_CPUS = 1 # CPUs per worker\n", + "CONFIG_RAY_MEMORY = 2 * GB # memory per worker\n", + "CONFIG_RAY_RUNTIME_WORKERS = 2" + ] + }, + { + "cell_type": "markdown", + "id": "f40af9e1", + "metadata": {}, + "source": [ + "### 2.4 - Handy Utils" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "df47deb1", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import requests\n", + "from humanfriendly import format_size\n", + "import pandas as pd\n", + "import glob\n", + "\n", + "## Reads parquet files in a folder into a pandas dataframe\n", + "def read_parquet_files_as_df (parquet_dir):\n", + " parquet_files = glob.glob(f'{parquet_dir}/*.parquet')\n", + " # read each parquet file into a DataFrame and store in a list\n", + " dfs = [pd.read_parquet (f) for f in parquet_files]\n", + " dfs = [df for df in dfs if not df.empty] # filter out empty dataframes\n", + " # Concatenate all DataFrames into a single DataFrame\n", + " if len(dfs) > 0:\n", + " data_df = pd.concat(dfs, ignore_index=True)\n", + " return data_df\n", + " else:\n", + " return pd.DataFrame() # return empty df\n", + "# ------------\n", + "\n", + "\n", + "def download_file(url, local_file, chunk_size=1024*1024):\n", + " \"\"\"\n", + " Downloads a remote URL to a local file.\n", + "\n", + " Args:\n", + " url (str): The remote URL.\n", + " local_filename (str): The name of the local file to save the downloaded content.\n", + " chunk_size (int): The size in bytes of each chunk. Defaults to 1024.\n", + "\n", + " Returns:\n", + " None\n", + "\n", + " Example usage:\n", + " download_file('http://example.com/file.txt', 'file.txt', chunk_size=1024*1024) # Download in chunks of 1MB\n", + " \"\"\"\n", + " # Check if the local file already exists\n", + " if os.path.exists(local_file):\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"Local file '{local_file}' ({file_size}) already exists. Skipping download.\")\n", + " return\n", + "\n", + " # Create the directory if it doesn't exist\n", + " os.makedirs(os.path.dirname(local_file), exist_ok=True)\n", + "\n", + " # Stream the file download\n", + " with requests.get(url, stream=True) as r:\n", + " r.raise_for_status()\n", + " with open(local_file, 'wb') as f:\n", + " for chunk in r.iter_content(chunk_size=chunk_size):\n", + " if chunk: # filter out keep-alive new chunks\n", + " f.write(chunk)\n", + " print()\n", + " file_size = format_size(os.path.getsize(local_file))\n", + " print(f\"{local_file} ({file_size}) downloaded successfully.\")\n", + "## --- end: download_file ------\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "f5be5e73", + "metadata": {}, + "source": [ + "## Step-3: Inspect the Data\n", + "\n", + "We will use simple PDFs. The files are [here](https://github.com/IBM/data-prep-kit/tree/dev/examples/data-files/pdf-processing-1/)\n", + "\n", + "- [earth.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth.pdf) and exact duplicate [earth-copy.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth-copy.pdf)\n", + "- [earth2.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth2.pdf) almost similar to earth.pdf (ONE word difference!)\n", + "- [mars.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/mars.pdf)\n", + "- [spam.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/spam.pdf) - contains spammy contents\n", + "- [lorem-ipsum.pdf](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/lorem-ipsum.pdf) - contains 'lorem ipsum' placeholder\n" + ] + }, + { + "cell_type": "markdown", + "id": "b20947ae", + "metadata": {}, + "source": [ + "### 3.1 -Download Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4cc5e1f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using input files from : ../../data-files/pdf-processing-1/\n" + ] + } + ], + "source": [ + "if RUNNING_IN_COLAB:\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth.pdf', os.path.join(input_dir, 'earth.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth-copy.pdf', os.path.join(input_dir, 'earth-copy.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/earth2.pdf', os.path.join(input_dir, 'earth2.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/mars.pdf', os.path.join(input_dir, 'mars.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/spam.pdf', os.path.join(input_dir, 'spam.pdf'))\n", + "\n", + " download_file ('https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/data-files/pdf-processing-1/lorem-ipsum.pdf', os.path.join(input_dir, 'lorem-ipsum.pdf'))\n", + "else:\n", + " print ('Using input files from : ', input_dir)" + ] + }, + { + "cell_type": "markdown", + "id": "06fef91e", + "metadata": {}, + "source": [ + "## Step-4: Extract Data from PDF (pdf2parquet)\n", + "\n", + "This step we will read PDF files and extract the text data.\n", + "\n", + "[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)\n", + "\n", + "We use the [Docling package](https://github.com/DS4SD/docling).\n" + ] + }, + { + "cell_type": "markdown", + "id": "b27cc402", + "metadata": {}, + "source": [ + "### 4.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "50f2c6a5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='../../data-files/pdf-processing-1/' --> output='output/01_pdf2pq_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:19:07 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 8}\n", + "14:19:07 INFO - pipeline id pipeline_id\n", + "14:19:07 INFO - code location None\n", + "14:19:07 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "14:19:07 INFO - actor creation delay 0\n", + "14:19:07 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:19:07 INFO - data factory data_ is using local data access: input_folder - ../../data-files/pdf-processing-1/ output_folder - output/01_pdf2pq_out\n", + "14:19:07 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:19:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "14:19:07 INFO - Running locally\n", + "2025-02-06 14:19:10,047\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:14 INFO - orchestrator started at 2025-02-06 14:19:14\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:14 INFO - Number of files is 6, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.023715972900390625, 'total_file_size': 0.2709054946899414}\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:14 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.187278747558594, 'object_store': 3.593639373779297}\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:14 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(RayTransformFileProcessor pid=3258905)\u001b[0m 14:19:18 INFO - Initializing models\n", + "Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 34505.24it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=3258905)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:27 INFO - Completed 1 files in 0.035 min\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:27 INFO - Completed 2 files in 0.035 min\n", + "\u001b[36m(RayTransformFileProcessor pid=3258906)\u001b[0m 14:19:18 INFO - Initializing models\n", + "Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 21207.16it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=3258906)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:29 INFO - Completed 3 files in 0.066 min\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:29 INFO - Completed 4 files in 0.067 min\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:29 INFO - Completed 4 files (66.667%) in 0.067 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:30 INFO - Completed processing 6 files in 0.093 min\n", + "\u001b[36m(orchestrate pid=3257941)\u001b[0m 14:19:30 INFO - done flushing in 0.001 sec\n", + "14:19:40 INFO - Completed execution in 0.557 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:1 completed successfully\n" + ] + } + ], + "source": [ + "from dpk_pdf2parquet.ray.transform import Pdf2Parquet\n", + "from dpk_pdf2parquet.transform import pdf2parquet_contents_types\n", + "\n", + "STAGE = 1\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_dir}' --> output='{output_pdf2pq_dir}'\\n\", flush=True)\n", + "\n", + "\n", + "result = Pdf2Parquet(input_folder= input_dir,\n", + " output_folder= output_pdf2pq_dir,\n", + " data_files_to_use=['.pdf'],\n", + " pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN, # markdown\n", + " \n", + " # runtime config\n", + " run_locally= True,\n", + " num_cpus= CONFIG_RAY_NUM_CPUS,\n", + " memory= CONFIG_RAY_MEMORY,\n", + " runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "159a5d67", + "metadata": {}, + "source": [ + "### 4.2 - Inspect Generated output\n", + "\n", + "Here we should see one entry per input file processed." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "82f04cd9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Displaying contents of : output/01_pdf2pq_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filename
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1028dc8970e-215a-44fe-a7bf-946c03f36c606571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T14:19:29.4089101.912304lorem-ipsum.pdf
1spam.pdfFree xxx1029ac78463-b325-406b-891e-c9e84722eb3410026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T14:19:30.9864641.573836spam.pdf
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011b3ed1942-54a6-49fc-bcbc-2d8c438adef310729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T14:19:29.3352711.850426earth2.pdf
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...10116d882651-2506-41cb-8704-85575c64b1437758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T14:19:30.9506731.612200mars.pdf
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011f8ccec16-576c-4e3e-8bec-359dff01d6d214711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T14:19:27.4704092.071769earth-copy.pdf
5earth.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...101118d940f3-f4b4-46ac-9147-077675aead1d14711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T14:19:27.4925742.093768earth.pdf
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "5 earth.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "5 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 8dc8970e-215a-44fe-a7bf-946c03f36c60 6571294142213095721 pdf \n", + "1 9ac78463-b325-406b-891e-c9e84722eb34 10026122586747302274 pdf \n", + "2 b3ed1942-54a6-49fc-bcbc-2d8c438adef3 10729312978404042321 pdf \n", + "3 6d882651-2506-41cb-8704-85575c64b143 7758129997476962679 pdf \n", + "4 f8ccec16-576c-4e3e-8bec-359dff01d6d2 14711865278795535908 pdf \n", + "5 18d940f3-f4b4-46ac-9147-077675aead1d 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "5 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2025-02-06T14:19:29.408910 1.912304 lorem-ipsum.pdf \n", + "1 2025-02-06T14:19:30.986464 1.573836 spam.pdf \n", + "2 2025-02-06T14:19:29.335271 1.850426 earth2.pdf \n", + "3 2025-02-06T14:19:30.950673 1.612200 mars.pdf \n", + "4 2025-02-06T14:19:27.470409 2.071769 earth-copy.pdf \n", + "5 2025-02-06T14:19:27.492574 2.093768 earth.pdf " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print (\"Displaying contents of : \", output_pdf2pq_dir)\n", + "output_df = read_parquet_files_as_df(output_pdf2pq_dir)\n", + "# print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "output_df.head(10)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "56232298", + "metadata": {}, + "source": [ + "\n", + "### 4.3 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **document_hash**: hash of documents\n", + "- **hash** : hash of `contents` column\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "**Note: you should notice the hash values are identical for the duplicate documents**\n", + "\n", + "Let's inspect the **contents** column." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "4bcc03dc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "## Earth\n", + "\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "\n", + "For more details about our Solar system see Chapter 1.\n", + "\n", + "## Earth\n", + "\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "\n", + "Basic facts about Earth:\n", + "\n", + "- · Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "- · Moons: One moon, called Luna or simply \"the Moon\".\n", + "- · Rotation Period: 24 hours (one day)\n" + ] + } + ], + "source": [ + "print (output_df[output_df['filename'] == 'earth.pdf'].iloc[0,]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "9d07a30e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Free xxx\n" + ] + } + ], + "source": [ + "print (output_df[output_df['filename'] == 'spam.pdf'].iloc[0,]['contents'])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "866857df", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lorem ipsum Lorem ipsum Lorem ipsum\n" + ] + } + ], + "source": [ + "print (output_df[output_df['filename'] == 'lorem-ipsum.pdf'].iloc[0,]['contents'])" + ] + }, + { + "cell_type": "markdown", + "id": "270f1673", + "metadata": {}, + "source": [ + "## Step-5: Create DOC ID for Documents\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This step is a pre-requisite for fuzzy dedup** in the pipeline.\n", + "\n", + "[DocID documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/doc_id)" + ] + }, + { + "cell_type": "markdown", + "id": "32478bb0", + "metadata": {}, + "source": [ + "### 5.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "9b0f613b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_pdf2pq_out' --> output='output/02_docid_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:19:42 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'doc_hash', 'int_column': 'int_id_column', 'start_id': 0}\n", + "14:19:42 INFO - pipeline id pipeline_id\n", + "14:19:42 INFO - code location None\n", + "14:19:42 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "14:19:42 INFO - actor creation delay 0\n", + "14:19:42 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:19:42 INFO - data factory data_ is using local data access: input_folder - output/01_pdf2pq_out output_folder - output/02_docid_out\n", + "14:19:42 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:19:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:19:42 INFO - Running locally\n", + "2025-02-06 14:19:43,706\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:45 INFO - orchestrator started at 2025-02-06 14:19:45\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:45 INFO - Number of files is 6, source profile {'max_file_size': 0.010061264038085938, 'min_file_size': 0.0055408477783203125, 'total_file_size': 0.04969310760498047}\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:45 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.290360260754824, 'object_store': 3.6451801294460893}\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:45 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - Completed 1 files in 0.004 min\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - Completed 2 files in 0.004 min\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - Completed 3 files in 0.004 min\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - Completed 4 files in 0.004 min\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - Completed 4 files (66.667%) in 0.004 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - Completed processing 6 files in 0.005 min\n", + "\u001b[36m(orchestrate pid=3259648)\u001b[0m 14:19:46 INFO - done flushing in 0.001 sec\n", + "14:19:56 INFO - Completed execution in 0.234 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 115 ms, sys: 137 ms, total: 251 ms\n", + "Wall time: 15.3 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_doc_id.ray.transform import DocID\n", + "\n", + "STAGE = 2\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_pdf2pq_dir}' --> output='{output_docid_dir}'\\n\", flush=True)\n", + "\n", + "result = DocID(input_folder= output_pdf2pq_dir,\n", + " output_folder= output_docid_dir,\n", + " doc_id_doc_column= \"contents\",\n", + " doc_id_hash_column= \"doc_hash\",\n", + " # doc_id_int_column= \"doc_id_int\",\n", + " doc_id_int_column= \"int_id_column\",\n", + " \n", + " # runtime config\n", + " run_locally= True,\n", + " num_cpus= CONFIG_RAY_NUM_CPUS,\n", + " memory= CONFIG_RAY_MEMORY,\n", + " runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + " \n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "af2de0e5", + "metadata": {}, + "source": [ + "### 5.2 - Inspect Generated output\n", + "\n", + "You would see a new columns **doc_hash** and **int_id_column**" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "38b6e1cc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Displaying contents of : output/02_docid_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenamedoc_hashint_id_column
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1028dc8970e-215a-44fe-a7bf-946c03f36c606571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T14:19:29.4089101.912304lorem-ipsum.pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...3
1spam.pdfFree xxx1029ac78463-b325-406b-891e-c9e84722eb3410026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T14:19:30.9864641.573836spam.pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...5
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011b3ed1942-54a6-49fc-bcbc-2d8c438adef310729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T14:19:29.3352711.850426earth2.pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...2
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...10116d882651-2506-41cb-8704-85575c64b1437758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T14:19:30.9506731.612200mars.pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...4
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011f8ccec16-576c-4e3e-8bec-359dff01d6d214711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T14:19:27.4704092.071769earth-copy.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...1
5earth.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...101118d940f3-f4b4-46ac-9147-077675aead1d14711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T14:19:27.4925742.093768earth.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...0
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "5 earth.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "5 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 8dc8970e-215a-44fe-a7bf-946c03f36c60 6571294142213095721 pdf \n", + "1 9ac78463-b325-406b-891e-c9e84722eb34 10026122586747302274 pdf \n", + "2 b3ed1942-54a6-49fc-bcbc-2d8c438adef3 10729312978404042321 pdf \n", + "3 6d882651-2506-41cb-8704-85575c64b143 7758129997476962679 pdf \n", + "4 f8ccec16-576c-4e3e-8bec-359dff01d6d2 14711865278795535908 pdf \n", + "5 18d940f3-f4b4-46ac-9147-077675aead1d 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "5 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2025-02-06T14:19:29.408910 1.912304 lorem-ipsum.pdf \n", + "1 2025-02-06T14:19:30.986464 1.573836 spam.pdf \n", + "2 2025-02-06T14:19:29.335271 1.850426 earth2.pdf \n", + "3 2025-02-06T14:19:30.950673 1.612200 mars.pdf \n", + "4 2025-02-06T14:19:27.470409 2.071769 earth-copy.pdf \n", + "5 2025-02-06T14:19:27.492574 2.093768 earth.pdf \n", + "\n", + " doc_hash int_id_column \n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 3 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 5 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 2 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 4 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 1 \n", + "5 6140cf695f269a3ddca6568536076756105ad3186086b2... 0 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print (\"Displaying contents of : \", output_docid_dir)\n", + "output_df = read_parquet_files_as_df(output_docid_dir)\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "141f7cf1", + "metadata": {}, + "source": [ + "## Step-6: Eliminate Duplicate Documents\n", + "\n", + "We have 2 exact duplicates: **earth.pdf** , **earth-copy.pdf**\n", + "\n", + "Note how **doc_hash** for these documents are the same.\n", + "\n", + "[Exact dedupe information](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/ededup)" + ] + }, + { + "cell_type": "markdown", + "id": "eb74af84", + "metadata": {}, + "source": [ + "### 6.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "48beaa13", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_docid_out' --> output='output/03_exact_dedupe_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:19:57 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'doc_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "14:19:57 INFO - pipeline id pipeline_id\n", + "14:19:57 INFO - code location None\n", + "14:19:57 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "14:19:57 INFO - actor creation delay 0\n", + "14:19:57 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:19:57 INFO - data factory data_ is using local data access: input_folder - output/02_docid_out output_folder - output/03_exact_dedupe_out\n", + "14:19:57 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:19:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:19:57 INFO - Running locally\n", + "2025-02-06 14:19:58,746\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:00 INFO - orchestrator started at 2025-02-06 14:20:00\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:00 INFO - Number of files is 6, source profile {'max_file_size': 0.01116180419921875, 'min_file_size': 0.006641387939453125, 'total_file_size': 0.056290626525878906}\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:00 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.283956146799028, 'object_store': 3.6419780729338527}\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:00 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - Completed 1 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - Completed 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - Completed 3 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - Completed 4 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - Completed 4 files (66.667%) in 0.003 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - Completed processing 6 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3261174)\u001b[0m 14:20:01 INFO - done flushing in 0.001 sec\n", + "14:20:11 INFO - Completed execution in 0.225 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 98.9 ms, sys: 129 ms, total: 228 ms\n", + "Wall time: 15 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_ededup.ray.transform import Ededup\n", + "\n", + "STAGE = 3\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_docid_dir}' --> output='{output_exact_dedupe_dir}'\\n\", flush=True)\n", + "\n", + "result = Ededup(input_folder=output_docid_dir,\n", + " output_folder=output_exact_dedupe_dir,\n", + " ededup_doc_column=\"contents\",\n", + " ededup_doc_id_column=\"doc_hash\",\n", + " ededup_num_hashes= 2,\n", + " \n", + " # runtime config\n", + " run_locally= True,\n", + " num_cpus= CONFIG_RAY_NUM_CPUS,\n", + " memory= CONFIG_RAY_MEMORY,\n", + " runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "d9d93e16", + "metadata": {}, + "source": [ + "### 6.2 - Inspect Generated output\n", + "\n", + "You can see one of **earth.pdf** or **earth-copy.pdf** will be eliminated." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "ef98911d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input files before exact dedupe : 6\n", + "Output files after exact dedupe : 5\n", + "Duplicate files removed : 1\n", + "Displaying contents of : output/03_exact_dedupe_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenamedoc_hashint_id_columnremoved
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1028dc8970e-215a-44fe-a7bf-946c03f36c606571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T14:19:29.4089101.912304lorem-ipsum.pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...3[]
1spam.pdfFree xxx1029ac78463-b325-406b-891e-c9e84722eb3410026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T14:19:30.9864641.573836spam.pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...5[]
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011b3ed1942-54a6-49fc-bcbc-2d8c438adef310729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T14:19:29.3352711.850426earth2.pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...2[]
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...10116d882651-2506-41cb-8704-85575c64b1437758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T14:19:30.9506731.612200mars.pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...4[]
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011f8ccec16-576c-4e3e-8bec-359dff01d6d214711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T14:19:27.4704092.071769earth-copy.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...1[]
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 8dc8970e-215a-44fe-a7bf-946c03f36c60 6571294142213095721 pdf \n", + "1 9ac78463-b325-406b-891e-c9e84722eb34 10026122586747302274 pdf \n", + "2 b3ed1942-54a6-49fc-bcbc-2d8c438adef3 10729312978404042321 pdf \n", + "3 6d882651-2506-41cb-8704-85575c64b143 7758129997476962679 pdf \n", + "4 f8ccec16-576c-4e3e-8bec-359dff01d6d2 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2025-02-06T14:19:29.408910 1.912304 lorem-ipsum.pdf \n", + "1 2025-02-06T14:19:30.986464 1.573836 spam.pdf \n", + "2 2025-02-06T14:19:29.335271 1.850426 earth2.pdf \n", + "3 2025-02-06T14:19:30.950673 1.612200 mars.pdf \n", + "4 2025-02-06T14:19:27.470409 2.071769 earth-copy.pdf \n", + "\n", + " doc_hash int_id_column removed \n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 3 [] \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 5 [] \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 2 [] \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 4 [] \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 1 [] " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "input_df = read_parquet_files_as_df(output_docid_dir)\n", + "output_df = read_parquet_files_as_df(output_exact_dedupe_dir)\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input files before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output files after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate files removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "print (\"Displaying contents of : \", output_exact_dedupe_dir)\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "1cedeca2", + "metadata": {}, + "source": [ + "## Step-7: Fuzzy Dedupe\n", + "\n", + "In previous step, we removed **exact duplicates (identical documents)**.\n", + "\n", + "Fuzzy de-dupe can further filter out documents that are **not exactly identical, but nearly identical**\n", + "\n", + "Here is a simple example:\n", + "\n", + "`Our solar system is a vast and fascinating expanse`\n", + "\n", + "`The solar system is a vast and fascinating expanse`\n", + "\n", + "Only one word is different `Our` vs `The`.\n", + "\n", + "Imagine two documents with one extra blank line. For our purposes they are the same.\n", + "\n", + "[Fuzzy dedupe documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/fdedup)\n", + "\n", + "### Tweaking fuzzy matches\n", + "\n", + "**`jaccard_similarity_threshold`** is the parameter used to tweak similarities between documents. It's value is between 0 and 1.0. Values close to 1.0 means more strict checking (fewer documents will qualify). Lower threshold means more leniant matches (more documents will qualify)\n", + "\n", + "Adjust this value to find what works for your documents" + ] + }, + { + "cell_type": "markdown", + "id": "3f21d132", + "metadata": {}, + "source": [ + "### 7.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "f6430f24", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_exact_dedupe_out' --> output='output/04_fuzzy_dedupe_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:20:12 INFO - Starting SignatureCalculation step\n", + "14:20:12 INFO - Got parameters for SignatureCalculation\n", + "14:20:12 INFO - minhash parameters are : {'document_id_column': 'int_id_column', 'contents_column': 'contents', 'seed': 42, 'num_permutations': 112, 'jaccard_similarity_threshold': 0.9, 'word_shingle_size': 5, 'num_bands': 14, 'num_minhashes_per_band': 8, 'num_segments': 1, 'shingle_option': 'word'}\n", + "14:20:12 INFO - data factory scdata_ is using local configuration without input/output path\n", + "14:20:12 INFO - data factory scdata_ max_files -1, n_sample -1\n", + "14:20:12 INFO - data factory scdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:20:12 INFO - pipeline id pipeline_id\n", + "14:20:12 INFO - code location None\n", + "14:20:12 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "14:20:12 INFO - actor creation delay 0\n", + "14:20:12 INFO - job details {'job category': 'preprocessing', 'job name': 'minhash', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:20:12 INFO - data factory data_ is using local data access: input_folder - output/03_exact_dedupe_out output_folder - output/04_fuzzy_dedupe_out\n", + "14:20:12 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:20:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:20:12 INFO - Running locally\n", + "2025-02-06 14:20:13,822\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:15 INFO - orchestrator started at 2025-02-06 14:20:15\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:15 INFO - Number of files is 6, source profile {'max_file_size': 0.011510848999023438, 'min_file_size': 0.003223419189453125, 'total_file_size': 0.050751686096191406}\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:15 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.180192566476762, 'object_store': 3.59009628277272}\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:15 INFO - Number of workers - 3 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:16 INFO - Completed 1 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:16 INFO - Completed 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:16 INFO - Completed 3 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:16 INFO - Completed 3 files (50.0%) in 0.003 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:16 INFO - Completed processing 6 files in 0.003 min\n", + "\u001b[36m(RayTransformFileProcessor pid=3263786)\u001b[0m 14:20:16 INFO - Starting flush()\n", + "\u001b[36m(RayTransformFileProcessor pid=3263785)\u001b[0m 14:20:16 WARNING - table is empty, skipping processing\n", + "\u001b[36m(orchestrate pid=3262907)\u001b[0m 14:20:16 INFO - done flushing in 0.03 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=3263786)\u001b[0m 14:20:16 INFO - Wrote 14 tables with a total size of 13,440 bytes\n", + "14:20:26 INFO - Completed execution in 0.227 min, execution result 0\n", + "\u001b[36m(RayTransformFileProcessor pid=3263785)\u001b[0m 14:20:16 INFO - Starting flush()\u001b[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)\u001b[0m\n", + "\u001b[36m(RayTransformFileProcessor pid=3263785)\u001b[0m 14:20:16 INFO - Wrote 14 tables with a total size of 13,440 bytes\u001b[32m [repeated 2x across cluster]\u001b[0m\n", + "14:20:27 INFO - SignatureCalculation completed successfully\n", + "14:20:27 INFO - Starting ClusterAnalysis step\n", + "14:20:27 INFO - Got parameters for ClusterAnalysis\n", + "14:20:27 INFO - cluster parameters are : {'jaccard_similarity_threshold': 0.9, 'num_bands': 14, 'num_segments': 1, 'sort_output': False}\n", + "14:20:27 INFO - pipeline id pipeline_id\n", + "14:20:27 INFO - code location None\n", + "14:20:27 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "14:20:27 INFO - actor creation delay 0\n", + "14:20:27 INFO - job details {'job category': 'preprocessing', 'job name': 'cluster', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:20:27 INFO - data factory data_ is using local data access: input_folder - output/04_fuzzy_dedupe_out/bands output_folder - output/04_fuzzy_dedupe_out/docs_to_remove\n", + "14:20:27 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:20:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:20:27 INFO - Running locally\n", + "2025-02-06 14:20:28,857\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:30 INFO - orchestrator started at 2025-02-06 14:20:30\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:30 INFO - Number of folders is 14\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:30 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.263226319104433, 'object_store': 3.631613158620894}\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:30 INFO - Number of workers - 3 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 1 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 2 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 3 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 4 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 5 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 6 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 7 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 8 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 9 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 10 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 11 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed 11 files (78.571%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - Completed processing 14 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3264596)\u001b[0m 14:20:31 INFO - done flushing in 0.001 sec\n", + "14:20:41 INFO - Completed execution in 0.223 min, execution result 0\n", + "14:20:42 INFO - ClusterAnalysis completed successfully\n", + "14:20:42 INFO - Starting GetDuplicateList step\n", + "14:20:42 INFO - Got parameters for GetDuplicateList\n", + "14:20:42 INFO - fdlist parameters are : {'docs_to_remove': 'docs_to_remove', 'consolidated_filename': 'docs_to_remove_consolidated/docs_to_remove_consolidated.parquet', 'sort_output': False}\n", + "14:20:42 INFO - pipeline id pipeline_id\n", + "14:20:42 INFO - code location None\n", + "14:20:42 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "14:20:42 INFO - actor creation delay 0\n", + "14:20:42 INFO - job details {'job category': 'preprocessing', 'job name': 'fdlist', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:20:42 INFO - data factory data_ is using local data access: input_folder - output/04_fuzzy_dedupe_out output_folder - output/04_fuzzy_dedupe_out\n", + "14:20:42 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:20:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:20:42 INFO - Running locally\n", + "2025-02-06 14:20:43,486\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:44 INFO - orchestrator started at 2025-02-06 14:20:44\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:44 INFO - Number of folders is 1\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:44 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.259747315198183, 'object_store': 3.629873656667769}\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:44 INFO - Number of workers - 1 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:45 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:45 INFO - Completed processing 1 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=3266161)\u001b[0m 14:20:45 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=3267037)\u001b[0m 14:20:45 INFO - Get Duplicate List for folder docs_to_remove\n", + "\u001b[36m(RayTransformFileProcessor pid=3267037)\u001b[0m 14:20:45 INFO - 0 documents marked as duplicates\n", + "14:20:55 INFO - Completed execution in 0.222 min, execution result 0\n", + "14:20:57 INFO - GetDuplicateList completed successfully\n", + "14:20:57 INFO - Starting DataCleaning step\n", + "14:20:57 INFO - Got parameters for DataCleaning\n", + "14:20:57 INFO - fdclean parameters are : {'document_id_column': 'int_id_column', 'duplicate_list_location': 'docs_to_remove_consolidated/docs_to_remove_consolidated.parquet', 'operation_mode': 'filter_duplicates'}\n", + "14:20:57 INFO - data factory dcdata_ is using local configuration without input/output path\n", + "14:20:57 INFO - data factory dcdata_ max_files -1, n_sample -1\n", + "14:20:57 INFO - data factory dcdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:20:57 INFO - pipeline id pipeline_id\n", + "14:20:57 INFO - code location None\n", + "14:20:57 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "14:20:57 INFO - actor creation delay 0\n", + "14:20:57 INFO - job details {'job category': 'preprocessing', 'job name': 'fdclean', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:20:57 INFO - data factory data_ is using local data access: input_folder - output/03_exact_dedupe_out output_folder - output/04_fuzzy_dedupe_out/cleaned\n", + "14:20:57 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:20:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:20:57 INFO - Running locally\n", + "2025-02-06 14:20:58,292\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:20:59 INFO - orchestrator started at 2025-02-06 14:20:59\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:20:59 INFO - Number of files is 6, source profile {'max_file_size': 0.011510848999023438, 'min_file_size': 0.003223419189453125, 'total_file_size': 0.050751686096191406}\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:20:59 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.28473205678165, 'object_store': 3.642366027459502}\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:20:59 INFO - Number of workers - 3 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:21:00 INFO - Completed 1 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:21:00 INFO - Completed 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:21:00 INFO - Completed 3 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:21:00 INFO - Completed 3 files (50.0%) in 0.003 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:21:00 INFO - Completed processing 6 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3267588)\u001b[0m 14:21:00 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=3268467)\u001b[0m 14:21:00 WARNING - table is empty, skipping processing\n", + "14:21:10 INFO - Completed execution in 0.226 min, execution result 0\n", + "14:21:12 INFO - DataCleaning completed successfully\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 558 ms, sys: 526 ms, total: 1.08 s\n", + "Wall time: 59.5 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_fdedup.ray.transform import Fdedup\n", + "\n", + "STAGE = 4\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_fuzzy_dedupe_dir}'\\n\", flush=True)\n", + "\n", + "result = Fdedup(input_folder=output_exact_dedupe_dir,\n", + " output_folder=output_fuzzy_dedupe_dir,\n", + " contents_column= \"contents\",\n", + " # document_id_column= \"doc_id\",\n", + " document_id_column= \"int_id_column\",\n", + " num_permutations= 112,\n", + " num_bands= 14,\n", + " num_minhashes_per_band= 8,\n", + " jaccard_similarity_threshold = 0.9, # between 0 - 1. higher means more strict checking\n", + " operation_mode=\"filter_duplicates\",\n", + " # operation_mode=\"annotate\",\n", + " \n", + " # runtime config\n", + " run_locally= True,\n", + " ).transform()\n", + "\n", + "# if result == 0:\n", + "# print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "# else:\n", + "# raise Exception (f\"❌ Stage:{STAGE} failed (result={result})\")" + ] + }, + { + "cell_type": "markdown", + "id": "037d3974", + "metadata": {}, + "source": [ + "### 7.2 - Inspect Output\n", + "\n", + "FuzzyDedupe will write documents that are filtered in **output/04_fuzzy_dedupe_out/cleaned** folder\n", + "\n", + "You will notice only one **earth.pdf** made it! So fuzzy dedupe did filter out the almost identical doc." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "d59496f0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input files before exact dedupe : 5\n", + "Output files after exact dedupe : 5\n", + "Near duplicate files removed : 0\n", + "Displaying contents of : output/04_fuzzy_dedupe_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenamedoc_hashint_id_columnremoved
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1028dc8970e-215a-44fe-a7bf-946c03f36c606571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...352025-02-06T14:19:29.4089101.912304lorem-ipsum.pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...3[]
1spam.pdfFree xxx1029ac78463-b325-406b-891e-c9e84722eb3410026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...82025-02-06T14:19:30.9864641.573836spam.pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...5[]
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011b3ed1942-54a6-49fc-bcbc-2d8c438adef310729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...6102025-02-06T14:19:29.3352711.850426earth2.pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...2[]
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...10116d882651-2506-41cb-8704-85575c64b1437758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...7172025-02-06T14:19:30.9506731.612200mars.pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...4[]
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011f8ccec16-576c-4e3e-8bec-359dff01d6d214711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...6102025-02-06T14:19:27.4704092.071769earth-copy.pdf6140cf695f269a3ddca6568536076756105ad3186086b2...1[]
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 8dc8970e-215a-44fe-a7bf-946c03f36c60 6571294142213095721 pdf \n", + "1 9ac78463-b325-406b-891e-c9e84722eb34 10026122586747302274 pdf \n", + "2 b3ed1942-54a6-49fc-bcbc-2d8c438adef3 10729312978404042321 pdf \n", + "3 6d882651-2506-41cb-8704-85575c64b143 7758129997476962679 pdf \n", + "4 f8ccec16-576c-4e3e-8bec-359dff01d6d2 14711865278795535908 pdf \n", + "\n", + " hash size \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2025-02-06T14:19:29.408910 1.912304 lorem-ipsum.pdf \n", + "1 2025-02-06T14:19:30.986464 1.573836 spam.pdf \n", + "2 2025-02-06T14:19:29.335271 1.850426 earth2.pdf \n", + "3 2025-02-06T14:19:30.950673 1.612200 mars.pdf \n", + "4 2025-02-06T14:19:27.470409 2.071769 earth-copy.pdf \n", + "\n", + " doc_hash int_id_column removed \n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 3 [] \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 5 [] \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 2 [] \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 4 [] \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 1 [] " + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "input_df = read_parquet_files_as_df(output_exact_dedupe_dir)\n", + "output_df = read_parquet_files_as_df(os.path.join(output_fuzzy_dedupe_dir, \"cleaned\"))\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input files before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output files after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Near duplicate files removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "print (\"Displaying contents of : \", output_fuzzy_dedupe_dir)\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "c3e4f860", + "metadata": {}, + "source": [ + "## Step-8: Document Quality\n", + "\n", + "This handy plugin will score documents across many metrics.\n", + "\n", + "Here we will look for 'bad words' metric.\n", + "\n", + "[Document quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)\n", + "\n", + "By default it uses [bad words collection](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/dpk_doc_quality/ldnoobw). You can supply a custom file by passing an argument `bad_word_filepath=/path/to/badwords_file`" + ] + }, + { + "cell_type": "markdown", + "id": "144a0fff", + "metadata": {}, + "source": [ + "### 8.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "63140942", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-5: Processing input='output/04_fuzzy_dedupe_out/cleaned' --> output='output/05_doc_quality_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "14:21:12 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/home/sujee/apps/anaconda3/envs/dpk-6-pdf-processing-r1.0.0-all-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", + "14:21:12 INFO - data factory docq_ is using local configuration without input/output path\n", + "14:21:12 INFO - data factory docq_ max_files -1, n_sample -1\n", + "14:21:12 INFO - data factory docq_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:21:12 INFO - pipeline id pipeline_id\n", + "14:21:12 INFO - code location None\n", + "14:21:12 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "14:21:12 INFO - actor creation delay 0\n", + "14:21:12 INFO - job details {'job category': 'preprocessing', 'job name': 'docq', 'job type': 'ray', 'job id': 'job_id'}\n", + "14:21:12 INFO - data factory data_ is using local data access: input_folder - output/04_fuzzy_dedupe_out/cleaned output_folder - output/05_doc_quality_out\n", + "14:21:12 INFO - data factory data_ max_files -1, n_sample -1\n", + "14:21:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "14:21:12 INFO - Running locally\n", + "2025-02-06 14:21:13,443\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:14 INFO - orchestrator started at 2025-02-06 14:21:14\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:14 INFO - Number of files is 5, source profile {'max_file_size': 0.011510848999023438, 'min_file_size': 0.0069904327392578125, 'total_file_size': 0.04752826690673828}\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:14 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.264469146728516, 'object_store': 3.632234573364258}\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:14 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(RayTransformFileProcessor pid=3270111)\u001b[0m 14:21:15 INFO - Load badwords found locally from /home/sujee/apps/anaconda3/envs/dpk-6-pdf-processing-r1.0.0-all-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:16 INFO - Completed 1 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:16 INFO - Completed 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:16 INFO - Completed 3 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:16 INFO - Completed 3 files (60.0%) in 0.003 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:16 INFO - Completed processing 5 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=3269230)\u001b[0m 14:21:16 INFO - done flushing in 0.001 sec\n", + "14:21:26 INFO - Completed execution in 0.227 min, execution result 0\n", + "\u001b[36m(RayTransformFileProcessor pid=3270112)\u001b[0m 14:21:15 INFO - Load badwords found locally from /home/sujee/apps/anaconda3/envs/dpk-6-pdf-processing-r1.0.0-all-py3.11/lib/python3.11/site-packages/dpk_doc_quality/ldnoobw/en\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:5 completed successfully\n", + "CPU times: user 122 ms, sys: 128 ms, total: 250 ms\n", + "Wall time: 14.9 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_doc_quality.ray.transform import DocQuality\n", + "\n", + "STAGE = 5\n", + "output_fuzzy_dedupe_cleaned_dir = os.path.join(output_fuzzy_dedupe_dir, \"cleaned\")\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_fuzzy_dedupe_cleaned_dir}' --> output='{output_doc_quality_dir}'\\n\", flush=True)\n", + "\n", + "result = DocQuality(input_folder=output_fuzzy_dedupe_cleaned_dir,\n", + " output_folder= output_doc_quality_dir,\n", + " docq_text_lang = \"en\",\n", + " docq_doc_content_column =\"contents\",\n", + " \n", + " # runtime config\n", + " run_locally= True,\n", + " num_cpus= CONFIG_RAY_NUM_CPUS,\n", + " memory= CONFIG_RAY_MEMORY,\n", + " runtime_num_workers = CONFIG_RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed (result={result})\")" + ] + }, + { + "cell_type": "markdown", + "id": "1006b475", + "metadata": {}, + "source": [ + "### 8.2 - Inspect the Output\n", + "\n", + "We will see several new columns starting with the name **docq_**.\n", + "\n", + "Look at the column **docq_contain_bad_word**; this will flag documents with 'bad words'.\n", + "\n", + "Also inspect the column **docq_lorem_ipsum_ratio**; this will flag documents with 'lorem ipsum' text\n", + "\n", + "For more information see : [Doc Quality documentation](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "24181587", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Displaying contents of : output/05_doc_quality_out\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsize...docq_mean_word_lendocq_symbol_to_word_ratiodocq_sentence_countdocq_lorem_ipsum_ratiodocq_curly_bracket_ratiodocq_contain_bad_worddocq_bullet_point_ratiodocq_ellipsis_line_ratiodocq_alphabet_word_ratiodocq_contain_common_en_words
0lorem-ipsum.pdfLorem ipsum Lorem ipsum Lorem ipsum1028dc8970e-215a-44fe-a7bf-946c03f36c606571294142213095721pdfbc012d063005cc02deb6c2592d1f8c3b273625edf9eec5...35...5.0000000.00000010.0857140.0False0.0000000.01.000000False
1spam.pdfFree xxx1029ac78463-b325-406b-891e-c9e84722eb3410026122586747302274pdf543ffc97aef373ee009a5f908e0358ef80d329ca7ba964...8...3.5000000.00000010.0000000.0True0.0000000.01.000000False
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011b3ed1942-54a6-49fc-bcbc-2d8c438adef310729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...610...4.5412840.02752390.0000000.0False0.1764710.00.880734True
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...10116d882651-2506-41cb-8704-85575c64b1437758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...717...4.6880000.03200080.0000000.0False0.1764710.00.880000True
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011f8ccec16-576c-4e3e-8bec-359dff01d6d214711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...610...4.5412840.02752390.0000000.0False0.1764710.00.880734True
\n", + "

5 rows × 27 columns

\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 lorem-ipsum.pdf Lorem ipsum Lorem ipsum Lorem ipsum \n", + "1 spam.pdf Free xxx \n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 1 0 2 \n", + "1 1 0 2 \n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "0 8dc8970e-215a-44fe-a7bf-946c03f36c60 6571294142213095721 pdf \n", + "1 9ac78463-b325-406b-891e-c9e84722eb34 10026122586747302274 pdf \n", + "2 b3ed1942-54a6-49fc-bcbc-2d8c438adef3 10729312978404042321 pdf \n", + "3 6d882651-2506-41cb-8704-85575c64b143 7758129997476962679 pdf \n", + "4 f8ccec16-576c-4e3e-8bec-359dff01d6d2 14711865278795535908 pdf \n", + "\n", + " hash size ... \\\n", + "0 bc012d063005cc02deb6c2592d1f8c3b273625edf9eec5... 35 ... \n", + "1 543ffc97aef373ee009a5f908e0358ef80d329ca7ba964... 8 ... \n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 ... \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 ... \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 ... \n", + "\n", + " docq_mean_word_len docq_symbol_to_word_ratio docq_sentence_count \\\n", + "0 5.000000 0.000000 1 \n", + "1 3.500000 0.000000 1 \n", + "2 4.541284 0.027523 9 \n", + "3 4.688000 0.032000 8 \n", + "4 4.541284 0.027523 9 \n", + "\n", + " docq_lorem_ipsum_ratio docq_curly_bracket_ratio docq_contain_bad_word \\\n", + "0 0.085714 0.0 False \n", + "1 0.000000 0.0 True \n", + "2 0.000000 0.0 False \n", + "3 0.000000 0.0 False \n", + "4 0.000000 0.0 False \n", + "\n", + " docq_bullet_point_ratio docq_ellipsis_line_ratio \\\n", + "0 0.000000 0.0 \n", + "1 0.000000 0.0 \n", + "2 0.176471 0.0 \n", + "3 0.176471 0.0 \n", + "4 0.176471 0.0 \n", + "\n", + " docq_alphabet_word_ratio docq_contain_common_en_words \n", + "0 1.000000 False \n", + "1 1.000000 False \n", + "2 0.880734 True \n", + "3 0.880000 True \n", + "4 0.880734 True \n", + "\n", + "[5 rows x 27 columns]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df = read_parquet_files_as_df(output_doc_quality_dir)\n", + "print (\"Displaying contents of : \", output_doc_quality_dir)\n", + "output_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "c343b656", + "metadata": {}, + "source": [ + "### 8.3 - Filtering 'quality' documents\n", + "\n", + "So from the output above we see **spam.pdf** is flagged for containing bad words (**docq_contain_bad_word=True**).\n", + "\n", + "Also **lorem.pdf** is flagged for place holder content **lorem ipsum** (**docq_lorem_ipsum_ratio > 0**)\n", + "\n", + "We are going to filter them both out" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4b3dee53", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsize...docq_mean_word_lendocq_symbol_to_word_ratiodocq_sentence_countdocq_lorem_ipsum_ratiodocq_curly_bracket_ratiodocq_contain_bad_worddocq_bullet_point_ratiodocq_ellipsis_line_ratiodocq_alphabet_word_ratiodocq_contain_common_en_words
2earth2.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011b3ed1942-54a6-49fc-bcbc-2d8c438adef310729312978404042321pdff039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4...610...4.5412840.02752390.00.0False0.1764710.00.880734True
3mars.pdf## Mars\\n\\n## Solar System\\n\\nOur solar system...10116d882651-2506-41cb-8704-85575c64b1437758129997476962679pdfa3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e...717...4.6880000.03200080.00.0False0.1764710.00.880000True
4earth-copy.pdf## Earth\\n\\n## Solar System\\n\\nOur solar syste...1011f8ccec16-576c-4e3e-8bec-359dff01d6d214711865278795535908pdf6140cf695f269a3ddca6568536076756105ad3186086b2...610...4.5412840.02752390.00.0False0.1764710.00.880734True
\n", + "

3 rows × 27 columns

\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "2 earth2.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "3 mars.pdf ## Mars\\n\\n## Solar System\\n\\nOur solar system... \n", + "4 earth-copy.pdf ## Earth\\n\\n## Solar System\\n\\nOur solar syste... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "2 1 0 11 \n", + "3 1 0 11 \n", + "4 1 0 11 \n", + "\n", + " document_id document_hash ext \\\n", + "2 b3ed1942-54a6-49fc-bcbc-2d8c438adef3 10729312978404042321 pdf \n", + "3 6d882651-2506-41cb-8704-85575c64b143 7758129997476962679 pdf \n", + "4 f8ccec16-576c-4e3e-8bec-359dff01d6d2 14711865278795535908 pdf \n", + "\n", + " hash size ... \\\n", + "2 f039191d59ce8ba25023a844f9b99e7ef2ea4bf75a23f4... 610 ... \n", + "3 a3a4bb3b8f4f441d6d669e09f0cd07a9420d06850cf63e... 717 ... \n", + "4 6140cf695f269a3ddca6568536076756105ad3186086b2... 610 ... \n", + "\n", + " docq_mean_word_len docq_symbol_to_word_ratio docq_sentence_count \\\n", + "2 4.541284 0.027523 9 \n", + "3 4.688000 0.032000 8 \n", + "4 4.541284 0.027523 9 \n", + "\n", + " docq_lorem_ipsum_ratio docq_curly_bracket_ratio docq_contain_bad_word \\\n", + "2 0.0 0.0 False \n", + "3 0.0 0.0 False \n", + "4 0.0 0.0 False \n", + "\n", + " docq_bullet_point_ratio docq_ellipsis_line_ratio \\\n", + "2 0.176471 0.0 \n", + "3 0.176471 0.0 \n", + "4 0.176471 0.0 \n", + "\n", + " docq_alphabet_word_ratio docq_contain_common_en_words \n", + "2 0.880734 True \n", + "3 0.880000 True \n", + "4 0.880734 True \n", + "\n", + "[3 rows x 27 columns]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "all_docs_df = read_parquet_files_as_df(output_doc_quality_dir)\n", + "\n", + "# remove documents with badwords\n", + "clean_docs_df = all_docs_df[all_docs_df['docq_contain_bad_word'] == False]\n", + "\n", + "# also filter out 'lorem ipsum' text\n", + "clean_docs_df = clean_docs_df[clean_docs_df['docq_lorem_ipsum_ratio'] == 0]\n", + "\n", + "clean_docs_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "5861461a", + "metadata": {}, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "8d1b50f7", + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(output_final_dir, ignore_errors=True)\n", + "shutil.os.makedirs(output_final_dir, exist_ok=True)\n", + "\n", + "output_final_dir_parquet = os.path.join (output_final_dir, 'pq')\n", + "shutil.os.makedirs(output_final_dir_parquet, exist_ok=True)\n", + "\n", + "output_final_dir_markdown = os.path.join (output_final_dir, 'markdown')\n", + "shutil.os.makedirs(output_final_dir_markdown, exist_ok=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "ba897dd9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Saved CLEAN parquet output to 'output/output_final/pq'\n" + ] + } + ], + "source": [ + "## save parquet\n", + "\n", + "clean_docs_df.to_parquet(os.path.join(output_final_dir_parquet, \"clean_docs.parquet\"))\n", + "print (f\"✅ Saved CLEAN parquet output to '{output_final_dir_parquet}'\")" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "867bb0f7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Saved CLEAN markdown output to 'output/output_final/markdown'\n" + ] + } + ], + "source": [ + "## save markdown text\n", + "\n", + "for index, row in clean_docs_df.iterrows():\n", + " output_file_name = os.path.join (output_final_dir_markdown, row['filename'] + '.md')\n", + " with open(output_file_name, 'w') as output_file:\n", + " output_file.write(row['contents'])\n", + "\n", + "print (f\"✅ Saved CLEAN markdown output to '{output_final_dir_markdown}'\")\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "dpk-6-pdf-processing-r1.0.0-all-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06107a2f48b3491f91bbe84e46e10ba0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_68997339f13240a4824a9e416096bee4", + "placeholder": "​", + "style": "IPY_MODEL_919b086abd314077bbff75687392bd91", + "value": "" + } + }, + "68997339f13240a4824a9e416096bee4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6c08de2dd9a2402c90b1a7a645db9b13": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "7e13e8779a81400f996d4428c74acfaf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_91fff81a1de8487c9009e872b751edb0", + "placeholder": "​", + "style": "IPY_MODEL_ada62d24cbcf4361acbb21808f334d33", + "value": " 0/0 [00:00<?, ?it/s]" + } + }, + "8b7571c585df431eb901fcdebdf8177e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_06107a2f48b3491f91bbe84e46e10ba0", + "IPY_MODEL_bd74356eca18423aa0373c808d9097e3", + "IPY_MODEL_7e13e8779a81400f996d4428c74acfaf" + ], + "layout": "IPY_MODEL_a75892696be546a3970962bae7bf732a" + } + }, + "919b086abd314077bbff75687392bd91": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "91fff81a1de8487c9009e872b751edb0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a75892696be546a3970962bae7bf732a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ada62d24cbcf4361acbb21808f334d33": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "b4c209371e7a403986991a786cfb296d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": "20px" + } + }, + "bd74356eca18423aa0373c808d9097e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b4c209371e7a403986991a786cfb296d", + "max": 1, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_6c08de2dd9a2402c90b1a7a645db9b13", + "value": 0 + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/pdf-processing-1/requirements.txt b/examples/notebooks/pdf-processing-1/requirements.txt new file mode 100644 index 0000000000..ffd42dafe2 --- /dev/null +++ b/examples/notebooks/pdf-processing-1/requirements.txt @@ -0,0 +1,6 @@ +data-prep-toolkit-transforms[ray,all]==1.0.0 + +# jupyter +jupyterlab +ipykernel +ipywidgets diff --git a/examples/notebooks/rag/.gitignore b/examples/notebooks/rag-pdf-1/.gitignore similarity index 91% rename from examples/notebooks/rag/.gitignore rename to examples/notebooks/rag-pdf-1/.gitignore index 832170487a..124e3e8e79 100644 --- a/examples/notebooks/rag/.gitignore +++ b/examples/notebooks/rag-pdf-1/.gitignore @@ -1,4 +1,4 @@ -input/ +input*/ output*/ final_output/ storage/ diff --git a/examples/notebooks/rag/RAG-explained.md b/examples/notebooks/rag-pdf-1/RAG-explained.md similarity index 78% rename from examples/notebooks/rag/RAG-explained.md rename to examples/notebooks/rag-pdf-1/RAG-explained.md index 3c96827374..c25dab0068 100644 --- a/examples/notebooks/rag/RAG-explained.md +++ b/examples/notebooks/rag-pdf-1/RAG-explained.md @@ -8,29 +8,31 @@ RAG conists of two phases ![](media/rag-overview-2.png) -### Step 1 (Ingest): Cleanup documents +### Step 1 (Ingest): Extract text from PDFs -Remove markups, perform de-duplication ..etc +We will extract text in markdown format from PDFs. +### Step 2 (Ingest): Perform any De duplication -### Step 2 (Ingest): Split into chunks +Eliminate any duplicate documents. -Split the documents into manageable chunks or segments. There are various chunking stratergies. Documents can be split into pages or paragraphs or sections. The right chunking strategy depends on the document types being processed +### Step 3 (Ingest): Split into chunks +Split the documents into manageable chunks or segments. There are various chunking stratergies. Documents can be split into pages or paragraphs or sections. The right chunking strategy depends on the document types being processed -### Step 3 (Ingest): Vectorize / Calculate Embeddings +### Step 4 (Ingest): Vectorize / Calculate Embeddings In order to make text searchable, we need to 'vectorize' them. This is done by using **embedding models**. We will feature a variety of embedding models, open source ones and API based ones. -### Step 4 (Ingest). Saving Data into Vector Database +### Step 5 (Ingest). Saving Data into Vector Database In order to effectivly retrieve relevant documents, we use [Milvus](https://milvus.io/) - a very popular open source, vector database. -### Step 5 (Query). Vectorize Question +### Step 6 (Query). Vectorize Question When user asks a question, we are going to vectorize the question so we can fetch documents that **may** have the answer question. @@ -40,12 +42,12 @@ So we want to retrieve the relevant documents first. -### Step 6 (Query): Vector Search +### Step 7 (Query): Vector Search We send the 'vectorized query' to vector database to retrieve the relevant documents. -### Step 7 (Query): Retrieve Relevant Documents +### Step 8 (Query): Retrieve Relevant Documents Vector database looks at our query (in vectorized form), searches through the documents and returns the documents matching the query. @@ -54,14 +56,14 @@ This is an important step, because it **cuts down the 'search space'**. For exa The search has to be accurate, as these are the documents sent to LLM as **'context'**. LLM will look through these documents searching for the answer to our question -### Step 8 (Query): Send relevant documents and query LLM +### Step 9 (Query): Send relevant documents and query LLM We send the relevant documents (returned in the above step by Vector DB) and our query to LLM. LLMs can be accessed via API or we can run one locally. -### Step 9 (Query): Answer from LLM +### Step 10 (Query): Answer from LLM Now we get to see the answer provided by LLM 👏 diff --git a/examples/notebooks/rag/README.md b/examples/notebooks/rag-pdf-1/README.md similarity index 74% rename from examples/notebooks/rag/README.md rename to examples/notebooks/rag-pdf-1/README.md index 16ffdb15e2..53134aa149 100644 --- a/examples/notebooks/rag/README.md +++ b/examples/notebooks/rag-pdf-1/README.md @@ -24,40 +24,40 @@ Here is the overall work flow. For details see [RAG-explained](./RAG-explained. ![](media/rag-overview-2.png) -## Step-2: Process Input Documents (RAG stage 1, 2 & 3) +## Step-2: Process Input Documents (RAG stage 1, 2, 3 & 4) This code uses DPK to - Extract text from PDFs (RAG stage-1) -- Performs de-dupes (RAG stage-1) -- split the documents into chunks (RAG stage-2) -- vectorize the chunks (RAG stage-3) +- Performs de-dupes (RAG stage-2) +- split the documents into chunks (RAG stage-3) +- vectorize the chunks (RAG stage-4) Here is the code: -- Python version: [rag_1A_dpk_process_python.ipynb](rag_1A_dpk_process_python.ipynb) +- Python version: [rag_1_dpk_process_python.ipynb](rag_1_dpk_process_python.ipynb) - Ray version: [rag_1A_dpk_process_ray.ipynb](rag_1A_dpk_process_ray.ipynb) -## Step-3: Load data into vector database (RAG stage 4) +## Step-3: Load data into vector database (RAG stage 5) Our vector database is [Milvus](https://milvus.io/) -Run the code: [rag_1B_load_data_into_milvus.ipynb](rag_1B_load_data_into_milvus.ipynb) +Run the code: [rag_2_load_data_into_milvus.ipynb](rag_2_load_data_into_milvus.ipynb) Be sure to [shutdown the notebook](#tips-close-the-notebook-kernels-to-release-the-dblock) before proceeding to the next step -## Step-4: Perform vector search (RAG stage 5 & 6) +## Step-4: Perform vector search (RAG stage 6, 7 & 8) Let's do a few searches on our data. -Code: [rag_1C_vector_search.ipynb](rag_1C_vector_search.ipynb) +Code: [rag_3_vector_search.ipynb](rag_3_vector_search.ipynb) Be sure to [shutdown the notebook](#tips-close-the-notebook-kernels-to-release-the-dblock) before proceeding to the next step -## Step-5: Query the documents using LLM (RAG steps 5, 6, 7, 8 & 9) +## Step-5: Query the documents using LLM (RAG steps 9 & 10) We will use **Llama** as our LLM running on [Replicate](https://replicate.com/) service. @@ -76,24 +76,24 @@ REPLICATE_API_TOKEN=your REPLICATE token goes here ### 5.2 - Run the query code -Code: [rag_1D_query_replicate.ipynb](rag_1D_query_replicate.ipynb) +Code: [rag_4_query_replicate.ipynb](rag_4_query_replicate.ipynb) -## Step 6: Illama Index +## Step 6 (Optional): Illama Index For comparision, we can use [Llama-index](https://docs.llamaindex.ai/) framework to process PDFs and query ### Step 6.1 - Process documents and save the index into vector DB -Code: [rag_2A_llamaindex_process.ipynb](rag_2A_llamaindex_process.ipynb) +Code: [rag_llamaindex_1_process.ipynb](rag_llamaindex_1_process.ipynb) Be sure to [shutdown the notebook](#tips-close-the-notebook-kernels-to-release-the-dblock) before proceeding to the next step ### Step 6.2 - Query documents with LLM -code: [rag_2B_llamaindex_query.ipynb](rag_2B_llamaindex_query.ipynb) +code: [rag_llamaindex_2_query.ipynb](rag_llamaindex_2_query.ipynb) ## Tips: Close the notebook kernels, to release the db.lock diff --git a/examples/notebooks/rag/env.sample b/examples/notebooks/rag-pdf-1/env.sample similarity index 100% rename from examples/notebooks/rag/env.sample rename to examples/notebooks/rag-pdf-1/env.sample diff --git a/examples/notebooks/rag/media/rag-overview-2.excalidraw b/examples/notebooks/rag-pdf-1/media/rag-overview-2.excalidraw similarity index 80% rename from examples/notebooks/rag/media/rag-overview-2.excalidraw rename to examples/notebooks/rag-pdf-1/media/rag-overview-2.excalidraw index 81aca52fbb..13bdb554b3 100644 --- a/examples/notebooks/rag/media/rag-overview-2.excalidraw +++ b/examples/notebooks/rag-pdf-1/media/rag-overview-2.excalidraw @@ -5,8 +5,8 @@ "elements": [ { "type": "image", - "version": 98, - "versionNonce": 251175380, + "version": 126, + "versionNonce": 1466563571, "index": "b0z", "isDeleted": false, "id": "nQdFTOsh8Rjwn3poFcnOO", @@ -16,18 +16,20 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 327.1818181818182, + "x": 270.1818181818182, "y": 210.63636363636363, "strokeColor": "transparent", "backgroundColor": "transparent", "width": 64, "height": 64, "seed": 222183398, - "groupIds": [], + "groupIds": [ + "QdJG9kcQa3F6IfH7jiPYH" + ], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722626483716, + "updated": 1736751959412, "link": null, "locked": false, "status": "saved", @@ -35,12 +37,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "image", - "version": 210, - "versionNonce": 1489835732, + "version": 238, + "versionNonce": 1129055635, "index": "b10", "isDeleted": false, "id": "hlPJZs7lUbLYhuRbSmYHs", @@ -50,14 +53,16 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 329.90909090909093, + "x": 272.90909090909093, "y": 282.4545454545455, "strokeColor": "transparent", "backgroundColor": "transparent", "width": 64, "height": 64, "seed": 961787386, - "groupIds": [], + "groupIds": [ + "QdJG9kcQa3F6IfH7jiPYH" + ], "frameId": null, "roundness": null, "boundElements": [ @@ -70,7 +75,7 @@ "type": "arrow" } ], - "updated": 1722626623666, + "updated": 1736751959412, "link": null, "locked": false, "status": "saved", @@ -78,12 +83,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "arrow", - "version": 2395, - "versionNonce": 988701932, + "version": 2632, + "versionNonce": 1648220275, "index": "b11", "isDeleted": false, "id": "FVhCmDYbWjGck9rgcESwp", @@ -93,12 +99,12 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 837.5583207607388, - "y": 273.73602641681657, + "x": 1028.1361684337467, + "y": 272.9021879956133, "strokeColor": "#2f9e44", "backgroundColor": "transparent", - "width": 216.28952040489298, - "height": 2.3372664247598323, + "width": 152.67793298605602, + "height": 2.146087409728125, "seed": 1954615226, "groupIds": [], "frameId": null, @@ -106,11 +112,21 @@ "type": 2 }, "boundElements": [], - "updated": 1722628818904, + "updated": 1736752307348, "link": null, "locked": false, - "startBinding": null, - "endBinding": null, + "startBinding": { + "elementId": "IkaeA2i4mlTdmulYEI_na", + "focus": -1.8672994337354483, + "gap": 14, + "fixedPoint": null + }, + "endBinding": { + "elementId": "N5WLmDAgIi63Sts23MSGK", + "focus": 0.06572769344321795, + "gap": 2.2185846720928435, + "fixedPoint": null + }, "lastCommittedPoint": null, "startArrowhead": null, "endArrowhead": "arrow", @@ -120,15 +136,15 @@ 0 ], [ - 216.28952040489298, - 2.3372664247598323 + 152.67793298605602, + 2.146087409728125 ] ] }, { "type": "text", - "version": 804, - "versionNonce": 199670764, + "version": 870, + "versionNonce": 998657491, "index": "b13", "isDeleted": false, "id": "squum5tl-CiNtj9LWx0b8", @@ -138,8 +154,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1282.5454545454545, - "y": 262.3636363636363, + "x": 1455.5454545454545, + "y": 245.36363636363632, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", "width": 89.64100646972656, @@ -149,7 +165,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629956965, + "updated": 1736752265506, "link": null, "locked": false, "fontSize": 24.799999999999997, @@ -164,8 +180,8 @@ }, { "type": "text", - "version": 690, - "versionNonce": 2054766316, + "version": 759, + "versionNonce": 1228769107, "index": "b14", "isDeleted": false, "id": "pvfbIW6JhxonR3d8_RvDz", @@ -175,34 +191,34 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1227.6510066223145, - "y": 360.12367872429326, + "x": 1389.6510066223145, + "y": 375.12367872429326, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 203.59999084472656, + "width": 202.09620666503906, "height": 30.705073964828053, "seed": 1073499750, "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629909058, + "updated": 1736752350778, "link": null, "locked": false, "fontSize": 24.564059171862443, "fontFamily": 1, - "text": "6. vector search", + "text": "7. vector search", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "6. vector search", + "originalText": "7. vector search", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 730, - "versionNonce": 2056914772, + "version": 761, + "versionNonce": 1924159229, "index": "b15", "isDeleted": false, "id": "dRMw4C6S6Mp6eCCKDx_EF", @@ -212,34 +228,34 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 930.3603992808949, - "y": 377.4545454545454, + "x": 1017.3603992808949, + "y": 359.4545454545454, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 191.7354278564453, + "width": 199.1514434814453, "height": 29.99999999999997, "seed": 72527930, "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629913592, + "updated": 1736752372263, "link": null, "locked": false, "fontSize": 23.99999999999998, "fontFamily": 1, - "text": "7. relevant docs", + "text": "8. relevant docs", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "7. relevant docs", + "originalText": "8. relevant docs", "autoResize": true, "lineHeight": 1.25 }, { "type": "arrow", - "version": 2261, - "versionNonce": 924090324, + "version": 2365, + "versionNonce": 470776403, "index": "b16", "isDeleted": false, "id": "4sYEoyZlHRs94VNRTxBTv", @@ -249,11 +265,11 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1032.9825006562392, + "x": 1083.9825006562392, "y": 543.8609237753892, "strokeColor": "#2f9e44", "backgroundColor": "transparent", - "width": 353.0790083355713, + "width": 353.0790083355714, "height": 1.4622474064750577, "seed": 323719462, "groupIds": [], @@ -262,7 +278,7 @@ "type": 2 }, "boundElements": [], - "updated": 1722629750864, + "updated": 1736752322558, "link": null, "locked": false, "startBinding": null, @@ -281,15 +297,15 @@ 0 ], [ - -353.0790083355713, + -353.0790083355714, -1.4622474064750577 ] ] }, { "type": "arrow", - "version": 1610, - "versionNonce": 1686166612, + "version": 1661, + "versionNonce": 889024947, "index": "b17", "isDeleted": false, "id": "NakZRgP4xdzVmjeNvLR2u", @@ -299,7 +315,7 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 685.9985512656955, + "x": 736.9985512656955, "y": 595.724759181921, "strokeColor": "#2f9e44", "backgroundColor": "transparent", @@ -312,7 +328,7 @@ "type": 2 }, "boundElements": [], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "startBinding": null, @@ -333,8 +349,8 @@ }, { "type": "text", - "version": 1050, - "versionNonce": 2079156052, + "version": 1105, + "versionNonce": 1103490739, "index": "b18", "isDeleted": false, "id": "iPeoeo3O9O3upLDkcr9De", @@ -344,11 +360,11 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 708.1713911576705, - "y": 501.181818181818, + "x": 759.1713911576705, + "y": 503.181818181818, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 289.9561767578125, + "width": 286.3389587402344, "height": 29, "seed": 957822522, "groupIds": [], @@ -360,23 +376,23 @@ "type": "arrow" } ], - "updated": 1722629936709, + "updated": 1736752376845, "link": null, "locked": false, "fontSize": 23.2, "fontFamily": 1, - "text": "8. query + relevant docs", + "text": "9. query + relevant docs", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "8. query + relevant docs", + "originalText": "9. query + relevant docs", "autoResize": true, "lineHeight": 1.25 }, { "type": "text", - "version": 1204, - "versionNonce": 1652445908, + "version": 1259, + "versionNonce": 1242434675, "index": "b19", "isDeleted": false, "id": "ebtp4uI92EU2ovtVXsomC", @@ -386,34 +402,34 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 815.3363849986684, + "x": 866.3363849986684, "y": 605.6363636363637, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", - "width": 153.50363159179688, + "width": 163.9036102294922, "height": 29.999999999999947, "seed": 831514662, "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629941261, + "updated": 1736752381118, "link": null, "locked": false, "fontSize": 23.999999999999957, "fontFamily": 1, - "text": "9. answer 💡", + "text": "10. answer 💡", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "9. answer 💡", + "originalText": "10. answer 💡", "autoResize": true, "lineHeight": 1.25 }, { "type": "image", - "version": 1588, - "versionNonce": 1205783764, + "version": 1639, + "versionNonce": 2008697491, "index": "b1B", "isDeleted": false, "id": "_DTyJRHFJGCjvqUIBeB7C", @@ -423,7 +439,7 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1398.5454545454545, + "x": 1449.5454545454545, "y": 527.227272727273, "strokeColor": "transparent", "backgroundColor": "transparent", @@ -443,7 +459,7 @@ "type": "arrow" } ], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "status": "saved", @@ -451,12 +467,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "diamond", - "version": 707, - "versionNonce": 1404808044, + "version": 790, + "versionNonce": 443459869, "index": "b1C", "isDeleted": false, "id": "3WtwayBS0EeE079tVsbFm", @@ -466,8 +483,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1069.727272727273, - "y": 420.9090909090907, + "x": 1233.727272727273, + "y": 362.9090909090907, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, @@ -479,14 +496,14 @@ "type": 2 }, "boundElements": [], - "updated": 1722629787728, + "updated": 1736752364487, "link": null, "locked": false }, { "type": "diamond", - "version": 1185, - "versionNonce": 1464379348, + "version": 1236, + "versionNonce": 1371603411, "index": "b1D", "isDeleted": false, "id": "wd3SWMa_i1h518kFuRp-j", @@ -496,7 +513,7 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 882.1818181818188, + "x": 933.1818181818188, "y": 549.5454545454546, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", @@ -514,14 +531,14 @@ "type": "arrow" } ], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false }, { "type": "rectangle", - "version": 207, - "versionNonce": 1539331308, + "version": 390, + "versionNonce": 1718820659, "index": "b1DG", "isDeleted": false, "id": "Uv-8TiLeECJuuNx1yJjtv", @@ -531,8 +548,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 759.5454545454545, - "y": 245.72727272727275, + "x": 940.5454545454544, + "y": 240.72727272727275, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -551,14 +568,14 @@ "type": "arrow" } ], - "updated": 1722626695912, + "updated": 1736752234692, "link": null, "locked": false }, { "type": "rectangle", - "version": 206, - "versionNonce": 826201964, + "version": 389, + "versionNonce": 869187795, "index": "b1DV", "isDeleted": false, "id": "l7XMM15Xwzq5xmDF0QvyN", @@ -568,8 +585,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 776.090909090909, - "y": 254.09090909090912, + "x": 957.090909090909, + "y": 249.09090909090912, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -584,14 +601,14 @@ "type": 3 }, "boundElements": [], - "updated": 1722626695912, + "updated": 1736752234692, "link": null, "locked": false }, { "type": "rectangle", - "version": 238, - "versionNonce": 1360592492, + "version": 421, + "versionNonce": 1149578867, "index": "b1E", "isDeleted": false, "id": "Wxv71stEiYRpNjyhzzXgO", @@ -601,8 +618,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 775.1818181818182, - "y": 272.27272727272725, + "x": 956.181818181818, + "y": 267.27272727272725, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -626,14 +643,14 @@ "type": "arrow" } ], - "updated": 1722626698812, + "updated": 1736752234692, "link": null, "locked": false }, { "type": "rectangle", - "version": 240, - "versionNonce": 2075365996, + "version": 425, + "versionNonce": 135329245, "index": "b1F", "isDeleted": false, "id": "IkaeA2i4mlTdmulYEI_na", @@ -643,8 +660,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 779.3636363636363, - "y": 287.3636363636364, + "x": 960.363636363636, + "y": 282.3636363636364, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 47.27272727272725, @@ -657,15 +674,24 @@ "roundness": { "type": 3 }, - "boundElements": [], - "updated": 1722628818527, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + } + ], + "updated": 1736752237762, "link": null, "locked": false }, { "type": "text", - "version": 317, - "versionNonce": 153869652, + "version": 483, + "versionNonce": 354273853, "index": "b1I", "isDeleted": false, "id": "zSJvmm-7DrsR5-qRb96Kl", @@ -675,12 +701,12 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 609.4118679291607, + "x": 787.4118679291607, "y": 242.27481706603328, "strokeColor": "#1e1e1e", "backgroundColor": "#ffc9c9", - "width": 141.51840079198635, - "height": 59.453152259008114, + "width": 141.69613647460938, + "height": 59.45315225900812, "seed": 409665722, "groupIds": [], "frameId": null, @@ -695,23 +721,23 @@ "type": "arrow" } ], - "updated": 1722629990073, + "updated": 1736752249534, "link": null, "locked": false, "fontSize": 23.781260903603247, "fontFamily": 1, - "text": "2. split into\nchunks", + "text": "3. split into\nchunks", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "2. split into\nchunks", + "originalText": "3. split into\nchunks", "autoResize": true, "lineHeight": 1.25 }, { "type": "arrow", - "version": 661, - "versionNonce": 1783749228, + "version": 728, + "versionNonce": 29971901, "index": "b1M", "isDeleted": false, "id": "JMprrs8mNVD4CrqUlVm7i", @@ -721,12 +747,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 392.3090479062895, - "y": 277.3697732144489, + "x": 341.9938631491875, + "y": 277.5003262976055, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 138.0709014336395, - "height": 1.9443693290653528, + "width": 144.38608619074148, + "height": 1.0749224122219516, "seed": 1319994682, "groupIds": [], "frameId": null, @@ -734,7 +760,7 @@ "type": 2 }, "boundElements": [], - "updated": 1722626624242, + "updated": 1736752179646, "link": null, "locked": false, "startBinding": { @@ -758,15 +784,15 @@ 0 ], [ - 138.0709014336395, - -1.9443693290653528 + 144.38608619074148, + -1.0749224122219516 ] ] }, { "type": "text", - "version": 528, - "versionNonce": 1518707540, + "version": 675, + "versionNonce": 1972126653, "index": "b1N", "isDeleted": false, "id": "G0k27V_VE7lyh7YGr_fts", @@ -776,11 +802,11 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 856.9917648037997, - "y": 244.9780740734803, + "x": 1018.9917648037995, + "y": 243.9780740734803, "strokeColor": "#1e1e1e", "backgroundColor": "#b2f2bb", - "width": 149.3264923095703, + "width": 148.37196350097656, "height": 58.225670034857664, "seed": 970452474, "groupIds": [], @@ -792,23 +818,23 @@ "type": "arrow" } ], - "updated": 1722630014609, + "updated": 1736752256508, "link": null, "locked": false, "fontSize": 23.290268013943066, "fontFamily": 1, - "text": "3. vectorize \nchunks", + "text": "4. vectorize \nchunks", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "3. vectorize \nchunks", + "originalText": "4. vectorize \nchunks", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 813, - "versionNonce": 1647923412, + "version": 864, + "versionNonce": 1283936115, "index": "b1O", "isDeleted": false, "id": "bPsqdnl4lmEVHdpxnL7YF", @@ -818,7 +844,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1332.7272727272727, + "x": 1383.7272727272727, "y": 433.0909090909092, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -840,14 +866,14 @@ "type": "arrow" } ], - "updated": 1722629810316, + "updated": 1736752322532, "link": null, "locked": false }, { "type": "text", - "version": 766, - "versionNonce": 936520660, + "version": 817, + "versionNonce": 1392135859, "index": "b1P", "isDeleted": false, "id": "v_mj9bQ1A8qpZs-qgS8b-", @@ -857,7 +883,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1340.7182672674005, + "x": 1391.7182672674005, "y": 438.0909090909092, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -868,7 +894,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629810317, + "updated": 1736752322532, "link": null, "locked": false, "fontSize": 20, @@ -883,8 +909,8 @@ }, { "type": "arrow", - "version": 3859, - "versionNonce": 2108283860, + "version": 4016, + "versionNonce": 1406286035, "index": "b1Q", "isDeleted": false, "id": "R6JSfRIH6SMeQy6jhnQBf", @@ -894,12 +920,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1395.847386955965, - "y": 537.0909090909092, + "x": 1347.2057176703038, + "y": 535.2247043634964, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 204.21390458677502, - "height": 3.8247548126748825, + "width": 102.77389948848531, + "height": 1.9248687668030016, "seed": 50118393, "groupIds": [], "frameId": null, @@ -907,7 +933,7 @@ "type": 2 }, "boundElements": [], - "updated": 1722629818776, + "updated": 1736752322558, "link": null, "locked": false, "startBinding": { @@ -918,7 +944,7 @@ }, "endBinding": { "elementId": "yFt4MAxjxmGKCX04IRgc0", - "focus": -0.1925388809872483, + "focus": -0.19253888098724906, "gap": 1, "fixedPoint": null }, @@ -931,15 +957,15 @@ 0 ], [ - -204.21390458677502, - -3.8247548126748825 + -102.77389948848531, + -1.9248687668030016 ] ] }, { "type": "arrow", - "version": 2146, - "versionNonce": 275780332, + "version": 2342, + "versionNonce": 89761395, "index": "b1V", "isDeleted": false, "id": "NVXq7cCI4oPDcGJMcBQXw", @@ -949,12 +975,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1133.0222525327567, + "x": 1212.9150715827727, "y": 486.2891195266392, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 106.59478912038708, - "height": 163.4027558902755, + "width": 167.12335268377706, + "height": 164.4027558902755, "seed": 52355351, "groupIds": [], "frameId": null, @@ -962,7 +988,7 @@ "type": 2 }, "boundElements": [], - "updated": 1722629848587, + "updated": 1736752322558, "link": null, "locked": false, "startBinding": { @@ -986,15 +1012,15 @@ 0 ], [ - 106.59478912038708, - -163.4027558902755 + 167.12335268377706, + -164.4027558902755 ] ] }, { "type": "arrow", - "version": 1502, - "versionNonce": 1967576428, + "version": 2047, + "versionNonce": 1886409235, "index": "b1W", "isDeleted": false, "id": "eMud-gLoWrDRv7Gf__fPf", @@ -1004,12 +1030,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1105.176093297839, - "y": 281.999069312548, + "x": 1216.5680601284355, + "y": 282.99906931254793, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 105.32390670216091, - "height": 2.4349140599938437, + "width": 158.67875676051676, + "height": 0.45217922798940435, "seed": 2011074519, "groupIds": [], "frameId": null, @@ -1017,19 +1043,19 @@ "type": 2 }, "boundElements": [], - "updated": 1722629848587, + "updated": 1736752307348, "link": null, "locked": false, "startBinding": { "elementId": "N5WLmDAgIi63Sts23MSGK", - "focus": 0.3854158093728425, - "gap": 11.489709508382179, + "focus": 0.6648610311095211, + "gap": 6.004135127422806, "fixedPoint": null }, "endBinding": { - "elementId": "ltOK83LoUD7UU_3j_cg76", - "focus": -0.0705723899683619, - "gap": 1, + "elementId": "sm5yWctYrhqu8VelEVYq9", + "focus": 0.5145397067946539, + "gap": 6.0814613009241345, "fixedPoint": null }, "lastCommittedPoint": null, @@ -1041,15 +1067,15 @@ 0 ], [ - 105.32390670216091, - 2.4349140599938437 + 158.67875676051676, + 0.45217922798940435 ] ] }, { "type": "text", - "version": 858, - "versionNonce": 936929236, + "version": 911, + "versionNonce": 1378107027, "index": "b1X", "isDeleted": false, "id": "n84LhHameLrrvr9dvJS9B", @@ -1059,34 +1085,34 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1240, + "x": 1291, "y": 506.5454545454546, "strokeColor": "#1e1e1e", "backgroundColor": "#b2f2bb", - "width": 126.34536743164062, + "width": 126.8204345703125, "height": 54.00000000000007, "seed": 167700537, "groupIds": [], "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722630029204, + "updated": 1736752332051, "link": null, "locked": false, "fontSize": 21.60000000000003, "fontFamily": 1, - "text": "5.vectorize \nquery", + "text": "6.vectorize \nquery", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "5.vectorize \nquery", + "originalText": "6.vectorize \nquery", "autoResize": true, "lineHeight": 1.25 }, { "type": "rectangle", - "version": 911, - "versionNonce": 1314659796, + "version": 962, + "versionNonce": 984106899, "index": "b1Z", "isDeleted": false, "id": "72WLPYV3DQ5ji4DxMR7np", @@ -1096,7 +1122,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 800.0909090909091, + "x": 851.0909090909091, "y": 548.2727272727273, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1114,14 +1140,14 @@ "id": "drTGOegd6tHWTct9TzYcX" } ], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false }, { "type": "text", - "version": 869, - "versionNonce": 1245791060, + "version": 920, + "versionNonce": 738188595, "index": "b1a", "isDeleted": false, "id": "drTGOegd6tHWTct9TzYcX", @@ -1131,7 +1157,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 805.3628741177646, + "x": 856.3628741177646, "y": 553.2727272727273, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1142,7 +1168,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "fontSize": 16, @@ -1157,8 +1183,8 @@ }, { "type": "text", - "version": 604, - "versionNonce": 694532308, + "version": 655, + "versionNonce": 1695792851, "index": "b1b", "isDeleted": false, "id": "SxjaT2wrQTJ-o8SxdOtHm", @@ -1168,7 +1194,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 864.0000000000001, + "x": 915.0000000000001, "y": 551, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1179,7 +1205,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "fontSize": 20, @@ -1194,8 +1220,8 @@ }, { "type": "rectangle", - "version": 1124, - "versionNonce": 1734311508, + "version": 1175, + "versionNonce": 1081764979, "index": "b1c", "isDeleted": false, "id": "EGQgcKqCKZ92pyn-vxRWY", @@ -1205,7 +1231,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 999.4090909090909, + "x": 1050.409090909091, "y": 602.1363636363637, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1223,14 +1249,14 @@ "id": "hYwb2FeDB1k4VB6jIgy0Q" } ], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false }, { "type": "text", - "version": 1147, - "versionNonce": 486839252, + "version": 1198, + "versionNonce": 33187347, "index": "b1d", "isDeleted": false, "id": "hYwb2FeDB1k4VB6jIgy0Q", @@ -1240,7 +1266,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1018.5400848388671, + "x": 1069.5400848388672, "y": 607.6363636363637, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1251,7 +1277,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "fontSize": 20, @@ -1266,9 +1292,9 @@ }, { "type": "diamond", - "version": 696, - "versionNonce": 1463565036, - "index": "b1dG", + "version": 880, + "versionNonce": 1286095507, + "index": "b1dV", "isDeleted": false, "id": "_WHINKpSdEMstJlo2e-GL", "fillStyle": "solid", @@ -1277,28 +1303,30 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1058, - "y": 244.09090909090907, + "x": 1183, + "y": 243.09090909090907, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, "height": 20.909090909090878, "seed": 1555477478, - "groupIds": [], + "groupIds": [ + "OAVsXY2YonKeoKIwBQtO6" + ], "frameId": null, "roundness": { "type": 2 }, "boundElements": [], - "updated": 1722629634711, + "updated": 1736752300442, "link": null, "locked": false }, { "type": "diamond", - "version": 714, - "versionNonce": 775238892, - "index": "b1dV", + "version": 898, + "versionNonce": 306739251, + "index": "b1e", "isDeleted": false, "id": "N5WLmDAgIi63Sts23MSGK", "fillStyle": "solid", @@ -1307,14 +1335,16 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1057.818181818182, - "y": 266.54545454545456, + "x": 1182.818181818182, + "y": 265.54545454545456, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, "height": 20.909090909090878, "seed": 1408803750, - "groupIds": [], + "groupIds": [ + "OAVsXY2YonKeoKIwBQtO6" + ], "frameId": null, "roundness": { "type": 2 @@ -1329,15 +1359,15 @@ "type": "arrow" } ], - "updated": 1722629700042, + "updated": 1736752300442, "link": null, "locked": false }, { "type": "diamond", - "version": 685, - "versionNonce": 128927572, - "index": "b1e", + "version": 870, + "versionNonce": 999835507, + "index": "b1eG", "isDeleted": false, "id": "xtZbH3Rlzs0XZwPNk3pXG", "fillStyle": "solid", @@ -1346,14 +1376,16 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1059.4545454545455, + "x": 1184.4545454545455, "y": 293.18181818181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, "height": 20.909090909090878, "seed": 2012745702, - "groupIds": [], + "groupIds": [ + "OAVsXY2YonKeoKIwBQtO6" + ], "frameId": null, "roundness": { "type": 2 @@ -1364,14 +1396,14 @@ "type": "arrow" } ], - "updated": 1722629712968, + "updated": 1736752300443, "link": null, "locked": false }, { "type": "image", - "version": 451, - "versionNonce": 1532166252, + "version": 595, + "versionNonce": 657339795, "index": "b1eV", "isDeleted": false, "id": "ltOK83LoUD7UU_3j_cg76", @@ -1381,8 +1413,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1211.5, - "y": 250.88636363636368, + "x": 1374.5, + "y": 249.88636363636368, "strokeColor": "transparent", "backgroundColor": "transparent", "width": 64, @@ -1409,7 +1441,7 @@ "type": "arrow" } ], - "updated": 1722629848587, + "updated": 1736752210017, "link": null, "locked": false, "status": "saved", @@ -1417,12 +1449,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "image", - "version": 540, - "versionNonce": 1483405652, + "version": 591, + "versionNonce": 814361523, "index": "b1f", "isDeleted": false, "id": "4YoIP2OS4xYKWgZsRgVIl", @@ -1432,7 +1465,7 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 601.5000000000001, + "x": 652.5000000000001, "y": 511.88636363636374, "strokeColor": "transparent", "backgroundColor": "transparent", @@ -1448,7 +1481,7 @@ "type": "arrow" } ], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "status": "saved", @@ -1456,12 +1489,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "text", - "version": 1026, - "versionNonce": 1930671188, + "version": 1077, + "versionNonce": 480451315, "index": "b1g", "isDeleted": false, "id": "rQNkgL2YVTsa60VEz6nEc", @@ -1471,7 +1505,7 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 613.0500411987306, + "x": 664.0500411987306, "y": 589.3863636363637, "strokeColor": "#1e1e1e", "backgroundColor": "transparent", @@ -1482,7 +1516,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "fontSize": 20, @@ -1497,8 +1531,8 @@ }, { "type": "text", - "version": 439, - "versionNonce": 1166928468, + "version": 499, + "versionNonce": 41677939, "index": "b1h", "isDeleted": false, "id": "XUbC5cWQCm-GEFrdqZW7g", @@ -1508,11 +1542,11 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 401.94038113680745, - "y": 247.15978750685963, + "x": 348.94038113680745, + "y": 248.15978750685963, "strokeColor": "#1e1e1e", "backgroundColor": "#ffc9c9", - "width": 125.03573608398436, + "width": 116.12332153320312, "height": 56.915476374359955, "seed": 1458850132, "groupIds": [], @@ -1524,23 +1558,23 @@ "type": "arrow" } ], - "updated": 1722629998113, + "updated": 1736752135971, "link": null, "locked": false, "fontSize": 22.766190549743982, "fontFamily": 1, - "text": "1. Cleanup\npre-process", + "text": "1. Extract\nText", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "1. Cleanup\npre-process", + "originalText": "1. Extract\nText", "autoResize": true, "lineHeight": 1.25 }, { "type": "image", - "version": 138, - "versionNonce": 1172059244, + "version": 176, + "versionNonce": 1024925299, "index": "b1i", "isDeleted": false, "id": "XH-Rt0Q5-K2g4tM9reh76", @@ -1550,8 +1584,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 534.8409090909091, - "y": 209.88636363636368, + "x": 490.8409090909091, + "y": 210.88636363636368, "strokeColor": "transparent", "backgroundColor": "transparent", "width": 64, @@ -1562,8 +1596,13 @@ ], "frameId": null, "roundness": null, - "boundElements": [], - "updated": 1722626560915, + "boundElements": [ + { + "id": "lHi0BUzThLy_t7_14zOyA", + "type": "arrow" + } + ], + "updated": 1736752177042, "link": null, "locked": false, "status": "saved", @@ -1571,12 +1610,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "image", - "version": 186, - "versionNonce": 395282004, + "version": 225, + "versionNonce": 598487027, "index": "b1j", "isDeleted": false, "id": "YFlD_rDw6IwCctPG9BjYf", @@ -1586,8 +1626,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 534.8409090909091, - "y": 279.8863636363637, + "x": 490.8409090909091, + "y": 280.8863636363637, "strokeColor": "transparent", "backgroundColor": "transparent", "width": 64, @@ -1600,15 +1640,15 @@ "roundness": null, "boundElements": [ { - "id": "0wYqjwjKHCGbx7CfmDR__", + "id": "JMprrs8mNVD4CrqUlVm7i", "type": "arrow" }, { - "id": "JMprrs8mNVD4CrqUlVm7i", + "id": "lHi0BUzThLy_t7_14zOyA", "type": "arrow" } ], - "updated": 1722626611130, + "updated": 1736752177042, "link": null, "locked": false, "status": "saved", @@ -1616,12 +1656,13 @@ "scale": [ 1, 1 - ] + ], + "crop": null }, { "type": "arrow", - "version": 724, - "versionNonce": 90762348, + "version": 1157, + "versionNonce": 1917167763, "index": "b1k", "isDeleted": false, "id": "0wYqjwjKHCGbx7CfmDR__", @@ -1631,12 +1672,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 601.6995151292258, - "y": 276.08728311464677, + "x": 783.0454545454546, + "y": 275.9414427822386, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 160.10395921482052, - "height": 0.6238794650969908, + "width": 162.28249921422787, + "height": 0.5173522358115861, "seed": 1397245780, "groupIds": [], "frameId": null, @@ -1644,19 +1685,19 @@ "type": 2 }, "boundElements": [], - "updated": 1722626695912, + "updated": 1736752246931, "link": null, "locked": false, "startBinding": { - "elementId": "YFlD_rDw6IwCctPG9BjYf", - "focus": -1.109045219782619, - "gap": 4.754433856179901, + "elementId": "N0Ndvp4QxYBPXHHf4w9Gc", + "focus": 0.156852528891332, + "gap": 1, "fixedPoint": null }, "endBinding": { - "elementId": "Wxv71stEiYRpNjyhzzXgO", - "focus": 0.71470874166962, - "gap": 13.378343837771922, + "elementId": "IkaeA2i4mlTdmulYEI_na", + "focus": 1.6361250332441895, + "gap": 15.035682603953546, "fixedPoint": null }, "lastCommittedPoint": null, @@ -1668,15 +1709,15 @@ 0 ], [ - 160.10395921482052, - -0.6238794650969908 + 162.28249921422787, + -0.5173522358115861 ] ] }, { "type": "rectangle", - "version": 1127, - "versionNonce": 829383532, + "version": 1178, + "versionNonce": 1644774547, "index": "b1p", "isDeleted": false, "id": "yFt4MAxjxmGKCX04IRgc0", @@ -1686,7 +1727,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1032.2500000000002, + "x": 1083.2500000000002, "y": 501.38636363636374, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1716,14 +1757,14 @@ "type": "arrow" } ], - "updated": 1722629780787, + "updated": 1736752322532, "link": null, "locked": false }, { "type": "text", - "version": 1133, - "versionNonce": 608216276, + "version": 1184, + "versionNonce": 913629971, "index": "b1q", "isDeleted": false, "id": "3s-bbZ7ixbw_SFjRKEVxN", @@ -1733,7 +1774,7 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1039.9007540616124, + "x": 1090.9007540616124, "y": 519.3863636363637, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", @@ -1744,7 +1785,7 @@ "frameId": null, "roundness": null, "boundElements": [], - "updated": 1722629750845, + "updated": 1736752322532, "link": null, "locked": false, "fontSize": 16, @@ -1760,8 +1801,8 @@ { "id": "WtnY3pqkK8ASBf1-4fxPv", "type": "rectangle", - "x": 1203.840909090909, - "y": 391.88636363636374, + "x": 1333.840909090909, + "y": 380.88636363636374, "width": 39, "height": 21, "angle": 0, @@ -1779,18 +1820,18 @@ "type": 3 }, "seed": 1374900180, - "version": 527, - "versionNonce": 253503724, + "version": 576, + "versionNonce": 1111120243, "isDeleted": false, - "boundElements": null, - "updated": 1722630039761, + "boundElements": [], + "updated": 1736752350104, "link": null, "locked": false }, { "type": "diamond", - "version": 766, - "versionNonce": 168389228, + "version": 910, + "versionNonce": 488199187, "index": "b1t", "isDeleted": false, "id": "zlGe03vzEWzmllwt-MPix", @@ -1800,8 +1841,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1210.0227272727275, - "y": 262.43181818181824, + "x": 1373.0227272727275, + "y": 261.43181818181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, @@ -1813,14 +1854,14 @@ "type": 2 }, "boundElements": [], - "updated": 1722629848587, + "updated": 1736752210018, "link": null, "locked": false }, { "type": "diamond", - "version": 727, - "versionNonce": 818494700, + "version": 871, + "versionNonce": 630938035, "index": "b1u", "isDeleted": false, "id": "10E22hoaTdI84VEMTX9wE", @@ -1830,8 +1871,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1214.0227272727275, - "y": 268.43181818181824, + "x": 1377.0227272727275, + "y": 267.43181818181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, @@ -1843,14 +1884,14 @@ "type": 2 }, "boundElements": [], - "updated": 1722629848587, + "updated": 1736752210018, "link": null, "locked": false }, { "type": "diamond", - "version": 727, - "versionNonce": 1156235116, + "version": 871, + "versionNonce": 118205267, "index": "b1v", "isDeleted": false, "id": "gtj688hXkjTSwOP-H_0OR", @@ -1860,8 +1901,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1214.0227272727275, - "y": 268.43181818181824, + "x": 1377.0227272727275, + "y": 267.43181818181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, @@ -1873,14 +1914,14 @@ "type": 2 }, "boundElements": [], - "updated": 1722629848587, + "updated": 1736752210018, "link": null, "locked": false }, { "type": "diamond", - "version": 741, - "versionNonce": 246936044, + "version": 886, + "versionNonce": 633391347, "index": "b1w", "isDeleted": false, "id": "sm5yWctYrhqu8VelEVYq9", @@ -1890,8 +1931,8 @@ "roughness": 1, "opacity": 100, "angle": 0, - "x": 1215.0227272727275, - "y": 279.43181818181824, + "x": 1378.0227272727275, + "y": 278.43181818181824, "strokeColor": "#e03131", "backgroundColor": "#ffc9c9", "width": 33.636363636363626, @@ -1902,15 +1943,20 @@ "roundness": { "type": 2 }, - "boundElements": [], - "updated": 1722629848587, + "boundElements": [ + { + "id": "eMud-gLoWrDRv7Gf__fPf", + "type": "arrow" + } + ], + "updated": 1736752210018, "link": null, "locked": false }, { "type": "arrow", - "version": 2267, - "versionNonce": 1193363436, + "version": 2463, + "versionNonce": 451043347, "index": "b1y", "isDeleted": false, "id": "LX6SwsBcl881KJdzOxXvb", @@ -1920,12 +1966,12 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1200.9555294678319, - "y": 316.7121870453824, + "x": 1363.7986220548623, + "y": 291.5113856680257, "strokeColor": "#2f9e44", "backgroundColor": "#b2f2bb", - "width": 112.73078059081445, - "height": 169.47279864584357, + "width": 205.5410496836057, + "height": 194.67360002320027, "seed": 2069875180, "groupIds": [], "frameId": null, @@ -1933,7 +1979,7 @@ "type": 2 }, "boundElements": [], - "updated": 1722629848587, + "updated": 1736752322559, "link": null, "locked": false, "startBinding": { @@ -1957,15 +2003,15 @@ 0 ], [ - -112.73078059081445, - 169.47279864584357 + -205.5410496836057, + 194.67360002320027 ] ] }, { "type": "text", - "version": 625, - "versionNonce": 1425330516, + "version": 787, + "versionNonce": 961621395, "index": "b1z", "isDeleted": false, "id": "u6igVA48eC3IO6u0dKz-T", @@ -1975,33 +2021,199 @@ "roughness": 0, "opacity": 100, "angle": 0, - "x": 1105.4218642911046, - "y": 247.3299461276493, + "x": 1241.4218642911046, + "y": 249.3299461276493, "strokeColor": "#1e1e1e", "backgroundColor": "#b2f2bb", - "width": 88.7796630859375, + "width": 90.26747131347656, "height": 29.112835017428832, "seed": 1859790188, "groupIds": [], "frameId": null, "roundness": null, - "boundElements": [], - "updated": 1722630005980, + "boundElements": [ + { + "id": "eMud-gLoWrDRv7Gf__fPf", + "type": "arrow" + } + ], + "updated": 1736752309041, "link": null, "locked": false, "fontSize": 23.290268013943066, "fontFamily": 1, - "text": "4. Save", + "text": "5. Save", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5. Save", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "image", + "version": 258, + "versionNonce": 2016160147, + "index": "b20", + "isDeleted": false, + "id": "N0Ndvp4QxYBPXHHf4w9Gc", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 718.0454545454546, + "y": 239.01136363636374, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 1121153747, + "groupIds": [ + "tFCUjq5A0CHHf2JIbodsU" + ], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + } + ], + "updated": 1736752214558, + "link": null, + "locked": false, + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ], + "crop": null + }, + { + "type": "arrow", + "version": 807, + "versionNonce": 1372619571, + "index": "b22", + "isDeleted": false, + "id": "lHi0BUzThLy_t7_14zOyA", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 569.1784478847981, + "y": 266.887821138674, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 148.39048327151932, + "height": 1.5860812915788074, + "seed": 1060900509, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1736752181892, + "link": null, + "locked": false, + "startBinding": { + "elementId": "YFlD_rDw6IwCctPG9BjYf", + "focus": -1.4375659883772536, + "gap": 14.337538793888939, + "fixedPoint": null + }, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 148.39048327151932, + 1.5860812915788074 + ] + ] + }, + { + "type": "text", + "version": 413, + "versionNonce": 1462177085, + "index": "b23", + "isDeleted": false, + "id": "Dv4NcpvpsuaL0gaHJT8fX", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 571.2862541494615, + "y": 241.2847875068597, + "strokeColor": "#1e1e1e", + "backgroundColor": "#ffc9c9", + "width": 115.57589721679688, + "height": 29.72657612950406, + "seed": 1621553149, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1736752196212, + "link": null, + "locked": false, + "fontSize": 23.781260903603247, + "fontFamily": 1, + "text": "2. dedupe", "textAlign": "left", "verticalAlign": "top", "containerId": null, - "originalText": "4. Save", + "originalText": "2. dedupe", "autoResize": true, "lineHeight": 1.25 + }, + { + "type": "diamond", + "version": 811, + "versionNonce": 1796060531, + "index": "b24", + "isDeleted": false, + "id": "KcQtYO2ZEGlag6-sdrK9U", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1218.227272727273, + "y": 380.5568181818183, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 33.636363636363626, + "height": 20.909090909090878, + "seed": 337196605, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1736752360186, + "link": null, + "locked": false } ], "appState": { - "gridSize": null, + "gridSize": 20, + "gridStep": 5, + "gridModeEnabled": false, "viewBackgroundColor": "#ffffff" }, "files": { @@ -2010,35 +2222,35 @@ "id": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAABd1JREFUeF7tm39sE2UYx79vu61b2cbKVjackBHX6TpNFIzxDzaBIBInEHQQgxoCqIygyIBoDGZSRRNRcIHAnPIjDpBfhgTkp6DRiIAYfyErsk62yToDY+vItv6+vebewmyX9u5Kr9263P3V3Pvc93neT5973ifv3RGEeTSPeUCXoOEWApgJwABgJAASpoyguTotzZNQaFyQtWv7Tjl1g2mFFbjVYJwP4CMAI6IZmGp4OjTjxrup0/5cZu32L6PpSzKAVkPRuxT0rWgGc1ubB5A0fhwopU5qt5fpd35+JFp+JQFozS98gRJSG60g+uveBsCfp4AdDufMrNptp6LhXxTAdaMx1eNBA4DsaAQQTNMfAIPQS3uos7tUv2PH93LHIAqgJb9wMSFks9yOhfT6A2C2FJ2U630ia+sn5+WMRRRAq8F4hAJPyulUTCsoAN9S0+7h6OTsLdUXxDSkjosCsBqMFgD5UgXlsAsFgCUCRZuGcBPTamrMcviSAqA92sueUBEMOklKrV41SrKrq69ECkEcQIGxAxS6SB2Fc71QBvjpXFV5SbFu66bmcLT728YzABBK/+a0mmJ9VdW/dwohrgGwSRNyKXFYckn6unU37gRC/APwVcaLKq2mWFdV1RkuhKEBwLc6/EIzUifp167tCgfCkAHAJq0ip+2Ge6aOXr7cIRXC0ALAZwJRfZdpLJhGli51SYEw5AD4CqPq2IicrBlk9WqvGIShCYAxUB3STSp5msyZwwlBGJQAiFaLxAJ+synCg/N+qj+wf1HcAYhw2v9fTmDLrTcL7l4NygxQAMhFQMkA5RZQaoBSBJVVQFkGlT5A6QQFCCidoFjTZR2AXWGxmCSPK52g0gkqnaDSCcaiE8z44H0kT54UUJuo2wWusQmOYyfQ88Uetm+tnV2G9NdXBNr19MBjscB54iQcB78C9Xj6xjPeMyF56tSQNc956ht0vinw0kqsiqBuw8dImRY60J5du3HTtAbDnp+L4ZWrQk7Ia2lAe/kScFdbmI1u/YdIeSr0k3nH0eOwLQsEGiA+EABcp8+As1qhHpUDTfEE/tEV+/evTZmG5MdK+gB4LtbBY2lAwpjRSBr3kM8OgLepGW3TZ4G6XAEAXGfPgfvnasD83BfrYN+7P/SqOBAAOpa8BudJ3+s8mVtqoCmZwH7z59XZI/sAdG3cjK6Nm9hY4n33YsRn1VBn+97C6aw0wb5nXwAA27KVcBw9JrkFYIYDCUCdk43M2m1IyMtjsbTPfwkJY/OCAuDHtXPKkLHGxGx5gDww/1vAeepbeK809gFw/3GhD3RIKgMBIFgwvTYbrk18HNpnZoUEwGeB/tABdrmnzoy2WbMFa0DP7r24+fY7whkxGAD0dnXB9moFXGfOBhRB/1uA3Qb3F0F/YB+bkPvCn7hR9mx8AuDT19vYDOp2w9vUBNcPP4LPAP7wXwX6A0hd9CLSV1QwO8ehw7CtfCO+a0Cw3AwKQK1GyvRSZJgqQVJS2GW2pRVwHP96aAPg2trQ22GDetQoqNLT+ni5zp1H+7wFbOn0L4JxtwqIZUCwcb5r7FxVCdrdzYbjBkDaK4uR9PB4FnTXhk1w//pb0OqcPGUyqwP+B3U44bVY4DhxklV//yO1/GVoHn2EnequroHrp5+Fq37/0VitAuFFFUNrBYCyIaJsiCgbIrHYEIlhWQvPlVIE5SiCBmPMX5cP728WtO7ItZgzhSzEnwwZjPW3vg+UMa6YSf2VazEXRgSg1VB0mIKWxixkWR3Rw7mWS9MjAtBiMJYToFrWuGIkRoHyuy3mmogAsM/mvLCAIidGccvl5nqSisvXX74s+Pa4aA3go2k1FM2loLvkiiw2OmRurqVut5gvSQB4EWuB0QSKSjHBQTFOYMqtN6+WEotkALxYS0HRPELp+lh/RSZlIrds2gmlFXc1XNoh9ZqwAPCijXkPZiQmuRcSihkgKABln9SGrSM1QBE7CoJroKinBAc97qRtY5t+D+uzmf8A6hsfbisiXOQAAAAASUVORK5CYII=", "created": 1711006482453, - "lastRetrieved": 1711214452915 + "lastRetrieved": 1736751950898 }, "aee87fc347ff479e333de52ddc10cfc86c76a601": { "mimeType": "image/png", "id": "aee87fc347ff479e333de52ddc10cfc86c76a601", "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAADoJJREFUeF7tm3dcFNfax39nZnZ2WUBFRBFiybVFRaJXYuOaxH4tMSbYKyGaxBKNXWMJaowFNVe9YopoFCF2YjSa2DC5GkVFYwE1RaNYEFGkLOxOOXM9i+CCICwsXvJ+3vMX7Mw85TvPOec55zxDUM7buDjpXZ1iiVryomtSWZhKykKoo2SOjTd340CiQHBWFcTuK+uRu46SnSOnXANgRo6LsyzggI2fNtbHOdp5Jq/cAwiO1oTgdkQpC+f/EgAKcJwD0BKAHwAPAPcAXADwIwC7QZX7CMgHYLCo5xfJklrds1ZFi1tVZ6TdN2u3rqboOY6kKTL96CGUVQDU4kbMXwWAwOu4dYSQ/gEjmws9hvmiYhWnXB9NaRYc3HoREUuOy4pCf5It6hsA0osD4S8BgOe51QYXXdC8yF5i3SZVC/Ur8XoqZg34Rrp3xxStSGo3ALQoCH8FAK8QgugF2wJIo5e8ivIHt64+wNjOkYpkUYcDWF/UA88KwAsPB6reAk+6A6SGpmmVbAzLIgTnFFX7AsAWAJqt0Tq9cMi/W52XJyzvzBfkzN2b6YhYegKqTDFwYgtUr10REUtjEPVZ7DXJotYucwDBZ16tFNzs8INCFHlzHBdCKe3v7mKw1KvuZnBz0cMo6sBz2exNFhk37mXQCwnJFBoOq5r2OoAsAM8BcCUcOTcvshfv2+Y5JN/KwLZVsRB0HPqM8UNmhoSZ/Xaids0XQAjBg8zrWLI7AKwrvPOPDUx8IwAXnwahVBEQfLxrBb0TTSKaNnla030r8ylqyXPkuyquTq4dmtQUvSu7PPVlpJjM2PLzr1K6Wb5MCGpQ+jhKNsW/C6OLiCmv74COVoGiSEgzJUGyKHjRtxU2RO7A+rVfYPGi2Vh/OtCqp1/Dz6Qsk9wfQFSZAWCCF5zr/A7h6ffTGh+4bqPIhyPkhE8Nd7GTby2ee/S2iwrHe+lZOHD+uvKCd2Whhrsr7qZl4tvYK4j6YzR4gcOgJmEInrsEffoPxMC+PeHs7IJ14Vtx+NABDBvUG4EftsZrQS9a1QzzCzOnJGW+V9Q4UKoIKMQhI8+Ry3WrVfLs+VIdoSinn3adRcWagxcQdiwQHt6u2Lz8JHasPovN23fDv+0r1ke/37MbgYP7oM+Y5uj/wUvW31SFok+D1aoiUzYT7CvTCChA+HQnUQh+t5OvqONZ0lbypmlA6MGzctBHbXWdBzS2CgqbewT7Ii9h7YbNMJkyMHLEMPQd0xz9HjnP7jl7NAGzB+1UNapVAVDY+GSV5+gI4HmOJLfzqVGpWe3C52t7kByOv4GbapYcemiIjuOzzd2w8Bi2r44F4QgGT26F3qOa5xE5a8A3ctzJWz8okvpaUbrsBpB+vHs7QdWuOPnvuVaA8JcJweHRXZqSh1FQlO5iXc+UFIRFX1C7DvPl3prpn2svm/7YuFC5mnMeOXvDz+PzWT8qlGpsMIgvSoldAExHenpBoKxP3cxUuN4e//g2f7o53t3FMD+ovc/jPLUICxRKceZqErwru8LLLa8zOY9eu5uGbTG/ad0DfbVh0/05Uf9kSkBVDTs+P62FLz6maVR7G8BXRTlvVxdIi3nDndfktQB6ZgvWQo1Zpomk3WGzjaKQ2lUrjO3Tqr5YHOUq1RAV87uSmJZFLLLM+df3In51q0Hgnhw7Eu6lY2fsFdVQQUSPQF/ep5U33DyMSEsx4+Kp29gbft5yJyFNVRXK5sGtxdFvFwCryxpI1oke4RrFJefWuz8uQEloAy+3d3r61Skwa7O9PzXTgu9ir8gW8PjPhqW6mPOXMe6TUEWSJDTwcuNreVQglYx6VDTqoddli4u9egeH426k6kT+jsWs1M+RJxr4G5JZjXiYOC0FYNeukV1dgCk0n+z+N32GMYm025pRFID0LAlG/eOsj73xWykZiL9xn8bfuIeWvi+oGxZM0T3nyQZrICPTjK/3RGPTnsNybNxvXJZFsnreoUlN/P35qjh77S6iLyQkyCqt+TAF0QNgI23yo8yxuC89z312AyhCizUC/Bt486f+SNTOXU9mGSpEgWfrc80iqwLPcVpL3wbyhMAAsccrLa0pbGHtfmo63p+/CjGxZ2kn31rcjfvptgBK5HD+hxwNYJteJ7xhkRWOI0Tr0LoZmT6iP+4kp4Blg9U9KqNx3dpwMRqKbfyASZ/gm4PHNKppxNVJr2Za5GSVUs9iCyjiRkcCaMRz3JkWvg34L4LH8XNCI9TKFV34lTNGl8pWFgH3UzPU2SMH8aM+/rf68+k4UE3zAXCpVIIfPewwAHpR3OnXuN4/D65dKLKwHjRlkUMBRCyeylOqoX3QVOl0/K97LJLCdn1K3RwGQCfwps+DxxkH9WhvNaogAD//Eo8jsXF4f/DrcNI/OVOG7zoIAoLBr2XLYC0nAhgA9v/GXYfw3pzlJllRn768LCYahwHgOU6KDJmm69WhTaEAmr45ChevXMdXn0zCgG6v5jExMfk+anUcav3t+oFwVKviViCAqANHMWjqIklVKZsFSt0cBsAgijG9u7RtHjZvvPVNFRQBG749gO//cwr/mvYeqrrbbgoBLLynLF1jdShk0vDc2SF/BATNXKZu33fkpFmSWpfaewcvhroQQvaGTBpOxgzsicFTFzt0DAhfOIVfGbETU5eFaZqmdQZwoLwBYPaM4HlupauzkdPxPHnZr4kQGTKtVHb2nTAfR8/EKYpKtfSMTKpSyqaVsFIJtXnYYV3ARiabo9kyNLD9Sy/67f1y/hOjXULiXXi4VYTBZiD88+YdsIxQ4PNm0Z3eni79FHv+5KOdnV0AEh3lPJNTFgBy7BtSydV5ze0fN4n5t8TeHDcXVxISMXV4X3i6u2HvkZNYFbkLP4UvQfNG9XL9UylF9ZcHSKkZJrbFHe5Ix3NklSUAD44jt3etmst3bN0sj+13U1IxfdlaRB08as3/G9ethQ/f6Y/endvmue+HI6fQa+wclVKNRRXL+R3eyhIAdDo+wqfu872PbFwq5g/tHE8UVX0i7Nk19rv/oAlS3B9/bpVldbDDPX8ksEwBAKgpCHz86AGvOS+eyKK4+G1SyJdYvXm3SVFUtrdvu+NcfCHFuLOsATATehFCti8cH8R9MLR42evSddu0GSvWs+kuAMA3xfCjxLc4EoCOOQsgO4XL2/w4QoIa/q0m3grozDvpC07isswWhO34Qb18NUGjVFsH4FQBsu4D2AlALrHXNg86DIAoihE8z/d1q+xe2Nk80TRN4DhOLUwpOxSklPKEWCtC8pwR5ticcv8ep6rqFkmSHDIuOAyAwcnp9uKlKzwHDX3LES+mUBkRG9ZhysSxieasrOqOUOQ4AAanO8tWhFbtN3CII+wqVMbmyHBMGDsqyWzOquYIRf8PwBEUmQxDOYsALSjYS+VoV2HN3KeuG/5PRoAWGGxQBO0jQrRplHItxbDgE4W9aEcAqAGgvV5vCP105WrjsxgDxr8/MtNiMbNV4UEACfmdU0bMZqdXnXJ/1xAkrJnLptUnWmkAVOX1+vWqJHXRu1aUeaoKS5at4J4FgEkTxlKV4xVLeqqO14nfq5KFnQbl1hLLI4LbENDZALpoGuYJqrySrFtQ4IFJSQEYeFF/oUp9nxrtZv5L9PRtgbWv1pYWfvyJ+CwATJv5oRR0+E8x8dwJRM8dKyX/Hp+gShZ2fm7JecXy27P8CYdp/AM+gGwNlhzdBUYZKrp9GrT/V1F0qWCVbQ+A5LtJ2BQZDlXNzpl4nkf/gUNQxaPoI3U2DeYAYM9KGWkI61hPsqQ9GAfgM3sH9RJFAKfTbWnUa0hAxzmrc08x7QHwy+lYTJ4wJg+AkGX/RtO/5z3nL8iZ/ADYPftnv0cv7ty4jcpyv2cCQHAyxvgFTWzRavTMXH32ALDXSNv7CwJwfNU8xK5ddlzOyrR7o7REEaAzOF9q88GcBs2Gvl8uAJxevwLHlgdfks2mhvbCLREA3uB0q/2s5dUbvzGsXACI2/EVoj/+4JZizvJmBpmO9+imEl1MhZZRrJL8qa1EADidLq3bko2udTux1W92+192gd/3RWHP5KHpVJYq3DvetYIBfAohZJKx5a5PywQAW66+GbaXr9GqXbkAcP3YIUQN76ay5TYzKP1Y94YuKfQK6bY3d1p05DTICnkyBmz5GdV8Ho/a/8sIuHPhFL7u6898ZLZlFvXWba+XpAuwku2bned/CRdPVs6b3fZPHqIsWhgiPItEaOq0yUqnkPDcMrSMxBvYN2MEM4PZdtuRANioynJq26ov94fjzGQXVxdw5HExkyzLWLo8FM8CwIRxoyDq2A5cdqMaRUa6tWKHVaVfdhQAVg3Gqq3Y11q2o6krIaTF+MAAsmBcdmHygWNnMDx4BYig17y8vEsSVcW2+ebNGxqhElkTPBYdWmWfN2RZJFRq+Sb7k31LVOjKryAlTzOWFRmySusZBTzYmRDyw7X9GyAIAmp2HII6DRqqyXcSORYJZdl0Oh08qnnS3y9f5NkxunslV6s6Z79emqIoXVhvtEf/0wCwRL0jgOgCBLLYVw+ELYSfT324t+mD1eu+xqsdmf6yb4f278WYtwfj3tEtueeL1V4ZSB+kprFUeJs9FtgCYJ2KbeixT9FYW/gonApLJroOD/gnnn/OE9/9dBKn4n7TfJs2J0YXhxRuFOpDZkYGzv1yWvPzqUu6t82uDmct5Ksd2oPUtO2FbKXbymPLZnbOaP3EzhbADEFv/KiSd2PrhfSkKyJV5UIjhNUBuXs9r/CCjhV1w5SSxJtNqRylRX6nZM8LeuJejuNgcK5IXdyq5tl+N6WncEQwqhyvL3A7PUfQg5txgmLJZJ/XLcgDgOP4TbVa9enXrO+8UhlY3h8+s2UWrh3fuplSlX1NYhMBHLfWo06LYQ06jSxdkX85J3B5fyi9+8fJ9aA0KH8X8OMEcRdVJIcVIZZHFpwgJlJF6sFKj5l9/wUIqPqMzq3ikAAAAABJRU5ErkJggg==", "created": 1711007232670, - "lastRetrieved": 1711214452915 + "lastRetrieved": 1736751950898 }, "07b175087348021c4c82812631d73b2ab9bb7a5e": { "mimeType": "image/png", "id": "07b175087348021c4c82812631d73b2ab9bb7a5e", "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAACPxJREFUeF7dmwtsFNcVQM+d9drGu8YEHL5R+BSwoVWb1qnSKFKhaZR+01RqQYpEEVULTUhD+f+clKUBwh8KDUmhVRGNFAmKVEL6VZqSSohGjdu0asELFLxRQoCAg+1dg7F3bv1mdx3bXdtvbQeWHWk0O7Pvd897b+bed98V+vnQ5xhFCxXARIQS8M6B3m9N/k48SzxPHPVAXfKsR6hDvfvE88Tvk+RRJY/xTn82WfpSWJuwQgVKBeYKw/tSpkXe8yhVCFXetY9QMgKgv6CQKA+BPALcewOEteDhJTkPHAN9kSCH5dtcs81oBUB/zATEWQ58s92wta3jRqcz0+ZXqLtefsCpnirvFoA+S5CWvErQhUB+T4Vl2f/XUdmGv2WNPE60q7Z1CUC3MQLJew2YkGWCZdqcU2jLFFnAu+kypgWgmynF8b+GMDnT2rIyvXIct3mKLOZS5/alB7C9YAmqG7NSmN42SmSpzG/aZAdga8GfgPt7W1eW5ntVFjZ93hLAgOm4ugthSJYKk1mzlMs4MlcWXt1vBcAk0u2BYcTdn6I8nFltWZZaOITP+Z7Mj12wfgm2T6ibAjNANwAjs0y0nppzDmSZLIm90F3CLr4CRZ+WxY1/S2XUEA5FxfdBfBrIN7IYxjnQg+A7QGPDUQnhtsmwuaNMqefpAWwI1ABnwF0py67+tcOISMFw49MQmZrUEwp76o4P6X+j8p5C9QjO/wvtTeUNAz4DzjpgnCyLjbF6B+gzxTWIjk4mPtSq/m5lXPSoTCfeuQBvdBSUjEGby3GkHMxJWfIcDPj7KHwzUAuEE6dW42o14q+mqa6mfS+39fZ+fJwJ3tdqTRoNNvEOU4nIioZMAJACkCq3FtXfIbxMQd7vZcGVKzaCaYh8Bg4Mct0XxG0O4nOCuG4QR4KoBL0yRKO4GsVxosTdKI4/Sn48Sn19VEJct6pn26BBNLV8EeWriHwJMPA/OJRMAAw0U6AzgPbFtYAcBd5ACePEw8Q1LJXp37Q2AmSSRtcGhuGTMlxfGeKNtrtBTY/ndVNORFbUW46AdSU9AeiqHjMqwoiGcZ2zQAOOxnCl0buiMdQXQ90Y4jbi+mNeQU5zAHWKECeAxAMgAVwJ4GiRd4ViHHcsKqmpNSgTYMm0EVlZZwlg7aDeAuhFu25YlohUXrEEsOa23ATw5PuWAJ4enJsAnqq1BPCjIbkJ4IeXbQGU5iiAS5YAVhsA0t1n8Ia9ufqvIo3IKlsAoaG/BGb0X+VZUdILErr4LStV2NMcQ0NngOxAuS0rmt/bRgjvg86T0MW0VmH3q8IrS0eQl7cH+Epv67/J+X5DS8tsWXcp7YKop4Wna6AxcDqYkk8NnwLOdFBjCg+7yUL1VP0FkIPg7penz5tVbe/oLFPqeXoAT474M8g+wuf2yYEPLEDP8mu547OoOw0hm2BcQDmIOAfIe/svHTpvGj7KRs4EnSlr3v2c1TtAK0emPoPGBF3F2nP7pRVi+8wejPjwSahvEi6TcLQcZVLSDC7qqZt6+X9jwtbgBK5U43ACiZ/Ad/5EZ7NYzeiuHDm9dZSvTrYpImvPWX4GK0d11gOMR/YwKi/RUPiq7Dzd1JUAXsXLh4/G8Zch8cGoE8AhgKoxagKgQTBGjwZQc+9NxBgqMXBjIMaLE0MkhmuurjGganGbw6w/H+ncER065YnxBRRfux/Rr4HxYTKq3f8RWfuOJYCVd3SnCBkL7o8gL3nmcL7vtIRqrJ2Rvez5tNk0NKaQ6/HxSXPYCP1gAnLaIyLr3rYEsMIAsFaEzLpbJLliU51YH3DDuP6zSHMDBf5YbwF5AjYZU9lfjNM8FtdJ2f+pVSejrDl2UDUiz9gCWH5nf6vCBpKZv2b0JE9p9NYHEnPATA3z3khOE+9q7i2Fs0IQkfVvWY6A5aNr0G5XhKxqzKpEQkTWRywBLB3T3yMgG1hEZGONJYAlY3MTwKazlgAWj8tNAJvP2AL4SI4C+K8lgEXjjQY4MRsmbj+24aRsOW1WlTsc6W2BeePuxO/7OcoD/diAm1eU8ArN8e/IjjNvWQFIJdIFEx5t/UabXRUJD86td0RbdY0lsu3U8101vcdtcjq/fAy4q0C+DvTGIXEzsF0B/TU4q2V7tXmfdXl0MQXKF+N398qWk22bijQ0OZ9afQBHp6P6MEiWwdAriBzClf0MllckdLzNp6iLJpbS7MySHdWbraaAzptkqA0G3UKhbpWN4Yb2GdtgCFNR4w1Wo5uPA3w3qLuNl/oMSDWixv440llo0w5dWlbMNVkIssh4mGXHCcuvwBMegNSq8CVU9qDuYW6vfj2dO9qrzIyQyzIe4uWew9KVMkSMhzYIbsITLN67JHV2dpsbN7gxhaMoUc9jjJO811ocDXuGFr5qhujp9j3csXNweK/8HsR5CNHZQGny/4jstAYwuSs94D1Uf4s4LzOg5Q+dR0Ymve8Bq0u6x0s02pVANmV6PX017wuoa1zjXwZuT5MvIjuPW46A73/URhEyPXYyYQZLGNXE5gV/fli2v2m1d8BGuA69O/+uQTRfL/M2YYjxFGvKW2x0lp42YkTkJ/+xBPD4x2wAdNf+iyBnUW1ANIY4jahxjZsVnpQ5rI3evTnEmL+SMIfNKpG5FwmgbhEq5ncx6FhgaKbQ2qWPyLP/tgQw9+M10LZFpg91ZlNWiciuf1kCeMwAyLH1AIjIc7YAHv1EbgJ4/p+WI2DOXTVIjo0As0lq95u2AD6ZmyNg9z9sAXwqRwH83RLA7Irc3C6/p8pyu/x3717SuiydWwETsFR+9oZlwMScilLijvGs5kbIDBzH506R3VV2ITOecTOnYgQtvtwImsqLG+Htg6ZS+pvOnRqk8Wol4m06vvXC5kS3UVi0RnYdyTxsroMRMuseEzp3awVOwnrZ+3rfAic7a/I6a2ohNBk7+xE0y0JnhWOte4lfhILDsveItbe6xzXB7swZnXnvKEQrQEzQdOr88IOnoarV/18FWoVKlew71uuI8j4BSAenHZSJoCWoU4KjAxOh81oCYkLmuwmf13qQOi+E3pV6xK3z7tGTfRU2XXv/B2/4aW50o0c0AAAAAElFTkSuQmCC", "created": 1722471115880, - "lastRetrieved": 1722471115880 + "lastRetrieved": 1736751950898 }, "3d9a42e4251de71beb4942440bfe118a46d4b46d": { "mimeType": "image/png", "id": "3d9a42e4251de71beb4942440bfe118a46d4b46d", "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAADWtJREFUeF7tW3mQHGUV/33dPd1zz+zMHgmb3exMz2KCoSpFgohiki2JoGglQIygiCJgUAGrQhQjViXeWmqBWFyKISBHOItLSgVZY8iyyczuZjEbk3AoOQhCstdMz0yfn/X17oQ9ZjczPbPBP3xVUzs7/b33vffr973vvddfE1SRUrt2nWPp+rd0w/iIIAi8W5REl8vl43kOPM+DEOJktn4ARyilhwghf+E47olYLPZvJ4KK8TjSaKKgZLLrG5qm/kQ3jKDf50MwGLANnkHaRildn0gktlc6R0UAJJPJBbpuvKBqWoNbkhCJ1EAQhEp1Kof/Scuyvtra2vpuOUxjxzoGIJVKfV7J5u6zLIsPBPyoCYed6lAp3wHLsla2trb2OBHkCIBUd/c1mXTmDkopwuEQgoHAuLkty8KePfvxyu69OHDgLfA8QTDoQyDgBcdxJevp93nR2HgK5jSegtra6HQxJEcpXe5kSZQNQFdv74eV4fR20zQ5n8+HaKTmuEFKNotbf3MP3njjMJidwaAIUaxOLBBFER868wwsP7dtqmX2DoAzZVk+UDLCAMoCoL29XXCJ0pCu616Xy4VZDfXH78q2l3bioS3PQNMMeL0C/H4RzoL+9OozT1h10Qo0NTUWG9gVj8c/RAixSgWhLACSyeTdGSV7JRNeX1cLt9ttz3PPvY9g+/aRJRgOS5Ck6tz1qYxg2+mFKz+NRWcsnDSEUnpFIpHYXHUAenp6wul05qhpWbzbLaG+rs6eY+/e1/Crm38PSgGPR7Dd/mQQWxLXX7sGNTWTgu9BjuNOjcVi+VL0mNIDHnr8yVWcZc4DT/RoMNgXCIVWZ5XsF5nQ2mgUXq8Hpmlh7bofQlHydqCLRt1Ok51SdJ00JhabiyuvuHzSUnv7nXdvP3To8BHG4PN7X3P7fHvOXbLklWKTFAXg/i2PtmcUZRljiDXNgSi64JYkJa+qPuZ+LDJzhGDT5ofR0bFratfnBXDxeQAvwHq9D9D10gwVBHCJDwKWBeuNfwKGMSXfRRd+ZtJSOHqsP9vzjz4vY2qojSIcCrJot7Ft6dLvTxRUFIBN9z2Q03TdLYkiWkaDDc9x1LQsMtb9b1z/Mxw7NmTLrK/3jrsT3MKzoS/5JEwKeCU3QoQi/eR90F7tmxYE/uxzoZ25FCal8Lk9CBga0o9vgn7g9aJ8py84DZd87uJx1wzT1Nu3dbjsmBQMoKGuln3d3bZs6eklAnC/qumGWACA7d1sb2fk9/sQqRnZ+tZ87SZ7GQgCZ7t/gUi0AfoXroVFKTyShGhwZJ1SXUP/bT+A2V88ceOaE1BXfgksv2DG1wSCNp+VzaD/NxthKelJILAYsG7tdZN+b3+pA4ZhwuN2o7lxNpt9X9uyZfMcASDwPAzTHHH1UBDBYBCH3/oPNmy8xf5tYvDjL7gUufh8+1okGLI9oEDKi89A+duzRe8mv3oNcg0j21tdqAbsBhQo/fT9yKW2FeW7af06OyaNpY6dKSjZnJ14tcbmVgYA2/P10fXL7j7zgq1bO/GHB54a9QoXfD7b42ziLrkG+bpT7O8NNVG4xtQHzAhmTDHivrwW+cCIt8yO1oLn3ttOGWgMvGK05uor0Nw8Z9ylnd29GBoetn9jAHAcqbYHvI0NG39d3AM+dj5yCz9iXwv5Agh47Xhk03R3Ujj/s8i2jizTSCAE72iewf4fevB2qHt7iwJw0/ob4B0zBxtU8ABWlSZamivzgPExwI/I6N47VQwA4YCvfQ8aL0DgBdSFa8BzHIy3D2Hgrp+CmsWjOhHdoF9dD40Q22tqQyN82ht7MXjvLbCTjQnEirB1N1Q5Bmy+/8GhvKoFWWkrz20acWuOo5a9C7jtLJDRjet/imPHRtxs4i5AJA/IhV+GWdtg59u+1/cg++LToOr0+QkJhMB95jLo0XoIlgX3vl7k/vZHUKP4FrpgwWm4dMIuYJqm/uLoLhAKBDCrvpZtg9vbli49ZyKARbfBhx99/NvZfH6jYZpSbSSSCQYDeY8kulVVC7I8gFVn7O/dmx5CZ+dIfnEyUuBi/r9yxQU4c/EZ4y71Dwz179q92w4m0XBYqQmHXge1vtXW1vZCSQAUmyiV6v5lOpO+gV1jBYnX44FhGFi77sfIZguZoGdGCqCiCx9Ay9xmXHUlywTH30dCyGfj8fhjU/GN/b3kYqijoy9imAPvmKbJj10Ge/65Hzffcs/7UAu4cN21a47nJAWjKKVvSZKUaGpqylUVACYsmUzdm1GUy+01X1cHlhUy+l+qBgkhV8fj8btLMZ6NKdkD2OC+vj5xYHBoUNd1D6sPGurf6wds/Xsntjz8LHTdnNF+QDQSwaqLV0za90cN7o3H44sIISNZWwlUFgBMXnd395KMkm1nHSHWAWaN0AKl0xn8+tZNePPAEZZ4IBiUIIqlt8Cm01cQeCxedAbOP+/jYIlZEXqX5/mzWlpa/lWC3ceHlA0A40ylUl/PKNnbpuoJqqqOV3bvwZ6+/XbKzEpl1g/0+z1l9gR9aGycbVefzNvY84UpiK33T8iy/FI5xpe9BMYK39nVdVk+m9vMgmIgEEBNOFTu3NUaf5hSuiKRSHQ5EejIAwoT9fT0LMzm889rqlbLdgZWJzBXPVlECPkTx3FfaWlpsZsfTqgiAAoTplKpdfm8ttEwDV/A77efDJXT/nageBLAelmW/+qAdxxLVQAoSEwmk8spJdfrpv5hlyAIkiR5XIJL4gXezukdPhvMADgE4AAh5HlK6VOyLL9aqeEF/qoCUEypzs7OoCRJnKqqVkNDg1aO4oZh0NbWVrUcnnLHzjgA5Sp0ssf/H4CZRPzgwYOe/v7++YIgUFEU3YSQMPuwPgkrIEc/TIXB0c8QpXSQfVgPRJKkQULIQKl5vRNbqu4ByWTXd03LuNw0zGbB5fKwHmKFj8wppXQvx3HbCCEPxmKxrU4MnYqnqgBsvve+N5uam2dzhNi5al1drd2VrRYpudxbhw4feflT5y1fVS2ZVQPgkUce69y3b+9ZrCEwp7nFrKuL5kWXKLjdkulxuwVRdIkuwQWXS5gyR2CpNWtlsz6DbuisNZ/VdF0dGhpGOp0JpBXFx8acKsd/3rb0Y9+pBghVAeDPz//1h13Jnd8zTRMenx/+0PSHJViS5BIESxAEk9WjhmHwpmkRwzBK0sctSfT0+aeuWLRoUfE2cRnIlDThdPLa29vd3T292aySIRzPI1I/y2nCU4baYLVHZvVFK2sIIVM/NytBYsUAPProEx37X913tmWaCEVqIVZxzU+nv8/rxdymxp8sOeejN5Vg55RDKgKAZXmabgyw3kBeVREOhzE4NISBwWGkMxn7EVc1iKXQrMaoCQcRDoVQEwoWegJHLMuKVZItVgRAMpnckFGyG5mRrDHCGiQFYvFgaDgNVdVYQIOVOQAc2wUNHmjEB43zQTNHTpSJvAGRM+Aa/YvAfPDuCFjQlCQRoemP3V0gy/JzToGuCIDOzh17cvm8/RCw8ZTZJzgbaCGydTmQZ7UN4AoS8O7J02v+hTjaenvJ3TpCyG/j8fiakw4ApZTb9tJ2nbk/a1HNntVwQh149W2EO1aDaqrdjRRDBJz4Hgi6J4FjrXfA4v0nlDVmwJF4PN5ICHG03hx7QCqVOiudUTqZIuWcExSU1xDq/BKoYdjPEMQaDkQADGkOjrbeCcsVLcd4eyyltDmRSBwsm7FkPysiOZXquSqdGf4du8SOyrEjc6WSa7gXwZ3XgJoWCA8IdbU4Ov9umCJ7jl8+cRx3diwWs29GuVSJB/winVHWOQHADnz92+Df/QNb38yCjdAiHy1X97HjV8my/LgTAY4B2JlMPqQo2UucAuBE2Wl4vi7L8h1OZDoGIJlM3ppRsvZz6XKXgBNFp+OhlF6aSCS2OJHrGIDu7u7rh4bT9gmJ9xsAjuOWOS2THQOwa9euTwwMDv2ZAVDOLuDkLpXA8wFZlveXMG7SEMcApFIpr5LNZdihiVLzACcKlsAzODAwUL948eISDyGOl+gYACbm5Zc7/5VX1Rb2/cSZYAmmOBvygCzLlzljLfPp8MRJUqnUe1thNAJWob0P5HgLZLpW5AG9vb31Q8PpIywdHnto4iSCcNTn87XMmjVLcTpnRQCwSXckk09mlewK9r3aPcASjFory/LNJYybckjFAIz2BI6Zpimwft+shoaT0hEC8KZlWR+opBdQ8RIowNrV1fWj4XTG7sywc3tsW5xhooSQlfF4/OlK56nYAwoK7NiRfDGby7axhic7P8SO0MwgbZBleaSQqJCqBgDrD3Tu2PFmPq/OmWEQHovH46ud1v8T8aoaAEwwS44M00zmcvnTZgiEuwYGBq5zmvQUc5aqAlCYYGcyuUVRsp9jINRGI8dfrqrAW1nr+5uyLLNeWVVpRgCwvaGnZ6WWV+9UVbWB5QjsBUux+OmuExn07Oh7wrtPNNDJ9RkDYMwO8Y28qm3Qdb2OHWlnHd4SHpaygxQvEEJ+Fo/Hi78l4cTaIjwzDkBhzlQqxfpdN+qm+WlYCLpEQXAJLq8ouixRFNkRmCOEkH9blvWcYRjPzZs3b/L7MVUyeqyY/wLrN+J9+VeaUgAAAABJRU5ErkJggg==", "created": 1722471134279, - "lastRetrieved": 1722471134279 + "lastRetrieved": 1736751950898 }, "fffa228d79e3bc7053142e0031890d5aaf369b8a": { "mimeType": "image/png", "id": "fffa228d79e3bc7053142e0031890d5aaf369b8a", "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAABGRJREFUeF7tm0tsE1cUhv8Zu3FiJ3HihJCkpbxKQI55ryq6aKXQFigPBYG6aQAJEbHgKZAQCxYIlQ2gQJFSEK2gRRUtKuUhNSAC2YDEgleIbdpCpEhGEEIgNsGOnYxn0L0G14HAzI3H44w9d+N5nHt8z3fPPef4eoZDljcuy+2HAYDVA7zetgUAdwjAh6x9hyN/+Lemqw27tn42nL5K+jB7gMfj9nEcPlKiXA2Z9Tt+hGvK2JRBYAbg9bolNQxTqoMAIG2ac8K1fTs3f6q0n1I53QBIFQRdAUgFBN0BUBuCLgGoCUG3ANSCoGsAakDQPYBkIWQEgGQgZAyA4UIY8QCUVnSv5ZxOF5NNTMLkS7QuhQ0AjAQMDzCWgBEDmOIak7ARBI0sYKTBIeuA/T+fRXvHQ8aElZz4pPGVWLdq0VtK0pIGsx5AcnOpbu+0eIC6JiSnzQBgVIJpqASzPghmPYDkwpa6vY0gaATBNARBdZ04OW3GEkjFEgjU1izgAc2eCknOB0hv7gHPod7258W/5XQp2hDprZ3rAyTNngqRG7TC+76CU80fy8kqBFCj6VMhcoNWer/gVLOsfbIC5Mt6aw0Ash4g8SaEps8GzOYhJ4gP9iLvrhuhqTPBCQPI87YNkutzToVkMsPqvo2+KS6I+QVDT7QgwNp6A5wYlXUETT0g/MlkPFm59r2Dsl84h8BXCwFJwujGfch56KPy/ZVj8HjtZoDjUNT0F/xfL6HH72qjjjYi9/6/IwuAUORA94p6iCYzYLEgasunAzQH/JCiUZgiYRSfPoGu+k2QeB6WjnaUHfmBynStXofIuIngRBGjG/ei55ulEArs4EwmCPYiKmMKvgAiEfBRAaXHDsHsfzayACSOJjh9Np4t+45eqmj4Huburvht/7zF6J3zBT0vOXGUfj79diX9LLzSAvv5M3FZobQMjzZup+eOk7/C1npD1uhEAU2XgFIAUo4FnRu2QbAXw9QTm8VosQPmQA/KG3aDG+jPbADEur4qJ7rr1gyazdLjR5D3j3vQtYz0gNcWEtcmBtI40d1Fl8qbLWMBhKbNwtPldYPsLfnjF1jv3Mx8DxBt+ehcv41miQ86Y3+mDJRXgg8GUXFgN3gS7V+1jPQAkh1IliCt7KeDgCjSNEjyvq31Ohwnj2cugMRiydp2CyW/H4ulweV1IMuCNJLj8+7dpce69YCQa0Y8t1fs2RkvWp5//iUCNfNpqisn9UHAH0uDhXaa70mKtF9qQmHLhRiAIgcebdlBj0nNQMpklpa2OoBUepGJVeDDYeT4OuJjJtf7XDPAh4JvlbLhCVUQrVZYva2A+P9Pj/4x4yDm5sLS/h+tFFla2gCwDDKVsioCyPINkeDSufNFiWyJ6WZXyMeLYr3t9OUmOQ9TtCGSqMR4UFLjl6bkZvDN+ynfFtf6tTlGAD6n0yW7EZqok3kJeDyeeYB0WMt3BxVC8EkSt6a6uvq8QnkqxgyARbkeZA0AepilVI7xJZGIcV/ibAoaAAAAAElFTkSuQmCC", "created": 1721376622438, - "lastRetrieved": 1721376622438 + "lastRetrieved": 1736751950898 } } } \ No newline at end of file diff --git a/examples/notebooks/rag-pdf-1/media/rag-overview-2.png b/examples/notebooks/rag-pdf-1/media/rag-overview-2.png new file mode 100644 index 0000000000..6f3ca96bda Binary files /dev/null and b/examples/notebooks/rag-pdf-1/media/rag-overview-2.png differ diff --git a/examples/notebooks/rag/my_config.py b/examples/notebooks/rag-pdf-1/my_config.py similarity index 96% rename from examples/notebooks/rag/my_config.py rename to examples/notebooks/rag-pdf-1/my_config.py index 66fc1ecf71..e6522381b6 100644 --- a/examples/notebooks/rag/my_config.py +++ b/examples/notebooks/rag-pdf-1/my_config.py @@ -27,6 +27,7 @@ class MyConfig: # MY_CONFIG.LLM_MODEL = "meta/meta-llama-3-70b-instruct" # MY_CONFIG.LLM_MODEL = "ibm-granite/granite-3.0-2b-instruct" MY_CONFIG.LLM_MODEL = "ibm-granite/granite-3.0-8b-instruct" +MY_CONFIG.MAX_CONTEXT_WINDOW = 4096 #tokens ## RAY CONFIGURATION diff --git a/examples/notebooks/rag-pdf-1/rag_1_dpk_process_python.ipynb b/examples/notebooks/rag-pdf-1/rag_1_dpk_process_python.ipynb new file mode 100644 index 0000000000..2330785f9e --- /dev/null +++ b/examples/notebooks/rag-pdf-1/rag_1_dpk_process_python.ipynb @@ -0,0 +1,1243 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": {}, + "source": [ + "
\n", + "

Data Processing for RAG with Data Prep Kit (Python)

\n", + " \n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": {}, + "source": [ + "## Before Running the notebook\n", + "\n", + "Please complete [setting up python dev environment](./setup-python-dev-env.md)" + ] + }, + { + "cell_type": "markdown", + "id": "053ecf08-5f62-4b99-9347-8a0955843d21", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "This notebook will process PDF documents as part of RAG pipeline\n", + "\n", + "![](media/rag-overview-2.png)\n", + "\n", + "This notebook will perform steps 1, 2, 3 and 4 in RAG pipeline.\n", + "\n", + "Here are the processing steps:\n", + "\n", + "- **pdf2parquet** : Extract text (in markdown format) from PDF and store them as parquet files\n", + "- **Exact Dedup**: Documents with exact content are filtered out\n", + "- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n", + "- **Text encoder**: Convert chunks into vectors using embedding models" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": {}, + "source": [ + "## Step-1: Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "33345487", + "metadata": {}, + "outputs": [], + "source": [ + "from my_config import MY_CONFIG" + ] + }, + { + "cell_type": "markdown", + "id": "facb3bbc", + "metadata": {}, + "source": [ + "## Step-2: Data\n", + "\n", + "We will use white papers about LLMs. \n", + "\n", + "- [Granite Code Models](https://arxiv.org/abs/2405.04324)\n", + "- [Attention is all you need](https://arxiv.org/abs/1706.03762)\n", + "\n", + "You can of course substite your own data below" + ] + }, + { + "cell_type": "markdown", + "id": "f1fe7c0c", + "metadata": {}, + "source": [ + "### 2.1 - Download data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "8739b7a2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared input directory\n", + "\n", + "input/attention.pdf (2.22 MB) downloaded successfully.\n", + "\n", + "input/granite.pdf (1.27 MB) downloaded successfully.\n", + "\n", + "input/granite2.pdf (1.27 MB) downloaded successfully.\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "from utils import download_file\n", + "\n", + "shutil.rmtree(MY_CONFIG.INPUT_DATA_DIR, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)\n", + "print (\"✅ Cleared input directory\")\n", + " \n", + "download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attention.pdf' ))\n", + "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))\n", + "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite2.pdf' )) # duplicate\n" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": {}, + "source": [ + "### 2.2 - Set input/output path variables for the pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_dedupe_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_chunk_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": {}, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": {}, + "source": [ + "### 3.1 - Execute " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4b101999", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='input' --> output='output/01_parquet_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:15:27 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 8}\n", + "18:15:27 INFO - pipeline id pipeline_id\n", + "18:15:27 INFO - code location None\n", + "18:15:27 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/01_parquet_out\n", + "18:15:27 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:15:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "18:15:27 INFO - orchestrator pdf2parquet started at 2025-01-17 18:15:27\n", + "18:15:27 INFO - Number of files is 3, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 4.541904449462891}\n", + "18:15:27 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a78fe84a36c54de2a383778812391a99", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 9 files: 0%| | 0/9 [00:00 output='{output_parquet_dir}'\\n\", flush=True)\n", + "\n", + "result = Pdf2Parquet(input_folder= MY_CONFIG.INPUT_DATA_DIR,\n", + " output_folder= output_parquet_dir,\n", + " data_files_to_use=['.pdf'],\n", + " pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN, # markdown\n", + " # pdf2parquet_contents_type=pdf2parquet_contents_types.JSON # JSON\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "5ca790e0", + "metadata": {}, + "source": [ + "### 3.2 - Inspect Generated output\n", + "\n", + "Here we should see one entry per input file processed" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "fe59563d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filename
0attention.pdfProvided proper attribution is provided, Googl...156147178f709f-cd23-4bad-957a-5e8a88c9af222949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-17T18:15:44.57333813.146994attention.pdf
1granite2.pdf## Granite Code Models: A Family of Open Found...2819295758f58b8-eaba-444a-b348-d45194a1c2e63127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-17T18:16:40.70016028.162055granite2.pdf
2granite.pdf## Granite Code Models: A Family of Open Found...2819295c19d4b3e-c045-4823-8814-43da8808d68d3127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-17T18:16:12.49781327.883452granite.pdf
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 attention.pdf Provided proper attribution is provided, Googl... \n", + "1 granite2.pdf ## Granite Code Models: A Family of Open Found... \n", + "2 granite.pdf ## Granite Code Models: A Family of Open Found... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 15 6 147 \n", + "1 28 19 295 \n", + "2 28 19 295 \n", + "\n", + " document_id document_hash ext \\\n", + "0 178f709f-cd23-4bad-957a-5e8a88c9af22 2949302674760005271 pdf \n", + "1 758f58b8-eaba-444a-b348-d45194a1c2e6 3127757990743433032 pdf \n", + "2 c19d4b3e-c045-4823-8814-43da8808d68d 3127757990743433032 pdf \n", + "\n", + " hash size \\\n", + "0 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "1 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "2 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2025-01-17T18:15:44.573338 13.146994 attention.pdf \n", + "1 2025-01-17T18:16:40.700160 28.162055 granite2.pdf \n", + "2 2025-01-17T18:16:12.497813 27.883452 granite.pdf " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_parquet_dir)\n", + "\n", + "# print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "3f900753", + "metadata": {}, + "source": [ + "## Step-4: Eliminate Duplicate Documents\n", + "\n", + "We have 2 duplicate documnets here : `granite.pdf` and `granite2.pdf`.\n", + "\n", + "Note how the `hash` for these documents are same.\n", + "\n", + "We are going to perform **de-dupe**\n", + "\n", + "On the content of each document, a SHA256 hash is computed, followed by de-duplication of record having identical hashes.\n", + "\n", + "[Dedupe transform documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/README.md)" + ] + }, + { + "cell_type": "markdown", + "id": "2ef93831", + "metadata": {}, + "source": [ + "### 4.1 - Execute " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1901b4a1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_dedupe_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:16:40 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None}\n", + "18:16:40 INFO - pipeline id pipeline_id\n", + "18:16:40 INFO - code location None\n", + "18:16:40 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_dedupe_out\n", + "18:16:40 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:16:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:16:40 INFO - orchestrator ededup started at 2025-01-17 18:16:40\n", + "18:16:40 INFO - Number of files is 3, source profile {'max_file_size': 0.04436779022216797, 'min_file_size': 0.02082538604736328, 'total_file_size': 0.10954761505126953}\n", + "18:16:40 INFO - Starting from the beginning\n", + "18:16:40 INFO - Completed 1 files (33.33%) in 0.0 min\n", + "18:16:40 INFO - Completed 2 files (66.67%) in 0.0 min\n", + "18:16:40 INFO - Completed 3 files (100.0%) in 0.0 min\n", + "18:16:40 INFO - Done processing 3 files, waiting for flush() completion.\n", + "18:16:40 INFO - done flushing in 0.0 sec\n", + "18:16:40 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 32.8 ms, sys: 3.05 ms, total: 35.9 ms\n", + "Wall time: 32.6 ms\n" + ] + } + ], + "source": [ + "%%time \n", + "\n", + "from dpk_ededup.transform_python import Ededup\n", + "\n", + "STAGE = 2\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_parquet_dir}' --> output='{output_exact_dedupe_dir}'\\n\", flush=True)\n", + "\n", + "result = Ededup(input_folder=output_parquet_dir,\n", + " output_folder=output_exact_dedupe_dir,\n", + " ededup_doc_column=\"contents\",\n", + " ededup_doc_id_column=\"document_id\"\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "c45a59d2", + "metadata": {}, + "source": [ + "### 4.2 - Inspect Generated output\n", + "\n", + "We would see 2 documents: `attention.pdf` and `granite.pdf`. The duplicate `granite.pdf` has been filtered out!" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "0691f08e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input files before exact dedupe : 3\n", + "Output files after exact dedupe : 2\n", + "Duplicate files removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenameremoved
1granite.pdf## Granite Code Models: A Family of Open Found...2819295c19d4b3e-c045-4823-8814-43da8808d68d3127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-17T18:16:12.49781327.883452granite.pdf[]
0attention.pdfProvided proper attribution is provided, Googl...156147178f709f-cd23-4bad-957a-5e8a88c9af222949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-17T18:15:44.57333813.146994attention.pdf[]
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "1 granite.pdf ## Granite Code Models: A Family of Open Found... \n", + "0 attention.pdf Provided proper attribution is provided, Googl... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "1 28 19 295 \n", + "0 15 6 147 \n", + "\n", + " document_id document_hash ext \\\n", + "1 c19d4b3e-c045-4823-8814-43da8808d68d 3127757990743433032 pdf \n", + "0 178f709f-cd23-4bad-957a-5e8a88c9af22 2949302674760005271 pdf \n", + "\n", + " hash size \\\n", + "1 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "0 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "\n", + " date_acquired pdf_convert_time source_filename removed \n", + "1 2025-01-17T18:16:12.497813 27.883452 granite.pdf [] \n", + "0 2025-01-17T18:15:44.573338 13.146994 attention.pdf [] " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "input_df = read_parquet_files_as_df(output_parquet_dir)\n", + "output_df = read_parquet_files_as_df(output_exact_dedupe_dir)\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input files before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output files after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate files removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.sample(min(3, output_df.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": {}, + "source": [ + "## Step-5: Doc chunks\n", + "\n", + "Split the documents in chunks.\n", + "\n", + "[Chunking transform documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/README.md)\n", + "\n", + "**Experiment with chunking size to find the setting that works best for your documents**" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": {}, + "source": [ + "### 5.1 - Execute " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "2cfbf532", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_dedupe_out' --> output='output/03_chunk_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:16:41 INFO - doc_chunk parameters are : {'chunking_type': 'li_markdown', 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}\n", + "18:16:41 INFO - pipeline id pipeline_id\n", + "18:16:41 INFO - code location None\n", + "18:16:41 INFO - data factory data_ is using local data access: input_folder - output/02_dedupe_out output_folder - output/03_chunk_out\n", + "18:16:41 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:16:41 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:16:41 INFO - orchestrator doc_chunk started at 2025-01-17 18:16:41\n", + "18:16:41 INFO - Number of files is 3, source profile {'max_file_size': 0.04471015930175781, 'min_file_size': 0.0028095245361328125, 'total_file_size': 0.06870079040527344}\n", + "18:16:41 INFO - Completed 1 files (33.33%) in 0.0 min\n", + "18:16:41 INFO - Completed 2 files (66.67%) in 0.0 min\n", + "18:16:41 WARNING - table is empty, skipping processing\n", + "18:16:41 INFO - Completed 3 files (100.0%) in 0.0 min\n", + "18:16:41 INFO - Done processing 3 files, waiting for flush() completion.\n", + "18:16:41 INFO - done flushing in 0.0 sec\n", + "18:16:41 INFO - Completed execution in 0.001 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 890 ms, sys: 87.7 ms, total: 978 ms\n", + "Wall time: 980 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_doc_chunk.transform_python import DocChunk\n", + "\n", + "STAGE = 3\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_chunk_dir}'\\n\", flush=True)\n", + "\n", + "result = DocChunk(input_folder=output_exact_dedupe_dir,\n", + " output_folder=output_chunk_dir,\n", + " doc_chunk_chunking_type= \"li_markdown\",\n", + " # doc_chunk_chunking_type= \"dl_json\",\n", + " doc_chunk_chunk_size_tokens = 128, # default 128\n", + " doc_chunk_chunk_overlap_tokens=30 # default 30\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": {}, + "source": [ + "### 5.2 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d8138d43", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 60\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenameremovedsource_document_idcontentsdocument_id
48granite.pdf28192953127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-17T18:16:12.49781327.883452granite.pdf[]c19d4b3e-c045-4823-8814-43da8808d68d## 6.1.5 RepoBench, CrossCodeEval: Repository-...63337c6952e14044ce448bb0dc6a369181b7779cffcd92...
35granite.pdf28192953127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-17T18:16:12.49781327.883452granite.pdf[]c19d4b3e-c045-4823-8814-43da8808d68d## 3 Model Architecture\\n\\nWe train a series o...b0ad58f3ab8f7e69f2460a6713bf65396737cb179cc374...
22attention.pdf1561472949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-17T18:15:44.57333813.146994attention.pdf[]178f709f-cd23-4bad-957a-5e8a88c9af22## 6.2 Model Variations\\n\\nTo evaluate the imp...60de5803d0837ef01773367a79da7c3e47fe90bec09ecb...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "48 granite.pdf 28 19 295 \n", + "35 granite.pdf 28 19 295 \n", + "22 attention.pdf 15 6 147 \n", + "\n", + " document_hash ext \\\n", + "48 3127757990743433032 pdf \n", + "35 3127757990743433032 pdf \n", + "22 2949302674760005271 pdf \n", + "\n", + " hash size \\\n", + "48 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "35 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "22 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "\n", + " date_acquired pdf_convert_time source_filename removed \\\n", + "48 2025-01-17T18:16:12.497813 27.883452 granite.pdf [] \n", + "35 2025-01-17T18:16:12.497813 27.883452 granite.pdf [] \n", + "22 2025-01-17T18:15:44.573338 13.146994 attention.pdf [] \n", + "\n", + " source_document_id \\\n", + "48 c19d4b3e-c045-4823-8814-43da8808d68d \n", + "35 c19d4b3e-c045-4823-8814-43da8808d68d \n", + "22 178f709f-cd23-4bad-957a-5e8a88c9af22 \n", + "\n", + " contents \\\n", + "48 ## 6.1.5 RepoBench, CrossCodeEval: Repository-... \n", + "35 ## 3 Model Architecture\\n\\nWe train a series o... \n", + "22 ## 6.2 Model Variations\\n\\nTo evaluate the imp... \n", + "\n", + " document_id \n", + "48 63337c6952e14044ce448bb0dc6a369181b7779cffcd92... \n", + "35 b0ad58f3ab8f7e69f2460a6713bf65396737cb179cc374... \n", + "22 60de5803d0837ef01773367a79da7c3e47fe90bec09ecb... " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "input_df = read_parquet_files_as_df(output_exact_dedupe_dir) ## for debug purposes\n", + "output_df = read_parquet_files_as_df(output_chunk_dir)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.sample(min(3, output_df.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": {}, + "source": [ + "## Step-6: Calculate Embeddings for Chunks\n", + "\n", + "we will calculate embeddings for each chunk using an open source embedding model\n", + "\n", + "[Embeddings / Text Encoder documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/README.md)" + ] + }, + { + "cell_type": "markdown", + "id": "b9112479", + "metadata": {}, + "source": [ + "### 6.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "23e8b858", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_chunk_out' --> output='output/04_embeddings_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:16:42 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "18:16:42 INFO - pipeline id pipeline_id\n", + "18:16:42 INFO - code location None\n", + "18:16:42 INFO - data factory data_ is using local data access: input_folder - output/03_chunk_out output_folder - output/04_embeddings_out\n", + "18:16:42 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:16:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:16:42 INFO - orchestrator text_encoder started at 2025-01-17 18:16:42\n", + "18:16:42 INFO - Number of files is 2, source profile {'max_file_size': 0.04669189453125, 'min_file_size': 0.02893352508544922, 'total_file_size': 0.07562541961669922}\n", + "18:16:44 INFO - Completed 1 files (50.0%) in 0.003 min\n", + "18:16:45 INFO - Completed 2 files (100.0%) in 0.006 min\n", + "18:16:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:16:45 INFO - done flushing in 0.0 sec\n", + "18:16:45 INFO - Completed execution in 0.044 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:4 completed successfully\n", + "CPU times: user 1.03 s, sys: 132 ms, total: 1.16 s\n", + "Wall time: 3.21 s\n" + ] + } + ], + "source": [ + "%%time \n", + "\n", + "from dpk_text_encoder.transform_python import TextEncoder\n", + "\n", + "STAGE = 4\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_chunk_dir}' --> output='{output_embeddings_dir}'\\n\", flush=True)\n", + "\n", + "\n", + "result = TextEncoder(input_folder= output_chunk_dir, \n", + " output_folder= output_embeddings_dir, \n", + " text_encoder_model_name = MY_CONFIG.EMBEDDING_MODEL\n", + " ).transform()\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": {}, + "source": [ + "### 6.2 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "7b1c1d09", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (60, 15)\n", + "Output data dimensions (rows x columns)= (60, 16)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenameremovedsource_document_idcontentsdocument_idembeddings
2attention.pdf1561472949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-17T18:15:44.57333813.146994attention.pdf[]178f709f-cd23-4bad-957a-5e8a88c9af22## Abstract\\n\\nThe dominant sequence transduct...590629323f9d88598a80846d1df6a83d0ad6ac53efe278...[-0.08771476, -0.12373961, 0.043168165, 0.0060...
18attention.pdf1561472949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-17T18:15:44.57333813.146994attention.pdf[]178f709f-cd23-4bad-957a-5e8a88c9af22## 5.3 Optimizer\\n\\nWe used the Adam optimizer...47fc3dca18355f0f161c953a7ad213eaa8c33da0be6875...[-0.0124165565, -0.04576251, 0.037190527, -0.0...
21attention.pdf1561472949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-17T18:15:44.57333813.146994attention.pdf[]178f709f-cd23-4bad-957a-5e8a88c9af22## 6.1 Machine Translation\\n\\nOn the WMT 2014 ...b7aa340533889effd73d129e0c14083277031c44becfa6...[-0.037983608, -0.067570895, -0.000437462, 0.0...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "2 attention.pdf 15 6 147 \n", + "18 attention.pdf 15 6 147 \n", + "21 attention.pdf 15 6 147 \n", + "\n", + " document_hash ext \\\n", + "2 2949302674760005271 pdf \n", + "18 2949302674760005271 pdf \n", + "21 2949302674760005271 pdf \n", + "\n", + " hash size \\\n", + "2 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "18 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "21 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "\n", + " date_acquired pdf_convert_time source_filename removed \\\n", + "2 2025-01-17T18:15:44.573338 13.146994 attention.pdf [] \n", + "18 2025-01-17T18:15:44.573338 13.146994 attention.pdf [] \n", + "21 2025-01-17T18:15:44.573338 13.146994 attention.pdf [] \n", + "\n", + " source_document_id \\\n", + "2 178f709f-cd23-4bad-957a-5e8a88c9af22 \n", + "18 178f709f-cd23-4bad-957a-5e8a88c9af22 \n", + "21 178f709f-cd23-4bad-957a-5e8a88c9af22 \n", + "\n", + " contents \\\n", + "2 ## Abstract\\n\\nThe dominant sequence transduct... \n", + "18 ## 5.3 Optimizer\\n\\nWe used the Adam optimizer... \n", + "21 ## 6.1 Machine Translation\\n\\nOn the WMT 2014 ... \n", + "\n", + " document_id \\\n", + "2 590629323f9d88598a80846d1df6a83d0ad6ac53efe278... \n", + "18 47fc3dca18355f0f161c953a7ad213eaa8c33da0be6875... \n", + "21 b7aa340533889effd73d129e0c14083277031c44becfa6... \n", + "\n", + " embeddings \n", + "2 [-0.08771476, -0.12373961, 0.043168165, 0.0060... \n", + "18 [-0.0124165565, -0.04576251, 0.037190527, -0.0... \n", + "21 [-0.037983608, -0.067570895, -0.000437462, 0.0... " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "input_df = read_parquet_files_as_df(output_chunk_dir)\n", + "output_df = read_parquet_files_as_df(output_embeddings_dir)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.sample(min(3, output_df.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": {}, + "source": [ + "## Step-7: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Copied output from 'output/04_embeddings_out' --> 'output/output_final'\n" + ] + } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_embeddings_dir, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"✅ Copied output from '{output_embeddings_dir}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dpk-2-rag-pdf-r1.0.0-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/rag-pdf-1/rag_1_dpk_process_ray.ipynb b/examples/notebooks/rag-pdf-1/rag_1_dpk_process_ray.ipynb new file mode 100644 index 0000000000..92e3e882a0 --- /dev/null +++ b/examples/notebooks/rag-pdf-1/rag_1_dpk_process_ray.ipynb @@ -0,0 +1,1308 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": {}, + "source": [ + "
\n", + "

Data Processing for RAG with Data Prep Kit (RAY)

\n", + " \n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": {}, + "source": [ + "## Before Running the notebook\n", + "\n", + "Please complete [setting up python dev environment](./setup-python-dev-env.md)" + ] + }, + { + "cell_type": "markdown", + "id": "053ecf08-5f62-4b99-9347-8a0955843d21", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "This notebook will process PDF documents as part of RAG pipeline\n", + "\n", + "![](media/rag-overview-2.png)\n", + "\n", + "This notebook will perform steps 1, 2 and 3 in RAG pipeline.\n", + "\n", + "Here are the processing steps:\n", + "\n", + "Here are the processing steps:\n", + "\n", + "- **pdf2parquet** : Extract text (in markdown format) from PDF and store them as parquet files\n", + "- **Exact Dedup**: Documents with exact content are filtered out\n", + "- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n", + "- **Text encoder**: Convert chunks into vectors using embedding models" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": {}, + "source": [ + "## Step-1: Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "33345487", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Ray configuration: CPUs=0.5, memory=2 GB, workers=2\n" + ] + } + ], + "source": [ + "import os\n", + "from my_config import MY_CONFIG\n", + "\n", + "## RAY CONFIGURATION\n", + "num_cpus_available = os.cpu_count()\n", + "# print (num_cpus_available)\n", + "# MY_CONFIG.RAY_NUM_CPUS = num_cpus_available // 2 ## use half the available cores for processing\n", + "MY_CONFIG.RAY_NUM_CPUS = 0.5\n", + "MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + "# MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", + "MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + "\n", + "print (f\"Ray configuration: CPUs={MY_CONFIG.RAY_NUM_CPUS}, memory={MY_CONFIG.RAY_MEMORY_GB} GB, workers={MY_CONFIG.RAY_RUNTIME_WORKERS}\")" + ] + }, + { + "cell_type": "markdown", + "id": "40c58856", + "metadata": {}, + "source": [ + "## Step-2: Data\n", + "\n", + "We will use white papers about LLMs. \n", + "\n", + "- [Granite Code Models](https://arxiv.org/abs/2405.04324)\n", + "- [Attention is all you need](https://arxiv.org/abs/1706.03762)\n", + "\n", + "You can of course substite your own data below" + ] + }, + { + "cell_type": "markdown", + "id": "6bce5939", + "metadata": {}, + "source": [ + "### 2.1 - Download data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "1bfde6eb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared input directory\n", + "\n", + "input/attention.pdf (2.22 MB) downloaded successfully.\n", + "\n", + "input/granite.pdf (1.27 MB) downloaded successfully.\n", + "\n", + "input/granite2.pdf (1.27 MB) downloaded successfully.\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "from utils import download_file\n", + "\n", + "shutil.rmtree(MY_CONFIG.INPUT_DATA_DIR, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)\n", + "print (\"✅ Cleared input directory\")\n", + " \n", + "download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attention.pdf' ))\n", + "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))\n", + "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite2.pdf' )) # duplicate\n" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": {}, + "source": [ + "### 2.2 - Set input/output path variables for the pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_dedupe_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_chunk_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_embeddings_out')\n", + "\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": {}, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": {}, + "source": [ + "### 3.1 - Execute " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "d940a56a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='input' --> output='output/01_parquet_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "22:47:52 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 8}\n", + "22:47:52 INFO - pipeline id pipeline_id\n", + "22:47:52 INFO - code location None\n", + "22:47:52 INFO - number of workers 2 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}\n", + "22:47:52 INFO - actor creation delay 0\n", + "22:47:52 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:47:52 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/01_parquet_out\n", + "22:47:52 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:47:52 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "22:47:52 INFO - Running locally\n", + "2025-01-19 22:47:53,502\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:47:57 INFO - orchestrator started at 2025-01-19 22:47:57\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:47:57 INFO - Number of files is 3, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 4.541904449462891}\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:47:57 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.521201325580478, 'object_store': 3.7606006618589163}\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:47:57 INFO - Number of workers - 2 with {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(RayTransformFileProcessor pid=1162792)\u001b[0m 22:48:01 INFO - Initializing models\n", + "Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 180615.96it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1162792)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[36m(RayTransformFileProcessor pid=1162792)\u001b[0m ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:48:57 INFO - Completed 1 files in 0.807 min\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:48:57 INFO - Completed 1 files (33.333%) in 0.807 min. Waiting for completion\n", + "\u001b[36m(RayTransformFileProcessor pid=1162791)\u001b[0m 22:48:01 INFO - Initializing models\n", + "Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 69391.06it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1162791)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:51:42 INFO - Completed processing 3 files in 3.571 min\n", + "\u001b[36m(orchestrate pid=1161876)\u001b[0m 22:51:42 INFO - done flushing in 0.001 sec\n", + "22:51:52 INFO - Completed execution in 4.009 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:1 completed successfully\n", + "CPU times: user 4.25 s, sys: 783 ms, total: 5.03 s\n", + "Wall time: 4min 5s\n" + ] + } + ], + "source": [ + "%%time \n", + "\n", + "from dpk_pdf2parquet.ray.transform import Pdf2Parquet\n", + "from data_processing.utils import GB\n", + "from dpk_pdf2parquet.transform import pdf2parquet_contents_types\n", + "\n", + "STAGE = 1 \n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{MY_CONFIG.INPUT_DATA_DIR}' --> output='{output_parquet_dir}'\\n\", flush=True)\n", + "\n", + "result = Pdf2Parquet(input_folder= MY_CONFIG.INPUT_DATA_DIR,\n", + " output_folder= output_parquet_dir, \n", + " data_files_to_use=['.pdf'],\n", + " pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,\n", + " # pdf2parquet_contents_type=pdf2parquet_contents_types.JSON,\n", + " \n", + " ## runtime options\n", + " run_locally= True,\n", + " num_cpus= MY_CONFIG.RAY_NUM_CPUS,\n", + " memory= MY_CONFIG.RAY_MEMORY_GB * GB,\n", + " runtime_num_workers = MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + "\n", + " ## debug\n", + " # num_cpus= 1, \n", + " # memory= MY_CONFIG.RAY_MEMORY_GB * GB, \n", + " # runtime_num_workers = 1, ## Note: has to be one for this particular job, to prevent race condition when downloading models!\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "5ca790e0", + "metadata": {}, + "source": [ + "### 3.2 - Inspect Generated output\n", + "\n", + "Here we should see one entry per input file processed" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "fe59563d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filename
0attention.pdfProvided proper attribution is provided, Googl...1561470677706d-c587-4ddc-a52d-7ed12b082cbe2949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-19T22:48:56.99251948.361864attention.pdf
1granite2.pdf## Granite Code Models: A Family of Open Found...2819295ab9c2476-0e95-4b0e-84c9-1efab49761de3127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:51:42.875845165.833343granite2.pdf
2granite.pdf## Granite Code Models: A Family of Open Found...281929520953ad1-8227-4454-8b20-c412f55201853127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:50:57.694515169.037999granite.pdf
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 attention.pdf Provided proper attribution is provided, Googl... \n", + "1 granite2.pdf ## Granite Code Models: A Family of Open Found... \n", + "2 granite.pdf ## Granite Code Models: A Family of Open Found... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 15 6 147 \n", + "1 28 19 295 \n", + "2 28 19 295 \n", + "\n", + " document_id document_hash ext \\\n", + "0 0677706d-c587-4ddc-a52d-7ed12b082cbe 2949302674760005271 pdf \n", + "1 ab9c2476-0e95-4b0e-84c9-1efab49761de 3127757990743433032 pdf \n", + "2 20953ad1-8227-4454-8b20-c412f5520185 3127757990743433032 pdf \n", + "\n", + " hash size \\\n", + "0 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "1 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "2 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2025-01-19T22:48:56.992519 48.361864 attention.pdf \n", + "1 2025-01-19T22:51:42.875845 165.833343 granite2.pdf \n", + "2 2025-01-19T22:50:57.694515 169.037999 granite.pdf " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_parquet_dir)\n", + "# print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "output_df.head(5)\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "8c54f1d7", + "metadata": {}, + "source": [ + "## Step-4: Eliminate Duplicate Documents\n", + "\n", + "We have 2 duplicate documnets here : `granite.pdf` and `granite2.pdf`.\n", + "\n", + "Note how the `hash` for these documents are same.\n", + "\n", + "We are going to perform **de-dupe**\n", + "\n", + "On the content of each document, a SHA256 hash is computed, followed by de-duplication of record having identical hashes.\n", + "\n", + "[Dedupe transform documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/README.md)" + ] + }, + { + "cell_type": "markdown", + "id": "5133e8b7", + "metadata": {}, + "source": [ + "### 4.1 - Execute " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "60014643", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_dedupe_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "22:51:54 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'document_id', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "22:51:54 INFO - pipeline id pipeline_id\n", + "22:51:54 INFO - code location None\n", + "22:51:54 INFO - number of workers 2 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}\n", + "22:51:54 INFO - actor creation delay 0\n", + "22:51:54 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:51:54 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_dedupe_out\n", + "22:51:54 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:51:54 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:51:54 INFO - Running locally\n", + "2025-01-19 22:51:55,430\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:56 INFO - orchestrator started at 2025-01-19 22:51:56\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:56 INFO - Number of files is 3, source profile {'max_file_size': 0.04436779022216797, 'min_file_size': 0.02082538604736328, 'total_file_size': 0.10954761505126953}\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:56 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.416252136230469, 'object_store': 3.7081260681152344}\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:56 INFO - Number of workers - 2 with {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:58 INFO - Completed 1 files in 0.004 min\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:58 INFO - Completed 1 files (33.333%) in 0.004 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:58 INFO - Completed processing 3 files in 0.004 min\n", + "\u001b[36m(orchestrate pid=1165300)\u001b[0m 22:51:58 INFO - done flushing in 0.001 sec\n", + "22:52:08 INFO - Completed execution in 0.228 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 93.2 ms, sys: 141 ms, total: 234 ms\n", + "Wall time: 14.9 s\n" + ] + } + ], + "source": [ + "%%time \n", + "\n", + "from dpk_ededup.ray.transform import Ededup\n", + "\n", + "STAGE = 2\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_parquet_dir}' --> output='{output_exact_dedupe_dir}'\\n\", flush=True)\n", + "\n", + "result = Ededup(input_folder=output_parquet_dir,\n", + " output_folder=output_exact_dedupe_dir,\n", + " ededup_hash_cpu= 0.5,\n", + " ededup_num_hashes= 2,\n", + " ededup_doc_column=\"contents\",\n", + " ededup_doc_id_column=\"document_id\",\n", + " \n", + " ## runtime options\n", + " run_locally= True,\n", + " num_cpus= MY_CONFIG.RAY_NUM_CPUS,\n", + " memory= MY_CONFIG.RAY_MEMORY_GB * GB,\n", + " runtime_num_workers = MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a15d456a", + "metadata": {}, + "source": [ + "### 4.2 - Inspect Generated output\n", + "\n", + "We would see 2 documents: `attention.pdf` and `granite.pdf`. The duplicate `granite.pdf` has been filtered out!" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "0d93c248", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input files before exact dedupe : 3\n", + "Output files after exact dedupe : 2\n", + "Duplicate files removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_iddocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenameremoved
0attention.pdfProvided proper attribution is provided, Googl...1561470677706d-c587-4ddc-a52d-7ed12b082cbe2949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-19T22:48:56.99251948.361864attention.pdf[]
1granite.pdf## Granite Code Models: A Family of Open Found...281929520953ad1-8227-4454-8b20-c412f55201853127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:50:57.694515169.037999granite.pdf[]
\n", + "
" + ], + "text/plain": [ + " filename contents \\\n", + "0 attention.pdf Provided proper attribution is provided, Googl... \n", + "1 granite.pdf ## Granite Code Models: A Family of Open Found... \n", + "\n", + " num_pages num_tables num_doc_elements \\\n", + "0 15 6 147 \n", + "1 28 19 295 \n", + "\n", + " document_id document_hash ext \\\n", + "0 0677706d-c587-4ddc-a52d-7ed12b082cbe 2949302674760005271 pdf \n", + "1 20953ad1-8227-4454-8b20-c412f5520185 3127757990743433032 pdf \n", + "\n", + " hash size \\\n", + "0 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "1 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "\n", + " date_acquired pdf_convert_time source_filename removed \n", + "0 2025-01-19T22:48:56.992519 48.361864 attention.pdf [] \n", + "1 2025-01-19T22:50:57.694515 169.037999 granite.pdf [] " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "input_df = read_parquet_files_as_df(output_parquet_dir)\n", + "output_df = read_parquet_files_as_df(output_exact_dedupe_dir)\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input files before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output files after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate files removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.sample(min(3, output_df.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": {}, + "source": [ + "## Step-5: Doc chunks\n", + "\n", + "Split the documents in chunks.\n", + "\n", + "[Chunking transform documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/README.md)\n", + "\n", + "**Experiment with chunking size to find the setting that works best for your documents**" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": {}, + "source": [ + "### 5.1 - Execute " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "f1fbdbca", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_dedupe_out' --> output='output/03_chunk_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "22:52:10 INFO - doc_chunk parameters are : {'chunking_type': 'li_markdown', 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}\n", + "22:52:10 INFO - pipeline id pipeline_id\n", + "22:52:10 INFO - code location None\n", + "22:52:10 INFO - number of workers 2 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}\n", + "22:52:10 INFO - actor creation delay 0\n", + "22:52:10 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:52:10 INFO - data factory data_ is using local data access: input_folder - output/02_dedupe_out output_folder - output/03_chunk_out\n", + "22:52:10 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:52:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:52:10 INFO - Running locally\n", + "2025-01-19 22:52:11,705\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:14 INFO - orchestrator started at 2025-01-19 22:52:14\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:14 INFO - Number of files is 3, source profile {'max_file_size': 0.04471015930175781, 'min_file_size': 0.0028095245361328125, 'total_file_size': 0.06870079040527344}\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:14 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.387397004291415, 'object_store': 3.693698501214385}\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:14 INFO - Number of workers - 2 with {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:16 INFO - Completed 1 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:16 INFO - Completed 1 files (33.333%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:16 INFO - Completed processing 3 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=1166877)\u001b[0m 22:52:16 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=1167753)\u001b[0m 22:52:16 WARNING - table is empty, skipping processing\n", + "22:52:26 INFO - Completed execution in 0.262 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 1.03 s, sys: 311 ms, total: 1.34 s\n", + "Wall time: 18.3 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from dpk_doc_chunk.ray.transform import DocChunk\n", + "from data_processing.utils import GB\n", + "\n", + "STAGE = 3\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_exact_dedupe_dir}' --> output='{output_chunk_dir}'\\n\", flush=True)\n", + "\n", + "result = DocChunk(input_folder=output_exact_dedupe_dir,\n", + " output_folder=output_chunk_dir,\n", + " doc_chunk_chunking_type= \"li_markdown\",\n", + "\n", + " ## runtime options\n", + " run_locally= True,\n", + " num_cpus= MY_CONFIG.RAY_NUM_CPUS,\n", + " memory= MY_CONFIG.RAY_MEMORY_GB * GB,\n", + " runtime_num_workers = MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": {}, + "source": [ + "### 5.2 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d8138d43", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 60\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenameremovedsource_document_idcontentsdocument_id
1attention.pdf1561472949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-19T22:48:56.99251948.361864attention.pdf[]0677706d-c587-4ddc-a52d-7ed12b082cbe## Attention Is All You Need\\n\\nAshish Vaswani...45e678f43369d5fa127105b7cca6a6e4dd4deed6422185...
58granite.pdf28192953127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:50:57.694515169.037999granite.pdf[]20953ad1-8227-4454-8b20-c412f5520185## References\\n\\nWasi Uddin Ahmad, Md Golam Ra...b787f46ab644038e472b9815a122eead379ed7f37a3d4f...
52granite.pdf28192953127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:50:57.694515169.037999granite.pdf[]20953ad1-8227-4454-8b20-c412f5520185## 6.4 Code Reasoning, Understanding and Execu...1c7f5e76a2aaad73f5f03549b065016b0703239538839d...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "1 attention.pdf 15 6 147 \n", + "58 granite.pdf 28 19 295 \n", + "52 granite.pdf 28 19 295 \n", + "\n", + " document_hash ext \\\n", + "1 2949302674760005271 pdf \n", + "58 3127757990743433032 pdf \n", + "52 3127757990743433032 pdf \n", + "\n", + " hash size \\\n", + "1 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "58 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "52 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "\n", + " date_acquired pdf_convert_time source_filename removed \\\n", + "1 2025-01-19T22:48:56.992519 48.361864 attention.pdf [] \n", + "58 2025-01-19T22:50:57.694515 169.037999 granite.pdf [] \n", + "52 2025-01-19T22:50:57.694515 169.037999 granite.pdf [] \n", + "\n", + " source_document_id \\\n", + "1 0677706d-c587-4ddc-a52d-7ed12b082cbe \n", + "58 20953ad1-8227-4454-8b20-c412f5520185 \n", + "52 20953ad1-8227-4454-8b20-c412f5520185 \n", + "\n", + " contents \\\n", + "1 ## Attention Is All You Need\\n\\nAshish Vaswani... \n", + "58 ## References\\n\\nWasi Uddin Ahmad, Md Golam Ra... \n", + "52 ## 6.4 Code Reasoning, Understanding and Execu... \n", + "\n", + " document_id \n", + "1 45e678f43369d5fa127105b7cca6a6e4dd4deed6422185... \n", + "58 b787f46ab644038e472b9815a122eead379ed7f37a3d4f... \n", + "52 1c7f5e76a2aaad73f5f03549b065016b0703239538839d... " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "input_df = read_parquet_files_as_df(output_exact_dedupe_dir) ## for debug purposes\n", + "output_df = read_parquet_files_as_df(output_chunk_dir)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "# print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "# print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.sample(min(3, output_df.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": {}, + "source": [ + "## Step-6: Calculate Embeddings for Chunks\n", + "\n", + "we will calculate embeddings for each chunk using an open source embedding model\n", + "\n", + "[Embeddings / Text Encoder documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/README.md)" + ] + }, + { + "cell_type": "markdown", + "id": "1e6a88f8", + "metadata": {}, + "source": [ + "### 6.1 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "76132f76", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_chunk_out' --> output='output/04_embeddings_out'\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "22:52:28 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "22:52:28 INFO - pipeline id pipeline_id\n", + "22:52:28 INFO - code location None\n", + "22:52:28 INFO - number of workers 2 worker options {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1}\n", + "22:52:28 INFO - actor creation delay 0\n", + "22:52:28 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:52:28 INFO - data factory data_ is using local data access: input_folder - output/03_chunk_out output_folder - output/04_embeddings_out\n", + "22:52:28 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:52:28 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:52:28 INFO - Running locally\n", + "2025-01-19 22:52:29,668\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:33 INFO - orchestrator started at 2025-01-19 22:52:33\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:33 INFO - Number of files is 2, source profile {'max_file_size': 0.04669189453125, 'min_file_size': 0.02893352508544922, 'total_file_size': 0.07562541961669922}\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:33 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 7.35516357421875, 'object_store': 3.677581787109375}\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:33 INFO - Number of workers - 2 with {'num_cpus': 0.5, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:40 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:42 INFO - Completed processing 2 files in 0.037 min\n", + "\u001b[36m(orchestrate pid=1168388)\u001b[0m 22:52:42 INFO - done flushing in 0.001 sec\n", + "22:52:52 INFO - Completed execution in 0.397 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:4 completed successfully\n", + "CPU times: user 674 ms, sys: 298 ms, total: 972 ms\n", + "Wall time: 26.2 s\n" + ] + } + ], + "source": [ + "%%time \n", + "\n", + "from dpk_text_encoder.ray.transform import TextEncoder\n", + "from data_processing.utils import GB\n", + "\n", + "STAGE = 4\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{output_chunk_dir}' --> output='{output_embeddings_dir}'\\n\", flush=True)\n", + "\n", + "result = TextEncoder(input_folder= output_chunk_dir, \n", + " output_folder= output_embeddings_dir, \n", + " text_encoder_model_name = MY_CONFIG.EMBEDDING_MODEL,\n", + " \n", + " ## runtime options\n", + " run_locally= True,\n", + " num_cpus= MY_CONFIG.RAY_NUM_CPUS,\n", + " memory= MY_CONFIG.RAY_MEMORY_GB * GB,\n", + " runtime_num_workers = MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " ).transform()\n", + "\n", + "if result == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (f\"❌ Stage:{STAGE} failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": {}, + "source": [ + "### 6.2 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "7b1c1d09", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (60, 15)\n", + "Output data dimensions (rows x columns)= (60, 16)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_hashexthashsizedate_acquiredpdf_convert_timesource_filenameremovedsource_document_idcontentsdocument_idembeddings
44granite.pdf28192953127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:50:57.694515169.037999granite.pdf[]20953ad1-8227-4454-8b20-c412f5520185## 6.1.1 HumanEvalSynthesize: Multilingual Cod...b10bcf46720fb7fff15818c4bc03ec37ae84181e6cbbc1...[-0.03851807, 0.00934296, 0.02425409, -0.00439...
15attention.pdf1561472949302674760005271pdff1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...460402025-01-19T22:48:56.99251948.361864attention.pdf[]0677706d-c587-4ddc-a52d-7ed12b082cbe## 5 Training\\n\\nThis section describes the tr...7e7dce074e6995e9c9551e1349cad58153b319c45e20a1...[-0.02469791, -0.077463716, 0.07508141, 0.0363...
40granite.pdf28192953127757990743433032pdf0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41...1276782025-01-19T22:50:57.694515169.037999granite.pdf[]20953ad1-8227-4454-8b20-c412f5520185## 4.4 Infrastructure\\n\\nWe train the Granite ...81f51d21fd61607d8aa9bb50925c2bc936fa1da7b27f4b...[-0.033672214, -0.01862875, 0.0034308454, 0.06...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "44 granite.pdf 28 19 295 \n", + "15 attention.pdf 15 6 147 \n", + "40 granite.pdf 28 19 295 \n", + "\n", + " document_hash ext \\\n", + "44 3127757990743433032 pdf \n", + "15 2949302674760005271 pdf \n", + "40 3127757990743433032 pdf \n", + "\n", + " hash size \\\n", + "44 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "15 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "40 0534b6a29ab9cedf21c3f6cf756cf0252d19a8e9135a41... 127678 \n", + "\n", + " date_acquired pdf_convert_time source_filename removed \\\n", + "44 2025-01-19T22:50:57.694515 169.037999 granite.pdf [] \n", + "15 2025-01-19T22:48:56.992519 48.361864 attention.pdf [] \n", + "40 2025-01-19T22:50:57.694515 169.037999 granite.pdf [] \n", + "\n", + " source_document_id \\\n", + "44 20953ad1-8227-4454-8b20-c412f5520185 \n", + "15 0677706d-c587-4ddc-a52d-7ed12b082cbe \n", + "40 20953ad1-8227-4454-8b20-c412f5520185 \n", + "\n", + " contents \\\n", + "44 ## 6.1.1 HumanEvalSynthesize: Multilingual Cod... \n", + "15 ## 5 Training\\n\\nThis section describes the tr... \n", + "40 ## 4.4 Infrastructure\\n\\nWe train the Granite ... \n", + "\n", + " document_id \\\n", + "44 b10bcf46720fb7fff15818c4bc03ec37ae84181e6cbbc1... \n", + "15 7e7dce074e6995e9c9551e1349cad58153b319c45e20a1... \n", + "40 81f51d21fd61607d8aa9bb50925c2bc936fa1da7b27f4b... \n", + "\n", + " embeddings \n", + "44 [-0.03851807, 0.00934296, 0.02425409, -0.00439... \n", + "15 [-0.02469791, -0.077463716, 0.07508141, 0.0363... \n", + "40 [-0.033672214, -0.01862875, 0.0034308454, 0.06... " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils import read_parquet_files_as_df\n", + "\n", + "input_df = read_parquet_files_as_df(output_chunk_dir)\n", + "output_df = read_parquet_files_as_df(output_embeddings_dir)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.sample(min(3, output_df.shape[0]))" + ] + }, + { + "cell_type": "markdown", + "id": "8b80bc44", + "metadata": {}, + "source": [ + "## Step-7: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Copied output from 'output/04_embeddings_out' --> 'output/output_final'\n" + ] + } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_embeddings_dir, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"✅ Copied output from '{output_embeddings_dir}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dpk-1-rag-pdf-r1.0.0.a4-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/rag/rag_1B_load_data_into_milvus.ipynb b/examples/notebooks/rag-pdf-1/rag_2_load_data_into_milvus.ipynb similarity index 57% rename from examples/notebooks/rag/rag_1B_load_data_into_milvus.ipynb rename to examples/notebooks/rag-pdf-1/rag_2_load_data_into_milvus.ipynb index e481cf9eef..6f3872cb8a 100644 --- a/examples/notebooks/rag/rag_1B_load_data_into_milvus.ipynb +++ b/examples/notebooks/rag-pdf-1/rag_2_load_data_into_milvus.ipynb @@ -8,9 +8,9 @@ "\n", "This notebook loads output from data prep kit into Milvus\n", "\n", - "**Step-4 in this workflow**\n", + "**Step-5 in this workflow**\n", "\n", - "![](../media/rag-overview-2.png)\n" + "![](media/rag-overview-2.png)\n" ] }, { @@ -50,10 +50,10 @@ "Loading data from : output/output_final\n", "Number of parquet files to read : 2\n", "\n", - "Read file: 'output/output_final/granite.parquet'. number of rows = 123\n", - "Read file: 'output/output_final/attension.parquet'. number of rows = 88\n", + "Read file: 'output/output_final/attention.parquet'. number of rows = 27\n", + "Read file: 'output/output_final/granite.parquet'. number of rows = 33\n", "\n", - "Total number of rows = 211\n" + "Total number of rows = 60\n" ] } ], @@ -94,32 +94,28 @@ "text": [ "embedding length: 384\n", "\n", - "RangeIndex: 211 entries, 0 to 210\n", - "Data columns (total 20 columns):\n", + "RangeIndex: 60 entries, 0 to 59\n", + "Data columns (total 16 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", - " 0 filename 211 non-null object \n", - " 1 num_pages 211 non-null int64 \n", - " 2 num_tables 211 non-null int64 \n", - " 3 num_doc_elements 211 non-null int64 \n", - " 4 ext 211 non-null object \n", - " 5 hash 211 non-null object \n", - " 6 size 211 non-null int64 \n", - " 7 date_acquired 211 non-null object \n", - " 8 pdf_convert_time 211 non-null float64\n", - " 9 source_filename 211 non-null object \n", - " 10 source_document_id 211 non-null object \n", - " 11 text 211 non-null object \n", - " 12 doc_jsonpath 211 non-null object \n", - " 13 page_number 211 non-null int64 \n", - " 14 bbox 211 non-null object \n", - " 15 document_id 211 non-null object \n", - " 16 chunk_id 211 non-null int64 \n", - " 17 removed 211 non-null object \n", - " 18 chunk_hash 211 non-null int64 \n", - " 19 vector 211 non-null object \n", - "dtypes: float64(1), int64(7), object(12)\n", - "memory usage: 33.1+ KB\n", + " 0 filename 60 non-null object \n", + " 1 num_pages 60 non-null int64 \n", + " 2 num_tables 60 non-null int64 \n", + " 3 num_doc_elements 60 non-null int64 \n", + " 4 document_hash 60 non-null object \n", + " 5 ext 60 non-null object \n", + " 6 hash 60 non-null object \n", + " 7 size 60 non-null int64 \n", + " 8 date_acquired 60 non-null object \n", + " 9 pdf_convert_time 60 non-null float64\n", + " 10 source_filename 60 non-null object \n", + " 11 removed 60 non-null object \n", + " 12 source_document_id 60 non-null object \n", + " 13 text 60 non-null object \n", + " 14 document_id 60 non-null object \n", + " 15 vector 60 non-null object \n", + "dtypes: float64(1), int64(4), object(11)\n", + "memory usage: 7.6+ KB\n", "None\n" ] }, @@ -148,138 +144,122 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", + " document_hash\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " removed\n", " source_document_id\n", " text\n", - " doc_jsonpath\n", - " page_number\n", - " bbox\n", " document_id\n", - " chunk_id\n", - " removed\n", - " chunk_hash\n", " vector\n", " \n", " \n", " \n", " \n", " 0\n", - " granite.pdf\n", - " 28\n", - " 17\n", - " 348\n", + " attention.pdf\n", + " 15\n", + " 6\n", + " 147\n", + " 2949302674760005271\n", " pdf\n", - " 79c53d694df467391e94f279af2fa6a9a7e45c3922546e...\n", - " 655054\n", - " 2024-10-02T00:28:23.836369\n", - " 167.768806\n", - " granite.pdf\n", - " 81bc331a-69cf-49bd-84b9-afedcab1344a\n", - " Granite Code Models: A Family of Open Foundati...\n", - " $.main-text[3]\n", - " 1\n", - " [142.70646667, 672.96929932, 468.58251953, 711...\n", - " b773445f7cf4cc9a5bf6ec296c74504f93c9c179028ac6...\n", - " 88\n", + " f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...\n", + " 46040\n", + " 2025-01-19T22:48:56.992519\n", + " 48.361864\n", + " attention.pdf\n", " []\n", - " -1\n", - " [-0.015789315, -0.07841933, -0.032271657, 0.00...\n", + " 0677706d-c587-4ddc-a52d-7ed12b082cbe\n", + " Provided proper attribution is provided, Googl...\n", + " 40364b6813455711d85ac8fb680212f946dd00b2f59f31...\n", + " [-0.005492052, 0.006140055, 0.004378937, -0.00...\n", " \n", " \n", " 1\n", - " granite.pdf\n", - " 28\n", - " 17\n", - " 348\n", + " attention.pdf\n", + " 15\n", + " 6\n", + " 147\n", + " 2949302674760005271\n", " pdf\n", - " 79c53d694df467391e94f279af2fa6a9a7e45c3922546e...\n", - " 655054\n", - " 2024-10-02T00:28:23.836369\n", - " 167.768806\n", - " granite.pdf\n", - " 81bc331a-69cf-49bd-84b9-afedcab1344a\n", - " Granite Code Models: A Family of Open Foundati...\n", - " $.main-text[4]\n", - " 1\n", - " [107.61845398, 535.62896729, 503.99923706, 647...\n", - " 7353bcc8d99c279335eaf120c793ca6a08f9a4fddcbb5b...\n", - " 89\n", + " f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...\n", + " 46040\n", + " 2025-01-19T22:48:56.992519\n", + " 48.361864\n", + " attention.pdf\n", " []\n", - " -1\n", - " [-0.059480786, -0.056680508, -0.042864937, -0....\n", + " 0677706d-c587-4ddc-a52d-7ed12b082cbe\n", + " ## Attention Is All You Need\\n\\nAshish Vaswani...\n", + " 45e678f43369d5fa127105b7cca6a6e4dd4deed6422185...\n", + " [0.0298234, -0.006213936, 0.06320297, -0.00840...\n", " \n", " \n", " 2\n", - " granite.pdf\n", - " 28\n", - " 17\n", - " 348\n", + " attention.pdf\n", + " 15\n", + " 6\n", + " 147\n", + " 2949302674760005271\n", " pdf\n", - " 79c53d694df467391e94f279af2fa6a9a7e45c3922546e...\n", - " 655054\n", - " 2024-10-02T00:28:23.836369\n", - " 167.768806\n", - " granite.pdf\n", - " 81bc331a-69cf-49bd-84b9-afedcab1344a\n", - " Granite Code Models: A Family of Open Foundati...\n", - " $.main-text[5]\n", - " 1\n", - " [220.87228394, 484.46414185, 390.87872314, 529...\n", - " 389267895ca214924a0a071df8379c2b15fcf374f232a6...\n", - " 90\n", + " f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca...\n", + " 46040\n", + " 2025-01-19T22:48:56.992519\n", + " 48.361864\n", + " attention.pdf\n", " []\n", - " -1\n", - " [-0.07557265, -0.07152908, -0.048923455, -0.04...\n", + " 0677706d-c587-4ddc-a52d-7ed12b082cbe\n", + " ## Abstract\\n\\nThe dominant sequence transduct...\n", + " 590629323f9d88598a80846d1df6a83d0ad6ac53efe278...\n", + " [-0.08771475, -0.12373961, 0.043168113, 0.0060...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 granite.pdf 28 17 348 pdf \n", - "1 granite.pdf 28 17 348 pdf \n", - "2 granite.pdf 28 17 348 pdf \n", + " filename num_pages num_tables num_doc_elements \\\n", + "0 attention.pdf 15 6 147 \n", + "1 attention.pdf 15 6 147 \n", + "2 attention.pdf 15 6 147 \n", "\n", - " hash size \\\n", - "0 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "1 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "2 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", + " document_hash ext \\\n", + "0 2949302674760005271 pdf \n", + "1 2949302674760005271 pdf \n", + "2 2949302674760005271 pdf \n", "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "1 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "2 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", + " hash size \\\n", + "0 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "1 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", + "2 f1f600333e46c5d7e23f5a110a903ee38aab0bf7047eca... 46040 \n", "\n", - " source_document_id \\\n", - "0 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "1 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "2 81bc331a-69cf-49bd-84b9-afedcab1344a \n", + " date_acquired pdf_convert_time source_filename removed \\\n", + "0 2025-01-19T22:48:56.992519 48.361864 attention.pdf [] \n", + "1 2025-01-19T22:48:56.992519 48.361864 attention.pdf [] \n", + "2 2025-01-19T22:48:56.992519 48.361864 attention.pdf [] \n", "\n", - " text doc_jsonpath \\\n", - "0 Granite Code Models: A Family of Open Foundati... $.main-text[3] \n", - "1 Granite Code Models: A Family of Open Foundati... $.main-text[4] \n", - "2 Granite Code Models: A Family of Open Foundati... $.main-text[5] \n", + " source_document_id \\\n", + "0 0677706d-c587-4ddc-a52d-7ed12b082cbe \n", + "1 0677706d-c587-4ddc-a52d-7ed12b082cbe \n", + "2 0677706d-c587-4ddc-a52d-7ed12b082cbe \n", "\n", - " page_number bbox \\\n", - "0 1 [142.70646667, 672.96929932, 468.58251953, 711... \n", - "1 1 [107.61845398, 535.62896729, 503.99923706, 647... \n", - "2 1 [220.87228394, 484.46414185, 390.87872314, 529... \n", + " text \\\n", + "0 Provided proper attribution is provided, Googl... \n", + "1 ## Attention Is All You Need\\n\\nAshish Vaswani... \n", + "2 ## Abstract\\n\\nThe dominant sequence transduct... \n", "\n", - " document_id chunk_id removed \\\n", - "0 b773445f7cf4cc9a5bf6ec296c74504f93c9c179028ac6... 88 [] \n", - "1 7353bcc8d99c279335eaf120c793ca6a08f9a4fddcbb5b... 89 [] \n", - "2 389267895ca214924a0a071df8379c2b15fcf374f232a6... 90 [] \n", + " document_id \\\n", + "0 40364b6813455711d85ac8fb680212f946dd00b2f59f31... \n", + "1 45e678f43369d5fa127105b7cca6a6e4dd4deed6422185... \n", + "2 590629323f9d88598a80846d1df6a83d0ad6ac53efe278... \n", "\n", - " chunk_hash vector \n", - "0 -1 [-0.015789315, -0.07841933, -0.032271657, 0.00... \n", - "1 -1 [-0.059480786, -0.056680508, -0.042864937, -0.... \n", - "2 -1 [-0.07557265, -0.07152908, -0.048923455, -0.04... " + " vector \n", + "0 [-0.005492052, 0.006140055, 0.004378937, -0.00... \n", + "1 [0.0298234, -0.006213936, 0.06320297, -0.00840... \n", + "2 [-0.08771475, -0.12373961, 0.043168113, 0.0060... " ] }, "execution_count": 3, @@ -361,6 +341,7 @@ "name": "stdout", "output_type": "stream", "text": [ + "✅ Cleared collection : dpk_papers\n", "✅ Created collection : dpk_papers\n" ] } @@ -382,6 +363,13 @@ "print (\"✅ Created collection :\", MY_CONFIG.COLLECTION_NAME)\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-5: Insert Data into Collection" + ] + }, { "cell_type": "code", "execution_count": 6, @@ -391,13 +379,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "inserted # rows 211\n" + "inserted # rows 60\n" ] }, { "data": { "text/plain": [ - "{'row_count': 211}" + "{'row_count': 60}" ] }, "execution_count": 6, @@ -417,7 +405,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Step-5: Close DB Connection\n", + "## Step-6: Close DB Connection\n", "\n", "Close the connection so the lock files are relinquished and other notebooks can access the db" ] @@ -453,7 +441,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "dpk-2-rag-pdf-r1.0.0-py3.11", "language": "python", "name": "python3" }, @@ -467,7 +455,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.11.11" } }, "nbformat": 4, diff --git a/examples/notebooks/rag-pdf-1/rag_3_vector_search.ipynb b/examples/notebooks/rag-pdf-1/rag_3_vector_search.ipynb new file mode 100644 index 0000000000..e650c9db8b --- /dev/null +++ b/examples/notebooks/rag-pdf-1/rag_3_vector_search.ipynb @@ -0,0 +1,431 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Handy Utils to do Vector Search on Collections" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-1: Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from my_config import MY_CONFIG" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-2: Connect to Vector Database\n", + "\n", + "Milvus can be embedded and easy to use.\n", + "\n", + "Note: If you encounter an error about unable to load database, try this: \n", + "\n", + "- In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock \n", + "- In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock\n", + "- Re-run this cell again\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Connected to Milvus instance: ./rag_1_dpk.db\n" + ] + } + ], + "source": [ + "from pymilvus import MilvusClient\n", + "\n", + "milvus_client = MilvusClient(MY_CONFIG.DB_URI)\n", + "\n", + "print (\"✅ Connected to Milvus instance:\", MY_CONFIG.DB_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-3: Setup Embeddings\n", + "\n", + "Two choices here. \n", + "\n", + "1. use sentence transformers directly\n", + "2. use Milvus model wrapper" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "## Option 1 - use sentence transformers directly\n", + "\n", + "# If connection to https://huggingface.co/ failed, uncomment the following path\n", + "import os\n", + "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n", + "\n", + "from sentence_transformers import SentenceTransformer\n", + "\n", + "embedding_model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)\n", + "\n", + "def get_embeddings (str):\n", + " embeddings = embedding_model.encode(str, normalize_embeddings=True)\n", + " return embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "## Option 2 - Milvus model\n", + "from pymilvus import model\n", + "\n", + "# If connection to https://huggingface.co/ failed, uncomment the following path\n", + "import os\n", + "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n", + "\n", + "\n", + "# embedding_fn = model.DefaultEmbeddingFunction()\n", + "\n", + "## initialize the SentenceTransformerEmbeddingFunction\n", + "embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(\n", + " model_name = MY_CONFIG.EMBEDDING_MODEL,\n", + " device='cpu' # this will work on all devices (KIS)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sentence transformer : embeddings len = 384\n", + "sentence transformer : embeddings[:5] = [ 0.02468892 0.10352131 0.0275264 -0.08551715 -0.01412829]\n", + "milvus model wrapper : embeddings len = 384\n", + "milvus model wrapper : embeddings[:5] = [ 0.02468898 0.10352129 0.02752643 -0.08551721 -0.01412823]\n" + ] + } + ], + "source": [ + "# Test Embeddings\n", + "text = 'Paris 2024 Olympics'\n", + "embeddings = get_embeddings(text)\n", + "print ('sentence transformer : embeddings len =', len(embeddings))\n", + "print ('sentence transformer : embeddings[:5] = ', embeddings[:5])\n", + "\n", + "embeddings = embedding_fn([text])\n", + "print ('milvus model wrapper : embeddings len =', len(embeddings[0]))\n", + "print ('milvus model wrapper : embeddings[:5] = ', embeddings[0][:5])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-4: Do A Vector Search\n", + "\n", + "We will do this to verify data" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "\n", + "## helper function to perform vector search\n", + "def do_vector_search (query):\n", + " query_vectors = [get_embeddings(query)] # Option 1 - using sentence transformers\n", + " # query_vectors = embedding_fn([query]) # using Milvus model \n", + "\n", + " results = milvus_client.search(\n", + " collection_name=MY_CONFIG.COLLECTION_NAME, # target collection\n", + " data=query_vectors, # query vectors\n", + " limit=5, # number of returned entities\n", + " output_fields=[\"filename\", \"page_number\", \"text\"], # specifies fields to be returned\n", + " )\n", + " return results\n", + "## ----\n", + "\n", + "def print_search_results (results):\n", + " # pprint (results)\n", + " print ('num results : ', len(results[0]))\n", + "\n", + " for i, r in enumerate (results[0]):\n", + " #pprint(r, indent=4)\n", + " print (f'------ result {i+1} --------')\n", + " print ('search score:', r['distance'])\n", + " print ('filename:', r['entity']['filename'])\n", + " if 'page_number' in r['entity']:\n", + " print ('page number:', r['entity']['page_number'])\n", + " print ('text:\\n', r['entity']['text'])\n", + " print()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "num results : 5\n", + "------ result 1 --------\n", + "search score: 0.5530709028244019\n", + "filename: granite.pdf\n", + "text:\n", + " ## 5 Instruction Tuning\n", + "\n", + "Finetuning code LLMs on a variety of tasks explained via instructions has been shown to improve model usability and general performance. While there has been much progress in code instruction tuning, most of them adopt synthetically generated data from OpenAI models, which limits the model use in many enterprise applications. Thus, following OctoCoder (Muennighoff et al., 2023), we use only a combination of permissively licensed data, with an aim to enhance instruction following capabilities of our models, including logical reasoning and problem-solving skills. Specifically, Granite Code Instruct models are trained on the following types of data.\n", + "\n", + "- · Code Commits Dataset : CommitPackFT (Muennighoff et al., 2023), a filtered version of full CommitPack dataset across 92 programming languages 6 ;\n", + "\n", + "Table 2: Summary of evaluation tasks.\n", + "\n", + "| Task | Benchmark | Reference |\n", + "|------------------------------------|---------------------|---------------------------|\n", + "| Multilingual code generation | HumanEvalSynthesize | Muennighoff et al. (2023) |\n", + "| Multilingual code generation | MultiPL-E | Cassano et al. (2023) |\n", + "| Python code generation | MBPP | Austin et al. (2021) |\n", + "| Python code generation | MBPP+ | Liu et al. (2023a) |\n", + "| Data science code generation | DS1000 | Lai et al. (2023) |\n", + "| Repository-level code generation | RepoBench | Liu et al. (2023b) |\n", + "| Repository-level code generation | CrossCodeEval | Ding et al. (2023) |\n", + "| Fill-in-the-middle code completion | SantaCoder-FIM | Allal et al. (2023) |\n", + "| Multilingual code explanation | HumanEvalExplain | Muennighoff et al. (2023) |\n", + "| Multilingual code fixing | HumanEvalFix | Muennighoff et al. (2023) |\n", + "| Code editing | CanItEdit | Cassano et al. (2024) |\n", + "| Code translation | CodeLingua | Pan et al. (2024) |\n", + "| Code execution | CruxEval | Gu et al. (2024) |\n", + "| Math reasoning | MATH | Hendrycks et al. (2021) |\n", + "| Math reasoning | GSM8K | Cobbe et al. (2021) |\n", + "| Math reasoning | SAT | Azerbayev et al. (2023) |\n", + "| Math reasoning | OCW | Lewkowycz et al. (2022) |\n", + "| Function calling | BFCL | Yan et al. (2024) |\n", + "| Model robustness | ReCode | Wang et al. (2022) |\n", + "\n", + "- · Math Datasets : MathInstruct 7 (Yue et al., 2023) and MetaMathQA (Yu et al., 2023);\n", + "- · Code Instruction Datasets : Glaive-Code-Assistant-v3 8 , Self-OSS-Instruct-SC2 9 , Glaive-Function-Calling-v2 10 , NL2SQL 11 and few synthetically generated API calling datasets (Basu et al., 2024);\n", + "- · Language Instruction Datasets : High-quality datasets like HelpSteer (Wang et al., 2023), an open license-filtered version of Platypus 12 (Lee et al., 2023) including a collection of hardcoded prompts to ensure model generates correct outputs given inquiries about its name or developers.\n", + "\n", + "For training, we use a cosine scheduler with 250 warmup steps, an initial learning rate 10 - 5 , and train for three epochs. Further, we add random, uniform noise with a magnitude of 5 √ Nh , where N is the sequence length and h is the embedding dimension, to the embedding vector, as proposed by Jain et al.. The additional noise improved overall answer quality of the instruction model. We use FlashAttention 2 (Dao, 2023; Dao et al., 2022) with a Padding-Free Transformer 13 implementation to reduce GPU memory usage and redundant FLOPs during finetuning. We also use full activation checkpointing (Korthikanti et al., 2023), which allows us to finetune our Granite-20B-Code models with 8K context length within a single node within a few hours on 8 × A100 GPUs.\n", + "\n", + "------ result 2 --------\n", + "search score: 0.477556437253952\n", + "filename: granite.pdf\n", + "text:\n", + " ## Granite Code Models: A Family of Open Foundation Models for Code Intelligence\n", + "\n", + "Mayank Mishra ⋆ Matt Stallone ⋆ Gaoyuan Zhang ⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri † Rameswar Panda †\n", + "\n", + "IBM Research ⋆ Equal Contribution † Corresponding Authors ruchir@us.ibm.com, rpanda@ibm.com\n", + "\n", + "------ result 3 --------\n", + "search score: 0.45931386947631836\n", + "filename: granite.pdf\n", + "text:\n", + " ## 4.1 Two Phase Training\n", + "\n", + "Granite Code models are trained on 3.5T to 4.5T tokens of code data and natural language datasets related to code. Data is tokenized via byte pair encoding (BPE, (Sennrich et al., 2015)), employing the same tokenizer as StarCoder (Li et al., 2023a). Following (Shen et al., 2024; Hu et al., 2024), we utilize high-quality data with two phases of training as follows.\n", + "\n", + "- · Phase 1 (code only training) : During phase 1, both 3B and 8B models are trained for 4 trillion tokens of code data comprising 116 languages. The 20B parameter model is trained on 3 trillion tokens of code. The 34B model is trained on 1.4T tokens after the depth upscaling which is done on the 1.6T checkpoint of 20B model.\n", + "- · Phase 2 (code + language training) : In phase 2, we include additional high-quality publicly available data from various domains, including technical, mathematics, and web documents, to further improve the model's performance in reasoning and problem solving skills, which are essential for code generation. We train all our models for 500B tokens (80% code and 20% language data) in phase 2 training.\n", + "\n", + "------ result 4 --------\n", + "search score: 0.43072062730789185\n", + "filename: granite.pdf\n", + "text:\n", + " ## 6.1.6 FIM: Infilling Evaluations\n", + "\n", + "Granite Code models are trained for code completion purposes using FIM objective, as described in Sec. 4.2. We use SantaCoder-FIM benchmark (Allal et al., 2023), for infilling evaluations which tests the ability of models to fill in a single line of code in Python, JavaScript, and Java solutions to HumanEval. We use greedy decoding and report the mean exact match for all the models. Table 9 shows that Granite Code models significantly outperforms StarCoder and StarCoder2 across all model sizes, demonstrating it to be\n", + "\n", + "Figure 3: Performance of Granite-8B-Code-Instruct, Mistral-7B-Instruct-v0.2, Gemma-7B-IT, and Llama-3-8B-Instruct on HumanEvalPack. Best viewed in color.\n", + "\n", + "\n", + "\n", + "excellent well-rounded models for code completion use cases. Moreover, we observe no performance improvement in scaling the model sizes from 8B to 34B, indicating that smaller models are often more suitable for FIM code completion tasks.\n", + "\n", + "------ result 5 --------\n", + "search score: 0.41963690519332886\n", + "filename: attention.pdf\n", + "text:\n", + " ## 5 Training\n", + "\n", + "This section describes the training regime for our models.\n", + "\n" + ] + } + ], + "source": [ + "query = \"What was the training data used to train Granite models?\"\n", + "\n", + "results = do_vector_search (query)\n", + "print_search_results(results)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "num results : 5\n", + "------ result 1 --------\n", + "search score: 0.6020913124084473\n", + "filename: attention.pdf\n", + "text:\n", + " ## 3.2 Attention\n", + "\n", + "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n", + "\n", + "Scaled Dot-Product Attention\n", + "\n", + "\n", + "\n", + "Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.\n", + "\n", + "\n", + "\n", + "of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.\n", + "\n", + "------ result 2 --------\n", + "search score: 0.5734226703643799\n", + "filename: attention.pdf\n", + "text:\n", + " ## Attention Visualizations Input-Input Layer5\n", + "\n", + "Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb 'making', completing the phrase 'making...more difficult'. Attentions here shown only for the word 'making'. Different colors represent different heads. Best viewed in color.\n", + "\n", + "\n", + "\n", + "Input-Input Layer5\n", + "\n", + "Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word 'its' for attention heads 5 and 6. Note that the attentions are very sharp for this word.\n", + "\n", + "\n", + "\n", + "Input-Input Layer5\n", + "\n", + "Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.\n", + "\n", + "\n", + "\n", + "------ result 3 --------\n", + "search score: 0.4910251796245575\n", + "filename: attention.pdf\n", + "text:\n", + " ## 3.2.3 Applications of Attention in our Model\n", + "\n", + "The Transformer uses multi-head attention in three different ways:\n", + "\n", + "- · In \"encoder-decoder attention\" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].\n", + "- · The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.\n", + "- · Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to -∞ ) all values in the input of the softmax which correspond to illegal connections. See Figure 2.\n", + "\n", + "------ result 4 --------\n", + "search score: 0.44283992052078247\n", + "filename: attention.pdf\n", + "text:\n", + " ## 2 Background\n", + "\n", + "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.\n", + "\n", + "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].\n", + "\n", + "End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].\n", + "\n", + "To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].\n", + "\n", + "------ result 5 --------\n", + "search score: 0.42441824078559875\n", + "filename: attention.pdf\n", + "text:\n", + " ## 3.2.2 Multi-Head Attention\n", + "\n", + "Instead of performing a single attention function with d model -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d k , d k and d v dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding d v -dimensional\n", + "\n", + "output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.\n", + "\n", + "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.\n", + "\n", + "MultiHead( Q,K,V ) = Concat(head 1 ,..., head h ) W O where head i = Attention( QW Q i ,KW K i ,VW V i )\n", + "\n", + "Where the projections are parameter matrices W Q i ∈ R d model × d k , W K i ∈ R d model × d k , W V i ∈ R d model × d v and W O ∈ R hd v × d model .\n", + "\n", + "In this work we employ h = 8 parallel attention layers, or heads. For each of these we use d k = d v = d model /h = 64 . Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.\n", + "\n" + ] + } + ], + "source": [ + "query = \"What is the attention mechanism?\"\n", + "\n", + "results = do_vector_search (query)\n", + "print_search_results(results)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# milvus_client.close()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dpk-1-rag-pdf-r1.0.0.a4-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/notebooks/rag-pdf-1/rag_4_query_replicate.ipynb b/examples/notebooks/rag-pdf-1/rag_4_query_replicate.ipynb new file mode 100644 index 0000000000..38420b6beb --- /dev/null +++ b/examples/notebooks/rag-pdf-1/rag_4_query_replicate.ipynb @@ -0,0 +1,559 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Query Data using LLM\n", + "\n", + "Here is the overall RAG pipeline. In this notebook, we will do steps (6), (7), (8), (9) and (10)\n", + "- Importing data is already done in this notebook [rag_2_load_data_into_milvus.ipynb](rag_2_load_data_into_milvus.ipynb)\n", + "- 👉 Step 6: Calculate embedding for user query\n", + "- 👉 Step 7 & 8: Send the query to vector db to retrieve relevant documents\n", + "- 👉 Step 9 & 10: Send the query and relevant documents (returned above step) to LLM and get answers to our query\n", + "\n", + "![image missing](media/rag-overview-2.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-1: Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from my_config import MY_CONFIG" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-2: Load .env file\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ config REPLICATE_API_TOKEN found\n" + ] + } + ], + "source": [ + "import os,sys\n", + "## Load Settings from .env file\n", + "from dotenv import find_dotenv, dotenv_values\n", + "\n", + "# _ = load_dotenv(find_dotenv()) # read local .env file\n", + "config = dotenv_values(find_dotenv())\n", + "\n", + "# debug\n", + "# print (config)\n", + "\n", + "MY_CONFIG.REPLICATE_API_TOKEN = config.get('REPLICATE_API_TOKEN')\n", + "\n", + "if MY_CONFIG.REPLICATE_API_TOKEN:\n", + " print (\"✅ config REPLICATE_API_TOKEN found\")\n", + "else:\n", + " raise Exception (\"'❌ REPLICATE_API_TOKEN' is not set. Please set it above to continue...\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-3: Connect to Vector Database\n", + "\n", + "Milvus can be embedded and easy to use.\n", + "\n", + "Note: If you encounter an error about unable to load database, try this: \n", + "\n", + "- In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock \n", + "- In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock\n", + "- Re-run this cell again\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Connected to Milvus instance: ./rag_1_dpk.db\n" + ] + } + ], + "source": [ + "from pymilvus import MilvusClient\n", + "\n", + "milvus_client = MilvusClient(MY_CONFIG.DB_URI)\n", + "\n", + "print (\"✅ Connected to Milvus instance:\", MY_CONFIG.DB_URI)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-4: Setup Embeddings\n", + "\n", + "Use the same embeddings we used to index our documents!" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)\n", + "\n", + "def get_embeddings (str):\n", + " embeddings = model.encode(str, normalize_embeddings=True)\n", + " return embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "embeddings len = 384\n", + "embeddings[:5] = [ 0.02468892 0.10352131 0.0275264 -0.08551715 -0.01412829]\n" + ] + } + ], + "source": [ + "# Test embeddings\n", + "embeddings = get_embeddings('Paris 2024 Olympics')\n", + "print ('embeddings len =', len(embeddings))\n", + "print ('embeddings[:5] = ', embeddings[:5])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-5: Vector Search and RAG" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# Get relevant documents using vector / sementic search\n", + "\n", + "def fetch_relevant_documents (query : str) :\n", + " search_res = milvus_client.search(\n", + " collection_name=MY_CONFIG.COLLECTION_NAME,\n", + " data = [get_embeddings(query)], # Use the `emb_text` function to convert the question to an embedding vector\n", + " limit=3, # Return top 3 results\n", + " search_params={\"metric_type\": \"IP\", \"params\": {}}, # Inner product distance\n", + " output_fields=[\"text\"], # Return the text field\n", + " )\n", + " # print (search_res)\n", + "\n", + " retrieved_docs_with_distances = [\n", + " {'text': res[\"entity\"][\"text\"], 'distance' : res[\"distance\"]} for res in search_res[0]\n", + " ]\n", + " return retrieved_docs_with_distances\n", + "## --- end ---\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[ { 'distance': 0.5530709028244019,\n", + " 'text': '## 5 Instruction Tuning\\n'\n", + " '\\n'\n", + " 'Finetuning code LLMs on a variety of tasks explained via '\n", + " 'instructions has been shown to improve model usability and '\n", + " 'general performance. While there has been much progress in '\n", + " 'code instruction tuning, most of them adopt synthetically '\n", + " 'generated data from OpenAI models, which limits the model use '\n", + " 'in many enterprise applications. Thus, following OctoCoder '\n", + " '(Muennighoff et al., 2023), we use only a combination of '\n", + " 'permissively licensed data, with an aim to enhance '\n", + " 'instruction following capabilities of our models, including '\n", + " 'logical reasoning and problem-solving skills. Specifically, '\n", + " 'Granite Code Instruct models are trained on the following '\n", + " 'types of data.\\n'\n", + " '\\n'\n", + " '- · Code Commits Dataset : CommitPackFT (Muennighoff et al., '\n", + " '2023), a filtered version of full CommitPack dataset across 92 '\n", + " 'programming languages 6 ;\\n'\n", + " '\\n'\n", + " 'Table 2: Summary of evaluation tasks.\\n'\n", + " '\\n'\n", + " '| Task | Benchmark | '\n", + " 'Reference |\\n'\n", + " '|------------------------------------|---------------------|---------------------------|\\n'\n", + " '| Multilingual code generation | HumanEvalSynthesize | '\n", + " 'Muennighoff et al. (2023) |\\n'\n", + " '| Multilingual code generation | MultiPL-E | '\n", + " 'Cassano et al. (2023) |\\n'\n", + " '| Python code generation | MBPP | '\n", + " 'Austin et al. (2021) |\\n'\n", + " '| Python code generation | MBPP+ | '\n", + " 'Liu et al. (2023a) |\\n'\n", + " '| Data science code generation | DS1000 | '\n", + " 'Lai et al. (2023) |\\n'\n", + " '| Repository-level code generation | RepoBench | '\n", + " 'Liu et al. (2023b) |\\n'\n", + " '| Repository-level code generation | CrossCodeEval | '\n", + " 'Ding et al. (2023) |\\n'\n", + " '| Fill-in-the-middle code completion | SantaCoder-FIM | '\n", + " 'Allal et al. (2023) |\\n'\n", + " '| Multilingual code explanation | HumanEvalExplain | '\n", + " 'Muennighoff et al. (2023) |\\n'\n", + " '| Multilingual code fixing | HumanEvalFix | '\n", + " 'Muennighoff et al. (2023) |\\n'\n", + " '| Code editing | CanItEdit | '\n", + " 'Cassano et al. (2024) |\\n'\n", + " '| Code translation | CodeLingua | '\n", + " 'Pan et al. (2024) |\\n'\n", + " '| Code execution | CruxEval | '\n", + " 'Gu et al. (2024) |\\n'\n", + " '| Math reasoning | MATH | '\n", + " 'Hendrycks et al. (2021) |\\n'\n", + " '| Math reasoning | GSM8K | '\n", + " 'Cobbe et al. (2021) |\\n'\n", + " '| Math reasoning | SAT | '\n", + " 'Azerbayev et al. (2023) |\\n'\n", + " '| Math reasoning | OCW | '\n", + " 'Lewkowycz et al. (2022) |\\n'\n", + " '| Function calling | BFCL | '\n", + " 'Yan et al. (2024) |\\n'\n", + " '| Model robustness | ReCode | '\n", + " 'Wang et al. (2022) |\\n'\n", + " '\\n'\n", + " '- · Math Datasets : MathInstruct 7 (Yue et al., 2023) and '\n", + " 'MetaMathQA (Yu et al., 2023);\\n'\n", + " '- · Code Instruction Datasets : Glaive-Code-Assistant-v3 8 , '\n", + " 'Self-OSS-Instruct-SC2 9 , Glaive-Function-Calling-v2 10 , '\n", + " 'NL2SQL 11 and few synthetically generated API calling '\n", + " 'datasets (Basu et al., 2024);\\n'\n", + " '- · Language Instruction Datasets : High-quality datasets '\n", + " 'like HelpSteer (Wang et al., 2023), an open license-filtered '\n", + " 'version of Platypus 12 (Lee et al., 2023) including a '\n", + " 'collection of hardcoded prompts to ensure model generates '\n", + " 'correct outputs given inquiries about its name or '\n", + " 'developers.\\n'\n", + " '\\n'\n", + " 'For training, we use a cosine scheduler with 250 warmup '\n", + " 'steps, an initial learning rate 10 - 5 , and train for three '\n", + " 'epochs. Further, we add random, uniform noise with a '\n", + " 'magnitude of 5 √ Nh , where N is the sequence length and h is '\n", + " 'the embedding dimension, to the embedding vector, as proposed '\n", + " 'by Jain et al.. The additional noise improved overall answer '\n", + " 'quality of the instruction model. We use FlashAttention 2 '\n", + " '(Dao, 2023; Dao et al., 2022) with a Padding-Free Transformer '\n", + " '13 implementation to reduce GPU memory usage and redundant '\n", + " 'FLOPs during finetuning. We also use full activation '\n", + " 'checkpointing (Korthikanti et al., 2023), which allows us to '\n", + " 'finetune our Granite-20B-Code models with 8K context length '\n", + " 'within a single node within a few hours on 8 × A100 GPUs.'},\n", + " { 'distance': 0.477556437253952,\n", + " 'text': '## Granite Code Models: A Family of Open Foundation Models '\n", + " 'for Code Intelligence\\n'\n", + " '\\n'\n", + " 'Mayank Mishra ⋆ Matt Stallone ⋆ Gaoyuan Zhang ⋆ Yikang Shen '\n", + " 'Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran '\n", + " 'Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong '\n", + " 'Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman '\n", + " 'Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris '\n", + " 'Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu '\n", + " 'Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel '\n", + " 'Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell '\n", + " 'Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen '\n", + " 'Seetharami Seelam Brian Belgodere Carlos Fonseca Amith '\n", + " 'Singhee Nirmit Desai David D. Cox Ruchir Puri † Rameswar '\n", + " 'Panda †\\n'\n", + " '\\n'\n", + " 'IBM Research ⋆ Equal Contribution † Corresponding Authors '\n", + " 'ruchir@us.ibm.com, rpanda@ibm.com'},\n", + " { 'distance': 0.45931386947631836,\n", + " 'text': '## 4.1 Two Phase Training\\n'\n", + " '\\n'\n", + " 'Granite Code models are trained on 3.5T to 4.5T tokens of '\n", + " 'code data and natural language datasets related to code. Data '\n", + " 'is tokenized via byte pair encoding (BPE, (Sennrich et al., '\n", + " '2015)), employing the same tokenizer as StarCoder (Li et al., '\n", + " '2023a). Following (Shen et al., 2024; Hu et al., 2024), we '\n", + " 'utilize high-quality data with two phases of training as '\n", + " 'follows.\\n'\n", + " '\\n'\n", + " '- · Phase 1 (code only training) : During phase 1, both 3B '\n", + " 'and 8B models are trained for 4 trillion tokens of code data '\n", + " 'comprising 116 languages. The 20B parameter model is trained '\n", + " 'on 3 trillion tokens of code. The 34B model is trained on '\n", + " '1.4T tokens after the depth upscaling which is done on the '\n", + " '1.6T checkpoint of 20B model.\\n'\n", + " '- · Phase 2 (code + language training) : In phase 2, we '\n", + " 'include additional high-quality publicly available data from '\n", + " 'various domains, including technical, mathematics, and web '\n", + " \"documents, to further improve the model's performance in \"\n", + " 'reasoning and problem solving skills, which are essential for '\n", + " 'code generation. We train all our models for 500B tokens (80% '\n", + " 'code and 20% language data) in phase 2 training.'}]\n" + ] + } + ], + "source": [ + "# test relevant vector search\n", + "import json\n", + "import pprint\n", + "\n", + "question = \"What was the training data used to train Granite models?\"\n", + "relevant_docs = fetch_relevant_documents(question)\n", + "pprint.pprint(relevant_docs, indent=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-6: Initialize LLM\n", + "\n", + "### LLM Choices at Replicate\n", + "\n", + "\n", + "| Model | Publisher | Params | Description |\n", + "|-------------------------------------|-----------|--------|------------------------------------------------------|\n", + "| ibm-granite/granite-3.0-8b-instruct | IBM | 8 B | IBM's newest Granite Model v3.0 (default) |\n", + "| ibm-granite/granite-3.0-2b-instruct | IBM | 2 B | IBM's newest Granite Model v3.0 |\n", + "| meta/meta-llama-3.1-405b-instruct | Meta | 405 B | Meta's flagship 405 billion parameter language model |\n", + "| meta/meta-llama-3-8b-instruct | Meta | 8 B | Meta's 8 billion parameter language model |\n", + "| meta/meta-llama-3-70b-instruct | Meta | 70 B | Meta's 70 billion parameter language model |\n", + "\n", + "References \n", + "\n", + "- https://www.ibm.com/granite\n", + "- https://www.llama.com/\n", + "- https://replicate.com/ " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using model: ibm-granite/granite-3.0-8b-instruct\n" + ] + } + ], + "source": [ + "import os\n", + "os.environ[\"REPLICATE_API_TOKEN\"] = MY_CONFIG.REPLICATE_API_TOKEN\n", + "\n", + "print ('Using model:', MY_CONFIG.LLM_MODEL)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import replicate\n", + "\n", + "def ask_LLM (question, relevant_docs):\n", + " context = \"\\n\".join(\n", + " [doc['text'] for doc in relevant_docs]\n", + " )\n", + " \n", + " max_new_tokens = 1024\n", + " \n", + " ## Truncate context, so we don't over shoot context window\n", + " context = context[:(MY_CONFIG.MAX_CONTEXT_WINDOW - max_new_tokens - 100)]\n", + " # print (\"context length:\", len(context))\n", + " # print ('============ context (this is the context supplied to LLM) ============')\n", + " # print (context)\n", + " # print ('============ end context ============', flush=True)\n", + "\n", + " system_prompt = \"\"\"\n", + " Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n", + " \"\"\"\n", + " user_prompt = f\"\"\"\n", + " Use the following pieces of information enclosed in tags to provide an answer to the question enclosed in tags.\n", + " \n", + " {context}\n", + " \n", + " \n", + " {question}\n", + " \n", + " \"\"\"\n", + " # print (\"user_prompt length:\", len(user_prompt))\n", + "\n", + " print ('============ here is the answer from LLM =====')\n", + " # The meta/meta-llama-3-8b-instruct model can stream output as it's running.\n", + " for event in replicate.stream(\n", + " MY_CONFIG.LLM_MODEL,\n", + " input={\n", + " \"top_k\": 1,\n", + " \"top_p\": 0.95,\n", + " \"prompt\": user_prompt,\n", + " #\"max_tokens\": MY_CONFIG.MAX_CONTEXT_WINDOW,\n", + " \"temperature\": 0.1,\n", + " \"system_prompt\": system_prompt,\n", + " \"length_penalty\": 1,\n", + " \"max_new_tokens\": max_new_tokens,\n", + " \"stop_sequences\": \"<|end_of_text|>,<|eot_id|>\",\n", + " \"prompt_template\": \"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\\n\\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n\",\n", + " \"presence_penalty\": 0,\n", + " \"log_performance_metrics\": False\n", + " },\n", + " ):\n", + " print(str(event), end=\"\")\n", + " ## ---\n", + " print ('\\n====== end LLM answer ======\\n', flush=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step-7: Query" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============ here is the answer from LLM =====\n", + "The Granite Code Instruct models were trained on a combination of permissively licensed data, including the Code Commits Dataset (CommitPackFT) and Math Datasets (MathInstruct and MetaMathQA). Additionally, they were trained on Code Instruction Datasets such as Glaive-Code-Assistant-v3, Self-OSS-Instruct-SC2, Glaive-Function-Calling-v2, and NL2SQL.\n", + "====== end LLM answer ======\n", + "\n", + "CPU times: user 78.4 ms, sys: 12.3 ms, total: 90.6 ms\n", + "Wall time: 3.04 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "question = \"What was the training data used to train Granite models?\"\n", + "relevant_docs = fetch_relevant_documents(question)\n", + "ask_LLM(question=question, relevant_docs=relevant_docs)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============ here is the answer from LLM =====\n", + "The attention mechanism is a method that allows a model to focus on specific parts of the input when producing an output. It maps a query and a set of key-value pairs to an output, where the output is computed as a weighted sum of the values, and the weight assigned to each value is determined by a compatibility function of the query with the corresponding key. In the context of the Transformer model, attention is used in three ways: encoder-decoder attention layers, self-attention layers in the encoder, and self-attention layers in the decoder.\n", + "====== end LLM answer ======\n", + "\n", + "CPU times: user 43 ms, sys: 13.7 ms, total: 56.7 ms\n", + "Wall time: 1.22 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "question = \"What is attention mechanism?\"\n", + "relevant_docs = fetch_relevant_documents(question)\n", + "ask_LLM(question=question, relevant_docs=relevant_docs)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============ here is the answer from LLM =====\n", + "I'm sorry, the provided context does not contain information about the moon landing.\n", + "====== end LLM answer ======\n", + "\n", + "CPU times: user 29 ms, sys: 7.71 ms, total: 36.7 ms\n", + "Wall time: 1.07 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "question = \"When was the moon landing?\"\n", + "relevant_docs = fetch_relevant_documents(question)\n", + "ask_LLM(question=question, relevant_docs=relevant_docs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "dpk-1-rag-pdf-r1.0.0.a4-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/notebooks/rag/rag_2A_llamaindex_process.ipynb b/examples/notebooks/rag-pdf-1/rag_llamaindex_1_process.ipynb similarity index 100% rename from examples/notebooks/rag/rag_2A_llamaindex_process.ipynb rename to examples/notebooks/rag-pdf-1/rag_llamaindex_1_process.ipynb diff --git a/examples/notebooks/rag/rag_2B_llamaindex_query.ipynb b/examples/notebooks/rag-pdf-1/rag_llamaindex_2_query.ipynb similarity index 100% rename from examples/notebooks/rag/rag_2B_llamaindex_query.ipynb rename to examples/notebooks/rag-pdf-1/rag_llamaindex_2_query.ipynb diff --git a/examples/notebooks/rag/requirements.txt b/examples/notebooks/rag-pdf-1/requirements.txt similarity index 69% rename from examples/notebooks/rag/requirements.txt rename to examples/notebooks/rag-pdf-1/requirements.txt index 1c5c4f00c4..894183bea8 100644 --- a/examples/notebooks/rag/requirements.txt +++ b/examples/notebooks/rag-pdf-1/requirements.txt @@ -1,10 +1,5 @@ ## Data prep kit - -data-prep-toolkit-transforms==0.2.1 -data-prep-toolkit-transforms-ray==0.2.1 - -deepsearch-toolkit - +data-prep-toolkit-transforms[ray,all]==1.0.0 # Milvus pymilvus @@ -13,14 +8,6 @@ pymilvus[model] # datasets datasets -## Torch and enbeddings -torch -sentence-transformers - -## --- Parquet -pandas -pyarrow - ## --- Replicate replicate @@ -40,7 +27,7 @@ llama-index-vector-stores-milvus # --- Utils -python-dotenv==1.0.0 +python-dotenv humanfriendly ## --- Jupyter Utils diff --git a/examples/notebooks/rag/setup-python-dev-env.md b/examples/notebooks/rag-pdf-1/setup-python-dev-env.md similarity index 90% rename from examples/notebooks/rag/setup-python-dev-env.md rename to examples/notebooks/rag-pdf-1/setup-python-dev-env.md index b007c4b4b6..005b0ebccf 100644 --- a/examples/notebooks/rag/setup-python-dev-env.md +++ b/examples/notebooks/rag-pdf-1/setup-python-dev-env.md @@ -17,16 +17,16 @@ We will create an environment for this workshop with all the required libraries ### A-1: Setup a conda env ```bash -conda create -n data-prep-kit-1 -y python=3.11 +conda create -n data-prep-kit-rag -y python=3.11 ``` activate the new conda environment ```bash -conda activate data-prep-kit-1 +conda activate data-prep-kit-rag ``` -Make sure env is swithced to data-prep-kit-1 +Make sure env is swithced to data-prep-kit-rag Check python version @@ -39,15 +39,15 @@ should say : 3.11 **Note**: If you are on a linux system install these too ```bash -conda install gcc_linux-64 +conda install -y gcc_linux-64 -conda install gxx_linux-64 +conda install -y gxx_linux-64 ``` ### A-2: Install dependencies ```bash -cd examples/notebooks/rag +cd examples/notebooks/rag-pdf-1 ``` ```bash diff --git a/examples/notebooks/rag/utils.py b/examples/notebooks/rag-pdf-1/utils.py similarity index 100% rename from examples/notebooks/rag/utils.py rename to examples/notebooks/rag-pdf-1/utils.py diff --git a/examples/notebooks/rag/media/rag-overview-2.png b/examples/notebooks/rag/media/rag-overview-2.png deleted file mode 100644 index e7b5fa2e53..0000000000 Binary files a/examples/notebooks/rag/media/rag-overview-2.png and /dev/null differ diff --git a/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb b/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb deleted file mode 100644 index ae8b0836d1..0000000000 --- a/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb +++ /dev/null @@ -1,1775 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", - "metadata": {}, - "source": [ - "
\n", - "

Data Processing for RAG with Data Prep Kit (Python)

\n", - " \n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "id": "b15976e3", - "metadata": {}, - "source": [ - "## Before Running the notebook\n", - "\n", - "Please complete [setting up python dev environment](./setup-python-dev-env.md)" - ] - }, - { - "cell_type": "markdown", - "id": "053ecf08-5f62-4b99-9347-8a0955843d21", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "This notebook will process PDF documents as part of RAG pipeline\n", - "\n", - "![](media/rag-overview-2.png)\n", - "\n", - "This notebook will perform steps 1, 2 and 3 in RAG pipeline.\n", - "\n", - "Here are the processing steps:\n", - "\n", - "- **pdf2parquet** : Extract text from PDF and convert them into parquet files\n", - "- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n", - "- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash\n", - "- **Exact Dedup**: Chunks with exact same content are filtered out\n", - "- **Text encoder**: Convert chunks into vectors using embedding models" - ] - }, - { - "cell_type": "markdown", - "id": "e8b10be1", - "metadata": {}, - "source": [ - "## Step-1: Configuration" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "33345487", - "metadata": {}, - "outputs": [], - "source": [ - "from my_config import MY_CONFIG" - ] - }, - { - "cell_type": "markdown", - "id": "facb3bbc", - "metadata": {}, - "source": [ - "## Step-2: Data\n", - "\n", - "We will use white papers about LLMs. \n", - "\n", - "- [Granite Code Models](https://arxiv.org/abs/2405.04324)\n", - "- [Attention is all you need](https://arxiv.org/abs/1706.03762)\n", - "\n", - "You can of course substite your own data below" - ] - }, - { - "cell_type": "markdown", - "id": "f1fe7c0c", - "metadata": {}, - "source": [ - "### 2.1 - Download data" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "8739b7a2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Local file 'input/attension.pdf' (2.22 MB) already exists. Skipping download.\n", - "Local file 'input/granite.pdf' (1.27 MB) already exists. Skipping download.\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "from utils import download_file\n", - "\n", - "## Download the data files\n", - "shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)\n", - "\n", - "download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attension.pdf' ))\n", - "\n", - "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))\n" - ] - }, - { - "cell_type": "markdown", - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", - "metadata": {}, - "source": [ - "### 2.2 - Set input/output path variables for the pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Cleared output directory\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "\n", - "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", - " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", - "\n", - "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", - "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", - "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", - "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", - "\n", - "## clear output folder\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", - "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", - "\n", - "print (\"✅ Cleared output directory\")" - ] - }, - { - "cell_type": "markdown", - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", - "metadata": {}, - "source": [ - "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", - "\n", - "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", - "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", - "metadata": {}, - "source": [ - "### 3.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-1: Processing input='input' --> output='output/01_parquet_out'\n" - ] - } - ], - "source": [ - "STAGE = 1 \n", - "\n", - "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", - "output_folder = output_parquet_dir\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", - "metadata": {}, - "source": [ - "### 3.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:23:40 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "00:23:40 INFO - pipeline id pipeline_id\n", - "00:23:40 INFO - code location None\n", - "00:23:40 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/01_parquet_out\n", - "00:23:40 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:23:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "00:23:40 INFO - orchestrator pdf2parquet started at 2024-10-02 00:23:40\n", - "00:23:40 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}\n", - "00:23:40 INFO - Initializing models\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "bd58971a33d4410c91e742e735a6e6e3", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Fetching 10 files: 0%| | 0/10 [00:00\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0granite.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...28173484a32ba4c-8fdb-4eeb-a06b-d28493efe8e3pdf0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...6549892024-10-02T00:24:48.95961234.223920granite.pdf
1attension.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...154193f275d75a-a072-4836-8a55-6a65f0d34577pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdf
\n", - "" - ], - "text/plain": [ - " filename contents \\\n", - "0 granite.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... \n", - "1 attension.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... \n", - "\n", - " num_pages num_tables num_doc_elements \\\n", - "0 28 17 348 \n", - "1 15 4 193 \n", - "\n", - " document_id ext \\\n", - "0 4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3 pdf \n", - "1 f275d75a-a072-4836-8a55-6a65f0d34577 pdf \n", - "\n", - " hash size \\\n", - "0 0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587... 654989 \n", - "1 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "\n", - " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-02T00:24:48.959612 34.223920 granite.pdf \n", - "1 2024-10-02T00:24:14.713654 18.004455 attension.pdf " - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(5)\n", - "\n", - "## To display certain columns\n", - "#parquet_df[['column1', 'column2', 'column3']].head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "72274586", - "metadata": {}, - "source": [ - "## Step-4: Doc chunks\n", - "\n", - "Split the documents in chunks, according to their layout segmentation." - ] - }, - { - "cell_type": "markdown", - "id": "96198fa6", - "metadata": {}, - "source": [ - "### 4.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "305f00a3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" - ] - } - ], - "source": [ - "STAGE = 2\n", - "\n", - "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_chunk_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "369f2cd1", - "metadata": {}, - "source": [ - "### 4.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "5b7b18d5", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:24:50 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "00:24:50 INFO - pipeline id pipeline_id\n", - "00:24:50 INFO - code location None\n", - "00:24:50 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "00:24:50 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:24:50 INFO - orchestrator doc_chunk started at 2024-10-02 00:24:50\n", - "00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.12735748291015625, 'min_file_size': 0.035338401794433594, 'total_file_size': 0.16269588470458984}\n", - "00:24:50 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "00:24:50 INFO - Completed 2 files (100.0%) in 0.004 min\n", - "00:24:50 INFO - Done processing 2 files, waiting for flush() completion.\n", - "00:24:50 INFO - done flushing in 0.0 sec\n", - "00:24:50 INFO - Completed execution in 0.004 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:2 completed successfully\n", - "CPU times: user 1.07 s, sys: 95.1 ms, total: 1.16 s\n", - "Wall time: 1.19 s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # doc_chunk arguments\n", - " # ...\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "213afdf6", - "metadata": {}, - "source": [ - "### 4.3 - Inspect Generated output\n", - "\n", - "We would see documents are split into many chunks" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "d8138d43", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Files processed : 2\n", - "Chunks created : 211\n", - "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (211, 16)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
87granite.pdf2817348pdf0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...6549892024-10-02T00:24:48.95961234.223920granite.pdf4a32ba4c-8fdb-4eeb-a06b-d28493efe8e36.3 Code Editing and Translation\\nTable 12: Pa...$.main-text[189]16[106.69820404, 190.24554443, 504.00320435, 211...f28d8c9a4fe81f0baf801daf9a95ddaf152a4ac5e8b8ac...
154attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d345773.2.2 Multi-Head Attention\\nMulti-head attenti...$.main-text[55]5[107.46644592, 669.41210938, 503.99703979, 690...da79f02a5f19c2f07de7a6f1da9df8db00f01a477582ac...
67granite.pdf2817348pdf0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...6549892024-10-02T00:24:48.95961234.223920granite.pdf4a32ba4c-8fdb-4eeb-a06b-d28493efe8e36.1.5 RepoBench, CrossCodeEval: Repository-Lev...$.main-text[153]12[106.97065735, 224.31654358, 505.74191284, 290...cd5bd4537bde007298a91de7fa2fb4b56516d2f1d31262...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "87 granite.pdf 28 17 348 pdf \n", - "154 attension.pdf 15 4 193 pdf \n", - "67 granite.pdf 28 17 348 pdf \n", - "\n", - " hash size \\\n", - "87 0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587... 654989 \n", - "154 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "67 0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587... 654989 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "87 2024-10-02T00:24:48.959612 34.223920 granite.pdf \n", - "154 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "67 2024-10-02T00:24:48.959612 34.223920 granite.pdf \n", - "\n", - " source_document_id \\\n", - "87 4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3 \n", - "154 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "67 4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3 \n", - "\n", - " contents doc_jsonpath \\\n", - "87 6.3 Code Editing and Translation\\nTable 12: Pa... $.main-text[189] \n", - "154 3.2.2 Multi-Head Attention\\nMulti-head attenti... $.main-text[55] \n", - "67 6.1.5 RepoBench, CrossCodeEval: Repository-Lev... $.main-text[153] \n", - "\n", - " page_number bbox \\\n", - "87 16 [106.69820404, 190.24554443, 504.00320435, 211... \n", - "154 5 [107.46644592, 669.41210938, 503.99703979, 690... \n", - "67 12 [106.97065735, 224.31654358, 505.74191284, 290... \n", - "\n", - " document_id \n", - "87 f28d8c9a4fe81f0baf801daf9a95ddaf152a4ac5e8b8ac... \n", - "154 da79f02a5f19c2f07de7a6f1da9df8db00f01a477582ac... \n", - "67 cd5bd4537bde007298a91de7fa2fb4b56516d2f1d31262... " - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (f\"Files processed : {input_df.shape[0]:,}\")\n", - "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "ece021fd", - "metadata": {}, - "source": [ - "## Step-5: DOC ID generation\n", - "\n", - "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", - "\n", - " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set hash_column to the name of the column, where you want to store it.\n", - " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set int_id_column to the name of the column, where you want to store it. **This is a pre-requisite for fuzzy dedup** in the pipeline." - ] - }, - { - "cell_type": "markdown", - "id": "e414c12c", - "metadata": {}, - "source": [ - "### 5.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "10251d3d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" - ] - } - ], - "source": [ - "\n", - "STAGE = 3\n", - "\n", - "input_folder = output_chunk_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_docid_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "0f312347", - "metadata": {}, - "source": [ - "### 5.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "a8b76a71", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:24:50 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "00:24:50 INFO - pipeline id pipeline_id\n", - "00:24:50 INFO - code location None\n", - "00:24:50 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "00:24:50 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:24:50 INFO - orchestrator doc_id started at 2024-10-02 00:24:50\n", - "00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.06398963928222656, 'min_file_size': 0.028062820434570312, 'total_file_size': 0.09205245971679688}\n", - "00:24:50 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "00:24:50 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "00:24:50 INFO - Done processing 2 files, waiting for flush() completion.\n", - "00:24:50 INFO - done flushing in 0.0 sec\n", - "00:24:50 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:3 completed successfully\n", - "CPU times: user 13.4 ms, sys: 4.83 ms, total: 18.3 ms\n", - "Wall time: 14.7 ms\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # doc id configuration\n", - " \"doc_id_doc_column\": \"contents\",\n", - " \"doc_id_hash_column\": \"chunk_hash\",\n", - " \"doc_id_int_column\": \"chunk_id\",\n", - "}\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "8c23338b", - "metadata": {}, - "source": [ - "### 5.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "ec23aa3a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 16)\n", - "Output data dimensions (rows x columns)= (211, 18)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
192attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d345776.2 Model Variations\\nIn Table 3 rows (A), we ...$.main-text[118]9[107.27760315, 318.93438721, 505.24127197, 350...70948f748c6f275b39c70652e29d60dfd53c545e0d6d92...70948f748c6f275b39c70652e29d60dfd53c545e0d6d92...69
71granite.pdf2817348pdf0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...6549892024-10-02T00:24:48.95961234.223920granite.pdf4a32ba4c-8fdb-4eeb-a06b-d28493efe8e36.1.5 RepoBench, CrossCodeEval: Repository-Lev...$.tables[7]13[109.39778137, 486.89639282, 502.1010437, 679....b7497dcda69d88caa6b7c3a462edb925ffa97ce5e42c52...b7497dcda69d88caa6b7c3a462edb925ffa97ce5e42c52...159
196attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d345776.3 English Constituency Parsing\\nWe performed...$.main-text[123]9[106.96768951, 69.592453, 504.24859619, 101.62...93e01b0e6bafcfe5fcd113d1a3dfedad27d12f81038ff5...93e01b0e6bafcfe5fcd113d1a3dfedad27d12f81038ff5...73
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "192 attension.pdf 15 4 193 pdf \n", - "71 granite.pdf 28 17 348 pdf \n", - "196 attension.pdf 15 4 193 pdf \n", - "\n", - " hash size \\\n", - "192 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "71 0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587... 654989 \n", - "196 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "192 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "71 2024-10-02T00:24:48.959612 34.223920 granite.pdf \n", - "196 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "\n", - " source_document_id \\\n", - "192 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "71 4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3 \n", - "196 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "\n", - " contents doc_jsonpath \\\n", - "192 6.2 Model Variations\\nIn Table 3 rows (A), we ... $.main-text[118] \n", - "71 6.1.5 RepoBench, CrossCodeEval: Repository-Lev... $.tables[7] \n", - "196 6.3 English Constituency Parsing\\nWe performed... $.main-text[123] \n", - "\n", - " page_number bbox \\\n", - "192 9 [107.27760315, 318.93438721, 505.24127197, 350... \n", - "71 13 [109.39778137, 486.89639282, 502.1010437, 679.... \n", - "196 9 [106.96768951, 69.592453, 504.24859619, 101.62... \n", - "\n", - " document_id \\\n", - "192 70948f748c6f275b39c70652e29d60dfd53c545e0d6d92... \n", - "71 b7497dcda69d88caa6b7c3a462edb925ffa97ce5e42c52... \n", - "196 93e01b0e6bafcfe5fcd113d1a3dfedad27d12f81038ff5... \n", - "\n", - " chunk_hash chunk_id \n", - "192 70948f748c6f275b39c70652e29d60dfd53c545e0d6d92... 69 \n", - "71 b7497dcda69d88caa6b7c3a462edb925ffa97ce5e42c52... 159 \n", - "196 93e01b0e6bafcfe5fcd113d1a3dfedad27d12f81038ff5... 73 " - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", - "metadata": {}, - "source": [ - "## Step-6: Exact Dedup\n", - "\n", - "Remove documents having identical code to remove bias in the training data. On the content of each document, a SHA256 hash is computed,\n", - "followed by de-duplication of record having identical hashes." - ] - }, - { - "cell_type": "markdown", - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", - "metadata": {}, - "source": [ - "### 6.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "4c7a1b94", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" - ] - } - ], - "source": [ - "STAGE = 4\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_exact_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", - "metadata": {}, - "source": [ - "### 6.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:24:50 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", - "00:24:50 INFO - pipeline id pipeline_id\n", - "00:24:50 INFO - code location None\n", - "00:24:50 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "00:24:50 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:24:50 INFO - orchestrator ededup started at 2024-10-02 00:24:50\n", - "00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.06945991516113281, 'min_file_size': 0.03227043151855469, 'total_file_size': 0.1017303466796875}\n", - "00:24:50 INFO - Starting from the beginning\n", - "00:24:50 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "00:24:50 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "00:24:50 INFO - Done processing 2 files, waiting for flush() completion.\n", - "00:24:50 INFO - done flushing in 0.0 sec\n", - "00:24:50 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:4 completed successfully\n", - "CPU times: user 22.1 ms, sys: 5.79 ms, total: 27.9 ms\n", - "Wall time: 23.5 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "\n", - "# Import ededup transform configuration\n", - "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # ededup parameters\n", - " \"ededup_doc_column\": \"contents\",\n", - " \"ededup_doc_id_column\": \"chunk_hash\",\n", - " \n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "eaf1c3c3", - "metadata": {}, - "source": [ - "### 6.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d824ebf6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 18)\n", - "Output data dimensions (rows x columns)= (211, 19)\n", - "Input chunks before exact dedupe : 211\n", - "Output chunks after exact dedupe : 211\n", - "Duplicate chunks removed : 0\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
194attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d345776.3 English Constituency Parsing\\nTo evaluate ...$.main-text[121]9[107.15766144, 167.93530273, 504.10968018, 210...10c85ade191100c9586ffb4e5ded4944bc4fd865d0919f...10c85ade191100c9586ffb4e5ded4944bc4fd865d0919f...71[]
101granite.pdf2817348pdf0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...6549892024-10-02T00:24:48.95961234.223920granite.pdf4a32ba4c-8fdb-4eeb-a06b-d28493efe8e36.5 Math Reasoning\\nTable 15: Performance on 4...$.main-text[219]19[118.49487305, 699.65753174, 492.17700195, 710...c39e0817c8d1edf1d322cef0535b5a63b80d2b2b4d1852...c39e0817c8d1edf1d322cef0535b5a63b80d2b2b4d1852...189[]
206attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d345777 Conclusion\\nAcknowledgements We are grateful...$.main-text[135]10[107.4437561, 212.26509094, 504.00241089, 232....855fdc0d15cb042a43d799b9a38d4339ae1e25b2df99c4...855fdc0d15cb042a43d799b9a38d4339ae1e25b2df99c4...83[]
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "194 attension.pdf 15 4 193 pdf \n", - "101 granite.pdf 28 17 348 pdf \n", - "206 attension.pdf 15 4 193 pdf \n", - "\n", - " hash size \\\n", - "194 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "101 0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587... 654989 \n", - "206 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "194 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "101 2024-10-02T00:24:48.959612 34.223920 granite.pdf \n", - "206 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "\n", - " source_document_id \\\n", - "194 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "101 4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3 \n", - "206 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "\n", - " contents doc_jsonpath \\\n", - "194 6.3 English Constituency Parsing\\nTo evaluate ... $.main-text[121] \n", - "101 6.5 Math Reasoning\\nTable 15: Performance on 4... $.main-text[219] \n", - "206 7 Conclusion\\nAcknowledgements We are grateful... $.main-text[135] \n", - "\n", - " page_number bbox \\\n", - "194 9 [107.15766144, 167.93530273, 504.10968018, 210... \n", - "101 19 [118.49487305, 699.65753174, 492.17700195, 710... \n", - "206 10 [107.4437561, 212.26509094, 504.00241089, 232.... \n", - "\n", - " document_id \\\n", - "194 10c85ade191100c9586ffb4e5ded4944bc4fd865d0919f... \n", - "101 c39e0817c8d1edf1d322cef0535b5a63b80d2b2b4d1852... \n", - "206 855fdc0d15cb042a43d799b9a38d4339ae1e25b2df99c4... \n", - "\n", - " chunk_hash chunk_id removed \n", - "194 10c85ade191100c9586ffb4e5ded4944bc4fd865d0919f... 71 [] \n", - "101 c39e0817c8d1edf1d322cef0535b5a63b80d2b2b4d1852... 189 [] \n", - "206 855fdc0d15cb042a43d799b9a38d4339ae1e25b2df99c4... 83 [] " - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", - "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", - "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "85309751-8556-41c6-ac32-84acc941bc8d", - "metadata": {}, - "source": [ - "## Fuzzy Dedup\n", - "\n", - "**Fuzzy dedupe is currently available in RAY version only**\n", - "\n", - "So we will skip this here" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": {}, - "source": [ - "## Step-7: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "74fd33b1", - "metadata": {}, - "source": [ - "### 7.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" - ] - } - ], - "source": [ - "STAGE = 5\n", - "\n", - "input_folder = output_exact_dedupe_dir\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "b9112479", - "metadata": {}, - "source": [ - "### 7.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:24:50 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "00:24:50 INFO - pipeline id pipeline_id\n", - "00:24:50 INFO - code location None\n", - "00:24:50 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", - "00:24:50 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:24:50 INFO - orchestrator text_encoder started at 2024-10-02 00:24:50\n", - "00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.06981945037841797, 'min_file_size': 0.032629966735839844, 'total_file_size': 0.10244941711425781}\n", - "00:24:52 INFO - Completed 1 files (50.0%) in 0.008 min\n", - "00:24:53 INFO - Completed 2 files (100.0%) in 0.02 min\n", - "00:24:53 INFO - Done processing 2 files, waiting for flush() completion.\n", - "00:24:53 INFO - done flushing in 0.0 sec\n", - "00:24:53 INFO - Completed execution in 0.046 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:5 completed successfully\n", - "CPU times: user 1.78 s, sys: 103 ms, total: 1.88 s\n", - "Wall time: 3.09 s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from text_encoder_transform_python import TextEncoderPythonTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", - "# Launch the ray actor(s) to process the input\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": {}, - "source": [ - "### 7.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "7b1c1d09", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 19)\n", - "Output data dimensions (rows x columns)= (211, 20)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremovedembeddings
193attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d345776.2 Model Variations\\nIn Table 3 rows (B), we ...$.main-text[119]9[107.44257355, 248.49208069, 505.24127197, 312...6b79d74f59d1218fa3cdff6d13b504c8bf80558f3e2522...6b79d74f59d1218fa3cdff6d13b504c8bf80558f3e2522...70[][-0.0049973284, -0.10789071, 0.02143236, -0.02...
210attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:24:14.71365418.004455attension.pdff275d75a-a072-4836-8a55-6a65f0d34577Attention Visualizations Input-Input Layer5\\nF...$.main-text[190]15[107.43354034, 157.36341858, 504.06988525, 189...67626adb815bf2b27871df24d538ddc10ae68a3fbbd238...67626adb815bf2b27871df24d538ddc10ae68a3fbbd238...87[][0.01508544, -0.015680796, 0.039181348, 0.0084...
46granite.pdf2817348pdf0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...6549892024-10-02T00:24:48.95961234.223920granite.pdf4a32ba4c-8fdb-4eeb-a06b-d28493efe8e36.1.1 HumanEvalSynthesize: Multilingual Code G...$.main-text[117]9[107.46860504, 613.84277344, 456.97003174, 624...3d5d963f59d4ecb05d1ec2d014747459e01cabe2944bba...3d5d963f59d4ecb05d1ec2d014747459e01cabe2944bba...134[][-0.029933447, 0.031515192, -0.04598905, -0.01...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "193 attension.pdf 15 4 193 pdf \n", - "210 attension.pdf 15 4 193 pdf \n", - "46 granite.pdf 28 17 348 pdf \n", - "\n", - " hash size \\\n", - "193 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "210 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "46 0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587... 654989 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "193 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "210 2024-10-02T00:24:14.713654 18.004455 attension.pdf \n", - "46 2024-10-02T00:24:48.959612 34.223920 granite.pdf \n", - "\n", - " source_document_id \\\n", - "193 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "210 f275d75a-a072-4836-8a55-6a65f0d34577 \n", - "46 4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3 \n", - "\n", - " contents doc_jsonpath \\\n", - "193 6.2 Model Variations\\nIn Table 3 rows (B), we ... $.main-text[119] \n", - "210 Attention Visualizations Input-Input Layer5\\nF... $.main-text[190] \n", - "46 6.1.1 HumanEvalSynthesize: Multilingual Code G... $.main-text[117] \n", - "\n", - " page_number bbox \\\n", - "193 9 [107.44257355, 248.49208069, 505.24127197, 312... \n", - "210 15 [107.43354034, 157.36341858, 504.06988525, 189... \n", - "46 9 [107.46860504, 613.84277344, 456.97003174, 624... \n", - "\n", - " document_id \\\n", - "193 6b79d74f59d1218fa3cdff6d13b504c8bf80558f3e2522... \n", - "210 67626adb815bf2b27871df24d538ddc10ae68a3fbbd238... \n", - "46 3d5d963f59d4ecb05d1ec2d014747459e01cabe2944bba... \n", - "\n", - " chunk_hash chunk_id removed \\\n", - "193 6b79d74f59d1218fa3cdff6d13b504c8bf80558f3e2522... 70 [] \n", - "210 67626adb815bf2b27871df24d538ddc10ae68a3fbbd238... 87 [] \n", - "46 3d5d963f59d4ecb05d1ec2d014747459e01cabe2944bba... 134 [] \n", - "\n", - " embeddings \n", - "193 [-0.0049973284, -0.10789071, 0.02143236, -0.02... \n", - "210 [0.01508544, -0.015680796, 0.039181348, 0.0084... \n", - "46 [-0.029933447, 0.031515192, -0.04598905, -0.01... " - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": {}, - "source": [ - "## Step-8: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" - ] - } - ], - "source": [ - "import shutil\n", - "\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", - "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", - "\n", - "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb b/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb deleted file mode 100644 index 8bdea1ff64..0000000000 --- a/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb +++ /dev/null @@ -1,2181 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", - "metadata": {}, - "source": [ - "
\n", - "

Data Processing for RAG with Data Prep Kit (RAY)

\n", - " \n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "id": "b15976e3", - "metadata": {}, - "source": [ - "## Before Running the notebook\n", - "\n", - "Please complete [setting up python dev environment](./setup-python-dev-env.md)" - ] - }, - { - "cell_type": "markdown", - "id": "053ecf08-5f62-4b99-9347-8a0955843d21", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "This notebook will process PDF documents as part of RAG pipeline\n", - "\n", - "![](media/rag-overview-2.png)\n", - "\n", - "This notebook will perform steps 1, 2 and 3 in RAG pipeline.\n", - "\n", - "Here are the processing steps:\n", - "\n", - "- **pdf2parquet** : Extract text from PDF and convert them into parquet files\n", - "- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)\n", - "- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash\n", - "- **Exact Dedup**: Chunks with exact same content are filtered out\n", - "- **Fuzzy Dedup**: Eliminate chunks that are 'very similar' content\n", - "- **Doc quality**: Scores the documents based on criteria like number of words, if it contains bad words ..etc\n", - "- **Text encoder**: Convert chunks into vectors using embedding models" - ] - }, - { - "cell_type": "markdown", - "id": "e8b10be1", - "metadata": {}, - "source": [ - "## Step-1: Configuration" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "33345487", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from my_config import MY_CONFIG\n", - "\n", - "## RAY CONFIGURATION\n", - "num_cpus_available = os.cpu_count()\n", - "# print (num_cpus_available)\n", - "# MY_CONFIG.RAY_NUM_CPUS = num_cpus_available // 2 ## use half the available cores for processing\n", - "MY_CONFIG.RAY_NUM_CPUS = 1\n", - "# print (MY_CONFIG.RAY_NUM_CPUS)\n", - "MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - "# MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", - "MY_CONFIG.RAY_RUNTIME_WORKERS = 2" - ] - }, - { - "cell_type": "markdown", - "id": "40c58856", - "metadata": {}, - "source": [ - "## Step-2: Data\n", - "\n", - "We will use white papers about LLMs. \n", - "\n", - "- [Granite Code Models](https://arxiv.org/abs/2405.04324)\n", - "- [Attention is all you need](https://arxiv.org/abs/1706.03762)\n", - "\n", - "You can of course substite your own data below" - ] - }, - { - "cell_type": "markdown", - "id": "6bce5939", - "metadata": {}, - "source": [ - "### 2.1 - Download data" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "1bfde6eb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Local file 'input/attension.pdf' (2.22 MB) already exists. Skipping download.\n", - "Local file 'input/granite.pdf' (1.27 MB) already exists. Skipping download.\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "from utils import download_file\n", - "\n", - "## Download the data files\n", - "shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)\n", - "\n", - "download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attension.pdf' ))\n", - "\n", - "download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))\n" - ] - }, - { - "cell_type": "markdown", - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", - "metadata": {}, - "source": [ - "### 2.2 - Set input/output path variables for the pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Cleared output directory\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "\n", - "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", - " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", - "\n", - "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", - "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", - "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", - "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", - "\n", - "\n", - "## clear output folder\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", - "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", - "\n", - "print (\"✅ Cleared output directory\")" - ] - }, - { - "cell_type": "markdown", - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", - "metadata": {}, - "source": [ - "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", - "\n", - "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", - "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", - "metadata": {}, - "source": [ - "### 3.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-1: Processing input='input' --> output='output/01_parquet_out'\n" - ] - } - ], - "source": [ - "STAGE = 1 \n", - "\n", - "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", - "output_folder = output_parquet_dir\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", - "metadata": {}, - "source": [ - "### 3.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:25:24 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "00:25:24 INFO - pipeline id pipeline_id\n", - "00:25:24 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", - "00:25:24 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", - "00:25:24 INFO - actor creation delay 0\n", - "00:25:24 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", - "00:25:24 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/01_parquet_out\n", - "00:25:24 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:25:24 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "00:25:24 INFO - Running locally\n", - "2024-10-02 00:25:26,362\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:25:29 INFO - orchestrator started at 2024-10-02 00:25:29\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:25:29 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:25:29 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 4.941529083997011, 'object_store': 2.470764541067183}\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:25:29 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:25:29 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(RayTransformFileProcessor pid=636524)\u001b[0m 00:25:32 INFO - Initializing models\n", - "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 129854.61it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=636524)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:28:23 INFO - Completed processing 2 files in 2.9 min\n", - "\u001b[36m(orchestrate pid=635641)\u001b[0m 00:28:23 INFO - done flushing in 0.001 sec\n", - "\u001b[36m(RayTransformFileProcessor pid=636523)\u001b[0m 00:25:32 INFO - Initializing models\n", - "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 37650.84it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=636523)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "00:28:33 INFO - Completed execution in 3.158 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:1 completed successfully\n", - "CPU times: user 3.85 s, sys: 668 ms, total: 4.52 s\n", - "Wall time: 3min 13s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "import ast\n", - "import os\n", - "import sys\n", - "\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "from data_processing.utils import GB, ParamsUtils\n", - "\n", - "from pdf2parquet_transform import (\n", - " pdf2parquet_contents_type_cli_param,\n", - " pdf2parquet_contents_types,\n", - ")\n", - "from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration\n", - "from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration\n", - "\n", - "# create parameters\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS, \"memory\": MY_CONFIG.RAY_MEMORY_GB * GB}\n", - "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", - "ingest_config = {\n", - " pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,\n", - "}\n", - "\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " \"data_files_to_use\": ast.literal_eval(\"['.pdf']\"),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": 1, # so model download to cleanup works properly\n", - " \"runtime_pipeline_id\": \"pipeline_id\",\n", - " \"runtime_job_id\": \"job_id\",\n", - " \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", - "}\n", - "\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))\n", - "# create launcher\n", - "launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "5ca790e0", - "metadata": {}, - "source": [ - "### 3.3 - Inspect Generated output\n", - "\n", - "Here we should see one entry per input file processed" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "fe59563d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Output dimensions (rows x columns)= (2, 12)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0granite.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...281734881bc331a-69cf-49bd-84b9-afedcab1344apdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf
1attension.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1541937afd3fbc-3a9f-4728-8fd8-4a9a13980244pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf
\n", - "
" - ], - "text/plain": [ - " filename contents \\\n", - "0 granite.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... \n", - "1 attension.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... \n", - "\n", - " num_pages num_tables num_doc_elements \\\n", - "0 28 17 348 \n", - "1 15 4 193 \n", - "\n", - " document_id ext \\\n", - "0 81bc331a-69cf-49bd-84b9-afedcab1344a pdf \n", - "1 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 pdf \n", - "\n", - " hash size \\\n", - "0 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "1 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "\n", - " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "1 2024-10-02T00:26:29.888597 53.822026 attension.pdf " - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(5)\n", - "\n", - "## To display certain columns\n", - "#parquet_df[['column1', 'column2', 'column3']].head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "72274586", - "metadata": {}, - "source": [ - "## Step-4: Doc chunks\n", - "\n", - "Split the documents in chunks, according to their layout segmentation." - ] - }, - { - "cell_type": "markdown", - "id": "96198fa6", - "metadata": {}, - "source": [ - "### 4.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "305f00a3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" - ] - } - ], - "source": [ - "STAGE = 2\n", - "\n", - "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_chunk_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "369f2cd1", - "metadata": {}, - "source": [ - "### 4.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "5b7b18d5", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:28:36 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "00:28:36 INFO - pipeline id pipeline_id\n", - "00:28:36 INFO - code location None\n", - "00:28:36 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "00:28:36 INFO - actor creation delay 0\n", - "00:28:36 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", - "00:28:36 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "00:28:36 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:28:36 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:28:36 INFO - Running locally\n", - "2024-10-02 00:28:38,768\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:41 INFO - orchestrator started at 2024-10-02 00:28:41\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:41 INFO - Number of files is 2, source profile {'max_file_size': 0.12733078002929688, 'min_file_size': 0.035338401794433594, 'total_file_size': 0.16266918182373047}\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:41 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 4.939725494943559, 'object_store': 2.4698627470061183}\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:41 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:41 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:43 INFO - Completed processing 2 files in 0.033 min\n", - "\u001b[36m(orchestrate pid=640134)\u001b[0m 00:28:43 INFO - done flushing in 0.001 sec\n", - "00:28:53 INFO - Completed execution in 0.281 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:2 completed successfully\n", - "CPU times: user 992 ms, sys: 321 ms, total: 1.31 s\n", - "Wall time: 19.6 s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "# Import doc_json_chunk transform configuration\n", - "from doc_chunk_transform_ray import DocChunkRayTransformConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # doc_chunk arguments\n", - " # ...\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "213afdf6", - "metadata": {}, - "source": [ - "### 4.3 - Inspect Generated output\n", - "\n", - "We would see documents are split into many chunks" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "d8138d43", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Files processed : 2\n", - "Chunks created : 211\n", - "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (211, 16)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
185attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802446.1 Machine Translation\\nOn the WMT 2014 Engli...$.main-text[108]8[107.27262115, 260.13467407, 505.24533081, 302...d6c1d3686219a176bc5ff0ebf4f5c82a53d95d1502d476...
94granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a6.3 Code Editing and Translation\\nFrom Table 1...$.main-text[199]17[107.33219147, 356.5696106, 505.74539185, 411....1c841522286ea1348acafd3a4cfbbffd327ca5de53c5f9...
175attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802445.1 Training Data and Batching\\nWe trained on ...$.main-text[91]7[107.12083435, 343.05245972, 505.65435791, 418...77de84b7743b8360a371146c12c9795a12984ef82354f4...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "185 attension.pdf 15 4 193 pdf \n", - "94 granite.pdf 28 17 348 pdf \n", - "175 attension.pdf 15 4 193 pdf \n", - "\n", - " hash size \\\n", - "185 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "94 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "175 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "185 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "94 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "175 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "\n", - " source_document_id \\\n", - "185 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "94 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "175 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "\n", - " contents doc_jsonpath \\\n", - "185 6.1 Machine Translation\\nOn the WMT 2014 Engli... $.main-text[108] \n", - "94 6.3 Code Editing and Translation\\nFrom Table 1... $.main-text[199] \n", - "175 5.1 Training Data and Batching\\nWe trained on ... $.main-text[91] \n", - "\n", - " page_number bbox \\\n", - "185 8 [107.27262115, 260.13467407, 505.24533081, 302... \n", - "94 17 [107.33219147, 356.5696106, 505.74539185, 411.... \n", - "175 7 [107.12083435, 343.05245972, 505.65435791, 418... \n", - "\n", - " document_id \n", - "185 d6c1d3686219a176bc5ff0ebf4f5c82a53d95d1502d476... \n", - "94 1c841522286ea1348acafd3a4cfbbffd327ca5de53c5f9... \n", - "175 77de84b7743b8360a371146c12c9795a12984ef82354f4... " - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (f\"Files processed : {input_df.shape[0]:,}\")\n", - "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "b8894d88", - "metadata": {}, - "source": [ - "## Step-5: DOC ID generation\n", - "\n", - "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", - "\n", - " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set hash_column to the name of the column, where you want to store it.\n", - " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set int_id_column to the name of the column, where you want to store it. **This is a pre-requisite for fuzzy dedup** in the pipeline." - ] - }, - { - "cell_type": "markdown", - "id": "46e88f76", - "metadata": {}, - "source": [ - "### 5.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "7debd243", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" - ] - } - ], - "source": [ - "\n", - "STAGE = 3\n", - "\n", - "input_folder = output_chunk_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_docid_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "1cadc2f3", - "metadata": {}, - "source": [ - "### 5.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "6b0eade3", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:28:55 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "00:28:55 INFO - pipeline id pipeline_id\n", - "00:28:55 INFO - code location None\n", - "00:28:55 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "00:28:55 INFO - actor creation delay 0\n", - "00:28:55 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", - "00:28:55 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "00:28:55 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:28:55 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:28:55 INFO - Running locally\n", - "2024-10-02 00:28:56,881\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:57 INFO - orchestrator started at 2024-10-02 00:28:57\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:57 INFO - Number of files is 2, source profile {'max_file_size': 0.06398677825927734, 'min_file_size': 0.028062820434570312, 'total_file_size': 0.09204959869384766}\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:57 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 4.8911590576171875, 'object_store': 2.4455795288085938}\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:57 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:57 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:58 INFO - Completed processing 2 files in 0.013 min\n", - "\u001b[36m(orchestrate pid=641742)\u001b[0m 00:28:58 INFO - done flushing in 0.001 sec\n", - "00:29:08 INFO - Completed execution in 0.228 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:3 completed successfully\n", - "CPU times: user 123 ms, sys: 167 ms, total: 290 ms\n", - "Wall time: 15 s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "from doc_id_transform_ray import DocIDRayTransformRuntimeConfiguration\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # doc id configuration\n", - " \"doc_id_doc_column\": \"contents\",\n", - " \"doc_id_hash_column\": \"chunk_hash\",\n", - " \"doc_id_int_column\": \"chunk_id\",\n", - "}\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "d5c5c6e4", - "metadata": {}, - "source": [ - "### 5.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "45d941b2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 16)\n", - "Output data dimensions (rows x columns)= (211, 18)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
31granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a3 Model Architecture\\nremove final 8 layers fr...$.main-text[69]6[107.45430756, 456.21582031, 504.50476074, 521...72fbd93a7a834627114fd13cdb1a48c354d6bd991a9eb9...72fbd93a7a834627114fd13cdb1a48c354d6bd991a9eb9...119
116granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344aAcknowledgments\\nThanks and acknowledgement to...$.main-text[249]21[107.07092285, 59.12960052, 505.24591064, 160....b6d51d1a54147d95051f77bf536ca6ab7360102dd5ac84...b6d51d1a54147d95051f77bf536ca6ab7360102dd5ac84...204
95granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a6.3 Code Editing and Translation\\nCodeLingua (...$.main-text[200]17[107.03813934, 207.6650238, 505.74505615, 350....c52299a48da2f5517c7ed6b964195a46dd0e339af1d0f3...c52299a48da2f5517c7ed6b964195a46dd0e339af1d0f3...183
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "31 granite.pdf 28 17 348 pdf \n", - "116 granite.pdf 28 17 348 pdf \n", - "95 granite.pdf 28 17 348 pdf \n", - "\n", - " hash size \\\n", - "31 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "116 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "95 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "31 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "116 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "95 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "\n", - " source_document_id \\\n", - "31 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "116 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "95 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "\n", - " contents doc_jsonpath \\\n", - "31 3 Model Architecture\\nremove final 8 layers fr... $.main-text[69] \n", - "116 Acknowledgments\\nThanks and acknowledgement to... $.main-text[249] \n", - "95 6.3 Code Editing and Translation\\nCodeLingua (... $.main-text[200] \n", - "\n", - " page_number bbox \\\n", - "31 6 [107.45430756, 456.21582031, 504.50476074, 521... \n", - "116 21 [107.07092285, 59.12960052, 505.24591064, 160.... \n", - "95 17 [107.03813934, 207.6650238, 505.74505615, 350.... \n", - "\n", - " document_id \\\n", - "31 72fbd93a7a834627114fd13cdb1a48c354d6bd991a9eb9... \n", - "116 b6d51d1a54147d95051f77bf536ca6ab7360102dd5ac84... \n", - "95 c52299a48da2f5517c7ed6b964195a46dd0e339af1d0f3... \n", - "\n", - " chunk_hash chunk_id \n", - "31 72fbd93a7a834627114fd13cdb1a48c354d6bd991a9eb9... 119 \n", - "116 b6d51d1a54147d95051f77bf536ca6ab7360102dd5ac84... 204 \n", - "95 c52299a48da2f5517c7ed6b964195a46dd0e339af1d0f3... 183 " - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", - "metadata": {}, - "source": [ - "## Step-6: Exact Dedup\n", - "\n", - "Remove documents having identical code to remove bias in the training data. On the content of each document, a SHA256 hash is computed,\n", - "followed by de-duplication of record having identical hashes." - ] - }, - { - "cell_type": "markdown", - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", - "metadata": {}, - "source": [ - "### 6.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "4c7a1b94", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" - ] - } - ], - "source": [ - "STAGE = 4\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_exact_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", - "metadata": {}, - "source": [ - "### 6.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:29:10 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", - "00:29:10 INFO - pipeline id pipeline_id\n", - "00:29:10 INFO - code location None\n", - "00:29:10 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "00:29:10 INFO - actor creation delay 0\n", - "00:29:10 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", - "00:29:10 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "00:29:10 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:29:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:29:10 INFO - Running locally\n", - "2024-10-02 00:29:11,920\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:12 INFO - orchestrator started at 2024-10-02 00:29:12\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:12 INFO - Number of files is 2, source profile {'max_file_size': 0.0694570541381836, 'min_file_size': 0.03227043151855469, 'total_file_size': 0.10172748565673828}\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:12 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 4.913980866782367, 'object_store': 2.4569904319941998}\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:12 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:12 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:13 INFO - Completed processing 2 files in 0.013 min\n", - "\u001b[36m(orchestrate pid=643333)\u001b[0m 00:29:13 INFO - done flushing in 0.001 sec\n", - "00:29:23 INFO - Completed execution in 0.227 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:4 completed successfully\n", - "CPU times: user 120 ms, sys: 172 ms, total: 292 ms\n", - "Wall time: 14.9 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "# Import ededup transform configuration\n", - "from ededup_transform_ray import EdedupRayTransformRuntimeConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # ededup parameters\n", - " \"ededup_hash_cpu\": 0.5,\n", - " \"ededup_num_hashes\": 2,\n", - " \"ededup_doc_column\": \"contents\",\n", - " \"ededup_doc_id_column\": \"chunk_hash\",\n", - " \n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "eaf1c3c3", - "metadata": {}, - "source": [ - "### 6.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d824ebf6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 18)\n", - "Output data dimensions (rows x columns)= (211, 19)\n", - "Input chunks before exact dedupe : 211\n", - "Output chunks after exact dedupe : 211\n", - "Duplicate chunks removed : 0\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
188attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802446.2 Model Variations\\nTo evaluate the importan...$.main-text[112]8[107.1419754, 91.9256134, 504.05615234, 113.59...6eb55d1014abb7e7a010fd07b994af17a0cad7ca059f8f...6eb55d1014abb7e7a010fd07b994af17a0cad7ca059f8f...65[]
153attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802443.2.2 Multi-Head Attention\\noutput values. The...$.main-text[54]5[107.36427307, 696.97607422, 503.99719238, 717...07f191b8e14ee3784ecc42c94e4096c97388733f1ea59b...07f191b8e14ee3784ecc42c94e4096c97388733f1ea59b...30[]
68granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a6.1.5 RepoBench, CrossCodeEval: Repository-Lev...$.main-text[154]12[107.21151733, 141.59487915, 505.73928833, 218...650d9bcdcb744b665a189a4d02f09a4be39dcde46a0ecd...650d9bcdcb744b665a189a4d02f09a4be39dcde46a0ecd...156[]
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "188 attension.pdf 15 4 193 pdf \n", - "153 attension.pdf 15 4 193 pdf \n", - "68 granite.pdf 28 17 348 pdf \n", - "\n", - " hash size \\\n", - "188 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "153 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "68 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "188 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "153 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "68 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "\n", - " source_document_id \\\n", - "188 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "153 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "68 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "\n", - " contents doc_jsonpath \\\n", - "188 6.2 Model Variations\\nTo evaluate the importan... $.main-text[112] \n", - "153 3.2.2 Multi-Head Attention\\noutput values. The... $.main-text[54] \n", - "68 6.1.5 RepoBench, CrossCodeEval: Repository-Lev... $.main-text[154] \n", - "\n", - " page_number bbox \\\n", - "188 8 [107.1419754, 91.9256134, 504.05615234, 113.59... \n", - "153 5 [107.36427307, 696.97607422, 503.99719238, 717... \n", - "68 12 [107.21151733, 141.59487915, 505.73928833, 218... \n", - "\n", - " document_id \\\n", - "188 6eb55d1014abb7e7a010fd07b994af17a0cad7ca059f8f... \n", - "153 07f191b8e14ee3784ecc42c94e4096c97388733f1ea59b... \n", - "68 650d9bcdcb744b665a189a4d02f09a4be39dcde46a0ecd... \n", - "\n", - " chunk_hash chunk_id removed \n", - "188 6eb55d1014abb7e7a010fd07b994af17a0cad7ca059f8f... 65 [] \n", - "153 07f191b8e14ee3784ecc42c94e4096c97388733f1ea59b... 30 [] \n", - "68 650d9bcdcb744b665a189a4d02f09a4be39dcde46a0ecd... 156 [] " - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", - "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", - "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "85309751-8556-41c6-ac32-84acc941bc8d", - "metadata": {}, - "source": [ - "## Step-7: Fuzzy Dedup\n", - "\n", - "Post exact deduplication, fuzzy deduplication is applied with\n", - "the goal of removing code files that may have slight variations and thereby unbiasing\n", - "the data further. Small variations are quite commonly seen in code data in the form\n", - "of variations in the values of variables, addittion of logging statements etc. Find near-\n", - "duplicate." - ] - }, - { - "cell_type": "markdown", - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", - "metadata": {}, - "source": [ - "### 7.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_fuzzy_dedupe_out'\n" - ] - } - ], - "source": [ - "## Input to this component is the output of doc_id generator component. \n", - "\n", - "STAGE = 5\n", - "\n", - "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_fuzzy_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", - "metadata": {}, - "source": [ - "### 7.2 - Execute " - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:29:25 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", - "00:29:25 INFO - pipeline id pipeline_id\n", - "00:29:25 INFO - code location None\n", - "00:29:25 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "00:29:25 INFO - actor creation delay 0\n", - "00:29:25 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "00:29:25 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_fuzzy_dedupe_out\n", - "00:29:25 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:29:25 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:29:25 INFO - Running locally\n", - "2024-10-02 00:29:26,903\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - orchestrator started at 2024-10-02 00:29:28\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - Number of files is 2, source profile {'max_file_size': 0.06981658935546875, 'min_file_size': 0.032629966735839844, 'total_file_size': 0.1024465560913086}\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 4.94085159432143, 'object_store': 2.470425795763731}\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:28 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:34 INFO - Completed 1 files in 0.115 min\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:34 INFO - Completed 1 files (50.0%) in 0.115 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:41 INFO - Completed processing 2 files in 0.217 min\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:41 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:42 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:42 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=645808)\u001b[0m 00:29:43 INFO - processing buckets 0 long, 1686 short\n", - "\u001b[36m(BucketsHash pid=645808)\u001b[0m 00:29:43 INFO - Done submitting long buckets\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - Done processing buckets in 0.011 min\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:43 INFO - creating document snapshots\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=646353)\u001b[0m 00:29:43 INFO - Waiting bucket processing completion. Submitted requests 17\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:44 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:44 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:52 INFO - Completed processing 2 files in 0.131 min\n", - "\u001b[36m(orchestrate pid=644959)\u001b[0m 00:29:52 INFO - done flushing in 0.003 sec\n", - "00:30:02 INFO - Completed execution in 0.627 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:5 completed successfully\n", - "CPU times: user 223 ms, sys: 189 ms, total: 412 ms\n", - "Wall time: 39 s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "import os\n", - "import sys\n", - "\n", - "from data_processing.utils import ParamsUtils\n", - "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", - "\n", - "# create parameters\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # Orchestration parameters\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # columns used\n", - " \"fdedup_doc_column\": \"contents\",\n", - " \"fdedup_id_column\": \"chunk_id\",\n", - " \"fdedup_cluster_column\": \"chunk_hash\",\n", - " # infrastructure\n", - " \"fdedup_bucket_cpu\": 0.3,\n", - " \"fdedup_doc_cpu\": 0.3,\n", - " \"fdedup_mhash_cpu\": 0.3,\n", - " \"fdedup_num_doc_actors\": 1,\n", - " \"fdedup_num_bucket_actors\": 1,\n", - " \"fdedup_num_minhash_actors\": 1,\n", - " \"fdedup_num_preprocessors\": 1,\n", - " # fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # between 0.0 to 1.0 ; smaller values tend to be more lenient in finding near dupes; close to 1.0 is more strict\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", - "}\n", - "\n", - "# Pass commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "a6f8cd11", - "metadata": {}, - "source": [ - "### 7.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "e899ad60", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 19)\n", - "Output data dimensions (rows x columns)= (211, 19)\n", - "Duplicate chunks removed by fuzzy-dedupe: 0\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_idremovedchunk_hash
47granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a6.1.1 HumanEvalSynthesize: Multilingual Code G...$.main-text[118]9[107.09940338, 505.84005737, 505.70474243, 604...22dd65548755f19ec6ccd89020fd1fbc88e339fafbd881...135[]-1
134attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802441 Introduction\\nAttention mechanisms have beco...$.main-text[20]2[107.17721558, 497.6980896, 505.65536499, 540....362722af4a10ed54ca21fd329149c01397a621e15f8306...11[]-1
93granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a6.3 Code Editing and Translation\\nTarget Langu...$.tables[13]17[161.45388794, 433.6942749, 450.61630249, 552....f665c10385f0eb31b2b94e5e61c934651f5789f5ab528c...181[]-1
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "47 granite.pdf 28 17 348 pdf \n", - "134 attension.pdf 15 4 193 pdf \n", - "93 granite.pdf 28 17 348 pdf \n", - "\n", - " hash size \\\n", - "47 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "134 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "93 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "47 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "134 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "93 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "\n", - " source_document_id \\\n", - "47 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "134 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "93 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "\n", - " contents doc_jsonpath \\\n", - "47 6.1.1 HumanEvalSynthesize: Multilingual Code G... $.main-text[118] \n", - "134 1 Introduction\\nAttention mechanisms have beco... $.main-text[20] \n", - "93 6.3 Code Editing and Translation\\nTarget Langu... $.tables[13] \n", - "\n", - " page_number bbox \\\n", - "47 9 [107.09940338, 505.84005737, 505.70474243, 604... \n", - "134 2 [107.17721558, 497.6980896, 505.65536499, 540.... \n", - "93 17 [161.45388794, 433.6942749, 450.61630249, 552.... \n", - "\n", - " document_id chunk_id removed \\\n", - "47 22dd65548755f19ec6ccd89020fd1fbc88e339fafbd881... 135 [] \n", - "134 362722af4a10ed54ca21fd329149c01397a621e15f8306... 11 [] \n", - "93 f665c10385f0eb31b2b94e5e61c934651f5789f5ab528c... 181 [] \n", - "\n", - " chunk_hash \n", - "47 -1 \n", - "134 -1 \n", - "93 -1 " - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": {}, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "8fbbeaff", - "metadata": {}, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" - ] - } - ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_fuzzy_dedupe_dir\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "1e6a88f8", - "metadata": {}, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "00:30:04 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "00:30:04 INFO - pipeline id pipeline_id\n", - "00:30:04 INFO - code location None\n", - "00:30:04 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "00:30:04 INFO - actor creation delay 0\n", - "00:30:04 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", - "00:30:04 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "00:30:04 INFO - data factory data_ max_files -1, n_sample -1\n", - "00:30:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "00:30:04 INFO - Running locally\n", - "2024-10-02 00:30:06,760\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:10 INFO - orchestrator started at 2024-10-02 00:30:10\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:10 INFO - Number of files is 2, source profile {'max_file_size': 0.06542396545410156, 'min_file_size': 0.029404640197753906, 'total_file_size': 0.09482860565185547}\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:10 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 4.923227692954242, 'object_store': 2.4616138450801373}\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:10 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:10 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:21 INFO - Completed processing 2 files in 0.188 min\n", - "\u001b[36m(orchestrate pid=647243)\u001b[0m 00:30:21 INFO - done flushing in 0.001 sec\n", - "00:30:31 INFO - Completed execution in 0.449 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:6 completed successfully\n", - "CPU times: user 638 ms, sys: 269 ms, total: 907 ms\n", - "Wall time: 29 s\n" - ] - } - ], - "source": [ - "%%time \n", - "\n", - "from text_encoder_transform_ray import TextEncoderRayTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())\n", - "# Launch the ray actor(s) to process the input\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Ray job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": {}, - "source": [ - "### 8.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "7b1c1d09", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (211, 19)\n", - "Output data dimensions (rows x columns)= (211, 20)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_idremovedchunk_hashembeddings
171attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802444 Why Self-Attention\\nlength n is smaller than...$.main-text[85]7[107.26034546, 652.83349609, 504.29177856, 717...6f8efa86e0a4f77b0d72d4a3141e5e0611b2921a392b99...48[]-1[0.018015103, -0.038851, 0.0016827772, -0.0493...
25granite.pdf2817348pdf79c53d694df467391e94f279af2fa6a9a7e45c3922546e...6550542024-10-02T00:28:23.836369167.768806granite.pdf81bc331a-69cf-49bd-84b9-afedcab1344a3 Model Architecture\\nBatch size, 3B = 2048. B...$.tables[0]5[138.25450134, 299.99499512, 471.55078125, 432...b8f3a83c697e885ad31913c716644399a4772691e39d0b...113[]-1[0.003977602, -0.06122852, -0.089708336, -0.00...
137attension.pdf154193pdf6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...1358142024-10-02T00:26:29.88859753.822026attension.pdf7afd3fbc-3a9f-4728-8fd8-4a9a139802442 Background\\nSelf-attention, sometimes called...$.main-text[24]2[107.29702759, 256.18237305, 505.24960327, 298...9c2abd2ec38b67c74873e0cd670d27b702711d05930f26...14[]-1[0.03394238, -0.0117239505, -0.03349689, -0.02...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "171 attension.pdf 15 4 193 pdf \n", - "25 granite.pdf 28 17 348 pdf \n", - "137 attension.pdf 15 4 193 pdf \n", - "\n", - " hash size \\\n", - "171 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "25 79c53d694df467391e94f279af2fa6a9a7e45c3922546e... 655054 \n", - "137 6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23... 135814 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "171 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "25 2024-10-02T00:28:23.836369 167.768806 granite.pdf \n", - "137 2024-10-02T00:26:29.888597 53.822026 attension.pdf \n", - "\n", - " source_document_id \\\n", - "171 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "25 81bc331a-69cf-49bd-84b9-afedcab1344a \n", - "137 7afd3fbc-3a9f-4728-8fd8-4a9a13980244 \n", - "\n", - " contents doc_jsonpath \\\n", - "171 4 Why Self-Attention\\nlength n is smaller than... $.main-text[85] \n", - "25 3 Model Architecture\\nBatch size, 3B = 2048. B... $.tables[0] \n", - "137 2 Background\\nSelf-attention, sometimes called... $.main-text[24] \n", - "\n", - " page_number bbox \\\n", - "171 7 [107.26034546, 652.83349609, 504.29177856, 717... \n", - "25 5 [138.25450134, 299.99499512, 471.55078125, 432... \n", - "137 2 [107.29702759, 256.18237305, 505.24960327, 298... \n", - "\n", - " document_id chunk_id removed \\\n", - "171 6f8efa86e0a4f77b0d72d4a3141e5e0611b2921a392b99... 48 [] \n", - "25 b8f3a83c697e885ad31913c716644399a4772691e39d0b... 113 [] \n", - "137 9c2abd2ec38b67c74873e0cd670d27b702711d05930f26... 14 [] \n", - "\n", - " chunk_hash embeddings \n", - "171 -1 [0.018015103, -0.038851, 0.0016827772, -0.0493... \n", - "25 -1 [0.003977602, -0.06122852, -0.089708336, -0.00... \n", - "137 -1 [0.03394238, -0.0117239505, -0.03349689, -0.02... " - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.sample(min(3, output_df.shape[0]))" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": {}, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" - ] - } - ], - "source": [ - "import shutil\n", - "\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", - "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", - "\n", - "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "data-prep-kit-3-py312", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.7" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/notebooks/rag/rag_1C_vector_search.ipynb b/examples/notebooks/rag/rag_1C_vector_search.ipynb deleted file mode 100644 index e49de86e4c..0000000000 --- a/examples/notebooks/rag/rag_1C_vector_search.ipynb +++ /dev/null @@ -1,354 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Handy Utils to do Vector Search on Collections" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-1: Configuration" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "from my_config import MY_CONFIG" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-2: Connect to Vector Database\n", - "\n", - "Milvus can be embedded and easy to use.\n", - "\n", - "Note: If you encounter an error about unable to load database, try this: \n", - "\n", - "- In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock \n", - "- In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock\n", - "- Re-run this cell again\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Connected to Milvus instance: ./rag_1_dpk.db\n" - ] - } - ], - "source": [ - "from pymilvus import MilvusClient\n", - "\n", - "milvus_client = MilvusClient(MY_CONFIG.DB_URI)\n", - "\n", - "print (\"✅ Connected to Milvus instance:\", MY_CONFIG.DB_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-3: Setup Embeddings\n", - "\n", - "Two choices here. \n", - "\n", - "1. use sentence transformers directly\n", - "2. use Milvus model wrapper" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/sujee/apps/anaconda3/envs/data-prep-kit-4-021/lib/python3.11/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", - " from tqdm.autonotebook import tqdm, trange\n", - "/home/sujee/apps/anaconda3/envs/data-prep-kit-4-021/lib/python3.11/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", - " warnings.warn(\n" - ] - } - ], - "source": [ - "## Option 1 - use sentence transformers directly\n", - "\n", - "# If connection to https://huggingface.co/ failed, uncomment the following path\n", - "import os\n", - "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n", - "\n", - "from sentence_transformers import SentenceTransformer\n", - "\n", - "embedding_model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)\n", - "\n", - "def get_embeddings (str):\n", - " embeddings = embedding_model.encode(str, normalize_embeddings=True)\n", - " return embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "## Option 2 - Milvus model\n", - "from pymilvus import model\n", - "\n", - "# If connection to https://huggingface.co/ failed, uncomment the following path\n", - "import os\n", - "os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n", - "\n", - "\n", - "# embedding_fn = model.DefaultEmbeddingFunction()\n", - "\n", - "## initialize the SentenceTransformerEmbeddingFunction\n", - "embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(\n", - " model_name = MY_CONFIG.EMBEDDING_MODEL,\n", - " device='cpu' # this will work on all devices (KIS)\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence transformer : embeddings len = 384\n", - "sentence transformer : embeddings[:5] = [ 0.02468893 0.10352131 0.02752644 -0.08551719 -0.01412828]\n", - "milvus model wrapper : embeddings len = 384\n", - "milvus model wrapper : embeddings[:5] = [ 0.02468893 0.10352128 0.02752643 -0.08551716 -0.01412826]\n" - ] - } - ], - "source": [ - "# Test Embeddings\n", - "text = 'Paris 2024 Olympics'\n", - "embeddings = get_embeddings(text)\n", - "print ('sentence transformer : embeddings len =', len(embeddings))\n", - "print ('sentence transformer : embeddings[:5] = ', embeddings[:5])\n", - "\n", - "embeddings = embedding_fn([text])\n", - "print ('milvus model wrapper : embeddings len =', len(embeddings[0]))\n", - "print ('milvus model wrapper : embeddings[:5] = ', embeddings[0][:5])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-4: Do A Vector Search\n", - "\n", - "We will do this to verify data" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "import random\n", - "\n", - "\n", - "## helper function to perform vector search\n", - "def do_vector_search (query):\n", - " query_vectors = [get_embeddings(query)] # Option 1 - using sentence transformers\n", - " # query_vectors = embedding_fn([query]) # using Milvus model \n", - "\n", - " results = milvus_client.search(\n", - " collection_name=MY_CONFIG.COLLECTION_NAME, # target collection\n", - " data=query_vectors, # query vectors\n", - " limit=5, # number of returned entities\n", - " output_fields=[\"filename\", \"page_number\", \"text\"], # specifies fields to be returned\n", - " )\n", - " return results\n", - "## ----\n", - "\n", - "def print_search_results (results):\n", - " # pprint (results)\n", - " print ('num results : ', len(results[0]))\n", - "\n", - " for i, r in enumerate (results[0]):\n", - " #pprint(r, indent=4)\n", - " print (f'------ result {i+1} --------')\n", - " print ('search score:', r['distance'])\n", - " print ('filename:', r['entity']['filename'])\n", - " print ('page number:', r['entity']['page_number'])\n", - " print ('text:\\n', r['entity']['text'])\n", - " print()" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "num results : 5\n", - "------ result 1 --------\n", - "search score: 0.5946735143661499\n", - "filename: granite.pdf\n", - "page number: 5\n", - "text:\n", - " 3 Model Architecture\n", - "Table 1: Model configurations for Granite Code models.\n", - "\n", - "------ result 2 --------\n", - "search score: 0.5919967889785767\n", - "filename: granite.pdf\n", - "page number: 6\n", - "text:\n", - " 3 Model Architecture\n", - "Figure 2: An overview of depth upscaling (Kim et al., 2024) for efficient training of Granite34B-Code. We utilize the 20B model after 1.6T tokens to start training of 34B model with the same code pretraining data without any changes to the training and inference framework.\n", - "\n", - "------ result 3 --------\n", - "search score: 0.5557882785797119\n", - "filename: granite.pdf\n", - "page number: 1\n", - "text:\n", - " Granite Code Models: A Family of Open Foundation Models for Code Intelligence\n", - "Mayank Mishra ⋆ Matt Stallone ⋆ Gaoyuan Zhang ⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri † Rameswar Panda †\n", - "\n", - "------ result 4 --------\n", - "search score: 0.539251983165741\n", - "filename: granite.pdf\n", - "page number: 6\n", - "text:\n", - " 3 Model Architecture\n", - "remove final 8 layers from the original model and initial 8 layers from its duplicate to form two models. Finally, we concatenate both models to form Granite-34B-Code model with 88 layers (see Figure 2 for an illustration). After the depth upscaling, we observe that the drop in performance compared to 20B model is pretty small contrary to what is observed by Kim et al.. This performance is recovered pretty quickly after we continue pretraining of the upscaled 34B model. Similar, to 20B, we use a 8192 token context during pretraining.\n", - "\n", - "------ result 5 --------\n", - "search score: 0.537261962890625\n", - "filename: granite.pdf\n", - "page number: 20\n", - "text:\n", - " 6.6 Calling Functions and Tools\n", - "Figure 4 shows the results of different Granite Code models on BFCL benchmark. As can be seen from the figure, overall accuracy improves from 25.65% to 57.12% for Granite-3BCode-Base to Granite-34B-Code-Base, showing the effectiveness of model scaling in function (tool) calling capabilities. We also compare Granite-8B-Code with CodeLlama-7B in Figure 5 and find that Granite-8B-Code-Instruct beats CodeLlama-7B-Instruct by 22%, 14% and 12% on AST Summary, Execution Summary and Overall accuracy respectively. Additionally, Figure 5 shows that instruction tuning consistently improves performance of both base models, with more noticeable improvements in Granite Code models. E.g., +17.88% in overall accuracy from Granite-8B-Code-Base to Granite-8B-Code-Instruct, indicating the effectiveness of our well-curated data mixture in finetuning base models.\n", - "\n" - ] - } - ], - "source": [ - "query = \"What was the training data used to train Granite models?\"\n", - "\n", - "results = do_vector_search (query)\n", - "print_search_results(results)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "num results : 5\n", - "------ result 1 --------\n", - "search score: 0.6484582424163818\n", - "filename: attension.pdf\n", - "page number: 2\n", - "text:\n", - " 1 Introduction\n", - "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.\n", - "\n", - "------ result 2 --------\n", - "search score: 0.6340895891189575\n", - "filename: attension.pdf\n", - "page number: 3\n", - "text:\n", - " 3.2 Attention\n", - "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n", - "\n", - "------ result 3 --------\n", - "search score: 0.5805453062057495\n", - "filename: attension.pdf\n", - "page number: 10\n", - "text:\n", - " 7 Conclusion\n", - "We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.\n", - "\n", - "------ result 4 --------\n", - "search score: 0.5805416703224182\n", - "filename: attension.pdf\n", - "page number: 15\n", - "text:\n", - " Attention Visualizations Input-Input Layer5\n", - "Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.\n", - "\n", - "------ result 5 --------\n", - "search score: 0.5769087076187134\n", - "filename: attension.pdf\n", - "page number: 13\n", - "text:\n", - " Attention Visualizations Input-Input Layer5\n", - "Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb 'making', completing the phrase 'making...more difficult'. Attentions here shown only for the word 'making'. Different colors represent different heads. Best viewed in color.\n", - "\n" - ] - } - ], - "source": [ - "query = \"What is the attention mechanism?\"\n", - "\n", - "results = do_vector_search (query)\n", - "print_search_results(results)" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "# milvus_client.close()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/examples/notebooks/rag/rag_1D_query_replicate.ipynb b/examples/notebooks/rag/rag_1D_query_replicate.ipynb deleted file mode 100644 index 5e94ac0e83..0000000000 --- a/examples/notebooks/rag/rag_1D_query_replicate.ipynb +++ /dev/null @@ -1,479 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Query Data using LLM\n", - "\n", - "Here is the overall RAG pipeline. In this notebook, we will do steps (5), (6), (7), (8), (9)\n", - "- Importing data is already done in this notebook [rag_1B_load_data_into_milvus.ipynb](rag_1B_load_data_into_milvus.ipynb)\n", - "- 👉 Step 5: Calculate embedding for user query\n", - "- 👉 Step 6 & 7: Send the query to vector db to retrieve relevant documents\n", - "- 👉 Step 8 & 9: Send the query and relevant documents (returned above step) to LLM and get answers to our query\n", - "\n", - "![image missing](media/rag-overview-2.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-1: Configuration" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "from my_config import MY_CONFIG" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-2: Load .env file\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ config REPLICATE_API_TOKEN found\n" - ] - } - ], - "source": [ - "import os,sys\n", - "## Load Settings from .env file\n", - "from dotenv import find_dotenv, dotenv_values\n", - "\n", - "# _ = load_dotenv(find_dotenv()) # read local .env file\n", - "config = dotenv_values(find_dotenv())\n", - "\n", - "# debug\n", - "# print (config)\n", - "\n", - "MY_CONFIG.REPLICATE_API_TOKEN = config.get('REPLICATE_API_TOKEN')\n", - "\n", - "if MY_CONFIG.REPLICATE_API_TOKEN:\n", - " print (\"✅ config REPLICATE_API_TOKEN found\")\n", - "else:\n", - " raise Exception (\"'❌ REPLICATE_API_TOKEN' is not set. Please set it above to continue...\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-3: Connect to Vector Database\n", - "\n", - "Milvus can be embedded and easy to use.\n", - "\n", - "Note: If you encounter an error about unable to load database, try this: \n", - "\n", - "- In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock \n", - "- In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock\n", - "- Re-run this cell again\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Connected to Milvus instance: ./rag_1_dpk.db\n" - ] - } - ], - "source": [ - "from pymilvus import MilvusClient\n", - "\n", - "milvus_client = MilvusClient(MY_CONFIG.DB_URI)\n", - "\n", - "print (\"✅ Connected to Milvus instance:\", MY_CONFIG.DB_URI)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-4: Setup Embeddings\n", - "\n", - "Use the same embeddings we used to index our documents!" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/sujee/apps/anaconda3/envs/data-prep-kit-4-021/lib/python3.11/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", - " from tqdm.autonotebook import tqdm, trange\n", - "/home/sujee/apps/anaconda3/envs/data-prep-kit-4-021/lib/python3.11/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", - " warnings.warn(\n" - ] - } - ], - "source": [ - "from sentence_transformers import SentenceTransformer\n", - "\n", - "model = SentenceTransformer(MY_CONFIG.EMBEDDING_MODEL)\n", - "\n", - "def get_embeddings (str):\n", - " embeddings = model.encode(str, normalize_embeddings=True)\n", - " return embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "embeddings len = 384\n", - "embeddings[:5] = [ 0.02468893 0.10352131 0.02752644 -0.08551719 -0.01412828]\n" - ] - } - ], - "source": [ - "# Test embeddings\n", - "embeddings = get_embeddings('Paris 2024 Olympics')\n", - "print ('embeddings len =', len(embeddings))\n", - "print ('embeddings[:5] = ', embeddings[:5])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-5: Vector Search and RAG" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "# Get relevant documents using vector / sementic search\n", - "\n", - "def fetch_relevant_documents (query : str) :\n", - " search_res = milvus_client.search(\n", - " collection_name=MY_CONFIG.COLLECTION_NAME,\n", - " data = [get_embeddings(query)], # Use the `emb_text` function to convert the question to an embedding vector\n", - " limit=3, # Return top 3 results\n", - " search_params={\"metric_type\": \"IP\", \"params\": {}}, # Inner product distance\n", - " output_fields=[\"text\"], # Return the text field\n", - " )\n", - " # print (search_res)\n", - "\n", - " retrieved_docs_with_distances = [\n", - " {'text': res[\"entity\"][\"text\"], 'distance' : res[\"distance\"]} for res in search_res[0]\n", - " ]\n", - " return retrieved_docs_with_distances\n", - "## --- end ---\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[ { 'distance': 0.5946735143661499,\n", - " 'text': '3 Model Architecture\\n'\n", - " 'Table 1: Model configurations for Granite Code models.'},\n", - " { 'distance': 0.5919967889785767,\n", - " 'text': '3 Model Architecture\\n'\n", - " 'Figure 2: An overview of depth upscaling (Kim et al., 2024) '\n", - " 'for efficient training of Granite34B-Code. We utilize the 20B '\n", - " 'model after 1.6T tokens to start training of 34B model with '\n", - " 'the same code pretraining data without any changes to the '\n", - " 'training and inference framework.'},\n", - " { 'distance': 0.5557882785797119,\n", - " 'text': 'Granite Code Models: A Family of Open Foundation Models for '\n", - " 'Code Intelligence\\n'\n", - " 'Mayank Mishra ⋆ Matt Stallone ⋆ Gaoyuan Zhang ⋆ Yikang Shen '\n", - " 'Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran '\n", - " 'Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong '\n", - " 'Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman '\n", - " 'Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris '\n", - " 'Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu '\n", - " 'Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel '\n", - " 'Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell '\n", - " 'Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen '\n", - " 'Seetharami Seelam Brian Belgodere Carlos Fonseca Amith '\n", - " 'Singhee Nirmit Desai David D. Cox Ruchir Puri † Rameswar '\n", - " 'Panda †'}]\n" - ] - } - ], - "source": [ - "# test relevant vector search\n", - "import json\n", - "import pprint\n", - "\n", - "question = \"What was the training data used to train Granite models?\"\n", - "relevant_docs = fetch_relevant_documents(question)\n", - "pprint.pprint(relevant_docs, indent=4)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-6: Initialize LLM\n", - "\n", - "### LLM Choices at Replicate\n", - "\n", - "\n", - "| Model | Publisher | Params | Description |\n", - "|-------------------------------------|-----------|--------|------------------------------------------------------|\n", - "| ibm-granite/granite-3.0-8b-instruct | IBM | 8 B | IBM's newest Granite Model v3.0 (default) |\n", - "| ibm-granite/granite-3.0-2b-instruct | IBM | 2 B | IBM's newest Granite Model v3.0 |\n", - "| meta/meta-llama-3.1-405b-instruct | Meta | 405 B | Meta's flagship 405 billion parameter language model |\n", - "| meta/meta-llama-3-8b-instruct | Meta | 8 B | Meta's 8 billion parameter language model |\n", - "| meta/meta-llama-3-70b-instruct | Meta | 70 B | Meta's 70 billion parameter language model |\n", - "\n", - "References \n", - "\n", - "- https://www.ibm.com/granite\n", - "- https://www.llama.com/\n", - "- https://replicate.com/ " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Using model: ibm-granite/granite-3.0-8b-instruct\n" - ] - } - ], - "source": [ - "import os\n", - "os.environ[\"REPLICATE_API_TOKEN\"] = MY_CONFIG.REPLICATE_API_TOKEN\n", - "\n", - "print ('Using model:', MY_CONFIG.LLM_MODEL)" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "import replicate\n", - "\n", - "def ask_LLM (question, relevant_docs):\n", - " context = \"\\n\".join(\n", - " [doc['text'] for doc in relevant_docs]\n", - " )\n", - " print ('============ context (this is the context supplied to LLM) ============')\n", - " print (context)\n", - " print ('============ end context ============', flush=True)\n", - "\n", - " system_prompt = \"\"\"\n", - " Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n", - " \"\"\"\n", - " user_prompt = f\"\"\"\n", - " Use the following pieces of information enclosed in tags to provide an answer to the question enclosed in tags.\n", - " \n", - " {context}\n", - " \n", - " \n", - " {question}\n", - " \n", - " \"\"\"\n", - "\n", - " print ('============ here is the answer from LLM... STREAMING... =====')\n", - " # The meta/meta-llama-3-8b-instruct model can stream output as it's running.\n", - " for event in replicate.stream(\n", - " MY_CONFIG.LLM_MODEL,\n", - " input={\n", - " \"top_k\": 1,\n", - " \"top_p\": 0.95,\n", - " \"prompt\": user_prompt,\n", - " \"max_tokens\": 1024,\n", - " \"temperature\": 0.1,\n", - " \"system_prompt\": system_prompt,\n", - " \"length_penalty\": 1,\n", - " # \"max_new_tokens\": 512,\n", - " \"stop_sequences\": \"<|end_of_text|>,<|eot_id|>\",\n", - " \"prompt_template\": \"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\\n\\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n\",\n", - " \"presence_penalty\": 0,\n", - " \"log_performance_metrics\": False\n", - " },\n", - " ):\n", - " print(str(event), end=\"\")\n", - " ## ---\n", - " print ('\\n====== end LLM answer ======\\n', flush=True)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Step-7: Query" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "============ context (this is the context supplied to LLM) ============\n", - "3 Model Architecture\n", - "Table 1: Model configurations for Granite Code models.\n", - "3 Model Architecture\n", - "Figure 2: An overview of depth upscaling (Kim et al., 2024) for efficient training of Granite34B-Code. We utilize the 20B model after 1.6T tokens to start training of 34B model with the same code pretraining data without any changes to the training and inference framework.\n", - "Granite Code Models: A Family of Open Foundation Models for Code Intelligence\n", - "Mayank Mishra ⋆ Matt Stallone ⋆ Gaoyuan Zhang ⋆ Yikang Shen Aditya Prasad Adriana Meza Soria Michele Merler Parameswaran Selvam Saptha Surendran Shivdeep Singh Manish Sethi Xuan-Hong Dang Pengyuan Li Kun-Lung Wu Syed Zawad Andrew Coleman Matthew White Mark Lewis Raju Pavuluri Yan Koyfman Boris Lublinsky Maximilien de Bayser Ibrahim Abdelaziz Kinjal Basu Mayank Agarwal Yi Zhou Chris Johnson Aanchal Goyal Hima Patel Yousaf Shah Petros Zerfos Heiko Ludwig Asim Munawar Maxwell Crouse Pavan Kapanipathi Shweta Salaria Bob Calio Sophia Wen Seetharami Seelam Brian Belgodere Carlos Fonseca Amith Singhee Nirmit Desai David D. Cox Ruchir Puri † Rameswar Panda †\n", - "============ end context ============\n", - "============ here is the answer from LLM... STREAMING... =====\n", - "The context does not provide specific details about the training data used to train the Granite models. It only mentions that the 20B model was trained after 1.6T tokens and then used to start training the 34B model with the same code pretraining data. However, it does not specify what this code pretraining data is.\n", - "====== end LLM answer ======\n", - "\n", - "CPU times: user 63.6 ms, sys: 12 ms, total: 75.6 ms\n", - "Wall time: 1.43 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "question = \"What was the training data used to train Granite models?\"\n", - "relevant_docs = fetch_relevant_documents(question)\n", - "ask_LLM(question=question, relevant_docs=relevant_docs)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "============ context (this is the context supplied to LLM) ============\n", - "1 Introduction\n", - "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.\n", - "3.2 Attention\n", - "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n", - "7 Conclusion\n", - "We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.\n", - "============ end context ============\n", - "============ here is the answer from LLM... STREAMING... =====\n", - "An attention mechanism is a method used in sequence modeling and transduction models to model dependencies between elements in input or output sequences, regardless of their distance. It maps a query and a set of key-value pairs to an output, which is computed as a weighted sum.\n", - "====== end LLM answer ======\n", - "\n", - "CPU times: user 30.6 ms, sys: 17.3 ms, total: 47.9 ms\n", - "Wall time: 880 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "question = \"What is attention mechanism?\"\n", - "relevant_docs = fetch_relevant_documents(question)\n", - "ask_LLM(question=question, relevant_docs=relevant_docs)" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "============ context (this is the context supplied to LLM) ============\n", - "6.1.5 RepoBench, CrossCodeEval: Repository-Level Code Generation\n", - "StarCoderBase-3B, MBPP = 29.4. StarCoderBase-3B, MBPP+ = 37.8. StableCode-3B, MBPP = 34.8. StableCode-3B, MBPP+ = 43.3. StarCoder2-3B, MBPP = 42.4. StarCoder2-3B, MBPP+ = 48.6. CodeGemma-2B, MBPP = 30.4. CodeGemma-2B, MBPP+ = 30.8. Granite-3B-Code-Base, MBPP = 36.0. Granite-3B-Code-Base, MBPP+ = 45.1. StarCoderBase-7B, MBPP = 34.8. StarCoderBase-7B, MBPP+ = 42.1. CodeLlama-7B, MBPP = 39.0. CodeLlama-7B, MBPP+ = 42.3. StarCoder2-7B, MBPP = 45.4. StarCoder2-7B, MBPP+ = 46.7. CodeGemma-7B, MBPP = 53.0. CodeGemma-7B, MBPP+ = 54.9. Granite-8B-Code-Base, MBPP = 42.2. Granite-8B-Code-Base, MBPP+ = 49.6. StarCoderBase-15B, MBPP = 37.4. StarCoderBase-15B, MBPP+ = 46.1. CodeLlama-13B, MBPP = 30.6. CodeLlama-13B, MBPP+ = 30.1. StarCoder2-15B, MBPP = 51.2. StarCoder2-15B, MBPP+ = 56.6. Granite-20B-Code-Base, MBPP = 43.8. Granite-20B-Code-Base, MBPP+ = 51.6. CodeLlama-34B, MBPP = 48.6. CodeLlama-34B, MBPP+ = 53.6. Granite-34B-Code-Base, MBPP = 47.2. Granite-34B-Code-Base, MBPP+ = 53.1\n", - "6.1.3 MBPP and MBPP+: Code Generation in Python\n", - "MBPP (Austin et al., 2021) and MBPP+ (Liu et al., 2023a) are two of the most widely studied benchmarks for evaluating code models. While the prompt for each MBPP problem includes a natural language description followed by a few tests, MBPP+ consists of 35 × more tests than the original benchmarks. We use greedy decoding and report the mean pass@1 for all the models. Table 5 summarizes the results of different base models. As we can see, Granite3B-Code-Base significantly outperforms CodeGemma-2B but falls short of StarCoder2-3B on\n", - "6.1.4 DS1000: Data Science Tasks in Python\n", - "The Granite Code models achieve relatively high accuracy across all sizes (e.g., outperforming CodeGemma at 2B-3B scale, StarCoder2 at 7B-8B scale and CodeLlama models with half of the sizes). This shows that our Granite Code models are not only capable of generating good code but also of using libraries more accurately in real data science workflows.\n", - "============ end context ============\n", - "============ here is the answer from LLM... STREAMING... =====\n", - "I'm sorry, the provided context does not contain information about the moon landing.\n", - "====== end LLM answer ======\n", - "\n", - "CPU times: user 45 ms, sys: 3.19 ms, total: 48.2 ms\n", - "Wall time: 412 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "question = \"When was the moon landing?\"\n", - "relevant_docs = fetch_relevant_documents(question)\n", - "ask_LLM(question=question, relevant_docs=relevant_docs)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "data-prep-kit-4-021", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/transforms/language/doc_chunk/requirements.txt b/transforms/language/doc_chunk/requirements.txt index c24f0113bc..b458ca98cc 100644 --- a/transforms/language/doc_chunk/requirements.txt +++ b/transforms/language/doc_chunk/requirements.txt @@ -1,3 +1,3 @@ -docling-core==2.3.0 -pydantic>=2.0.0,<2.10.0 +docling-core==2.18.0 +pydantic>=2.0.0 llama-index-core>=0.11.22,<0.12.0 diff --git a/transforms/language/doc_chunk/test-data/expected/metadata.json b/transforms/language/doc_chunk/test-data/expected/metadata.json index e83a0375bd..69a62dd7b5 100644 --- a/transforms/language/doc_chunk/test-data/expected/metadata.json +++ b/transforms/language/doc_chunk/test-data/expected/metadata.json @@ -5,8 +5,8 @@ "job name": "doc_chunk", "job type": "pure python", "job id": "job_id", - "start_time": "2024-10-30 18:38:40", - "end_time": "2024-10-30 18:38:40", + "start_time": "2025-02-10 15:20:06", + "end_time": "2025-02-10 15:20:07", "status": "success" }, "code": { @@ -25,6 +25,7 @@ "output_bbox_column_name": "bbox", "chunk_size_tokens": 128, "chunk_overlap_tokens": 30, + "dl_min_chunk_len": null, "checkpointing": false, "max_files": -1, "random_samples": -1, @@ -34,9 +35,9 @@ "num_processors": 0 }, "execution_stats": { - "cpus": 19.5, + "cpus": 25.8, "gpus": 0, - "memory": 27.48, + "memory": 24.41, "object_store": 0, "execution time, min": 0.001 }, @@ -44,19 +45,19 @@ "source_files": 1, "source_size": 12073, "result_files": 1, - "result_size": 14363, - "processing_time": 0.043, + "result_size": 16705, + "processing_time": 0.044, "nfiles": 1, - "nrows": 39, + "nrows": 29, "source_doc_count": 1, - "result_doc_count": 39 + "result_doc_count": 29 }, "source": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/doc_chunk/python/test-data/input", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/doc_chunk/test-data/input", "type": "path" }, "target": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/doc_chunk/python/output", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/doc_chunk/output", "type": "path" } } \ No newline at end of file diff --git a/transforms/language/doc_chunk/test-data/expected/test1.parquet b/transforms/language/doc_chunk/test-data/expected/test1.parquet index 46714dde7b..72d6fd06d9 100644 Binary files a/transforms/language/doc_chunk/test-data/expected/test1.parquet and b/transforms/language/doc_chunk/test-data/expected/test1.parquet differ diff --git a/transforms/language/pdf2parquet/Dockerfile.python b/transforms/language/pdf2parquet/Dockerfile.python index 4ecaaa89c8..a10833bc7a 100644 --- a/transforms/language/pdf2parquet/Dockerfile.python +++ b/transforms/language/pdf2parquet/Dockerfile.python @@ -32,11 +32,10 @@ RUN pip install ${PIP_INSTALL_EXTRA_ARGS} -r requirements.txt # Set environment ENV PYTHONPATH /home/dpk +ENV PATH="/home/dpk/.local/bin:${PATH}" # Download models -RUN python -c 'from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models; load_pretrained_nlp_models(verbose=True);' -RUN python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; s=StandardPdfPipeline.download_models_hf(); print(f"Models cached in {s}")' - +RUN docling-tools models download layout tableformer picture_classifier easyocr # Parallelism ENV OMP_NUM_THREADS=2 diff --git a/transforms/language/pdf2parquet/Dockerfile.ray b/transforms/language/pdf2parquet/Dockerfile.ray index 4dc62538ec..6cbd20ea4e 100644 --- a/transforms/language/pdf2parquet/Dockerfile.ray +++ b/transforms/language/pdf2parquet/Dockerfile.ray @@ -32,15 +32,12 @@ COPY --chmod=775 --chown=ray:root dpk_pdf2parquet/ dpk_pdf2parquet/ COPY --chmod=775 --chown=ray:root requirements.txt requirements.txt RUN pip install ${PIP_INSTALL_EXTRA_ARGS} -r requirements.txt - - -# Download models -RUN python -c 'from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models; load_pretrained_nlp_models(verbose=True);' -# RUN python -c 'from docling.document_converter import DocumentConverter; from pathlib import Path; DocumentConverter.download_models_hf(local_dir=Path("./artifacts/"));' -RUN python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; s=StandardPdfPipeline.download_models_hf(); print(f"Models cached in {s}")' - # Set environment ENV PYTHONPATH /home/ray +ENV PATH="/home/ray/.local/bin:${PATH}" + +# Download models +RUN docling-tools models download layout tableformer picture_classifier easyocr # Parallelism ENV OMP_NUM_THREADS=2 diff --git a/transforms/language/pdf2parquet/requirements.txt b/transforms/language/pdf2parquet/requirements.txt index b4c6d06f2c..e3cb4727f3 100644 --- a/transforms/language/pdf2parquet/requirements.txt +++ b/transforms/language/pdf2parquet/requirements.txt @@ -1,5 +1,6 @@ -docling-core==2.3.0 -docling-ibm-models==2.0.3 -deepsearch-glm==0.26.1 -docling==2.3.1 +docling-core==2.18.0 +docling-ibm-models==3.3.1 +docling-parse==3.3.0 +deepsearch-glm==1.0.0 +docling==2.21.0 filetype >=1.2.0, <2.0.0 diff --git a/transforms/language/pdf2parquet/test-data/expected/archive1.parquet b/transforms/language/pdf2parquet/test-data/expected/archive1.parquet index 27b97529d2..83703cab30 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected/archive1.parquet and b/transforms/language/pdf2parquet/test-data/expected/archive1.parquet differ diff --git a/transforms/language/pdf2parquet/test-data/expected/metadata.json b/transforms/language/pdf2parquet/test-data/expected/metadata.json index f5961f8437..2d9adf085d 100644 --- a/transforms/language/pdf2parquet/test-data/expected/metadata.json +++ b/transforms/language/pdf2parquet/test-data/expected/metadata.json @@ -5,15 +5,11 @@ "job name": "pdf2parquet", "job type": "pure python", "job id": "job_id", - "start_time": "2024-11-13 08:35:51", - "end_time": "2024-11-13 08:36:23", + "start_time": "2025-02-10 14:18:13", + "end_time": "2025-02-10 14:18:21", "status": "success" }, - "code": { - "github": "github", - "commit_hash": "12345", - "path": "path" - }, + "code": null, "job_input_params": { "batch_size": -1, "artifacts_path": null, @@ -23,42 +19,40 @@ "ocr_engine": "easyocr", "bitmap_area_threshold": 0.05, "pdf_backend": "dlparse_v2", - "double_precision": 0, + "double_precision": 8, "checkpointing": false, "max_files": -1, "random_samples": -1, "files_to_use": [ ".pdf", - ".docx", - ".pptx", ".zip" ], "num_processors": 0 }, "execution_stats": { - "cpus": 147.5, + "cpus": 23.6, "gpus": 0, - "memory": 33.72, + "memory": 29.99, "object_store": 0, - "execution time, min": 0.522 + "execution time, min": 0.127 }, "job_output_stats": { "source_files": 2, "source_size": 605137, "result_files": 2, - "result_size": 33078, - "processing_time": 4.221, + "result_size": 32765, + "processing_time": 3.93, "nrows": 3, "nsuccess": 3, "nfail": 0, "nskip": 0 }, "source": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/test-data/input", "type": "path" }, "target": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/output", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/output", "type": "path" } } \ No newline at end of file diff --git a/transforms/language/pdf2parquet/test-data/expected/redp5110-ch1.parquet b/transforms/language/pdf2parquet/test-data/expected/redp5110-ch1.parquet index 3e08723a07..d6777ae872 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected/redp5110-ch1.parquet and b/transforms/language/pdf2parquet/test-data/expected/redp5110-ch1.parquet differ diff --git a/transforms/language/pdf2parquet/test-data/expected_batch/metadata.json b/transforms/language/pdf2parquet/test-data/expected_batch/metadata.json index 8756a013e4..63289ef922 100644 --- a/transforms/language/pdf2parquet/test-data/expected_batch/metadata.json +++ b/transforms/language/pdf2parquet/test-data/expected_batch/metadata.json @@ -5,8 +5,8 @@ "job name": "pdf2parquet", "job type": "pure python", "job id": "job_id", - "start_time": "2024-11-13 08:37:05", - "end_time": "2024-11-13 08:37:11", + "start_time": "2025-02-10 14:45:21", + "end_time": "2025-02-10 14:45:28", "status": "success" }, "code": { @@ -36,29 +36,29 @@ "num_processors": 0 }, "execution_stats": { - "cpus": 143.9, + "cpus": 28.6, "gpus": 0, - "memory": 34.21, + "memory": 24.32, "object_store": 0, - "execution time, min": 0.1 + "execution time, min": 0.113 }, "job_output_stats": { "source_files": 2, "source_size": 605137, "result_files": 1, - "processing_time": 3.364, + "processing_time": 3.426, "nrows": 3, "nsuccess": 3, "nfail": 0, "nskip": 0, - "result_size": 27226 + "result_size": 26903 }, "source": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/test-data/input", "type": "path" }, "target": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/output", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/output", "type": "path" } } \ No newline at end of file diff --git a/transforms/language/pdf2parquet/test-data/expected_batch/redp5110-ch1.parquet b/transforms/language/pdf2parquet/test-data/expected_batch/redp5110-ch1.parquet index 9e3302c8cd..bc4b054f18 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected_batch/redp5110-ch1.parquet and b/transforms/language/pdf2parquet/test-data/expected_batch/redp5110-ch1.parquet differ diff --git a/transforms/language/pdf2parquet/test-data/expected_json/archive1.parquet b/transforms/language/pdf2parquet/test-data/expected_json/archive1.parquet index 584cbea226..87b1d67dcf 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected_json/archive1.parquet and b/transforms/language/pdf2parquet/test-data/expected_json/archive1.parquet differ diff --git a/transforms/language/pdf2parquet/test-data/expected_json/metadata.json b/transforms/language/pdf2parquet/test-data/expected_json/metadata.json index 35a8bd8744..5c8f176825 100644 --- a/transforms/language/pdf2parquet/test-data/expected_json/metadata.json +++ b/transforms/language/pdf2parquet/test-data/expected_json/metadata.json @@ -5,8 +5,8 @@ "job name": "pdf2parquet", "job type": "pure python", "job id": "job_id", - "start_time": "2024-11-13 08:37:56", - "end_time": "2024-11-13 08:38:02", + "start_time": "2025-02-10 14:44:43", + "end_time": "2025-02-10 14:44:50", "status": "success" }, "code": { @@ -36,29 +36,29 @@ "num_processors": 0 }, "execution_stats": { - "cpus": 142.2, + "cpus": 28.5, "gpus": 0, - "memory": 33.63, + "memory": 24.53, "object_store": 0, - "execution time, min": 0.1 + "execution time, min": 0.107 }, "job_output_stats": { "source_files": 2, "source_size": 605137, "result_files": 2, - "result_size": 22993, - "processing_time": 3.422, + "result_size": 23484, + "processing_time": 3.518, "nrows": 3, "nsuccess": 3, "nfail": 0, "nskip": 0 }, "source": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/test-data/input", "type": "path" }, "target": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/output", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/output", "type": "path" } } \ No newline at end of file diff --git a/transforms/language/pdf2parquet/test-data/expected_json/redp5110-ch1.parquet b/transforms/language/pdf2parquet/test-data/expected_json/redp5110-ch1.parquet index 915c071891..fb503f2493 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected_json/redp5110-ch1.parquet and b/transforms/language/pdf2parquet/test-data/expected_json/redp5110-ch1.parquet differ diff --git a/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/archive1.parquet b/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/archive1.parquet index f1bbf6c77b..763dbd50c0 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/archive1.parquet and b/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/archive1.parquet differ diff --git a/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/metadata.json b/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/metadata.json index ad1709b3dc..5d94f7ea96 100644 --- a/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/metadata.json +++ b/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/metadata.json @@ -5,8 +5,8 @@ "job name": "pdf2parquet", "job type": "pure python", "job id": "job_id", - "start_time": "2024-11-13 08:37:31", - "end_time": "2024-11-13 08:37:34", + "start_time": "2025-02-10 14:44:09", + "end_time": "2025-02-10 14:44:11", "status": "success" }, "code": { @@ -36,29 +36,29 @@ "num_processors": 0 }, "execution_stats": { - "cpus": 143.4, + "cpus": 28.8, "gpus": 0, - "memory": 31.51, + "memory": 22.7, "object_store": 0, - "execution time, min": 0.042 + "execution time, min": 0.038 }, "job_output_stats": { "source_files": 2, "source_size": 605137, "result_files": 2, - "result_size": 29694, - "processing_time": 2.077, + "result_size": 29781, + "processing_time": 1.506, "nrows": 3, "nsuccess": 3, "nfail": 0, "nskip": 0 }, "source": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/test-data/input", "type": "path" }, "target": { - "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/python/output", + "name": "/Users/dol/codes/data-prep-kit/transforms/language/pdf2parquet/output", "type": "path" } } \ No newline at end of file diff --git a/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/redp5110-ch1.parquet b/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/redp5110-ch1.parquet index 004f70d2de..96260b9977 100644 Binary files a/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/redp5110-ch1.parquet and b/transforms/language/pdf2parquet/test-data/expected_md_no_table_no_ocr/redp5110-ch1.parquet differ diff --git a/transforms/language/readability/README.md b/transforms/language/readability/README.md index 781338783a..baf9fdb80f 100644 --- a/transforms/language/readability/README.md +++ b/transforms/language/readability/README.md @@ -72,13 +72,19 @@ or English, focusing on the number of miniwords and length of sentences. The set of dictionary keys holding [ReadabilityTransform](dpk_readability/runtime.py) configuration for values are as follows: * _readability_contents_column_name_ - specifies the name of the column holding the document text. The default is `text`. -* _readability_curriculum_ - set to True when the data is prepared for curriculum learning and is annotated with the `flesch_kincaid`, `gunning_fog`, `automated_readability_index` readability scores, and the average of these 3 grade-level scores to speed up the annotation process. +* _readability_score_list_ - list of readability scores to be computed by the transform; + valid values: `coleman_liau_index_textstat`, `flesch_kincaid_textstat`, + `difficult_words_textstat`, `spache_readability_textstat`, `smog_index_textstat`, + `reading_time_textstat`, `dale_chall_readability_score_textstat`, `text_standard_textstat`, + `automated_readability_index_textstat`, `gunning_fog_textstat`, `flesch_ease_textstat`, + `mcalpine_eflaw_textstat`, `linsear_write_formula_textstat`. + Additionally, a set of data access-specific arguments are provided that enable the specification of the location of domain list files, so that these files could be stored in the local file system or in S3 storage, for example. The arguments are as follows (and generally match the TransformLauncher's -data access arguments but with the `extreme_tokenized_' prefix). +data access arguments but with the `readability_' prefix). * _readability_local_config_ - specifies the input and output folders. * _readability_s3_config_ - specifies the input and output paths in s3. @@ -94,20 +100,20 @@ annotated `readability-test.parquet` file and the `metadata.json` file.
 cma:readability$ make venv PYTHON=python3.11
 cma:readability$ source venv/bin/activate
-(venv) cma:readability$ python -m dpk_readability.runtime --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }"
-12:07:23 INFO - Launching Readability transform
-12:07:23 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_curriculum': False}
-12:07:23 INFO - pipeline id pipeline_id
-12:07:23 INFO - code location None
-12:07:23 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
-12:07:23 INFO - data factory data_ max_files -1, n_sample -1
-12:07:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
-12:07:23 INFO - orchestrator readability started at 2025-01-28 12:07:23
-12:07:23 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
-12:07:23 INFO - Completed 1 files (100.0%) in 0.002 min
-12:07:23 INFO - Done processing 1 files, waiting for flush() completion.
-12:07:23 INFO - done flushing in 0.0 sec
-12:07:23 INFO - Completed execution in 0.003 min, execution result 0
+(venv) cma:readability$ python -m dpk_readability.runtime --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }" --readability_score_list "['reading_time_textstat','spache_readability_textstat','text_standard_textstat']"
+13:07:23 INFO - Launching Readability transform
+13:07:23 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_score_list': ['reading_time_textstat', 'spache_readability_textstat', 'text_standard_textstat']}
+13:07:23 INFO - pipeline id pipeline_id
+13:07:23 INFO - code location None
+13:07:23 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
+13:07:23 INFO - data factory data_ max_files -1, n_sample -1
+13:07:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
+13:07:23 INFO - orchestrator readability started at 2025-02-07 13:07:23
+13:07:23 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
+13:07:24 INFO - Completed 1 files (100.0%) in 0.002 min
+13:07:24 INFO - Done processing 1 files, waiting for flush() completion.
+13:07:24 INFO - done flushing in 0.0 sec
+13:07:24 INFO - Completed execution in 0.002 min, execution result 0
 (venv) cma:readability$ deactivate
 
@@ -134,8 +140,8 @@ options: -h, --help show this help message and exit --readability_contents_column_name READABILITY_CONTENTS_COLUMN_NAME contents column name for input parquet table to transform - --readability_curriculum READABILITY_CURRICULUM - curriculum parameter for transform; select True for curriculum learning + --readability_score_list READABILITY_SCORE_LIST + list of readability scores to be computed by the transform; valid values: {'flesch_ease_textstat', 'reading_time_textstat', 'flesch_kincaid_textstat', 'automated_readability_index_textstat', 'linsear_write_formula_textstat', 'text_standard_textstat', 'smog_index_textstat', 'difficult_words_textstat', 'spache_readability_textstat', 'dale_chall_readability_score_textstat', 'mcalpine_eflaw_textstat', 'gunning_fog_textstat', 'coleman_liau_index_textstat'} --data_s3_cred DATA_S3_CRED AST string of options for s3 credentials. Only required for S3 data access. access_key: access key help text @@ -181,3 +187,4 @@ options: path: Path within the repository Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324', 'path': 'transforms/universal/code' } + diff --git a/transforms/language/readability/dpk_readability/common.py b/transforms/language/readability/dpk_readability/common.py index 352e11b1f8..d19e6c9dfa 100644 --- a/transforms/language/readability/dpk_readability/common.py +++ b/transforms/language/readability/dpk_readability/common.py @@ -63,12 +63,10 @@ """Key holds the mcalpine_eflaw_textstat R score threshold parameter""" reading_time_textstat = "reading_time_textstat" """Key holds the reading_time_textstat R score threshold parameter""" -avg_grade_level = "avg_grade_level" -"""Key holds the avg_grade_level R score threshold parameter""" contents_column_name = "contents_column_name" """Contents column name for the input parquet table to the transform""" -curriculum = "curriculum" -"""curriculum parameter for transform; either True or False""" +score_list = "score_list" +"""list of readability scores to be computed by the transform""" ######################################################################################## @@ -76,12 +74,12 @@ """avg_grade_level R score threshold parameter""" contents_column_name_cli_param = f"{cli_prefix}{contents_column_name}" """Content column name for parquet input table to transform""" -curriculum_cli_param = f"{cli_prefix}{curriculum}" -"""curriculum parameter for transform; either True or False""" +score_list_cli_param = f"{cli_prefix}{score_list}" +"""list of readability scores or a single readability scores to be computed by the transform""" # The set of default value that can be overwritten from the CLI """ contents_column_name_default = "contents" """The default value for contents_column_name""" -curriculum_default = False -"""curriculum parameter for transform; either True or False""" +score_list_default = mcalpine_eflaw_textstat +"""readability score that is computed by default""" diff --git a/transforms/language/readability/dpk_readability/runtime.py b/transforms/language/readability/dpk_readability/runtime.py index 62d03c6906..018c27daaa 100644 --- a/transforms/language/readability/dpk_readability/runtime.py +++ b/transforms/language/readability/dpk_readability/runtime.py @@ -10,6 +10,8 @@ # limitations under the License. ################################################################################ +import argparse +import ast import sys from argparse import ArgumentParser, Namespace @@ -21,12 +23,25 @@ from data_processing.transform import TransformConfiguration from data_processing.utils import CLIArgumentProvider, ParamsUtils, get_logger, str2bool from dpk_readability.common import ( + automated_readability_index_textstat, cli_prefix, + coleman_liau_index_textstat, contents_column_name_cli_param, contents_column_name_default, - curriculum_cli_param, - curriculum_default, + dale_chall_readability_score_textstat, + difficult_words_textstat, + flesch_ease_textstat, + flesch_kincaid_textstat, + gunning_fog_textstat, + linsear_write_formula_textstat, + mcalpine_eflaw_textstat, + reading_time_textstat, + score_list_cli_param, + score_list_default, short_name, + smog_index_textstat, + spache_readability_textstat, + text_standard_textstat, ) from dpk_readability.transform import ReadabilityTransform @@ -54,6 +69,33 @@ def add_input_params(self, parser: ArgumentParser) -> None: By convention a common prefix should be used for all transform-specific CLI args (e.g, noop_, pii_, etc.) """ + valid_values = { + flesch_ease_textstat, + flesch_kincaid_textstat, + gunning_fog_textstat, + smog_index_textstat, + coleman_liau_index_textstat, + automated_readability_index_textstat, + dale_chall_readability_score_textstat, + difficult_words_textstat, + linsear_write_formula_textstat, + text_standard_textstat, + spache_readability_textstat, + mcalpine_eflaw_textstat, + reading_time_textstat, + } + + def validate_scores(x): + if x.startswith("[") and x.endswith("]"): + scores = ast.literal_eval(x) + if not all(score in valid_values for score in scores): + raise argparse.ArgumentTypeError(f"Invalid scores in list. Allowed scores: {valid_values}") + return scores + elif x in valid_values: + return x + else: + raise argparse.ArgumentTypeError(f"Invalid score: {x}. Allowed scores: {valid_values}") + parser.add_argument( f"--{contents_column_name_cli_param}", type=str, @@ -61,12 +103,13 @@ def add_input_params(self, parser: ArgumentParser) -> None: default=contents_column_name_default, help="contents column name for input parquet table to transform", ) + parser.add_argument( - f"--{curriculum_cli_param}", - type=lambda x: bool(str2bool(x)), + f"--{score_list_cli_param}", + type=validate_scores, required=False, - default=curriculum_default, - help="curriculum parameter for transform; select True for curriculum learning", + default=score_list_default, + help=f"list of readability scores to be computed by the transform; valid values: {valid_values}", ) def apply_input_params(self, args: Namespace) -> bool: diff --git a/transforms/language/readability/dpk_readability/transform.py b/transforms/language/readability/dpk_readability/transform.py index 73d97b72cb..dcab7c306c 100644 --- a/transforms/language/readability/dpk_readability/transform.py +++ b/transforms/language/readability/dpk_readability/transform.py @@ -10,20 +10,18 @@ # limitations under the License. ################################################################################ -from typing import Any +from typing import Any, Callable +import polars as pl import pyarrow as pa import textstat from data_processing.transform import AbstractTableTransform from data_processing.utils import get_logger from dpk_readability.common import ( automated_readability_index_textstat, - avg_grade_level, coleman_liau_index_textstat, contents_column_name_cli_param, contents_column_name_default, - curriculum_cli_param, - curriculum_default, dale_chall_readability_score_textstat, difficult_words_textstat, flesch_ease_textstat, @@ -32,6 +30,8 @@ linsear_write_formula_textstat, mcalpine_eflaw_textstat, reading_time_textstat, + score_list_cli_param, + score_list_default, smog_index_textstat, spache_readability_textstat, text_standard_textstat, @@ -49,134 +49,123 @@ class ReadabilityTransform(AbstractTableTransform): def __init__(self, config: dict): super().__init__(config) self.contents_column_name = config.get(contents_column_name_cli_param, contents_column_name_default) - self.curriculum = config.get(curriculum_cli_param, curriculum_default) + self.score_list = config.get(score_list_cli_param, score_list_default) + if isinstance(self.score_list, str): + self.score_list = [self.score_list] def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Table], dict[str, Any]]: """transform function for readability_scores""" - pq_df_new = table.to_pandas() - - if self.curriculum: - ######## This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. - pq_df_new[flesch_kincaid_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.flesch_kincaid_grade(x) - ) - - ######## This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. - pq_df_new[gunning_fog_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.gunning_fog(x) - ) - - ######## Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text. For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade. - pq_df_new[automated_readability_index_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.automated_readability_index(x) - ) - - ######## Average of all grade_level metrics - # pq_df_new['avg_grade_level'] = pq_df_new[['flesch_kincaid_textstat', 'gunning_fog_textstat', 'coleman_liau_index_textstat', 'automated_readability_index_textstat', 'dale_chall_readability_score_textstat', 'linsear_write_formula_textstat']].mean(axis=1) - ######## R83_avg_GradeL - pq_df_new[avg_grade_level] = pq_df_new[ - [flesch_kincaid_textstat, gunning_fog_textstat, automated_readability_index_textstat] - ].mean(axis=1) - - ######## Returns a score for the readability of an english text for a foreign learner or English, focusing on the number of miniwords and length of sentences. It is recommended to aim for a score equal to or lower than 25. Further reading on blog https://strainindex.wordpress.com/2009/04/30/mcalpine-eflaw-readability-score/ - pq_df_new[mcalpine_eflaw_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.mcalpine_eflaw(x) - ) - else: - ######### textstat Readability Scores - ######### Score School level (US) Notes - ######### 100.00–90.00 5th grade Very easy to read. Easily understood by an average 11-year-old student. - ######### 90.0–80.0 6th grade Easy to read. Conversational English for consumers. - ######### 80.0–70.0 7th grade Fairly easy to read. - ######### 70.0–60.0 8th & 9th grade Plain English. Easily understood by 13- to 15-year-old students. - ######### 60.0–50.0 10th to 12th grade Fairly difficult to read. - ######### 50.0–30.0 College Difficult to read. - ######### 30.0–10.0 College graduate Very difficult to read. Best understood by university graduates. - ######### 10.0–0.0 Professional Extremely difficult to read. Best understood by university graduates. - ######## While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid. - pq_df_new[flesch_ease_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.flesch_reading_ease(x) - ) - - ######## This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. - pq_df_new[flesch_kincaid_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.flesch_kincaid_grade(x) - ) - - ######## This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. - pq_df_new[gunning_fog_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.gunning_fog(x) - ) - - ######## Returns the SMOG index of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. Texts of fewer than 30 sentences are statistically invalid, because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result. - pq_df_new[smog_index_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.smog_index(x) - ) - - ######## Returns the grade level of the text using the Coleman-Liau Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. - pq_df_new[coleman_liau_index_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.coleman_liau_index(x) - ) - - ######## Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text. For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade. - pq_df_new[automated_readability_index_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.automated_readability_index(x) - ) - - ######## Different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula. Further reading on https://en.wikipedia.org/wiki/Dale–Chall_readability_formula - ######### Score Understood by - ######### 4.9 or lower average 4th-grade student or lower - ######### 5.0–5.9 average 5th or 6th-grade student - ######### 6.0–6.9 average 7th or 8th-grade student - ######### 7.0–7.9 average 9th or 10th-grade student - ######### 8.0–8.9 average 11th or 12th-grade student - ######### 9.0–9.9 average 13th to 15th-grade (college) student - pq_df_new[dale_chall_readability_score_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.dale_chall_readability_score(x) - ) - - ######## No explanation - pq_df_new[difficult_words_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.difficult_words(x) - ) - - ######## Returns the grade level using the Linsear Write Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. Further reading on Wikipedia https://en.wikipedia.org/wiki/Linsear_Write - pq_df_new[linsear_write_formula_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.linsear_write_formula(x) - ) - - ######## Based upon all the above tests, returns the estimated school grade level required to understand the text. Optional float_output allows the score to be returned as a float. Defaults to False. - pq_df_new[text_standard_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.text_standard(x, float_output=True) - ) - - ######## Returns grade level of english text. Intended for text written for children up to grade four. - ######## Further reading on https://en.wikipedia.org/wiki/Spache_readability_formula - pq_df_new[spache_readability_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.spache_readability(x) - ) - - ######## Returns a score for the readability of an english text for a foreign learner or English, focusing on the number of miniwords and length of sentences. It is recommended to aim for a score equal to or lower than 25. Further reading on blog https://strainindex.wordpress.com/2009/04/30/mcalpine-eflaw-readability-score/ - pq_df_new[mcalpine_eflaw_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.mcalpine_eflaw(x) - ) - - ######## Returns the reading time of the given text. Assumes 14.69ms per character. - ######## Further reading in Thttps://homepages.inf.ed.ac.uk/keller/papers/cognition08a.pdf - pq_df_new[reading_time_textstat] = pq_df_new[self.contents_column_name].apply( - lambda x: textstat.reading_time(x) - ) - - ######## Average of all grade_level metrics - # pq_df_new['avg_grade_level'] = pq_df_new[['flesch_kincaid_textstat', 'gunning_fog_textstat', 'coleman_liau_index_textstat', 'automated_readability_index_textstat', 'dale_chall_readability_score_textstat', 'linsear_write_formula_textstat']].mean(axis=1) - ######## R83_avg_GradeL - pq_df_new[avg_grade_level] = pq_df_new[ - [flesch_kincaid_textstat, gunning_fog_textstat, automated_readability_index_textstat] - ].mean(axis=1) - - output_table = pa.Table.from_pandas(pq_df_new) + df = pl.from_arrow(table) + + ######### textstat Readability Scores + ######### Score School level (US) Notes + ######### 100.00–90.00 5th grade Very easy to read. Easily understood by an average 11-year-old student. + ######### 90.0–80.0 6th grade Easy to read. Conversational English for consumers. + ######### 80.0–70.0 7th grade Fairly easy to read. + ######### 70.0–60.0 8th & 9th grade Plain English. Easily understood by 13- to 15-year-old students. + ######### 60.0–50.0 10th to 12th grade Fairly difficult to read. + ######### 50.0–30.0 College Difficult to read. + ######### 30.0–10.0 College graduate Very difficult to read. Best understood by university graduates. + ######### 10.0–0.0 Professional Extremely difficult to read. Best understood by university graduates. + ######## While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid. + + df = self._add_textstat_column( + df, self.contents_column_name, textstat.flesch_reading_ease, flesch_ease_textstat + ) + + ######## This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. + df = self._add_textstat_column( + df, self.contents_column_name, textstat.flesch_kincaid_grade, flesch_kincaid_textstat + ) + + ######## This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. + df = self._add_textstat_column(df, self.contents_column_name, textstat.gunning_fog, gunning_fog_textstat) + + ######## Returns the SMOG index of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. Texts of fewer than 30 sentences are statistically invalid, because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result. + df = self._add_textstat_column(df, self.contents_column_name, textstat.smog_index, smog_index_textstat) + + ######## Returns the grade level of the text using the Coleman-Liau Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. + df = self._add_textstat_column( + df, self.contents_column_name, textstat.coleman_liau_index, coleman_liau_index_textstat + ) + + ######## Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text. For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade. + df = self._add_textstat_column( + df, self.contents_column_name, textstat.automated_readability_index, automated_readability_index_textstat + ) + + ######## Different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula. Further reading on https://en.wikipedia.org/wiki/Dale–Chall_readability_formula + ######### Score Understood by + ######### 4.9 or lower average 4th-grade student or lower + ######### 5.0–5.9 average 5th or 6th-grade student + ######### 6.0–6.9 average 7th or 8th-grade student + ######### 7.0–7.9 average 9th or 10th-grade student + ######### 8.0–8.9 average 11th or 12th-grade student + ######### 9.0–9.9 average 13th to 15th-grade (college) student + df = self._add_textstat_column( + df, self.contents_column_name, textstat.dale_chall_readability_score, dale_chall_readability_score_textstat + ) + + ######## No explanation + df = self._add_textstat_column( + df, self.contents_column_name, textstat.difficult_words, difficult_words_textstat + ) + + ######## Returns the grade level using the Linsear Write Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. Further reading on Wikipedia https://en.wikipedia.org/wiki/Linsear_Write + df = self._add_textstat_column( + df, self.contents_column_name, textstat.linsear_write_formula, linsear_write_formula_textstat + ) + + ######## Based upon all the above tests, returns the estimated school grade level required to understand the text. Optional float_output allows the score to be returned as a float. Defaults to False. + df = self._add_textstat_column( + df, self.contents_column_name, textstat.text_standard, text_standard_textstat, float_output=True + ) + + ######## Returns grade level of english text. Intended for text written for children up to grade four. + ######## Further reading on https://en.wikipedia.org/wiki/Spache_readability_formula + df = self._add_textstat_column( + df, self.contents_column_name, textstat.spache_readability, spache_readability_textstat + ) + + ######## Returns a score for the readability of an english text for a foreign learner or English, focusing on the number of miniwords and length of sentences. It is recommended to aim for a score equal to or lower than 25. Further reading on blog https://strainindex.wordpress.com/2009/04/30/mcalpine-eflaw-readability-score/ + df = self._add_textstat_column(df, self.contents_column_name, textstat.mcalpine_eflaw, mcalpine_eflaw_textstat) + + ######## Returns the reading time of the given text. Assumes 14.69ms per character. + ######## Further reading in Thttps://homepages.inf.ed.ac.uk/keller/papers/cognition08a.pdf + df = self._add_textstat_column(df, self.contents_column_name, textstat.reading_time, reading_time_textstat) + + # output_table = pa.Table.from_pandas(pq_df_new) + output_table = df.to_arrow() metadata = {"nrows": len(output_table)} logger.debug(f"Transformed one table with {len(output_table)} rows") return [output_table], metadata + + def _add_textstat_column( + self, + df: pl.DataFrame, + text_column: str, + stat_func: Callable, + new_column_name: str, + **kwargs: Any, + ) -> pl.DataFrame: + """ + Adds a new column to the Polars DataFrame by applying a textstat function to a text column. + The function executes only if the textstat score identified in the new_column_name exists + in the self.score_list variable + + :param df: The input Polars DataFrame + :param text_column: The name of the text column + :param stat_func: A textstat function to apply + :param new_column_name: The name of the new column + :return: A new DataFrame with the additional computed column + """ + if new_column_name in self.score_list: + return df.with_columns( + df[text_column] + .map_elements(lambda x: stat_func(x, **kwargs), return_dtype=pl.Float64) + .alias(new_column_name) + ) + else: + return df diff --git a/transforms/language/readability/readability_python.ipynb b/transforms/language/readability/readability_python.ipynb index 51c73569e8..58a7d76e79 100644 --- a/transforms/language/readability/readability_python.ipynb +++ b/transforms/language/readability/readability_python.ipynb @@ -56,7 +56,7 @@ "| input_folder:str | \\${PWD}/test-data/input/ | folder that contains the input parquet files for the extreme tokenized algorithm |\n", "| output_folder:str | \\${PWD}/output/ | folder that contains the all the intermediate results and the output parquet files for the extreme tokenized algorithm |\n", "| readability_contents_column_name:str | text | name of the column that stores document text |\n", - "| readability_curriculum:str | False | curriculum parameter for transform; either True or False |" + "| readability_score_list:Union[str, list[str]] | mcalpine_eflaw_textstat | list of readability scores or a single readability scores to be computed by the transform |" ] }, { @@ -69,18 +69,18 @@ "name": "stderr", "output_type": "stream", "text": [ - "11:49:27 INFO - Readability parameters are : {'contents_column_name': 'contents', 'curriculum': False}\n", - "11:49:27 INFO - pipeline id pipeline_id\n", - "11:49:27 INFO - code location None\n", - "11:49:27 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n", - "11:49:27 INFO - data factory data_ max_files -1, n_sample -1\n", - "11:49:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "11:49:27 INFO - orchestrator readability started at 2025-01-23 11:49:27\n", - "11:49:27 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}\n", - "11:49:27 INFO - Completed 1 files (100.0%) in 0.003 min\n", - "11:49:27 INFO - Done processing 1 files, waiting for flush() completion.\n", - "11:49:27 INFO - done flushing in 0.0 sec\n", - "11:49:27 INFO - Completed execution in 0.003 min, execution result 0\n" + "19:29:24 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_score_list': ['mcalpine_eflaw_textstat']}\n", + "19:29:24 INFO - pipeline id pipeline_id\n", + "19:29:24 INFO - code location None\n", + "19:29:24 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n", + "19:29:24 INFO - data factory data_ max_files -1, n_sample -1\n", + "19:29:24 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "19:29:24 INFO - orchestrator readability started at 2025-02-10 19:29:24\n", + "19:29:24 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}\n", + "19:29:25 INFO - Completed 1 files (100.0%) in 0.006 min\n", + "19:29:25 INFO - Done processing 1 files, waiting for flush() completion.\n", + "19:29:25 INFO - done flushing in 0.0 sec\n", + "19:29:25 INFO - Completed execution in 0.006 min, execution result 0\n" ] }, { @@ -99,7 +99,7 @@ " input_folder=\"test-data/input\",\n", " output_folder=\"output\",\n", " readability_contents_column_name=\"contents\",\n", - " readability_curriculum=False,\n", + " readability_score_list=[\"mcalpine_eflaw_textstat\"],\n", ").transform()\n" ] }, @@ -432,3089 +432,334 @@ "name": "stdout", "output_type": "stream", "text": [ - "shape: (2, 16)\n", - "┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┐\n", - "│ con ┆ id ┆ fle ┆ fle ┆ gun ┆ smo ┆ col ┆ aut ┆ dal ┆ dif ┆ lin ┆ tex ┆ spa ┆ mca ┆ rea ┆ avg_gr │\n", - "│ ten ┆ --- ┆ sch ┆ sch ┆ nin ┆ g_i ┆ ema ┆ oma ┆ e_c ┆ fic ┆ sea ┆ t_s ┆ che ┆ lpi ┆ din ┆ ade_le │\n", - "│ ts ┆ str ┆ _ea ┆ _ki ┆ g_f ┆ nde ┆ n_l ┆ ted ┆ hal ┆ ult ┆ r_w ┆ tan ┆ _re ┆ ne_ ┆ g_t ┆ vel │\n", - "│ --- ┆ ┆ se_ ┆ nca ┆ og_ ┆ x_t ┆ iau ┆ _re ┆ l_r ┆ _wo ┆ rit ┆ dar ┆ ada ┆ efl ┆ ime ┆ --- │\n", - "│ str ┆ ┆ tex ┆ id_ ┆ tex ┆ ext ┆ _in ┆ ada ┆ ead ┆ rds ┆ e_f ┆ d_t ┆ bil ┆ aw_ ┆ _te ┆ f64 │\n", - "│ ┆ ┆ tst ┆ tex ┆ tst ┆ sta ┆ dex ┆ bil ┆ abi ┆ _te ┆ orm ┆ ext ┆ ity ┆ tex ┆ xts ┆ │\n", - "│ ┆ ┆ at ┆ tst ┆ at ┆ t ┆ _te ┆ ity ┆ lit ┆ xts ┆ ula ┆ sta ┆ _te ┆ tst ┆ tat ┆ │\n", - "│ ┆ ┆ --- ┆ at ┆ --- ┆ --- ┆ xts ┆ _in ┆ y_s ┆ tat ┆ _te ┆ t ┆ xts ┆ at ┆ --- ┆ │\n", - "│ ┆ ┆ f64 ┆ --- ┆ f64 ┆ f64 ┆ tat ┆ dex ┆ cor ┆ --- ┆ xts ┆ --- ┆ tat ┆ --- ┆ f64 ┆ │\n", - "│ ┆ ┆ ┆ f64 ┆ ┆ ┆ --- ┆ _te ┆ e_t ┆ i64 ┆ tat ┆ f64 ┆ --- ┆ f64 ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ xts ┆ ext ┆ ┆ --- ┆ ┆ f64 ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ tat ┆ sta ┆ ┆ f64 ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ --- ┆ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ --- ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "╞═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪════════╡\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ire ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ire ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cal ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ls ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ext ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ end ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rse ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ som ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ext ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ven ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ age ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ men ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ass ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ist ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ new ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ izo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ns ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nd ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ roc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ k ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ id ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ la ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 198 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 6 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sum ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sub ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ard ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ la ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sag ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ phi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ los ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oph ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ref ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ele ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ctr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ado ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ldw ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ide ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ var ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ufa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ctu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ [s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ say ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “An ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ typ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ doe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ isf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ act ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ref ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ err ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “De ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lud ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ chl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ess ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sup ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ser ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ two ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sub ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -me ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ : ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (De ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DV ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (De ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ify ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ama ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ zin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Try ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ht ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Now ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ! ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ van ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ish ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ few ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ond ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ igh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ned ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ht ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ow: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ps ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ seq ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ D – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ : ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bef ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ore ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ try ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ han ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ant ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ low ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 5W ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1H ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 5Ws ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Why ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ re, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Who ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1H ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ M – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Aft ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ col ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nex ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ igh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ col ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ don ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ghl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ re ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ roo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cau ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ don ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ked ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ I – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nda ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fac ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ure ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ear ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ don ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ps. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eff ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ard ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ din ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ C – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ses ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ens ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ure ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sup ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ura ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cy. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Boo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ New ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ elo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DV ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Let ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ock ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ial ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DV ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ D – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ In ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ scr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ atc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ giv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ype ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ll ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sca ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ M – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ par ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ame ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ica ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ del ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ain ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ D – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Go ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ maj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ big ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sca ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ V – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ify ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ify ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ flo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ w. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aft ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sfu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ As ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ila ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Art ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ als ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ity ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pet ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ngs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i.e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ end ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ere ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cap ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ies ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ory ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ics ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Yel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ low ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ond ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ory ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 10 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 15 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ngs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tan ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ adv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ anc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ps ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ en ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fun ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ any ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ses ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ks ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ att ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ end ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ser ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ful ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ job ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Adv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ anc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ To ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ome ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ did ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sfu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ran ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ica ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ To ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ome ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ opt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ yea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sfu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ten ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ onl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ens ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ure ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ act ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ par ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ipa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eff ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ici ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ but ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ckl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ris ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ que ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ amo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ryb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ody ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Oug ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ht ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ w ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Do ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ peo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ght ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sal ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sky ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -ro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cke ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ted ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ est ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ yea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ yea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ll ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ reo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ccu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Let ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ock ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ few ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ben ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ efi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tan ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ven ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Enh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ anc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ret ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ : ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ isf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tak ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ns ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ awa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oth ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ han ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ opp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ite ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oth ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ new ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ so ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ret ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ain ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ reg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ula ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uld ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ top ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ori ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ty ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ns. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ach ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Set ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ RT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ goa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ls ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ omp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ any ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ RT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ who ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ful ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Spe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ach ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ebo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bef ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ore ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ set ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ RT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ant ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ w ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ agi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ per ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Any ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ foc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SWO ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ T ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (St ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ren ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gth ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kne ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Opp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ain ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eye ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ntu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ th ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Big ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ To ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cyc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ num ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ero ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 35% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cyc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aft ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ On ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ who ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ en’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ del ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ays ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cyc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ les ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ age ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ men ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ zer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nal ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ama ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ zin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 3M ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sto ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 3M ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ebr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ k ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ear ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 200 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 0’s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Spa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ re ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ few ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ute ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ siv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aki ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mad ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ onl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oug ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 50% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 67% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enH ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ous ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (GH ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ G) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 37% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ycl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ usa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 8% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tox ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ air ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ If ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ do ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ won ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 3M ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ why ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Let ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ban ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ish ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ome ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ld ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ led ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ die ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 11 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ato ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hum ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ day ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ air ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 18 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aga ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ etc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dei ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ez- ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Zor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ert ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -Ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aqi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n-b ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ase ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ war ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mon ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ito ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ led ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ die ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 11 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ong ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Leb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ane ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hez ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lah ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gad ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lud ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -Ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ani ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Afg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ han ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ato ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ali ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ akd ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ own ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imm ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ edi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wn. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ new ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ age ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ncy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SAN ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ but ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ giv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “At ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1:1 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ am, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ene ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ my ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ial ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ass ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dei ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ez- ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Zor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ reg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ,” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SAN ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “Th ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ agg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ren ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bei ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ifi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ add ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ It ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ond ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ les ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ k. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ las ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Jan ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y 7 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rn ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ th ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cap ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ asc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -Ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ely ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aga ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aff ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ili ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ say ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ven ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ arc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ foe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ foo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ld ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ber ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ own ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ var ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iet ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ali ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sup ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ por ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 50 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 202 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 0, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ann ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ual ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ arm ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hun ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ air ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ civ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ il ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ war ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ke ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 201 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Leb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ane ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hez ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lah ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ern ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ men ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ops ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Jew ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ish ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ely ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ack ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ now ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ led ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ges ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ind ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ivi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ifi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ten ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sto ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pon ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sib ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ili ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ty ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ orm ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ws ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ set ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ art ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ icl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ire ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴────────┘\n" + "shape: (2, 3)\n", + "┌────────────────────────────────────┬───────────────────────────────────┬─────────────────────────┐\n", + "│ contents ┆ id ┆ mcalpine_eflaw_textstat │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ f64 │\n", + "╞════════════════════════════════════╪═══════════════════════════════════╪═════════════════════════╡\n", + "│ Six Sigma Tips ┆ ┆ │\n", + "│ achieve high quality every time, ┆ ┆ │\n", + "│ the desired objective of most ┆ ┆ │\n", + "│ every business. However to get the ┆ ┆ │\n", + "│ desire of high quality creations ┆ ┆ │\n", + "│ turn into reality calls to extend ┆ ┆ │\n", + "│ yourself for some extra miles. Six ┆ ┆ │\n", + "│ Sigma is a success proven business ┆ ┆ │\n", + "│ management scheme that can assist ┆ ┆ │\n", + "│ you to unleash new horizons of ┆ ┆ │\n", + "│ business success with time bound ┆ ┆ │\n", + "│ rock solid results. The notion of ┆ ┆ │\n", + "│ six sigma was created by Motorola ┆ ┆ │\n", + "│ in 1986 as a result of consumers’ ┆ ┆ │\n", + "│ complaints about sub-standard ┆ ┆ │\n", + "│ Motorola phones’ quality. With the ┆ ┆ │\n", + "│ passage of time, the philosophy ┆ ┆ │\n", + "│ was further refined by General ┆ ┆ │\n", + "│ Electric and was adopted worldwide ┆ ┆ │\n", + "│ by various manufacturing firms. ┆ ┆ │\n", + "│ Six Sigma Tip[s on Methodology ┆ ┆ │\n", + "│ says: ┆ ┆ │\n", + "│ “Any type of process which does ┆ ┆ │\n", + "│ not result in customer’s ┆ ┆ │\n", + "│ satisfaction is referred as a ┆ ┆ │\n", + "│ “Defect” and it has to be excluded ┆ ┆ │\n", + "│ from the process in order to get ┆ ┆ │\n", + "│ matchless supreme quality products ┆ ┆ │\n", + "│ and services”. ┆ ┆ │\n", + "│ Six Sigma consists of further two ┆ ┆ │\n", + "│ sub-methods. These are: ┆ ┆ │\n", + "│ - DMAIC (Define, Measure, Analyze, ┆ ┆ │\n", + "│ Improve, Control) ┆ ┆ │\n", + "│ - DMADV (Design, Measure, Analyze, ┆ ┆ │\n", + "│ Design, Verify) ┆ ┆ │\n", + "│ Amazing DMAIC Methodology to Try ┆ ┆ │\n", + "│ Right Now! ┆ ┆ │\n", + "│ Get your all doubts vanish in few ┆ ┆ │\n", + "│ seconds by getting these Six Sigma ┆ ┆ │\n", + "│ tips and more insights about DMAIC ┆ ┆ │\n", + "│ methodology explained just right ┆ ┆ │\n", + "│ below: ┆ ┆ │\n", + "│ DMAIC has five steps which are ┆ ┆ │\n", + "│ defined step wise in sequence. ┆ ┆ │\n", + "│ D – Define the Problem statement: ┆ ┆ │\n", + "│ Before trying your hands on to the ┆ ┆ │\n", + "│ problem, it is important to define ┆ ┆ │\n", + "│ the problem statement first. ┆ ┆ │\n", + "│ Follow 5W 1H rule to define your ┆ ┆ │\n", + "│ problem in detail whereas 5Ws are ┆ ┆ │\n", + "│ What, When, Why, Where, Who and 1H ┆ ┆ │\n", + "│ is How. ┆ ┆ │\n", + "│ M – Measure: ┆ ┆ │\n", + "│ After defining the problem ┆ ┆ │\n", + "│ statement in detail, data ┆ ┆ │\n", + "│ collection is the next step. ┆ ┆ │\n", + "│ Compile all relevant data that ┆ ┆ │\n", + "│ provides you exact insight for the ┆ ┆ │\n", + "│ problem diagnosis. ┆ ┆ │\n", + "│ A – Analyze: ┆ ┆ │\n", + "│ The data collection that was done ┆ ┆ │\n", + "│ in previous step will be analyzed ┆ ┆ │\n", + "│ thoroughly where root cause ┆ ┆ │\n", + "│ analysis will be done and ways to ┆ ┆ │\n", + "│ eradicate such defects will be ┆ ┆ │\n", + "│ looked into. ┆ ┆ │\n", + "│ I – Improve ┆ ┆ │\n", + "│ Process improvement will be ┆ ┆ │\n", + "│ carried out on the foundation of ┆ ┆ │\n", + "│ facts and figures derived from ┆ ┆ │\n", + "│ research done in previous steps. ┆ ┆ │\n", + "│ Efforts will be directed towards ┆ ┆ │\n", + "│ process improvement leading to ┆ ┆ │\n", + "│ peerless quality outcomes. ┆ ┆ │\n", + "│ C – Control ┆ ┆ │\n", + "│ Implement control measures to the ┆ ┆ │\n", + "│ processes that ensure defect free ┆ ┆ │\n", + "│ product every time with superior ┆ ┆ │\n", + "│ precision and accuracy. ┆ ┆ │\n", + "│ How to Boost New Product ┆ ┆ │\n", + "│ Development by DMADV Method? ┆ ┆ │\n", + "│ Let’s unlock the potentials of ┆ ┆ │\n", + "│ DMADV method. ┆ ┆ │\n", + "│ D – Design ┆ ┆ │\n", + "│ In Design phase, one has to design ┆ ┆ │\n", + "│ the process from the scratch with ┆ ┆ │\n", + "│ all the relevant strategies that ┆ ┆ │\n", + "│ will give unbeatable results on ┆ ┆ │\n", + "│ prototype small scale. ┆ ┆ │\n", + "│ M – Measure ┆ ┆ │\n", + "│ Measure all the defined parameters ┆ ┆ │\n", + "│ that holds significance with ┆ ┆ │\n", + "│ respect to high quality delivery ┆ ┆ │\n", + "│ at all levels. ┆ ┆ │\n", + "│ A – Analyze ┆ ┆ │\n", + "│ Analyze what has been measured in ┆ ┆ │\n", + "│ previous step keeping design ┆ ┆ │\n", + "│ constraints in mind. ┆ ┆ │\n", + "│ D – Design ┆ ┆ │\n", + "│ Go for designing considering major ┆ ┆ │\n", + "│ and minor details of process on ┆ ┆ │\n", + "│ big scale. ┆ ┆ │\n", + "│ V – Verify ┆ ┆ │\n", + "│ Verify the process and its flow. ┆ ┆ │\n", + "│ Observe and record the results ┆ ┆ │\n", + "│ after successful implementation of ┆ ┆ │\n", + "│ the designed process. ┆ ┆ │\n", + "│ Getting Smart with Six Sigma Belt ┆ ┆ │\n", + "│ System ┆ ┆ │\n", + "│ As similar as Martial Arts system, ┆ ┆ │\n", + "│ Six Sigma also comes with Belt ┆ ┆ │\n", + "│ system with different level of ┆ ┆ │\n", + "│ authority system along with ┆ ┆ │\n", + "│ experience and level of competency ┆ ┆ │\n", + "│ and trainings. Employees start ┆ ┆ │\n", + "│ with initial levels i.e. White ┆ ┆ │\n", + "│ Belt System and then employees can ┆ ┆ │\n", + "│ gradually increase their levels ┆ ┆ │\n", + "│ depending on their interest, ┆ ┆ │\n", + "│ experience and capabilities. ┆ ┆ │\n", + "│ Six Sigma White Belt System ┆ ┆ │\n", + "│ Level: First Level of Belt System ┆ ┆ │\n", + "│ Knowledge Criteria: Mandatory to ┆ ┆ │\n", + "│ complete several hours of training ┆ ┆ │\n", + "│ about Six Sigma to get the basic ┆ ┆ │\n", + "│ clarity about the basics of it. ┆ ┆ │\n", + "│ Six Sigma Tips Yellow Belt ┆ ┆ │\n", + "│ Level: Second level of Six Sigma ┆ ┆ │\n", + "│ Knowledge Criteria: Mandatory to ┆ ┆ │\n", + "│ complete 10 to 15 hours Six Sigma ┆ ┆ │\n", + "│ classroom trainings to understand ┆ ┆ │\n", + "│ a bit advance steps methods. ┆ ┆ │\n", + "│ Six Sigma Tips: Green Belt ┆ ┆ │\n", + "│ Level: Fundamental of success for ┆ ┆ │\n", + "│ any project. ┆ ┆ │\n", + "│ Knowledge Criteria: A classroom ┆ ┆ │\n", + "│ training session consisting for ┆ ┆ │\n", + "│ weeks is compulsory to attend and ┆ ┆ │\n", + "│ have to pass a written exam along ┆ ┆ │\n", + "│ with serving in a Six Sigma ┆ ┆ │\n", + "│ project team. ┆ ┆ │\n", + "│ Six Sigma Tips: Black Belt ┆ ┆ │\n", + "│ Level: Full time job for Black ┆ ┆ │\n", + "│ Belt. Advance level of Six Sigma. ┆ ┆ │\n", + "│ Knowledge Criteria: To become a ┆ ┆ │\n", + "│ certified Six Sigma Black Belt, ┆ ┆ │\n", + "│ candidates must pass a written ┆ ┆ │\n", + "│ exam and successfully complete ┆ ┆ │\n", + "│ projects. ┆ ┆ │\n", + "│ Six Sigma Tips: Master Black Belt ┆ ┆ │\n", + "│ Level: Highest Six Sigma Belt ┆ ┆ │\n", + "│ ranking and the certification is ┆ ┆ │\n", + "│ in high demand for customers. ┆ ┆ │\n", + "│ Knowledge Criteria: To become Six ┆ ┆ │\n", + "│ Sigma Master Black Belt, one has ┆ ┆ │\n", + "│ to opt an experience of at least ┆ ┆ │\n", + "│ five years as Black Belt along ┆ ┆ │\n", + "│ with successful completion of ┆ ┆ │\n", + "│ minimum of ten six sigma projects. ┆ ┆ │\n", + "│ The role of Master Black Belts is ┆ ┆ │\n", + "│ not only to ensure active ┆ ┆ │\n", + "│ participation and efficient ┆ ┆ │\n", + "│ problem solving in six sigma ┆ ┆ │\n", + "│ projects but to trickle down six ┆ ┆ │\n", + "│ sigma culture and nourish six ┆ ┆ │\n", + "│ sigma techniques among employees ┆ ┆ │\n", + "│ is one of the objective of Six ┆ ┆ │\n", + "│ Sigma Master Black Belt. ┆ ┆ │\n", + "│ What Not Everybody Ought to Know ┆ ┆ │\n", + "│ About Six Sigma ┆ ┆ │\n", + "│ Do you wish more people bought ┆ ┆ │\n", + "│ your product, or your sales to be ┆ ┆ │\n", + "│ sky-rocketed without investing ┆ ┆ │\n", + "│ years and years? Are you still ┆ ┆ │\n", + "│ wasting your time on reoccurring ┆ ┆ │\n", + "│ problems daily? Let us unlock few ┆ ┆ │\n", + "│ of the many benefits of Six Sigma ┆ ┆ │\n", + "│ which your consultant will never ┆ ┆ │\n", + "│ tell you about. ┆ ┆ │\n", + "│ - Proven Way to Enhance Customer ┆ ┆ │\n", + "│ Retention: ┆ ┆ │\n", + "│ A single dissatisfied customer can ┆ ┆ │\n", + "│ take your millions of business ┆ ┆ │\n", + "│ away and on the other hand one ┆ ┆ │\n", + "│ loyal customer can bring in the ┆ ┆ │\n", + "│ opportunities to invite other new ┆ ┆ │\n", + "│ customers as well so to retain ┆ ┆ │\n", + "│ your regular customers should be ┆ ┆ │\n", + "│ your top most priority by all ┆ ┆ │\n", + "│ means. ┆ ┆ │\n", + "│ - How to be able to Achieve ┆ ┆ │\n", + "│ Targets in an Easy Way? ┆ ┆ │\n", + "│ Set SMART goals to accomplish ┆ ┆ │\n", + "│ anything whereas the term SMART is ┆ ┆ │\n", + "│ an abbreviation whose full form is ┆ ┆ │\n", + "│ Specific, Measurable, Achievable, ┆ ┆ │\n", + "│ Relevant, Timebound. Before ┆ ┆ │\n", + "│ setting SMART objectives, it is ┆ ┆ │\n", + "│ important to know every employees’ ┆ ┆ │\n", + "│ training needs, learning agility ┆ ┆ │\n", + "│ and performance curve accordingly. ┆ ┆ │\n", + "│ - The Best Ever Planning Solution ┆ ┆ │\n", + "│ for Any Business ┆ ┆ │\n", + "│ Six Sigma tips help in great ways ┆ ┆ │\n", + "│ to get rid of distractions and to ┆ ┆ │\n", + "│ maintain your focus in ┆ ┆ │\n", + "│ unidirectional way. SWOT ┆ ┆ │\n", + "│ (Strength, Weakness, Opportunities ┆ ┆ │\n", + "│ and Threats) analysis remains your ┆ ┆ │\n", + "│ eyes to the areas of improvements ┆ ┆ │\n", + "│ which eventually can turn your ┆ ┆ │\n", + "│ business a worth more than the ┆ ┆ │\n", + "│ present situation. ┆ ┆ │\n", + "│ - How Big Firms Are Able To Reduce ┆ ┆ │\n", + "│ Their Cycle Times? ┆ ┆ │\n", + "│ According to numerous business ┆ ┆ │\n", + "│ reports, 35% reduction in cycle ┆ ┆ │\n", + "│ times have been observed after ┆ ┆ │\n", + "│ implementation of Six Sigma. On ┆ ┆ │\n", + "│ the contrary, firms who haven’t ┆ ┆ │\n", + "│ implemented six sigma methodology ┆ ┆ │\n", + "│ complaint about delays in their ┆ ┆ │\n", + "│ projects and cycle time ┆ ┆ │\n", + "│ completion, less employee ┆ ┆ │\n", + "│ engagement or totally zero ┆ ┆ │\n", + "│ employee motivational levels. ┆ ┆ │\n", + "│ Real Life Amazing Results of Six ┆ ┆ │\n", + "│ Sigma: 3M Success Story ┆ ┆ │\n", + "│ 3M reported to have been ┆ ┆ │\n", + "│ celebrated their success credited ┆ ┆ │\n", + "│ by Six Sigma back in early 2000’s. ┆ ┆ │\n", + "│ Spare few minutes to have a glance ┆ ┆ │\n", + "│ on impressive ground breaking ┆ ┆ │\n", + "│ results made real only through Six ┆ ┆ │\n", + "│ Sigma. ┆ ┆ │\n", + "│ - 50% reduction in waste ┆ ┆ │\n", + "│ generation ┆ ┆ │\n", + "│ - 67% reduction in GreenHouse Gas ┆ ┆ │\n", + "│ (GHG) emissions ┆ ┆ │\n", + "│ - 37% water recycles usage ┆ ┆ │\n", + "│ - 8% reduction in toxic air ┆ ┆ │\n", + "│ emissions ┆ ┆ │\n", + "│ If Six Sigma tips can do wonders ┆ ┆ │\n", + "│ for 3M then why not you? Let’s ┆ ┆ │\n", + "│ banish the hindrances and fears in ┆ ┆ │\n", + "│ your way to become world class ┆ ┆ │\n", + "│ business. ┆ ┆ │\n", + "│ Israeli night raids targeting arms ┆ ┆ │\n", + "│ eastern Syria killed at least five ┆ ┆ │\n", + "│ soldiers and 11 allied fighters, ┆ ┆ │\n", + "│ the Syrian Observatory for Human ┆ ┆ │\n", + "│ Rights said on Wednesday. ┆ ┆ │\n", + "│ The Israeli air force carried out ┆ ┆ │\n", + "│ more than 18 strikes against ┆ ┆ │\n", + "│ multiple targets in an area ┆ ┆ │\n", + "│ stretching from the eastern town ┆ ┆ │\n", + "│ of Deir ez-Zor to the Boukamal ┆ ┆ │\n", + "│ desert at the Syrian-Iraqi border, ┆ ┆ │\n", + "│ according to the Britain-based war ┆ ┆ │\n", + "│ monitor. ┆ ┆ │\n", + "│ The raids killed five Syrian ┆ ┆ │\n", + "│ soldiers and 11 allied fighters ┆ ┆ │\n", + "│ belonging to the Iranian ┆ ┆ │\n", + "│ Revolutionary Guards, Lebanese ┆ ┆ │\n", + "│ Hezbollah and the Fatimid Brigade, ┆ ┆ │\n", + "│ which includes pro-Iranian Afghan ┆ ┆ │\n", + "│ fighters, the Observatory said, ┆ ┆ │\n", + "│ although their nationalities and a ┆ ┆ │\n", + "│ precise breakdown were not ┆ ┆ │\n", + "│ immediately known. ┆ ┆ │\n", + "│ The Syrian state news agency SANA ┆ ┆ │\n", + "│ reported the strikes but without ┆ ┆ │\n", + "│ giving further details. ┆ ┆ │\n", + "│ “At 1:10 am, the Israeli enemy ┆ ┆ │\n", + "│ carried out an aerial assault on ┆ ┆ │\n", + "│ the town of Deir ez-Zor and the ┆ ┆ │\n", + "│ Boukamal region,” SANA said, ┆ ┆ │\n", + "│ citing a military source. ┆ ┆ │\n", + "│ “The results of the aggression are ┆ ┆ │\n", + "│ currently being verified,” it ┆ ┆ │\n", + "│ added. ┆ ┆ │\n", + "│ It was the second wave of Israeli ┆ ┆ │\n", + "│ raids in Syria in less than a ┆ ┆ │\n", + "│ week. ┆ ┆ │\n", + "│ The last strikes on January 7 ┆ ┆ │\n", + "│ targeted positions in southern ┆ ┆ │\n", + "│ Syria and south of the capital ┆ ┆ │\n", + "│ Damascus, killing three pro-Iran ┆ ┆ │\n", + "│ fighters. ┆ ┆ │\n", + "│ Israel routinely carries out raids ┆ ┆ │\n", + "│ in Syria, mostly against targets ┆ ┆ │\n", + "│ affiliated with Iran in what it ┆ ┆ │\n", + "│ says is a bid to prevent its arch ┆ ┆ │\n", + "│ foe from securing further foothold ┆ ┆ │\n", + "│ along its borders. ┆ ┆ │\n", + "│ Iran has members of its own ┆ ┆ │\n", + "│ military as well as fighters from ┆ ┆ │\n", + "│ a variety of nationalities ┆ ┆ │\n", + "│ fighting with militias it supports ┆ ┆ │\n", + "│ deployed across Syria. ┆ ┆ │\n", + "│ Israel hit around 50 targets in ┆ ┆ │\n", + "│ Syria in 2020, according to an ┆ ┆ │\n", + "│ annual report released in late ┆ ┆ │\n", + "│ December by the Israeli military. ┆ ┆ │\n", + "│ The Israeli army has carried out ┆ ┆ │\n", + "│ hundreds of air and missile ┆ ┆ │\n", + "│ strikes on Syria since the civil ┆ ┆ │\n", + "│ war broke out in 2011, targeting ┆ ┆ │\n", + "│ Iranian and Lebanese Hezbollah ┆ ┆ │\n", + "│ forces as well as government ┆ ┆ │\n", + "│ troops. ┆ ┆ │\n", + "│ The Jewish state rarely ┆ ┆ │\n", + "│ acknowledges individual strikes. ┆ ┆ │\n", + "│ The Syrian Observer has not ┆ ┆ │\n", + "│ verified the content of this ┆ ┆ │\n", + "│ story. Responsibility for the ┆ ┆ │\n", + "│ information and views set out in ┆ ┆ │\n", + "│ this article lies entirely with ┆ ┆ │\n", + "│ the author. ┆ ┆ │\n", + "└────────────────────────────────────┴───────────────────────────────────┴─────────────────────────┘\n" ] } ], diff --git a/transforms/language/readability/readability_ray.ipynb b/transforms/language/readability/readability_ray.ipynb index f6f69266ca..5c18d61b95 100644 --- a/transforms/language/readability/readability_ray.ipynb +++ b/transforms/language/readability/readability_ray.ipynb @@ -56,7 +56,7 @@ "| input_folder:str | \\${PWD}/test-data/input/ | folder that contains the input parquet files for the extreme tokenized algorithm |\n", "| output_folder:str | \\${PWD}/output/ | folder that contains the all the intermediate results and the output parquet files for the extreme tokenized algorithm |\n", "| readability_contents_column_name:str | text | name of the column that stores document text |\n", - "| readability_curriculum:str | False | curriculum parameter for transform; either True or False |" + "| readability_score_list:Union[str, list[str]] | mcalpine_eflaw_textstat | list of readability scores or a single readability scores to be computed by the transform |" ] }, { @@ -69,25 +69,25 @@ "name": "stderr", "output_type": "stream", "text": [ - "11:48:14 INFO - Readability parameters are : {'contents_column_name': 'contents', 'curriculum': False}\n", - "11:48:14 INFO - pipeline id pipeline_id\n", - "11:48:14 INFO - code location None\n", - "11:48:14 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "11:48:14 INFO - actor creation delay 0\n", - "11:48:14 INFO - job details {'job category': 'preprocessing', 'job name': 'readability', 'job type': 'ray', 'job id': 'job_id'}\n", - "11:48:14 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n", - "11:48:14 INFO - data factory data_ max_files -1, n_sample -1\n", - "11:48:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "11:48:14 INFO - Running locally\n", - "2025-01-23 11:48:16,086\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:17 INFO - orchestrator started at 2025-01-23 11:48:17\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:17 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:17 INFO - Cluster resources: {'cpus': 28, 'gpus': 0, 'memory': 30.58273086603731, 'object_store': 15.291365432552993}\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:17 INFO - Number of workers - 1 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:18 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:18 INFO - Completed processing 1 files in 0.003 min\n", - "\u001b[36m(orchestrate pid=1194279)\u001b[0m 11:48:18 INFO - done flushing in 0.001 sec\n", - "11:48:28 INFO - Completed execution in 0.236 min, execution result 0\n" + "19:30:01 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_score_list': ['mcalpine_eflaw_textstat']}\n", + "19:30:01 INFO - pipeline id pipeline_id\n", + "19:30:01 INFO - code location None\n", + "19:30:01 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "19:30:01 INFO - actor creation delay 0\n", + "19:30:01 INFO - job details {'job category': 'preprocessing', 'job name': 'readability', 'job type': 'ray', 'job id': 'job_id'}\n", + "19:30:01 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n", + "19:30:01 INFO - data factory data_ max_files -1, n_sample -1\n", + "19:30:01 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "19:30:01 INFO - Running locally\n", + "2025-02-10 19:30:02,130\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:03 INFO - orchestrator started at 2025-02-10 19:30:03\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:03 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:03 INFO - Cluster resources: {'cpus': 28, 'gpus': 0, 'memory': 32.52414321899414, 'object_store': 16.26207160949707}\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:03 INFO - Number of workers - 1 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:04 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:04 INFO - Completed processing 1 files in 0.002 min\n", + "\u001b[36m(orchestrate pid=263463)\u001b[0m 19:30:04 INFO - done flushing in 0.001 sec\n", + "19:30:14 INFO - Completed execution in 0.225 min, execution result 0\n" ] }, { @@ -106,7 +106,7 @@ " input_folder=\"test-data/input\",\n", " output_folder=\"output\",\n", " readability_contents_column_name=\"contents\",\n", - " readability_curriculum=False,\n", + " readability_score_list=[\"mcalpine_eflaw_textstat\"],\n", " run_locally=True,\n", ").transform()\n" ] @@ -440,3089 +440,334 @@ "name": "stdout", "output_type": "stream", "text": [ - "shape: (2, 16)\n", - "┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬────────┐\n", - "│ con ┆ id ┆ fle ┆ fle ┆ gun ┆ smo ┆ col ┆ aut ┆ dal ┆ dif ┆ lin ┆ tex ┆ spa ┆ mca ┆ rea ┆ avg_gr │\n", - "│ ten ┆ --- ┆ sch ┆ sch ┆ nin ┆ g_i ┆ ema ┆ oma ┆ e_c ┆ fic ┆ sea ┆ t_s ┆ che ┆ lpi ┆ din ┆ ade_le │\n", - "│ ts ┆ str ┆ _ea ┆ _ki ┆ g_f ┆ nde ┆ n_l ┆ ted ┆ hal ┆ ult ┆ r_w ┆ tan ┆ _re ┆ ne_ ┆ g_t ┆ vel │\n", - "│ --- ┆ ┆ se_ ┆ nca ┆ og_ ┆ x_t ┆ iau ┆ _re ┆ l_r ┆ _wo ┆ rit ┆ dar ┆ ada ┆ efl ┆ ime ┆ --- │\n", - "│ str ┆ ┆ tex ┆ id_ ┆ tex ┆ ext ┆ _in ┆ ada ┆ ead ┆ rds ┆ e_f ┆ d_t ┆ bil ┆ aw_ ┆ _te ┆ f64 │\n", - "│ ┆ ┆ tst ┆ tex ┆ tst ┆ sta ┆ dex ┆ bil ┆ abi ┆ _te ┆ orm ┆ ext ┆ ity ┆ tex ┆ xts ┆ │\n", - "│ ┆ ┆ at ┆ tst ┆ at ┆ t ┆ _te ┆ ity ┆ lit ┆ xts ┆ ula ┆ sta ┆ _te ┆ tst ┆ tat ┆ │\n", - "│ ┆ ┆ --- ┆ at ┆ --- ┆ --- ┆ xts ┆ _in ┆ y_s ┆ tat ┆ _te ┆ t ┆ xts ┆ at ┆ --- ┆ │\n", - "│ ┆ ┆ f64 ┆ --- ┆ f64 ┆ f64 ┆ tat ┆ dex ┆ cor ┆ --- ┆ xts ┆ --- ┆ tat ┆ --- ┆ f64 ┆ │\n", - "│ ┆ ┆ ┆ f64 ┆ ┆ ┆ --- ┆ _te ┆ e_t ┆ i64 ┆ tat ┆ f64 ┆ --- ┆ f64 ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ xts ┆ ext ┆ ┆ --- ┆ ┆ f64 ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ tat ┆ sta ┆ ┆ f64 ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ --- ┆ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ --- ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "╞═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪════════╡\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ire ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ire ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cal ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ls ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ext ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ end ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rse ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ som ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ext ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ven ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ age ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ men ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ass ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ist ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ new ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ izo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ns ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nd ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ roc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ k ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ id ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ la ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 198 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 6 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sum ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sub ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ard ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ la ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sag ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ phi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ los ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oph ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ref ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ele ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ctr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ado ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ldw ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ide ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ var ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ufa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ctu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ [s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ say ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “An ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ typ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ doe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ isf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ act ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ref ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ err ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “De ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lud ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ chl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ess ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sup ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ser ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ two ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sub ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -me ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ : ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (De ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DV ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (De ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ify ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ama ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ zin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Try ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ht ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Now ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ! ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ van ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ish ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ few ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ond ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ igh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ned ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ht ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ow: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ IC ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ps ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ seq ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ D – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ : ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bef ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ore ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ try ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ han ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ant ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ low ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 5W ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1H ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 5Ws ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Why ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ re, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Who ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1H ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ M – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Aft ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ col ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nex ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ igh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ col ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ don ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ghl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ re ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ roo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cau ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ don ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ked ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ I – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nda ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fac ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ure ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ear ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ don ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ps. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eff ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ard ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ din ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ C – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ses ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ens ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ure ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sup ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ura ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cy. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Boo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ New ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ elo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DV ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Let ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ock ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ial ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ DV ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ D – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ In ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ scr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ atc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ giv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ype ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ll ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sca ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ M – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ def ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ par ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ame ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ica ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ qua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ del ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lyz ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ain ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ D – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Go ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ maj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ big ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sca ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ V – ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ify ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ify ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ flo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ w. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aft ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sfu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ign ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ As ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ila ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Art ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ als ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ity ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pet ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ngs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i.e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ end ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ere ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cap ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ies ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ory ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ics ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Yel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ low ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ond ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ory ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 10 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 15 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ngs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tan ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ adv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ anc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ste ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ps ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ en ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fun ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ any ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ses ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ks ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ att ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ end ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ser ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ful ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ job ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Adv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ anc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ To ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ome ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ did ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sfu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ el: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ran ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ica ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ To ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ome ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ opt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ exp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ yea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sfu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ten ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ onl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ens ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ure ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ act ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ par ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ipa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eff ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ici ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ but ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ckl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ris ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ que ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ amo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ck ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ryb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ody ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Oug ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ht ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ w ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Do ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ peo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ght ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ duc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sal ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sky ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -ro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cke ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ted ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ est ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ yea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ yea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ll ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ reo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ccu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ble ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Let ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ unl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ock ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ few ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ben ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ efi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tan ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ven ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Enh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ anc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ret ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ : ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gle ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ isf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tak ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ns ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ awa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oth ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ han ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ one ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ opp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ite ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oth ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ new ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ so ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ret ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ain ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ reg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ula ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tom ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ers ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uld ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ top ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ori ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ty ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ns. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ be ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ach ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Set ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ RT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ goa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ls ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ omp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ any ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ RT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ who ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ful ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Spe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Mea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ach ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nt, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ebo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bef ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ore ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ set ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SMA ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ RT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obj ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ive ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ant ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ w ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ees ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ini ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ agi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ per ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ man ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Any ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ p ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ foc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ect ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SWO ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ T ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (St ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ren ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gth ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kne ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Opp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uni ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ana ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lys ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ain ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eye ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eve ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ntu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ th ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ How ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Big ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Abl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ To ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Cyc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ num ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ero ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 35% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cyc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aft ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ On ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ms ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ who ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ en’ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ met ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hod ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ int ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ abo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ del ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ays ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ jec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cyc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tim ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ com ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ple ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ les ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ age ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ men ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ or ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ zer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ o ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mot ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iva ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nal ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ els ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Lif ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ama ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ zin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 3M ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sto ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 3M ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ebr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ suc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ k ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ear ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 200 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 0’s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Spa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ re ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ few ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ min ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ute ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imp ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ siv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aki ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mad ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ onl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oug ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 50% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gen ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ era ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 67% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ enH ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ous ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ (GH ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ G) ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 37% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ycl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ usa ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ge ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ - ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 8% ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ red ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uct ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tox ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ic ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ air ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ssi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ If ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Six ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Sig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ma ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ can ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ do ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ won ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 3M ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ why ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ? ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Let ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ’s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ban ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ish ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dra ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ you ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ way ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ome ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ld ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cla ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bus ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ine ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ss. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ led ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lea ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ st ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ die ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 11 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ato ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hum ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Wed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nes ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ day ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ air ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 18 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aga ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tip ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ le ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ etc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dei ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ez- ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Zor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ des ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ert ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ at ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -Ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aqi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n-b ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ase ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ war ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mon ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ito ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ led ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fiv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ die ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 11 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ all ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ied ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ong ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Rev ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ olu ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Gua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ , ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Leb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ane ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hez ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lah ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Fat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gad ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ whi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ch ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lud ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -Ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ani ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Afg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ han ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ato ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gh ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ali ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ akd ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ own ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ imm ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ edi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kno ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wn. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ new ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ age ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ncy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SAN ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ but ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ giv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ det ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ail ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “At ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1:1 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 0 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ am, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ene ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ my ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aer ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ial ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ass ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aul ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tow ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dei ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ez- ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Zor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Bou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ al ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ reg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ,” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ SAN ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ A ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ “Th ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ult ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ agg ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sio ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ are ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ren ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bei ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ifi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ” ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ add ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ It ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ was ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ond ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wav ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ les ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ k. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ las ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Jan ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y 7 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ons ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rn ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ th ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ cap ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dam ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ asc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ us, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ kil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ g ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ee ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ -Ir ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rou ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ely ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rai ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mos ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aga ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ins ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aff ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ili ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ate ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wha ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ say ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ is ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bid ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ven ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ arc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ foe ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ uri ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fur ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ foo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tho ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ld ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ alo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ der ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mem ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ber ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ its ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ own ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hte ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ m a ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ var ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iet ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ion ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ali ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ fig ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ng ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ iti ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ it ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sup ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ por ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ts ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ loy ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ oss ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ und ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 50 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 202 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 0, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ acc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ord ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ to ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ an ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ann ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ual ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rep ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ort ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ eas ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Dec ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ emb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ by ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ita ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Isr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ael ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ i ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ arm ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ y ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ car ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ d ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hun ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dre ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ds ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ air ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ mis ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sil ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sin ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ce ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ civ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ il ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ war ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ke ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 201 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 1, ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ get ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ing ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Ira ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ nia ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Leb ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ane ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ se ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Hez ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ bol ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lah ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ces ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wel ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ gov ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ern ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ men ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ tro ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ops ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Jew ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ish ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sta ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ te ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ rar ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ely ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ack ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ now ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ led ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ges ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ind ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ivi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ dua ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ l ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ike ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ The ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Syr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ian ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Obs ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ erv ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ er ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ has ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ not ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ver ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ifi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ con ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ten ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ t ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ of ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sto ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ry. ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ Res ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ pon ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ sib ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ili ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ty ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ for ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ inf ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ orm ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ati ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ on ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ and ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ vie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ws ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ set ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ out ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ in ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ thi ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ art ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ icl ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ e ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ lie ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ s ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ent ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ire ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ly ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ wit ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ h ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ the ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ aut ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ hor ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ . ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴────────┘\n" + "shape: (2, 3)\n", + "┌────────────────────────────────────┬───────────────────────────────────┬─────────────────────────┐\n", + "│ contents ┆ id ┆ mcalpine_eflaw_textstat │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ f64 │\n", + "╞════════════════════════════════════╪═══════════════════════════════════╪═════════════════════════╡\n", + "│ Six Sigma Tips ┆ ┆ │\n", + "│ achieve high quality every time, ┆ ┆ │\n", + "│ the desired objective of most ┆ ┆ │\n", + "│ every business. However to get the ┆ ┆ │\n", + "│ desire of high quality creations ┆ ┆ │\n", + "│ turn into reality calls to extend ┆ ┆ │\n", + "│ yourself for some extra miles. Six ┆ ┆ │\n", + "│ Sigma is a success proven business ┆ ┆ │\n", + "│ management scheme that can assist ┆ ┆ │\n", + "│ you to unleash new horizons of ┆ ┆ │\n", + "│ business success with time bound ┆ ┆ │\n", + "│ rock solid results. The notion of ┆ ┆ │\n", + "│ six sigma was created by Motorola ┆ ┆ │\n", + "│ in 1986 as a result of consumers’ ┆ ┆ │\n", + "│ complaints about sub-standard ┆ ┆ │\n", + "│ Motorola phones’ quality. With the ┆ ┆ │\n", + "│ passage of time, the philosophy ┆ ┆ │\n", + "│ was further refined by General ┆ ┆ │\n", + "│ Electric and was adopted worldwide ┆ ┆ │\n", + "│ by various manufacturing firms. ┆ ┆ │\n", + "│ Six Sigma Tip[s on Methodology ┆ ┆ │\n", + "│ says: ┆ ┆ │\n", + "│ “Any type of process which does ┆ ┆ │\n", + "│ not result in customer’s ┆ ┆ │\n", + "│ satisfaction is referred as a ┆ ┆ │\n", + "│ “Defect” and it has to be excluded ┆ ┆ │\n", + "│ from the process in order to get ┆ ┆ │\n", + "│ matchless supreme quality products ┆ ┆ │\n", + "│ and services”. ┆ ┆ │\n", + "│ Six Sigma consists of further two ┆ ┆ │\n", + "│ sub-methods. These are: ┆ ┆ │\n", + "│ - DMAIC (Define, Measure, Analyze, ┆ ┆ │\n", + "│ Improve, Control) ┆ ┆ │\n", + "│ - DMADV (Design, Measure, Analyze, ┆ ┆ │\n", + "│ Design, Verify) ┆ ┆ │\n", + "│ Amazing DMAIC Methodology to Try ┆ ┆ │\n", + "│ Right Now! ┆ ┆ │\n", + "│ Get your all doubts vanish in few ┆ ┆ │\n", + "│ seconds by getting these Six Sigma ┆ ┆ │\n", + "│ tips and more insights about DMAIC ┆ ┆ │\n", + "│ methodology explained just right ┆ ┆ │\n", + "│ below: ┆ ┆ │\n", + "│ DMAIC has five steps which are ┆ ┆ │\n", + "│ defined step wise in sequence. ┆ ┆ │\n", + "│ D – Define the Problem statement: ┆ ┆ │\n", + "│ Before trying your hands on to the ┆ ┆ │\n", + "│ problem, it is important to define ┆ ┆ │\n", + "│ the problem statement first. ┆ ┆ │\n", + "│ Follow 5W 1H rule to define your ┆ ┆ │\n", + "│ problem in detail whereas 5Ws are ┆ ┆ │\n", + "│ What, When, Why, Where, Who and 1H ┆ ┆ │\n", + "│ is How. ┆ ┆ │\n", + "│ M – Measure: ┆ ┆ │\n", + "│ After defining the problem ┆ ┆ │\n", + "│ statement in detail, data ┆ ┆ │\n", + "│ collection is the next step. ┆ ┆ │\n", + "│ Compile all relevant data that ┆ ┆ │\n", + "│ provides you exact insight for the ┆ ┆ │\n", + "│ problem diagnosis. ┆ ┆ │\n", + "│ A – Analyze: ┆ ┆ │\n", + "│ The data collection that was done ┆ ┆ │\n", + "│ in previous step will be analyzed ┆ ┆ │\n", + "│ thoroughly where root cause ┆ ┆ │\n", + "│ analysis will be done and ways to ┆ ┆ │\n", + "│ eradicate such defects will be ┆ ┆ │\n", + "│ looked into. ┆ ┆ │\n", + "│ I – Improve ┆ ┆ │\n", + "│ Process improvement will be ┆ ┆ │\n", + "│ carried out on the foundation of ┆ ┆ │\n", + "│ facts and figures derived from ┆ ┆ │\n", + "│ research done in previous steps. ┆ ┆ │\n", + "│ Efforts will be directed towards ┆ ┆ │\n", + "│ process improvement leading to ┆ ┆ │\n", + "│ peerless quality outcomes. ┆ ┆ │\n", + "│ C – Control ┆ ┆ │\n", + "│ Implement control measures to the ┆ ┆ │\n", + "│ processes that ensure defect free ┆ ┆ │\n", + "│ product every time with superior ┆ ┆ │\n", + "│ precision and accuracy. ┆ ┆ │\n", + "│ How to Boost New Product ┆ ┆ │\n", + "│ Development by DMADV Method? ┆ ┆ │\n", + "│ Let’s unlock the potentials of ┆ ┆ │\n", + "│ DMADV method. ┆ ┆ │\n", + "│ D – Design ┆ ┆ │\n", + "│ In Design phase, one has to design ┆ ┆ │\n", + "│ the process from the scratch with ┆ ┆ │\n", + "│ all the relevant strategies that ┆ ┆ │\n", + "│ will give unbeatable results on ┆ ┆ │\n", + "│ prototype small scale. ┆ ┆ │\n", + "│ M – Measure ┆ ┆ │\n", + "│ Measure all the defined parameters ┆ ┆ │\n", + "│ that holds significance with ┆ ┆ │\n", + "│ respect to high quality delivery ┆ ┆ │\n", + "│ at all levels. ┆ ┆ │\n", + "│ A – Analyze ┆ ┆ │\n", + "│ Analyze what has been measured in ┆ ┆ │\n", + "│ previous step keeping design ┆ ┆ │\n", + "│ constraints in mind. ┆ ┆ │\n", + "│ D – Design ┆ ┆ │\n", + "│ Go for designing considering major ┆ ┆ │\n", + "│ and minor details of process on ┆ ┆ │\n", + "│ big scale. ┆ ┆ │\n", + "│ V – Verify ┆ ┆ │\n", + "│ Verify the process and its flow. ┆ ┆ │\n", + "│ Observe and record the results ┆ ┆ │\n", + "│ after successful implementation of ┆ ┆ │\n", + "│ the designed process. ┆ ┆ │\n", + "│ Getting Smart with Six Sigma Belt ┆ ┆ │\n", + "│ System ┆ ┆ │\n", + "│ As similar as Martial Arts system, ┆ ┆ │\n", + "│ Six Sigma also comes with Belt ┆ ┆ │\n", + "│ system with different level of ┆ ┆ │\n", + "│ authority system along with ┆ ┆ │\n", + "│ experience and level of competency ┆ ┆ │\n", + "│ and trainings. Employees start ┆ ┆ │\n", + "│ with initial levels i.e. White ┆ ┆ │\n", + "│ Belt System and then employees can ┆ ┆ │\n", + "│ gradually increase their levels ┆ ┆ │\n", + "│ depending on their interest, ┆ ┆ │\n", + "│ experience and capabilities. ┆ ┆ │\n", + "│ Six Sigma White Belt System ┆ ┆ │\n", + "│ Level: First Level of Belt System ┆ ┆ │\n", + "│ Knowledge Criteria: Mandatory to ┆ ┆ │\n", + "│ complete several hours of training ┆ ┆ │\n", + "│ about Six Sigma to get the basic ┆ ┆ │\n", + "│ clarity about the basics of it. ┆ ┆ │\n", + "│ Six Sigma Tips Yellow Belt ┆ ┆ │\n", + "│ Level: Second level of Six Sigma ┆ ┆ │\n", + "│ Knowledge Criteria: Mandatory to ┆ ┆ │\n", + "│ complete 10 to 15 hours Six Sigma ┆ ┆ │\n", + "│ classroom trainings to understand ┆ ┆ │\n", + "│ a bit advance steps methods. ┆ ┆ │\n", + "│ Six Sigma Tips: Green Belt ┆ ┆ │\n", + "│ Level: Fundamental of success for ┆ ┆ │\n", + "│ any project. ┆ ┆ │\n", + "│ Knowledge Criteria: A classroom ┆ ┆ │\n", + "│ training session consisting for ┆ ┆ │\n", + "│ weeks is compulsory to attend and ┆ ┆ │\n", + "│ have to pass a written exam along ┆ ┆ │\n", + "│ with serving in a Six Sigma ┆ ┆ │\n", + "│ project team. ┆ ┆ │\n", + "│ Six Sigma Tips: Black Belt ┆ ┆ │\n", + "│ Level: Full time job for Black ┆ ┆ │\n", + "│ Belt. Advance level of Six Sigma. ┆ ┆ │\n", + "│ Knowledge Criteria: To become a ┆ ┆ │\n", + "│ certified Six Sigma Black Belt, ┆ ┆ │\n", + "│ candidates must pass a written ┆ ┆ │\n", + "│ exam and successfully complete ┆ ┆ │\n", + "│ projects. ┆ ┆ │\n", + "│ Six Sigma Tips: Master Black Belt ┆ ┆ │\n", + "│ Level: Highest Six Sigma Belt ┆ ┆ │\n", + "│ ranking and the certification is ┆ ┆ │\n", + "│ in high demand for customers. ┆ ┆ │\n", + "│ Knowledge Criteria: To become Six ┆ ┆ │\n", + "│ Sigma Master Black Belt, one has ┆ ┆ │\n", + "│ to opt an experience of at least ┆ ┆ │\n", + "│ five years as Black Belt along ┆ ┆ │\n", + "│ with successful completion of ┆ ┆ │\n", + "│ minimum of ten six sigma projects. ┆ ┆ │\n", + "│ The role of Master Black Belts is ┆ ┆ │\n", + "│ not only to ensure active ┆ ┆ │\n", + "│ participation and efficient ┆ ┆ │\n", + "│ problem solving in six sigma ┆ ┆ │\n", + "│ projects but to trickle down six ┆ ┆ │\n", + "│ sigma culture and nourish six ┆ ┆ │\n", + "│ sigma techniques among employees ┆ ┆ │\n", + "│ is one of the objective of Six ┆ ┆ │\n", + "│ Sigma Master Black Belt. ┆ ┆ │\n", + "│ What Not Everybody Ought to Know ┆ ┆ │\n", + "│ About Six Sigma ┆ ┆ │\n", + "│ Do you wish more people bought ┆ ┆ │\n", + "│ your product, or your sales to be ┆ ┆ │\n", + "│ sky-rocketed without investing ┆ ┆ │\n", + "│ years and years? Are you still ┆ ┆ │\n", + "│ wasting your time on reoccurring ┆ ┆ │\n", + "│ problems daily? Let us unlock few ┆ ┆ │\n", + "│ of the many benefits of Six Sigma ┆ ┆ │\n", + "│ which your consultant will never ┆ ┆ │\n", + "│ tell you about. ┆ ┆ │\n", + "│ - Proven Way to Enhance Customer ┆ ┆ │\n", + "│ Retention: ┆ ┆ │\n", + "│ A single dissatisfied customer can ┆ ┆ │\n", + "│ take your millions of business ┆ ┆ │\n", + "│ away and on the other hand one ┆ ┆ │\n", + "│ loyal customer can bring in the ┆ ┆ │\n", + "│ opportunities to invite other new ┆ ┆ │\n", + "│ customers as well so to retain ┆ ┆ │\n", + "│ your regular customers should be ┆ ┆ │\n", + "│ your top most priority by all ┆ ┆ │\n", + "│ means. ┆ ┆ │\n", + "│ - How to be able to Achieve ┆ ┆ │\n", + "│ Targets in an Easy Way? ┆ ┆ │\n", + "│ Set SMART goals to accomplish ┆ ┆ │\n", + "│ anything whereas the term SMART is ┆ ┆ │\n", + "│ an abbreviation whose full form is ┆ ┆ │\n", + "│ Specific, Measurable, Achievable, ┆ ┆ │\n", + "│ Relevant, Timebound. Before ┆ ┆ │\n", + "│ setting SMART objectives, it is ┆ ┆ │\n", + "│ important to know every employees’ ┆ ┆ │\n", + "│ training needs, learning agility ┆ ┆ │\n", + "│ and performance curve accordingly. ┆ ┆ │\n", + "│ - The Best Ever Planning Solution ┆ ┆ │\n", + "│ for Any Business ┆ ┆ │\n", + "│ Six Sigma tips help in great ways ┆ ┆ │\n", + "│ to get rid of distractions and to ┆ ┆ │\n", + "│ maintain your focus in ┆ ┆ │\n", + "│ unidirectional way. SWOT ┆ ┆ │\n", + "│ (Strength, Weakness, Opportunities ┆ ┆ │\n", + "│ and Threats) analysis remains your ┆ ┆ │\n", + "│ eyes to the areas of improvements ┆ ┆ │\n", + "│ which eventually can turn your ┆ ┆ │\n", + "│ business a worth more than the ┆ ┆ │\n", + "│ present situation. ┆ ┆ │\n", + "│ - How Big Firms Are Able To Reduce ┆ ┆ │\n", + "│ Their Cycle Times? ┆ ┆ │\n", + "│ According to numerous business ┆ ┆ │\n", + "│ reports, 35% reduction in cycle ┆ ┆ │\n", + "│ times have been observed after ┆ ┆ │\n", + "│ implementation of Six Sigma. On ┆ ┆ │\n", + "│ the contrary, firms who haven’t ┆ ┆ │\n", + "│ implemented six sigma methodology ┆ ┆ │\n", + "│ complaint about delays in their ┆ ┆ │\n", + "│ projects and cycle time ┆ ┆ │\n", + "│ completion, less employee ┆ ┆ │\n", + "│ engagement or totally zero ┆ ┆ │\n", + "│ employee motivational levels. ┆ ┆ │\n", + "│ Real Life Amazing Results of Six ┆ ┆ │\n", + "│ Sigma: 3M Success Story ┆ ┆ │\n", + "│ 3M reported to have been ┆ ┆ │\n", + "│ celebrated their success credited ┆ ┆ │\n", + "│ by Six Sigma back in early 2000’s. ┆ ┆ │\n", + "│ Spare few minutes to have a glance ┆ ┆ │\n", + "│ on impressive ground breaking ┆ ┆ │\n", + "│ results made real only through Six ┆ ┆ │\n", + "│ Sigma. ┆ ┆ │\n", + "│ - 50% reduction in waste ┆ ┆ │\n", + "│ generation ┆ ┆ │\n", + "│ - 67% reduction in GreenHouse Gas ┆ ┆ │\n", + "│ (GHG) emissions ┆ ┆ │\n", + "│ - 37% water recycles usage ┆ ┆ │\n", + "│ - 8% reduction in toxic air ┆ ┆ │\n", + "│ emissions ┆ ┆ │\n", + "│ If Six Sigma tips can do wonders ┆ ┆ │\n", + "│ for 3M then why not you? Let’s ┆ ┆ │\n", + "│ banish the hindrances and fears in ┆ ┆ │\n", + "│ your way to become world class ┆ ┆ │\n", + "│ business. ┆ ┆ │\n", + "│ Israeli night raids targeting arms ┆ ┆ │\n", + "│ eastern Syria killed at least five ┆ ┆ │\n", + "│ soldiers and 11 allied fighters, ┆ ┆ │\n", + "│ the Syrian Observatory for Human ┆ ┆ │\n", + "│ Rights said on Wednesday. ┆ ┆ │\n", + "│ The Israeli air force carried out ┆ ┆ │\n", + "│ more than 18 strikes against ┆ ┆ │\n", + "│ multiple targets in an area ┆ ┆ │\n", + "│ stretching from the eastern town ┆ ┆ │\n", + "│ of Deir ez-Zor to the Boukamal ┆ ┆ │\n", + "│ desert at the Syrian-Iraqi border, ┆ ┆ │\n", + "│ according to the Britain-based war ┆ ┆ │\n", + "│ monitor. ┆ ┆ │\n", + "│ The raids killed five Syrian ┆ ┆ │\n", + "│ soldiers and 11 allied fighters ┆ ┆ │\n", + "│ belonging to the Iranian ┆ ┆ │\n", + "│ Revolutionary Guards, Lebanese ┆ ┆ │\n", + "│ Hezbollah and the Fatimid Brigade, ┆ ┆ │\n", + "│ which includes pro-Iranian Afghan ┆ ┆ │\n", + "│ fighters, the Observatory said, ┆ ┆ │\n", + "│ although their nationalities and a ┆ ┆ │\n", + "│ precise breakdown were not ┆ ┆ │\n", + "│ immediately known. ┆ ┆ │\n", + "│ The Syrian state news agency SANA ┆ ┆ │\n", + "│ reported the strikes but without ┆ ┆ │\n", + "│ giving further details. ┆ ┆ │\n", + "│ “At 1:10 am, the Israeli enemy ┆ ┆ │\n", + "│ carried out an aerial assault on ┆ ┆ │\n", + "│ the town of Deir ez-Zor and the ┆ ┆ │\n", + "│ Boukamal region,” SANA said, ┆ ┆ │\n", + "│ citing a military source. ┆ ┆ │\n", + "│ “The results of the aggression are ┆ ┆ │\n", + "│ currently being verified,” it ┆ ┆ │\n", + "│ added. ┆ ┆ │\n", + "│ It was the second wave of Israeli ┆ ┆ │\n", + "│ raids in Syria in less than a ┆ ┆ │\n", + "│ week. ┆ ┆ │\n", + "│ The last strikes on January 7 ┆ ┆ │\n", + "│ targeted positions in southern ┆ ┆ │\n", + "│ Syria and south of the capital ┆ ┆ │\n", + "│ Damascus, killing three pro-Iran ┆ ┆ │\n", + "│ fighters. ┆ ┆ │\n", + "│ Israel routinely carries out raids ┆ ┆ │\n", + "│ in Syria, mostly against targets ┆ ┆ │\n", + "│ affiliated with Iran in what it ┆ ┆ │\n", + "│ says is a bid to prevent its arch ┆ ┆ │\n", + "│ foe from securing further foothold ┆ ┆ │\n", + "│ along its borders. ┆ ┆ │\n", + "│ Iran has members of its own ┆ ┆ │\n", + "│ military as well as fighters from ┆ ┆ │\n", + "│ a variety of nationalities ┆ ┆ │\n", + "│ fighting with militias it supports ┆ ┆ │\n", + "│ deployed across Syria. ┆ ┆ │\n", + "│ Israel hit around 50 targets in ┆ ┆ │\n", + "│ Syria in 2020, according to an ┆ ┆ │\n", + "│ annual report released in late ┆ ┆ │\n", + "│ December by the Israeli military. ┆ ┆ │\n", + "│ The Israeli army has carried out ┆ ┆ │\n", + "│ hundreds of air and missile ┆ ┆ │\n", + "│ strikes on Syria since the civil ┆ ┆ │\n", + "│ war broke out in 2011, targeting ┆ ┆ │\n", + "│ Iranian and Lebanese Hezbollah ┆ ┆ │\n", + "│ forces as well as government ┆ ┆ │\n", + "│ troops. ┆ ┆ │\n", + "│ The Jewish state rarely ┆ ┆ │\n", + "│ acknowledges individual strikes. ┆ ┆ │\n", + "│ The Syrian Observer has not ┆ ┆ │\n", + "│ verified the content of this ┆ ┆ │\n", + "│ story. Responsibility for the ┆ ┆ │\n", + "│ information and views set out in ┆ ┆ │\n", + "│ this article lies entirely with ┆ ┆ │\n", + "│ the author. ┆ ┆ │\n", + "└────────────────────────────────────┴───────────────────────────────────┴─────────────────────────┘\n" ] } ], diff --git a/transforms/language/readability/test-data/expected/metadata.json b/transforms/language/readability/test-data/expected/metadata.json index e02f28d268..c304b58942 100644 --- a/transforms/language/readability/test-data/expected/metadata.json +++ b/transforms/language/readability/test-data/expected/metadata.json @@ -5,40 +5,43 @@ "job name": "readability", "job type": "pure python", "job id": "job_id", - "start_time": "2024-10-03 17:57:33", - "end_time": "2024-10-03 17:57:33", + "start_time": "2025-02-07 11:40:34", + "end_time": "2025-02-07 11:40:34", "status": "success" }, - "code": { - "github": "github", - "commit_hash": "12345", - "path": "path" - }, + "code": null, "job_input_params": { - "contents_column_name": "contents", - "curriculum": true, + "readability_contents_column_name": "contents", + "readability_score_list": "mcalpine_eflaw_textstat", "checkpointing": false, "max_files": -1, "random_samples": -1, "files_to_use": [".parquet"], "num_processors": 0 }, + "execution_stats": { + "cpus": 1.9, + "gpus": 0, + "memory": 20.07, + "object_store": 0, + "execution time, min": 0.002 + }, "job_output_stats": { "source_files": 1, "source_size": 14884, "result_files": 1, - "result_size": 17206, - "processing_time": 0.171, + "result_size": 12222, + "processing_time": 0.141, "nrows": 2, "source_doc_count": 2, "result_doc_count": 2 }, "source": { - "name": "/Users/touma/data-prep-kit-inner/transforms/language/readability/python/test-data/input", + "name": "/home/cma/de/data-prep-kit/transforms/language/readability/test-data/input", "type": "path" }, "target": { - "name": "/Users/touma/data-prep-kit-inner/transforms/language/readability/python/test-data/output", + "name": "/home/cma/de/data-prep-kit/transforms/language/readability/output", "type": "path" } } diff --git a/transforms/language/readability/test-data/expected/readability-test.parquet b/transforms/language/readability/test-data/expected/readability-test.parquet index f42ae4f7ad..b99c11e40c 100644 Binary files a/transforms/language/readability/test-data/expected/readability-test.parquet and b/transforms/language/readability/test-data/expected/readability-test.parquet differ diff --git a/transforms/language/readability/test-data/expected2/metadata.json b/transforms/language/readability/test-data/expected2/metadata.json index 857fef9917..354e41d500 100644 --- a/transforms/language/readability/test-data/expected2/metadata.json +++ b/transforms/language/readability/test-data/expected2/metadata.json @@ -5,40 +5,43 @@ "job name": "readability", "job type": "pure python", "job id": "job_id", - "start_time": "2024-10-03 19:04:11", - "end_time": "2024-10-03 19:04:12", + "start_time": "2025-02-07 13:27:11", + "end_time": "2025-02-07 13:27:11", "status": "success" }, - "code": { - "github": "github", - "commit_hash": "12345", - "path": "path" - }, + "code": null, "job_input_params": { - "contents_column_name": "contents", - "curriculum": 0, + "readability_contents_column_name": "contents", + "readability_score_list": ["reading_time_textstat", "spache_readability_textstat", "text_standard_textstat"], "checkpointing": false, "max_files": -1, "random_samples": -1, "files_to_use": [".parquet"], "num_processors": 0 }, + "execution_stats": { + "cpus": 3.7, + "gpus": 0, + "memory": 13.97, + "object_store": 0, + "execution time, min": 0.002 + }, "job_output_stats": { "source_files": 1, "source_size": 14884, "result_files": 1, - "result_size": 24744, - "processing_time": 0.206, + "result_size": 13171, + "processing_time": 0.139, "nrows": 2, "source_doc_count": 2, "result_doc_count": 2 }, "source": { - "name": "/Users/touma/data-prep-kit-inner/transforms/language/readability/python/test-data/input", + "name": "/home/cma/de/data-prep-kit/transforms/language/readability/test-data/input", "type": "path" }, "target": { - "name": "/Users/touma/data-prep-kit-inner/transforms/language/readability/python/test-data/output", + "name": "/home/cma/de/data-prep-kit/transforms/language/readability/output", "type": "path" } } diff --git a/transforms/language/readability/test-data/expected2/readability-test.parquet b/transforms/language/readability/test-data/expected2/readability-test.parquet index 8faccb5a16..03165bae1d 100644 Binary files a/transforms/language/readability/test-data/expected2/readability-test.parquet and b/transforms/language/readability/test-data/expected2/readability-test.parquet differ diff --git a/transforms/language/readability/test/test_readability_python.py b/transforms/language/readability/test/test_readability_python.py index 9782d98654..dc666df576 100644 --- a/transforms/language/readability/test/test_readability_python.py +++ b/transforms/language/readability/test/test_readability_python.py @@ -15,7 +15,13 @@ from data_processing.test_support.launch.transform_test import ( AbstractTransformLauncherTest, ) -from dpk_readability.common import contents_column_name_cli_param +from dpk_readability.common import ( + contents_column_name_cli_param, + reading_time_textstat, + score_list_cli_param, + spache_readability_textstat, + text_standard_textstat, +) from dpk_readability.runtime import ReadabilityPythonTransformConfiguration @@ -32,7 +38,7 @@ def get_test_transform_fixtures(self) -> list[tuple]: cli_params = { contents_column_name_cli_param: "contents", - # curriculum_cli_param: False + score_list_cli_param: f"['{reading_time_textstat}','{spache_readability_textstat}','{text_standard_textstat}']", } fixtures = [] diff --git a/transforms/language/readability/test/test_readability_python_defaults.py b/transforms/language/readability/test/test_readability_python_defaults.py index 7a976923d6..d428ff898b 100644 --- a/transforms/language/readability/test/test_readability_python_defaults.py +++ b/transforms/language/readability/test/test_readability_python_defaults.py @@ -17,7 +17,8 @@ from data_processing.test_support.launch.transform_test import ( AbstractTransformLauncherTest, ) -from dpk_readability.common import contents_column_name_cli_param, curriculum_cli_param + +# from dpk_readability.common import contents_column_name_cli_param, score_list_cli_param, mcalpine_eflaw_textstat from dpk_readability.runtime import ReadabilityPythonTransformConfiguration @@ -32,7 +33,7 @@ class TestPythonReadabilityTransform(AbstractTransformLauncherTest): def get_test_transform_fixtures(self) -> list[tuple]: basedir = os.path.abspath(os.path.join(os.getcwd(), "..", "test-data")) - cli_params = {contents_column_name_cli_param: "contents", curriculum_cli_param: True} + cli_params = {} fixtures = [] launcher = PythonTransformLauncher(ReadabilityPythonTransformConfiguration()) diff --git a/transforms/language/readability/test/test_readability_ray.py b/transforms/language/readability/test/test_readability_ray.py index 2b86800020..9a78a70075 100644 --- a/transforms/language/readability/test/test_readability_ray.py +++ b/transforms/language/readability/test/test_readability_ray.py @@ -15,7 +15,13 @@ AbstractTransformLauncherTest, ) from data_processing_ray.runtime.ray import RayTransformLauncher -from dpk_readability.common import contents_column_name_cli_param, curriculum_cli_param +from dpk_readability.common import ( + contents_column_name_cli_param, + reading_time_textstat, + score_list_cli_param, + spache_readability_textstat, + text_standard_textstat, +) from dpk_readability.ray.runtime import ReadabilityRayTransformConfiguration @@ -33,7 +39,7 @@ def get_test_transform_fixtures(self) -> list[tuple]: cli_params = { contents_column_name_cli_param: "contents", "run_locally": True, - # curriculum_cli_param: False + score_list_cli_param: f"['{reading_time_textstat}','{spache_readability_textstat}','{text_standard_textstat}']", } fixtures = [] diff --git a/transforms/language/readability/test/test_readability_ray_defaults.py b/transforms/language/readability/test/test_readability_ray_defaults.py index a9c624e0a1..d371fa4d6c 100644 --- a/transforms/language/readability/test/test_readability_ray_defaults.py +++ b/transforms/language/readability/test/test_readability_ray_defaults.py @@ -17,7 +17,8 @@ AbstractTransformLauncherTest, ) from data_processing_ray.runtime.ray import RayTransformLauncher -from dpk_readability.common import contents_column_name_cli_param, curriculum_cli_param + +# from dpk_readability.common import contents_column_name_cli_param, curriculum_cli_param from dpk_readability.ray.runtime import ReadabilityRayTransformConfiguration @@ -32,7 +33,8 @@ class TestRayReadabilityTransform(AbstractTransformLauncherTest): def get_test_transform_fixtures(self) -> list[tuple]: basedir = os.path.abspath(os.path.join(os.getcwd(), "..", "test-data")) - cli_params = {"run_locally": True, contents_column_name_cli_param: "contents", curriculum_cli_param: True} + cli_params = {"run_locally": True} + # cli_params = {"run_locally": True, contents_column_name_cli_param: "contents", curriculum_cli_param: True} fixtures = [] launcher = RayTransformLauncher(ReadabilityRayTransformConfiguration()) diff --git a/transforms/pyproject.toml b/transforms/pyproject.toml index 62ae5677da..5c6cd46807 100644 --- a/transforms/pyproject.toml +++ b/transforms/pyproject.toml @@ -87,7 +87,8 @@ language = { file = [ "universal/tokenization/requirements.txt", "universal/web2parquet/requirements.txt", "universal/profiler/requirements.txt", -"universal/resize/requirements.txt" +"universal/resize/requirements.txt", +"universal/rep_removal/requirements.txt" ]} # pyproject.toml must be in a parent and cannot be in sibling diff --git a/transforms/transforms-dev1-testing.ipynb b/transforms/transforms-dev1-testing.ipynb new file mode 100644 index 0000000000..2366dcf5b3 --- /dev/null +++ b/transforms/transforms-dev1-testing.ipynb @@ -0,0 +1,886 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e0dfb3c5-7419-48b3-ae05-706ec1829b6e", + "metadata": {}, + "source": [ + "Assumes that the transforms package has been installaed in the venv and all manipulations required for cargo and rep_removal were done in the vnev" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "8d049f72-9ab5-486b-99d0-70e374c9f656", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/touma/data-prep-kit-pkg/transforms/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "from huggingface_hub import hf_hub_download\n", + "import pyarrow.parquet as pq\n", + "import pandas as pd\n", + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ad36252c-8730-46fe-8882-a6be7c5076c5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 3.57 s, sys: 4.86 s, total: 8.42 s\n", + "Wall time: 51.9 s\n" + ] + } + ], + "source": [ + "%%time\n", + "REPO_ID = \"HuggingFaceFW/fineweb\"\n", + "FILENAME = \"data/CC-MAIN-2013-20/000_00000.parquet\"\n", + "file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type=\"dataset\")" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "90ba29c1-6c70-4fba-b700-8dd2630d8b4e", + "metadata": {}, + "outputs": [], + "source": [ + "#os.path.dirname(file1)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "4204bf13-5af6-4235-9a93-140e181cd3a5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 4.71 s, sys: 7.07 s, total: 11.8 s\n", + "Wall time: 8.42 s\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textiddumpurldatefile_pathlanguagelanguage_scoretoken_count
0How AP reported in all formats from tornado-st...<urn:uuid:d66bc6fe-8477-4adf-b430-f6a558ccc8ff>CC-MAIN-2013-20http://%20jwashington@ap.org/Content/Press-Rel...2013-05-18T05:48:54Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.972142717
1Did you know you have two little yellow, nine-...<urn:uuid:803e14c3-dc2e-43d6-b75d-6fb3981c4fe6>CC-MAIN-2013-20http://1000awesomethings.com/2012/09/24/934-ad...2013-05-18T08:11:45Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.947991821
2Car Wash For Clara!\\nNow is your chance to hel...<urn:uuid:ac1bbfff-9519-4967-9c64-3dc3a4b471ec>CC-MAIN-2013-20http://1027kord.com/car-wash-for-clara/2013-05-18T06:49:55Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.911518125
3Listeners Get Sky-high View of Missoula From H...<urn:uuid:c1445c58-b111-4c4e-badd-1e43ec317df7>CC-MAIN-2013-20http://1075zoofm.com/listeners-get-sky-high-vi...2013-05-18T06:25:20Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.956516103
4Log In Please enter your ECode to log in.\\nFor...<urn:uuid:e5829f7d-b944-4468-9573-61b7cb3078cc>CC-MAIN-2013-20http://1105govinfoevents.com/enterprisearchite...2013-05-18T05:27:01Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.79823575
..............................
1091391PALMS — The winner of a $7 million SuperLotto ...<urn:uuid:9a5989f7-b385-498f-84de-75abc9272805>CC-MAIN-2013-20http://www.scpr.org/news/2010/06/06/15880/7m-s...2013-05-22T08:33:55Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.971524165
1091392Irfan Khan/AFP/Getty Images\\nFormer Bell City ...<urn:uuid:b49419dd-bc94-4302-a097-6c544fa0631e>CC-MAIN-2013-20http://www.scpr.org/news/2011/03/15/24996/atto...2013-05-22T07:56:02Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.973813313
1091393A more common sentiment than you would think (...<urn:uuid:832b678a-df73-4131-b479-b9fbd3370a6f>CC-MAIN-2013-20http://www.scq.ubc.ca/sciencescouts/the-i%E2%8...2013-05-22T07:55:36Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.969990217
1091394Paper Fashions Boutique is here to save you ti...<urn:uuid:1c61271c-9694-4481-aef2-117fea466605>CC-MAIN-2013-20http://www.scrapscene.com/2010/08/new-scrapboo...2013-05-22T08:27:53Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.963822659
1091395Admissions down in Argentina by 7% in first ha...<urn:uuid:8759fd30-1bf9-4538-83d1-1195e0d08f93>CC-MAIN-2013-20http://www.screendaily.com/admissions-down-in-...2013-05-22T08:13:50Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.925611252
\n", + "

1091396 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " text \\\n", + "0 How AP reported in all formats from tornado-st... \n", + "1 Did you know you have two little yellow, nine-... \n", + "2 Car Wash For Clara!\\nNow is your chance to hel... \n", + "3 Listeners Get Sky-high View of Missoula From H... \n", + "4 Log In Please enter your ECode to log in.\\nFor... \n", + "... ... \n", + "1091391 PALMS — The winner of a $7 million SuperLotto ... \n", + "1091392 Irfan Khan/AFP/Getty Images\\nFormer Bell City ... \n", + "1091393 A more common sentiment than you would think (... \n", + "1091394 Paper Fashions Boutique is here to save you ti... \n", + "1091395 Admissions down in Argentina by 7% in first ha... \n", + "\n", + " id dump \\\n", + "0 CC-MAIN-2013-20 \n", + "1 CC-MAIN-2013-20 \n", + "2 CC-MAIN-2013-20 \n", + "3 CC-MAIN-2013-20 \n", + "4 CC-MAIN-2013-20 \n", + "... ... ... \n", + "1091391 CC-MAIN-2013-20 \n", + "1091392 CC-MAIN-2013-20 \n", + "1091393 CC-MAIN-2013-20 \n", + "1091394 CC-MAIN-2013-20 \n", + "1091395 CC-MAIN-2013-20 \n", + "\n", + " url \\\n", + "0 http://%20jwashington@ap.org/Content/Press-Rel... \n", + "1 http://1000awesomethings.com/2012/09/24/934-ad... \n", + "2 http://1027kord.com/car-wash-for-clara/ \n", + "3 http://1075zoofm.com/listeners-get-sky-high-vi... \n", + "4 http://1105govinfoevents.com/enterprisearchite... \n", + "... ... \n", + "1091391 http://www.scpr.org/news/2010/06/06/15880/7m-s... \n", + "1091392 http://www.scpr.org/news/2011/03/15/24996/atto... \n", + "1091393 http://www.scq.ubc.ca/sciencescouts/the-i%E2%8... \n", + "1091394 http://www.scrapscene.com/2010/08/new-scrapboo... \n", + "1091395 http://www.screendaily.com/admissions-down-in-... \n", + "\n", + " date \\\n", + "0 2013-05-18T05:48:54Z \n", + "1 2013-05-18T08:11:45Z \n", + "2 2013-05-18T06:49:55Z \n", + "3 2013-05-18T06:25:20Z \n", + "4 2013-05-18T05:27:01Z \n", + "... ... \n", + "1091391 2013-05-22T08:33:55Z \n", + "1091392 2013-05-22T07:56:02Z \n", + "1091393 2013-05-22T07:55:36Z \n", + "1091394 2013-05-22T08:27:53Z \n", + "1091395 2013-05-22T08:13:50Z \n", + "\n", + " file_path language \\\n", + "0 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "2 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "3 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "4 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "... ... ... \n", + "1091391 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091392 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091393 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091394 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091395 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "\n", + " language_score token_count \n", + "0 0.972142 717 \n", + "1 0.947991 821 \n", + "2 0.911518 125 \n", + "3 0.956516 103 \n", + "4 0.798235 75 \n", + "... ... ... \n", + "1091391 0.971524 165 \n", + "1091392 0.973813 313 \n", + "1091393 0.969990 217 \n", + "1091394 0.963822 659 \n", + "1091395 0.925611 252 \n", + "\n", + "[1091396 rows x 9 columns]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "import pyarrow.parquet as pq\n", + "import pandas as pd\n", + "table = pq.read_table(file1)\n", + "table.to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "b6bbf09e-240d-4017-9bd3-80c809b01d27", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "12:43:08 INFO - Doc id parameters are : {'doc_column': 'text', 'hash_column': 'document_id', 'int_column': 'int_id_column', 'start_id': 5}\n", + "12:43:08 INFO - pipeline id pipeline_id\n", + "12:43:08 INFO - code location None\n", + "12:43:08 INFO - data factory data_ is using local data access: input_folder - /Users/touma/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20 output_folder - files-doc-id\n", + "12:43:08 INFO - data factory data_ max_files -1, n_sample -1\n", + "12:43:08 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "12:43:08 INFO - orchestrator doc_id started at 2025-02-03 12:43:08\n", + "12:43:08 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}\n", + "12:43:30 INFO - Completed 1 files (100.0%) in 0.374 min\n", + "12:43:30 INFO - Done processing 1 files, waiting for flush() completion.\n", + "12:43:30 INFO - done flushing in 0.0 sec\n", + "12:43:30 INFO - Completed execution in 0.374 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 17.8 s, sys: 6.27 s, total: 24.1 s\n", + "Wall time: 22.5 s\n" + ] + }, + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "from dpk_doc_id.transform_python import DocID\n", + "DocID(input_folder= os.path.dirname(file1),\n", + " output_folder= \"files-doc-id\",\n", + " doc_id_doc_column= \"text\",\n", + " doc_id_hash_column= \"document_id\",\n", + " doc_id_int_column= \"int_id_column\",\n", + " doc_id_start_id= 5).transform()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "2b841fe0-696a-47b9-a93d-683190410710", + "metadata": {}, + "outputs": [], + "source": [ + "#%%time\n", + "#import pyarrow.parquet as pq\n", + "#import pandas as pd\n", + "#table = pq.read_table('files-doc-id/000_00000.parquet')\n", + "#table.to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "72d7a18b-a218-4cd2-9877-61cfb32fff1a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "12:45:13 INFO - pipeline id pipeline_id\n", + "INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id\n", + "12:45:13 INFO - code location None\n", + "INFO:data_processing.runtime.execution_configuration:code location None\n", + "12:45:13 INFO - data factory data_ is using local data access: input_folder - /Users/touma/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20 output_folder - files-rep_removal\n", + "INFO:data_processing.data_access.data_access_factory_base9afa7eae-b98d-4b5e-b07d-cd279ce6afde:data factory data_ is using local data access: input_folder - /Users/touma/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20 output_folder - files-rep_removal\n", + "12:45:13 INFO - data factory data_ max_files -1, n_sample -1\n", + "INFO:data_processing.data_access.data_access_factory_base9afa7eae-b98d-4b5e-b07d-cd279ce6afde:data factory data_ max_files -1, n_sample -1\n", + "12:45:13 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "INFO:data_processing.data_access.data_access_factory_base9afa7eae-b98d-4b5e-b07d-cd279ce6afde:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "12:45:13 INFO - orchestrator rep_removal started at 2025-02-03 12:45:13\n", + "INFO:data_processing.runtime.pure_python.transform_orchestrator:orchestrator rep_removal started at 2025-02-03 12:45:13\n", + "12:45:13 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}\n", + "INFO:data_processing.runtime.pure_python.transform_orchestrator:Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cpu speed: 3504 MHz, Cores: 12\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:root:timeout is: 35130.8109303653\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "gpu_usage: 0.00%, GPU speed: 0 MHz\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:root:running the merge\n", + "INFO:root:merging complete\n", + "\u001b[1m\u001b[33mwarning\u001b[0m\u001b[1m:\u001b[0m no edition set: defaulting to the 2015 edition while the latest is 2021\n", + "\u001b[1m\u001b[32m Updating\u001b[0m crates.io index\n", + "\u001b[1m\u001b[32m Locking\u001b[0m 48 packages to latest compatible versions\n", + "\u001b[1m\u001b[36m Adding\u001b[0m clap v3.2.25 \u001b[1m\u001b[33m(available: v4.5.27)\u001b[0m\n", + "\u001b[1m\u001b[36m Adding\u001b[0m crossbeam v0.3.2 \u001b[1m\u001b[33m(available: v0.8.4)\u001b[0m\n", + "\u001b[1m\u001b[36m Adding\u001b[0m filebuffer v0.4.0 \u001b[1m\u001b[33m(available: v1.0.0)\u001b[0m\n", + "\u001b[1m\u001b[36m Adding\u001b[0m zstd v0.5.4+zstd.1.4.7 \u001b[1m\u001b[33m(available: v0.13.2)\u001b[0m\n", + "\u001b[1m\u001b[36m Adding\u001b[0m zstd-sys v1.4.18+zstd.1.4.7 \u001b[1m\u001b[33m(available: v1.6.3+zstd.1.5.2)\u001b[0m\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m libc v0.2.169\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m proc-macro2 v1.0.93\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m version_check v0.9.5\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m shlex v1.3.0\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m either v1.13.0\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m unicode-ident v1.0.16\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m glob v0.3.2\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m syn v1.0.109\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m autocfg v1.4.0\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m zstd-safe v2.0.6+zstd.1.4.7\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m heck v0.4.1\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m hashbrown v0.12.3\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Start load!\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:49:30 INFO - Completed 1 files (100.0%) in 64.286 min\n", + "INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (100.0%) in 64.286 min\n", + "13:49:30 INFO - Done processing 1 files, waiting for flush() completion.\n", + "INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 1 files, waiting for flush() completion.\n", + "13:49:30 INFO - done flushing in 0.001 sec\n", + "INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.001 sec\n", + "13:49:30 INFO - Completed execution in 64.287 min, execution result 0\n", + "INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 64.287 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 43.6 s, sys: 1min 22s, total: 2min 5s\n", + "Wall time: 1h 4min 25s\n" + ] + }, + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 / 1474657457 \n", + "1000000000 / 1474657457 \n", + "Duplicates found: 21535301\n", + "Total time taken: 119206ms\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[1m\u001b[32m Compiling\u001b[0m itertools v0.9.0\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m os_str_bytes v6.6.1\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m textwrap v0.16.1\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m strsim v0.10.0\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m proc-macro-error-attr v1.0.4\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m proc-macro-error v1.0.4\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m indexmap v1.9.3\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m clap_lex v0.2.4\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m termcolor v1.4.1\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m once_cell v1.20.2\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m bitflags v1.3.2\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m crossbeam v0.3.2\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m quote v1.0.38\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m jobserver v0.1.32\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m atty v0.2.14\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m filebuffer v0.4.0\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m cc v1.2.11\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m zstd-sys v1.4.18+zstd.1.4.7\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m clap_derive v3.2.25\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m clap v3.2.25\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m zstd v0.5.4+zstd.1.4.7\n", + "\u001b[1m\u001b[32m Compiling\u001b[0m dedup_dataset v1.0.0 (/Users/touma/data-prep-kit-pkg/transforms/venv/lib/python3.11/site-packages/dpk_rep_removal/rust)\n", + "\u001b[1m\u001b[32m Finished\u001b[0m `dev` profile [optimized + debuginfo] target(s) in 12.37s\n", + "\u001b[1m\u001b[32m Running\u001b[0m `venv/lib/python3.11/site-packages/dpk_rep_removal/rust/target/debug/dedup_dataset self-similar --data-file /var/folders/lb/tysjhggx38l6g9xxg5whzxfc0000gn/T/tmp_h3cw1xg/save_dir/parquet --length-threshold 50 --cache-dir /var/folders/lb/tysjhggx38l6g9xxg5whzxfc0000gn/T/tmp_h3cw1xg/cache --num-threads 1 --frequency-threshold 1 --retain-first-copy`\n" + ] + } + ], + "source": [ + "%%time\n", + "from dpk_rep_removal.runtime import RepRemoval\n", + "RepRemoval(input_folder= os.path.dirname(file1),\n", + " output_folder= \"files-rep_removal\",\n", + " rep_removal_contents_column_name='text', \n", + " rep_removal_num_threads='1',\n", + " ).transform()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "296200e3-503e-4e5f-92f9-4dd78484c615", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 5.9 s, sys: 4.47 s, total: 10.4 s\n", + "Wall time: 11.8 s\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textiddumpurldatefile_pathlanguagelanguage_scoretoken_count
0How AP reported in all formats from tornado-st...<urn:uuid:d66bc6fe-8477-4adf-b430-f6a558ccc8ff>CC-MAIN-2013-20http://%20jwashington@ap.org/Content/Press-Rel...2013-05-18T05:48:54Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.972142717
1Did you know you have two little yellow, nine-...<urn:uuid:803e14c3-dc2e-43d6-b75d-6fb3981c4fe6>CC-MAIN-2013-20http://1000awesomethings.com/2012/09/24/934-ad...2013-05-18T08:11:45Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.947991821
2Car Wash For Clara!\\nNow is your chance to hel...<urn:uuid:ac1bbfff-9519-4967-9c64-3dc3a4b471ec>CC-MAIN-2013-20http://1027kord.com/car-wash-for-clara/2013-05-18T06:49:55Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.911518125
3Listeners Get Sky-high View of Missoula From H...<urn:uuid:c1445c58-b111-4c4e-badd-1e43ec317df7>CC-MAIN-2013-20http://1075zoofm.com/listeners-get-sky-high-vi...2013-05-18T06:25:20Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.956516103
4Log In Please enter your ECode to log in.\\nFor...<urn:uuid:e5829f7d-b944-4468-9573-61b7cb3078cc>CC-MAIN-2013-20http://1105govinfoevents.com/enterprisearchite...2013-05-18T05:27:01Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.79823575
..............................
1091391PALMS — The winner of a $7 million SuperLotto ...<urn:uuid:9a5989f7-b385-498f-84de-75abc9272805>CC-MAIN-2013-20http://www.scpr.org/news/2010/06/06/15880/7m-s...2013-05-22T08:33:55Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.971524165
1091392Irfan Khan/AFP/Getty Images\\nFormer Bell City ...<urn:uuid:b49419dd-bc94-4302-a097-6c544fa0631e>CC-MAIN-2013-20http://www.scpr.org/news/2011/03/15/24996/atto...2013-05-22T07:56:02Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.973813313
1091393A more common sentiment than you would think (...<urn:uuid:832b678a-df73-4131-b479-b9fbd3370a6f>CC-MAIN-2013-20http://www.scq.ubc.ca/sciencescouts/the-i%E2%8...2013-05-22T07:55:36Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.969990217
1091394Paper Fashions Boutique is here to save you ti...<urn:uuid:1c61271c-9694-4481-aef2-117fea466605>CC-MAIN-2013-20http://www.scrapscene.com/2010/08/new-scrapboo...2013-05-22T08:27:53Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.963822659
1091395Admissions down in Argentina by 7% in first ha...<urn:uuid:8759fd30-1bf9-4538-83d1-1195e0d08f93>CC-MAIN-2013-20http://www.screendaily.com/admissions-down-in-...2013-05-22T08:13:50Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.925611252
\n", + "

1091396 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " text \\\n", + "0 How AP reported in all formats from tornado-st... \n", + "1 Did you know you have two little yellow, nine-... \n", + "2 Car Wash For Clara!\\nNow is your chance to hel... \n", + "3 Listeners Get Sky-high View of Missoula From H... \n", + "4 Log In Please enter your ECode to log in.\\nFor... \n", + "... ... \n", + "1091391 PALMS — The winner of a $7 million SuperLotto ... \n", + "1091392 Irfan Khan/AFP/Getty Images\\nFormer Bell City ... \n", + "1091393 A more common sentiment than you would think (... \n", + "1091394 Paper Fashions Boutique is here to save you ti... \n", + "1091395 Admissions down in Argentina by 7% in first ha... \n", + "\n", + " id dump \\\n", + "0 CC-MAIN-2013-20 \n", + "1 CC-MAIN-2013-20 \n", + "2 CC-MAIN-2013-20 \n", + "3 CC-MAIN-2013-20 \n", + "4 CC-MAIN-2013-20 \n", + "... ... ... \n", + "1091391 CC-MAIN-2013-20 \n", + "1091392 CC-MAIN-2013-20 \n", + "1091393 CC-MAIN-2013-20 \n", + "1091394 CC-MAIN-2013-20 \n", + "1091395 CC-MAIN-2013-20 \n", + "\n", + " url \\\n", + "0 http://%20jwashington@ap.org/Content/Press-Rel... \n", + "1 http://1000awesomethings.com/2012/09/24/934-ad... \n", + "2 http://1027kord.com/car-wash-for-clara/ \n", + "3 http://1075zoofm.com/listeners-get-sky-high-vi... \n", + "4 http://1105govinfoevents.com/enterprisearchite... \n", + "... ... \n", + "1091391 http://www.scpr.org/news/2010/06/06/15880/7m-s... \n", + "1091392 http://www.scpr.org/news/2011/03/15/24996/atto... \n", + "1091393 http://www.scq.ubc.ca/sciencescouts/the-i%E2%8... \n", + "1091394 http://www.scrapscene.com/2010/08/new-scrapboo... \n", + "1091395 http://www.screendaily.com/admissions-down-in-... \n", + "\n", + " date \\\n", + "0 2013-05-18T05:48:54Z \n", + "1 2013-05-18T08:11:45Z \n", + "2 2013-05-18T06:49:55Z \n", + "3 2013-05-18T06:25:20Z \n", + "4 2013-05-18T05:27:01Z \n", + "... ... \n", + "1091391 2013-05-22T08:33:55Z \n", + "1091392 2013-05-22T07:56:02Z \n", + "1091393 2013-05-22T07:55:36Z \n", + "1091394 2013-05-22T08:27:53Z \n", + "1091395 2013-05-22T08:13:50Z \n", + "\n", + " file_path language \\\n", + "0 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "2 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "3 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "4 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "... ... ... \n", + "1091391 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091392 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091393 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091394 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "1091395 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en \n", + "\n", + " language_score token_count \n", + "0 0.972142 717 \n", + "1 0.947991 821 \n", + "2 0.911518 125 \n", + "3 0.956516 103 \n", + "4 0.798235 75 \n", + "... ... ... \n", + "1091391 0.971524 165 \n", + "1091392 0.973813 313 \n", + "1091393 0.969990 217 \n", + "1091394 0.963822 659 \n", + "1091395 0.925611 252 \n", + "\n", + "[1091396 rows x 9 columns]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%%time\n", + "import pyarrow.parquet as pq\n", + "import pandas as pd\n", + "table = pq.read_table('files-rep_removal/000_00000.parquet')\n", + "table.to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e80e2e5a-4318-47bd-a7f0-a446f532e60e", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/transforms/universal/rep_removal/Makefile b/transforms/universal/rep_removal/Makefile index 5740780dd6..5c110983b1 100644 --- a/transforms/universal/rep_removal/Makefile +++ b/transforms/universal/rep_removal/Makefile @@ -20,5 +20,4 @@ run-cli-sample: source venv/bin/activate && \ $(PYTHON) -m dpk_$(TRANSFORM_NAME).runtime \ --data_local_config "{ 'input_folder' : 'test-data/input', 'output_folder' : 'output'}" \ - --rep_removal_contents_column_name 'text' \ - --rep_removal_num_threads '1' + --rep_removal_contents_column_name 'text' diff --git a/transforms/universal/rep_removal/README.md b/transforms/universal/rep_removal/README.md index d810479f96..0b28f42dfc 100644 --- a/transforms/universal/rep_removal/README.md +++ b/transforms/universal/rep_removal/README.md @@ -52,29 +52,38 @@ pip install --no-binary :all: psutil ``` -B) Compile the dedup_dataset binary from the **dpk_rep_removal** package dir: -- Install from git clone repo: -```shell -cargo install --path dpk_rep_removal/rust -``` -- Install from pip install (Note: Activate venv before running next commands): -```shell -PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}') -cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust -``` +[//]: # (B) Compile the dedup_dataset binary from the **dpk_rep_removal** package dir:) + +[//]: # (- Install from git clone repo:) + +[//]: # (```shell) + +[//]: # (cargo install --path dpk_rep_removal/rust) + +[//]: # (```) + +[//]: # (- Install from pip install (Note: Activate venv before running next commands):) + +[//]: # (```shell) + +[//]: # (PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')) + +[//]: # (cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust) + +[//]: # (```) ## Input Parameters The transform can be initialized with the following parameters: -| Parameter | Default | Description | -|------------------------------------|------------|---------------------------------------------------| -| `rep_removal_contents_column_name` | `contents` | Name of the column holding the document contents | -| `rep_removal_dedup_level_name` | `parquet` | Name of the type of file to process | -| `rep_remova_length_thresh` | `50` | Length threshold for processing | -| `rep_removal_frequency_threshold` | `1` | Frequency threshold for processing | -| `rep_removal_retain_first_copy` | `True` | Boolean value for whether to retain first copy | -| `rep_removal_tokenize` | `True` | Boolean value for whether to tokenize | -| `rep_removal_num_threads` | `4` | Value for number of threads to use for processing | +| Parameter | Default | Description | +|------------------------------------|------------------------------------|---------------------------------------------------| +| `rep_removal_contents_column_name` | `contents` | Name of the column holding the document contents | +| `rep_removal_dedup_level_name` | `parquet` | Name of the type of file to process | +| `rep_remova_length_thresh` | `50` | Length threshold for processing | +| `rep_removal_frequency_threshold` | `1` | Frequency threshold for processing | +| `rep_removal_retain_first_copy` | `True` | Boolean value for whether to retain first copy | +| `rep_removal_tokenize` | `True` | Boolean value for whether to tokenize | +| `rep_removal_num_threads` | `psutils.cpu_count(logical=False)` | Value for number of threads to use for processing | ## Output Format @@ -116,8 +125,7 @@ You can invoke the transform via command line, as shown in sample make command ` ```commandline python -m dpk_rep_removal.runtime \ --data_local_config "{ 'input_folder' : 'test-data/input', 'output_folder' : 'output'}" \ - --rep_removal_contents_column_name 'text' \ - --rep_removal_num_threads '1' + --rep_removal_contents_column_name 'text' ``` diff --git a/transforms/universal/rep_removal/dpk_rep_removal/dedup_pq_level.py b/transforms/universal/rep_removal/dpk_rep_removal/dedup_pq_level.py index ab78338f95..a6d15d6adf 100644 --- a/transforms/universal/rep_removal/dpk_rep_removal/dedup_pq_level.py +++ b/transforms/universal/rep_removal/dpk_rep_removal/dedup_pq_level.py @@ -24,12 +24,9 @@ import pandas as pd import struct from collections import defaultdict -import dpk_rep_removal.utils import transformers from transformers import GPT2Tokenizer -run_in_OCP = True - #### Save the tokenizer in a local path to speed up the process #### Get tokenizer from the local path to speed up the process @@ -77,65 +74,10 @@ def decode(x): return out -def load_pq_docs(pq_df, content_col, save_dir, dataset_name, tokenize, num_threads): - global args_tokenize - args_tokenize = tokenize - - pre_sep = b"\xff\xff" - post_sep = b"" - - if not os.path.exists(save_dir): - os.mkdir(save_dir) - - fout = open(os.path.join(save_dir, dataset_name), "wb") - - with mp.get_context("fork").Pool(num_threads) as p: - sizes = [0] - docs_content_text = pq_df[content_col].tolist() - encoded_docs = p.map(encode, docs_content_text) - - for doc in encoded_docs: - next_line = sep() + doc - fout.write(next_line) - sizes.append(sizes[-1] + len(next_line)) - fout.close() - open(os.path.join(save_dir, dataset_name + ".size"), "wb").write(np.array(sizes, dtype=np.uint64).tobytes()) - - -def load_pq_docs_once(pq_df, content_col, save_dir, dataset_name, tokenize, num_threads): - global encoded_docs, loaded_size, args_tokenize - args_tokenize = tokenize - - pre_sep = b"\xff\xff" - post_sep = b"" - - if not os.path.exists(save_dir): - os.mkdir(save_dir) - - fout = open(os.path.join(save_dir, dataset_name), "wb") - - with mp.get_context("fork").Pool(num_threads) as p: - loaded_size = [0] - docs_content_text = pq_df[content_col].tolist() - encoded_docs = p.map(encode, docs_content_text) - - for doc in encoded_docs: - next_line = sep() + doc - fout.write(next_line) - loaded_size.append(loaded_size[-1] + len(next_line)) - fout.close() - open(os.path.join(save_dir, dataset_name + ".size"), "wb").write(np.array(loaded_size, dtype=np.uint64).tobytes()) - ### To avoid tokenizing again we pass the tokenized column to use later - # return enc_text, loaded_size - - def load_pq_docs_once_avoidIO(pq_df, content_col, save_dir, dataset_name, tokenize, num_threads): global args_tokenize, encoded_docs, loaded_size args_tokenize = tokenize - pre_sep = b"\xff\xff" - post_sep = b"" - if not os.path.exists(save_dir): os.mkdir(save_dir) @@ -158,31 +100,6 @@ def load_pq_docs_once_avoidIO(pq_df, content_col, save_dir, dataset_name, tokeni # return enc_text, loaded_size -def gen_output_doc(args): - global remove_ex, args_tokenize - - this_idx, row = args - - if this_idx in remove_ex: - if args_tokenize: - row = encode(row) - for start, end in remove_ex[this_idx][::-1]: - if start % 2: - start = start - 1 - if end % 2: - end = end + 1 - # print(start,end) - # end = int(end-6) - # print(start,end) - row = row[:start] + row[end:] - row = decode(row) - else: - for start, end in remove_ex[this_idx][::-1]: - # print(start,end) - row = row[:start] + row[end:] - return row - - def gen_output_doc_once(args): global remove_ex, args_tokenize, encoded_docs @@ -208,27 +125,6 @@ def gen_output_doc_once(args): return row -def save_deduped_pq(pq_df, output_dir, content_col, num_threads, tokenize): - global args_tokenize, remove_ex - args_tokenize = tokenize - - # pq_df = pd.read_parquet(input_pq_list) - pre_content_col_size = sum(pq_df[content_col].str.len()) - - ### Removing the repeated subsequences from all parquet docs - docs = [(i, row) for i, row in enumerate(pq_df[content_col])] - p = mp.get_context("fork").Pool(int(num_threads)) - docs = p.map(gen_output_doc, docs) - - pq_df[content_col] = docs - deduped_content_col_size = sum(pq_df[content_col].str.len()) - - #### saving the output parquet file once - pq_df.to_parquet(output_dir) - - return pre_content_col_size, deduped_content_col_size - - def save_deduped_pq_once(pq_df, output_dir, content_col, num_threads, tokenize): global args_tokenize, remove_ex args_tokenize = tokenize @@ -251,96 +147,6 @@ def save_deduped_pq_once(pq_df, output_dir, content_col, num_threads, tokenize): return pre_content_col_size, deduped_content_col_size -def extract_dup_per_doc(size_file, repeated_pairs): - global remove_ex - remove = [] - fin = open(repeated_pairs) - for line in fin: - if 'out' in line: break - for line in fin: - remove.append(list(map(int, line.split()))) - - sizes = np.frombuffer(open(size_file, "rb").read(), dtype=np.uint64) - - remove_ex = defaultdict(list) - - # count_between_docs = 0 - # duplicate_between_docs = [] ### for printing and investigation - ptr = 0 - for i, byte_start in enumerate(sizes[:-1]): - byte_end = sizes[i + 1] - # print(byte_start, byte_end, remove[ptr]) - while ptr < len(remove) and byte_start <= remove[ptr][0] < byte_end: - # print(remove[ptr]) - - ##### if a duplicate is made from two subsequent documents, - ##### Do not remove it as each part might be the only occurrence in its related doc - ##### This follows our strategy to retain the first occurrence of each duplicate - if remove[ptr][1] > byte_end + 6: - # count_between_docs += 1 - # duplicate_between_docs.append(i) ### for printing and investigation - ptr += 1 - continue ### Do not remove this duplicate - - # The magic value 6 here corresponds to the 4-byte index prefix followed by \xff\xff. - remove_ex[i].append((max(int(remove[ptr][0] - byte_start - 6), 0), - int(min(int(remove[ptr][1] - byte_start), - byte_end - byte_start)) - 6)) ################## added -6 to exclude sep - ptr += 1 - # print ('############# Number of duplicate made from two subsequent documents: ', count_between_docs) - # print ('############# Number of duplicate made from two subsequent documents: ', duplicate_between_docs) - - # df_dict = pd.DataFrame(remove_ex) - # print(remove_ex) - # return remove_ex - - -def extract_dup_per_doc_avoidIO(repeated_pairs): - global remove_ex, loaded_size - remove = [] - fin = open(repeated_pairs) - for line in fin: - if 'out' in line: break - for line in fin: - remove.append(list(map(int, line.split()))) - - ### Avoid I/O process for .size file to speed up the process - # sizes = np.frombuffer(open(size_file, "rb").read(), dtype=np.uint64) - sizes = loaded_size - - remove_ex = defaultdict(list) - - # count_between_docs = 0 - # duplicate_between_docs = [] ### for printing and investigation - ptr = 0 - for i, byte_start in enumerate(sizes[:-1]): - byte_end = sizes[i + 1] - # print(byte_start, byte_end, remove[ptr]) - while ptr < len(remove) and byte_start <= remove[ptr][0] < byte_end: - # print(remove[ptr]) - - ##### if a duplicate is made from two subsequent documents, - ##### Do not remove it as each part might be the only occurrence in its related doc - ##### This follows our strategy to retain the first occurrence of each duplicate - if remove[ptr][1] > byte_end + 6: - # count_between_docs += 1 - # duplicate_between_docs.append(i) ### for printing and investigation - ptr += 1 - continue ### Do not remove this duplicate - - # The magic value 6 here corresponds to the 4-byte index prefix followed by \xff\xff. - remove_ex[i].append((max(int(remove[ptr][0] - byte_start - 6), 0), - int(min(int(remove[ptr][1] - byte_start), - byte_end - byte_start)) - 6)) ################## added -6 to exclude sep - ptr += 1 - # print ('############# Number of duplicate made from two subsequent documents: ', count_between_docs) - # print ('############# Number of duplicate made from two subsequent documents: ', duplicate_between_docs) - - # df_dict = pd.DataFrame(remove_ex) - # print(remove_ex) - # return remove_ex - - def extract_dup_per_doc_avoidIO_further(repeated_pairs): global remove_ex, loaded_size remove = [] @@ -395,9 +201,3 @@ def extract_dup_per_doc_avoidIO_further(repeated_pairs): int(min(int(remove[ptr][1] - byte_start), byte_end - byte_start)) - 6)) ################## added -6 to exclude sep ptr += 1 - # print ('############# Number of duplicate made from two subsequent documents: ', count_between_docs) - # print ('############# Number of duplicate made from two subsequent documents: ', duplicate_between_docs) - - # df_dict = pd.DataFrame(remove_ex) - # print(remove_ex) - # return remove_ex \ No newline at end of file diff --git a/transforms/universal/rep_removal/dpk_rep_removal/make_suffix_array.py b/transforms/universal/rep_removal/dpk_rep_removal/make_suffix_array.py index 809d6d499f..7224612e07 100644 --- a/transforms/universal/rep_removal/dpk_rep_removal/make_suffix_array.py +++ b/transforms/universal/rep_removal/dpk_rep_removal/make_suffix_array.py @@ -31,146 +31,147 @@ # See the License for the specific language governing permissions and # limitations under the License. -import logging import os import time import subprocess import numpy as np +import multiprocessing as mp from dpk_rep_removal.utils import calculate_timeout +from data_processing.utils import get_logger +logger = get_logger(__name__, level="INFO") -logging.basicConfig(level=logging.DEBUG) +pwd = os.path.dirname(__file__) +dedup_program = f"{pwd}/rust/target/release/dedup_dataset" -def make_suffix_array(input, tmp_dir_sub, dedup_level, num_threads, num_cpus): - # data_size = os.path.getsize(sys.argv[1]) - data_size = os.path.getsize(input) - - HACK = 100000 - - started = [] - +# Determine the number of jobs based on the data size (total jobs, and jobs at once) +def determine_job_parameters(data_size): if data_size > 10e9: - total_jobs = 100 - jobs_at_once = 20 + return 100, 20 elif data_size > 1e9: - total_jobs = 96 - jobs_at_once = 96 + return 96, 96 elif data_size > 10e6: - total_jobs = 4 - jobs_at_once = 4 + return 4, 4 else: - total_jobs = 1 - jobs_at_once = 1 + return 1, 1 + + +# Run a subprocess command and return the output +def run_subprocess(cmd, timeout=None): + try: + if timeout is None: + process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, text=True) + stdout, stderr = process.communicate() + else: + process = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, text=True, timeout=timeout) + stderr = process.stderr + stdout = process.stdout + + if process.returncode != 0: + raise Exception(f"Error in subprocess: {stderr}") + return stdout + except Exception as e: + logger.error(f"Error running command '{cmd}': {e}") + return None + + +# Create parts of the dataset +def create_part(data_file, start_byte, end_byte): + cmd = f"{dedup_program} make-part --data-file {data_file} --start-byte {start_byte} --end-byte {end_byte}" + logger.info(f"Creating part: {start_byte}-{end_byte}") + return run_subprocess(cmd) + + +# Calculate expected size using FACT +def get_expected_size(file_path): + size_data = os.path.getsize(file_path) + FACT = np.ceil(np.log(size_data) / np.log(2) / 8) + return size_data * FACT + + +# Check the integrity of the files +def check_file_integrity(data_file, started): + logger.info("Checking file integrity...") + while True: + files = [f"{data_file}.part.{s}-{e}" for s, e in started] + wait = [] + + for file, (s, e) in zip(files, started): + if not os.path.exists(file) or not os.path.exists(f"{file}.table.bin") or os.path.getsize( + f"{file}.table.bin") == 0 or get_expected_size(file) != os.path.getsize(file + ".table.bin"): + logger.warning(f"File missing or invalid: {file}, rerunning.") + wait.append((s, e)) + + if not wait: + break + + logger.info(f"Re-running {len(wait)} jobs due to failed integrity checks.") + with mp.Pool(len(wait)) as pool: + pool.starmap(create_part, [(data_file, s, e) for s, e in wait]) + + time.sleep(1) + + +# Merge the suffix trees +def merge_suffix_trees(files, suffix_array_path, threads, timeout=None): + cmd = f"{dedup_program} merge --output-file {suffix_array_path} --suffix-path {' --suffix-path '.join(files)} --num-threads {threads}" + logger.info("Merging suffix trees...") + result = run_subprocess(cmd, timeout) + if result: + logger.info("Merge successful.") + else: + logger.error("Merge failed.") + raise RuntimeError("Merge failed.") + + +# Cleanup and verification of the final table file + +def cleanup_and_verify_final_table(input_file, suffix_array_path, tmp_dir_sub): + logger.info("Final cleanup and verification...") + subprocess.run("cat %s.table.bin.* > %s/out.table.bin" % (suffix_array_path, tmp_dir_sub), shell=True) + subprocess.run("mv %s/out.table.bin %s.table.bin" % (tmp_dir_sub, input_file), shell=True) + # Verify file integrity + if os.path.exists(f"{input_file}.table.bin"): + if os.path.getsize(f"{input_file}.table.bin") % os.path.getsize(input_file) != 0: + logger.error("File size is incorrect.") + raise RuntimeError("File size is incorrect.") + else: + logger.error("Failed to create the table file.") + raise RuntimeError("Failed to create the table file.") + + +def make_suffix_array(input, tmp_dir_sub, dedup_level, num_threads, num_cpus): + HACK = 100000 + data_size = os.path.getsize(input) + total_jobs, jobs_at_once = determine_job_parameters(data_size) + chunk_size = data_size // total_jobs + started = [] + logger.info(f"Starting the deduplication process for file: {input}") - S = data_size // total_jobs timeout = calculate_timeout(data_size, cpu_cores=num_cpus) - logging.info(f"timeout is: {timeout}") + logger.info(f"timeout is: {timeout}") - pwd = os.path.dirname(__file__) - dedup_program = f"{pwd}/rust/target/release/dedup_dataset" + # Create dataset parts in parallel + for jobstart in range(0, total_jobs, jobs_at_once): + wait = [] + for i in range(jobstart, jobstart + jobs_at_once): + start_byte, end_byte = i * chunk_size, min((i + 1) * chunk_size + HACK, data_size) + started.append((start_byte, end_byte)) + wait.append((start_byte, end_byte)) - try: - for jobstart in range(0, total_jobs, jobs_at_once): - wait = [] - for i in range(jobstart, jobstart + jobs_at_once): - s, e = i * S, min((i + 1) * S + HACK, data_size) - # cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(sys.argv[1], s, e) - - ########################################################################################################################################### - # cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(input, s, e) - cmd = f"{dedup_program}" + " make-part --data-file %s --start-byte %d --end-byte %d" % (input, s, e) - ########################################################################################################################################### - - started.append((s, e)) - #run the command with subprocess and capture the output - result = subprocess.run(cmd, shell=True, capture_output=True, text=True) - wait.append(result) - - if e == data_size: - break - - #Ensure all commands have finished - for result in wait: - if result.returncode != 0: - raise RuntimeError(f"Error occurred: {result.stderr}") - - # check the output of part files and rerun if necessary - while True: - # files = ["%s.part.%d-%d"%(sys.argv[1],s, e) for s,e in started] - files = ["%s.part.%d-%d" % (input, s, e) for s, e in started] - - wait = [] - for x, (s, e) in zip(files, started): - go = False - if not os.path.exists(x): - go = True - else: - size_data = os.path.getsize(x) - FACT = np.ceil(np.log(size_data) / np.log(2) / 8) - if not os.path.exists(x) or not os.path.exists(x + ".table.bin") or os.path.getsize( - x + ".table.bin") == 0 or size_data * FACT != os.path.getsize(x + ".table.bin"): - go = True - if go: - # cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(sys.argv[1], s, e) - ########################################################################################################################################### - # cmd = "./target/debug/dedup_dataset make-part --data-file %s --start-byte %d --end-byte %d"%(input, s, e) - cmd = f"{dedup_program}" + " make-part --data-file %s --start-byte %d --end-byte %d" % (input, s, e) - ########################################################################################################################################### - - # run the command to recreate the missing or failed parts - result = subprocess.run(cmd, shell=True, capture_output=True, text=True) - wait.append(result) - if len(wait) >= jobs_at_once: - break - - # Ensure all commands have finished - for result in wait: - if result.returncode != 0: - raise RuntimeError(f"Error occurred: {result.stderr}") - - time.sleep(1) - # break the loop when no jobs are left - if len(wait) == 0: - break - - #os.popen("rm tmp/out.table.bin.*").read() - - torun = " --suffix-path ".join(files) - # pipe = os.popen("./target/debug/dedup_dataset merge --output-file %s --suffix-path %s --num-threads %d"%("tmp/out.table.bin", torun, num_threads)) - - #### Saving suffix arrays in a sub folder (part of the input file name is used for sub folder name) - #### to avoid conflicts in parallel processes on the same node - suffix_array_path = os.path.join(tmp_dir_sub, dedup_level) - - ########################################################################################################################################### - # pipe = os.popen("./target/debug/dedup_dataset merge --output-file %s --suffix-path %s --num-threads %d"%(suffix_array_path, torun,num_threads )) - cmd = f"{dedup_program}" + " merge --output-file %s --suffix-path %s --num-threads %d" % ( - suffix_array_path, torun, num_threads) - ########################################################################################################################################### - - # run the merge command: - logging.info("running the merge") - result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout) - if result.returncode != 0: - raise RuntimeError("Something went wrong with merging.") - - #### Saving suffix arrays in a sub folder (part of the input file name is used for sub folder name) - #### to avoid conflicts in parallel processes on the same node - subprocess.run("cat %s.table.bin.* > %s/out.table.bin" % (suffix_array_path, tmp_dir_sub), shell=True) - - subprocess.run("mv %s/out.table.bin %s.table.bin" % (tmp_dir_sub, input), shell=True) - - logging.info('merging complete') - # if os.path.exists(sys.argv[1]+".table.bin"): - if os.path.exists(input + ".table.bin"): - if os.path.getsize(input + ".table.bin") % os.path.getsize(input) != 0: - raise RuntimeError("File size is wrong") + logger.info(f"Scheduling {jobs_at_once} jobs to create dataset parts.") + with mp.Pool(jobs_at_once) as pool: + pool.starmap(create_part, [(input, s, e) for s, e in wait]) - else: - raise RuntimeError("Failed to create table") + # Check the integrity of all created parts + check_file_integrity(input, started) + + # Merging the parts into the final dataset + suffix_array_path = os.path.join(tmp_dir_sub, dedup_level) + files = [f"{input}.part.{s}-{e}" for s, e in started] + merge_suffix_trees(files, suffix_array_path, num_threads, timeout) - except subprocess.TimeoutExpired: - raise RuntimeError("subprocess timed out. skipping file") + # Final cleanup and verification + cleanup_and_verify_final_table(input, suffix_array_path, tmp_dir_sub) - except subprocess.CalledProcessError: - raise RuntimeError("error during subprocess call. skipping file") + logger.info("Deduplication process completed successfully.") diff --git a/transforms/universal/rep_removal/dpk_rep_removal/runtime.py b/transforms/universal/rep_removal/dpk_rep_removal/runtime.py index e77a53d222..da7a71497b 100644 --- a/transforms/universal/rep_removal/dpk_rep_removal/runtime.py +++ b/transforms/universal/rep_removal/dpk_rep_removal/runtime.py @@ -55,16 +55,16 @@ def add_input_params(self, parser: ArgumentParser) -> None: ) parser.add_argument( "--rep_removal_length_thresh", - type=str, + type=int, required=False, - default="50", + default=50, help="Length threshold for processing", ) parser.add_argument( "--rep_removal_frequency_threshold", - type=str, + type=int, required=False, - default="1", + default=1, help="Frequency threshold for processing.", ) parser.add_argument( @@ -83,14 +83,14 @@ def add_input_params(self, parser: ArgumentParser) -> None: ) parser.add_argument( "--rep_removal_num_threads", - type=str, + type=int, required=False, - default="4", + default=cpu_count(logical=False), help="Value for number of threads to use for processing", ) parser.add_argument( "--rep_removal_num_cpus", - type=str, + type=int, required=False, default=cpu_count(logical=False), help="Value for number of cpus allocated for processing", diff --git a/transforms/universal/rep_removal/dpk_rep_removal/transform.py b/transforms/universal/rep_removal/dpk_rep_removal/transform.py index 8b505a5c51..a1118814f2 100644 --- a/transforms/universal/rep_removal/dpk_rep_removal/transform.py +++ b/transforms/universal/rep_removal/dpk_rep_removal/transform.py @@ -9,8 +9,9 @@ # See the License for the specific language governing permissions and # limitations under the License. ################################################################################ -import logging + import os +import subprocess import tempfile import pyarrow as pa import pandas as pd @@ -20,8 +21,8 @@ from psutil import cpu_count from dpk_rep_removal.make_suffix_array import make_suffix_array from data_processing.transform import AbstractTableTransform - -logging.basicConfig(level=logging.DEBUG) +from data_processing.utils import get_logger +logging = get_logger(__name__, level="INFO") class RepRemovalTransform(AbstractTableTransform): @@ -30,26 +31,32 @@ def __init__(self, config: dict[str, Any]): self.contents_column_name = config.get("rep_removal_contents_column_name", "contents") self.dedup_level = config.get("rep_removal_dedup_level_name", "parquet") - self.length_thresh = config.get("rep_removal_length_thresh", str(50)) - self.frequency_threshold = config.get("rep_removal_frequency_threshold", str(1)) + self.length_thresh = str(config.get("rep_removal_length_thresh", 5)) + self.frequency_threshold = str(config.get("rep_removal_frequency_threshold", 1)) self.retain_first_copy = str(config.get("rep_removal_retain_first_copy", True)) self.tokenize = str(config.get("rep_removal_tokenize", True)) - self.num_threads = config.get("rep_removal_num_threads", str(4)) - self.num_cpus = config.get("rep_removal_num_cpus", cpu_count(logical=False)) + self.num_threads = str(config.get("rep_removal_num_threads", cpu_count(logical=False))) + self.num_cpus = str(config.get("rep_removal_num_cpus", cpu_count(logical=False))) if self.retain_first_copy.lower() == 'false': self.retain_first_copy = False else: self.retain_first_copy = True + + pwd = os.path.dirname(__file__) + manifest_path = f"{pwd}/rust/" + cmd = f"cargo install --path {manifest_path}" + subprocess.run(cmd, shell=True, capture_output=True, text=True) + def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Table], dict[str, Any]]: """ """ pq_df = table.to_pandas() try: with tempfile.TemporaryDirectory() as td: save_dir = os.path.join(td, 'save_dir') + logging.info("encoding parquet") encoded_pq = os.path.join(save_dir, self.dedup_level) - load_pq_docs_once_avoidIO(pq_df, self.contents_column_name, save_dir, self.dedup_level, self.tokenize, int(self.num_threads)) @@ -58,14 +65,16 @@ def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Tab os.makedirs(cache_dir) os.makedirs(temp_dir) + logging.info("making suffix array") make_suffix_array(encoded_pq, temp_dir, self.dedup_level, int(self.num_threads), int(self.num_cpus)) + logging.info("finding repeated substrings") find_repeated_substrings(encoded_pq, self.length_thresh, cache_dir, self.num_threads, self.frequency_threshold, self.retain_first_copy) - + logging.info("collecting duplicates") repeated_pairs = collect_duplicates_avoidIO(encoded_pq, self.length_thresh, cache_dir) # no duplicates found - if repeated_pairs[0] == 'S 0': + if 'out' not in repeated_pairs: return [], {"duplicates_found": 0} extract_dup_per_doc_avoidIO_further(repeated_pairs) @@ -75,10 +84,12 @@ def transform(self, table: pa.Table, file_name: str = None) -> tuple[list[pa.Tab self.num_threads, self.tokenize) + duplicates_found = len(repeated_pairs[repeated_pairs.index('out') + 1:-1]) + logging.info(f"Num Duplicate Rows: {duplicates_found}") metadata = { "pre_content col size": pre_content_col_size, "rep_removed_content col size": deduped_content_col_size, - "duplicates_found": len(repeated_pairs) - 4, + "duplicates_found": duplicates_found, } # add deduped to res table diff --git a/transforms/universal/rep_removal/rep_removal.ipynb b/transforms/universal/rep_removal/rep_removal.ipynb index 1c529765bc..a6982fc44d 100644 --- a/transforms/universal/rep_removal/rep_removal.ipynb +++ b/transforms/universal/rep_removal/rep_removal.ipynb @@ -28,24 +28,7 @@ "***Rust*** is required to be installed on the system locally in order to run. To install, review here: https://www.rust-lang.org/tools/install\n", "\n", "### Add Rust to $PATH\n", - "If Rust is **not** added to your `$PATH`, run the below steps to add the rust installation location for proper execution. \n", - "\n", - "You can use the `!whereis cargo` command to find where rust is installed in your machine, and **set the path there up to the `/bin`**\n", - "\n", - "ex: whereis cargo produces:\n", - "cargo: /Users/USERNAME/.cargo/bin/cargo\n", - "\n", - "set the $PATH to include `/Users/USERNAME/.cargo/bin/`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "88527732-fcaf-4fac-9120-43c67dba76d3", - "metadata": {}, - "outputs": [], - "source": [ - "import os" + "If Rust is **not** added to your `$PATH`, run the cell below to add the rust installation location for proper execution. \n" ] }, { @@ -55,18 +38,12 @@ "metadata": {}, "outputs": [], "source": [ - "!whereis cargo" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eaed73c0-95b8-42de-9cac-a0cdf19ad35b", - "metadata": {}, - "outputs": [], - "source": [ - "# set $PATH env to append the rust path\n", - "os.environ['PATH'] = os.environ['PATH'] + ':/OUTPUT/OF/WHEREIS/CARGO/UP/TO/BIN/'" + "import pathlib\n", + "import os\n", + "\n", + "result = !whereis cargo\n", + "cargo_path = os.path.join(pathlib.Path(result[0].split(' ')[1]).parent, '')\n", + "os.environ['PATH'] = os.environ['PATH'] + f':{cargo_path}'" ] }, { @@ -105,7 +82,6 @@ "RepRemoval(input_folder= \"test-data/input\",\n", " output_folder= \"test-data/output\",\n", " rep_removal_contents_column_name='text', \n", - " rep_removal_num_threads='1',\n", " ).transform()" ] }, diff --git a/transforms/universal/rep_removal/test/test_rep_removal_python.py b/transforms/universal/rep_removal/test/test_rep_removal_python.py index 8936c35fee..1e6766a3ed 100644 --- a/transforms/universal/rep_removal/test/test_rep_removal_python.py +++ b/transforms/universal/rep_removal/test/test_rep_removal_python.py @@ -24,7 +24,6 @@ def test_rep_removal(self): RepRemoval(input_folder=basedir + "/input", output_folder=basedir + "/output", rep_removal_contents_column_name='text', - rep_removal_num_threads='1', ).transform() table1 = pq.read_table(os.path.join(basedir, 'expected', 'test1.parquet')) @@ -36,7 +35,6 @@ def test_wrong_contents_field(self): RepRemoval(input_folder=basedir + "/input", output_folder=basedir + "/output", rep_removal_contents_column_name='contents', - rep_removal_num_threads='1', ).transform() with open(os.path.join(basedir, 'output', 'metadata.json'), 'r') as f: @@ -47,7 +45,6 @@ def test_remove_first_copy(self): RepRemoval(input_folder=basedir + "/input", output_folder=basedir + "/output", rep_removal_contents_column_name='text', - rep_removal_num_threads='1', rep_removal_retain_first_copy=False, ).transform() diff --git a/transforms/universal/rep_removal/test/test_rep_removal_ray.py b/transforms/universal/rep_removal/test/test_rep_removal_ray.py index 600f213cbf..56449c4555 100644 --- a/transforms/universal/rep_removal/test/test_rep_removal_ray.py +++ b/transforms/universal/rep_removal/test/test_rep_removal_ray.py @@ -28,7 +28,6 @@ def get_test_transform_fixtures(self) -> list[tuple]: transform_config = { "run_locally": True, "rep_removal_contents_column_name": 'text', - "rep_removal_num_threads": '1', } launcher = RayTransformLauncher(RepRemovalRayTransformConfiguration())