Merge pull request #585 from h2oai/control_embedding_migration

Control embedding migration
h2oai · Aug 2, 2023 · d333423 · d333423
2 parents 738d7ac + fe6aaef
commit d333423
Show file tree

Hide file tree

Showing 31 changed files with 938 additions and 404 deletions.
diff --git a/README.md b/README.md
@@ -17,6 +17,7 @@ Query and summarize your documents or just chat with local private GPT LLMs usin
 - **Inference Servers** support (HF TGI server, vLLM, Gradio, ExLLaMa, OpenAI)
 - **OpenAI-compliant Python client API** for client-server control
 - **Evaluate** performance using reward models
+- **Quality** maintained with over 250 unit and integration tests taking over 4 GPU-hours
 
 ### Getting Started
 
@@ -128,9 +129,10 @@ GPU and CPU mode tested on variety of NVIDIA GPUs in Ubuntu 18-22, but any moder
 - To run h2oGPT tests:
     ```bash
     wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin
-    pip install requirements-parser
-    pytest -s -v tests client/tests
+    pip install requirements-parser pytest-instafail
+    pytest --instafail -s -v tests client/tests
     ```
+  or tweak/run `tests/test4gpus.sh` to run tests in parallel.
 
 ### Help
 

diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -182,11 +182,27 @@ This warning can be safely ignored.
    - `CUDA_VISIBLE_DEVICES`: Standard list of CUDA devices to make visible.
    - `PING_GPU`: ping GPU every few minutes for full GPU memory usage by torch, useful for debugging OOMs or memory leaks
    - `GET_GITHASH`: get git hash on startup for system info.  Avoided normally as can fail with extra messages in output for CLI mode
-
+   - `H2OGPT_SCRATCH_PATH`: Choose base scratch folder for scratch databases and files
+   - `H2OGPT_BASE_PATH`: Choose base folder for all files except scratch files
 These can be useful on HuggingFace spaces, where one sets secret tokens because CLI options cannot be used.
 
 > **_NOTE:_**  Scripts can accept different environment variables to control query arguments. For instance, if a Python script takes an argument like `--load_8bit=True`, the corresponding ENV variable would follow this format: `H2OGPT_LOAD_8BIT=True` (regardless of capitalization). It is important to ensure that the environment variable is assigned the exact value that would have been used for the script's query argument.
 
+### How to run functions in src from Python interpreter
+
+E.g.
+```python
+import sys
+sys.path.append('src')
+from src.gpt_langchain import get_supported_types
+non_image_types, image_types, video_types = get_supported_types()
+print(non_image_types)
+print(image_types)
+for x in image_types:
+    print('   - `.%s` : %s Image (optional),' % (x.lower(), x.upper()))
+print(video_types)
+```
+
 ### GPT4All not producing output.
 
 Please contact GPT4All team.  Even a basic test can give empty result.

diff --git a/docs/README_LangChain.md b/docs/README_LangChain.md
@@ -44,9 +44,72 @@ Open-source data types are supported, .msg is not supported due to GPL-3 require
    - `.odt`: Open Document Text,
    - `.pptx` : PowerPoint Document,
    - `.ppt` : PowerPoint Document,
+   - `.apng` : APNG Image (optional),
+   - `.blp` : BLP Image (optional),
+   - `.bmp` : BMP Image (optional),
+   - `.bufr` : BUFR Image (optional),
+   - `.bw` : BW Image (optional),
+   - `.cur` : CUR Image (optional),
+   - `.dcx` : DCX Image (optional),
+   - `.dds` : DDS Image (optional),
+   - `.dib` : DIB Image (optional),
+   - `.emf` : EMF Image (optional),
+   - `.eps` : EPS Image (optional),
+   - `.fit` : FIT Image (optional),
+   - `.fits` : FITS Image (optional),
+   - `.flc` : FLC Image (optional),
+   - `.fli` : FLI Image (optional),
+   - `.fpx` : FPX Image (optional),
+   - `.ftc` : FTC Image (optional),
+   - `.ftu` : FTU Image (optional),
+   - `.gbr` : GBR Image (optional),
+   - `.gif` : GIF Image (optional),
+   - `.grib` : GRIB Image (optional),
+   - `.h5` : H5 Image (optional),
+   - `.hdf` : HDF Image (optional),
+   - `.icb` : ICB Image (optional),
+   - `.icns` : ICNS Image (optional),
+   - `.ico` : ICO Image (optional),
+   - `.iim` : IIM Image (optional),
+   - `.im` : IM Image (optional),
+   - `.j2c` : J2C Image (optional),
+   - `.j2k` : J2K Image (optional),
+   - `.jfif` : JFIF Image (optional),
+   - `.jp2` : JP2 Image (optional),
+   - `.jpc` : JPC Image (optional),
+   - `.jpe` : JPE Image (optional),
+   - `.jpeg` : JPEG Image (optional),
+   - `.jpf` : JPF Image (optional),
+   - `.jpg` : JPG Image (optional),
+   - `.jpx` : JPX Image (optional),
+   - `.mic` : MIC Image (optional),
+   - `.mpeg` : MPEG Image (optional),
+   - `.mpg` : MPG Image (optional),
+   - `.msp` : MSP Image (optional),
+   - `.pbm` : PBM Image (optional),
+   - `.pcd` : PCD Image (optional),
+   - `.pcx` : PCX Image (optional),
+   - `.pgm` : PGM Image (optional),
    - `.png` : PNG Image (optional),
-   - `.jpg` : JPEG Image (optional),
-   - `.jpeg` : JPEG Image (optional).
+   - `.pnm` : PNM Image (optional),
+   - `.ppm` : PPM Image (optional),
+   - `.ps` : PS Image (optional),
+   - `.psd` : PSD Image (optional),
+   - `.pxr` : PXR Image (optional),
+   - `.qoi` : QOI Image (optional),
+   - `.ras` : RAS Image (optional),
+   - `.rgb` : RGB Image (optional),
+   - `.rgba` : RGBA Image (optional),
+   - `.sgi` : SGI Image (optional),
+   - `.tga` : TGA Image (optional),
+   - `.tif` : TIF Image (optional),
+   - `.tiff` : TIFF Image (optional),
+   - `.vda` : VDA Image (optional),
+   - `.vst` : VST Image (optional),
+   - `.webp` : WEBP Image (optional),
+   - `.wmf` : WMF Image (optional),
+   - `.xbm` : XBM Image (optional),
+   - `.xpm` : XPM Image (optional).
 
 To support image captioning, on Ubuntu run:
 ```bash
@@ -326,6 +389,8 @@ For links to direct to the document and download to your local machine, the orig
 
 * [docquery](https://github.com/impira/docquery) like PrivateGPT but uses LayoutLM.
 
+* [KhoJ](https://github.com/khoj-ai/khoj) but also access from emacs or Obsidian.
+
 * [ChatPDF](https://www.chatpdf.com/) but h2oGPT is open-source and private and many more data types.
 
 * [Sharly](https://www.sharly.ai/) but h2oGPT is open-source and private and many more data types.  Sharly and h2oGPT both allow sharing work through UserData shared collection.

diff --git a/reqs_optional/requirements_optional_langchain.txt b/reqs_optional/requirements_optional_langchain.txt
@@ -20,7 +20,8 @@ chromadb==0.3.25
 unstructured[local-inference]==0.7.4
 #pdf2image==1.16.3
 #pytesseract==0.3.10
-pillow
+pillow>=10.0.0
+posthog>=3.0.1
 
 pdfminer.six==20221105
 urllib3

diff --git a/requirements.txt b/requirements.txt
@@ -66,3 +66,11 @@ text-generation==0.6.0
 tiktoken==0.4.0
 # optional: for OpenAI endpoint or embeddings (requires key)
 openai==0.27.8
+
+requests>=2.31.0
+urllib3>=1.26.16
+filelock>=3.12.2
+joblib>=1.3.1
+tqdm>=4.65.0
+tabulate>=0.9.0
+packaging>=23.1
diff --git a/src/cli.py b/src/cli.py
@@ -15,7 +15,7 @@ def run_cli(  # for local function:
         score_model=None, load_8bit=None, load_4bit=None, load_half=None,
         load_gptq=None, load_exllama=None, use_safetensors=None, revision=None,
         use_gpu_id=None, tokenizer_base_model=None,
-        gpu_id=None, local_files_only=None, resume_download=None, use_auth_token=None,
+        gpu_id=None, n_jobs=None, local_files_only=None, resume_download=None, use_auth_token=None,
         trust_remote_code=None, offload_folder=None, rope_scaling=None, max_seq_len=None, compile_model=None,
         # for some evaluate args
         stream_output=None, async_output=None, num_async=None,
@@ -40,11 +40,13 @@ def run_cli(  # for local function:
         raise_generate_gpu_exceptions=None, load_db_if_exists=None, use_llm_if_no_docs=None,
         my_db_state0=None, selection_docs_state0=None, dbs=None, langchain_modes=None, langchain_mode_paths=None,
         detect_user_path_changes_every_query=None,
-        use_openai_embedding=None, use_openai_model=None, hf_embedding_model=None, cut_distance=None,
+        use_openai_embedding=None, use_openai_model=None,
+        hf_embedding_model=None, migrate_embedding_model=None,
+        cut_distance=None,
         answer_with_sources=None,
         append_sources_to_answer=None,
         add_chat_history_to_context=None,
-        db_type=None, n_jobs=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
+        db_type=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
         use_cache=None,
         auto_reduce_chunks=None, max_chunks=None, model_lock=None, force_langchain_evaluate=None,
         model_state_none=None,

diff --git a/src/client_test.py b/src/client_test.py
@@ -49,6 +49,7 @@
 from bs4 import BeautifulSoup  # pip install beautifulsoup4
 
 from enums import DocumentSubset, LangChainAction
+from tests.utils import get_inf_server
 
 debug = False
 
@@ -58,7 +59,7 @@
 def get_client(serialize=True):
     from gradio_client import Client
 
-    client = Client(os.getenv('HOST', "http://localhost:7860"), serialize=serialize)
+    client = Client(get_inf_server(), serialize=serialize)
     if debug:
         print(client.view_api(all_endpoints=True))
     return client

diff --git a/src/eval.py b/src/eval.py
@@ -24,7 +24,7 @@ def run_eval(  # for local function:
         score_model=None, load_8bit=None, load_4bit=None, load_half=None,
         load_gptq=None, load_exllama=None, use_safetensors=None, revision=None,
         use_gpu_id=None, tokenizer_base_model=None,
-        gpu_id=None, local_files_only=None, resume_download=None, use_auth_token=None,
+        gpu_id=None, n_jobs=None, local_files_only=None, resume_download=None, use_auth_token=None,
         trust_remote_code=None, offload_folder=None, rope_scaling=None, max_seq_len=None, compile_model=None,
         # for evaluate args beyond what's already above, or things that are always dynamic and locally created
         temperature=None,
@@ -60,11 +60,13 @@ def run_eval(  # for local function:
         raise_generate_gpu_exceptions=None, load_db_if_exists=None, use_llm_if_no_docs=None,
         my_db_state0=None, selection_docs_state0=None, dbs=None, langchain_modes=None, langchain_mode_paths=None,
         detect_user_path_changes_every_query=None,
-        use_openai_embedding=None, use_openai_model=None, hf_embedding_model=None, cut_distance=None,
+        use_openai_embedding=None, use_openai_model=None,
+        hf_embedding_model=None, migrate_embedding_model=None,
+        cut_distance=None,
         answer_with_sources=None,
         append_sources_to_answer=None,
         add_chat_history_to_context=None,
-        db_type=None, n_jobs=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
+        db_type=None, first_para=None, text_limit=None, verbose=None, cli=None, reverse_docs=None,
         use_cache=None,
         auto_reduce_chunks=None, max_chunks=None,
         model_lock=None, force_langchain_evaluate=None,
@@ -121,7 +123,7 @@ def run_eval(  # for local function:
     num_examples = len(examples)
     scoring_path = 'scoring'
     # if no permissions, assume may not want files, put into temp
-    scoring_path = makedirs(scoring_path, tmp_ok=True)
+    scoring_path = makedirs(scoring_path, tmp_ok=True, use_base=True)
     if eval_as_output:
         used_base_model = 'gpt35'
         used_lora_weights = ''