From 928164d2add9082c380ac27c1fce88be9bf50a73 Mon Sep 17 00:00:00 2001 From: zane <2587359106@qq.com> Date: Thu, 28 Mar 2024 09:48:27 +0800 Subject: [PATCH 01/10] fix empty bug --- examples/llava/clip.cpp | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp index 40c9762617cfd..5954bf6cdec68 100644 --- a/examples/llava/clip.cpp +++ b/examples/llava/clip.cpp @@ -835,9 +835,10 @@ static ggml_cgraph * clip_image_build_graph(clip_ctx * ctx, const clip_image_f32 mlp_2 = ggml_pool_2d(ctx0, mlp_2, GGML_OP_POOL_AVG, 2, 2, 2, 2, 0, 0); // weight ne = [3, 3, 2048, 1] struct ggml_tensor * peg_0 = ggml_conv_depthwise_2d(ctx0, model.mm_model_peg_0_w, mlp_2, 1, 1, 1, 1, 1, 1); - peg_0 = ggml_add(ctx0, peg_0, mlp_2); peg_0 = ggml_cont(ctx0, ggml_permute(ctx0, peg_0, 1, 2, 0, 3)); peg_0 = ggml_add(ctx0, peg_0, model.mm_model_peg_0_b); + mlp_2 = ggml_cont(ctx0, ggml_permute(ctx0, mlp_2, 1, 2, 0, 3)); + peg_0 = ggml_add(ctx0, peg_0, mlp_2); peg_0 = ggml_reshape_3d(ctx0, peg_0, peg_0->ne[0], peg_0->ne[1] * peg_0->ne[2], peg_0->ne[3]); embeddings = peg_0; } @@ -1755,7 +1756,7 @@ int clip_n_patches(const struct clip_ctx * ctx) { int n_patches = (params.image_size / params.patch_size) * (params.image_size / params.patch_size); - if (ctx->proj_type == PROJECTOR_TYPE_LDP) { + if (ctx->proj_type == PROJECTOR_TYPE_LDP || ctx->proj_type == PROJECTOR_TYPE_LDPV2) { n_patches /= 4; } From 741eebf2578e3e848718d99f53e0fe27844b9c0d Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:02:00 +0800 Subject: [PATCH 02/10] Update MobileVLM-README.md added more results on devices --- examples/llava/MobileVLM-README.md | 204 +++++++++++++++++++++++++++-- 1 file changed, 196 insertions(+), 8 deletions(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index 1fc83247a56c1..0bdc10af997f6 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -6,7 +6,7 @@ for more information, please go to [Meituan-AutoML/MobileVLM](https://github.com The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava. -Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using MobileVLM as an example, the different conversion step will be shown. +Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models are the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown. ## Usage Build with cmake or run `make llava-cli` to build it. @@ -20,6 +20,17 @@ After building, run: `./llava-cli` to see the usage. For example: -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:" ``` +## GGUF model +If you just want to use it, fetch the gguf format weight from here: +for MobileVLM-1.7B +``` +git clone https://huggingface.co/guinmoon/MobileVLM-1.7B-GGUF +``` +for MobileVLM_V2-1.7B +``` +git clone https://huggingface.co/ZiangWu/MobileVLM_V2-1.7B-GGUF +``` + ## Model conversion - Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally: @@ -36,7 +47,7 @@ git clone https://huggingface.co/openai/clip-vit-large-patch14-336 python ./examples/llava/llava-surgery.py -m path/to/MobileVLM-1.7B ``` -3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** the arg is `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF: +3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** you should use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF: ```sh python ./examples/llava/convert-image-encoder-to-gguf \ @@ -78,7 +89,7 @@ cd examples/llava/android/build_64 ### run on Android refer to `android/adb_run.sh`, modify resources' `name` and `path` -## some result on Android with `Snapdragon 888` chip +## Some result on Android with `Snapdragon 888` chip ### case 1 **input** ```sh @@ -109,7 +120,6 @@ llama_print_timings: total time = 34731.93 ms --image /data/local/tmp/cat.jpeg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWhat is in the image? ASSISTANT:" ``` - **output** ```sh encode_image_with_clip: image encoded in 21149.51 ms by CLIP ( 146.87 ms per image patch) @@ -121,12 +131,80 @@ llama_print_timings: eval time = 1279.03 ms / 18 runs ( 71.06 m llama_print_timings: total time = 34570.79 ms ``` + +## Some result on Android with `Snapdragon 778G` chip +### MobileVLM-1.7B case +#### llava-cli release-2005b +**input** +```sh +/data/local/tmp/llava-cli \ + -m /data/local/tmp/ggml-model-q4_k.gguf \ + --mmproj /data/local/tmp/mmproj-model-f16.gguf \ + -t 4 \ + --image /data/local/tmp/many_llamas.jpeg \ + -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWhat's that? ASSISTANT:" +``` +**output** +```sh +encode_image_with_clip: image encoded in 18728.52 ms by CLIP ( 130.06 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that? ASSISTANT: + + A group of llamas are standing in a green pasture. + +llama_print_timings: load time = 20357.33 ms +llama_print_timings: sample time = 2.96 ms / 14 runs ( 0.21 ms per token, 4734.53 tokens per second) +llama_print_timings: prompt eval time = 8119.49 ms / 191 tokens ( 42.51 ms per token, 23.52 tokens per second) +llama_print_timings: eval time = 1005.75 ms / 14 runs ( 71.84 ms per token, 13.92 tokens per second) +llama_print_timings: total time = 28038.34 ms / 205 tokens +``` +#### llava-cli latest-version +**input** +Just the same as above. + +**output**(seems to be much slower) +```sh +encode_image_with_clip: image embedding created: 144 tokens + +encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that? ASSISTANT: + + It is a group of sheep standing together in a grass field. + +llama_print_timings: load time = 818120.91 ms +llama_print_timings: sample time = 3.44 ms / 14 runs ( 0.25 ms per token, 4067.40 tokens per second) +llama_print_timings: prompt eval time = 529274.69 ms / 191 tokens ( 2771.07 ms per token, 0.36 tokens per second) +llama_print_timings: eval time = 43894.02 ms / 13 runs ( 3376.46 ms per token, 0.30 tokens per second) +llama_print_timings: total time = 865441.76 ms / 204 tokens +``` +### MobileVLM_V2-1.7B case +#### llava-cli release-2005b +**input** +Just the same as above. + +**output** +```sh +encode_image_with_clip: image encoded in 20609.61 ms by CLIP ( 143.12 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that? ASSISTANT: + + This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting. + +The background offers glimpses into a picturesque town nestled amidst hills under an overcast sky, adding depth to the scene while also emphasizing that distance between these llama and human-made structures like houses or roads in which they roam freely without any barriers around them. The image is framed by text at both right angles on white backgrounds against a contrasting blue backdrop with green foliage, further drawing attention to the llamas amidst their natural habitat while also inviting viewers into this picturesque landscape within town limits of Alta Llama + +llama_print_timings: load time = 22406.77 ms +llama_print_timings: sample time = 49.26 ms / 186 runs ( 0.26 ms per token, 3776.27 tokens per second) +llama_print_timings: prompt eval time = 9044.54 ms / 191 tokens ( 47.35 ms per token, 21.12 tokens per second) +llama_print_timings: eval time = 14497.49 ms / 186 runs ( 77.94 ms per token, 12.83 tokens per second) +llama_print_timings: total time = 44411.01 ms / 377 tokens +``` + ## Orin compile and run ### compile ```sh make LLAMA_CUDA=1 CUDA_DOCKER_ARCH=sm_87 LLAMA_CUDA_F16=1 -j 32 ``` - ### run on Orin ### case 1 **input** @@ -175,8 +253,118 @@ llama_print_timings: eval time = 166.65 ms / 11 runs ( 15.15 m llama_print_timings: total time = 1365.47 ms / 243 tokens ``` -## Minor shortcomings -The `n_patch` of output in `ldp` is 1/4 of the input. In order to implement quickly, we uniformly modified `clip_n_patches` function to a quarter. when counting the time consumption, the calculated time will be 4 times bigger than the real cost. +## Running on Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz +### Operating system +Ubuntu22.04 +### compile +```sh +make -j32 +``` +### MobileVLM-1.7B case +**input** +```sh +-m /path/to/ggml-model-q4_k.gguf \ + --mmproj /path/to/mmproj-model-f16.gguf \ + --image /path/to/many_llamas.jpeg + -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWhat's that? ASSISTANT:" \ +``` +**output** +```sh +encode_image_with_clip: image embedding created: 144 tokens + +encode_image_with_clip: image encoded in 2730.94 ms by CLIP ( 18.96 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that?ASSISTANT: + + A group of llamas are walking together in a field. + +llama_print_timings: load time = 5506.60 ms +llama_print_timings: sample time = 0.44 ms / 13 runs ( 0.03 ms per token, 29545.45 tokens per second) +llama_print_timings: prompt eval time = 2031.58 ms / 190 tokens ( 10.69 ms per token, 93.52 tokens per second) +llama_print_timings: eval time = 438.92 ms / 12 runs ( 36.58 ms per token, 27.34 tokens per second) +llama_print_timings: total time = 5990.25 ms / 202 tokens +``` + +### MobileVLM_V2-1.7B case +**input** +Just the same as above. +**ouput** +```sh +encode_image_with_clip: image embedding created: 144 tokens + +encode_image_with_clip: image encoded in 3223.89 ms by CLIP ( 22.39 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that?ASSISTANT: + + The image captures a tranquil scene in a park, where a group of approximately 20 llamas are gathered. The llamas, a mix of white and black, are standing in a line, their black and white patterns contrasting with the lush green grass of the park. The lamas are arranged in a line, suggesting a social order. + +The park itself is lush and green, with trees dotting the landscape in the background. A sign reading "Llamas Tico Ana" is also visible in the image, possibly indicating the location or the breed of the llamas. The image seems to be taken from a distance, providing a wide view of the scene and the surrounding environment. + +The llamas' positions relative to each other, the sign, and the trees create a harmonious composition. The image does not contain any discernible text. The overall scene is one of peace and natural beauty, with the llamas in their natural habitat, surrounded by the vibrant colors and lush greenery of the park. + +llama_print_timings: load time = 6642.61 ms +llama_print_timings: sample time = 8.15 ms / 223 runs ( 0.04 ms per token, 27358.61 tokens per second) +llama_print_timings: prompt eval time = 2475.07 ms / 190 tokens ( 13.03 ms per token, 76.77 tokens per second) +llama_print_timings: eval time = 8760.60 ms / 222 runs ( 39.46 ms per token, 25.34 tokens per second) +llama_print_timings: total time = 15513.95 ms / 412 tokens +``` + +## Run on Intel(R) Core(TM) Ultra7 115H +### operation system +Windows11 +### comiple +```sh +make -j32 +``` +### MobileVLM-1.7B case +**input** +```sh +-m /path/to/ggml-model-q4_k.gguf \ + --mmproj /path/to/tmp/mmproj-model-f16.gguf \ + -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWhat's that? ASSISTANT:" \ +``` +**output** +```sh +encode_image_with_clip: image encoded in 4902.81 ms by CLIP ( 34.05 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that? ASSISTANT: + + The image features a group of brown and white llamas standing in a grassy field. + +llama_print_timings: load time = 7441.06 ms +llama_print_timings: sample time = 0.72 ms / 19 runs ( 0.04 ms per token, 26279.39 tokens per second) +llama_print_timings: prompt eval time = 2090.71 ms / 191 tokens ( 10.95 ms per token, 91.36 tokens per second) +llama_print_timings: eval time = 512.35 ms / 18 runs ( 28.46 ms per token, 35.13 tokens per second) +llama_print_timings: total time = 7987.23 ms / 209 tokens +``` + +### MobileVLM_V2-1.7B case +**input** +Just the same as above. + +**output** +```sh +encode_image_with_clip: image encoded in 4682.44 ms by CLIP ( 32.52 ms per image patch) +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +user_prompt: \nWhat's that? ASSISTANT: + + This image captures a lively scene of a group of 14 llamas in a grassy field. The llamas, with their distinctive black and white coats, are standing and walking in a line, seemingly engaged in a social activity. One + of them, possibly the first in the line, has its back turned, perhaps observing something in the distance. + +The llama in the front of the line stands out due to its black and white coloring, which is quite unusual for llama patterns. The llama in the front also seems to be more aware of its surroundings, as it faces the camera, giving a sense of engagement with the viewer. + +The image is taken from the side of the llama, providing a clear view of the llama in the front and its companions. The lameness in the llama in + front is not visible, indicating that it might not be the main focus of the photo. + +The background of the image features a grassy field, with a fence and a tree visible in the distance. The tree appears to be bare, suggesting that it might be during a time of year when most trees are dormant or have shed their leaves. + + +llama_print_timings: load time = 7015.35 ms +llama_print_timings: sample time = 10.61 ms / 256 runs ( 0.04 ms per token, 24119.09 tokens per second) +llama_print_timings: prompt eval time = 2052.45 ms / 191 tokens ( 10.75 ms per token, 93.06 tokens per second) +llama_print_timings: eval time = 7259.43 ms / 255 runs ( 28.47 ms per token, 35.13 tokens per second) +llama_print_timings: total time = 14371.19 ms / 446 tokens +``` ## TODO @@ -191,5 +379,5 @@ The `n_patch` of output in `ldp` is 1/4 of the input. In order to implement quic ## contributor ```sh -zhangjidong05, yangyang260, huyiming03, chenxiaotao03 +zhangjidong05, yangyang260, huyiming03, chenxiaotao03, ZiangWu-77 ``` From 5310114cd6a8689e6721555716bb6a56b408fa84 Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:02:32 +0800 Subject: [PATCH 03/10] Update MobileVLM-README.md --- examples/llava/MobileVLM-README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index 0bdc10af997f6..ad1e422874b74 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -253,7 +253,7 @@ llama_print_timings: eval time = 166.65 ms / 11 runs ( 15.15 m llama_print_timings: total time = 1365.47 ms / 243 tokens ``` -## Running on Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz +## Running on Intel(R) Core(TM) i7-10750H ### Operating system Ubuntu22.04 ### compile From 7fc9c777a3f5209a442a891f522900ee6aa1a387 Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:04:10 +0800 Subject: [PATCH 04/10] Update MobileVLM-README.md --- examples/llava/MobileVLM-README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index ad1e422874b74..b7b568b691f59 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -288,6 +288,7 @@ llama_print_timings: total time = 5990.25 ms / 202 tokens ### MobileVLM_V2-1.7B case **input** Just the same as above. + **ouput** ```sh encode_image_with_clip: image embedding created: 144 tokens From a4527cb16e83094154e47c8130303ec55dd3a46b Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:05:12 +0800 Subject: [PATCH 05/10] Update MobileVLM-README.md --- examples/llava/MobileVLM-README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index b7b568b691f59..6830b40c6f128 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -160,6 +160,7 @@ llama_print_timings: total time = 28038.34 ms / 205 tokens ``` #### llava-cli latest-version **input** + Just the same as above. **output**(seems to be much slower) @@ -181,6 +182,7 @@ llama_print_timings: total time = 865441.76 ms / 204 tokens ### MobileVLM_V2-1.7B case #### llava-cli release-2005b **input** + Just the same as above. **output** @@ -287,6 +289,7 @@ llama_print_timings: total time = 5990.25 ms / 202 tokens ### MobileVLM_V2-1.7B case **input** + Just the same as above. **ouput** @@ -341,6 +344,7 @@ llama_print_timings: total time = 7987.23 ms / 209 tokens ### MobileVLM_V2-1.7B case **input** + Just the same as above. **output** From 79de0e65e16321a1ad1979576108f803fe83f97e Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:07:45 +0800 Subject: [PATCH 06/10] Update MobileVLM-README.md --- examples/llava/MobileVLM-README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index 6830b40c6f128..19f33042524ec 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -47,7 +47,7 @@ git clone https://huggingface.co/openai/clip-vit-large-patch14-336 python ./examples/llava/llava-surgery.py -m path/to/MobileVLM-1.7B ``` -3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** you should use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF: +3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF: ```sh python ./examples/llava/convert-image-encoder-to-gguf \ From 1cdd3b0ae3037de8d27d4c349c33303fafb12c3c Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:10:21 +0800 Subject: [PATCH 07/10] Update MobileVLM-README.md --- examples/llava/MobileVLM-README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index 19f33042524ec..ab8101b2309f2 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -134,7 +134,7 @@ llama_print_timings: total time = 34570.79 ms ## Some result on Android with `Snapdragon 778G` chip ### MobileVLM-1.7B case -#### llava-cli release-2005b +#### llava-cli release-b2005 **input** ```sh /data/local/tmp/llava-cli \ From 4ab46218c15d7921eaf25f76d6d965f56b5ace14 Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 17:58:26 +0800 Subject: [PATCH 08/10] Update MobileVLM-README.md --- examples/llava/MobileVLM-README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index ab8101b2309f2..ae21904e82ad7 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -147,7 +147,7 @@ llama_print_timings: total time = 34570.79 ms **output** ```sh encode_image_with_clip: image encoded in 18728.52 ms by CLIP ( 130.06 ms per image patch) -system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: user_prompt: \nWhat's that? ASSISTANT: A group of llamas are standing in a green pasture. @@ -168,7 +168,7 @@ Just the same as above. encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch) -system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: user_prompt: \nWhat's that? ASSISTANT: It is a group of sheep standing together in a grass field. @@ -188,7 +188,7 @@ Just the same as above. **output** ```sh encode_image_with_clip: image encoded in 20609.61 ms by CLIP ( 143.12 ms per image patch) -system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: user_prompt: \nWhat's that? ASSISTANT: This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting. @@ -256,7 +256,7 @@ llama_print_timings: total time = 1365.47 ms / 243 tokens ``` ## Running on Intel(R) Core(TM) i7-10750H -### Operating system +### Operating system Ubuntu22.04 ### compile ```sh @@ -275,7 +275,7 @@ make -j32 encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 2730.94 ms by CLIP ( 18.96 ms per image patch) -system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: user_prompt: \nWhat's that?ASSISTANT: A group of llamas are walking together in a field. @@ -297,7 +297,7 @@ Just the same as above. encode_image_with_clip: image embedding created: 144 tokens encode_image_with_clip: image encoded in 3223.89 ms by CLIP ( 22.39 ms per image patch) -system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: +system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: user_prompt: \nWhat's that?ASSISTANT: The image captures a tranquil scene in a park, where a group of approximately 20 llamas are gathered. The llamas, a mix of white and black, are standing in a line, their black and white patterns contrasting with the lush green grass of the park. The lamas are arranged in a line, suggesting a social order. From 2a77902a1d290e8f02192adc4488710bf6ffa645 Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 21:37:57 +0800 Subject: [PATCH 09/10] Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov --- examples/llava/MobileVLM-README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index ae21904e82ad7..063b943ff7a2c 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -6,7 +6,7 @@ for more information, please go to [Meituan-AutoML/MobileVLM](https://github.com The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava. -Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models are the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown. +Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown. ## Usage Build with cmake or run `make llava-cli` to build it. From 72dbd3250b3a1bb3c87a42dd202e530d94270337 Mon Sep 17 00:00:00 2001 From: Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com> Date: Thu, 28 Mar 2024 21:42:10 +0800 Subject: [PATCH 10/10] Update MobileVLM-README.md remove gguf links --- examples/llava/MobileVLM-README.md | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md index 063b943ff7a2c..96b048525239f 100644 --- a/examples/llava/MobileVLM-README.md +++ b/examples/llava/MobileVLM-README.md @@ -20,17 +20,6 @@ After building, run: `./llava-cli` to see the usage. For example: -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:" ``` -## GGUF model -If you just want to use it, fetch the gguf format weight from here: -for MobileVLM-1.7B -``` -git clone https://huggingface.co/guinmoon/MobileVLM-1.7B-GGUF -``` -for MobileVLM_V2-1.7B -``` -git clone https://huggingface.co/ZiangWu/MobileVLM_V2-1.7B-GGUF -``` - ## Model conversion - Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally: