Official code for the paper A picture is worth a thousand words? Investigating the Impact of Image Aids in AR on Memory Recall for Everyday Tasks
Memories retrieval using Multimodal LLM with Hololens 2.
This system developed for deployment on Hololens 2 device
-
Captures images from user's field of view and stores the descriptions of them generated with LLM in a vector database;
-
Retrieves the images best matching the user's question and generates textual answers complementing the retrieved images.
-
Client-related code is written in C#;
-
Server-related code is written in Python.
- Especially, this projects makes use of:
for integrating LISA model
's **Transformers** library (Bits and Bytes) for model 4-bit quantization
- Clone the repository; to run the server, you will only need files in the server folder.
git clone https://github.com/bess-cater/ARMemory.git && cd ARMemory/server
- Build docker image from dockerfile:
docker build -t memory_image .
- And run it:
docker run -i -t -d --name memory --ipc=host --gpus all -v "$(pwd)/ARMemory/server:/memory" -p 9999:9999 memory_image
- Activate conda environment:
conda env create -f environment.yml && conda activate memory
Additional steps:
- For Image and Image+Text mode you may want to use segmentation model. We use LISA.
Clone the repository and place the corresponding folder inside the server folder.
You may want to change the port; in this case, make sure to assign the same port in server and client-related files.
This section describes the process for capturing pictures from Hololens built-in camera and processing them with the chosen LLM.
-
Before starting the server, make sure to provide your OpenAI key if you are using GPT-4o model or/and OpenAI model for vector embeddings; it is required here.
-
Start the server
python -m myserver --save <name of a person whose viewpoint is recorded> --scene <place which is recorded>
In order to enable Hololens 2 capturing images and sending it to the server, you need to build the Hololens application from Unity project.
-
Open encoding scene
-
Make sure that encoding script is attached to the scene and your IP and port are set.
-
Before building the application, make sure your app has access to the WebCam (Edit - Project Settings - Capabilities - WebCam)
-
Build and deploy solution
-
Run in the Hololens 2 device (grant access to the WebCam)
In this stage, answers to the users' questions are retrieved with the help of LLM.
- Run send_message.py
python -m send_message --save <name of a person whose viewpoint is recorded> --scene <place which is recorded> --condition <condition>
The first two arguments must be the same as in the Encoding stage; for condition, you may choose image, text, ot image_text.
-
Create a new project in Unity and make sure you have configured it with the Mixed Reality Toolkit. We used version 2.8.3.0.
-
For this project, we especially rely on Microsoft Azure cognitive services; the package must be downloaded from here and imported into the Unity project.
-
Open the retrieval scene. If everything is set right, you will see the following layout:
-
To use the sevices, provide your Speech SDK credentials.
-
Do not forget to set your IP and port!
-
Before the build, make sure that InternetClient, Microphone, and SpatialPerception capabilities of the Publishing settings are enabled.
-
Build and deploy solution; run it on the Hololens 2.