Streaming server that receives live video frames from an Android device's camera implemented using SocketIO. The server hosts two Deep learning models (Computer Vision Model and Natural Language Processing model) and connects the two of them to perform ASL to English translation. The server also performs pre-processing of the input video frames received from the Android app before sending it to the CV model and performs post-processing of the output generated by the NLP model before sending the result to the client. Synaera Android client sends requests to this server and receives translation results using Android's SocketIO API.
On startup, the server loads the two pre-trained deep learning models, CV model (Model_13ws_4p_5fps_new.h5
) and NLP model (nmt_weights_v8.h5
) and these instances are used for all requests throughout the duration of the server's uptime. The server then initializes all socket parameters, sets up socket listeners for session establishment with the Android client and for processing translation requests from the client. The session is initiated by the client, will last as long as the app is active on the user's device and is terminated on internet connection interrupt or when the app is closed.
- When the user starts recording from the camera screen of the app, the Android client triggers the
receiveImage
event on the server (seeImageServer.py
) to send the video frames one by one. Here the frame is processed in a new thread and prepared for the CV model. - Using mediapipe holistic, keypoints are identified on the person's face, hands and pose, which are extracted from each frame and stored as xyz values. Note that the server never stores the images after this extraction is performed.
- The xyz values are stored in a sequence and sent to the CV model, where the sign/action performed by the user is detected.
- Every new prediction made by the CV model is stored in a sentence array. As soon as no sign is performed (user's hands are no longer visible in the camera frame) or the user stops recording, the sentence array is sent to the NLP model to form an English sentence from the predicted ASL signs.
- The NLP model's result is formatted to fix punctuation, sentence case and remove any extra spaces.
- This result is sent to the Android client using the client's callback event where it is displayed as a caption on the camera screen.
- This process repeats as long as the user is recording from the camera and the translations are generated in real time while the user is signing.
This server is currently hosted on an Azure Cloud VM, but can also be run locally.
- Python version between 3.7-3.10
Clone this repository on your machine and navigate to the Server
directory
git clone https://github.com/MehrinFirdousi/Synaera-TeamSemaphore.git
cd Server
Install dependencies
pip install -r requirements.txt
Start server
python3 ImageServer.py
[Optional] Start server in debug mode and view frames being received by the server
python3 ImageServer.py -d
Since this is a client-server use case, to test the server locally you must have the Synaera app installed and running on your Android device. The device must be connected to the internet and on the same LAN as the server. Additionally, this server application is meant to be run on a dedicated cloud-based server, so when testing locally the translation will be relatively slow since PCs are not powerful enough to run the intensive ML operations involved in the translation process.