You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why is the format of prediction as well for training defined as command(4) + color(4) + preposition(4) + letter(25) + digit(10) + adverb(4).
Will it work for any video I use to predict with the help of unseen model weights? (As per my understanding, It extracts the lip region using dlib and then try to map visual content to word conversion model?)
The text was updated successfully, but these errors were encountered:
This model is only trained for GRID dataset. If your video is saying "hello", it won't predict "hello". Instead it will predict some 6 word sentence based on command(4) + color(4) + preposition(4) + letter(25) + digit(10) + adverb(4). Even with unseen model, you can only predict unseen speaker's video that is in the form of command(4) + color(4) + preposition(4) + letter(25) + digit(10) + adverb(4).
The text was updated successfully, but these errors were encountered: