You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for making this repository available! I am investigating the ONNX models within a Java environment. I was able to load up and execute the ONNX models and it does appear to be working. I converted the Decoder python into Java and the timecodes all appear to be accurate. The only issue I am having is duplicate tokens for words on the output.
As an example, audio (16kHz mono 16bit signed PCM) with the following words:
"Einstein's postulate allows us to understand causal relationships in Minkowski space, a mathematical model of space and time."
...become this after STT:
eininstestein' postululate allowallows us to understandunderstand causle relrelationationshiips in mkski spspace a mamathemthematicalical modell of space in time
Perhaps the sequentially duplicated tokens are normal and just something that is filtered out? Or could this be related to my front-end audio processing? I am converting to a float array normalized between -1 and 1.
A full output of the "AlignDict" shows accurate time start/stop.
I have verified that the duplicate tokens are not a result of the Decode() operation. The duplicated tokes are contained in the raw token probability outputs.
{label index} : {label}
969 : e
3 : in
3 : in
473 : ste
473 : ste
3 : in
991 : '
998 :
998 :
85 : po
25 : st
114 : ul
114 : ul
147 : ate
998 :
998 :
928 : allow
928 : allow
975 : s
998 :
80 : us
998 :
12 : to
998 :
998 :
0 : _
0 : _
801 : understand
801 : understand
0 : _
998 :
998 :
115 : ca
0 : _
80 : us
0 : _
24 : le
998 :
998 :
713 : rel
713 : rel
102 : ation
102 : ation
143 : sh
973 : i
973 : i
317 : ps
998 :
3 : in
998 :
998 :
982 : m
0 : _
990 : k
0 : _
0 : _
682 : sk
973 : i
998 :
998 :
229 : sp
229 : sp
211 : ace
0 : _
0 : _
0 : _
0 : _
0 : _
0 : _
0 : _
998 :
998 :
971 : a
998 :
167 : ma
167 : ma
169 : them
169 : them
8 : at
334 : ical
334 : ical
998 :
71 : mo
78 : de
978 : l
978 : l
998 :
20 : of
998 :
229 : sp
211 : ace
998 :
3 : in
998 :
998 :
242 : time
0 : _
0 : _
0 : _
0 : _
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
First of all, thank you for making this repository available! I am investigating the ONNX models within a Java environment. I was able to load up and execute the ONNX models and it does appear to be working. I converted the Decoder python into Java and the timecodes all appear to be accurate. The only issue I am having is duplicate tokens for words on the output.
As an example, audio (16kHz mono 16bit signed PCM) with the following words:
"Einstein's postulate allows us to understand causal relationships in Minkowski space, a mathematical model of space and time."
...become this after STT:
eininstestein' postululate allowallows us to understandunderstand causle relrelationationshiips in mkski spspace a mamathemthematicalical modell of space in time
Perhaps the sequentially duplicated tokens are normal and just something that is filtered out? Or could this be related to my front-end audio processing? I am converting to a float array normalized between -1 and 1.
A full output of the "AlignDict" shows accurate time start/stop.
AlignDict{word='eininstestein'', startTs=0.0, endTs=0.55}
AlignDict{word='postululate', startTs=0.55, endTs=1.11}
AlignDict{word='allowallows', startTs=1.11, endTs=1.5}
AlignDict{word='us', startTs=1.5, endTs=1.66}
AlignDict{word='to', startTs=1.66, endTs=1.82}
AlignDict{word='understandunderstand', startTs=1.98, endTs=2.3}
AlignDict{word='causle', startTs=2.38, endTs=2.93}
AlignDict{word='relrelationationshiips', startTs=2.93, endTs=3.72}
AlignDict{word='in', startTs=3.72, endTs=3.88}
AlignDict{word='mkski', startTs=3.88, endTs=4.59}
AlignDict{word='spspace', startTs=4.59, endTs=4.99}
AlignDict{word='a', startTs=5.54, endTs=5.78}
AlignDict{word='mamathemthematicalical', startTs=5.78, endTs=6.42}
AlignDict{word='modell', startTs=6.42, endTs=6.81}
AlignDict{word='of', startTs=6.81, endTs=6.97}
AlignDict{word='space', startTs=6.97, endTs=7.21}
AlignDict{word='in', startTs=7.21, endTs=7.37}
AlignDict{word='time', startTs=7.37, endTs=7.6}
I have verified that the duplicate tokens are not a result of the Decode() operation. The duplicated tokes are contained in the raw token probability outputs.
{label index} : {label}
969 : e
3 : in
3 : in
473 : ste
473 : ste
3 : in
991 : '
998 :
998 :
85 : po
25 : st
114 : ul
114 : ul
147 : ate
998 :
998 :
928 : allow
928 : allow
975 : s
998 :
80 : us
998 :
12 : to
998 :
998 :
0 : _
0 : _
801 : understand
801 : understand
0 : _
998 :
998 :
115 : ca
0 : _
80 : us
0 : _
24 : le
998 :
998 :
713 : rel
713 : rel
102 : ation
102 : ation
143 : sh
973 : i
973 : i
317 : ps
998 :
3 : in
998 :
998 :
982 : m
0 : _
990 : k
0 : _
0 : _
682 : sk
973 : i
998 :
998 :
229 : sp
229 : sp
211 : ace
0 : _
0 : _
0 : _
0 : _
0 : _
0 : _
0 : _
998 :
998 :
971 : a
998 :
167 : ma
167 : ma
169 : them
169 : them
8 : at
334 : ical
334 : ical
998 :
71 : mo
78 : de
978 : l
978 : l
998 :
20 : of
998 :
229 : sp
211 : ace
998 :
3 : in
998 :
998 :
242 : time
0 : _
0 : _
0 : _
0 : _
Beta Was this translation helpful? Give feedback.
All reactions