sequentially duplicate token outputs? #268

alex-nugent · 2024-01-23T19:52:09Z

alex-nugent
Jan 23, 2024

First of all, thank you for making this repository available! I am investigating the ONNX models within a Java environment. I was able to load up and execute the ONNX models and it does appear to be working. I converted the Decoder python into Java and the timecodes all appear to be accurate. The only issue I am having is duplicate tokens for words on the output.

As an example, audio (16kHz mono 16bit signed PCM) with the following words:

"Einstein's postulate allows us to understand causal relationships in Minkowski space, a mathematical model of space and time."

...become this after STT:

eininstestein' postululate allowallows us to understandunderstand causle relrelationationshiips in mkski spspace a mamathemthematicalical modell of space in time

Perhaps the sequentially duplicated tokens are normal and just something that is filtered out? Or could this be related to my front-end audio processing? I am converting to a float array normalized between -1 and 1.

A full output of the "AlignDict" shows accurate time start/stop.

AlignDict{word='eininstestein'', startTs=0.0, endTs=0.55}
AlignDict{word='postululate', startTs=0.55, endTs=1.11}
AlignDict{word='allowallows', startTs=1.11, endTs=1.5}
AlignDict{word='us', startTs=1.5, endTs=1.66}
AlignDict{word='to', startTs=1.66, endTs=1.82}
AlignDict{word='understandunderstand', startTs=1.98, endTs=2.3}
AlignDict{word='causle', startTs=2.38, endTs=2.93}
AlignDict{word='relrelationationshiips', startTs=2.93, endTs=3.72}
AlignDict{word='in', startTs=3.72, endTs=3.88}
AlignDict{word='mkski', startTs=3.88, endTs=4.59}
AlignDict{word='spspace', startTs=4.59, endTs=4.99}
AlignDict{word='a', startTs=5.54, endTs=5.78}
AlignDict{word='mamathemthematicalical', startTs=5.78, endTs=6.42}
AlignDict{word='modell', startTs=6.42, endTs=6.81}
AlignDict{word='of', startTs=6.81, endTs=6.97}
AlignDict{word='space', startTs=6.97, endTs=7.21}
AlignDict{word='in', startTs=7.21, endTs=7.37}
AlignDict{word='time', startTs=7.37, endTs=7.6}

I have verified that the duplicate tokens are not a result of the Decode() operation. The duplicated tokes are contained in the raw token probability outputs.

{label index} : {label}
969 : e
3 : in
3 : in
473 : ste
473 : ste
3 : in
991 : '
998 :
998 :
85 : po
25 : st
114 : ul
114 : ul
147 : ate
998 :
998 :
928 : allow
928 : allow
975 : s
998 :
80 : us
998 :
12 : to
998 :
998 :
0 : _
0 : _
801 : understand
801 : understand
0 : _
998 :
998 :
115 : ca
0 : _
80 : us
0 : _
24 : le
998 :
998 :
713 : rel
713 : rel
102 : ation
102 : ation
143 : sh
973 : i
973 : i
317 : ps
998 :
3 : in
998 :
998 :
982 : m
0 : _
990 : k
0 : _
0 : _
682 : sk
973 : i
998 :
998 :
229 : sp
229 : sp
211 : ace
0 : _
0 : _
0 : _
0 : _
0 : _
0 : _
0 : _
998 :
998 :
971 : a
998 :
167 : ma
167 : ma
169 : them
169 : them
8 : at
334 : ical
334 : ical
998 :
71 : mo
78 : de
978 : l
978 : l
998 :
20 : of
998 :
229 : sp
211 : ace
998 :
3 : in
998 :
998 :
242 : time
0 : _
0 : _
0 : _
0 : _

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

sequentially duplicate token outputs? #268

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

sequentially duplicate token outputs? #268

Uh oh!

alex-nugent Jan 23, 2024

Replies: 0 comments

alex-nugent
Jan 23, 2024