Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained models performing poorly on dense video captioning #22

Open
lawrenceztang opened this issue Aug 22, 2024 · 1 comment
Open

Comments

@lawrenceztang
Copy link

lawrenceztang commented Aug 22, 2024

The HowTo100M + VidChapters-7M + ViTT model is performing poorly on dense video captioning.

Reproduction:

Run

yt-dlp -P $TRANSFORMERS_CACHE -o video.mp4 https://www.youtube.com/watch?v=WJPyrQqLTl4

to download this specific video.

Follow the steps in the demo using the HowTo100M + VidChapters-7M + ViTT checkpoint.

Output captions:

[{'sentence': 'Dress up for the ceremony.', 'timestamp': [350.1055353535354, 395.77147474747477]}, {'sentence': 'Take a photo with me.', 'timestamp': [395.77147474747477, 426.21543434343437]}, {'sentence': 'Take a photo with me.', 'timestamp': [426.21543434343437, 456.659393939394]}, {'sentence': 'Take a photo with me.', 'timestamp': [471.88137373737374, 487.10335353535356]}, {'sentence': 'Take a photo with me.', 'timestamp': [487.10335353535356, 517.5473131313131]}, {'sentence': 'Take a picture with me.', 'timestamp': [547.9912727272728, 578.4352323232324]}, {'sentence': 'Take a picture with me.', 'timestamp': [593.6572121212122, 608.879191919192]}, {'sentence': 'Take a picture with me.', 'timestamp': [639.3231515151516, 654.5451313131314]}, {'sentence': 'Take a picture with me.', 'timestamp': [669.7671111111111, 684.9890909090909]}, {'sentence': 'Take a picture with me.', 'timestamp': [684.9890909090909, 700.2110707070708]}, {'sentence': 'Take a picture with me.', 'timestamp': [730.6550303030302, 745.8770101010102]}, {'sentence': 'Take a picture with me.', 'timestamp': [745.8770101010102, 761.0989898989899]}, {'sentence': 'Take a picture with me.', 'timestamp': [791.5429494949495, 806.7649292929293]}, {'sentence': 'Take a picture with me.', 'timestamp': [806.7649292929293, 821.9869090909092]}, {'sentence': 'Take a picture with me.', 'timestamp': [837.208888888889, 852.4308686868687]}, {'sentence': 'Take a picture with me.', 'timestamp': [852.4308686868687, 867.6528484848486]}, {'sentence': 'Take a picture with me.', 'timestamp': [882.8748282828284, 913.318787878788]}, {'sentence': 'Take a picture with me.', 'timestamp': [913.318787878788, 928.5407676767677]}, {'sentence': 'Take a picture with me.', 'timestamp': [928.5407676767677, 943.7627474747475]}, {'sentence': 'Take a picture with me.', 'timestamp': [958.9847272727274, 974.2067070707071]}, {'sentence': 'Take a picture with me.', 'timestamp': [989.4286868686869, 1019.8726464646466]}, {'sentence': 'Take a picture with me.', 'timestamp': [1035.0946262626262, 1065.538585858586]}, {'sentence': 'Take a picture with me.', 'timestamp': [1080.7605656565656, 1111.2045252525254]}, {'sentence': 'Take a picture with me.', 'timestamp': [1111.2045252525254, 1141.648484848485]}, {'sentence': 'Take a picture with me.', 'timestamp': [1141.648484848485, 1156.8704646464648]}, {'sentence': 'Take a picture with me.', 'timestamp': [1156.8704646464648, 1172.0924444444445]}, {'sentence': 'Take a picture with me.', 'timestamp': [1172.0924444444445, 1202.536404040404]}, {'sentence': 'Take a picture with me.', 'timestamp': [1202.536404040404, 1217.758383838384]}, {'sentence': 'Take a', 'timestamp': [1217.758383838384, 1232.9803636363638]}]
@mylo19
Copy link

mylo19 commented Sep 6, 2024

I have the same issue with the HowTo100M + VidChapters-7M + YouCook2 model. For this video the model gives these captions:

[{'sentence': 'Take a look at the hood.', 'timestamp': [3.534141414141414, 10.602424242424242]}, {'sentence': 'Take a look at the hood.', 'timestamp': [12.369494949494948, 22.97191919191919]}, {'sentence': 'Take a look at the hood.', 'timestamp': [22.97191919191919, 33.57434343434343]}, {'sentence': 'Take a look at the hood.', 'timestamp': [35.34141414141414, 45.94383838383838]}, {'sentence': 'Take a look at the hood.', 'timestamp': [45.94383838383838, 56.546262626262624]}, {'sentence': 'Take a look at the hood.', 'timestamp': [56.546262626262624, 65.38161616161617]}, {'sentence': 'Take a look at the hood.', 'timestamp': [65.38161616161617, 74.2169696969697]}, {'sentence': 'Take a look at the hood.', 'timestamp': [77.75111111111111, 86.58646464646465]}, {'sentence': 'Take a look at the hood.', 'timestamp': [86.58646464646465, 95.42181818181818]}, {'sentence': 'Take a look at the hood.', 'timestamp': [97.1888888888889, 107.79131313131313]}, {'sentence': 'Take a look at the hood.', 'timestamp': [109.55838383838385, 118.39373737373737]}, {'sentence': 'Take a look at the hood.', 'timestamp': [120.16080808080808, 123.69494949494948]}, {'sentence': 'Take a look at the hood.', 'timestamp': [123.69494949494948, 132.53030303030303]}, {'sentence': 'Take a look at the hood.', 'timestamp': [132.53030303030303, 139.59858585858586]}, {'sentence': 'Take a look at the hood.', 'timestamp': [139.59858585858586, 144.89979797979797]}, {'sentence': 'Take a look at the hood.', 'timestamp': [144.89979797979797, 150.2010101010101]}, {'sentence': 'Take a look at the hood.', 'timestamp': [150.2010101010101, 155.50222222222223]}, {'sentence': 'Take a look at the hood.', 'timestamp': [155.50222222222223, 160.80343434343433]}]

while for this video it gives:

[{'sentence': 'Introduktion.', 'timestamp': [0.0, 3.2334545454545456]}, {'sentence': 'ffningen.', 'timestamp': [3.2334545454545456, 9.700363636363637]}, {'sentence': 'ffningen.', 'timestamp': [9.700363636363637, 21.017454545454545]}, {'sentence': 'ffningen.', 'timestamp': [21.017454545454545, 30.717818181818185]}, {'sentence': 'ffningen.', 'timestamp': [30.717818181818185, 42.03490909090909]}, {'sentence': 'ffningen.', 'timestamp': [42.03490909090909, 51.73527272727273]}, {'sentence': 'ffningen.', 'timestamp': [51.73527272727273, 61.43563636363637]}, {'sentence': 'ffningen.', 'timestamp': [61.43563636363637, 75.98618181818182]}, {'sentence': 'ffningen.', 'timestamp': [75.98618181818182, 87.30327272727274]}, {'sentence': 'ffningen.', 'timestamp': [87.30327272727274, 98.62036363636365]}, {'sentence': 'ffningen.', 'timestamp': [98.62036363636365, 113.17090909090909]}, {'sentence': 'ffningen.', 'timestamp': [113.17090909090909, 127.72145454545455]}, {'sentence': 'ffningen.', 'timestamp': [127.72145454545455, 147.12218181818184]}, {'sentence': 'ffningen.', 'timestamp': [147.12218181818184, 151.97236363636364]}]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants