Skip to content

how you handle inputs of different sizes? #24

Answered by v-iashin
v-iashin asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks for your question.

If you open features archives, you will observe that all features have different lengths because videos are of different durations. The features are (Tv x 1024) for i3d and (Ta x 256) for vggish. So, 1024-d and 256-d are not temporal dimensions.

During proposal generation training we need full videos. Therefore, they are either used as-is or trimmed to max_feature_len (800 for audio and 300 for visual which is temporarily equivalent). While during the training of the captioning module we trim the set of features according to the ground truth time step. This pipeline is defined here:

def load_features_fr…

Replies: 1 comment

Comment options

v-iashin
Mar 19, 2021
Maintainer Author

You must be logged in to vote
0 replies
Answer selected by v-iashin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant