You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanna train YOLOV on my dataset but one of the classes (which is in fact the most important to me) is very hard to detect in any frame. But I suspect that it might be detected from context clues, which are more detectable, e.g. the positions of the humans, the other objects they are carrying...
As an extreme example, you could potentially guess where a transparent fishing thread is in a video showing a human fishing, if you have a detector that can clearly see the human, the lake, the splash where the bait is sitting in the water, the rod etc... Another more temporal example : you could potentially detect that a fork has some food in it, even if you can't see the food very well, if you can clearly detect the fork and the human taking it from their plate to their mouth then chewing.
I have those classes annotated and I've trained an image detector on all my classes but it performs poorly for the most important class I'm talking about. From what I understand, the paper's training is not end to end. But it does use all classes' instances features during aggregation.
Does that mean that it could theoretically, through attention, be more able to "guess" the presence of an object somewhere, not necessarily based on the presence of the same object in surrounding frames, but instead based on the presence of other objects in surrounding frames ? Or instead does the aggregation only include the instances of a given class during the refinement of its prediction ?
Thanks.
The text was updated successfully, but these errors were encountered:
I wanna train YOLOV on my dataset but one of the classes (which is in fact the most important to me) is very hard to detect in any frame. But I suspect that it might be detected from context clues, which are more detectable, e.g. the positions of the humans, the other objects they are carrying...
As an extreme example, you could potentially guess where a transparent fishing thread is in a video showing a human fishing, if you have a detector that can clearly see the human, the lake, the splash where the bait is sitting in the water, the rod etc... Another more temporal example : you could potentially detect that a fork has some food in it, even if you can't see the food very well, if you can clearly detect the fork and the human taking it from their plate to their mouth then chewing.
I have those classes annotated and I've trained an image detector on all my classes but it performs poorly for the most important class I'm talking about. From what I understand, the paper's training is not end to end. But it does use all classes' instances features during aggregation.
Does that mean that it could theoretically, through attention, be more able to "guess" the presence of an object somewhere, not necessarily based on the presence of the same object in surrounding frames, but instead based on the presence of other objects in surrounding frames ? Or instead does the aggregation only include the instances of a given class during the refinement of its prediction ?
Thanks.
The text was updated successfully, but these errors were encountered: