-Event cameras asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. Deploying deep learning methods for classification or other tasks to these sensors typically requires large labeled datasets. However, annotation of event data is a costly and laborious process. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised pretraining framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task, leading to a consistently improved performance on the task while requiring fewer labels. Our method outperforms the state-of-the-art object classification across three datasets, N-ImageNet, N-Cars, and N-Caltech101, increasing the top-1 accuracy of previous work by significant margins. We further find that MEM is even superior to supervised RGB-based pretraining when tested on real-world event data. Models pretrained with MEM exhibit improved label efficiency, especially in low data regimes, and generalize well to the dense task of semantic image segmentation.
0 commit comments