Generating training sets for deep convolutional neural networks is a bottleneck for modern real-world applications. This is a demanding tasks for applications where annotating training data is costly, such as in semantic segmentation. In the literature, there is still a gap between the performance achieved by a network trained on full and on weak annotations. In this paper, we establish a strategy to measure this gap and to identify the ingredients necessary to close it.
On scribbles, we establish state-of-the-art results: we obtain a gap in mIoU of 2.4% without CRF, and 2.9% with CRF post-processing. Thanks to a systematic study of the different ingredients involved in the weakly supervised scenario and an original experimental strategy, we unravel a counter-intuitive mechanism that is simple and amenable to generalisations to other weakly-supervised scenarios: averaging local predicted annotations with the baseline ones and reuse them for training a DCNN yields new state-of-the-art results.
Finally, closing the gap was reported only recently for bounding boxes in Khoreva et al. (arXiv:1603.07485v2), by requiring 10x more training images. By simulating varying amounts of pixel-level annotations respecting scribble human annotations statistics, we show that our training strategy reacts to small increases in the amount of annotations and requires only 2-5x more annotated pixels, closing the gap with only 3.1% of all pixels annotated. This work contributes new ideas towards closing the gap in real-world applications.