Recent advancements in deep learning and computer vision have led to
widespread use of deep neural networks to extract building footprints from
remote-sensing imagery. The success of such methods relies on the availability
of large databases of high-resolution remote sensing images with high-quality
annotations. The CrowdAI Mapping Challenge Dataset is one of these datasets
that has been used extensively in recent years to train deep neural networks.
This dataset consists of $ \sim\ $280k training images and $ \sim\ 60ktestingimages,withpolygonalbuildingannotationsforallimages.However,issuessuchaslow−qualityandincorrectannotations,extensiveduplicationofimagesamples,anddataleakagesignificantlyreducetheutilityofdeepneuralnetworkstrainedonthedataset.Therefore,itisanimperativepre−conditiontoadoptadatavalidationpipelinethatevaluatesthequalityofthedatasetpriortoitsuse.Tothisend,weproposeadrop−inpipelinethatemploysperceptualhashingtechniquesforefficientde−duplicationofthedatasetandidentificationofinstancesofdataleakagebetweentrainingandtestingsplits.Inourexperiments,wedemonstratethatnearly250k( \sim\ $90%)
images in the training split were identical. Moreover, our analysis on the
validation split demonstrates that roughly 56k of the 60k images also appear in
the training split, resulting in a data leakage of 93%. The source code used
for the analysis and de-duplication of the CrowdAI Mapping Challenge dataset is
publicly available at https://github.com/yeshwanth95/CrowdAI_Hash_and_search .