Using Machine Learning to Identify Data Contamination

Figure: (Top) Root-mean-square error between predictions (from the model trained with both contaminated and non-contaminated data) and ERA5-derived TCWV. The purple boxes highlight areas with artifacts resembling satellite orbit patterns. (Bottom) Same as the top figure but with predictions from the CNN model trained with 100% and 25% contaminated data. Spatial artifacts in the top figure are not as noticeable here.
Figure: (Top) Root-mean-square error between predictions (from the model trained with both contaminated and non-contaminated data) and ERA5-derived TCWV. The purple boxes highlight areas with artifacts resembling satellite orbit patterns. (Bottom) Same as the top figure but with predictions from the CNN model trained with 100% and 25% contaminated data. Spatial artifacts in the top figure are not as noticeable here.

When working with large quantities of multi-dimensional data from multiple sources, dataset contamination is inevitable. Many methods have been developed to diagnose such contamination. However, unforeseen errors can sometimes sneak in, a topic discussed  by CISESS Scientists Malarvizhi Arulraj and Veljko Petkovic and colleagues, based on something they noticed while doing research. 

 

Using simulated observations of a future space-borne passive microwave sensor to develop a convolutional neural network (CNN)-based model that would predict Total Column Water Vapor (TCWV), they discovered an oversight in time matching that led to contamination of a small part of their TCWV dataset. This prompted the authors to study how changing the amount of contaminated data used for training affects how well the model performs and detects errors. They found that CNN model output does not identify data issues when trained only on contaminated data (see Figure). As the percentage of contaminated samples in the training datasets decreased, though, the model’s ability to pick up the data issue increased. They conclude that since some level of data contamination is unavoidable, users, especially benchmark dataset developers, should carefully examine the data going into their models to understand its quality.

Citation: Arulraj, Malarvizhi, Veljko Petkovic, Huan Meng, and Ralph R. Ferraro, 2025: Lessons learned: Can machine learning model expose dataset contamination? Artif. I. Earth Syst., accepted, https://doi.org/ 10.1175/AIES-D-25-0030.1.

This article was put together by the CISESS coordinators based on scientist input.

Picture of Debra Baker

Debra Baker

Debra Baker is the Coordinator for the Cooperative Institute for Satellite Earth System Studies (CISESS) at the University of Maryland. She received her M.S. in atmospheric science from the University of Maryland, College Park. Before joining ESSIC in 2013, she worked on air quality issues at the Maryland Department of the Environment. Debra also has a law degree from Harvard Law School.

Picture of Kate Cooney

Kate Cooney

Katherine Cooney is a part-time faculty assistant at the Cooperative Institute for Satellite Earth System Studies (CISESS). Kate received a B.S. in environmental science and policy from the University of Maryland (UMD), College Park. She later earned a M.S. in geology from UMD, while investigating the isotopic fractionation of precipitation nitrate under the guidance of Distinguished University Professor James Farquhar. After graduation, she worked as an air-quality specialist at the Mid Atlantic Regional Air Management Association in Baltimore, Maryland. While her family was stationed in Tokyo, Japan, she dedicated her time serving military families and the local community. She is grateful for the opportunity to return to earth system studies, supporting the CISESS Business Office and assisting the CISESS Coordinator Deb Baker since January 2021.

Picture of Maureen Cribb

Maureen Cribb