LSpec: The simple meta-algorithm to detect outliers from dataset while training machine learning models

Abstract

Dataset of good quality and abundance would probably be the most important key to the success of any machine learning research endeavor. Still some research areas of machine learning suffer from the lack of good dataset, the machine learning projects on ultrasound images being the typical case. The hindrances come from the mistake or inconsistency on data input (for example, labeling ultrasound image requires assessment by relevant experts, and sometimes two ultrasound experts will label the same ultrasound image differently whether it's slight or significant difference.) and the diversity of data source (large ultrasound dataset would be composed of ultrasound images from different probes with different resolution) And thus it's natural to have desire to detect outliers from dataset with least effort. Whilst the research endeavor of Avoiding overfitting effect while feeding ultrasound videos for ML training, we (the ODSL team) have came up with the idea of using the output of pre-trained/training model ouptuts for semi-supervised clustering and named it as LSpec (short for learning spectrum) As LSpec is meta-algorithm, we need further research to find more effective solutions. As a result of this research project, we will present the use of LSpec to detect bad records (intentionally mixed into original MNIST dataset). And we will move further to tweak LSpec for ultrasound images.