Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning

Xiaobo Xia, Jiale Liu, Jun Yu, Xu Shen, Bo Han, Tongliang Liu*

*Corresponding author for this work

Research output: Contribution to conferenceConference paperpeer-review

24 Citations (Scopus)

Abstract

Deep learning methods nowadays rely on massive data, resulting in substantial costs of data storage and model training. Data selection is a useful tool to alleviate such costs, where a coreset of massive data is extracted to practically perform on par with full data. Based on carefully-designed score criteria, existing methods first count the score of each data point and then select the data points whose scores lie in a certain range to construct a coreset. These methods work well in their respective preconceived scenarios but are not robust to the change of scenarios, since the optimal range of scores varies as the scenario changes. The issue limits the application of these methods, because realistic scenarios often mismatch preconceived ones, and it is inconvenient or unfeasible to tune the criteria and methods accordingly. In this paper, to address the issue, a concept of the moderate coreset is discussed. Specifically, given any score criterion of data selection, different scenarios prefer data points with scores in different intervals. As the score median is a proxy of the score distribution in statistics, the data points with scores close to the score median can be seen as a proxy of full data and generalize different scenarios, which are used to construct the moderate coreset. As a proof-of-concept, a universal method that inherits the moderate coreset and uses the distance of a data point to its class centr as the score criterion, is proposed to meet complex realistic scenarios. Extensive experiments confirm the advance of our method over prior state-of-the-art methods, leading to a strong baseline for future research. The implementation is available at https://github.com/tmllab/Moderate-DS.

Original languageEnglish
Pages1-20
Number of pages20
Publication statusPublished - 1 May 2023
Event11th International Conference on Learning Representations, ICLR 2023 - Kigali, Rwanda
Duration: 1 May 20235 May 2023
https://iclr.cc/Conferences/2023
https://openreview.net/group?id=ICLR.cc/2023/Conference

Conference

Conference11th International Conference on Learning Representations, ICLR 2023
Country/TerritoryRwanda
CityKigali
Period1/05/235/05/23
Internet address

Scopus Subject Areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning'. Together they form a unique fingerprint.

Cite this