Mining contrast sets in classification, regression, and survival data by fusing separate and conquer models

1 April 2022

Abstract

Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on a sequential covering - a well established heuristic for decision rule induction. The fusion of multiple passes accompanied with an attribute penalization scheme allows generation of contrast sets describing same examples with different attributes, distinguishing presented approach from the standard sequential covering. The ability to identify contrast sets in regression and survival data sets, the feature not provided by the existing algorithms, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: Contrast sets, Sequential covering, Model fusion, Rule induction, Regression, Survival, Knowledge discovery

View on arXiv

Comments on this paper