135

Automatic Classifiers as Scientific Instruments: One Step Further Away from Ground-Truth

Abstract

Automatic detectors of facial expression, gesture, affect, etc., can serve as scientific instruments to measure many behavioral and social phenomena (e.g., emotion, empathy, stress, engagement, etc.), and this has great potential to advance basic science. However, when a detector dd is trained to approximate an existing measurement tool (e.g., observation protocol, questionnaire), then care must be taken when interpreting measurements collected using dd since they are one step further removed from the underlying construct. We examine how the accuracy of dd, as quantified by the correlation qq of dd's outputs with the ground-truth construct UU, impacts the estimated correlation between UU (e.g., stress) and some other phenomenon VV (e.g., academic performance). In particular: (1) We show that if the true correlation between UU and VV is rr, then the expected sample correlation, over all vectors Tn\mathcal{T}^n whose correlation with UU is qq, is qrqr. (2) We derive a formula to compute the probability that the sample correlation (over nn subjects) using dd is positive, given that the true correlation between UU and VV is negative (and vice-versa). We show that this probability is non-negligible (around 1015%10-15\%) for values of nn and qq that have been used in recent affective computing studies. (3) With the goal to reduce the variance of correlations estimated by an automatic detector, we show empirically that training multiple neural networks d(1),,d(m)d^{(1)},\ldots,d^{(m)} using different training configurations (e.g., architectures, hyperparameters) for the same detection task provides only limited `coverage' of Tn\mathcal{T}^n.

View on arXiv
Comments on this paper