683

Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining

Abstract

The ubiquitous presence of sequence data across fields such as the web, healthcare, bioinformatics, and text mining has made sequence mining a vital research area. However, sequence mining is particularly challenging because of the absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods such as k-means, kNN and regression have proved distance between data points in a Euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness---arbitrary strings of arbitrary length. We, therefore, propose a new function, Sequence Graph Transform (SGT), that extracts sequence features and embeds them in a finite-dimensional Euclidean space. SGT is scalable due to a low computation and has a universal applicability to most sequence problems. We theoretically show that SGT can capture both short and long patterns in sequences and provides an accurate distance-based measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems.

View on arXiv
Comments on this paper