Sequence Graph Transform (SGT): A Feature Extraction Function for
Sequence Data Mining
The ubiquitous presence of sequence data across fields like web, healthcare, bioinformatics, and text mining, has made sequence mining a vital research area. However, sequence mining is particularly challenging because of absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods like k-means, kNN and regression have proved distance between data points in a Euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness --- arbitrary strings of arbitrary length. We, therefore, propose a new function, called Sequence Graph Transform (SGT), that extracts sequence features and embeds them in a finite-dimensional Euclidean space. SGT is scalable due to a low computational complexity and has a universal applicability to most sequence problem. We theoretically show that SGT can capture both short and long patterns in sequences and provides an accurate distance-based measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems.
View on arXiv