SFA

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets. Time series analysis, as an application for high dimensional data mining, is a common task in biochemistry, meteoro- logy, climate research, bio-medicine or marketing. Similarity search in data with increasing dimensionality results in an exponential growth of the search space, referred to as Curse of Dimensionality. A common approach to postpone this ef- fect is to apply approximation to reduce the dimensionality of the original data prior to indexing. However, approxim- ation involves loss of information, which also leads to an exponential growth of the search space. Therefore, index- ing an approximation with a high dimensionality, i.e. high quality, is desirable. We introduce Symbolic Fourier Approximation (SFA) and the SFA trie which allows for indexing of not only large data- sets but also high dimensional approximations. This is done by exploiting the trade-off between the quality of the ap- proximation and the degeneration of the index by using a variable number of dimensions to represent each approxim- ation. Our experiments show that SFA combined with the SFA trie can scale up to a factor of 5–10 more indexed di- mensions than previous approaches. Thus, it provides lower page accesses and CPU costs by a factor of 2–25 respect- ively 2–11 for exact similarity search using real world and synthetic data.