Measuring Time Series’ Similarity through Large Singular Features Revealed with Wavelet Transformation Zbigniew R. Struzik, Arno Siebes Kruislaan 413, 1098 SJ Amsterdam The Netherlands email: [email protected]

Abstract

the similarity measure could be explicit and use the standard data mining algorithms. In the context of time series, one way to achieve this is to extract a (fixed) number of features from the time series so that similar time series have similar features (e.g., as numerical values) and vice-versa. This is the approach we, and others [1, 2, 3, 4, 5], follow in our research. In other words, we want to define the similarity of time series through a number of features. In general, the issue of quantitative similarity estimation between time series in data mining applications seemingly suffers from a serious internal inconsistency: on the one hand, one wants the similarity to be independent of a large class of linear transformations like (amplitude, time) rescaling, addition of linear trend or constant bias. This is understandable since most such operations affect the parameter values of commonly used estimators (e.g. power spectrum), or destroy any stationarity potentially present in the time series making estimation impossible. On the other hand, the subjective, qualitative judgement of similarity (by humans) is based precisely on non-stationary behaviour; rapid transients marking beginnings of trends, extreme fluctuations and generally speaking, large but rare events. Such local fluctuations, characteristic for single realisations, usually make statistical estimations difficult and result in unreliable estimates. In particular, it is common knowledge that the evaluation of data distributions from short data sets is an awkward task, resulting in unreliable estimates. The reason for this is limited statistics, in which local fluctuations of the data override consistent statistical behaviour. However, what is of great disadvantage from a statistical point of view can be of advantage in another context. In this paper, we propose a method of characterising the time series which relies on such deviations from the consistent statistical behaviour as caused by the non-stationary behaviour of the data. We will show how large local fluctuations in relatively short data sets carry the relevant information about the transient ‘shape’ of the time series. In particular, we can then make use of them in order to pro-

For the majority of data mining applications, there are no models of data which would facilitate the task of comparing records of time series. We propose a generic approach to comparing noise time series using the largest deviations from consistent statistical behaviour. For this purpose we use a powerful framework based on wavelet decomposition, which allows filtering polynomial bias, while capturing the essential singular behaviour. In addition, we are able to reveal scale-wise ranking of singular events including their scale free characteristic: the H¨older exponent. We use a set of such characteristics to design a compact representation of the time series suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds with the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of (local) correlation.

Note: This work has been carried out under the Impact project.

1. Introduction The importance of similarity measures in data mining is easily underestimated. This is caused by the fact that most algorithms assume relational data and the similarity is implicitly measured by similarity (or even equality) of values for given attributes. However, the importance of similarity measures becomes apparent the moment one considers mining non-relational data, such as time series. In such cases, patterns should describe sets of time series with similar behaviour and, thus, a similarity measure is necessary. This similarity measure could be implicit in completely new algorithms which work for this special type of data. Or, 1

vide a very compact set of characteristics of the time series useful for correlation or matching purposes. But what if the time series data in our application is long enough to result in good statistical estimates? The way to go is, of course, to reduce the data length in order to increase the influence of large local fluctuations! What sounds unreasonable, is perfectly admissible and technically possible, by the operation of coarse graining the data using so-called wavelet filters, in the Wavelet Transformation scheme. In this paper, we will demonstrate how the Wavelet Transform resulting from scale-wise decomposing of time series data provides a natural way to obtain scale-wise ranking of events in the time series. In addition to this, by evaluating both the local scaling estimates and the spectral density of singular behaviour in the time series, we will be able locally to indicate rare events in time series. These will next be used for the purpose of (locally) correlating time series using large or rare events. In section 2, we will focus on the relevant aspects of the wavelet transformation, in particular the ability to characterise scale free behaviour of characteristic events in time series, like ‘crash’ singularities. The link between such singularities and the non-stationary behaviour of time series will be postulated, and together with the hierarchical scalewise decomposition provided by the wavelet transform, it will enable us to select the interesting large scale features. In section 3, we will discuss the h-representation of time series, utilising the large scale characteristics with exponents properly estimated. The issues of distance metric in the representation and that of correlation between the representations will be addressed in section 4. This is followed by the test case of correlating examples of currency exchange rates. Section 5 closes the paper with conclusions and suggestions for future developments.

Conceptually, the wavelet transform is a convolution product of the time series the scaled and translated , with -th derivative of a kernel - the wavelet usually . Usually, in thea absence smoothing kernel of other criteria, the preferred choice is the kernel well localised both in frequency and we have chosen position.

Inthis aspaper, the smoothing kernel, the Gaussian which has optimal localisation in both domains. The scaling and translation actions are performed by two parameters; the scale parameter ‘adapts’ the width of the wavelet kernel to the microscopic resolution required, thus changing its frequency contents, and the location of the analysing wavelet is determined by the parameter :

2. Continuous Wavelet Transform and its Maxima Used to Reveal the Structure of the Time Series

Figure 1. Continuous Wavelet Transform representation of the random walk (Brownian process) time series. The wavelet used is the Mexican hat - the second derivative of the Gaussian kernel. The coordinate axis are: position , scale in logarithm 7980: , and the ! "# . value of the transform

' ( -. / ! "# $ &% ) (1) * (,+ " where 0"#132 and 54 6 for the continuous version of the Wavelet Transformation (CWT).

As already mentioned above, the recently introduced Wavelet Transform (WT), see e.g. Ref. [6, 7], provides a way of analysing the local behaviour of functions. In this, it fundamentally differs from global transforms like the Fourier Transform. In addition to locality, it possesses the often very desirable ability of filtering the polynomial behaviour to some predefined degree. Therefore, correct characterisation of time series is possible, in particular in the presence of non-stationarities like global or local trends or biases. One of the aspects of the WT which is of great advantage for our purpose is the ability to reveal the hierarchy of (singular) features including their scaling behaviour - the so-called scale free behaviour.

In figure 1, we show the wavelet transform of a random walk sample decomposed with the Mexican hat wavelet the second derivative of the Gaussian kernel. From the definition, the transform retains all the temporal locality properties - the position axis is in the forefront of the 3D plot. The standard way of presenting the CWT is using the logarithmic scale, therefore the scale axis pointing ‘in depth’ of the plot is log(s). The third axis denotes the magni ";vertical tude of the transform . The 3D plot shows how the 2

wavelet transform reveals more and more detail while going towards smaller scales, i.e. towards smaller 7

Abstract

the similarity measure could be explicit and use the standard data mining algorithms. In the context of time series, one way to achieve this is to extract a (fixed) number of features from the time series so that similar time series have similar features (e.g., as numerical values) and vice-versa. This is the approach we, and others [1, 2, 3, 4, 5], follow in our research. In other words, we want to define the similarity of time series through a number of features. In general, the issue of quantitative similarity estimation between time series in data mining applications seemingly suffers from a serious internal inconsistency: on the one hand, one wants the similarity to be independent of a large class of linear transformations like (amplitude, time) rescaling, addition of linear trend or constant bias. This is understandable since most such operations affect the parameter values of commonly used estimators (e.g. power spectrum), or destroy any stationarity potentially present in the time series making estimation impossible. On the other hand, the subjective, qualitative judgement of similarity (by humans) is based precisely on non-stationary behaviour; rapid transients marking beginnings of trends, extreme fluctuations and generally speaking, large but rare events. Such local fluctuations, characteristic for single realisations, usually make statistical estimations difficult and result in unreliable estimates. In particular, it is common knowledge that the evaluation of data distributions from short data sets is an awkward task, resulting in unreliable estimates. The reason for this is limited statistics, in which local fluctuations of the data override consistent statistical behaviour. However, what is of great disadvantage from a statistical point of view can be of advantage in another context. In this paper, we propose a method of characterising the time series which relies on such deviations from the consistent statistical behaviour as caused by the non-stationary behaviour of the data. We will show how large local fluctuations in relatively short data sets carry the relevant information about the transient ‘shape’ of the time series. In particular, we can then make use of them in order to pro-

For the majority of data mining applications, there are no models of data which would facilitate the task of comparing records of time series. We propose a generic approach to comparing noise time series using the largest deviations from consistent statistical behaviour. For this purpose we use a powerful framework based on wavelet decomposition, which allows filtering polynomial bias, while capturing the essential singular behaviour. In addition, we are able to reveal scale-wise ranking of singular events including their scale free characteristic: the H¨older exponent. We use a set of such characteristics to design a compact representation of the time series suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds with the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of (local) correlation.

Note: This work has been carried out under the Impact project.

1. Introduction The importance of similarity measures in data mining is easily underestimated. This is caused by the fact that most algorithms assume relational data and the similarity is implicitly measured by similarity (or even equality) of values for given attributes. However, the importance of similarity measures becomes apparent the moment one considers mining non-relational data, such as time series. In such cases, patterns should describe sets of time series with similar behaviour and, thus, a similarity measure is necessary. This similarity measure could be implicit in completely new algorithms which work for this special type of data. Or, 1

vide a very compact set of characteristics of the time series useful for correlation or matching purposes. But what if the time series data in our application is long enough to result in good statistical estimates? The way to go is, of course, to reduce the data length in order to increase the influence of large local fluctuations! What sounds unreasonable, is perfectly admissible and technically possible, by the operation of coarse graining the data using so-called wavelet filters, in the Wavelet Transformation scheme. In this paper, we will demonstrate how the Wavelet Transform resulting from scale-wise decomposing of time series data provides a natural way to obtain scale-wise ranking of events in the time series. In addition to this, by evaluating both the local scaling estimates and the spectral density of singular behaviour in the time series, we will be able locally to indicate rare events in time series. These will next be used for the purpose of (locally) correlating time series using large or rare events. In section 2, we will focus on the relevant aspects of the wavelet transformation, in particular the ability to characterise scale free behaviour of characteristic events in time series, like ‘crash’ singularities. The link between such singularities and the non-stationary behaviour of time series will be postulated, and together with the hierarchical scalewise decomposition provided by the wavelet transform, it will enable us to select the interesting large scale features. In section 3, we will discuss the h-representation of time series, utilising the large scale characteristics with exponents properly estimated. The issues of distance metric in the representation and that of correlation between the representations will be addressed in section 4. This is followed by the test case of correlating examples of currency exchange rates. Section 5 closes the paper with conclusions and suggestions for future developments.

Conceptually, the wavelet transform is a convolution product of the time series the scaled and translated , with -th derivative of a kernel - the wavelet usually . Usually, in thea absence smoothing kernel of other criteria, the preferred choice is the kernel well localised both in frequency and we have chosen position.

Inthis aspaper, the smoothing kernel, the Gaussian which has optimal localisation in both domains. The scaling and translation actions are performed by two parameters; the scale parameter ‘adapts’ the width of the wavelet kernel to the microscopic resolution required, thus changing its frequency contents, and the location of the analysing wavelet is determined by the parameter :

2. Continuous Wavelet Transform and its Maxima Used to Reveal the Structure of the Time Series

Figure 1. Continuous Wavelet Transform representation of the random walk (Brownian process) time series. The wavelet used is the Mexican hat - the second derivative of the Gaussian kernel. The coordinate axis are: position , scale in logarithm 7980: , and the ! "# . value of the transform

' ( -. / ! "# $ &% ) (1) * (,+ " where 0"#132 and 54 6 for the continuous version of the Wavelet Transformation (CWT).

As already mentioned above, the recently introduced Wavelet Transform (WT), see e.g. Ref. [6, 7], provides a way of analysing the local behaviour of functions. In this, it fundamentally differs from global transforms like the Fourier Transform. In addition to locality, it possesses the often very desirable ability of filtering the polynomial behaviour to some predefined degree. Therefore, correct characterisation of time series is possible, in particular in the presence of non-stationarities like global or local trends or biases. One of the aspects of the WT which is of great advantage for our purpose is the ability to reveal the hierarchy of (singular) features including their scaling behaviour - the so-called scale free behaviour.

In figure 1, we show the wavelet transform of a random walk sample decomposed with the Mexican hat wavelet the second derivative of the Gaussian kernel. From the definition, the transform retains all the temporal locality properties - the position axis is in the forefront of the 3D plot. The standard way of presenting the CWT is using the logarithmic scale, therefore the scale axis pointing ‘in depth’ of the plot is log(s). The third axis denotes the magni ";vertical tude of the transform . The 3D plot shows how the 2

wavelet transform reveals more and more detail while going towards smaller scales, i.e. towards smaller 7