High Dimensional Statistics And Time Series Analysis

7/28/2019

Since big data is all the rage these days, I've been asked at work to help develop an outlier detection system for streams of environmental data coming from a variety of sensors. Right now we're doing USL/LSL monitoring, but the goal is to capture multi-dimensional outliers and anomalous sequences that nonetheless may be within the normal bounds. The data consists of time-series measurements from T=0 to T=Tmax over a machine cycle, where Tmax is not guaranteed to be consistent from cycle to cycle. The sample period is fairly regular, but some data may be missing.

Posted by2 years ago

Time Series Analysis Example
Time Series Analysis In Statistics
Analysis Of Time Series Data

Archived

Time (ms)	V1	V2	..	P1	P2	..
2	0.4	0.8	..	NULL	NULL	..
101	0.6	1.1	..	0.5	0.6	..
199	0.9	NULL	..	0.5	0.64	..
..	..	..	..	..	..	..
Tmax	V1(Tmax)	V2(Tmax)	..	P1(Tmax)	P2(Tmax)

I've spent the past few days taking a deep dive into machine learning literature, and I think I have an understanding of the high-level steps I need to take, but the sheer variety of approaches for each step is a little overwhelming. Right now, my concerns are mainly focused around two steps:

1. Transform time-series data into point-space data

It appears that there are a variety of methods here: DFT, LSSA, wavelet decomposition, polynomial regression, etc. I know that the best way to choose is to evaluate performance against actual data, but is there a recommended 'standard' method to start with?
How is correlation between decomposed vectors usually handled? For example, if take a matrix of 3-dimensions x 10000 samples, and decompose each dimension into 20 frequency components, so that the transformed data is a single 60-dimensional vector, is there anything special that needs to be done to account for the fact that each group of 20 dimensions isn't independent?
How should Tmax be handled? Should it be included as its own dimension? If so, how should it be weighted against the other dimensions?

2. Identify outliers, and most-significant dimensions

From what I can tell, two major approaches seem to be Local Outlier Factor (LOF) and Isolation Forests
I have three reservations about the LOF method 1) some sort of dimensionality reduction will be required to avoid the 'curse of dimensionality'. I found a paper on the PINN method [1], which seems promising, but there don't seem to be any implementations that I can find 2) some sort of normalisation will be required to prevent the axis scale from dominating distance calculations, but how do I evaluate what counts as 'dominating'? 3) The LOF seems to be highly dependent on the distance metric chosen. Policenauts ps1 or saturn play. Most implentations seem to use Euclidean distance. However, if spectral analysis is performed in step 1, is Euclidean distance an accurate distance metric in the frequency domain? Is it worth looking for an implentation that uses log-spectral distance?
From my limited understanding, the two major advantages of isolation forests are that 1) they have a certain degree of scale invariance because they don't use a distance metric for classification and 2) they have dimensionality reduction 'built-in' because each tree is constructed from a random sample of dimensions. Is this an accurate characterisation?
The largest challenge that I see is identification of which dimensions contributed the most to a datum being considered an outlier. This seems like it potentially would be easier with the LOF method (do an O(n) search in the k-neighborhood of an outlier, projecting each n-vector into (n-1) space and rank dimensions based on which causes the greatest change to the LOF score), but the couple of tools that I've found so far don't support this out-of-the-box. I'm not sure how to even approach this problem with the isolation forest method, which is why I'm still considering LOF as an option.

I'm also having trouble finding any literature on this problem. The closest I've found is [2], which only discusses anomalies in sequences of discrete symbols. I know it's possible to map continuous data to symbolic data [3], but I'm not sure how it would be applied to multi-dimensional data, or how the performance compares to LOF or isolation forests.
Another random thought I had was using either an N-order Markov Model, or an LSTM autoencoder/decoder to calculate the prediction error for each time step, but I am not sure how to account for the variable sampling intervals, the variable sequence length, or the continuous multivariate aspect of the data.

If anyone could point me in the right direction for researching methods, or if anyone has any suggested implementations (I'm sure this is far from a unique problem), that would be very much appreciated. So far I've looked at scikit-learn and Jubatus. I know R is popular for data analysis, but I'm not familiar with the language or with its libraries. I know Amazon offers anomaly detection through AWS, but I need something on-premise (besides the fact that it doesn't analyse which dimensions are anomalous).

Sorry for the long post, but I had a bunch of related questions and didn't want to split them all up individually.

tl;dr

Doing outlier detection for high-dimensional time series data
What are current state-of-the-art methods for time-series compression/regression?
What are current state-of-the-art methods for outlier detection, including detecting which dimensions are anomalous?

Thanks in advance.

Refs

1. Finding Local Anomalies in Very High Dimensional Space (de Vries, Chawla, Houle, 2010)

2. Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences (Budalakoti, Srivastava, et al., 2006)

3. A Symbolic Representation of Time Series, with Implications for Streaming Algorithms (Lin, Keogh, et al., 2003)

High dimensional statistics and time series analysis pdf

3 comments

(This article was first published on Data Perspective, and kindly contributed to R-bloggers)

Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:

Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.

Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

For todayâ€™s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names (â€œ9.4â€³, 9.5â€ â€¦) correspond to midpoints of intervals of finger lengths whereas the 22 column names (â€œ142.24â€, â€œ144.78â€â€¦) correspond to (body) heights of 3000 criminals, see also below.

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.

We observe that column â€œ165.1â€ contains maximum variance in the data. Applying PCA using prcomp().

Letâ€™s plot all the principal components and see how the variance is accounted with each component.

Time Series Analysis Example

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, â€œ165.1, 167.64, and 170.18â€ which are less correlated with them.
Complete Code for PCA implementation in R:

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

https://feeds.feedburner.com/DataPerspective

To leave a comment for the author, please follow the link and comment on their blog: Data Perspective.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more..

If you got this far, why not subscribe for updatesfrom the site? Choose your flavor: e-mail, twitter, RSS, or facebook..

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensionalphysical space of everyday experience.The expression was coined by Richard E. Bellman when considering problems in dynamic programming.^[1]^[2]

Cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

1Domains
- 1.6Nearest neighbor search

Domains[edit]

Combinatorics[edit]

In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion. Even in the simplest case of d{displaystyle d} binary variables, the number of possible combinations already is O(2d){displaystyle O(2^{d})}, exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations.

Sampling[edit]

There is an exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 10²=100 evenly spaced sample points suffice to sample a unit interval (a '1-dimensional cube') with no more than 10^âˆ’2=0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10^âˆ’2=0.01 between adjacent points would require 10²⁰[=(10²)¹⁰] sample points. In general, with a spacing distance of 10^âˆ’n the 10-dimensional hypercube appears to be a factor of 10^n(10-1)[=(10ⁿ)¹⁰/(10ⁿ)] 'larger' than the 1-dimensional hypercube, which is the unit interval. In the above example n=2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 10¹⁸ 'larger' than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.

Optimization[edit]

When solving dynamic optimization problems by numerical backward induction, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the 'state variable' is large.

Machine learning[edit]

In machine learning problems that involve learning a 'state-of-nature' from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation.^[3] With a fixed number of training samples, the predictive power of a classifier or regressor first increases as number of dimensions/features used is increased but then decreases,^[4] which is known as Hughes phenomenon^[5] or peaking phenomena.^[3]

Time Series Analysis In Statistics

Distance functions[edit]

When a measure such as a Euclidean distance is defined using many coordinates, there is little difference in the distances between different pairs of samples.

One way to illustrate the 'vastness' of high-dimensional Euclidean space is to compare the proportion of an inscribed hypersphere with radius r{displaystyle r} and dimension d{displaystyle d}, to that of a hypercube with edges of length 2r{displaystyle 2r}.The volume of such a sphere is 2rdÏ€d/2dÎ“(d/2){displaystyle {frac {2r^{d}pi ^{d/2}}{d;Gamma (d/2)}}}, where Î“{displaystyle Gamma } is the gamma function, while the volume of the cube is (2r)d{displaystyle (2r)^{d}}.As the dimension d{displaystyle d} of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube. This can clearly be seen by comparing the proportions as the dimension d{displaystyle d} goes to infinity:

VhypersphereVhypercube=Ï€d/2d2dâˆ’1Î“(d/2)â†’0{displaystyle {frac {V_{mathrm {hypersphere} }}{V_{mathrm {hypercube} }}}={frac {pi ^{d/2}}{d2^{d-1}Gamma (d/2)}}rightarrow 0} as dâ†’âˆž{displaystyle drightarrow infty }.

Furthermore, the distance between the center and the corners is rd{displaystyle r{sqrt {d}}}, which increases without bound for fixed r.In this sense, nearly all of the high-dimensional space is 'far away' from the centre. To put it another way, the high-dimensional unit hypercube can be said to consist almost entirely of the 'corners' of the hypercube, with almost no 'middle'.

This also helps to understand the chi-squared distribution. Indeed, the (non-central) chi-squared distribution associated to a random point in the interval [-1, 1] is the same as the distribution of the length-squared of a random point in the d-cube. By the law of large numbers, this distribution concentrates itself in a narrow band around d times the standard deviation squared (Ïƒ²) of the original derivation. This illuminates the chi-squared distribution and also illustrates that most of the volume of the d-cube concentrates near the surface of a sphere of radius âˆšdÏƒ.

A further development of this phenomenon is as follows. Any fixed distribution on â„ induces a product distribution on points in â„^d. For any fixed n, it turns out that the minimum and the maximum distance between a random reference point Q and a list of n random data points P₁,..,P_n become indiscernible compared to the minimum distance:^[6]

limdâ†’âˆžE(distmaxâ¡(d)âˆ’distminâ¡(d)distminâ¡(d))â†’0{displaystyle lim _{dto infty }Eleft({frac {operatorname {dist} _{max }(d)-operatorname {dist} _{min }(d)}{operatorname {dist} _{min }(d)}}right)to 0}.

Analysis Of Time Series Data

This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions â„ are independent and identically distributed.^[7] When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.^[7]

Nearest neighbor search[edit]

The effect complicates nearest neighbor search in high dimensional space. It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions.^[8]^[9]

However, it has recently been observed that the mere number of dimensions does not necessarily result in difficulties,^[10] since relevant additional dimensions can also increase the contrast. In addition, for the resulting ranking it remains useful to discern close and far neighbors. Irrelevant ('noise') dimensions, however, reduce the contrast in the manner described above. In time series analysis, where the data are inherently high-dimensional, distance functions also work reliably as long as the signal-to-noise ratio is high enough.^[11]

k-nearest neighbor classification[edit]

Another effect of high dimensionality on distance functions concerns k-nearest neighbor (k-NN) graphs constructed from a data set using a distance function. As the dimension increases, the indegree distribution of the k-NN digraph becomes skewed with a peak on the right because of the emergence of a disproportionate number of hubs, that is,>

^Richard Ernest Bellman (1961). Adaptive control processes: a guided tour. Princeton University Press.
^ ^a^bKoutroumbas, Sergios Theodoridis, Konstantinos (2008). Pattern Recognition - 4th Edition. Burlington. Retrieved 8 January 2018.
^Trunk, G. V. (July 1979). 'A Problem of Dimensionality: A Simple Example'. IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (3): 306â€“307. doi:10.1109/TPAMI.1979.4766926.
^Hughes, G.F. (January 1968). 'On the mean accuracy of statistical pattern recognizers'. IEEE Transactions on Information Theory. 14 (1): 55â€“63. doi:10.1109/TIT.1968.1054102.
^Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. (1999). When is 'Nearest Neighbor' Meaningful?. Proc. 7th International Conference on Database Theory - ICDT'99. LNCS. 1540. pp. 217â€“235. doi:10.1007/3-540-49257-7_15. ISBN978-3-540-65452-0.
^ ^a^b^cZimek, A.; Schubert, E.; Kriegel, H.-P. (2012). 'A survey on unsupervised outlier detection in high-dimensional numerical data'. Statistical Analysis and Data Mining. 5 (5): 363â€“387. doi:10.1002/sam.11161.
^Marimont, R.B.; Shapiro, M.B. (1979). 'Nearest Neighbour Searches and the Curse of Dimensionality'. IMA J Appl Math. 24 (1): 59â€“70. doi:10.1093/imamat/24.1.59.
^ChÃ¡vez, Edgar; Navarro, Gonzalo; Baeza-Yates, Ricardo; MarroquÃn, JosÃ© Luis (2001). 'Searching in Metric Spaces'. ACM Computing Surveys. 33 (3): 273â€“321. CiteSeerX10.1.1.100.7845. doi:10.1145/502807.502808.
^Houle, M. E.; Kriegel, H. P.; KrÃ¶ger, P.; Schubert, E.; Zimek, A. (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?(PDF). Scientific and Statistical Database Management. Lecture Notes in Computer Science. 6187. p. 482. doi:10.1007/978-3-642-13818-8_34. ISBN978-3-642-13817-1.
^Bernecker, T.; Houle, M. E.; Kriegel, H. P.; KrÃ¶ger, P.; Renz, M.; Schubert, E.; Zimek, A. (2011). Quality of Similarity Rankings in Time Series. Symposium on Spatial and Temporal Databases. Lecture Notes in Computer Science. 6849. p. 422. doi:10.1007/978-3-642-22922-0_25. ISBN978-3-642-22921-3.
^RadovanoviÄ‡, MiloÅ¡; Nanopoulos, Alexandros; IvanoviÄ‡, Mirjana (2010). 'Hubs in space: Popular nearest neighbors in high-dimensional data'(PDF). Journal of Machine Learning Research. 11: 2487â€“2531.
^RadovanoviÄ‡, M.; Nanopoulos, A.; IvanoviÄ‡, M. (2010). On the existence of obstinate results in vector space models. 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10. p. 186. doi:10.1145/1835449.1835482. ISBN9781450301534.
Retrieved from 'https://en.wikipedia.org/w/index.php?title=Curse_of_dimensionality&oldid=904787357'

Comments are closed.

LIVE AT THE SPOTLIGHT