From lukas.gudmundsson at geo.uio.no Wed Nov 5 17:32:19 2008 From: lukas.gudmundsson at geo.uio.no (lukas.gudmundsson at geo.uio.no) Date: Wed, 5 Nov 2008 17:32:19 +0100 (CET) Subject: [Simsalabim-communicate] Design issues Message-ID: <494144b7b7a35c46dae4c74345554a39.squirrel@webmail.uio.no> Hello Anton, (@all others: this is a reply to a previous mail from Anton. You find it at the end of this mail.) thank you very much for the detailed descriptions on your ideas for improving SSA in R. In the following I try to go through some of your last mail and try to comment as good as possible on this. As far as I understand, you are mostly concerned about performance and numerical stability. As you already found out I have been ?ignoring? both important issues in some way, but had always the wish of these things being tackled. I guess that you have already found out that I am solving the problem of negative eigenvalues in a naive way by setting them artificially to a small positive value. Some of the inefficiencies come from certain design decisions, related to application patterns. A. As you probably know I am a geoscientist, using SSA for identification and extraction of low frequency patterns. Here I follow the tradition of using eigenvectors and eigenvalues to identify signals (based i.e. on significance testing or dominant frequencies of eigenvectors). This approach is somewhat different from the w-correlation approach, but has the advantage of full automation and less computations (for the RC's). In addition it is often of interest only to consider signals with certain time scales (based on spectra of eigenvectors). Any changes in architecture should either still enable the usage of the according methods or adopt them to new object structures. B. One goal was to make the usage of SSA and SSA with missings as close as possible. I tried to keep the object structure of decompSSA and decompSSAM as close as possible. With the usage of svd in the ?complete? case this was not very straightforward. The reason is that the factor ? vectors do not appear naturally in some of the algorithms for SSA with missing values (and if they can be calculated that is numerically inefficient if one uses only R [ * - operator]). However one could easily think about putting an optional slot in the output of decompSSA that enables us to include the factor vectors (this would also imply additional methods for reconSSA, depending on the availability of the factor ? vectors. C. Finally I think it is worth to think about how much one gains from certain improvements (relative vs. absolute speedup). The thing that is computationally by far most expensive in the current implementation is the diagonal averaging procedure as this is fully implemented in R and I have found no way to vectorize that. If I have it right in mind this accounts for more than 80% of the system time. I do not want to say that other approaches are not worth considering, but changing this (porting to C or vectorize) absolute computation time (for large datasets) could drop from hours to minutes. I am as well sorry for that long and messy mail and the late reply, but I hope I made some of my design decisions (based on usage patterns) clear. I am looking forward to improve SSA in R together with you and other people interested in this method. Best, Lukas ---------------------------- Original Message ---------------------------- Subject: Re: [Simsalabim-communicate] test From: "Anton Korobeynikov" Date: Wed, October 29, 2008 23:35 To: "Lukas Gudmundsson" -------------------------------------------------------------------------- Hello, Lukas > I am looking forward to here about the outcome of your meeting We discussed thoroughly possible problems & bottlenecks of the (possible) R implementation of basic SSA algorithms having in mind all known to us users & problems being solved with SSA. Currently we're interested only in 'basic SSA', since it forms the base for huge amount of other methods. Also, this allows us to have hands clean and think about all possible solutions without dealing with backward compatibility problems, etc. One can easily consider all steps of algorithm either time consuming or memory consuming (or both), thus flexible approach should be introduced. As an example, let us consider the mentioned problem with factor vectors. I. Singular decomposition. There are different approaches here: 1. Compute eigevalues and eigenvectors of covariance matrix. But: - This is known to be numericaly unstable [1], especially if trajectory matrix is rank-deficient (you can easily obtain, for example, negative eigenvalues even with standard LAPACK routines, and thus - R's eigen() routine) or if there are some noticable amount of small eigenvalues (this is exactly the case of SSA!). - You need to compute covariance matrix. The complexity for this is O(L*K^2) for naive approach (series => hankel => covariance matrix) and O(L*K) for direct approach (seris => covariance matrix). - You will need to compute factor vectos separately, or implicitely later, with O(L*K^2) complexity. 2. Use singular value decomposition. But: - Fast methods are usually numerically unstable due to finite precision of computer arithmetics and special care should be made for keeping set of singular vectors orthogonal. - Stable & robust methods (Jacobi method, or Householder bidiagonalization) are slow. - We obtain set of factor vectors for free as a side effect of the algorithm. Fortunately, here have only 1 clear usage patter and shouldn't think about how users will invoke singular decomposition routine, thus we can end with some sane solution (or combination of different approaches depending on size of input, etc). II. Reconstruction Reconstruction itself in general costs alot, since we need to multiply 2 matrices here if factor vectors are known, or construct again trajectory matrix and compute factor vectors implicitely (which will add to complexity alot!). One typical reconstruction usage pattern when we're looking for best grouping for eigentriples usually involves construction of w-correlation matrix, which costs O(N*M^2) (M - number of reconstructed series), if we already have reconstructed series and... much more if we only have sets of eigentriples. Add to this the need of computation of factor vectors and you'll end with rather computation hog thing. Surely, we have another problem, when factor vectors are big enough not to keep them all in memory. Thus, we've ended with classical time/memory tradeoff. There also other patterns, with similar results. I won't mention it here to save your and my time :) After studying of the problems we decided, that computations should be as lazy as possible: for example, factor vectors need to be computed only, when needed, BUT: saved for later use. Lazyness should be also flexible: user should have ability to control, how much data is saved & reused. Functions need to provide some sane interface to specify, how much data user wants to be precomputed (for example, "compute only first 20 eigentriples", or "compute series reconstructions based on first 5 eigentriples"). I'm currently working on proof-of-concept implementation of our ideas. Some of them are pretty straightforward, but others look tricky to implement within R. We plan to use simsalabim as 'reference implementation' and after complete of proof-of-concept project start to port current simsalabim algorithms to new base. This WILL break backward compatibility, but it seems, that user-visible benefits will provide much more. Sorry for such long and a bit messy e-mail, but I tried to nail down problems and possible ways to solve them :) I'm open to any comments, suggestions & questions. [1] Bau III, David; Trefethen, Lloyd N. Numerical linear algebra, Philadelphia: Society for Industrial and Applied Mathematics. 1997. -- With best regards, Anton Korobeynikov. Faculty of Mathematics & Mechanics, Saint Petersburg State University.