Abstract
Objectives
Our series analyzed are aggregated (with interval 10 sec) intra-daily data of DAX30. In this case we have totally 251 observation days of each stock from the DAX30 list. If we aggregate high frequency intra-daily data with interval of 10 seconds we get 3060 intra-daily observations for every stock. We aggregate raw inhomogeneous intra-daily data into the equally-spaced homogeneous 10 seconds intra-daily time series. To mitigate opening auction effect, transactions of the first 10 minutes of the day were omitted. The aggregation was done with previous-tick interpolation (Wasserfallen and Zimmermann, 1985). Dacorogna et al. (2001) point out that linear interpolation relies on the future information whereas previous-tick interpolation is based on the information already known. Our results show that the distribution of intra-daily log-returns differs every day. We have to use some statistical tests to proof that. The problem is that we cannot use traditional tests for continuous distributions since our data are mixed-stable (discontinuous law) distributed. We have developed sequential code for k-sample homogeneity test (Anderson-Darling and Kruskal-Wallis). The problem is that to test homogeneity of one single stock series it takes few days. Usual procedure is to stack all series that will be analyzed to one vector of (251days x 3060 intra-daily) 768060 observations. This new vector must be arranged and rank of each element must be found. Separately each series also must be arranged and ranks of each element in intra-daily series and stacked series must be found, this gives at least 589.913.859.422 comparison operations for one stock. The algorithm may be divided into 6 main parts:

1) Initialization (memory allocation for specific structures and arrays; reading from file of data files name and etc.). This part cannot be parallelized and is only serial;

2) Reading and sorting of data arrays (this step may be parallelized, by reading and sorting data on different processors):

a) Reading of data from files witch names were read previously (step 1);

b) Sorting of read data;

3) Merging of sorted arrays, may be parallelized by merging arrays in slave processors and after that sending result to master and merging again with other merged arrays.

4) Calculation of k-sample Anderson—Darling test statistic [1, formula (3)], may be parallelized by calculating different parts of tests statistics parallel;

5) Calculation of variance of Anderson—Darling test statistic [1, formula (4)], the most time consuming part, may be parallelized by calculating different sums in parallel;

6) Finalization (calculation of standardized T value, printing of results to file, memory clearance and etc.), only serial. Execution time is not measured.

Initialization and finalization has almost constant execution time since these are serial parts. All other parts are parallelizable. As already mentioned the most time consuming part is #5 when at least 589.913.859.422 operations are performed.

Results of preliminary computer time evaluation show that I can analyze only one stock data series at the same time with my computer. If we have 28 stocks and the series length of each is ~3000, then to perform k-sample homogeneity test takes 36 hours! This is not acceptable decision making time. Moreover, the research requires repeating this experiment many times for different series lengths taking into account the possible program bugs.

Objectives of this research is to develop portable parallel programs with MPI which can be used on any parallel architecture and perform computational experiments.

Achievements
During my visit all objectives were achieved. Very fast and efficient parallel code was developed and it was used to check the homogeneity of 28 intra-daily DAX30 stocks from year 2007. The initial running time on my computer with one processor and one node was aprox. 4000 seconds and after the visit this time was reduced to average 1011.55 sec. on one Edinburgh HPC CRAY processor (with 128 processors this took 20.24 sec.). The average running (on EPCC CRAY supercomputer) times of initial version, partially improved and final version are given in Table 1. One may see that average efficiency of all algorithm is close to 0.7 (varies for different number of processors)

Table 1. Average execution time, speedup and efficiency of algorithm for different number of processors.

n_proc

initial version

Partial improvement

Final version

speed up

efficiency

1

4000

2312.156956

1011.550148

1

1

2

*

968.97555

568.8115823

1.778357157

0.889179

4

647.893774

486.967435

309.9577592

3.263509682

0.815877

8

363.053988

246.80019

148.1271944

6.82892937

0.853616

16

179.675273

133.305119

80.4860199

12.56802299

0.785501

32

*

75.148571

47.25069067

21.40815582

0.669005

64

70.702355

42.404254

29.48128735

34.31160031

0.536119

128

42.522539

33.206758

20.2375152

49.98391049

0.390499

Note: * not measured

Initialization of program has almost constant execution time. Reading and sorting of data arrays has weak dependence on number of processors and mostly depends on intercommunication between processors and hard discs. Merging of sorted arrays (we use merge-sort algorithm) is very fast, but is a little bit slower for one processor and for high number of processors (influence of intercommunication). Calculation of test statistics has very nice scalability and is has efficiency close to 1 for all different number of processors. Calculation of theoretical variance is very time consuming, but in fact it has average efficiency close to 0.85 (min= 0.83 and max= 0.89).

For small number of processors 95% of time is consumed for calculation of theoretical variance, but when number of processors increases his time consumption drops to 60%. However since reading and sorting time weakly depends on number of processors when we calculate with 128 processors its time consumption increases to 33.5%.

The speedup (see Table 1) of full algorithm may be described by power or polynomial (2nd order) dependence (with R-squared more that 0.99).

Nowadays almost all computers have a possibility to perform parallel computing. Computers that have one processor have possibility to use multiple cores on it or if NVIDIA graphical card is available it is possible to perform high performance computing. We tried to run this algorithm on Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz, 2670 Mhz, 2 Core(s), 2 Logical Processor(s) with Microsoft Windows 7 Professional. The execution time on home office computer is given in Table 2. This table contains execution time of optimized (by VC2010 compiler) algorithm and native algorithm without any optimization.

Table 2. Execution time on home office computer, with 1 and 2 cores.

nr. Cores

compilation

run time

1

native

2312.156956

optimized

2268.38367

2

native

1174.749748

optimized

1151.412488

Our homogeneity test was used to check if intra-daily stock returns of year 2007 are homogenous. For all 28 stock from the DAX30 list the hypothesis of homogeneity was rejected.

[1] F. W. Scholz; M. A. Stephens, K-Sample Anderson-Darling Tests, Journal of the American Statistical Association, Vol. 82, No. 399. (Sep., 1987), pp. 918-924.