A Quadtree Based Storage Format and Effective Multiplication Algorithm for Sparse Matrix


Abstract: Sparse matrix multiplication is widely involved in scientific researches and large numbers of applications. However, the traditional storage formats for sparse matrix could introduce a lot of indirect data accessing during the time, which results in high rate of cache misses and poor execution performance. To improve this, we propose a quadtree storage format, derived from recursive divisions of matrix. With this data structure, the original sparse matrix multiplication can be transferred to computations over independent relatively small regions, which can well fit the cache capacity. Therefore, a significant improvement of performance could be achieved. Moreover, this region-based computation framework promises substantial parallelism in distributed computing environment. We elaborate our solution of quadtree-based parallel matrix multiplication, including tuning a better region size, instruction level optimization techniques and computation distribution strategy. Extra cost introduced by this data model is discussed as well. Extensive and comprehensive experiments over real practical matrices shows that the quadtree based sparse matrixvector multiplication has an average of 1.100 speedup of the CSR format, and the sparse matrix-matrix multiplication has an average of 1.862 speedup comparing with the CSR format. The quadtree storage format achieves high performance after employing instruction level optimization techniques, and the experiments in Lenovo Deepcomp7000 demonstrate that the sparse matrix-vector multiplication has an average of 1.63 speedup comparing with the Intel Cluster Math Kernel Library implementation.


Keyword: Sparse matrix; Quadtree; Cache-hit rate; Matrix multiplication algorithm; Distributed parallelism

The recursive decomposition of sparse matrix

The sparse matrix can be recursive decomposed into a number of sub-regions. The region length d effects the matrix storage efficiency (compression ratio) and program execution efficiency significantly. Obviously, in order to improve the Cache data locality, the best case is that data loaded into the cache can complete all of its operations. It cannot be fully achieved, but we can adjust the parameters to approximate the situation. Therefore, an appropriate selection d can help decomposing the matrix into subregions, which is not bigger than the cache. Thus, the entire multiplication divides into independent operations of various sub-regions. It contributes a lot to data locality and the reduction of cache loss.

Pluto results summary

Figure 1 details the process of recursively decomposing a sparse matrix. The process can be expressed as the recursive generation of a quadtree, which is shown in figure 2. This article defines two kinds of intermediate nodes. One is called empty-region (expressed as E), which doesn’t contain nonzero elements and therefore no need to be further decomposed. The other is called mixed-region(expressed as M), which contains nonzero elements and can be decomposed continually. The leaf nodes of a quadtree can also be separated into two kinds: one is empty-region, the other is dense-region (expressed as D which may contain zero elements).

Pluto results summary

According to the target region length d, the quadtree storage structure has following characteristics: (1) If the size of matrix is n×n,the size of the target region is d×d , then the maximum depth of the tree is Dep=logdn and the maximum number of intermediate nodes is N = O(4Dep-1); (2) The overhead to store block information and intermediate node is as follows. During the process of generating the quadtree, each intermediate node contains an index pointer (x, y) to a relate region (occupy a storage space SI ) and the region length d (occupy a storage space SD). The maximum additional space overhead of quadtree is CS = (2SI + SD)×O(4Dep-1). However, this part of overhead, which is only due to the transformation, organization and representation of the matrix, is not involved in the multiplication process. Therefore, it does not increase the time complexity of computation. (3) Lower Compression ratio comparing with CSR. Since every target region stores nonzero elements in COO format, the total space overhead is CM = nθ(SD+2SI )+CS, where CM is closely related with the matrix sparsity and the size of target region d. In addition, the distribution of nonzero elements will affect the number of leaf nodes too. (4) Elimination of the possible impact of the distribution of nonzero elements in the multiplication. The original multiplication has been divided into independent operations of various sub-regions. Most of these sub-regains have a dense form, based on which the algorithm is more general. (5) Easy to programming. This data structure reflects the process of recursive decomposition. It is suitable for the block matrix multiplication algorithm.

The implementation of Quadtree based storage format

To transform the storage form of a sparse matrix under traditional forms (such as CSR) to quadtree forms, we should take two problems into account. Firstly, traditional tree generation algorithm (such as depth-first algorithm and breadth-first algorithm) are not suitable here, becasue a large number of traverse, search and stack overheads in such algorithms lead to performance reduction. Secondly, The generation process of quadtree should be efficiency.

Jacobi (single core) Jacobi (multiple cores)

Taking these two problems into account, the designed quadtree structure of a matrix is shown in Figure 3. An original matrix is decomposed into four same-size sub- regions from top to bottom. And then, these 4 sub-regions are decomposed into 4 sub-regions similarly. The decom- position continues until the length of sub-region d is in a given limit d. Then, the matrix are decomposed into leaf nodes successfully. After storing every leaf nodes, the transformation of structure completes. The storage structure in figure 2 is corresponding to the matrix in figure 1. Obviously, the computer can index to the leaf node directly. The quadtree structure not only reduces space overhead, but also eliminates the overhead of stack and queue operations.

Performance

Jacobi (single core) Jacobi (multiple cores)
Jacobi (single core) Jacobi (multiple cores)
Pluto results summary

Authors

Ji-Lin Zhang,Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Li Zhuang,Cloud Technology Research Center,Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Xiao-Fei Zhang,Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong
Xiang-Hua Xu,Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Jian Wan,Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Ivan Simecek,Department of Computer Science, School of Electrical Engineering, Czech Technical University, Czech Republic

Contact

Please email zhuangli1987@gmail.com or jilin.zhang@hdu.edu.cn

Quad-Tree

Download Quad-Tree files (Already download 2 times,this month has been downloaded 2 times)    Donate money  Project detail and discuss  Get support

References

[1] Asanovic, K., et al. The landscape of parallel computing research: A view from berkeley. 2006, EECS Department, University of California, Berkeley.

[2] Goumas, G., et al. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, November 2008, pp. 1–42.

[3] Vuduc, R. and H. Moon. Fast sparse matrix-vector multipli- cation by exploiting variable block structure. Lecture notes in computer science, 2005. 3726: pp. 807–816.

[4] Agarwal, R. C., F. G. Gustavson, and M. Zubair. A high performance algorithm using pre-processing for the sparse matrix-vector multiplication. In Proceedings of the 1992 ACM/IEEE conference on Supercomputing, IEEE Computer Society Press: Minneapolis, 1992, pp. 32–41.

[5] Buttari, A., et al. Performance optimization and modeling of blocked sparse kernels. International Journal of High Performance Computing Applications, 2007, 21(4): pp. 467– 484.

[6] Im, E., K. Yelick, and R. Vuduc. Sparsity: Optimization Framework for Sparse Matrix Kernels. Int. J. High Perform. Compute., 2004, 18(1): pp. 135–158.

[7] Im, E. Optimizing the performance of sparse matrix-vector multiplication, in Technical Report: CSD-00-1104, 2000, U- niversity of California at Berkeley Berkeley, CA, USA.

[8] Mellor-Crummey, J. and J. Garvin. Optimizing sparse matrix- vector product computations using unroll and jam. Interna- tional Journal of High Performance Computing Applications, 2004. 18(2): pp. 225–236.

[9] Pichel, J., et al. Improving the locality of the sparse matrix- vector product on shared memory multiprocessors. In 12th Euromicro Conference on Parallel, Distributed and Network- Based Processing. 2004: IEEE Computer Society.

[10] Pichel, J., et al. Performance optimization of irregular codes based on the combination of reordering and blocking tech- niques. Parallel Computing, 2005, 31(8-9): pp. 858–876.

[11] Pinar, A. and M. T. Heath, Improving performance of s- parse matrix-vector multiplication, in Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), ACM: Portland, Oregon, United States. 1999, pp. 30.

[12] Belgin, M., G. Back, and C. J. Ribbens, Pattern-based sparse matrix representation for memory-efficient SMVM kernels. In Proceedings of the 23rd international conference on Su- percomputing, ACM: Yorktown Heights, NY, USA, 2009, pp. 100–109.

[13] Kourtis, K., G. Goumas, and N. Koziris. Improving the Perfor- mance of Multithreaded Sparse Matrix-Vector Multiplication Using Index and Value Compression. In Proceedings of the 2008 37th International Conference on Parallel Processing, IEEE Computer Society, 2008, pp. 511–519.

[14] Kourtis, K., G. Goumas, and N. Koziris. Optimizing sparse matrix-vector multiplication using index and value compres- sion. In Proceedings of the 5th conference on Computing frontiers. 2008, ACM: Ischia, Italy. pp. 87–96.

[15] Temam, O. and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of the 1992 ACM/IEEE conference on Supercomputing, 1992, IEEE Computer Society Press: Minneapolis, Minnesota, United States. pp. 578–587.

[16] Williams, S., et al. Optimization of sparse matrix-vector mul- tiplication on emerging multicore platforms. Parallel Comput., 2009, 35(3): pp. 178–194.

[17] Vuduc, R., et al. Performance optimizations and bounds for sparse matrix-vector multiply, in Proceedings of the 2002 ACM/IEEE conference on Supercomputing, IEEE Computer Society Press: Baltimore, Maryland, 2002, pp. 1–35.

[18] Kratsch, D. and J. Spinrad. Between O(nm) and O(n). in Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms 2003. Baltimore, Maryland: Society for Industrial and Applied Mathematics Philadelphia, PA, USA.

[19] Yuster, R. and U. Zwick, Fast sparse matrix multiplication. ACM Trans. Algorithms, 2005, 1(1): pp. 2–13.

[20] Pichel, J. C., et al., A new technique to reduce false sharing in parallel irregular codes based on distance functions. In 8th In- ternational Symposium on Parallel Architectures, Algorithms and Networks, IEEE Computer Society, 2005, pp. 306–311.

[21] Willcock, J. and A. Lumsdaine, Accelerating sparse matrix computations via data compression. In Proceedings of the 20th annual international conference on Supercomputing, ACM: Cairns, Queensland, Australia, 2006, pp. 307–316.

[22] Tvrdik, P. and I. Simecek. A New Approach for Accelerating the Sparse Matrix-Vector Multiplication. in Symbolic and Numeric Algorithms for Scientific Computing, SYNASC ’06, Timisoara, Romania: IEEE Computer Press, 2006.

[23] Intel 64 and IA-32 Architec- tures Optimization Reference Manual, http://www.intel.com/Assets/PDF/manual/248966.pdf, January 2011.

[24] Donald Knuth, Two Notes on Notation, American Mathemat- ical Monthly, May 1992, 99(5): pp. 403–422.

[25] T. Davis. The University of Florida sparse matrix collection [EB/OL]. Nan Digest, vol. 97, no. 23, http://www.cise.ufl.edu/research/sparse/matrices, June 1997.

[26] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. M. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, 1994.

[27] Y. Saad, Iterative Methods for Sparse Linear Systems. Philadelphia, PA, USA: SIAM, 2003.

[28] Intel Math Kernel Library Reference Manual, http://software.intel.com/sites/products/documentation/hpc/mk l/mklman/index.htm, 2011.

[29] E Yuan, Yun-Quan Zhang, Fang-Fang Liu, Xiang-Zheng Sun, Automatic Performance Tuning of Sparse Matrix-Vector Mul- tiplication:Implementation Techniques and Its Application Research. Journal of Computer Research and Development, 2009, 46(7): pp. 1117–1126.

[30] Li Rao, Yun-Quan Zhang, Yu-Cheng Li, Performance Test and Analysis of Alltoall Collective Communication on Do- mestic Hundred Trillion Times Cluster System. Computer Science, 2010, 37(8): pp. 186–207.

     Locations of visitors to this page