Treffer: Parallel programming frameworks for irregular and regular programs

Title:
Parallel programming frameworks for irregular and regular programs
Publisher Information:
The University of Texas at Austin, 2024.
Publication Year:
2024
Document Type:
Dissertation Thesis
Language:
English
DOI:
10.26153/tsw/59929
Accession Number:
edsair.doi...........9b2c0dd417bbd232d9be342deb593c6b
Database:
OpenAIRE

Weitere Informationen

Developing efficient parallel programs is challenging because it demands a comprehensive understanding spanning from algorithms, such as computation types, dependencies between computations, and memory operation patterns, to architectures, such as memory hierarchy, computing devices, and a hardware topology. Further, selecting and exploiting suitable programming abstractions is essential. One effective strategy to manage these complexities and parallelizing programs is to categorize programs based on the main data structures that they use. Parallel programs can be classified into irregular and regular programs. Irregular programs typically involve unstructured and sparse data structures such as graphs, trees, or priority queues, which necessitate irregular and sparse memory access patterns. In many irregular programs, computation depends on dynamic data dependencies, posing challenges for predicting and efficiently planning parallel execution. Examples of irregular programs include graph analytics programs, such as single-source shortest path, page-rank, and community detection. On the other hand, regular programs exploit structured and dense data structures such as matrices and arrays. These programs exhibit more structured and dense memory access patterns with static data dependencies, leading predictable execution. Linear algebraic programs such as Cholesky factorization and computational simulations such as radioactive heat transfer simulation are examples of the regular programs. Given the distinct properties, irregular and regular programs demand different approaches to achieve high parallel execution, and many parallel programming frameworks have been proposed, each with different programming model and runtime designs for their target workloads. This dissertation advances the current state-of-the-art in parallel programming frameworks by enhancing programming models and designing optimizations for high-parallelism on modern architectures, addressing both irregular and regular workloads. Our first contribution is in studying performance of programming abstractions for graph analytics. Traditionally, parallel graph analytics algorithms have been implemented in systems like Pregel, Galois, and Ligra that support graph data structures and operations directly. An alternative approach expresses graph algorithms in terms of sparse matrix kernels such as sparse matrix-vertex and matrix-matrix multiplication. How does matrix-based abstraction perform compared to graph-based abstraction? To study it, we implemented a matrix API for graph analytics on top of Galois. Experiments on a 56 core CPU show that the matrix-based runtime and API is 5× slower on the average than the graph-based runtime and API. Our in-depth study identified inherent performance and expressiveness limitations of the matrix-based programming abstraction for graph analytics workloads. Our second contribution is in distributed programming frameworks for graph analytics. Recent distributed graph analytics systems such as Gemini and Gluon solely support adjacent-vertex algorithms in which nodes are updated iteratively using properties of adjacent neighbors of those nodes. However, there are many algorithms such as Shiloach-Vishkin for connected components and Louvain for community detection, that cannot be expressed in this model. These algorithms may produce better quality output or may be more efficient than adjacent-vertex algorithms. We call these algorithms trans-vertex algorithms. In this dissertation, we present Kimbap, a distributed graph analytics system efficiently supporting both adjacent- and trans-vertex algorithms. Kimbap permits the computation at a node to read and write properties of any node in the graph. Kimbap’s programming model allows programmers to write high-level iterative graph analytics programs, while the Kimbap compiler automatically generates high performance and scalable distributed code. The underlying system uses a distributed node-property map optimized for highly concurrent sparse reads and reductions on arbitrary node properties. Experiments on CPU clusters with up to 256 machines, roughly 12000 threads total, show 4x faster performance than the state-of-the-art hand-optimized trans-vertex programs, and matches or outperforms the state-of-the-art distributed graph analytics system on adjacent-vertex programs. Our last contribution is in task-based parallel programming frameworks. Python has been a reasonable choice for developing scientific applications, supported by its rapidly expanding infrastructure such as Jupyter and a rich set of scientific libraries such as NumPy and CuPy optimized for modern architectures. However, existing Python-based task parallel programming systems in this area are often monolithic in the sense that they assume that most of the codebase is written within their frameworks. It prevents programmers from leveraging rapidly evolving, highly-optimized libraries and implementations developed in different programming models, runtime systems, or architectures. To address these limitations, we propose Parla. Parla is a Python-based task-parallel programming framework for a heterogeneous shared-memory machine. The main contribution of Parla is gradual adoption that enables programmers to easily migrate, integrate, and parallelize existing sequential Python programs. Parla allows external library calls within tasks without any modification. Parla efficiently parallelizes tasks within a single process, utilizing resource aware task mapping, scheduling, and execution. By internally managing device resources such as GPU context, memory, and computation units, Parla alleviates complexity of heterogeneous architecture programming and allows programmers to transform existing sequential programs into highly parallelized versions with ease. Lastly, it is highly optimized for NVIDIA GPU architectures by leveraging device-specific features including CUDA’s event mechanism. Parla is funded by a DOE PSAAP grant, and several graduate students are involved in its implementation. My contributions are designing and implementing resource management and task scheduling mechanisms and policies in the Parla runtime system. We show that Parla achieves competitive performance with state-of-the-art implementations while simplifying parallel programming.