Ding Zhiyu Week 2 Study Report - "Parallel Programming Fundamentals" 2.4-2.10

Main Content

Parallel Software

Task Parallelism

SISD (Single Instruction Single Data)

Single Instruction Single Data stream: In a SISD architecture, a processor executes one instruction at a time, and that instruction operates on a single data unit. This is the basic model of the traditional von Neumann architecture computer, where each processing unit independently completes computation tasks without hardware-level parallelism.

SIMD (Single Instruction Multiple Data)

Single Instruction Multiple Data stream: A processor with SIMD architecture executes the same instruction at a time, but that instruction simultaneously operates on multiple (typically vector or array) data units. This design is widely used in graphics processing units (GPUs), vector processors, and vectorization extensions of modern CPUs (such as Intel's AVX, AMD's Vega SIMD, etc.), and can efficiently handle large-scale data parallelism problems.

SIMT (Single Instruction Multiple Thread)

Single Instruction Multiple Thread: The SIMT architecture is particularly associated with the CUDA core programming model of modern GPUs. In NVIDIA's CUDA architecture, SIMT is implemented by organizing multiple threads into a Warp. At the same time, all threads execute the same instruction, although each thread may access different data. However, if a branch instruction is encountered, threads that are not selected must wait (dynamic scheduling), so SIMT in some ways combines characteristics of SISD and MIMD.

MIMD (Multiple Instruction Multiple Data)

Multiple Instruction Multiple Data stream: In a MIMD architecture, each processing unit independently executes different instructions and operates on its own independent data set. This structure is common in multi-core CPUs, distributed computing systems, and some parallel processors. Each processing unit has its own control logic and instruction stream, capable of highly parallel execution of different tasks, offering high flexibility and concurrency.

SPMD (Single Program, Multiple Data)

SPMD (Single Program, Multiple Data) is a parallel computing model that describes a scenario where multiple processing units (such as CPU cores or GPU threads) execute different copies of the same program, each with its own independent local data set, processing different parts of the data globally. In the SPMD model, although all processors execute the same instruction sequence, since they operate on different input data, they can simultaneously complete different sub-parts of a large task. This model is common in distributed and parallel computing environments, such as high-performance computing clusters, multi-core systems, and GPU programming. The SPMD programming approach simplifies code writing because the programmer only needs to write one copy of the code, and the system automatically distributes it to multiple computing resources for parallel execution. It has similarities with the MIMD architecture, but emphasizes more the concept of a single program being replicated and independently executed on different data. For example, in MPI (Message Passing Interface) programming, the SPMD model is widely used.

Shared Memory

Dynamic Threads and Static Threads

Static Threads:

Static threads have their number and structure determined at program compile or load time. Programmers typically need to explicitly specify the number of threads and pre-assign the tasks each thread will execute. Characteristics of static threads include:

Preset count: The exact number of threads is known before the program starts.
Long lifecycle: Once created, threads may persist throughout the entire program execution.
Fixed task division: The task scope of each thread is clearly defined at the beginning of the program.
Pre-allocated resources: The system can pre-allocate fixed resources such as stack space for each thread.

Dynamic Threads:

Dynamic threads are created and destroyed dynamically during program execution based on demand. This strategy allows programs to flexibly adjust the number of threads based on actual conditions (such as load, data size, etc.) to optimize performance and resource utilization.

On-demand creation: New threads are created at runtime based on actual computation needs and can be destroyed after completing their tasks.
Short lifecycle: Dynamic threads can be created and ended at any time based on task execution, without needing to persist throughout the entire program lifecycle.
Flexible task assignment: Work tasks can be more flexibly assigned to different threads, adapting to changing computation environments and load variations.
Real-time resource management: The system dynamically allocates and reclaims resources at runtime, better utilizing system resources.

Nondeterminism

Nondeterminism in execution order: In an asynchronous parallel environment, multiple tasks may start at different times, execute at different speeds, or end at different times. This means that if there are data dependencies between these tasks, their execution order may affect the program's final state.
Nondeterminism in results: Due to nondeterminism in execution order, if a parallel program is not correctly designed to handle concurrency, the program's output may vary between executions, even with the same input data. This is known as a race condition in concurrent programming.
Nondeterminism in performance: The performance of a parallel program may be affected by various factors, including task scheduling strategy, inter-processor communication latency, memory access patterns, etc. Changes in these factors may lead to nondeterministic program performance.
Nondeterminism in system resource utilization: In a parallel system, system resource usage (such as CPU, memory, I/O) may be affected by multiple concurrently executing tasks. Competition for system resources may cause performance bottlenecks, leading to nondeterministic resource utilization efficiency.

Mutual Exclusion Lock, Mutex, Lock

In parallel computing and multithreaded programming, the terms mutual exclusion lock (Mutex), mutex, and lock are frequently used. They typically refer to the same concept: a synchronization mechanism used to ensure that multiple threads or processes do not conflict when accessing shared resources.

Mutual Exclusion Lock (Mutex Lock):

Mutex is short for Mutual Exclusion.
A mutual exclusion lock is used to protect shared resources, preventing multiple threads from accessing them simultaneously, which could lead to data races and inconsistent results.
When a thread acquires the mutex lock, it can access the protected resource. Before this thread releases the lock, any other thread attempting to acquire the lock will be blocked.

Mutex:

A mutex is the object or data structure that implements the mutual exclusion lock.
In many programming languages and operating systems, a mutex is a specific type of variable used to control access to shared resources.
A mutex typically has two basic operations: lock (or acquire) and unlock (or release). A thread first attempts to lock the mutex to gain exclusive access to the resource, and after completing its resource operations, the thread must unlock the mutex to allow other threads to access the resource.

Lock:

A lock is a broader term used to describe any synchronization mechanism that can be used to control access to shared resources.
A lock can be a mutex lock, or it can be other types of locks, such as read-write locks (which allow multiple readers to simultaneously access a resource, but require exclusive access for writers), spin locks (threads continuously check instead of sleeping while waiting for lock release), etc.

Thread Safety

Thread Safety refers to the ability of code to be safely called by multiple threads simultaneously in a multithreaded environment without causing any issues, such as data corruption or inconsistent results. Thread-safe code can correctly handle the read and write operations that multiple threads may perform simultaneously, ensuring the integrity and consistency of shared data.

Usually when code is not thread-safe, it is because different threads have problems when accessing shared data.

Distributed Memory

Message Passing

Message passing in distributed memory is a parallel computing model where each process has its own private memory space, and processes exchange data by sending and receiving messages. This differs from the shared memory model, where all processes share the same memory space.

In the message passing model, processes must explicitly send messages to other processes and receive messages from other processes. Message passing is usually implemented using libraries, such as the Message Passing Interface (MPI), which is a common standard for this communication approach.

Blocking:

A blocking call means that a process does not execute subsequent instructions until the send or receive operation is complete. For example, in a blocking send operation, the process waits until the message data has been copied to the network buffer before continuing; in a blocking receive operation, the process waits until the corresponding message arrives and is processed before continuing.
Blocking calls can simplify program logic, because after the call, the programmer can be certain that the operation has completed.

Non-blocking:

In contrast to blocking calls, non-blocking calls allow a process to initiate an operation and then immediately continue executing subsequent instructions without waiting for the operation to complete.
Non-blocking calls can improve program performance because they allow the process to perform other computational tasks while waiting for communication to complete.

Broadcast:

Broadcast is a communication pattern where one process sends the same message to all other processes.
In MPI, a broadcast operation can be implemented by a single call, such as the MPI_Bcast function.

Reduction:

A reduction operation combines data from all processes, typically to perform some computation such as sum, maximum, minimum, etc.
In MPI, a reduction operation can be implemented using the MPI_Reduce function. This function collects input data from all processes, applies a reduction operation to them, and sends the result to a specified root process.

These message passing operations are the foundation of distributed memory parallel programming, allowing developers to build complex parallel algorithms across multiple compute nodes, which may be located at different physical locations.

Hybrid Programming in Mixed Systems

In parallel computing, hybrid programming refers to combining multiple parallel programming models to exploit the multi-level parallelism of modern high-performance computing systems. Modern systems typically include multi-core processors, multi-processor nodes, and possibly multi-node clusters, each level of which can adopt different parallel programming techniques.

Common patterns of hybrid programming include:

Multi-threading and Message Passing:

Use a multi-threading model (such as OpenMP or POSIX threads) within a single node to exploit the parallelism of multi-core processors.
Use a message passing model (such as MPI) between nodes to handle inter-node communication.

Shared Memory and Distributed Memory:

Use a shared memory model within a single node, allowing all cores to access the same memory space.
Use a distributed memory model in multi-node systems, where each node has its own memory space and communicates with other nodes over the network.

Accelerators (e.g., GPU) and CPU:

Use technologies such as CUDA or OpenCL to perform parallel computation on accelerators (e.g., GPUs).
Simultaneously use the CPU for other computational tasks or for code that is not suitable for running on accelerators.

The main advantage of hybrid programming is the ability to maximize parallel performance and resource utilization. For example, on a cluster with multiple multi-core processors, you can run one thread per core (using OpenMP or other threading libraries) while using MPI for communication between different nodes of the cluster.

The challenge of hybrid programming models is the need for a deep understanding of different parallel programming techniques, and the need for careful program design to avoid complex synchronization and communication issues. Additionally, debugging and performance tuning are more complex than with a single programming model.

Performance

In parallel computing, speedup and efficiency are two key metrics for measuring parallel program performance. They help us understand the time savings from parallelizing a problem and the effectiveness of resource utilization.

Speedup

Speedup measures the performance improvement of a parallel algorithm relative to the optimal serial algorithm. It is defined as the ratio of serial execution time to parallel execution time:

S(p) = \frac{T_s}{T_p(p)}

Where:

$S(p)$ is the speedup when using $p$ processors.
$T_s$ is the execution time of the optimal serial algorithm.
$T_p(p)$ is the execution time of the parallel algorithm on $p$ processors.

Theoretically, the ideal speedup is $p$ , meaning that if you use $p$ processors, the program's execution time will be $1/p$ of that on a single processor. However, due to communication, synchronization, and decomposition overhead, the actual speedup is often less than $p$ .

Brief Introduction to Amdahl's Law

Amdahl's Law provides a theoretical upper bound on speedup, assuming that only a portion of the program can be parallelized. The formula is:

S(p) = \frac{1}{(1 - f) + \frac{f}{p}}

Where $f$ is the parallelizable portion of the program. Amdahl's Law emphasizes the inherent limitation of parallelization -- even with an infinite number of processors, the speedup is limited by the non-parallelizable portion.

Efficiency

Efficiency is a measure of processor utilization, defined as the ratio of speedup to the number of processors:

E(p) = \frac{S(p)}{p} = \frac{T_s}{p \cdot T_p(p)}

The ideal value of efficiency is 1 (or 100%), indicating that all processors are fully utilized with no waste. In practice, efficiency is usually less than 1 due to parallel overhead.

Efficiency can be affected by various factors, including:

Communication overhead: Data exchange between processors can reduce efficiency.
Load imbalance: If some processors are busy while others are idle, efficiency decreases.
Synchronization overhead: Waiting for other processors to complete their tasks wastes time.
Serial portion: According to Amdahl's Law, the serial portion of the program limits speedup and efficiency.

When designing parallel programs, developers need to minimize these overheads and imbalances to improve speedup and efficiency. This typically involves optimization of algorithms and data structures, as well as a deep understanding of the parallel computing platform.

Relationship Between Efficiency and Scale

The relationship between efficiency and scale is an important consideration in parallel computing. This relationship is typically influenced by two main factors: the problem size (i.e., the amount of work or data set size) and the parallel system scale (i.e., the number of processors). Here are several key concepts to describe this relationship:

Scalability: The scalability of a parallel program refers to its ability to maintain efficiency or speedup as more processors are added to handle larger data sets. Good scalability means that efficiency does not significantly decrease as the system scale increases.
Strong Scalability: If a parallel program processes a fixed-size problem, and as the number of processors increases, it can maintain approximately the same amount of work per processor while execution time decreases correspondingly, then the program has strong scalability. In this case, efficiency is relatively stable because each processor has enough work to stay busy.
Weak Scalability: If the number of processors increases while the problem size also increases proportionally to keep the amount of work per processor constant, then we are examining weak scalability. Ideally, even as problem size and processor count increase, efficiency should remain unchanged.
Amdahl's Law and Gustafson's Law: Amdahl's Law emphasizes the limitation of the serial portion of a program on speedup and efficiency, while Gustafson's Law proposes that when increasing the problem size, the overall speedup can be improved by increasing the work of the parallel portion, potentially maintaining or improving efficiency.

In practice, efficiency typically decreases as the number of processors increases, due to:

Communication overhead: Communicating among more processors increases total overhead, especially when communication does not overlap well with computation operations.
Load imbalance: As the number of processors increases, keeping all processors equally busy becomes more difficult.
Synchronization delays: More processors may mean more frequent synchronization, which leads to decreased efficiency.
Resource contention: Processors may need to share certain resources, such as memory bandwidth. As the number of processors increases, this contention may cause performance bottlenecks.

Generally, the larger the problem size, the greater the speedup and efficiency, because as the scale increases, processes or threads have more work to do, and the relative time needed for coordinating processes and threads decreases, so speedup and efficiency increase.

Therefore, choosing appropriate problem size and processor count, as well as optimizing parallel algorithms and implementation, is crucial for maintaining high efficiency.

Amdahl's Law was proposed by computer scientist Gene Amdahl in 1967 to describe the maximum speedup that a program can achieve after parallelization. The core idea of Amdahl's Law is that the speedup of a program is limited by its serial portion, and even with unlimited increase in the number of processors, the speedup cannot increase indefinitely.

Amdahl's Law

Amdahl's Law formula can be expressed as:

S(p) = \frac{1}{(1 - P) + \frac{P}{p}}

Where:

$S(p)$ is the speedup, representing the ratio of execution time before and after parallelization.
$P$ is the proportion of the program that can be parallelized (between 0 and 1).
$(1 - P)$ is the proportion of the program that must be executed serially.
$p$ is the number of processors used for parallel computation.

According to Amdahl's Law, if 10% of a program is serial, even if the rest is fully parallelized, the maximum speedup can only be 10x, because as the number of processors increases infinitely, the speedup approaches the reciprocal of the serial portion (in this example, 1/(1 - P) = 1/0.1 = 10).

Example

Suppose we have a program that is 90% parallelizable ( $P = 0.9$ ), with the remaining 10% that must be executed serially. We want to calculate the theoretical speedup when using 4 processors.

Applying Amdahl's Law:

S(4) = \frac{1}{(1 - 0.9) + \frac{0.9}{4}} = \frac{1}{0.1 + 0.225} = \frac{1}{0.325} \approx 3.08

This means that even with 4 processors, the maximum theoretical speedup is only 3.08x, not 4x.

Timing

Parallel programming refers to the process of simultaneously executing multiple computational tasks on a multi-processor or multi-core computer. Parallel computing can significantly improve program performance, especially when processing large data sets or executing complex computations. Below are some basics of parallel programming and timing in parallel computing.

Timing in Parallel Computing

Wall-clock time: Also known as real time or elapsed time, it is the actual world time required to complete a specific task.
CPU time: The time the processor spends on a task. In parallel computing, the total CPU time may exceed the wall-clock time because multiple processors may be working simultaneously.

Timing Tools and Methods

To accurately measure the performance of parallel programs, various timing tools and methods can be used:

Hardware timers: Many processors provide hardware-supported timers that can measure the exact number of cycles for instruction execution.
Software timers: Timing functions provided by the operating system, such as gettimeofday in UNIX systems or std::chrono library in C++11.
Performance analysis tools: Such as gprof, VTune, TAU, etc., which can provide detailed performance analysis reports.

Foster's Methodology

In parallel computing, Foster's methodology is a method for designing parallel algorithms and programs. It was proposed by Ian Foster to help developers systematically transform problems and algorithms into effective parallel solutions. Foster's methodology includes four main steps: Partitioning, Communication, Agglomeration, and Mapping. Below is a detailed description of these four steps:

Partitioning

Partitioning is the process of dividing the entire computational task into multiple smaller tasks that can be executed in parallel. Partitioning can be based on data or tasks:

Data partitioning: Dividing the data set into multiple subsets, each processed in parallel by different processors. For example, in matrix multiplication, a matrix can be divided into multiple sub-matrices, each processed by one processor.
Task partitioning: Creating different tasks based on different parts of the algorithm, where each task executes a specific part of the algorithm. For example, in pipeline processing, each task may execute one stage of the pipeline.

The goal of partitioning is to maximize the number of parallel tasks while minimizing dependencies between tasks, to reduce synchronization and communication overhead.

Communication

After the partitioning step, it is necessary to determine the communication requirements between tasks. This includes determining which data needs to be transferred between processors and when these transfers occur. Communication can be divided into two categories:

Local communication: If parallel tasks need to frequently exchange data, they should be physically close to each other to reduce communication latency.
Global communication: Involves data exchange among all or multiple tasks, such as broadcasts and collective operations, which are usually more expensive and require careful management.

Properly designing the communication strategy is key to ensuring efficient parallel execution.

Agglomeration

In the agglomeration step, developers further combine the previously partitioned tasks into larger task blocks to reduce communication overhead and management complexity. This step involves trade-offs:

Combining small tasks into large tasks can reduce the number of communications but may reduce parallelism.
Agglomeration can also help optimize resource utilization, such as cache utilization and memory access patterns.

The goal of agglomeration is to find the optimal balance between task size and count to improve overall program performance.

Mapping

The final step is to map the agglomerated tasks onto processors. This step needs to consider:

Load balancing: Ensure that each processor has approximately equal workload, avoiding some processors being overloaded while others are idle.
Affinity: Map related tasks to physically nearby processors to reduce communication latency.

The mapping strategy should consider hardware resource limitations such as the number of processors, performance, and memory.

Table of Contents

Main Content

Parallel Software​

Task Parallelism​

Shared Memory​

Dynamic Threads and Static Threads​

Nondeterminism​

Mutual Exclusion Lock, Mutex, Lock​

Thread Safety​

Distributed Memory​

Message Passing​

Hybrid Programming in Mixed Systems​

Performance​

Speedup​

Brief Introduction to Amdahl's Law​

Efficiency​

Relationship Between Efficiency and Scale​

Amdahl's Law​

Example​

Timing​

Timing in Parallel Computing​

Timing Tools and Methods​

Foster's Methodology​

Partitioning​

Communication​

Agglomeration​

Mapping​