Usage
Syntax
Overview
PyOMP is an extension to Numba that brings
OpenMP parallel programming capabilities to Python. All PyOMP functionality
is implemented in the numba.openmp module.
To use PyOMP, you must import from the numba.openmp module. Key imports
include:
njit- The JIT decorator for compiling functions with OpenMP supportopenmp_context(typically aliased asopenmp) - The context manager for specifying OpenMP directivesOpenMP runtime functions - Functions for querying and controlling parallel execution (e.g.,
omp_get_thread_num(),omp_get_num_threads())
OpenMP directives
OpenMP parallel regions are specified using a with statement for the
openmp context, passing the OpenMP syntax specification as a string.
The with statement for OpenMP regions must always be placed
within a function decorated with the @njit decorator from numba.openmp.
The OpenMP directive syntax in PyOMP is identical to the C/C++ OpenMP syntax.
For a complete list of supported OpenMP directives with detailed information,
see section OpenMP support.
Important
OpenMP regions must be placed within functions decorated with the
@njit decorator from numba.openmp. Failure to do so will result in
undefined behavior, including potential runtime errors or incorrect
execution. Always ensure that any function containing OpenMP directives is
properly decorated to avoid such issues.
OpenMP runtime functions
Beyond directives, PyOMP exposes OpenMP runtime functions that allow you to
query and control parallel execution behavior. These functions are imported
directly from numba.openmp. Commonly used runtime functions include:
omp_get_thread_num()- Returns the unique identifier of the calling threadomp_get_num_threads()- Returns the total number of threads in the current parallel regionomp_set_num_threads(n)- Sets the number of threads for subsequent parallel regionsomp_get_wtime()- Returns elapsed wall-clock time (useful for performance profiling)omp_get_max_threads()- Returns the maximum number of threads available
For a comprehensive list of all available runtime functions, refer to the OpenMP support documentation.
Examples
CPU parallelism example
Here is a minimal parallel “hello world” example for CPU execution:
1from numba.openmp import njit
2from numba.openmp import openmp_context as openmp
3from numba.openmp import omp_get_thread_num
4
5@njit
6def hello():
7 with openmp("parallel"):
8 print("Hello from thread", omp_get_thread_num())
9
10hello()
Key aspects of this example:
Imports (lines 1–3): Import the
njitdecorator,openmp_contextcontext manager, and runtime functionomp_get_thread_num()fromnumba.openmp.@njit decorator (line 5): Required to compile the function with OpenMP support using Numba’s JIT compiler in nopython mode.
Parallel region (lines 7–8): The
with openmp("parallel")statement creates a parallel region that executes the enclosed code block across multiple threads.Runtime function (line 8):
omp_get_thread_num()returns the unique thread identifier, demonstrating how to use OpenMP runtime functions within a parallel region.
On an 8-core machine, the output will display one line per thread. Note that thread execution order is non-deterministic:
Hello from thread 4
Hello from thread 5
Hello from thread 7
Hello from thread 0
Hello from thread 2
Hello from thread 3
Hello from thread 1
Hello from thread 6
GPU offloading example
PyOMP supports GPU programming through OpenMP’s target directive for device offloading.
Currently, NVIDIA GPUs are supported (AMD and Intel support are in development).
This example parallelizes a vector addition operation using the GPU:
1from numba.openmp import njit
2from numba.openmp import openmp_context as openmp
3import numpy as np
4
5@njit
6def vecadd(a, b, n):
7 c = np.empty(n)
8 with openmp("target teams distribute parallel for"):
9 for i in range(n):
10 c[i] = a[i] + b[i]
11
12 return c
13
14n = 1000000
15a = np.full(n, 1)
16b = np.full(n, 2)
17c = vecadd(a, b, n)
18print("c = ", c)
The target teams distribute parallel for directive offloads the loop to the GPU.
The directive automatically distributes loop iterations across GPU teams (thread-blocks)
and threads to maximize available parallelism.
Expected output:
c = [3. 3. 3. ... 3. 3. 3.]