Usage

Syntax

Overview

PyOMP is an extension to Numba that brings OpenMP parallel programming capabilities to Python. All PyOMP functionality is implemented in the numba.openmp module.

To use PyOMP, you must import from the numba.openmp module. Key imports include:

njit - The JIT decorator for compiling functions with OpenMP support
openmp_context (typically aliased as openmp) - The context manager for specifying OpenMP directives
OpenMP runtime functions - Functions for querying and controlling parallel execution (e.g., omp_get_thread_num(), omp_get_num_threads())

OpenMP directives

OpenMP parallel regions are specified using a with statement for the openmp context, passing the OpenMP syntax specification as a string. The with statement for OpenMP regions must always be placed within a function decorated with the @njit decorator from numba.openmp. The OpenMP directive syntax in PyOMP is identical to the C/C++ OpenMP syntax. For a complete list of supported OpenMP directives with detailed information, see section OpenMP support.

Important

OpenMP regions must be placed within functions decorated with the @njit decorator from numba.openmp. Failure to do so will result in undefined behavior, including potential runtime errors or incorrect execution. Always ensure that any function containing OpenMP directives is properly decorated to avoid such issues.

OpenMP runtime functions

Beyond directives, PyOMP exposes OpenMP runtime functions that allow you to query and control parallel execution behavior. These functions are imported directly from numba.openmp. Commonly used runtime functions include:

omp_get_thread_num() - Returns the unique identifier of the calling thread
omp_get_num_threads() - Returns the total number of threads in the current parallel region
omp_set_num_threads(n) - Sets the number of threads for subsequent parallel regions
omp_get_wtime() - Returns elapsed wall-clock time (useful for performance profiling)
omp_get_max_threads() - Returns the maximum number of threads available

For a comprehensive list of all available runtime functions, refer to the OpenMP support documentation.

Examples

CPU parallelism example

Here is a minimal parallel “hello world” example for CPU execution:

from numba.openmp import njit
from numba.openmp import openmp_context as openmp
from numba.openmp import omp_get_thread_num

@njit
def hello():
   with openmp("parallel"):
      print("Hello from thread", omp_get_thread_num())

hello()

Key aspects of this example:

Imports (lines 1–3): Import the njit decorator, openmp_context context manager, and runtime function omp_get_thread_num() from numba.openmp.
@njit decorator (line 5): Required to compile the function with OpenMP support using Numba’s JIT compiler in nopython mode.
Parallel region (lines 7–8): The with openmp("parallel") statement creates a parallel region that executes the enclosed code block across multiple threads.
Runtime function (line 8): omp_get_thread_num() returns the unique thread identifier, demonstrating how to use OpenMP runtime functions within a parallel region.

On an 8-core machine, the output will display one line per thread. Note that thread execution order is non-deterministic:

Hello from thread 4
Hello from thread 5
Hello from thread 7
Hello from thread 0
Hello from thread 2
Hello from thread 3
Hello from thread 1
Hello from thread 6

GPU offloading example

PyOMP supports GPU programming through OpenMP’s target directive for device offloading. Currently, NVIDIA GPUs are supported (AMD and Intel support are in development).

This example parallelizes a vector addition operation using the GPU:

from numba.openmp import njit
from numba.openmp import openmp_context as openmp
import numpy as np

@njit
def vecadd(a, b, n):
  c = np.empty(n)
  with openmp("target teams distribute parallel for"):
    for i in range(n):
     c[i] = a[i] + b[i]

  return c

n = 1000000
a = np.full(n, 1)
b = np.full(n, 2)
c = vecadd(a, b, n)
print("c = ", c)

The target teams distribute parallel for directive offloads the loop to the GPU. The directive automatically distributes loop iterations across GPU teams (thread-blocks) and threads to maximize available parallelism.

Expected output:

c = [3. 3. 3. ... 3. 3. 3.]