Usage

Syntax

Overview

PyOMP is an extension to Numba that brings OpenMP parallel programming capabilities to Python. All PyOMP functionality is implemented in the numba.openmp module.

To use PyOMP, you must import from the numba.openmp module. Key imports include:

  • njit - The JIT decorator for compiling functions with OpenMP support

  • openmp_context (typically aliased as openmp) - The context manager for specifying OpenMP directives

  • OpenMP runtime functions - Functions for querying and controlling parallel execution (e.g., omp_get_thread_num(), omp_get_num_threads())

OpenMP directives

OpenMP parallel regions are specified using a with statement for the openmp context, passing the OpenMP syntax specification as a string. The with statement for OpenMP regions must always be placed within a function decorated with the @njit decorator from numba.openmp. The OpenMP directive syntax in PyOMP is identical to the C/C++ OpenMP syntax. For a complete list of supported OpenMP directives with detailed information, see section OpenMP support.

Important

OpenMP regions must be placed within functions decorated with the @njit decorator from numba.openmp. Failure to do so will result in undefined behavior, including potential runtime errors or incorrect execution. Always ensure that any function containing OpenMP directives is properly decorated to avoid such issues.

OpenMP runtime functions

Beyond directives, PyOMP exposes OpenMP runtime functions that allow you to query and control parallel execution behavior. These functions are imported directly from numba.openmp. Commonly used runtime functions include:

  • omp_get_thread_num() - Returns the unique identifier of the calling thread

  • omp_get_num_threads() - Returns the total number of threads in the current parallel region

  • omp_set_num_threads(n) - Sets the number of threads for subsequent parallel regions

  • omp_get_wtime() - Returns elapsed wall-clock time (useful for performance profiling)

  • omp_get_max_threads() - Returns the maximum number of threads available

For a comprehensive list of all available runtime functions, refer to the OpenMP support documentation.

Examples

CPU parallelism example

Here is a minimal parallel “hello world” example for CPU execution:

 1from numba.openmp import njit
 2from numba.openmp import openmp_context as openmp
 3from numba.openmp import omp_get_thread_num
 4
 5@njit
 6def hello():
 7   with openmp("parallel"):
 8      print("Hello from thread", omp_get_thread_num())
 9
10hello()

Key aspects of this example:

  • Imports (lines 1–3): Import the njit decorator, openmp_context context manager, and runtime function omp_get_thread_num() from numba.openmp.

  • @njit decorator (line 5): Required to compile the function with OpenMP support using Numba’s JIT compiler in nopython mode.

  • Parallel region (lines 7–8): The with openmp("parallel") statement creates a parallel region that executes the enclosed code block across multiple threads.

  • Runtime function (line 8): omp_get_thread_num() returns the unique thread identifier, demonstrating how to use OpenMP runtime functions within a parallel region.

On an 8-core machine, the output will display one line per thread. Note that thread execution order is non-deterministic:

Hello from thread 4
Hello from thread 5
Hello from thread 7
Hello from thread 0
Hello from thread 2
Hello from thread 3
Hello from thread 1
Hello from thread 6

GPU offloading example

PyOMP supports GPU programming through OpenMP’s target directive for device offloading. Currently, NVIDIA GPUs are supported (AMD and Intel support are in development).

This example parallelizes a vector addition operation using the GPU:

 1from numba.openmp import njit
 2from numba.openmp import openmp_context as openmp
 3import numpy as np
 4
 5@njit
 6def vecadd(a, b, n):
 7  c = np.empty(n)
 8  with openmp("target teams distribute parallel for"):
 9    for i in range(n):
10     c[i] = a[i] + b[i]
11
12  return c
13
14n = 1000000
15a = np.full(n, 1)
16b = np.full(n, 2)
17c = vecadd(a, b, n)
18print("c = ", c)

The target teams distribute parallel for directive offloads the loop to the GPU. The directive automatically distributes loop iterations across GPU teams (thread-blocks) and threads to maximize available parallelism.

Expected output:

c = [3. 3. 3. ... 3. 3. 3.]