Usage ===== Syntax ------ Overview ~~~~~~~~ PyOMP is an extension to `Numba `_ that brings OpenMP parallel programming capabilities to Python. All PyOMP functionality is implemented in the ``numba.openmp`` module. To use PyOMP, you **must** import from the ``numba.openmp`` module. Key imports include: * ``njit`` - The JIT decorator for compiling functions with OpenMP support * ``openmp_context`` (typically aliased as ``openmp``) - The context manager for specifying OpenMP directives * OpenMP runtime functions - Functions for querying and controlling parallel execution (e.g., ``omp_get_thread_num()``, ``omp_get_num_threads()``) OpenMP directives ~~~~~~~~~~~~~~~~~ OpenMP parallel regions are specified using a ``with`` statement for the ``openmp`` context, passing the OpenMP syntax specification as a string. The ``with`` statement for OpenMP regions **must** always be placed within a function decorated with the ``@njit`` decorator from ``numba.openmp``. The OpenMP directive syntax in PyOMP is identical to the C/C++ OpenMP syntax. For a complete list of supported OpenMP directives with detailed information, see section :doc:`openmp`. .. important:: OpenMP regions **must** be placed within functions decorated with the ``@njit`` decorator from ``numba.openmp``. Failure to do so will result in undefined behavior, including potential runtime errors or incorrect execution. Always ensure that any function containing OpenMP directives is properly decorated to avoid such issues. OpenMP runtime functions ~~~~~~~~~~~~~~~~~~~~~~~~ Beyond directives, PyOMP exposes OpenMP runtime functions that allow you to query and control parallel execution behavior. These functions are imported directly from ``numba.openmp``. Commonly used runtime functions include: * ``omp_get_thread_num()`` - Returns the unique identifier of the calling thread * ``omp_get_num_threads()`` - Returns the total number of threads in the current parallel region * ``omp_set_num_threads(n)`` - Sets the number of threads for subsequent parallel regions * ``omp_get_wtime()`` - Returns elapsed wall-clock time (useful for performance profiling) * ``omp_get_max_threads()`` - Returns the maximum number of threads available For a comprehensive list of all available runtime functions, refer to the :doc:`openmp` documentation. Examples -------- CPU parallelism example ~~~~~~~~~~~~~~~~~~~~~~~ Here is a minimal parallel "hello world" example for CPU execution: .. code-block:: python :linenos: from numba.openmp import njit from numba.openmp import openmp_context as openmp from numba.openmp import omp_get_thread_num @njit def hello(): with openmp("parallel"): print("Hello from thread", omp_get_thread_num()) hello() Key aspects of this example: * **Imports** (lines 1--3): Import the ``njit`` decorator, ``openmp_context`` context manager, and runtime function ``omp_get_thread_num()`` from ``numba.openmp``. * **@njit decorator** (line 5): Required to compile the function with OpenMP support using Numba's JIT compiler in nopython mode. * **Parallel region** (lines 7--8): The ``with openmp("parallel")`` statement creates a parallel region that executes the enclosed code block across multiple threads. * **Runtime function** (line 8): ``omp_get_thread_num()`` returns the unique thread identifier, demonstrating how to use OpenMP runtime functions within a parallel region. On an 8-core machine, the output will display one line per thread. Note that thread execution order is non-deterministic: .. code-block:: bash Hello from thread 4 Hello from thread 5 Hello from thread 7 Hello from thread 0 Hello from thread 2 Hello from thread 3 Hello from thread 1 Hello from thread 6 GPU offloading example ~~~~~~~~~~~~~~~~~~~~~~ PyOMP supports GPU programming through OpenMP's ``target`` directive for device offloading. Currently, NVIDIA GPUs are supported (AMD and Intel support are in development). This example parallelizes a vector addition operation using the GPU: .. code-block:: python :linenos: from numba.openmp import njit from numba.openmp import openmp_context as openmp import numpy as np @njit def vecadd(a, b, n): c = np.empty(n) with openmp("target teams distribute parallel for"): for i in range(n): c[i] = a[i] + b[i] return c n = 1000000 a = np.full(n, 1) b = np.full(n, 2) c = vecadd(a, b, n) print("c = ", c) The ``target teams distribute parallel for`` directive offloads the loop to the GPU. The directive automatically distributes loop iterations across GPU teams (thread-blocks) and threads to maximize available parallelism. Expected output: .. code-block:: bash c = [3. 3. 3. ... 3. 3. 3.]