Profiling Aesara function

Profiling Aesara function#


This method replaces the old ProfileMode. Do not use ProfileMode anymore.

Besides checking for errors, another important task is to profile your code in terms of speed and/or memory usage. You can profile your functions using either of the following two options:

  1. Use the Aesara flag config.profile to enable profiling.
    • To enable the memory profiler use the Aesara flag: config.profile_memory in addition to config.profile.

    • Moreover, to enable the profiling of Aesara rewrite phases, use the Aesara flag: config.profile_optimizer in addition to config.profile.

    • You can also use the Aesara flags profiling__n_apply, profiling__n_ops and profiling__min_memory_size to modify the quantity of information printed.

  2. Pass the argument profile=True to the function aesara.function and then call f.profile.summary() for a single function.

    • Use this option when you want to profile not all the functions but only one or more specific function(s).

    • You can also combine the profile results of many functions:

The profiler will output one profile per Aesara function and produce a profile that is the sum of the printed profiles. Each profile contains four sections: global info, class info, Ops info and Apply node info.

In the global section, the “Message” is the name of the Aesara function. aesara.function() has an optional parameter name that defaults to None. Change it to something else to help profile many Aesara functions. In that section, we also see the number of times the function was called (1) and the total time spent in all those calls. The time spent in Function.vm.__call__() and in thunks is useful to understand Aesara’s overhead.

Also, we see the time spent in the two parts of the compilation process: rewriting (i.e. modifying the graph to make it more stable/faster) and the linking (i.e. compile C code and make the Python callable returned by aesara.function()).

The class, Ops and Apply nodes sections have the same information: i.e. information about the Apply nodes that ran. The Ops section takes the information from the Apply section and merges it with the Apply nodes that have exactly the same Op. If two Apply nodes in the graph have two Ops that compare equal, they will be merged. Some Ops, like Elemwise, will not compare equal if their parameters differ, so the class section will merge more Apply nodes than the Ops section.

Note that the profile also shows which Ops were run with C implementation.

Developers wishing to optimize the performance of their graph should focus on the worst offending Ops and Apply nodes–either by optimizing an implementation, providing a missing C implementation, or by writing a graph rewrite that eliminates the offending Op altogether.

Here is some example output when Aesara’s rewrites are disabled. With all rewrites enabled, there would be only one Op remaining in the graph.

To run the example:

AESARA_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/

The output:

Function profiling
  Message: None
  Time in 1 calls to Function.__call__: 5.698204e-05s
  Time in Function.vm.__call__: 1.192093e-05s (20.921%)
  Time in thunks: 6.198883e-06s (10.879%)
  Total compile time: 3.642474e+00s
    Aesara rewrite time: 7.326508e-02s
       Aesara validate time: 3.712177e-04s
    Aesara Linker time (includes C, CUDA code generation/compiling): 9.584920e-01s

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  100.0%   100.0%       0.000s       2.07e-06s     C        3        3   <class 'aesara.tensor.elemwise.Elemwise'>
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  65.4%    65.4%       0.000s       2.03e-06s     C        2        2   Elemwise{add,no_inplace}
  34.6%   100.0%       0.000s       2.15e-06s     C        1        1   Elemwise{mul,no_inplace}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  50.0%    50.0%       0.000s       3.10e-06s      1     0   Elemwise{add,no_inplace}(x, y)
  34.6%    84.6%       0.000s       2.15e-06s      1     2   Elemwise{mul,no_inplace}(TensorConstant{(1,) of 2.0}, Elemwise{add,no_inplace}.0)
  15.4%   100.0%       0.000s       9.54e-07s      1     1   Elemwise{add,no_inplace}(Elemwise{add,no_inplace}.0, z)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)