Profiling Aesara function#
Note
This method replaces the old ProfileMode
. Do not use ProfileMode
anymore.
Besides checking for errors, another important task is to profile your code in terms of speed and/or memory usage. You can profile your functions using either of the following two options:
- Use the Aesara flag
config.profile
to enable profiling. To enable the memory profiler use the Aesara flag:
config.profile_memory
in addition toconfig.profile
.Moreover, to enable the profiling of Aesara rewrite phases, use the Aesara flag:
config.profile_optimizer
in addition toconfig.profile
.You can also use the Aesara flags
profiling__n_apply
,profiling__n_ops
andprofiling__min_memory_size
to modify the quantity of information printed.
- Use the Aesara flag
Pass the argument
profile=True
to the functionaesara.function
and then callf.profile.summary()
for a single function.Use this option when you want to profile not all the functions but only one or more specific function(s).
You can also combine the profile results of many functions:
The profiler will output one profile per Aesara function and produce a profile
that is the sum of the printed profiles. Each profile contains four sections:
global info, class info, Op
s info and Apply
node info.
In the global section, the “Message” is the name of the Aesara
function. aesara.function()
has an optional parameter name
that
defaults to None
. Change it to something else to help profile many
Aesara functions. In that section, we also see the number of times the
function was called (1) and the total time spent in all those
calls. The time spent in Function.vm.__call__()
and in thunks is useful
to understand Aesara’s overhead.
Also, we see the time spent in the two parts of the compilation process:
rewriting (i.e. modifying the graph to make it more stable/faster) and the
linking (i.e. compile C code and make the Python callable returned by
aesara.function()
).
The class, Op
s and Apply
nodes sections have the same information: i.e.
information about the Apply
nodes that ran. The Op
s section takes the
information from the Apply
section and merges it with the Apply
nodes that have
exactly the same Op
. If two Apply
nodes in the graph have two Op
s that
compare equal, they will be merged. Some Op
s, like Elemwise
, will not
compare equal if their parameters differ, so the class section will merge more
Apply
nodes than the Op
s section.
Note that the profile also shows which Op
s were run with C
implementation.
Developers wishing to optimize the performance of their graph should
focus on the worst offending Op
s and Apply
nodes–either by optimizing
an implementation, providing a missing C implementation, or by writing
a graph rewrite that eliminates the offending Op
altogether.
Here is some example output when Aesara’s rewrites are disabled. With all
rewrites enabled, there would be only one Op
remaining in the graph.
To run the example:
AESARA_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py
The output:
Function profiling
==================
Message: None
Time in 1 calls to Function.__call__: 5.698204e-05s
Time in Function.vm.__call__: 1.192093e-05s (20.921%)
Time in thunks: 6.198883e-06s (10.879%)
Total compile time: 3.642474e+00s
Aesara rewrite time: 7.326508e-02s
Aesara validate time: 3.712177e-04s
Aesara Linker time (includes C, CUDA code generation/compiling): 9.584920e-01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.000s 2.07e-06s C 3 3 <class 'aesara.tensor.elemwise.Elemwise'>
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
65.4% 65.4% 0.000s 2.03e-06s C 2 2 Elemwise{add,no_inplace}
34.6% 100.0% 0.000s 2.15e-06s C 1 1 Elemwise{mul,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
50.0% 50.0% 0.000s 3.10e-06s 1 0 Elemwise{add,no_inplace}(x, y)
34.6% 84.6% 0.000s 2.15e-06s 1 2 Elemwise{mul,no_inplace}(TensorConstant{(1,) of 2.0}, Elemwise{add,no_inplace}.0)
15.4% 100.0% 0.000s 9.54e-07s 1 1 Elemwise{add,no_inplace}(Elemwise{add,no_inplace}.0, z)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)