How to Make Ops

How to Make Ops#

Parametrization#

An Op class can represent one or a wide variety of functions depending on how you choose to parametrize it. The parameters of an Op do not show up in the structure of the computation graph - they are local to the Op. [What does the last sentence mean? What is its effect?] When an Op’s make_node function is called on an Op instance with a list of inputs, the computation that is performed depends on the type and value of those inputs and on the internal parameters of the Op.

It is not always obvious what should be a parameter and what should be an input. For example, a generic indexing Op could take a list and an index as graph inputs, whereas a specific indexing Op could have an index parameter, so you could have a specialized Op instance to fetch the nth element of a list, where n is known statically. [Could you give some advice about the relative tradeoffs of having something as a parameter and something as an input?]

Examples of parameterized Ops in aesara:

Broadcast(<scalar op>, <inplace?>): upgrades an op that works on scalars so it works on tensors. Can work inplace or not.
Reduce(<scalar op>, <axes>): reduces the specified axes using the provided scalar op.
Add(<output type inferrer>): adds scalars and puts the variable in a scalar whose type is inferred from the input types using output_type_inferrer(*inputs)
Composite(<graph>): makes a single Op out of a graph of scalar operations.

[These examples are a little abstract. I’m not sure what are the inputs and what are the parameters. Maybe also give like something that has a random seed.]

Ideas:

MyOp(<debug>): prints debugging information in perform or the C implementation if debug is True.
MyOp(<allow C>): always use the python implementation if allow C is False (raise an exception in c_code)

`eq`, `ne` and `hash`#

In order for certain rewrites to apply (such as the merging of duplicate calculations by MergeOptimizer), it is necessary for Ops that do the same thing to compare equal. If Op instances are generated by a function call (for example) then it can happen that several different Op instances do the same thing; in that case you will have to override Op.__eq__, Op.__ne__, and Op.__hash__ for the MergeOptimizer to recognize them as equal.

Recall: the contract for any __hash__ is that a == b implies hash(a) == hash(b).

`Op.make_node()`#

The Op.make_node() method is expected to have the following signature:

make_node(self, *inputs)

inputs may be a list of anything that the user wants to provide as symbolic input (symbolic: standing for the actual values that will be passed when the graph is compiled into an executable function). [The Aesara intro should describe symbolic in greater depth, and we should link to that from here.] This may or may not include Variable instances (but if you want the inputs of this Op to sometimes be outputs of another Op, then the inputs should be Variable instances). [What else could they be? Constant, Values, …] The return value should be an instance of [GraphStructures Apply] (see the example below). Here are the tasks typically handled in make_node.

Check that the inputs are valid (type checking, etc.). [Since we don’t actually have values, what can we do besides type checking?]

If needed, wrap the inputs in Variable instances with the proper type.

Make the Variable instances that will serve as the outputs of the node.

return Apply(self, <wrapped inputs>, <outputs>)

The inputs and outputs arguments to Apply must be lists of Variable instances (or instances of subclasses of Variable). The inputs given to Apply do not have to be the same as the inputs passed to make_node, but it is recommended that the order corresponds. [why?] The behavior of make_node should not depend on the structure of the graph of [or?] its inputs: it may look at the type and type fields of its inputs, but not at their owner field, because modifications to the graph structure do not use make_node.

Example:

from aesara.scalar import *

class Add(Op):
    #...
    def make_node(self, x, y):
        # note 1: constant, int64 and ScalarType are defined in aesara.scalar
        # note 2: constant(x) is equivalent to Constant(type=int64, data=x)
        # note 3: the call int64() is equivalent to Variable(type=int64, None) or Variable(type=ScalarType(dtype = 'int64'), None)
        if isinstance(x, int):
            x = constant(x)
        elif not isinstance(x, Variable) or not x.type == int64:
            raise TypeError("expected an int64 ScalarType")
        if isinstance(y, int):
            y = constant(y)
        elif not isinstance(y, Variable) or not x.type == int64:
            raise TypeError("expected an int64 ScalarType")
        inputs = [x, y]
        outputs = [int64()]
        node = Apply(op = self, inputs = inputs, outputs = outputs)
        return node
    #...

add = Add()                               # I make an instance of Add
node1 = add.make_node(int64(), int64())   # I make a node with two Variable inputs
node2 = add.make_node(1, 2)               # this works too
node3 = add.make_node(int64(), 79)        # this works three
node4 = add.make_node(float64(), int64()) # this raises a TypeError

[What type is an instance of Add? It’s an Apply? But that’s not a Variable, and cannot be used as input for another Op.]

Two Apply nodes node1 and node2 are assumed by the compiler to represent the same behavior if:: 1. node1.op == node2.op 1. all(input1.type == input2.type for input1, input2 in zip(node1.inputs, node2.inputs)) 1. all(output1.type == output2.type for output1, output2 in zip(node1.outputs, node2.outputs))

It is considered an error to have conditions 1 and 2 but not condition 3. A corollary to those conditions is that repeated calls to make_node with the same inputs should produce equivalent nodes.

`call`#

In Op, __call__ is defined in terms of make_node. Instead of returning a node, it returns the output Variables directly, which is practical from a UI standpoint. Here is pseudocode:

if len(outputs) is 1:
    __call__(*inputs) <=> make_node(*inputs).outputs[0]
else:
    __call__(*inputs) <=> make_node(*inputs).outputs

It is not necessary or recommended to override __call__ unless you want to hide some outputs from view (see hidden outputs section).

perform#

The perform method is expected to have the following signature:

`` perform(self, node, inputs, output_storage) ``

Where:

node: a pointer to an Apply instance - node is assumed to be produced by a previous call to self.make_node.
inputs: not the same as node.inputs - it is a list of values. [i.e. actually data, not just symbolic stuff?]
output_storage: not the same as node.outputs - it is a list of lists of length 1 where the variables of the computation must be put.

[Can you explain better how inputs is not node.inputs and output_storage is not node.outputs?]

[Would it be better to call inputs as ‘inputs_storage’?]

Here is an example of a properly defined perform:

    class Add(Op):
        ...
        def perform(self, node, inputs, output_storage):
            # this does z = x + y
            x, y = inputs        # extract the two inputs
            z, = output_storage  # extract the one storage (the comma after z is not optional)
            z[0] = x + y         # we must put the variable in z[0]
        ...

    add = Add()                               # I make an instance of Add
    node = add.make_node(int64(), int64())    # I make a node with two integer inputs
    storage = [None]                          # I make my storage as a 1-element list with None
    add.perform(node, (3, 7), (storage, ))    # I provide the node, two inputs and storage for one output
print storage[0]                          # prints 10

[Why is node never used in the perform function? Why is self never used?]

[What does the comma after z do? Why is it not optional?]

The node parameter is not always needed, but might come in handy sometimes [when?]. There are as many entries in output_storage as there are in node.outputs and each entry is a list of length 1. The outputs must be computed from the inputs and put in those lists. The lists in output_storage must not be resized - the only allowed operation is to set or read their first element. [Since these instructions correspond to more general principles, could you state the principles of the contract more generally and put it __above__ the example?]

reusing outputs#

The output storage in output_storage might not be empty. In fact, whatever the op allocates to store the computation and puts in the storage might still be there the second time around. [huh?] This is an intended feature and it is acceptable for perform to reuse what is in the output storage if it is worth it. For example, if perform must add two 1000x1000 matrices into a new matrix of the same size and that there is already a 1000x1000 matrix in the corresponding output storage, it may reuse it and thus save a lot in memory and allocation time. It may also freely discard what is already there.

Note that it is not guaranteed that the outputs will stick around. Indeed, the linker may, at its discretion, clean them up. It is not guaranteed either (though it will usually be the case) that the contents of the output storage was allocated by a previous call to perform. It is however guaranteed that the contents are either None or a structure of the proper type which it can use.

If the contents of the storage are None, new storage is expected for that output (typical case is that we “gave” the output to the user so we don’t own it anymore). Therefore, it is not acceptable to have a private cache of previously allocated storage unless you know what you are doing.

Advanced note: for an Op with multiple outputs, it is possible that some of them can be reused and some others not. If an Op with multiple outputs shares storage between them, e.g. the first output is a view of the second, if the first output is reset to None, the second should not be reused, even if it’s available, because a fresh output is expected for the first. It is not recommended in general to share storage between outputs unless one of them is hidden (see hidden outputs section), because the engine does not know how to handle that situation safely.

grad#

grad is an Aesara-specific [as opposed to?] function - it does not interface with core rewrite and compilation facilities, but it provides a useful interface to differentiation. Its expected signature is:

grad(self, inputs, output_gradients)

where:

inputs is a list of Variable instances. It is assumed to be the inputs field of a node produced by make_node.
output_gradients is a list of Variable instances. They have the same properties as the outputs of the node, but are filled with gradient values.

Essentially, the semantics are:

# Not completely sure about this, James should doublecheck -jpt and ob
def grad(self, (x, ), (gz, )):
   return [gz * (dz/dx)]
def grad(self, (x, y), (gz, )):
   return gz*(dz/dx), gz*(dz/dy)
def grad(self, (x, y), (gz, gw)):
   # In this situation you want two return values that have the shape of x and y respectively
   return gz*dz/dx + gw*dw/dx, gz*dz/dy + gw*dw/dy

More specifically, grad must return a list or tuple of input gradients, as many as there are inputs. Let C be a Variable (currently assumed to be a scalar) that depends through an Aesara symbolic expression on the node outputs. Then each output_gradients[i] represents symbolically dC/doutputs[i]. The returned input gradients should represent symbolically dC/dinputs[i].

Example:

class Mul(Op):
    ...
    def grad(self, inputs, output_gradients):
        x, y = inputs
        gz, = output_gradients   # here again, the comma is not optional
        return mul(gz, y), mul(gz, x)
    ...
mul = Mul()

If the op is not differentiable wrt one of its inputs, the gradient for that input should be None; if the op is not differentiable with respect to any of its inputs, it should return something equivalent to [None] * len(inputs). If grad is not implemented for any op in a graph, then the symbolic gradient engine will complain (with an attribute exception).

If the op only has one input, be careful to still return a list or tuple:

fine: return gx,
fine: return [gx]
not fine: return gx

The [http://www.iro.umontreal.ca/~pift6266/A06/cours/gradient.pdf principle] behide this is explaned in section 2.

Destroyers and viewers#

Destroyers#

An Op may change the contents of its inputs. For example, z = add_inplace(x, y) will increment x with y, erasing the previous contents of x. z represents x after it was incremented. However, the engine needs to be told about all this so it can guarantee that add_inplace will only be executed as soon as we don’t need x anywhere else.

This is done by setting the destroy_map field of the op. destroy_map must be a dictionary which associates an output index or None to a list of input indices that are destroyed by that output. For example, add_inplace.destroy_map == {0: [0]} because the first input is overwritten by the first output. If it was y that was overwritten, then destroy_map would be {0: [1]}, because the second input is overwritten by the first output. In a nutshell, to each output must correspond the list of inputs that were changed and share storage with that output. Use None if the inputs were only destroyed to do temporary calculations, etc. and are not reused as the output storage.

Viewers#

Similarly, an Op might not modify the inputs, but return an output which shares state with one or several of its inputs. For example, transpose can be done efficiently by viewing the same data as the original with modified dimensions and strides. That is fine, but the compiler needs to be told.

This is done by setting the view_map field of the op. It works like the destroy_map field: to an output index is associated the list of inputs that it shares state with. For example, transpose.view_map == {0: [0]} because its first output uses the same data as its first input. view_map is conservative: if there is any probability that an output will be the view of an input, that input must be in the view list of that output.

Important note: currently, an output can only be the view of one input. This is limiting, as an ‘if’ or ‘switch’ op would need to declare its output as a view of both its then and else branches, but for the time being the framework is not powerful enough to handle it. A future version should address this issue.

Hidden outputs (as a form of op state)#

For performance purposes, an op might want to have a hidden internal state.

Example: if we expect to call the op repeatedly on incrementally bigger inputs, we might want private output storage that’s a lot bigger than needed and take incrementally bigger views on it, to save allocation overhead. In order to do this, we can have two outputs: one that we will return normally and will contain the answer and the other that will be the (larger) container. In this case, the advanced note in the ‘reusing outputs’ section applies. Furthermore, __call__ should be overridden to only return the first output instead of both of them. Here is what the example’s perform and __call__ would look like:

class Add(Op):
    """
    Use a hidden buffer to prevent unnecessary reallocation of memory.
    """
    default_output = 0
    def make_node(self, x, y):
        return Apply(self, [x,y], [x.type.make_variable(), x.type.make_variable()])

    def perform(self, node, (x, y), (z, stor)):
        if z[0] is None or stor[0] is None:
            stor[0] = numpy.ndarray(x.size * 2)
        else:
            if x.size > stor[0].size:
                stor[0].resize(x.size * 2, refcheck = 0)
        z[0] = stor[0][:x.size]
        numpy.add(x, y, z[0])
...

Another example: for a FFTW Op, we would like to cache FFTW’s plan along with the inputs it was computed on, so we can reuse it if the inputs are similar to the previous ones.

It is also possible but potentially more complicated to use “private inputs” to do the same thing: inputs cannot be set, though their contents can be modified, so a wrapper would be needed and the input must be marked as ‘destroyed’ by the Op using the ‘destroy_map’ field.