basic
– Basic Ops for neural networks¶

aesara.tensor.nnet.basic.
sigmoid
(x)[source]¶  Returns the standard sigmoid nonlinearity applied to x
Parameters: x  symbolic Tensor (or compatible)
Return type: same as x
Returns: elementwise sigmoid: .
note: see
ultra_fast_sigmoid()
orhard_sigmoid()
for faster versions. Speed comparison for 100M float64 elements on a Core2 Duo @ 3.16 GHz: hard_sigmoid: 1.0s
 ultra_fast_sigmoid: 1.3s
 sigmoid (with amdlibm): 2.3s
 sigmoid (without amdlibm): 3.7s
Precision: sigmoid(with or without amdlibm) > ultra_fast_sigmoid > hard_sigmoid.
Example:
import aesara.tensor as at x, y, b = at.dvectors('x', 'y', 'b') W = at.dmatrix('W') y = at.sigmoid(at.dot(W, x) + b)
Note
The underlying code will return an exact 0 or 1 if an element of x is too small or too big.

aesara.tensor.nnet.basic.
ultra_fast_sigmoid
(x)[source]¶  Returns the approximated standard
sigmoid()
nonlinearity applied to x. Parameters: x  symbolic Tensor (or compatible) Return type: same as x Returns: approximated elementwise sigmoid: . note: To automatically change all sigmoid()
ops to this version, use the Aesara optimizationlocal_ultra_fast_sigmoid
. This can be done with the Aesara flagoptimizer_including=local_ultra_fast_sigmoid
. This optimization is done late, so it should not affect stabilization optimization.
Note
The underlying code will return 0.00247262315663 as the minimum value and 0.997527376843 as the maximum value. So it never returns 0 or 1.
Note
Using directly the ultra_fast_sigmoid in the graph will disable stabilization optimization associated with it. But using the optimization to insert them won’t disable the stability optimization.
 Returns the approximated standard

aesara.tensor.nnet.basic.
hard_sigmoid
(x)[source]¶  Returns the approximated standard
sigmoid()
nonlinearity applied to x. Parameters: x  symbolic Tensor (or compatible) Return type: same as x Returns: approximated elementwise sigmoid: . note: To automatically change all sigmoid()
ops to this version, use the Aesara optimizationlocal_hard_sigmoid
. This can be done with the Aesara flagoptimizer_including=local_hard_sigmoid
. This optimization is done late, so it should not affect stabilization optimization.
Note
The underlying code will return an exact 0 or 1 if an element of x is too small or too big.
Note
Using directly the ultra_fast_sigmoid in the graph will disable stabilization optimization associated with it. But using the optimization to insert them won’t disable the stability optimization.
 Returns the approximated standard

aesara.tensor.nnet.basic.
softplus
(x)[source]¶  Returns the softplus nonlinearity applied to x
Parameter: x  symbolic Tensor (or compatible) Return type: same as x Returns: elementwise softplus: .
Note
The underlying code will return an exact 0 if an element of x is too small.
x, y, b = at.dvectors('x', 'y', 'b') W = at.dmatrix('W') y = at.nnet.softplus(at.dot(W,x) + b)

aesara.tensor.nnet.basic.
softmax
(x)[source]¶  Returns the softmax function of x:
Parameter: x symbolic 2D Tensor (or compatible). Return type: same as x Returns: a symbolic 2D tensor whose ijth element is .
The softmax function will, when applied to a matrix, compute the softmax values rowwise.
note: this supports hessian free as well. The code of the softmax op is more numerically stable because it uses this code:
e_x = exp(x  x.max(axis=1, keepdims=True)) out = e_x / e_x.sum(axis=1, keepdims=True)
Example of use:
x, y, b = at.dvectors('x', 'y', 'b') W = at.dmatrix('W') y = at.nnet.softmax(at.dot(W,x) + b)

aesara.tensor.nnet.
relu
(x, alpha=0)[source]¶ Compute the elementwise rectified linear activation function.
New in version 0.7.1.
Parameters:  x (symbolic tensor) – Tensor to compute the activation function for.
 alpha (
scalar or tensor, optional
) – Slope for negative input, usually between 0 and 1. The default value of 0 will lead to the standard rectifier, 1 will lead to a linear activation function, and any value in between will give a leaky rectifier. A shared variable (broadcastable againstx
) will result in a parameterized rectifier with learnable slope(s).
Returns: Elementwise rectifier applied to
x
.Return type: symbolic tensor
Notes
This is numerically equivalent to
switch(x > 0, x, alpha * x)
(ormaximum(x, alpha * x)
foralpha < 1
), but uses a faster formulation or an optimized Op, so we encourage to use this function.

aesara.tensor.nnet.
elu
(x, alpha=1)[source]¶ Compute the elementwise exponential linear activation function [2].
New in version 0.8.0.
Parameters:  x (symbolic tensor) – Tensor to compute the activation function for.
 alpha (scalar) –
Returns: Elementwise exponential linear activation function applied to
x
.Return type: symbolic tensor
References
[2] DjorkArne Clevert, Thomas Unterthiner, Sepp Hochreiter “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)” <http://arxiv.org/abs/1511.07289>`.

aesara.tensor.nnet.
selu
(x)[source]¶ Compute the elementwise Scaled Exponential Linear unit [3].
New in version 0.9.0.
Parameters: x (symbolic tensor) – Tensor to compute the activation function for. Returns: Elementwise scaled exponential linear activation function applied to x
.Return type: symbolic tensor References
[3] Klambauer G, Unterthiner T, Mayr A, Hochreiter S. “SelfNormalizing Neural Networks” <https://arxiv.org/abs/1706.02515>

aesara.tensor.nnet.basic.
binary_crossentropy
(output, target)[source]¶  Computes the binary crossentropy between a target and an output:
Parameters:  target  symbolic Tensor (or compatible)
 output  symbolic Tensor (or compatible)
Return type: same as target
Returns: a symbolic tensor, where the following is applied elementwise .
The following block implements a simple autoassociator with a sigmoid nonlinearity and a reconstruction error which corresponds to the binary crossentropy (note that this assumes that x will contain values between 0 and 1):
x, y, b, c = at.dvectors('x', 'y', 'b', 'c') W = at.dmatrix('W') V = at.dmatrix('V') h = at.sigmoid(at.dot(W, x) + b) x_recons = at.sigmoid(at.dot(V, h) + c) recon_cost = at.nnet.binary_crossentropy(x_recons, x).mean()

aesara.tensor.nnet.basic.
sigmoid_binary_crossentropy
(output, target)[source]¶  Computes the binary crossentropy between a target and the sigmoid of an output:
Parameters:  target  symbolic Tensor (or compatible)
 output  symbolic Tensor (or compatible)
Return type: same as target
Returns: a symbolic tensor, where the following is applied elementwise .
It is equivalent to
binary_crossentropy(sigmoid(output), target)
, but with more efficient and numerically stable computation, especially when taking gradients.The following block implements a simple autoassociator with a sigmoid nonlinearity and a reconstruction error which corresponds to the binary crossentropy (note that this assumes that x will contain values between 0 and 1):
x, y, b, c = at.dvectors('x', 'y', 'b', 'c') W = at.dmatrix('W') V = at.dmatrix('V') h = at.sigmoid(at.dot(W, x) + b) x_precons = at.dot(V, h) + c # final reconstructions are given by sigmoid(x_precons), but we leave # them unnormalized as sigmoid_binary_crossentropy applies sigmoid recon_cost = at.sigmoid_binary_crossentropy(x_precons, x).mean()

aesara.tensor.nnet.basic.
categorical_crossentropy
(coding_dist, true_dist)[source]¶ Return the crossentropy between an approximating distribution and a true distribution. The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the “true” distribution p. Mathematically, this function computes , where p=true_dist and q=coding_dist.
Parameters:  coding_dist  symbolic 2D Tensor (or compatible). Each row represents a distribution.
 true_dist  symbolic 2D Tensor OR symbolic vector of ints. In the case of an integer vector argument, each element represents the position of the ‘1’ in a 1ofN encoding (aka “onehot” encoding)
Return type: tensor of rank onelessthan
coding_dist
Note
An application of the scenario where true_dist has a 1ofN representation is in classification with softmax outputs. If
coding_dist
is the output of the softmax andtrue_dist
is a vector of correct labels, then the function will computey_i =  \log(coding_dist[i, one_of_n[i]])
, which corresponds to computing the neglogprobability of the correct class (which is typically the training criterion in classification settings).y = at.nnet.softmax(at.dot(W, x) + b) cost = at.nnet.categorical_crossentropy(y, o) # o is either the abovementioned 1ofN vector or 2D tensor

aesara.tensor.nnet.
h_softmax
(x, batch_size, n_outputs, n_classes, n_outputs_per_class, W1, b1, W2, b2, target=None)[source]¶ Twolevel hierarchical softmax.
This function implements a twolayer hierarchical softmax. It is commonly used as an alternative of the softmax when the number of outputs is important (it is common to use it for millions of outputs). See reference [1] for more information about the computational gains.
The
n_outputs
outputs are organized inn_classes
classes, each class containing the same numbern_outputs_per_class
of outputs. For an inputx
(last hidden activation), the first softmax layer predicts its class and the second softmax layer predicts its output among its class.If
target
is specified, it will only compute the outputs of the corresponding targets. Otherwise, iftarget
isNone
, it will compute all the outputs.The outputs are grouped in classes in the same order as they are initially defined: if
n_outputs=10
andn_classes=2
, then the first class is composed of the outputs labeled{0,1,2,3,4}
while the second class is composed of{5,6,7,8,9}
. If you need to change the classes, you have to relabel your outputs.New in version 0.7.1.
Parameters:  x (tensor of shape (batch_size, number of features)) – the minibatch input of the twolayer hierarchical softmax.
 batch_size (int) – the size of the minibatch input x.
 n_outputs (int) – the number of outputs.
 n_classes (int) – the number of classes of the twolayer hierarchical softmax. It corresponds to the number of outputs of the first softmax. See note at the end.
 n_outputs_per_class (int) – the number of outputs per class. See note at the end.
 W1 (tensor of shape (number of features of the input x, n_classes)) – the weight matrix of the first softmax, which maps the input x to the probabilities of the classes.
 b1 (tensor of shape (n_classes,)) – the bias vector of the first softmax layer.
 W2 (tensor of shape (n_classes, number of features of the input x,) – n_outputs_per_class) the weight matrix of the second softmax, which maps the input x to the probabilities of the outputs.
 b2 (tensor of shape (n_classes, n_outputs_per_class)) – the bias vector of the second softmax layer.
 target (tensor of shape either (batch_size,) or (batch_size, 1)) – (optional, default None) contains the indices of the targets for the minibatch input x. For each input, the function computes the output for its corresponding target. If target is None, then all the outputs are computed for each input.
Returns: Output tensor of the twolayer hierarchical softmax for input
x
. Depending on argumenttarget
, it can have two different shapes. Iftarget
is not specified (None
), then all the outputs are computed and the returned tensor has shape (batch_size
,n_outputs
). Otherwise, whentarget
is specified, only the corresponding outputs are computed and the returned tensor has thus shape (batch_size
, 1).Return type: tensor of shape (
batch_size
,n_outputs
) or (batch_size
, 1)Notes
The product of
n_outputs_per_class
andn_classes
has to be greater or equal ton_outputs
. If it is strictly greater, then the irrelevant outputs will be ignored.n_outputs_per_class
andn_classes
have to be the same as the corresponding dimensions of the tensors ofW1
,b1
,W2
andb2
. The most computational efficient configuration is whenn_outputs_per_class
andn_classes
are equal to the square root ofn_outputs
.Examples
The following example builds a simple hierarchical softmax layer.
>>> import numpy as np >>> import aesara >>> import aesara.tensor as at >>> from aesara.tensor.nnet import h_softmax >>> >>> # Parameters >>> batch_size = 32 >>> n_outputs = 100 >>> dim_x = 10 # dimension of the input >>> n_classes = int(np.ceil(np.sqrt(n_outputs))) >>> n_outputs_per_class = n_classes >>> output_size = n_outputs_per_class * n_outputs_per_class >>> >>> # First level of h_softmax >>> floatX = aesara.config.floatX >>> W1 = aesara.shared( ... np.random.normal(0, 0.001, (dim_x, n_classes)).astype(floatX)) >>> b1 = aesara.shared(np.zeros((n_classes,), floatX)) >>> >>> # Second level of h_softmax >>> W2 = np.random.normal(0, 0.001, ... size=(n_classes, dim_x, n_outputs_per_class)).astype(floatX) >>> W2 = aesara.shared(W2) >>> b2 = aesara.shared(np.zeros((n_classes, n_outputs_per_class), floatX)) >>> >>> # We can now build the graph to compute a loss function, typically the >>> # negative loglikelihood: >>> >>> x = at.imatrix('x') >>> target = at.imatrix('target') >>> >>> # This only computes the output corresponding to the target. >>> # The complexity is O(n_classes + n_outputs_per_class). >>> y_hat_tg = h_softmax(x, batch_size, output_size, n_classes, ... n_outputs_per_class, W1, b1, W2, b2, target) >>> >>> negll = at.mean(at.log(y_hat_tg)) >>> >>> # We may need to compute all the outputs (at test time usually): >>> >>> # This computes all the outputs. >>> # The complexity is O(n_classes * n_outputs_per_class). >>> output = h_softmax(x, batch_size, output_size, n_classes, ... n_outputs_per_class, W1, b1, W2, b2)
References
[1] J. Goodman, “Classes for Fast Maximum Entropy Training,” ICASSP, 2001, <http://arxiv.org/abs/cs/0108006>`.