Flattens the input array into a 2-D array by collapsing the higher dimensions. .. note:: Flatten is deprecated. Use flatten instead. For an input array with shape (d1,d2,...,dk), flatten operation reshapes the input array into an output array of shape (d1,d2*...*dk). Note that the behavior of this function is different from numpy.ndarray.flatten, which behaves similar to mxnet.ndarray.reshape((-1,)). Example::.
Given data that is quantized in int32 and the corresponding thresholds, requantize the data into int8 using min and max thresholds either calculated at runtime or from calibration.
Slices a region of the array like the shape of another array. This function is similar to slice, however, the begin are always 0`s and `end of specific axes are inferred from the second input shape_like. Given the second shape_like input of shape=(d_0,d_1,...,d_n-1), a slice_like operator with default empty axes, it performs the following operation: `` out = slice(input, begin=(0, 0, ..., 0), end=(d_0, d_1, ..., d_n-1))``. When axes is not empty, it is used to speficy which axes are being sliced. Given a 4-d input data, slice_like operator with axes=(0,2,-1) will perform the following operation: `` out = slice(input, begin=(0, 0, 0, 0), end=(d_0, None, d_2, d_3))``. Note that it is allowed to have first and second input with different dimensions, however, you have to make sure the axes are specified and not exceeding the dimension limits. For example, given input_1 with shape=(2,3,4,5) and input_2 with shape=(1,2,3), it is not allowed to use: `` out = slice_like(a, b)`` because ndim of input_1 is 4, and ndim of input_2 is 3. The following is allowed in this situation: `` out = slice_like(a, b, axes=(0, 2))`` Example::.
Return an array with evenly spaced values. If axis is not given, the output will
have the same shape as the input array. Otherwise, the output will be a 1-D array with size of
the specified axis in input shape.
start (double, optional, default=0) – Start of interval. The interval includes this value. The default start value is 0.
step (double, optional, default=1) – Spacing between values.
repeat (int, optional, default='1') – The repeating time of all elements. E.g repeat=3, the element a will be repeated three times –> a, a, a.
ctx (string, optional, default='') – Context of output, in format [cpu|gpu|cpu_pinned](n).Only used for imperative calls.
axis (int or None, optional, default='None') – Arange elements according to the size of a certain axis of input array. The negative numbers are interpreted counting from the backward. If not provided, will arange elements according to the input shape.
name (string, optional.) – Name of the resulting symbol.
batch_dot is used to compute dot product of x and y when x and
y are data in batch, namely N-D (N >= 3) arrays in shape of (B0, …, B_i, :, :).
For example, given x with shape (B_0, …, B_i, N, M) and y with shape
(B_0, …, B_i, M, K), the result array will have shape (B_0, …, B_i, N, K),
which is computed by:
transpose_a (boolean, optional, default=0) – If true then transpose the first input before dot.
transpose_b (boolean, optional, default=0) – If true then transpose the second input before dot.
forward_stype ({None, 'csr', 'default', 'row_sparse'},optional, default='None') – The desired storage type of the forward output given by user, if thecombination of input storage types and this hint does not matchany implemented ones, the dot operator will perform fallback operationand still produce an output of the desired storage type.
name (string, optional.) – Name of the resulting symbol.
Flattens the input array into a 2-D array by collapsing the higher dimensions.
.. note:: Flatten is deprecated. Use flatten instead.
For an input array with shape (d1,d2,...,dk), flatten operation reshapes
the input array into an output array of shape (d1,d2*...*dk).
Note that the behavior of this function is different from numpy.ndarray.flatten,
which behaves similar to mxnet.ndarray.reshape((-1,)).
Example:
Both mean and var returns a scalar by treating the input as a vector.
Assume the input has size k on axis 1, then both gamma and beta
have shape (k,). If output_mean_var is set to be true, then outputs both data_mean and
the inverse of data_var, which are needed for the backward pass. Note that gradient of these
two outputs are blocked.
Besides the inputs and the outputs, this operator accepts two auxiliary
states, moving_mean and moving_var, which are k-length
vectors. They are global statistics for the whole dataset, which are updated
by:
If use_global_stats is set to be true, then moving_mean and
moving_var are used instead of data_mean and data_var to compute
the output. It is often used during inference.
The parameter axis specifies which axis of the input shape denotes
the ‘channel’ (separately normalized groups). The default is 1. Specifying -1 sets the channel
axis to be the last item in the input shape.
Both gamma and beta are learnable parameters. But if fix_gamma is true,
then set gamma to 1 and its gradient to 0.
Note
When fix_gamma is set to True, no sparse support is provided. If fix_gammais set to False,
the sparse tensors will fallback.
Defined in /home/smola/mxnet/src/operator/nn/batch_norm.cc:L636
eps (double, optional, default=0.0010000000474974513) – Epsilon to prevent div 0. Must be no less than CUDNN_BN_MIN_EPSILON defined in cudnn.h when using cudnn (usually 1e-5)
momentum (float, optional, default=0.899999976) – Momentum for moving average
fix_gamma (boolean, optional, default=1) – Fix gamma while training
use_global_stats (boolean, optional, default=0) – Whether use global moving statistics instead of local batch-norm. This will force change batch-norm into a scale shift operator.
output_mean_var (boolean, optional, default=0) – Output the mean and inverse std
axis (int, optional, default='1') – Specify which shape axis the channel is specified
cudnn_off (boolean, optional, default=0) – Do not select CUDNN operator, if available
min_calib_range (float or None, optional, default=None) – The minimum scalar value in the form of float32 obtained through calibration. If present, it will be used to by quantized batch norm op to calculate primitive scale.Note: this calib_range is to calib bn output.
max_calib_range (float or None, optional, default=None) – The maximum scalar value in the form of float32 obtained through calibration. If present, it will be used to by quantized batch norm op to calculate primitive scale.Note: this calib_range is to calib bn output.
name (string, optional.) – Name of the resulting symbol.
The matching is performed on score matrix with shape [B, N, M]
- B: batch_size
- N: number of rows to match
- M: number of columns as reference to be matched against.
Returns:
x : matched column indices. -1 indicating non-matched elements in rows.
y : matched row indices.
The output will be sorted in descending order according to score. Boxes with
overlaps larger than overlap_thresh, smaller scores and background boxes
will be removed and filled with -1, the corresponding position will be recorded
for backward propogation.
During back-propagation, the gradient will be copied to the original
position according to the input index. For positions that have been suppressed,
the in_grad will be assigned 0.
In summary, gradients are sticked to its boxes, will either be moved or discarded
according to its original index in input.
By default, a box is [id, score, xmin, ymin, xmax, ymax, …],
additional elements are allowed.
id_index: optional, use -1 to ignore, useful if force_suppress=False, which means
we will skip highly overlapped boxes if one is apple while the other is car.
background_id: optional, default=-1, class id for background boxes, useful
when id_index>=0 which means boxes with background id will be filtered before nms.
coord_start: required, default=2, the starting index of the 4 coordinates.
Two formats are supported:
corner: [xmin, ymin, xmax, ymax]
center: [x, y, width, height]
score_index: required, default=1, box score/confidence.
When two boxes overlap IOU > overlap_thresh, the one with smaller score will be suppressed.
in_format and out_format: default=’corner’, specify in/out box formats.
overlap_thresh (float, optional, default=0.5) – Overlapping(IoU) threshold to suppress object with smaller score.
valid_thresh (float, optional, default=0) – Filter input boxes to those whose scores greater than valid_thresh.
topk (int, optional, default='-1') – Apply nms to topk boxes with descending scores, -1 to no restriction.
coord_start (int, optional, default='2') – Start index of the consecutive 4 coordinates.
score_index (int, optional, default='1') – Index of the scores/confidence of boxes.
id_index (int, optional, default='-1') – Optional, index of the class categories, -1 to disable.
background_id (int, optional, default='-1') – Optional, id of the background class which will be ignored in nms.
force_suppress (boolean, optional, default=0) – Optional, if set false and id_index is provided, nms will only apply to boxes belongs to the same category
Broadcasting is a mechanism that allows ndarrays to perform arithmetic operations
with arrays of different shapes efficiently without creating multiple copies of arrays.
Also see, Broadcasting for more explanation.
Broadcasting is allowed on axes with size 1, such as from (2,1,3,1) to
(2,8,3,9). Elements will be duplicated on the broadcasted axes.
This operator will check if all the elements in a boolean tensor is true.
If not, ValueError exception will be raised in the backend with given error message.
In order to evaluate this operator, one should multiply the origin tensor by the return value
of this operator to force this operator become part of the computation graph,
otherwise the check would not be working under symoblic mode.
msg (string) – The error message in the exception.
Returns:
out – If all the elements in the input tensor are true,
array(True) will be returned, otherwise ValueError exception would
be raised before anything got returned.
>>> loc=np.zeros((2,2))>>> scale=np.array(#some_value)>>> constraint=(scale>0)>>> np.random.normal(loc, scale * npx.constraint_check(constraint, 'Scale should be larger than zero'))
If elements in the scale tensor are all bigger than zero, npx.constraint_check would return
np.array(True), which will not change the value of scale when multiplied by.
If some of the elements in the scale tensor violate the constraint,
i.e. there exists False in the boolean tensor constraint,
a ValueError exception with given message ‘Scale should be larger than zero’ would be raised.
where
quantized_range = MinAbs(max(int8), min(int8)) and
scale = quantized_range / MaxAbs(min_range, max_range).
When out_type is auto, the output type is automatically determined by min_calib_range if presented.
If min_calib_range < 0.0f, the output type will be int8, otherwise will be uint8.
If min_calib_range isn’t presented, the output type will be int8.
Note
This operator only supports forward propagation. DO NOT use it in training.
Defined in /home/smola/mxnet/src/operator/quantization/quantize_v2.cc:L104
out_type ({'auto', 'int8', 'uint8'},optional, default='int8') – Output data type. auto can be specified to automatically determine output type according to min_calib_range.
min_calib_range (float or None, optional, default=None) – The minimum scalar value in the form of float32. If present, it will be used to quantize the fp32 data into int8 or uint8.
max_calib_range (float or None, optional, default=None) – The maximum scalar value in the form of float32. If present, it will be used to quantize the fp32 data into int8 or uint8.
name (string, optional.) – Name of the resulting symbol.
RNN operator for input data type of uint8. The weight of each
gates is converted to int8, while bias is accumulated in type float32.
The hidden state and cell state are in type float32. For the input data, two more arguments
of type float32 must be provided representing the thresholds of quantizing argument from
data type float32 to uint8. The final outputs contain the recurrent result in float32.
It only supports quantization for Vanilla LSTM network.
Note
This operator only supports forward propagation. DO NOT use it in training.
Defined in /home/smola/mxnet/src/operator/quantization/quantized_rnn.cc:L320
state_size (int (non-negative), required) – size of the state for each layer
num_layers (int (non-negative), required) – number of stacked layers
bidirectional (boolean, optional, default=0) – whether to use bidirectional recurrent layers
mode ({'gru', 'lstm', 'rnn_relu', 'rnn_tanh'}, required) – the type of RNN to compute
p (float, optional, default=0) – drop rate of the dropout on the outputs of each RNN layer, except the last layer.
state_outputs (boolean, optional, default=0) – Whether to have the states as symbol outputs.
projection_size (int or None, optional, default='None') – size of project size
lstm_state_clip_min (double or None, optional, default=None) – Minimum clip value of LSTM states. This option must be used together with lstm_state_clip_max.
lstm_state_clip_max (double or None, optional, default=None) – Maximum clip value of LSTM states. This option must be used together with lstm_state_clip_min.
lstm_state_clip_nan (boolean, optional, default=0) – Whether to stop NaN from propagating in state by clipping it to min/max. If clipping range is not specified, this option is ignored.
use_sequence_length (boolean, optional, default=0) – If set to true, this layer takes in an extra input parameter sequence_length to specify variable length sequence
name (string, optional.) – Name of the resulting symbol.
If no_bias is set to be true, then the bias term is ignored.
The default data layout is NCHW, namely (batch_size, channel, height,
width). We can choose other layouts such as NWC.
If num_group is larger than 1, denoted by g, then split the input data
evenly into g parts along the channel axis, and also evenly split weight
along the first dimension. Next compute the convolution on the i-th part of
the data with the i-th weight part. The output is obtained by concatenating all
the g results.
1-D convolution does not have height dimension but only width in space.
data: (batch_size, channel, width)
weight: (num_filter, channel, kernel[0])
bias: (num_filter,)
out: (batch_size, num_filter, out_width).
3-D convolution adds an additional depth dimension besides height and
width. The shapes are
cudnn_tune: enable this option leads to higher startup time but may give
faster speed. Options are
off: no tuning
limited_workspace:run test and pick the fastest algorithm that doesn’t
exceed workspace limit.
fastest: pick the fastest algorithm and ignore workspace limit.
None (default): the behavior is determined by environment variable
MXNET_CUDNN_AUTOTUNE_DEFAULT. 0 for off, 1 for limited workspace
(default), 2 for fastest.
workspace: A large number leads to more (GPU) memory usage but may improve
the performance.
Defined in /home/smola/mxnet/src/operator/nn/convolution.cc:L509
stride (Shape(tuple), optional, default=[]) – Convolution stride: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
dilate (Shape(tuple), optional, default=[]) – Convolution dilate: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
pad (Shape(tuple), optional, default=[]) – Zero pad for convolution: (w,), (h, w) or (d, h, w). Defaults to no padding.
num_filter (int (non-negative), required) – Convolution filter(channel) number
num_group (int (non-negative), optional, default=1) – Number of group partitions.
workspace (long (non-negative), optional, default=1024) – Maximum temporary workspace allowed (MB) in convolution.This parameter has two usages. When CUDNN is not used, it determines the effective batch size of the convolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel when limited_workspace strategy is used.
no_bias (boolean, optional, default=0) – Whether to disable bias parameter.
cudnn_tune ({None, 'fastest', 'limited_workspace', 'off'},optional, default='None') – Whether to pick convolution algo by running performance test.
cudnn_off (boolean, optional, default=0) – Turn off cudnn for this layer.
The data tensor consists of sequences of activation vectors (without applying softmax),
with i-th channel in the last dimension corresponding to i-th label
for i between 0 and alphabet_size-1 (i.e always 0-indexed).
Alphabet size should include one additional value reserved for blank label.
When blank_label is "first", the 0-th channel is be reserved for
activation of blank label, or otherwise if it is “last”, (alphabet_size-1)-th channel should be
reserved for blank label.
label is an index matrix of integers. When blank_label is "first",
the value 0 is then reserved for blank label, and should not be passed in this matrix. Otherwise,
when blank_label is "last", the value (alphabet_size-1) is reserved for blank label.
If a sequence of labels is shorter than label_sequence_length, use the special
padding value at the end of the sequence to conform it to the correct
length. The padding value is 0 when blank_label is "first", and -1 otherwise.
For example, suppose the vocabulary is [a, b, c], and in one batch we have three sequences
‘ba’, ‘cbb’, and ‘abac’. When blank_label is "first", we can index the labels as
{‘a’: 1, ‘b’: 2, ‘c’: 3}, and we reserve the 0-th channel for blank label in data tensor.
The resulting label tensor should be padded to be:
[[2,1,0,0],[3,2,2,0],[1,2,1,3]]
When blank_label is "last", we can index the labels as
{‘a’: 0, ‘b’: 1, ‘c’: 2}, and we reserve the channel index 3 for blank label in data tensor.
The resulting label tensor should be padded to be:
[[1,0,-1,-1],[2,1,1,-1],[0,1,0,2]]
out is a list of CTC loss values, one per example in the batch.
See Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks, A. Graves et al. for more
information on the definition and the algorithm.
Defined in /home/smola/mxnet/src/operator/nn/ctc_loss.cc:L104
label (Symbol) – Ground-truth labels for the loss.
data_lengths (Symbol) – Lengths of data for each of the samples. Only required when use_data_lengths is true.
label_lengths (Symbol) – Lengths of labels for each of the samples. Only required when use_label_lengths is true.
use_data_lengths (boolean, optional, default=0) – Whether the data lenghts are decided by data_lengths. If false, the lengths are equal to the max sequence length.
use_label_lengths (boolean, optional, default=0) – Whether the label lenghts are decided by label_lengths, or derived from padding_mask. If false, the lengths are derived from the first occurrence of the value of padding_mask. The value of padding_mask is 0 when first CTC label is reserved for blank, and -1 when last label is reserved for blank. See blank_label.
blank_label ({'first', 'last'},optional, default='first') – Set the label that is reserved for blank label.If “first”, 0-th label is reserved, and label values for tokens in the vocabulary are between 1 and alphabet_size-1, and the padding mask is -1. If “last”, last label value alphabet_size-1 is reserved for blank label instead, and label values for tokens in the vocabulary are between 0 and alphabet_size-2, and the padding mask is 0.
name (string, optional.) – Name of the resulting symbol.
Computes 1D, 2D or 3D transposed convolution (aka fractionally strided convolution) of the input tensor. This operation can be seen as the gradient of Convolution operation with respect to its input. Convolution usually reduces the size of the input. Transposed convolution works the other way, going from a smaller input to a larger output while preserving the connectivity pattern.
Parameters:
data (Symbol) – Input tensor to the deconvolution operation.
weight (Symbol) – Weights representing the kernel.
bias (Symbol) – Bias added to the result after the deconvolution operation.
kernel (Shape(tuple), required) – Deconvolution kernel size: (w,), (h, w) or (d, h, w). This is same as the kernel size used for the corresponding convolution
stride (Shape(tuple), optional, default=[]) – The stride used for the corresponding convolution: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
dilate (Shape(tuple), optional, default=[]) – Dilation factor for each dimension of the input: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
pad (Shape(tuple), optional, default=[]) – The amount of implicit zero padding added during convolution for each dimension of the input: (w,), (h, w) or (d, h, w). (kernel-1)/2 is usually a good choice. If target_shape is set, pad will be ignored and a padding that will generate the target shape will be used. Defaults to no padding.
adj (Shape(tuple), optional, default=[]) – Adjustment for output shape: (w,), (h, w) or (d, h, w). If target_shape is set, adj will be ignored and computed accordingly.
target_shape (Shape(tuple), optional, default=[]) – Shape of the output tensor: (w,), (h, w) or (d, h, w).
num_filter (int (non-negative), required) – Number of output filters.
num_group (int (non-negative), optional, default=1) – Number of groups partition.
workspace (long (non-negative), optional, default=1024) – Maximum temporary workspace allowed (MB) in deconvolution.This parameter has two usages. When CUDNN is not used, it determines the effective batch size of the deconvolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel when limited_workspace strategy is used.
no_bias (boolean, optional, default=1) – Whether to disable bias parameter.
cudnn_tune ({None, 'fastest', 'limited_workspace', 'off'},optional, default='None') – Whether to pick convolution algorithm by running performance test.
cudnn_off (boolean, optional, default=0) – Turn off cudnn for this layer.
layout ({None, 'NCDHW', 'NCHW', 'NCW', 'NDHWC', 'NHWC'},optional, default='None') – Set layout for input, output and weight. Empty for default layout, NCW for 1d, NCHW for 2d and NCDHW for 3d.NHWC and NDHWC are only supported on GPU.
name (string, optional.) – Name of the resulting symbol.
If no_bias is set to be true, then the bias term is ignored.
The default data layout is NCHW, namely (batch_size, channle, height,
width).
If num_group is larger than 1, denoted by g, then split the input data
evenly into g parts along the channel axis, and also evenly split weight
along the first dimension. Next compute the convolution on the i-th part of
the data with the i-th weight part. The output is obtained by concating all
the g results.
If num_deformable_group is larger than 1, denoted by dg, then split the
input offset evenly into dg parts along the channel axis, and also evenly
split data into dg parts along the channel axis. Next compute the
deformable convolution, apply the i-th part of the offset on the i-th part
of the data.
Both weight and bias are learnable parameters.
Defined in /home/smola/mxnet/src/operator/deformable_convolution.cc:L80
Parameters:
data (Symbol) – Input data to the DeformableConvolutionOp.
offset (Symbol) – Input offset to the DeformableConvolutionOp.
During training, each element of the input is set to zero with probability p.
The whole array is rescaled by \(1/(1-p)\) to keep the expected
sum of the input unchanged.
During testing, this operator does not change the input if mode is ‘training’.
If mode is ‘always’, the same computaion as during training will be applied.
Example:
random.seed(998)input_array=array([[3.,0.5,-0.5,2.,7.],[2.,-0.4,7.,3.,0.2]])a=symbol.Variable('a')dropout=symbol.Dropout(a,p=0.2)executor=dropout.simple_bind(a=input_array.shape)## If trainingexecutor.forward(is_train=True,a=input_array)executor.outputs[[3.750.625-0.2.58.75][2.5-0.58.753.750.]]## If testingexecutor.forward(is_train=False,a=input_array)executor.outputs[[3.0.5-0.52.7.][2.-0.47.3.0.2]]
Defined in /home/smola/mxnet/src/operator/nn/dropout.cc:L95
Parameters:
data (Symbol) – Input array to which dropout will be applied.
p (float, optional, default=0.5) – Fraction of the input that gets dropped out during training time.
mode ({'always', 'training'},optional, default='training') – Whether to only turn on dropout during training or to also turn on for inference.
axes (Shape(tuple), optional, default=[]) – Axes for variational dropout kernel.
cudnn_off (boolean or None, optional, default=0) – Whether to turn off cudnn in dropout operator. This option is ignored if axes is specified.
name (string, optional.) – Name of the resulting symbol.
Maps integer indices to vector representations (embeddings).
This operator maps words to real-valued vectors in a high-dimensional space,
called word embeddings. These embeddings can capture semantic and syntactic properties of the words.
For example, it has been noted that in the learned embedding spaces, similar words tend
to be close to each other and dissimilar words far apart.
For an input array of shape (d1, …, dK),
the shape of an output array is (d1, …, dK, output_dim).
All the input values should be integers in the range [0, input_dim).
If the input_dim is ip0 and output_dim is op0, then shape of the embedding weight matrix must be
(ip0, op0).
When “sparse_grad” is False, if any index mentioned is too large, it is replaced by the index that
addresses the last vector in an embedding matrix.
When “sparse_grad” is True, an error will be raised if invalid indices are found.
The storage type of weight can be either row_sparse or default.
Note
If “sparse_grad” is set to True, the storage type of gradient w.r.t weights will be
“row_sparse”. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad
and Adam. Note that by default lazy updates is turned on, which may perform differently
from standard updates. For more details, please check the Optimization API at:
https://mxnet.apache.org/versions/master/api/python/docs/api/optimizer/index.html
Defined in /home/smola/mxnet/src/operator/tensor/indexing_op.cc:L758
Parameters:
data (Symbol) – The input array to the embedding operator.
input_dim (long, required) – Vocabulary size of the input indices.
output_dim (long, required) – Dimension of the embedding vectors.
dtype ({'bfloat16', 'float16', 'float32', 'float64', 'int32', 'int64', 'int8', 'uint8'},optional, default='float32') – Data type of weight.
sparse_grad (boolean, optional, default=0) – Compute row sparse gradient in the backward calculation. If set to True, the grad’s storage type is row_sparse.
name (string, optional.) – Name of the resulting symbol.
Applies a linear transformation: \(Y = XW^T + b\).
If flatten is set to be true, then the shapes are:
data: (batch_size, x1, x2, …, xn)
weight: (num_hidden, x1 * x2 * … * xn)
bias: (num_hidden,)
out: (batch_size, num_hidden)
If flatten is set to be false, then the shapes are:
data: (x1, x2, …, xn, input_dim)
weight: (num_hidden, input_dim)
bias: (num_hidden,)
out: (x1, x2, …, xn, num_hidden)
The learnable parameters include both weight and bias.
If no_bias is set to be true, then the bias term is ignored.
Note
The sparse support for FullyConnected is limited to forward evaluation with row_sparse
weight and bias, where the length of weight.indices and bias.indices must be equal
to num_hidden. This could be useful for model inference with row_sparse weights
trained with importance sampling or noise contrastive estimation.
To compute linear transformation with ‘csr’ sparse data, sparse.dot is recommended instead
of sparse.FullyConnected.
Defined in /home/smola/mxnet/src/operator/nn/fully_connected.cc:L288
Gather elements or slices from data and store to a tensor whose
shape is defined by indices.
Given data with shape (X_0, X_1, …, X_{N-1}) and indices with shape
(M, Y_0, …, Y_{K-1}), the output will have shape (Y_0, …, Y_{K-1}, X_M, …, X_{N-1}),
where M <= N. If M == N, output shape will simply be (Y_0, …, Y_{K-1}).
The input channels are separated into num_groups groups, each containing num_channels/num_groups channels.
The mean and standard-deviation are calculated separately over the each group.
Indexes for indicating update positions.
For example, array([[0, 1], [2, 3], [4, 5]] indicates here are two positions to
be updated, which is (0, 2, 4) and (1, 3, 5).
Note: - ‘ind’ cannot be empty array ‘[]’, for that case, please use operator ‘add’ instead.
0 <= ind.ndim <= 2.
ind.dtype should be ‘int32’ or ‘int64’
val (ndarray) – Input data. The array to update the input ‘a’.
Update values to input according to given indexes.
If multiple indices refer to the same location it is undefined which update is chosen; it may choose
the order of updates arbitrarily and nondeterministically (e.g., due to concurrent updates on some
hardware platforms). Recommend not to use repeate positions.
Parameters:
a (ndarray) – Input data. The array to be updated.
Support dtype: ‘float32’, ‘float64’, ‘int32’, ‘int64’.
Indexes for indicating update positions.
For example, array([[0, 1], [2, 3], [4, 5]] indicates here are two positions to
be updated, which is (0, 2, 4) and (1, 3, 5).
Note: - ‘ind’ cannot be empty array ‘[]’, for that case, please use operator ‘add’ instead.
0 <= ind.ndim <= 2.
ind.dtype should be ‘int32’ or ‘int64’
val (ndarray) – Input data. The array to update the input ‘a’.
Support dtype: ‘float32’, ‘float64’, ‘int32’, ‘int64’.
This layer is similar to batch normalization layer (BatchNorm)
with two differences: first, the normalization is
carried out per example (instance), not over a batch. Second, the
same normalization is applied both at test and train time. This
operation is also known as contrast normalization.
If the input data is of shape [batch, channel, spacial_dim1, spacial_dim2, …],
gamma and beta parameters must be vectors of shape [channel].
Compute the matrix multiplication between the projections of
queries and keys in multihead attention use as self attention.
the input must be a single tensor of interleaved projections
of queries, keys and values following the layout:
(seq_length, batch_size, num_heads * head_dim * 3)
Compute the matrix multiplication between the projections of
values and the attention weights in multihead attention use as self attention.
the inputs must be a tensor of interleaved projections
of queries, keys and values following the layout:
(seq_length, batch_size, num_heads * head_dim * 3)
and the attention weights following the layout:
(batch_size, seq_length, seq_length)
Multiply matrices using 8-bit integers. data * weight.
Input tensor arguments are: data weight [scaling] [bias]
data: either float32 or prepared using intgemm_prepare_data (in which case it is int8).
weight: must be prepared using intgemm_prepare_weight.
scaling: present if and only if out_type is float32. If so this is multiplied by the result before adding bias. Typically:
scaling = (max passed to intgemm_prepare_weight)/127.0 if data is in float32
scaling = (max_passed to intgemm_prepare_data)/127.0 * (max passed to intgemm_prepare_weight)/127.0 if data is in int8
bias: present if and only if !no_bias. This is added to the output after scaling and has the same number of columns as the output.
out_type: type of the output.
Defined in /home/smola/mxnet/src/operator/contrib/intgemm/intgemm_fully_connected_op.cc:L284
Parameters:
data (Symbol) – First argument to multiplication. Tensor of float32 (quantized on the fly) or int8 from intgemm_prepare_data. If you use a different quantizer, be sure to ban -128. The last dimension must be a multiple of 64.
weight (Symbol) – Second argument to multiplication. Tensor of int8 from intgemm_prepare_weight. The last dimension must be a multiple of 64. The product of non-last dimensions must be a multiple of 8.
scaling (Symbol) – Scaling factor to apply if output type is float32.
Compute the maximum absolute value in a tensor of float32 fast on a CPU. The tensor’s total size must be a multiple of 16 and aligned to a multiple of 64 bytes.
mxnet.nd.contrib.intgemm_maxabsolute(arr) == arr.abs().max()
Defined in /home/smola/mxnet/src/operator/contrib/intgemm/max_absolute_op.cc:L102
Parameters:
data (Symbol) – Tensor to compute maximum absolute value of
name (string, optional.) – Name of the resulting symbol.
This operator converts a weight matrix in column-major format to intgemm’s internal fast representation of weight matrices. MXNet customarily stores weight matrices in column-major (transposed) format. This operator is not meant to be fast; it is meant to be run offline to quantize a model.
In other words, it prepares weight for the operation C = data * weight^T.
If the provided weight matrix is float32, it will be quantized first. The quantization function is (int8_t)(127.0 / max * weight) where multiplier is provided as argument 1 (the weight matrix is argument 0). Then the matrix will be rearranged into the CPU-dependent format.
If the provided weight matrix is already int8, the matrix will only be rearranged into the CPU-dependent format. This way one can quantize with intgemm_prepare_data (which just quantizes), store to disk in a consistent format, then at load time convert to CPU-dependent format with intgemm_prepare_weight.
The internal representation depends on register length. So AVX512, AVX2, and SSSE3 have different formats. AVX512BW and AVX512VNNI have the same representation.
Defined in /home/smola/mxnet/src/operator/contrib/intgemm/prepare_weight_op.cc:L152
Parameters:
weight (Symbol) – Parameter matrix to be prepared for multiplication.
maxabs (Symbol) – Maximum absolute value for scaling. The weights will be multipled by 127.0 / maxabs.
already_quantized (boolean, optional, default=0) – Is the weight matrix already quantized?
name (string, optional.) – Name of the resulting symbol.
Normalizes the channels of the input tensor by mean and variance, and applies a scale gamma as
well as offset beta.
Assume the input has more than one dimension and we normalize along axis 1.
We first compute the mean and variance along this axis and then
compute the normalized output, which has the same shape as input, as following:
Unlike BatchNorm and InstanceNorm, the mean and var are computed along the channel dimension.
Assume the input has size k on axis 1, then both gamma and beta
have shape (k,). If output_mean_var is set to be true, then outputs both data_mean and
data_std. Note that no gradient will be passed through these two outputs.
The parameter axis specifies which axis of the input shape denotes
the ‘channel’ (separately normalized groups). The default is -1, which sets the channel
axis to be the last item in the input shape.
Defined in /home/smola/mxnet/src/operator/nn/layer_norm.cc:L401
axis (int, optional, default='-1') – The axis to perform layer normalization. Usually, this should be be axis of the channel dimension. Negative values means indexing from right to left.
eps (float, optional, default=9.99999975e-06) – An epsilon parameter to prevent division by 0.
output_mean_var (boolean, optional, default=0) – Output the mean and std calculated along the given axis.
name (string, optional.) – Name of the resulting symbol.
Applies Leaky rectified linear unit activation element-wise to the input.
Leaky ReLUs attempt to fix the “dying ReLU” problem by allowing a small slope
when the input is negative and has a slope of one when input is positive.
The following modified ReLU Activation functions are supported:
elu: Exponential Linear Unit. y = x > 0 ? x : slope * (exp(x)-1)
gelu: Gaussian Error Linear Unit. y = 0.5 * x * (1 + erf(x / sqrt(2)))
gelu_erf: Same as gelu.
gelu_tanh: Gaussian Error Linear Unit using tanh function.
y = 0.5 * x * (1 + tanh((sqrt(2/pi) * (x + 0.044715*x^3))))
selu: Scaled Exponential Linear Unit. y = lambda * (x > 0 ? x : alpha * (exp(x) - 1)) where
lambda = 1.0507009873554804934193349852946 and alpha = 1.6732632423543772848170429916717.
leaky: Leaky ReLU. y = x > 0 ? x : slope * x
prelu: Parametric ReLU. This is same as leaky except that slope is learnt during training.
rrelu: Randomized ReLU. same as leaky but the slope is uniformly and randomly chosen from
[lower_bound, upper_bound) for training, while fixed to be
(lower_bound+upper_bound)/2 for inference.
Defined in /home/smola/mxnet/src/operator/leaky_relu.cc:L196
Parameters:
data (Symbol) – Input data to activation function.
gamma (Symbol) – Input data to activation function.
act_type ({'elu', 'gelu_erf', 'gelu_tanh', 'leaky', 'prelu', 'rrelu', 'selu'},optional, default='leaky') – Activation function to be applied.
slope (float, optional, default=0.25) – Init slope for the activation. (For leaky and elu only)
lower_bound (float, optional, default=0.125) – Lower bound of random slope. (For rrelu only)
upper_bound (float, optional, default=0.333999991) – Upper bound of random slope. (For rrelu only)
name (string, optional.) – Name of the resulting symbol.
axis (int, optional, default='-1') – The axis along which to compute softmax.
temperature (double or None, optional, default=None) – Temperature parameter in softmax
dtype ({None, 'float16', 'float32', 'float64'},optional, default='None') – DType of the output in case this can’t be inferred. Defaults to the same as input’s dtype if not defined (dtype=None).
use_length (boolean or None, optional, default=0) – Whether to use the length input as a mask over the data input.
name (string, optional.) – Name of the resulting symbol.
If no_bias is set to be true, then the bias term is ignored.
The default data layout is NCHW, namely (batch_size, channle, height,
width).
If num_group is larger than 1, denoted by g, then split the input data
evenly into g parts along the channel axis, and also evenly split weight
along the first dimension. Next compute the convolution on the i-th part of
the data with the i-th weight part. The output is obtained by concating all
the g results.
If num_deformable_group is larger than 1, denoted by dg, then split the
input offset evenly into dg parts along the channel axis, and also evenly
split out evenly into dg parts along the channel axis. Next compute the
deformable convolution, apply the i-th part of the offset part on the i-th
out.
Both weight and bias are learnable parameters.
Defined in /home/smola/mxnet/src/operator/modulated_deformable_convolution.cc:L83
Parameters:
data (Symbol) – Input data to the ModulatedDeformableConvolutionOp.
offset (Symbol) – Input offset to ModulatedDeformableConvolutionOp.
mask (Symbol) – Input mask to the ModulatedDeformableConvolutionOp.
stride (Shape(tuple), optional, default=[]) – Convolution stride: (h, w) or (d, h, w). Defaults to 1 for each dimension.
dilate (Shape(tuple), optional, default=[]) – Convolution dilate: (h, w) or (d, h, w). Defaults to 1 for each dimension.
pad (Shape(tuple), optional, default=[]) – Zero pad for convolution: (h, w) or (d, h, w). Defaults to no padding.
num_filter (int (non-negative), required) – Convolution filter(channel) number
num_group (int (non-negative), optional, default=1) – Number of group partitions.
num_deformable_group (int (non-negative), optional, default=1) – Number of deformable group partitions.
workspace (long (non-negative), optional, default=1024) – Maximum temperal workspace allowed for convolution (MB).
no_bias (boolean, optional, default=0) – Whether to disable bias parameter.
im2col_step (int (non-negative), optional, default=64) – Maximum number of images per im2col computation; The total batch size should be divisable by this value or smaller than this value; if you face out of memory problem, you can try to use a smaller value here.
Return the indices of the elements that are non-zero.
Returns a ndarray with ndim is 2. Each row contains the indices
of the non-zero elements. The values in a are always tested and returned in
row-major, C-style order.
The result of this is always a 2-D array, with a row for
each non-zero element.
This operator computes the norm on an ndarray with the specified axis, depending
on the value of the ord parameter. By default, it computes the L2 norm on the entire
array. Currently only ord=2 supports sparse ndarrays.
ord (int, optional, default='2') – Order of the norm. Currently ord=1 and ord=2 is supported.
axis (Shape or None, optional, default=None) –
The axis or axes along which to perform the reduction.
The default, axis=(), will compute over all elements into a
scalar array with shape (1,).
If axis is int, a reduction is performed on a particular axis.
If axis is a 2-tuple, it specifies the axes that hold 2-D matrices,
and the matrix norms of these matrices are computed.
out_dtype ({None, 'float16', 'float32', 'float64', 'int32', 'int64', 'int8'},optional, default='None') – The data type of the output.
keepdims (boolean, optional, default=0) – If this is set to True, the reduced axis is left in the result as dimension with size one.
name (string, optional.) – Name of the resulting symbol.
Pads an input array with a constant or edge values of the array.
Note
Pad is deprecated. Use pad instead.
Note
Current implementation only supports 4D and 5D input arrays with padding applied
only on axes 1, 2 and 3. Expects axes 4 and 5 in pad_width to be zero.
This operation pads an input array with either a constant_value or edge values
along each axis of the input array. The amount of padding is specified by pad_width.
pad_width is a tuple of integer padding widths for each axis of the format
(before_1,after_1,...,before_N,after_N). The pad_width should be of length 2*N
where N is the number of dimensions of the array.
For dimension N of the input array, before_N and after_N indicates how many values
to add before and after the elements of the array along dimension N.
The widths of the higher two dimensions before_1, after_1, before_2,
after_2 must be 0.
mode ({'constant', 'edge', 'reflect'}, required) – Padding type to use. “constant” pads with constant_value “edge” pads using the edge values of the input array “reflect” pads by reflecting values with respect to the edges.
pad_width (Shape(tuple), required) – Widths of the padding regions applied to the edges of each axis. It is a tuple of integer padding widths for each axis of the format (before_1,after_1,...,before_N,after_N). It should be of length 2*N where N is the number of dimensions of the array.This is equivalent to pad_width in numpy.pad, but flattened.
constant_value (double, optional, default=0) – The value used for padding when mode is “constant”.
name (string, optional.) – Name of the resulting symbol.
axis (int or None, optional, default='-1') – int or None. The axis to picking the elements. Negative values means indexing from right to left. If is None, the elements in the index w.r.t the flattened input will be picked.
keepdims (boolean, optional, default=0) – If true, the axis where we pick the elements is left in the result as dimension with size one.
mode ({'clip', 'wrap'},optional, default='clip') – Specify how out-of-bound indices behave. Default is “clip”. “clip” means clip to the range. So, if all indices mentioned are too large, they are replaced by the index that addresses the last element along an axis. “wrap” means to wrap around.
name (string, optional.) – Name of the resulting symbol.
The definition of f depends on pooling_convention, which has two options:
valid (default):
f(x,k,p,s)=floor((x+2*p-k)/s)+1
full, which is compatible with Caffe:
f(x,k,p,s)=ceil((x+2*p-k)/s)+1
When global_pool is set to be true, then global pooling is performed. It will reset
kernel=(height,width) and set the appropiate padding to 0.
Three pooling options are supported by pool_type:
avg: average pooling
max: max pooling
sum: sum pooling
lp: Lp pooling
For 3-D pooling, an additional depth dimension is added before
height. Namely the input data and output will have shape (batch_size, channel, depth,
height, width) (NCDHW layout) or (batch_size, depth, height, width, channel) (NDHWC layout).
Notes on Lp pooling:
Lp pooling was first introduced by this paper: https://arxiv.org/pdf/1204.3968.pdf.
L-1 pooling is simply sum pooling, while L-inf pooling is simply max pooling.
We can see that Lp pooling stands between those two, in practice the most common value for p is 2.
For each window X, the mathematical expression for Lp pooling is:
\(f(X) = \sqrt[p]{\sum_{x}^{X} x^p}\)
Defined in /home/smola/mxnet/src/operator/nn/pooling.cc:L410
Parameters:
data (Symbol) – Input data to the pooling operator.
kernel (Shape(tuple), optional, default=[]) – Pooling kernel size: (y, x) or (d, y, x)
pool_type ({'avg', 'lp', 'max', 'sum'},optional, default='max') – Pooling type to be applied.
global_pool (boolean, optional, default=0) – Ignore kernel size, do global pooling based on current input feature map.
cudnn_off (boolean, optional, default=0) – Turn off cudnn pooling and use MXNet pooling operator.
pooling_convention ({'full', 'same', 'valid'},optional, default='valid') – Pooling convention to be applied.
stride (Shape(tuple), optional, default=[]) – Stride: for pooling (y, x) or (d, y, x). Defaults to 1 for each dimension.
pad (Shape(tuple), optional, default=[]) – Pad for pooling: (y, x) or (d, y, x). Defaults to no padding.
p_value (int or None, optional, default='None') – Value of p for Lp pooling, can be 1 or 2, required for Lp Pooling.
count_include_pad (boolean or None, optional, default=None) – Only used for AvgPool, specify whether to count padding elements for averagecalculation. For example, with a 5*5 kernel on a 3*3 corner of a image,the sum of the 9 valid elements will be divided by 25 if this is set to true,or it will be divided by 9 if this is set to false. Defaults to true.
Activation operator for input and output data type of int8.
The input and output data comes with min and max thresholds for quantizing
the float32 data into int8.
Note
This operator only supports forward propogation. DO NOT use it in training.
This operator only supports relu
Defined in /home/smola/mxnet/src/operator/quantization/quantized_activation.cc:L96
Convolution operator for input, weight and bias data type of int8,
and accumulates in type int32 for the output. For each argument, two more arguments of type
float32 must be provided representing the thresholds of quantizing argument from data
type float32 to int8. The final outputs contain the convolution result in int32, and min
and max thresholds representing the threholds for quantizing the float32 output into int32.
Note
This operator only supports forward propogation. DO NOT use it in training.
Defined in /home/smola/mxnet/src/operator/quantization/quantized_conv.cc:L189
stride (Shape(tuple), optional, default=[]) – Convolution stride: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
dilate (Shape(tuple), optional, default=[]) – Convolution dilate: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
pad (Shape(tuple), optional, default=[]) – Zero pad for convolution: (w,), (h, w) or (d, h, w). Defaults to no padding.
num_filter (int (non-negative), required) – Convolution filter(channel) number
num_group (int (non-negative), optional, default=1) – Number of group partitions.
workspace (long (non-negative), optional, default=1024) – Maximum temporary workspace allowed (MB) in convolution.This parameter has two usages. When CUDNN is not used, it determines the effective batch size of the convolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel when limited_workspace strategy is used.
no_bias (boolean, optional, default=0) – Whether to disable bias parameter.
cudnn_tune ({None, 'fastest', 'limited_workspace', 'off'},optional, default='None') – Whether to pick convolution algo by running performance test.
cudnn_off (boolean, optional, default=0) – Turn off cudnn for this layer.
elemwise_add operator for input dataA and input dataB data type of int8,
and accumulates in type int32 for the output. For each argument, two more arguments of type
float32 must be provided representing the thresholds of quantizing argument from data
type float32 to int8. The final outputs contain result in int32, and min
and max thresholds representing the threholds for quantizing the float32 output into int32.
Note
This operator only supports forward propogation. DO NOT use it in training.
Parameters:
min_calib_range (float or None, optional, default=None) – The minimum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int8 output data.
max_calib_range (float or None, optional, default=None) – The maximum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int8 output data.
min_calib_range (float or None, optional, default=None) – The minimum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int8 output data.
max_calib_range (float or None, optional, default=None) – The maximum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int8 output data.
enable_float_output (boolean, optional, default=0) – Whether to enable float32 output
name (string, optional.) – Name of the resulting symbol.
input_dim (long, required) – Vocabulary size of the input indices.
output_dim (long, required) – Dimension of the embedding vectors.
dtype ({'bfloat16', 'float16', 'float32', 'float64', 'int32', 'int64', 'int8', 'uint8'},optional, default='float32') – Data type of weight.
sparse_grad (boolean, optional, default=0) – Compute row sparse gradient in the backward calculation. If set to True, the grad’s storage type is row_sparse.
name (string, optional.) – Name of the resulting symbol.
Fully Connected operator for input, weight and bias data type of int8,
and accumulates in type int32 for the output. For each argument, two more arguments of type
float32 must be provided representing the thresholds of quantizing argument from data
type float32 to int8. The final outputs contain the convolution result in int32, and min
and max thresholds representing the threholds for quantizing the float32 output into int32.
Note
This operator only supports forward propogation. DO NOT use it in training.
Defined in /home/smola/mxnet/src/operator/quantization/quantized_fully_connected.cc:L328
elemwise_add operator for input dataA and input dataB data type of int8,
and accumulates in type int32 for the output. For each argument, two more arguments of type
float32 must be provided representing the thresholds of quantizing argument from data
type float32 to int8. The final outputs contain result in int32, and min
and max thresholds representing the threholds for quantizing the float32 output into int32.
Note
This operator only supports forward propogation. DO NOT use it in training.
Parameters:
min_calib_range (float or None, optional, default=None) – The minimum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int8 output data.
max_calib_range (float or None, optional, default=None) – The maximum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int8 output data.
Pooling operator for input and output data type of int8.
The input and output data comes with min and max thresholds for quantizing
the float32 data into int8.
Note
This operator only supports pool_type of avg or max.
Backward propagation computes the data gradient and returns zero min/max gradients.
Defined in /home/smola/mxnet/src/operator/quantization/quantized_pooling.cc:L443
kernel (Shape(tuple), optional, default=[]) – Pooling kernel size: (y, x) or (d, y, x)
pool_type ({'avg', 'lp', 'max', 'sum'},optional, default='max') – Pooling type to be applied.
global_pool (boolean, optional, default=0) – Ignore kernel size, do global pooling based on current input feature map.
cudnn_off (boolean, optional, default=0) – Turn off cudnn pooling and use MXNet pooling operator.
pooling_convention ({'full', 'same', 'valid'},optional, default='valid') – Pooling convention to be applied.
stride (Shape(tuple), optional, default=[]) – Stride: for pooling (y, x) or (d, y, x). Defaults to 1 for each dimension.
pad (Shape(tuple), optional, default=[]) – Pad for pooling: (y, x) or (d, y, x). Defaults to no padding.
p_value (int or None, optional, default='None') – Value of p for Lp pooling, can be 1 or 2, required for Lp Pooling.
count_include_pad (boolean or None, optional, default=None) – Only used for AvgPool, specify whether to count padding elements for averagecalculation. For example, with a 5*5 kernel on a 3*3 corner of a image,the sum of the 9 valid elements will be divided by 25 if this is set to true,or it will be divided by 9 if this is set to false. Defaults to true.
min_data (Symbol) – The minimum scalar value possibly produced for the data
max_data (Symbol) – The maximum scalar value possibly produced for the data
newshape (Shape(tuple), required) – The new shape should be compatible with the original shape. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions. -2 to -6 are used for data manipulation. -2 copy this dimension from the input to the output shape. -3 will skip current dimension if and only if the current dim size is one. -4 copy all remain of the input dimensions to the output shape. -5 use the product of two consecutive dimensions of the input shape as the output. -6 split one dimension of the input into two dimensions passed subsequent to -6 in the new shape.
reverse (boolean, optional, default=0) – If true then the special values are inferred from right to left
order (string, optional, default='C') – Read the elements of a using this index order, and place the elements into the reshaped array using this index order. ‘C’ means to read/write the elements using C-like index order, with the last axis index changing fastest, back to the first axis index changing slowest. Note that currently only C-like order is supported
name (string, optional.) – Name of the resulting symbol.
Given data that is quantized in int32 and the corresponding thresholds,
requantize the data into int8 using min and max thresholds either calculated at runtime
or from calibration. It’s highly recommended to pre-calucate the min and max thresholds
through calibration since it is able to save the runtime of the operator and improve the
inference accuracy.
Note
This operator only supports forward propogation. DO NOT use it in training.
Defined in /home/smola/mxnet/src/operator/quantization/requantize.cc:L83
min_range (Symbol) – The original minimum scalar value in the form of float32 used for quantizing data into int32.
max_range (Symbol) – The original maximum scalar value in the form of float32 used for quantizing data into int32.
out_type ({'auto', 'int8', 'uint8'},optional, default='int8') – Output data type. auto can be specified to automatically determine output type according to min_calib_range.
min_calib_range (float or None, optional, default=None) – The minimum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int32 data into int8.
max_calib_range (float or None, optional, default=None) – The maximum scalar value in the form of float32 obtained through calibration. If present, it will be used to requantize the int32 data into int8.
name (string, optional.) – Name of the resulting symbol.
The new shape should be compatible with the original shape.
If an integer, then the result will be a 1-D array of that length.
One shape dimension can be -1. In this case, the value is inferred
from the length of the array and remaining dimensions.
-2 to -6 are used for data manipulation.
-2 copy this dimension from the input to the output shape.
-3 will skip current dimension if and only if the current dim size is one.
-4 copy all remain of the input dimensions to the output shape.
-5 use the product of two consecutive dimensions of the input
shape as the output.
-6 split one dimension of the input into two dimensions passed
subsequent to -6 in the new shape.
reverse (bool, optional) – If set to true, the special values will be inferred from right to left.
order ({'C'}, optional) – Read the elements of a using this index order, and place the
elements into the reshaped array using this index order. ‘C’
means to read / write the elements using C-like index order,
with the last axis index changing fastest, back to the first
axis index changing slowest. Other order types such as ‘F’/’A’
may be added in the future.
Returns:
reshaped_array – It will be always a copy of the original array. This behavior is different
from the official NumPy reshape operator where views of the original array may be
generated.
More precise control over how dimensions are inherited is achieved by specifying slices over the lhs and rhs array dimensions. Only the sliced lhs dimensions are reshaped to the rhs sliced dimensions, with the non-sliced lhs dimensions staying the same.
lhs_begin (int or None, optional, default='None') – Defaults to 0. The beginning index along which the lhs dimensions are to be reshaped. Supports negative indices.
lhs_end (int or None, optional, default='None') – Defaults to None. The ending index along which the lhs dimensions are to be used for reshaping. Supports negative indices.
rhs_begin (int or None, optional, default='None') – Defaults to 0. The beginning index along which the rhs dimensions are to be used for reshaping. Supports negative indices.
rhs_end (int or None, optional, default='None') – Defaults to None. The ending index along which the rhs dimensions are to be used for reshaping. Supports negative indices.
name (string, optional.) – Name of the resulting symbol.
Applies recurrent layers to input data. Currently, vanilla RNN, LSTM and GRU are
implemented, with both multi-layer and bidirectional support.
When the input data is of type float32 and the environment variables MXNET_CUDA_ALLOW_TENSOR_CORE
and MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION are set to 1, this operator will try to use
pseudo-float16 precision (float32 math with float16 I/O) precision in order to use
Tensor Cores on suitable NVIDIA GPUs. This can sometimes give significant speedups.
Vanilla RNN
Applies a single-gate recurrent layer to input X. Two kinds of activation function are supported:
ReLU and Tanh.
With the projection size being set, LSTM could use the projection feature to reduce the parameters
size and give some speedups without significant damage to the accuracy.
Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech
Recognition - Sak et al. 2014. https://arxiv.org/abs/1402.1128
state_cell (Symbol) – initial cell state for LSTM networks (only for LSTM)
sequence_length (Symbol) – Vector of valid sequence lengths for each element in batch. (Only used if use_sequence_length kwarg is True)
state_size (int (non-negative), required) – size of the state for each layer
num_layers (int (non-negative), required) – number of stacked layers
bidirectional (boolean, optional, default=0) – whether to use bidirectional recurrent layers
mode ({'gru', 'lstm', 'rnn_relu', 'rnn_tanh'}, required) – the type of RNN to compute
p (float, optional, default=0) – drop rate of the dropout on the outputs of each RNN layer, except the last layer.
state_outputs (boolean, optional, default=0) – Whether to have the states as symbol outputs.
projection_size (int or None, optional, default='None') – size of project size
lstm_state_clip_min (double or None, optional, default=None) – Minimum clip value of LSTM states. This option must be used together with lstm_state_clip_max.
lstm_state_clip_max (double or None, optional, default=None) – Maximum clip value of LSTM states. This option must be used together with lstm_state_clip_min.
lstm_state_clip_nan (boolean, optional, default=0) – Whether to stop NaN from propagating in state by clipping it to min/max. If clipping range is not specified, this option is ignored.
use_sequence_length (boolean, optional, default=0) – If set to true, this layer takes in an extra input parameter sequence_length to specify variable length sequence
name (string, optional.) – Name of the resulting symbol.
Performs region of interest(ROI) pooling on the input array.
ROI pooling is a variant of a max pooling layer, in which the output size is fixed and
region of interest is a parameter. Its purpose is to perform max pooling on the inputs
of non-uniform sizes to obtain fixed-size feature maps. ROI pooling is a neural-net
layer mostly used in training a Fast R-CNN network for object detection.
This operator takes a 4D feature map as an input array and region proposals as rois,
then it pools over sub-regions of input and produces a fixed-sized output array
regardless of the ROI size.
To crop the feature map accordingly, you can resize the bounding box coordinates
by changing the parameters rois and spatial_scale.
The cropped feature maps are pooled by standard max pooling operation to a fixed size output
indicated by a pooled_size parameter. batch_size will change to the number of region
bounding boxes after ROIPooling.
The size of each region of interest doesn’t have to be perfectly divisible by
the number of pooling sections(pooled_size).
Example:
x = [[[[ 0., 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10., 11.],
[ 12., 13., 14., 15., 16., 17.],
[ 18., 19., 20., 21., 22., 23.],
[ 24., 25., 26., 27., 28., 29.],
[ 30., 31., 32., 33., 34., 35.],
[ 36., 37., 38., 39., 40., 41.],
[ 42., 43., 44., 45., 46., 47.]]]]
// region of interest i.e. bounding box coordinates.
y = [[0,0,0,4,4]]
// returns array of shape (2,2) according to the given roi with max pooling.
ROIPooling(x, y, (2,2), 1.0) = [[[[ 14., 16.],
[ 26., 28.]]]]
// region of interest is changed due to the change in `spacial_scale` parameter.
ROIPooling(x, y, (2,2), 0.7) = [[[[ 7., 9.],
[ 19., 21.]]]]
Defined in /home/smola/mxnet/src/operator/roi_pooling.cc:L217
Parameters:
data (Symbol) – The input array to the pooling operator, a 4D Feature maps
rois (Symbol) – Bounding box coordinates, a 2D array of [[batch_index, x1, y1, x2, y2]], where (x1, y1) and (x2, y2) are top left and bottom right corners of designated region of interest. batch_index indicates the index of corresponding image in the input array
pooled_size (Shape(tuple), required) – ROI pooling output shape (h,w)
spatial_scale (float, required) – Ratio of input feature map height (or w) to raw image height (or w). Equals the reciprocal of total stride in convolutional layers
name (string, optional.) – Name of the resulting symbol.
In forward pass, returns element-wise rounded value to the nearest integer of the input (same as round()).
In backward pass, returns gradients of 1 everywhere (instead of 0 everywhere as in round()):
\(\frac{d}{dx}{round\_ste(x)} = 1\) vs. \(\frac{d}{dx}{round(x)} = 0\).
This is useful for quantized training.
Reference: Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.
Samples are distributed according to a Poisson distribution parametrized by lambda (rate).
Samples will always be returned as a floating point data type.
Example:
poisson(lam=4,shape=(2,2))=[[5.,2.],[4.,6.]]
Defined in /home/smola/mxnet/src/operator/random/sample_op.cc:L152
Parameters:
lam (float, optional, default=1) – Lambda parameter (rate) of the Poisson distribution.
shape (Shape(tuple), optional, default=None) – Shape of the output.
ctx (string, optional, default='') – Context of output, in format [cpu|gpu|cpu_pinned](n). Only used for imperative calls.
dtype ({'None', 'bfloat16', 'float16', 'float32', 'float64'},optional, default='None') – DType of the output in case this can’t be inferred. Defaults to float32 if not defined (dtype=None).
name (string, optional.) – Name of the resulting symbol.
This function takes an n-dimensional input array of the form
[max_sequence_length, batch_size, other_feature_dims] and returns a (n-1)-dimensional array
of the form [batch_size, other_feature_dims].
Parameter sequence_length is used to handle variable-length sequences. sequence_length should be
an input array of positive ints of dimension [batch_size]. To use this parameter,
set use_sequence_length to True, otherwise each example in the batch is assumed
to have the max sequence length.
Defined in /home/smola/mxnet/src/operator/sequence_last.cc:L103
Parameters:
data (Symbol) – n-dimensional input array of the form [max_sequence_length, batch_size, other_feature_dims] where n>2
sequence_length (Symbol) – vector of sequence lengths of the form [batch_size]
use_sequence_length (boolean, optional, default=0) – If set to true, this layer takes in an extra input parameter sequence_length to specify variable length sequence
axis (int, optional, default='0') – The sequence axis. Only values of 0 and 1 are currently supported.
name (string, optional.) – Name of the resulting symbol.
Sets all elements outside the sequence to a constant value.
This function takes an n-dimensional input array of the form
[max_sequence_length, batch_size, other_feature_dims] and returns an array of the same shape.
Parameter sequence_length is used to handle variable-length sequences. sequence_length
should be an input array of positive ints of dimension [batch_size].
To use this parameter, set use_sequence_length to True,
otherwise each example in the batch is assumed to have the max sequence length and
this operator works as the identity operator.
Defined in /home/smola/mxnet/src/operator/sequence_mask.cc:L186
Parameters:
data (Symbol) – n-dimensional input array of the form [max_sequence_length, batch_size, other_feature_dims] where n>2
sequence_length (Symbol) – vector of sequence lengths of the form [batch_size]
use_sequence_length (boolean, optional, default=0) – If set to true, this layer takes in an extra input parameter sequence_length to specify variable length sequence
value (float, optional, default=0) – The value to be used as a mask.
axis (int, optional, default='0') – The sequence axis. Only values of 0 and 1 are currently supported.
name (string, optional.) – Name of the resulting symbol.
This function takes an n-dimensional input array of the form [max_sequence_length, batch_size, other_feature_dims]
and returns an array of the same shape.
Parameter sequence_length is used to handle variable-length sequences.
sequence_length should be an input array of positive ints of dimension [batch_size].
To use this parameter, set use_sequence_length to True,
otherwise each example in the batch is assumed to have the max sequence length.
Defined in /home/smola/mxnet/src/operator/sequence_reverse.cc:L118
Parameters:
data (Symbol) – n-dimensional input array of the form [max_sequence_length, batch_size, other dims] where n>2
sequence_length (Symbol) – vector of sequence lengths of the form [batch_size]
use_sequence_length (boolean, optional, default=0) – If set to true, this layer takes in an extra input parameter sequence_length to specify variable length sequence
axis (int, optional, default='0') – The sequence axis. Only 0 is currently supported.
name (string, optional.) – Name of the resulting symbol.
In forward pass, returns element-wise sign of the input (same as sign()).
In backward pass, returns gradients of 1 everywhere (instead of 0 everywhere as in sign()):
\(\frac{d}{dx}{sign\_ste(x)} = 1\) vs. \(\frac{d}{dx}{sign(x)} = 0\).
This is useful for quantized training.
Reference: Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.
Example::
x = sign_ste([-2, 0, 3])
x.backward()
x = [-1., 0., 1.]
x.grad() = [1., 1., 1.]
The storage type of sign_ste output depends upon the input storage type:
round_ste(default) = default
round_ste(row_sparse) = row_sparse
round_ste(csr) = csr
Defined in /home/smola/mxnet/src/operator/contrib/stes_op.cc:L80
In this attention pattern,
given a fixed window size 2w, each token attends to w tokens on the left side
if we use causal attention (setting symmetric to False),
otherwise each token attends to w tokens on each side.
The shapes of the inputs are:
- score :
(batch_size, seq_length, num_heads, w + w + 1) if symmetric is True,
(batch_size, seq_length, num_heads, w + 1) otherwise
value : (batch_size, seq_length, num_heads, num_head_units)
dilation : (num_heads,)
The shape of the output is:
- context_vec : (batch_size, seq_length, num_heads, num_head_units)
Defined in /home/smola/mxnet/src/operator/contrib/transformer.cc:L1045
In this attention pattern,
given a fixed window size 2w, each token attends to w tokens on the left side
if we use causal attention (setting symmetric to False),
otherwise each token attends to w tokens on each side.
The shapes of the inputs are:
- score :
(batch_size, seq_length, num_heads, w + w + 1) if symmetric is True,
(batch_size, seq_length, num_heads, w + 1) otherwise.
dilation : (num_heads,)
valid_length : (batch_size,)
The shape of the output is:
- mask : same as the shape of score
Defined in /home/smola/mxnet/src/operator/contrib/transformer.cc:L909
Compute the sliding window attention score, which is used in
Longformer (https://arxiv.org/pdf/2004.05150.pdf). In this attention pattern,
given a fixed window size 2w, each token attends to w tokens on the left side
if we use causal attention (setting symmetric to False),
otherwise each token attends to w tokens on each side.
The shapes of the inputs are:
- query : (batch_size, seq_length, num_heads, num_head_units)
- key : (batch_size, seq_length, num_heads, num_head_units)
- dilation : (num_heads,)
The shape of the output is:
- score :
(batch_size, seq_length, num_heads, w + w + 1) if symmetric is True,
(batch_size, seq_length, num_heads, w + 1) otherwise.
Defined in /home/smola/mxnet/src/operator/contrib/transformer.cc:L969
This function returns a sliced array between the indices given
by begin and end with the corresponding step.
For an input array of shape=(d_0,d_1,...,d_n-1),
slice operation with begin=(b_0,b_1...b_m-1),
end=(e_0,e_1,...,e_m-1), and step=(s_0,s_1,...,s_m-1),
where m <= n, results in an array with the shape
(|e_0-b_0|/|s_0|,...,|e_m-1-b_m-1|/|s_m-1|,d_m,...,d_n-1).
The resulting array’s k-th dimension contains elements
from the k-th dimension of the input array starting
from index b_k (inclusive) with step s_k
until reaching e_k (exclusive).
If the k-th elements are None in the sequence of begin, end,
and step, the following rule will be used to set default values.
If s_k is None, set s_k=1. If s_k > 0, set b_k=0, e_k=d_k;
else, set b_k=d_k-1, e_k=-1.
The storage type of slice output depends on storage types of inputs
* slice(csr) = csr
* otherwise, slice generates output with default storage
Note
When input data storage type is csr, it only supports
step=(), or step=(None,), or step=(1,) to generate a csr output.
For other step parameter values, it falls back to slicing
a dense tensor.
squeeze_axis=1 removes the axis with length 1 from the shapes of the output arrays.
Note that setting squeeze_axis to 1 removes axis with length 1 only
along the axis which it is split.
Also squeeze_axis can be set to true only if input.shape[axis]==num_outputs.
num_outputs (int, required) – Number of splits. Note that this should evenly divide the length of the axis.
axis (int, optional, default='1') – Axis along which to split.
squeeze_axis (boolean, optional, default=0) – If true, Removes the axis with length 1 from the shapes of the output arrays. Note that setting squeeze_axis to true removes axis with length 1 only along the axis which it is split. Also squeeze_axis can be set to true only if input.shape[axis]==num_outputs.
name (string, optional.) – Name of the resulting symbol.
Slices a region of the array like the shape of another array.
This function is similar to slice, however, the begin are always 0`s
and `end of specific axes are inferred from the second input shape_like.
Given the second shape_like input of shape=(d_0,d_1,...,d_n-1),
a slice_like operator with default empty axes, it performs the
following operation:
`` out = slice(input, begin=(0, 0, …, 0), end=(d_0, d_1, …, d_n-1))``.
When axes is not empty, it is used to speficy which axes are being sliced.
Given a 4-d input data, slice_like operator with axes=(0,2,-1)
will perform the following operation:
`` out = slice(input, begin=(0, 0, 0, 0), end=(d_0, None, d_2, d_3))``.
Note that it is allowed to have first and second input with different dimensions,
however, you have to make sure the axes are specified and not exceeding the
dimension limits.
For example, given input_1 with shape=(2,3,4,5) and input_2 with
shape=(1,2,3), it is not allowed to use:
`` out = slice_like(a, b)`` because ndim of input_1 is 4, and ndim of input_2
is 3.
The following is allowed in this situation:
`` out = slice_like(a, b, axes=(0, 2))``
Example:
axes (Shape(tuple), optional, default=[]) – List of axes on which input data will be sliced according to the corresponding size of the second input. By default will slice on all axes. Negative axes are supported.
name (string, optional.) – Name of the resulting symbol.
axis (int, optional, default='-1') – The axis along which to compute softmax.
temperature (double or None, optional, default=None) – Temperature parameter in softmax
dtype ({None, 'float16', 'float32', 'float64'},optional, default='None') – DType of the output in case this can’t be inferred. Defaults to the same as input’s dtype if not defined (dtype=None).
use_length (boolean or None, optional, default=0) – Whether to use the length input as a mask over the data input.
name (string, optional.) – Name of the resulting symbol.
Stops the accumulated gradient of the inputs from flowing through this operator
in the backward direction. In other words, this operator prevents the contribution
of its inputs to be taken into account for computing gradients.
Normalizes a data batch by mean and variance, and applies a scale gamma as
well as offset beta.
Standard BN [1]_ implementation only normalize the data within each device.
SyncBN normalizes the input within the whole mini-batch.
We follow the sync-onece implmentation described in the paper [2].
Assume the input has more than one dimension and we normalize along axis 1.
We first compute the mean and variance along this axis:
Both mean and var returns a scalar by treating the input as a vector.
Assume the input has size k on axis 1, then both gamma and beta
have shape (k,). If output_mean_var is set to be true, then outputs both data_mean and
data_var as well, which are needed for the backward pass.
Besides the inputs and the outputs, this operator accepts two auxiliary
states, moving_mean and moving_var, which are k-length
vectors. They are global statistics for the whole dataset, which are updated
by:
If use_global_stats is set to be true, then moving_mean and
moving_var are used instead of data_mean and data_var to compute
the output. It is often used during inference.
Both gamma and beta are learnable parameters. But if fix_gamma is true,
then set gamma to 1 and its gradient to 0.
Reference:
Defined in /home/smola/mxnet/src/operator/contrib/sync_batch_norm.cc:L97
eps (float, optional, default=0.00100000005) – Epsilon to prevent div 0
momentum (float, optional, default=0.899999976) – Momentum for moving average
fix_gamma (boolean, optional, default=1) – Fix gamma while training
use_global_stats (boolean, optional, default=0) – Whether use global moving statistics instead of local batch-norm. This will force change batch-norm into a scale shift operator.
output_mean_var (boolean, optional, default=0) – Output All,normal mean and var
ndev (int, optional, default='1') – The count of GPU devices
key (string, required) – Hash key for synchronization, please set the same hash key for same layer, Block.prefix is typically used as in gluon.nn.contrib.SyncBatchNorm.
name (string, optional.) – Name of the resulting symbol.
Concurrent sampling from multiple
Poisson distributions with parameters lambda (rate).
The parameters of the distributions are provided as an input array.
Let [s] be the shape of the input array, n be the dimension of [s], [t]
be the shape specified as the parameter of the operator, and m be the dimension
of [t]. Then the output will be a (n+m)-dimensional array with shape [s]x[t].
For any valid n-dimensional index i with respect to the input array, output[i]
will be an m-dimensional array that holds randomly drawn samples from the distribution
which is parameterized by the input value at index i. If the shape parameter of the
operator is not set, then one sample will be drawn per distribution and the output array
has the same shape as the input array.
Samples will always be returned as a floating point data type.
Defined in /home/smola/mxnet/src/operator/random/multisample_op.cc:L340
Parameters:
lam (Symbol) – Lambda (rate) parameters of the distributions.
shape (Shape(tuple), optional, default=[]) – Shape to be sampled from each random distribution.
dtype ({'None', 'float16', 'float32', 'float64'},optional, default='None') – DType of the output in case this can’t be inferred. Defaults to float32 if not defined (dtype=None).
name (string, optional.) – Name of the resulting symbol.
Returns the indices of the top k elements in an input array along the given
axis (by default).
If ret_type is set to ‘value’ returns the value of top k elements (instead of indices).
In case of ret_type = ‘both’, both value and index would be returned.
The returned elements will be sorted.
axis (int or None, optional, default='-1') – Axis along which to choose the top k indices. If not given, the flattened array is used. Default is -1.
k (int, optional, default='1') – Number of top elements to select, should be always smaller than or equal to the element number in the given axis. A global sort is performed if set k < 1.
”value” means to return the top k values, “indices” means to return the indices of the top k values, “mask” means to return a mask array containing 0 and 1. 1 means the top k values. “both” means to return a list of both values and indices of top k elements.
is_ascend (boolean, optional, default=0) – Whether to choose k largest or k smallest elements. Top K largest elements will be chosen if set to false.
dtype ({'float16', 'float32', 'float64', 'int32', 'int64', 'uint8'},optional, default='float32') – DType of the output indices when ret_typ is “indices” or “both”. An error will be raised if the selected data type cannot precisely represent the indices.
name (string, optional.) – Name of the resulting symbol.