DETAILED ACTION 
Response to Arguments
The amendments filed 5/28/2021 have been entered and made of record. 

The Applicant's amendments and arguments filed 5/28/2021 have been considered but are not persuasive.
Re Claim 1: Applicant asserts that cited references do not teach or disclose claim limitation “obtaining key-value coupling data determined based on an operation between new key data determined using a first nonlinear transformation for key data of an attention layer, and value data of the attention layer corresponding to the key data”,
wherein, new key data determined using a first nonlinear transformation for key data of an attention layer,
However, the Examiner disagrees, because:
as illustrated by Oord’s  Fig 2,

    PNG
    media_image1.png
    3300
    2560
    media_image1.png
    Greyscale


and, --the output from CNN 254 comprises a set of KxKx2P spatial feature maps and this provides a set of spatially indexed key and value vectors, pkey and pvalue which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079],
herein, “a set of spatially indexed key and value vectors, pkey” read on “new key data”, which is determined through a  “determined by a non-linear function of a t and pjkey.” as disclosed in [0082]-[0083], herein,  pjkey read on “key data of an attention layer”, and further see:
-- q.sub.t relates to the current pixel, and p.sub.j.sup.key more particularly j runs over the spatial locations of the supporting patches for each support image and has e.g. S.times.K.times.K values. Alternatively, for example, the non-linear function may be defined by a feedforward neural network jointly trained with the other system components.--, in [0082],
Thus, above qt, is a input of current pixel of support data, not “a query data”, which is the output as the image 230, in Fig. 2,
Oord then, further discloses obtaining key-value coupling data determined based on an operation between new key data determined using a first nonlinear transformation for key data of an attention layer, and value data of the attention layer corresponding to the key data (see Oord: e.g., --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or 
It’s clearly taught that to determine the set of scores for the soft attention query vector based operation between “to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches”. 
also see Shazeer: e.g., -- The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.--, in [0049]-[0051], and, --apply a learned key linear transformation to each original key to generate a layer-specific key for each original key, and apply a learned value linear transformation to each original value to generate a layer-specific values for each original value.  The attention layer then applies the attention mechanism described above using these layer-specific queries, keys, and values to generate initial outputs for the attention layer.--, in [0058].
Van den Oord as modified by Shazeer further disclose wherein the key-value-coupling data is fixed by determined, independent of the query data, based on the operation between the new key data and the value data (see Oord: e.g., --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or waveform value, with a set of keys identifying the best support data patches for generating the value.  The scores may be normalized.--, in [0016]; also see: --in which the model learns to perform a task and in which the model parameters may then be fixed whilst the model is conditioned on one or a few new examples to generate a target output.--, in [0068]-[0070]; also see: --a procedure for using the neural network system 200 of FIG. 2 for few-shot learning. In some implementations the system is first trained as described later and then parameters of the system are fixed. The trained system may then be used to implement few-shot learning as a form of inference, inducing a representation of a probability density distribution in the system by presenting the previously trained system with one or a few new examples. These new examples are received by the system as a support data set, for example as one or more new example images. In effect the system is trained to perform a task using the support data set, for example to copy the new example(s) or to process the new example(s) in some other way. The system then performs the same task on the new examples. The initial training can be considered a form of meta-learning.--, in [0091]), 
It is clearly disclosed above by Oord, that which the model parameters may then be fixed, so that model parameters as the key-value-coupling data is fixed.
	Therefore, claims 1-6, and 8-27 are still not patentably distinguishable over the prior art reference(s). Further discussions are addressed in the prior art rejection section below. 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-13, and, 15-27 is rejected under 35 U.S.C. 103 as being unpatentable over van den Oord (US 20200250528 A1,  DATE FILED: October 25, 2018), and in view of Shazeer (US 20190130213 A1).
Re Claims 1, 23, and 25, Van den Oord discloses a processor-implemented method of implementing an attention mechanism in a neural network (see van den Oord: e.g., -- Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.  Some neural networks include one or more hidden layers in addition to an output layer.  The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.  Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. [0004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that, in some implementations, is capable of learning to generate a data item from just a few examples.  In broad terms the system is autoregressive in that it generates values of the data item dependent upon previously generated values of the data item.  However the system also employs a soft attention mechanism which enables attention-controlled context for the generated data item values.  Thus rather than the context for the data item generation being the same for all the generated item values it depends upon the item value being generated, and more particularly the context is controlled by the previous item values generated.--, in [0003]-[0004], and, --A soft attention mechanism may be provided to attend to one or more suitable patches for use in generating the current data item value.  Thus the soft attention mechanism may determine a set of weightings or scores for the support data patches, for example in the form of a soft attention query vector (e.g. .alpha..sub.ij later) dependent upon the previously generated values of the data item.  The soft attention query vector may then be used to query the memory for generating a value of the data item at a current iteration.  When generating the value of the data item at the current iteration one or more layers of the causal convolutional neural network may be conditioned upon the support data patches weighted by the scores.  The support data patches typically each comprise an encoding of supporting data for generating the data item, and the encodings may be combined weighted by the scores.--, in [0013]-[0014], and [0032]), the method comprising:
obtaining key-value coupling data determined based on an operation between new key data determined using a first nonlinear transformation for key data of an attention layer, and value data of the attention layer corresponding to the key data (see van den Oord: e.g., Fig. 2, and, --Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.--, in [0003]; --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or waveform value, with a set of keys identifying the best support data patches for generating the value.  The scores may be normalized.--, in [0016], --convolutional neural network layers 110 may have a gated activation function in place of a conventional activation function.  In a gated activation function, the output of an element-wise non-linearity, i.e., of a conventional activation function, is element-wise multiplied by a gate vector that is generated by applying an element-wise non-linearity to the output of a convolution.--, in [0063]-[0069]; and, --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080], and, --the generated data item values may comprise pixel values of an image representing a predicted image frame which results from the agent performing the action.  This additional data may be transformed to generate a latent feature vector, for example using one or more neural network layers such as one or more convolutional layers and/or an MLP (multilayer perceptron), and the convolutional neural network module 220 may be conditioned on the latent feature vector.--, in [0093]-[0094], and, --The CNN module 220 has 16 layers with 128-dimensional feature maps and skip connections each conditioned on the global context features and the upper 8 layers also conditioned on the attention-controlled context features.--, in [0098], and, --The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.  Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data--, in [0108]; and, --the output from CNN 254 comprises a set of KxKx2P spatial feature maps and this provides a set of spatially indexed key and value vectors, pkey and pvalue which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079],
herein, “a set of spatially indexed key and value vectors, pkey” read on “new key data”, which is determined through a  “determined by a non-linear function of a combination of qt and pjkey.” as disclosed in [0082]-[0083], herein,  pjkey read on “key data of an attention layer”, and further see:
-- q.sub.t relates to the current pixel, and p.sub.j.sup.key more particularly j runs over the spatial locations of the supporting patches for each support image and has e.g. S.times.K.times.K values. Alternatively, for example, the non-linear function may be defined by a feedforward neural network jointly trained with the other system components.--, in [0082],
Thus, above qt, is a input of current pixel, not “a query data”, which is the output as the image 230, in Fig. 2; 
also see Shazeer: e.g., -- The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.--, in [0049]-[0051], and, --apply a learned key linear transformation to each original key to generate a layer-specific key for each original key, and apply a learned value linear transformation to each original value to generate a layer-specific values for each original value.  The attention layer then applies the attention mechanism described above using these layer-specific queries, keys, and values to generate initial outputs for the attention layer.--, in [0058]);
determining new query data by applying a second transformation to query data corresponding to input data of the attention layer (see van den Oord: e.g., ----A soft attention mechanism may be provided to attend to one or more suitable patches for use in generating the current data item value.  Thus the soft attention mechanism may determine a set of weightings or scores for the support data patches, for example in the form of a soft attention query vector (e.g. .alpha..sub.ij later) dependent upon the previously generated values of the data item.  The soft attention query vector may then be used to query the memory for generating a value of the data item at a current iteration.  When generating the value of the data item at the current iteration one or more layers of the causal convolutional neural network may be conditioned upon the support data patches weighted by the scores.  The support data patches typically each comprise an encoding of supporting data for generating the data item, and the encodings may be combined weighted by the scores.--, in [0013]-[0014], and [0032]; also see: --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080]);
Oord however does not explicitly disclose above second transformation is a non-linear transformation,
Shazeer teaches determining new query data by applying a second nonlinear transformation to query data corresponding to input data of the attention layer (see Shazeer: e.g., --the sequence of transformations can include two or more learned linear transformations each separated by an activation function, e.g., a non-linear elementwise activation function--, in [0043]-[0044]; and, --an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. [0050] More specifically, each attention sub-layer applies a scaled dot-product attention mechanism 230.  In scaled dot-product attention, for a given query, the attention sub-layer computes the dot products of the query with all of the keys, divides each of the dot products by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, and then applies a softmax function over the scaled dot products to obtain the weights on the values.  The attention sub-layer then computes a weighted sum of the values in accordance with these weights.  Thus, for scaled dot-product attention the compatibility function is the dot product and the output of the compatibility function is further scaled by the scaling factor.--, in [0049]-[0050]);
Van den Oord and Shazeer are combinable as they are in the same field of endeavor:  using the key-value attention mechanism. Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify Oord’s method using Shazeer’s teachings by including determining new query data by applying a second nonlinear transformation to query data corresponding to input data of the attention layer to Oord’s generation of query data in order to computes the dot products of the query with all of the keys  applies a softmax function over the scaled dot products to obtain the weights on the values (see Shazeer: e.g. in [0043]-[0044], and [0049]-[0050]);
Van den Oord as modified by Shazeer further disclose determining output data of the attention layer based on an operation between the new query data and the key-value coupling data (see van den Oord: e.g., -- Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.  Some neural networks include one or more hidden layers in addition to an output layer.  The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.  Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. [0004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that, in some implementations, is capable of learning to generate a data item from just a few examples.  In broad terms the system is autoregressive in that it generates values of the data item dependent upon previously generated values of the data item.  However the system also employs a soft attention mechanism which enables attention-controlled context for the generated data item values.  Thus rather than the context for the data item generation being the same for all the generated item values it depends upon the item value being generated, and more particularly the context is controlled by the previous item values generated.--, in [0003]-[0004], and, --convolutional neural network layers 110 may have a gated activation function in place of a conventional activation function.  In a gated activation function, the output of an element-wise non-linearity, i.e., of a conventional activation function, is element-wise multiplied by a gate vector that is generated by applying an element-wise non-linearity to the output of a convolution.--, in [0063]-[0069]; and, --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080]; also see see Shazeer: e.g., --the sequence of transformations can include two or more learned linear transformations each separated by an activation function, e.g., a non-linear elementwise activation function--, in [0043]-[0044]; and, --an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. [0050] More specifically, each attention sub-layer applies a scaled dot-product attention mechanism 230.  In scaled dot-product attention, for a given query, the attention sub-layer computes the dot products of the query with all of the keys, divides each of the dot products by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, and then applies a softmax function over the scaled dot products to obtain the weights on the values.  The attention sub-layer then computes a weighted sum of the values in accordance with these weights.  Thus, for scaled dot-product attention the compatibility function is the dot product and the output of the compatibility function is further scaled by the scaling factor.--, in [0049]-[0050]); and, 
wherein the key-value-coupling data is fixed by determined, independent of the query data, based on the operation between the new key data and the value data (see Oord: e.g., --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or waveform value, with a set of keys identifying the best support data patches for generating the value.  The scores may be normalized.--, in [0016]; also see: --in which the model learns to perform a task and in which the model parameters may then be fixed whilst the model is conditioned on one or a few new examples to generate a target output.--, in [0068]-[0070]; also see: --a procedure for using the neural network system 200 of FIG. 2 for few-shot learning. In some implementations the system is first trained as described later and then parameters of the system are fixed. The trained system may then be used to implement few-shot learning as a form of inference, inducing a representation of a probability density distribution in the system by presenting the previously trained system with one or a few new examples. These new examples are received by the system as a support data set, for example as one or more new example images. In effect the system is trained to perform a task using the support data set, for example to copy the new example(s) or to process the new example(s) in some other way. The system then performs the same task on the new examples. The initial training can be considered a form of meta-learning.--, in [0091]; {It is clearly disclosed above by Oord, that which the model parameters may then be fixed, so that model parameters as the key-value-coupling data is fixed}; and, also see Shazeer: e.g., -- the attention sub-layer packs the queries into a matrix Q, packs the keys into a matrix K, and packs the values into a matrix V. To pack a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the columns of the matrix.--, in [0051]; and see Shazeer: e.g., -- The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.--, in [0049]-[0051], and, --apply a learned key linear transformation to each original key to generate a layer-specific key for each original key, and apply a learned value linear transformation to each original value to generate a layer-specific values for each original value.  The attention layer then applies the attention mechanism described above using these layer-specific queries, keys, and values to generate initial outputs for the attention layer.--, in [0058]).

Re Claims 2, 15 and 26,  Van den Oord as modified by Shazeer further disclose wherein the obtaining comprises: determining the new key data by applying the first nonlinear transformation to the key data (see van den Oord: e.g., Fig. 2, and, --Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.--, in [0003]; --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or waveform value, with a set of keys identifying the best support data patches for generating the value.  The scores may be normalized.--, in [0016], --convolutional neural network layers 110 may have a gated activation function in place of a conventional activation function.  In a gated activation function, the output of an element-wise non-linearity, i.e., of a conventional activation function, is element-wise multiplied by a gate vector that is generated by applying an element-wise non-linearity to the output of a convolution.--, in [0063]-[0069]; and, --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080]); and
determining the key-value coupling data based on an operation between the value data and the new key data (see Shazeer: e.g., --an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key….the attention sub-layer computes the attention over a set of queries simultaneously.  In particular, the attention sub-layer packs the queries into a matrix Q, packs the keys into a matrix K, and packs the values into a matrix V. To pack a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the columns of the matrix. --, in [0049]-[0051], and, --apply a learned key linear transformation to each original key to generate a layer-specific key for each original key, and apply a learned value linear transformation to each original value to generate a layer-specific values for each original value. --, in [0057]-[0059]).

Re Claims 3, and 16, Van den Oord as modified by Shazeer further disclose wherein the new key data includes a first new key, and the value data includes a first value corresponding to the first new key (see Shazeer: e.g., --an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key….the attention sub-layer computes the attention over a set of queries simultaneously.  In particular, the attention sub-layer packs the queries into a matrix Q, packs the keys into a matrix K, and packs the values into a matrix V. To pack a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the columns of the matrix. --, in [0049]-[0051], and, --apply a learned key linear transformation to each original key to generate a layer-specific key for each original key, and apply a learned value linear transformation to each original value to generate a layer-specific values for each original value. --, in [0057]-[0059]), and
the key-value coupling data includes a single item of aggregated data determined based an operation between the first new key and the first value with respect to a first key-value pair of the first new key and the first value (see Oord: e.g., Fig. 2, and, --a connection from an input of a convolutional layer to a summer to sum this with an intermediate output of the layer effectively allowing the network to skip or partially skip a layer.--, in [0025], and, --the mask restricts the connections in a given pixel in the output feature map of the additional convolutional layer to those neighboring pixels in the input feature map to the additional convolutional layer that are before the given pixel in the sequence, to features corresponding to those colors in the corresponding pixel in the input feature map that have already been generated, and to features corresponding to the given color in the corresponding pixel in the input feature map.  [0062] The neural network system 100 can implement this masking in any of a variety of ways.  For example, each convolutional layer can have a kernel with the corresponding weights zeroed out. [0063] In some cases, the initial neural network layers 110 may include two stacks of convolutional neural network layers: a horizontal one that, for a given pixel in a given row, conditions on the color values already generated for the given row so far and a vertical one that conditions on all rows above the given row.  In these cases, the vertical stack, which does not have any masking, allows the receptive field to grow in a rectangular fashion without any blind spot, and the outputs of the two stacks may be combined, e.g., summed, after each layer.--, in [0061]-[0063]; also see Shazeer: e.g., -- an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.--, in [0049]-[0050], and [0057]-[0060]).

Re Claims 4, and 17, and 24, Van den Oord as modified by Shazeer further disclose wherein either one or both of the first nonlinear transformation and the second nonlinear transformation uses either one or both of a sine function and a cosine function as a nonlinear factor (see Shazeer: e.g., in [0027]).

Re Claims 5, and 18, Van den Oord as modified by Shazeer further disclose wherein the first nonlinear transformation and the second nonlinear transformation use the same function (see Shazeer: e.g., in [0027], and, --the sequence of transformations can include two or more learned linear transformations each separated by an activation function, e.g., a non-linear elementwise activation function--, in [0043]-[0044]).

Re Claim 6, Van den Oord as modified by Shazeer further disclose the output data of the attention layer is determined based on an operation between the new query data and the fixed key-value coupling data (see Shazeer: e.g., -- The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.--, in [0049]-[0051).


Re Claims 8, and 19, Van den Oord as modified by Shazeer further disclose wherein an operation between the new key data and the new query data corresponds to a similarity between the key data and the query data (see van den Oord: e.g., -- The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.--, in [0079]; also see Shazeer: e.g., -- an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. [0050] More specifically, each attention sub-layer applies a scaled dot-product attention mechanism 230.  In scaled dot-product attention, for a given query, the attention sub-layer computes the dot products of the query with all of the keys, divides each of the dot products by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, and then applies a softmax function over the scaled dot products to obtain the weights on the values.  The attention sub-layer then computes a weighted sum of the values in accordance with these weights.  Thus, for scaled dot-product attention the compatibility function is the dot product and the output of the compatibility function is further scaled by the scaling factor.--, in [0049]-[0050]).

Re Claim 9, Van den Oord as modified by Shazeer further disclose wherein the determining of the output data of the attention layer comprises normalizing a result of the operation between the new query data and the key-value coupling data (see van den Oord: e.g., -- The pixel query vector is used to determine a soft attention query vector .alpha..sub.tj which may comprise a normalized set of scores each defining a respective matching between the pixel query vector q.sub.t and one of the supporting patches as represented by its key p.sub.j.sup.key.  A score e.sub.tj defining such a matching may be determined by a non-linear function of a combination of q.sub.t and p.sub.j.sup.key.--, in [0082]).

Re Claims 10, 21 {claim 21 is rejected as the same reasons for the rejection of claim 10, and the rejection of claim 4, as discussed above}, and claim 27, Van den Oord as modified by Shazeer further disclose performing an inference operation using the neural network based on the output data of the attention layer, wherein the neural network includes additional trained layers (see van den Oord: e.g., -- In some implementations the system is first trained as described later and then parameters of the system are fixed.  The trained system may then be used to implement few-shot learning as a form of inference, inducing a representation of a probability density distribution in the system by presenting the previously trained system with one or a few new examples.  These new examples are received by the system as a support data set, for example as one or more new example images.--, in [0091]).

Re Claim 11, Van den Oord as modified by Shazeer further disclose outputting an image recognition result for the input data by applying the output data of the attention layer to the neural network (see van den Oord: e.g., -- Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.  Some neural networks include one or more hidden layers in addition to an output layer.  The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.  Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. [0004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that, in some implementations, is capable of learning to generate a data item from just a few examples.  In broad terms the system is autoregressive in that it generates values of the data item dependent upon previously generated values of the data item.  However the system also employs a soft attention mechanism which enables attention-controlled context for the generated data item values.  Thus rather than the context for the data item generation being the same for all the generated item values it depends upon the item value being generated, and more particularly the context is controlled by the previous item values generated.--, in [0003]-[0004]).

Re Claim 12, claim 12 is the corresponding storage medium claim to claim 1 respectively. Thus, claim 12 is rejected for the similar reasons as for claim 1. Furthermore, Van den Oord as modified by Shazeer further disclose non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the method (see Oord: e.g., Fig. 1, and in [0102]).

Re Claim 13, van den Oord discloses a processor-implemented nonlocal filtering method (see van den Oord: e.g., -- In the context of a convolutional neural network layer operating on a data sequence this can be implemented, for example, by the use of one or more masks to mask input(s) to a convolution operation from data item values in a sequence following those at a current time or iteration step of the sequence.  Additionally or alternatively a causal convolutional may be implemented by applying a normal convolution then shifting the output by a number of time or iteration steps, in particular shifting the output forward by (filter length-1) steps prior to applying an activation function for the convolutional layer, where "filter length" is the length of the filter of the convolution that is being applied.--, in [015], and,  -- where W.sub.f,k is the main filter for the layer k, x is the layer input, * denotes a convolution, .circle-w/dot.  denotes element-wise multiplication, and W.sub.g,k is the gate filter for the layer k. Adding such a multiplicative function, i.e. the gate filter and activation, may assist the network to model more complex interactions.--, in [0064]), comprising:
obtaining key-value coupling data determined based on an operation between new key data determined using a first nonlinear transformation for key data corresponding to patches in an input image, and value data of representative pixels in the patches (see van den Oord: e.g., Fig. 2, and, --Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.--, in [0003]; --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or waveform value, with a set of keys identifying the best support data patches for generating the value.  The scores may be normalized.--, in [0016], --convolutional neural network layers 110 may have a gated activation function in place of a conventional activation function.  In a gated activation function, the output of an element-wise non-linearity, i.e., of a conventional activation function, is element-wise multiplied by a gate vector that is generated by applying an element-wise non-linearity to the output of a convolution.--, in [0063]-[0069]; and, --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080], and, --the generated data item values may comprise pixel values of an image representing a predicted image frame which results from the agent performing the action.  This additional data may be transformed to generate a latent feature vector, for example using one or more neural network layers such as one or more convolutional layers and/or an MLP (multilayer perceptron), and the convolutional neural network module 220 may be conditioned on the latent feature vector.--, in [0093]-[0094], and, --The CNN module 220 has 16 layers with 128-dimensional feature maps and skip connections each conditioned on the global context features and the upper 8 layers also conditioned on the attention-controlled context features.--, in [0098], and, --The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.  Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data--, in [0108]);
determining new query data by applying a second transformation to query data corresponding to a target patch among the patches(see van den Oord: e.g., ----A soft attention mechanism may be provided to attend to one or more suitable patches for use in generating the current data item value.  Thus the soft attention mechanism may determine a set of weightings or scores for the support data patches, for example in the form of a soft attention query vector (e.g. .alpha..sub.ij later) dependent upon the previously generated values of the data item.  The soft attention query vector may then be used to query the memory for generating a value of the data item at a current iteration.  When generating the value of the data item at the current iteration one or more layers of the causal convolutional neural network may be conditioned upon the support data patches weighted by the scores.  The support data patches typically each comprise an encoding of supporting data for generating the data item, and the encodings may be combined weighted by the scores.--, in [0013]-[0014], and [0032]; also see: --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080]); 
Oord however does not explicitly disclose above second transformation is a non-linear transformation,
Shazeer discloses determining new query data by applying a second nonlinear transformation to query data corresponding to a target patch among the patches (see Shazeer: e.g., --the sequence of transformations can include two or more learned linear transformations each separated by an activation function, e.g., a non-linear elementwise activation function--, in [0043]-[0044]; and, --an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. [0050] More specifically, each attention sub-layer applies a scaled dot-product attention mechanism 230.  In scaled dot-product attention, for a given query, the attention sub-layer computes the dot products of the query with all of the keys, divides each of the dot products by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, and then applies a softmax function over the scaled dot products to obtain the weights on the values.  The attention sub-layer then computes a weighted sum of the values in accordance with these weights.  Thus, for scaled dot-product attention the compatibility function is the dot product and the output of the compatibility function is further scaled by the scaling factor.--, in [0049]-[0050]);
Van den Oord and Shazeer are combinable as they are in the same field of endeavor:  using the key-value attention mechanism. Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify Oord’s method using Shazeer’s teachings by including determining new query data by applying a second nonlinear transformation to query data corresponding to a target patch among the patches to Oord’s generation of query data in order to computes the dot products of the query with all of the keys  applies a softmax function over the scaled dot products to obtain the weights on the values (see Shazeer: e.g. in [0043]-[0044], and [0049]-[0050]);
Van den Oord as modified by Shazeer further disclose determining output data for denoising of a representative pixel in the target patch, based on an operation between the new query data and the key-value coupling data (see van den Oord: e.g., -- Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input.  Some neural networks include one or more hidden layers in addition to an output layer.  The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.  Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. [0004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that, in some implementations, is capable of learning to generate a data item from just a few examples.  In broad terms the system is autoregressive in that it generates values of the data item dependent upon previously generated values of the data item.  However the system also employs a soft attention mechanism which enables attention-controlled context for the generated data item values.  Thus rather than the context for the data item generation being the same for all the generated item values it depends upon the item value being generated, and more particularly the context is controlled by the previous item values generated.--, in [0003]-[0004], and, --convolutional neural network layers 110 may have a gated activation function in place of a conventional activation function.  In a gated activation function, the output of an element-wise non-linearity, i.e., of a conventional activation function, is element-wise multiplied by a gate vector that is generated by applying an element-wise non-linearity to the output of a convolution.--, in [0063]-[0069]; and, --the output from CNN 254 comprises a set of K.times.K.times.2P spatial feature maps and this provides a set of spatially indexed key and value vectors, p.sup.key and p.sup.value which together make up support memory 210.  A support memory constructed in this manner allows gradients to bebackpropagated through the support memory to train parameters of the local support set encoder, CNN 254, by gradient descent. [0079] The similarity of a query vector to a key vector can be used to query the memory and the value vector provides a corresponding output.  The support memory effectively provides a mechanism for the CNN module to learn to use encoded patches (data regions) from the support data set when generating the target output x. To achieve this it is possible to use the same vector as both the key and value but using separate key and value vectors may provide additional flexibility.  It is not necessary for the set of feature maps to have the same dimensions in the horizontal and vertical directions within the support image(s).--, in [0078]-[0080]; also see see Shazeer: e.g., --the sequence of transformations can include two or more learned linear transformations each separated by an activation function, e.g., a non-linear elementwise activation function--, in [0043]-[0044]; and, --an attention mechanism maps a query and a set of key-value pairs to an output, where the query, keys, and values are all vectors.  The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. [0050] More specifically, each attention sub-layer applies a scaled dot-product attention mechanism 230.  In scaled dot-product attention, for a given query, the attention sub-layer computes the dot products of the query with all of the keys, divides each of the dot products by a scaling factor, e.g., by the square root of the dimensions of the queries and keys, and then applies a softmax function over the scaled dot products to obtain the weights on the values.  The attention sub-layer then computes a weighted sum of the values in accordance with these weights.  Thus, for scaled dot-product attention the compatibility function is the dot product and the output of the compatibility function is further scaled by the scaling factor.--, in [0049]-[0050]);
wherein the key-value-coupling data is fixed by determined, independent of the query data, based on the operation between the new key data and the value data (see Oord: e.g., --the stored support data patches each have a support data patch key (p.sup.key).  The support data patch key may facilitate learning a score relating the value of the data item to be generated at an iteration to a supporting patch.  For example the soft attention mechanism may be configured to combine an encoding of the previously generated values of the data item (q.sub.t; upon which the current data item value depends), with the support data patch key for each of the support data patches, to determine the set of scores for the soft attention query vector.  The encoding of the previously generated values of the data item may comprise a set of features from a layer of the causal convolutional neural network.  In broad terms a set of scores links the generation of a current value for the data item, e.g. a current pixel or waveform value, with a set of keys identifying the best support data patches for generating the value.  The scores may be normalized.--, in [0016]; also see: --in which the model learns to perform a task and in which the model parameters may then be fixed whilst the model is conditioned on one or a few new examples to generate a target output.--, in [0068]-[0070]; also see: --a procedure for using the neural network system 200 of FIG. 2 for few-shot learning. In some implementations the system is first trained as described later and then parameters of the system are fixed. The trained system may then be used to implement few-shot learning as a form of inference, inducing a representation of a probability density distribution in the system by presenting the previously trained system with one or a few new examples. These new examples are received by the system as a support data set, for example as one or more new example images. In effect the system is trained to perform a task using the support data set, for example to copy the new example(s) or to process the new example(s) in some other way. The system then performs the same task on the new examples. The initial training can be considered a form of meta-learning.--, in [0091]; {It is clearly disclosed above by Oord, that which the model parameters may then be fixed, so that model parameters as the key-value-coupling data is fixed}; and, also see Shazeer: e.g., -- the attention sub-layer packs the queries into a matrix Q, packs the keys into a matrix K, and packs the values into a matrix V. To pack a set of vectors into a matrix, the attention sub-layer can generate a matrix that includes the vectors as the columns of the matrix.--, in [0051]; and see Shazeer: e.g., -- The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.--, in [0049]-[0051], and, --apply a learned key linear transformation to each original key to generate a layer-specific key for each original key, and apply a learned value linear transformation to each original value to generate a layer-specific values for each original value.  The attention layer then applies the attention mechanism described above using these layer-specific queries, keys, and values to generate initial outputs for the attention layer.--, in [0058]).

Re Claim 20, Oord as modified by Shazeer further disclose denoising the representative pixel in the target patch based on the output data (see Oord: e.g., -- for image processing tasks such as de-noising, de-blurring, image completion and the like by employing additional data defining a noisy or incomplete image; for image modification tasks by employing additional data defining a modified image--, in [0010]).

Re Claim 22, Oord as modified by Shazeer further disclose wherein the at least one layer is a respective attention layer that performs a corresponding attention mechanism (see Shazeer: e.g., Fig. 2A, -- techniques allow images to effectively be generated by an attention-based neural network by (i) effectively representing the images that are processed by the neural network and (ii) modifying the self-attention scheme applied the self-attention layers in the neural network.  Because of this, the neural network used to generate the image generates high-quality images and is computationally efficient even when generating large images--, in [0006]-[0007], -- FIG. 2A is a diagram showing attention mechanisms that are applied by the attention sub-layers in the subnetworks of the decoder neural network.--, in [0010]).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over van den Oord as modified by Shazeer, and in view of Chen (US 20180350034 A1).
Re Claim 14, van den Oord as modified by Shazeer however do not explicitly disclose the representative pixels in the patches are center pixels in the patches, and the representative pixel in the target patch is a center pixel in the target patch.
Chen teaches the representative pixels in the patches are center pixels in the patches, and the representative pixel in the target patch is a center pixel in the target patch (see Chen: e.g., -- In this coordinate system, the center of each pixel is defined as the index of the pixel.  The index values of the pixels in target image 700 are set as the same as the index values of the pixels in interior image 205.  As such, the range of index values for pixels in target image 700 is from (-(Ci-1)/2, -(Ci-1)/2) to ((Ci-1)/2, (Ci-1)/2), which is (-7.5, -7.5) to (7.5, 7.5) in this example.  For a given pixel of target image 700 with index values (X,Y), the coordinate of the center of the pixel is (X*P.sub.W, Y*P.sub.W), where P.sub.W is determined using the equation (17) described above. --, in [0060], also see: -- transform the source image into a different source image.  For example, file manager 110 may modify the size of the pixels of each of the exterior images to be the same size as the pixels of the interior image of the source image.  FIG. 3 illustrates a representation of a transformed source image according to some embodiments.  Specifically, FIG. 3 illustrates source image 300, which is a transformed source image of source image 200.  As shown, source image 300 includes interior image 205 and four pixel groups 310-325.  For this example, the size of the pixels in the pixel groups 310-325 is the same as the size of the pixels of interior image 205.--, in [0039]);
Van den Oord (as modified by Shazeer) and Chen are combinable as they are in the same field of endeavor:  image data transformation. Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify Oord (as modified by Shazeer)’s method using Chen’s teachings by including the representative pixels in the patches are center pixels in the patches, and the representative pixel in the target patch is a center pixel in the target patch to Oord (as modified by Shazeer)’s pixel data transformation {see Oord’s Fig. 2, from support image to target image} in order to in order to determine the values of the pixels of the target image (see Chen: e.g. in [0039], and [0060]-[0063]).











Conclusion
Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEI WEN YANG whose telephone number is (571)270-5670.  The examiner can normally be reached on 8:00 - 5:00 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on 571-272-7778.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/WEI WEN YANG/Primary Examiner, Art Unit 2667