DETAILED ACTION
This action is in response to the Applicant Response filed 19 April 2022 for application 16/192,649 filed 15 November 2018.
Claims 1, 15, 18-19 are currently amended.
Claim 20 is new.
Claims 1-20 are pending.
Claims 1-20 are rejected.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments regarding the objections to the specification have been fully considered and, in light of the amendments to the specification, are persuasive.

Applicant’s arguments regarding the objections to the claims have been fully considered and, in light of the amendments to the claims, are persuasive. However, in light of the amendments to the claims, new claim objections have arisen as noted below.

Applicant’s arguments with respect to the 35 U.S.C. 103 rejections of claims 1-19 have been fully considered but are moot because the arguments do not apply to any of the references being used in the current rejections.

Claim Objections
Claims 1-20 are objected to because of the following informalities:
Claim 1, line 5, objection is associated a first should read “object is associated with a first”
Claim 18, line 6, objection is associated a first should read “object is associated with a first”
Claim 19, line 7, objection is associated a first should read “object is associated with a first”
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Jaderberg et al. (Spatial Transformer Networks, hereinafter referred to as "Jaderberg") in view of Vaswani et al. (Attention Is All You Need, hereinafter referred to as "Vaswani") and further in view of He et al. (Channel Pruning for Accelerating Very Deep Neural Networks, hereinafter referred to as “He”).

Regarding claim 1 (Currently Amended), Jaderberg teaches a method comprising, by one or more computing systems (Jaderberg, section 3.3 – teaches GPU implementation; Jaderberg, section 4 – teaches experiments with results using real-world data sets): 
training a baseline machine-learning model based on a neural network (Jaderberg, section 3.4 – teaches the self-contained spatial transformer modules can be dropped into a CNN architecture at any point and in any number; Jaderberg, section 4.2 – teaches training a baseline character sequence CNN; see also Jaderberg, Abstract, section 1, section 4.3) comprising a plurality of stages, wherein each stage comprises a plurality of neural blocks (Jaderberg, section 3.4 – teaches the self-contained spatial transformer modules can be dropped into a CNN architecture at any point and in any number; Jaderberg, section 4.2 – teaches training a baseline character sequence CNN and extending the baseline CNN by including spatial transformers before each of the first four convolutional layers [between stages comprising neural blocks, e.g., convolutional layers and pooling layers]; see also Jaderberg, Abstract, section 1, section 4.3); 
accessing a plurality of training samples comprising a plurality of content objects, respectively (Jaderberg, section 1 - teaches the action of the spatial transformers is conditioned on individual data samples with the appropriate behavior learnt during training for the task in question; Jaderberg, section 4.2 - teaches using real-world dataset of Street View House Numbers; see also Jaderberg, section 4.1, 4.3), wherein each content object of the plurality of content objects is associated a first number of channels (Jaderberg, section 3.1 – teaches input feature map to the spatial transformer having a given number of channels [Therefore, the training sample input was associated with a first number of channels]); 
determining one or more non-local operations (Jaderberg, section 3 – teaches the three parts of the spatial transformer module; see also Jaderberg, Fig. 2) ...; 
generating one or more non-local blocks based on the plurality of training samples (Jaderberg, section 1 - teaches the action of the spatial transformers is conditioned on individual data samples with the appropriate behavior learnt during training for the task in question; Jaderberg, section 4.2 - teaches using real-world dataset of Street View House Numbers; see also Jaderberg, sections 3, 4.1, 4.3) and the one or more non-local operations (Jaderberg, section 3 – teaches the three parts of the spatial transformer module; see also Jaderberg, Fig. 2)...; 
determining a stage from the plurality of stages of the neural network (Jaderberg, section 3.4 – teaches the self-contained spatial transformer modules can be dropped into a CNN architecture at any point and in any number; Jaderberg, section 4.2 – teaches training a baseline character sequence CNN and extending the baseline CNN by including spatial transformers before each of the first four convolutional layers [between neural blocks]; see also Jaderberg, Abstract, section 1, section 4.3); and 
training a non-local machine-learning model (Jaderberg, section 1 - teaches the action of the spatial transformers is conditioned on individual data samples with the appropriate behavior learnt during training for the task in question; Jaderberg, section 3.4 - teaches training the spatial transformer; Jaderberg, section 4.2 - teaches training the model using house numbers dataset; see also Jaderberg sections 4.1, 4.3) by inserting each of the one or more non-local blocks in between at least two of the plurality of neural blocks in the determined stage of the neural network (Jaderberg, section 3.4 – teaches the self-contained spatial transformer modules can be dropped into a CNN architecture at any point and in any number; Jaderberg, section 4.2 – teaches training a baseline character sequence CNN and extending the baseline CNN by including spatial transformers before each of the first four convolutional layers [between neural blocks]; see also Jaderberg, Abstract, section 1, section 4.3).
While Jaderberg teaches inserting spatial transformers between convolutional layers, i.e., stages, Jaderberg does not explicitly teach wherein each non-local operation is based on one or more pairwise functions and one or more unary functions, wherein each of the one or more non-local operations is associated with a respective plurality of weight matrices. Further, while Jaderberg teaches generating non-local block with inputs having channels, Jaderberg does not explicitly teach wherein the generation comprises reducing a number of channels associated with each weight matrix of the plurality of weight matrices to be less than the first number of channels.
Vaswani teaches 
determining one or more non-local operations (Vaswani, section 3.2 – teaches self-attention output is computed as a weighted sum of the values [unary] where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [pairwise]; [Specification of instant application states – “The non-local machine learning model may be related to the recent self-attention method (i.e., a conventional work) for machine translation. A self-attention module computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space. Self-attention may be viewed as a form of the non-local mean ...” (¶0078)]), wherein each non-local operation is based on one or more pairwise functions and one or more unary functions (Vaswani, section 3.2 – teaches self-attention output is computed as a weighted sum of the values [unary] where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [pairwise]; Vaswani, Fig. 2, section 3.2.1 – teaches scaled dot product attention) , wherein each of the one or more non-local operations is associated with a respective plurality of weight matrices (Vaswani, section 3.2.2 – teaches each of the non-local operations comprising a plurality of weight matrices); 
generating one or more non-local blocks based on the plurality of training samples (Vaswani, section 5 – teaches training on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs) and the one or more non-local operations (Vaswani, section 3.2 – teaches self-attention output is computed as a weighted sum of the values [unary] where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [pairwise]; Vaswani, Fig. 2, section 3.2.1 – teaches scaled dot product attention)...
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to modify Jaderberg with the teachings of Vaswani in order to improving existing models while reducing training costs of the best known existing models in the field of adding non-local function blocks, such as transformer modules, to existing models (Vaswani, Abstract – “The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.”).
while Jaderberg in view of Vaswani teaches generating non-local block with inputs having channels, Jaderberg in view of Vaswani does not explicitly teach wherein the generation comprises reducing a number of channels associated with each weight matrix of the plurality of weight matrices to be less than the first number of channels.
He teaches generating one or more non-local blocks ... wherein the generation comprises reducing a number of channels associated with each weight matrix of the plurality of weight matrices to be less than the first number of channels (He, Fig. 2, section 3.1 – teaches using a channel reduction module [non-local block] to reduce the number of channels wherein the channels associated with the weight matrices are less than that of the input).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to modify Jaderberg in view of Vaswani with the teachings of He in order to accelerate DNNs using channel reduction while maintaining similar accuracy in the field of adding non-local function blocks to existing models (He, Abstract – “In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural networks. Given a trained CNN model, we propose an iterative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method reduces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5× speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, Xception and suffers only 1.4%, 1.0% accuracy loss under 2× speedup respectively, which is significant.”).

Regarding claim 2 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 1 as noted above. Jaderberg further teaches wherein the neural network comprises one or more of a convolutional neural network or a recurrent neural network (Jaderberg, section 3.4 – teaches the self-contained spatial transformer modules can be dropped into a CNN architecture at any point and in any number; see also, Jaderberg, sections 4.1, 4.2, 4.3 – examples using CNN baseline networks).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 3 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 1 as noted above. Jaderberg teaches wherein each of the plurality of content objects comprises one or more of a text, an audio clip, an image, or a video (Jaderberg, section 1 – teaches digit images; Jaderberg, section 4.2 – images of street view house numbers; see also Jaderberg section 4.1, 4.3).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 4 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 1 as noted above. Jaderberg further teaches wherein the neural network is based on one or more of a two-dimensional architecture or a three-dimensional architecture (Jaderberg, section 3 – teaches CNN with 2-D kernel; see also Jaderberg, Appendix A.3 – teaches extended 2-D transformations to 3-D).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 5 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 1 as noted above. Jaderberg further teaches generating a plurality of feature representations for the plurality content objects based on the baseline machine-learning model, respectively (Jaderberg, section 3, Fig. 2 – teaches the spatial transformer taking as an input feature map [for each content object contained in each image] generated by the baseline CNN; see also Jaderberg, sections 1, 4.1, 4.2, 4.3 – examples of content objects).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 6 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 5 as noted above. Jaderberg further teaches wherein generating each of the one or more non-local blocks comprises: 
applying each of the one or more non-local operations to the feature representation of one of the plurality of content objects (Jaderberg, section 3, Fig. 2 – teaches the spatial transformer taking as an input feature map [for each content object contained in each image] generated by the baseline CNN; see also Jaderberg, sections 1, 4.1, 4.2, 4.3 – examples of content objects).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.
Regarding claim 7 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 5 as noted above. Jaderberg further teaches determining, for each of the plurality of content objects, an output position and a plurality of positions associated with the output position (Jaderberg, section 1 – teaches a spatial transformer can crop out and scale normalize the appropriate region in an image [content object] for classification; see also Jaderberg, section 3.2 – teaches that to perform a warping of the input feature map each output pixel [output position] is computed by applying a sampling kernel centered at a particular location [plurality of positions associated with the output position] in the input feature map; see also Jaderberg, sections 4.1-4.3 – teaches output bounding regions).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 8 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 7 as noted above. Jaderberg further teaches wherein the output position is in one or more of space, time, or spacetime (Jaderberg, section 1 – teaches a spatial transformer can crop out and scale normalize the appropriate region [output position and associated positions] in an image [content object] for classification; see also Jaderberg, section 3.2 – teaches that to perform a warping of the input feature map each output pixel [output position] is computed by applying a sampling kernel centered at a particular location in the input feature map; see also Jaderberg, sections 4.1-4.3 – teaches output bounding regions; [output bounding regions are spatial]).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 7 above.
Regarding claim 9 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 7 as noted above. Vaswani further teaches wherein each of the one or more non-local operations is based on a function                                 
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                    ‍
                                     
                                    =
                                     
                                    ‍
                                    ‍
                                    
                                        
                                            1
                                        
                                        
                                            C
                                            (
                                            x
                                            )
                                        
                                    
                                    
                                        
                                            ∑
                                            
                                                
                                                    
                                                        ∀
                                                    
                                                    
                                                        j
                                                    
                                                
                                            
                                        
                                        
                                            f
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                    ,
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            j
                                                        
                                                    
                                                
                                            
                                            g
                                            (
                                            
                                                
                                                    x
                                                
                                                
                                                    j
                                                
                                            
                                            )
                                        
                                    
                                
                            , and wherein:
                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     indicates the feature representation at the output position; 
                        
                            
                                
                                    x
                                
                                
                                    j
                                
                            
                        
                     indicates the feature representation at one of the plurality of positions; 
                        
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     indicates an output response at the output position; 
                        
                            f
                            (
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    x
                                
                                
                                    j
                                
                            
                            )
                        
                     indicates the pairwise function; 
                        
                            g
                            (
                            
                                
                                    x
                                
                                
                                    j
                                
                            
                            )
                        
                     indicates the unary function; and 
                        
                            C
                            (
                            x
                            )
                        
                     indicates a normalization factor (Vaswani, section 3.2 – teaches self-attention output is computed as a weighted sum of the values [unary] where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [pairwise]; Vaswani, Equation 1, Fig. 2 (see below), section 3.2.1 – teaches scaled dot product attention represented by                         
                            A
                            t
                            t
                            e
                            n
                            t
                            i
                            o
                            n
                            
                                
                                    Q
                                    ,
                                     
                                    K
                                    ,
                                     
                                    V
                                
                            
                            =
                            s
                            o
                            f
                            t
                            m
                            a
                            x
                            
                                
                                    
                                        
                                            Q
                                            
                                                
                                                    K
                                                
                                                
                                                    T
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                        d
                                                    
                                                    
                                                        k
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            V
                        
                     where                         
                            s
                            o
                            f
                            t
                            m
                            a
                            x
                            (
                            Q
                            
                                
                                    K
                                
                                
                                    T
                                
                            
                            )
                        
                     represents the normalized pairwise function of query across all keys, and                         
                            V
                        
                     represents the unary function).

    PNG
    media_image1.png
    173
    232
    media_image1.png
    Greyscale

The specification of the instant application (¶0090) discloses how the softmax function is an example of the normalized pairwise function and the self-attention model of Vaswani is an example of the function claimed in claim 9:

    PNG
    media_image2.png
    750
    512
    media_image2.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to implement a non-local operation based on a pairwise function and a unary function because it creates superior models while being more parallelizable and requiring less training time (Vaswani, Abstract).

Regarding claim 10 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 9 as noted above. Vaswani further teaches where the pairwise function is based on one or more of: 
a Gaussian function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            
                                
                                    e
                                
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                        
                                            T
                                        
                                    
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                
                            
                        
                    ; 
an embedded Gaussian function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            
                                
                                    e
                                
                                
                                    
                                        
                                            θ
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            T
                                        
                                    
                                    ϕ
                                    (
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                            
                        
                    , wherein                         
                            θ
                        
                     is an embedding for                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     and                         
                            ϕ
                        
                     is an embedding for                         
                            
                                
                                    x
                                
                                
                                    j
                                
                            
                        
                     (see explanation below);
a dot product function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            
                                
                                    θ
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                
                                                    i
                                                
                                            
                                        
                                    
                                
                                
                                    T
                                
                            
                            ϕ
                            (
                            
                                
                                    x
                                
                                
                                    j
                                
                            
                            )
                        
                    ; or
a concatenation function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            R
                            e
                            L
                            U
                            (
                            
                                
                                    w
                                
                                
                                    f
                                
                                
                                    T
                                
                            
                            
                                
                                    θ
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                
                                                    i
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    ϕ
                                    (
                                    
                                        
                                            x
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                            
                            )
                        
                    , wherein                         
                            R
                            e
                            L
                            U
                        
                     indicates a function of a rectified linear unit, and wherein                         
                            
                                
                                    w
                                
                                
                                    f
                                
                            
                        
                     is a weight vector projecting a concatenated vector of                         
                            θ
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                     and                         
                            ϕ
                            (
                            
                                
                                    x
                                
                                
                                    j
                                
                            
                            )
                        
                     to a scalar.
As discussed above with respect to claim 9, Vaswani teaches the scaled dot product attention represented by                         
                            A
                            t
                            t
                            e
                            n
                            t
                            i
                            o
                            n
                            
                                
                                    Q
                                    ,
                                     
                                    K
                                    ,
                                     
                                    V
                                
                            
                            =
                            s
                            o
                            f
                            t
                            m
                            a
                            x
                            
                                
                                    
                                        
                                            Q
                                            
                                                
                                                    K
                                                
                                                
                                                    T
                                                
                                            
                                        
                                        
                                            
                                                
                                                    
                                                        d
                                                    
                                                    
                                                        k
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            V
                        
                     which is an example of a non-local operation based on a normalized pairwise function and a unary function. 
The following is an excerpt from the specification (¶0090) of the instant application explaining how the normalized pairwise function of the softmax is interpreted as the embedded Gaussian function:

    PNG
    media_image3.png
    633
    430
    media_image3.png
    Greyscale

It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to implement a non-local operation based on a pairwise function, including an embedded Gaussian function, and a unary function because it creates superior models while being more parallelizable and requiring less training time (Vaswani, Abstract).

Regarding claim 11 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 5 as noted above. Vaswani further teaches generating, for each of the plurality of content objects, a subsampled content object by applying subsampling to the feature representation of the content object, wherein the subsampled content object is associated with a subsampled feature representation (Vaswani, section 3.2.2 – teaches subsampling by linear projecting the queries, keys and values a given number of times with different linear projections, perform the attention module in parallel on each projection, and combine results in order to jointly attend to information from different representations subspaces at different positions).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to subsample the content object because it allows for jointly handling information in parallel from different representations at different positions without increasing computational costs (Vaswani, Abstract).

Regarding claim 12 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 11 as noted above. Jaderberg further teaches wherein the subsampling (Jaderberg, section 1 – teaches a spatial transformer can crop out and scale normalize the appropriate region in an image [content object] for classification; see also Jaderberg, section 3.2 – teaches that to perform a warping of the input feature map each output pixel [output position] is computed by applying a sampling kernel centered at a particular location in the input feature map; see also Jaderberg, sections 4.1-4.3 – teaches output bounding regions) comprises pooling, the pooling comprises one or more of max pooling or average pooling (Jaderberg, Appendix A.4 – teaches average pooling after the spatial transformers; Jaderberg, Appendix A.5 – teaches max-pooling layers).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 11 above.

Regarding claim 13 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 11 as noted above. Vaswani further teaches wherein generating each of the one or more non-local blocks comprises: 
applying each of the one or more non-local operations to the feature representation of one of the plurality of content objects and the subsampled feature representation of the subsampled content object corresponding to the content object (Vaswani, section 3.2.2 – teaches subsampling by linear projecting the queries, keys and values a given number of times with different linear projections, perform the attention module in parallel on each projection, and combine results in order to jointly attend to information from different representations subspaces at different positions).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to subsample the content object because it allows for jointly handling information in parallel from different representations at different positions without increasing computational costs (Vaswani, Abstract).

Regarding claim 14 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 11 as noted above. Jaderberg further teaches 
determining, for each of the plurality of content objects, an output position (Jaderberg, section 1 – teaches a spatial transformer can crop out and scale normalize the appropriate region in an image [content object] for classification; see also Jaderberg, section 3.2 – teaches that to perform a warping of the input feature map each output pixel [output position] is computed by applying a sampling kernel centered at a particular location in the input feature map; see also Jaderberg, sections 4.1-4.3 – teaches output bounding regions); and 
determining, for each of the plurality of subsampled content objects corresponding to the content object (Jaderberg, section 1 – teaches a spatial transformer can crop out and scale normalize the appropriate region in an image [content object] for classification; see also Jaderberg, section 3.2 – teaches that to perform a warping of the input feature map each output pixel [output position] is computed by applying a sampling kernel centered at a particular location [plurality of positions associated with the output position] in the input feature map; see also Jaderberg, sections 4.1-4.3 – teaches output bounding regions; see also Figures 1,3-5 and Tables 1-5 – examples of subsampling the image to a region), a plurality of positions associated with the output position (Jaderberg, section 1 – teaches a spatial transformer can crop out and scale normalize the appropriate region in an image [content object] for classification; see also Jaderberg, section 3.2 – teaches that to perform a warping of the input feature map each output pixel [output position] is computed by applying a sampling kernel centered at a particular location [plurality of positions associated with the output position] in the input feature map; see also Jaderberg, sections 4.1-4.3 – teaches output bounding regions).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 11 above.

Regarding claim 15 (Currently Amended), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 14 as noted above. Vaswani further teaches wherein each of the one or more non-local operations is based on a function                                 
                                    
                                        
                                            y
                                        
                                        
                                            i
                                        
                                    
                                    ‍
                                     
                                    =
                                     
                                    ‍
                                    ‍
                                    
                                        
                                            1
                                        
                                        
                                            C
                                            (
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                            )
                                        
                                    
                                    
                                        
                                            ∑
                                            
                                                
                                                    
                                                        ∀
                                                    
                                                    
                                                        j
                                                    
                                                
                                            
                                        
                                        
                                            f
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                    ,
                                                    
                                                        
                                                            
                                                                
                                                                    x
                                                                
                                                                ^
                                                            
                                                        
                                                        
                                                            j
                                                        
                                                    
                                                
                                            
                                            g
                                            (
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        ^
                                                    
                                                
                                                
                                                    j
                                                
                                            
                                            )
                                        
                                    
                                
                            , and wherein:
                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     indicates the feature representation at the output position; 
                        
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    j
                                
                            
                        
                     indicates the subsampled feature representation at one of the plurality of positions; 
                        
                            
                                
                                    y
                                
                                
                                    i
                                
                            
                        
                     indicates an output response at the output position; 
                        
                            f
                            (
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                            ,
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    j
                                
                            
                            )
                        
                     indicates the pairwise function; 
                        
                            g
                            (
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    j
                                
                            
                            )
                        
                     indicates the unary function; and 
                        
                            C
                            (
                            
                                
                                    x
                                
                                ^
                            
                            )
                        
                     indicates a normalization factor.
As noted above with respect to claim 9, Vaswani teaches the non-local function comprising a pairwise function and a unary function (Vaswani, section 3.2). Additionally, Vaswani teaches subsampling by linear projecting the queries, keys and values a given number of times with different linear projections, perform the attention module in parallel on each projection, and combine results in order to jointly attend to information from different representations subspaces at different positions (Vaswani, section 3.2.2). Because, for each of the attention modules in the multi-head attention block, Vaswani teaches performing the functions of claims 9 on a subsample of the content object, Vaswani teaches the limitations of claim 15 as well.
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to subsample the content object because it allows for jointly handling information in parallel from different representations at different positions without increasing computational costs (Vaswani, Abstract).

Regarding claim 16 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 15 as noted above. Vaswani further teaches where the pairwise function is based on one or more of: 
a Gaussian function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            
                                
                                    e
                                
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                        
                                            T
                                        
                                    
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                
                            
                        
                    ; 
an embedded Gaussian function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            
                                
                                    e
                                
                                
                                    
                                        
                                            θ
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            T
                                        
                                    
                                    ϕ
                                    (
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                            
                        
                    , wherein                         
                            θ
                        
                     is an embedding for                         
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                     and                         
                            ϕ
                        
                     is an embedding for                         
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    j
                                
                            
                        
                    ;
a dot product function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            
                                
                                    θ
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                
                                                    i
                                                
                                            
                                        
                                    
                                
                                
                                    T
                                
                            
                            ϕ
                            (
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    j
                                
                            
                            )
                        
                    ; or
a concatenation function                         
                            f
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                    ,
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                
                            
                            =
                            R
                            e
                            L
                            U
                            (
                            
                                
                                    w
                                
                                
                                    f
                                
                                
                                    T
                                
                            
                            
                                
                                    θ
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                
                                                    i
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    ϕ
                                    (
                                    
                                        
                                            
                                                
                                                    x
                                                
                                                ^
                                            
                                        
                                        
                                            j
                                        
                                    
                                    )
                                
                            
                            )
                        
                    , wherein                         
                            R
                            e
                            L
                            U
                        
                     indicates a function of a rectified linear unit, and wherein                         
                            
                                
                                    w
                                
                                
                                    f
                                
                            
                        
                     is a weight vector projecting a concatenated vector of                         
                            θ
                            
                                
                                    
                                        
                                            x
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                     and                         
                            ϕ
                            (
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    j
                                
                            
                            )
                        
                     to a scalar.
As noted above with respect to claim 10, Vaswani teaches the pairwise function based on an embedded Gaussian function (Vaswani, section 3.2). Additionally, Vaswani teaches subsampling by linear projecting the queries, keys and values a given number of times with different linear projections, perform the attention module in parallel on each projection, and combine results in order to jointly attend to information from different representations subspaces at different positions (Vaswani, section 3.2.2). Because, for each of the attention modules in the multi-head attention block, Vaswani teaches performing the functions of claims 10 on a subsample of the content object, Vaswani teaches the limitations of claim 16 as well.
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to subsample the content object because it allows for jointly handling information in parallel from different representations at different positions without increasing computational costs (Vaswani, Abstract).

Regarding claim 17 (Original), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 1 as noted above. Jaderberg further teaches 
receiving a querying content object (Jaderberg, section 1 – teaches taking as input a digit image; see also Jaderberg, sections 4.1-4.3 – taking in handwriting images, street number images, and bird images); and 
determining a category for the querying content object based on the non-local machine-learning model (Jaderberg, section 1 – teaches taking as input a digit image and classifying the digit; see also Jaderberg, sections 4.1-4.3 – taking in handwriting images, street number images, and bird images and classifying the images).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 18 (Currently Amended), it is the computer-readable storage media embodiment of claim 1 with similar limitations to claim 1 and is rejected under the same reasoning found in claim 1. Jaderberg further teaches one or more computer-readable non-transitory storage media embodying software that is operable when executed (Jaderberg, section 3.3 – teaches GPU implementation; Jaderberg, section 4 – teaches experiments with results using real-world data sets) ...
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 19 (Currently Amended), it is the system embodiment of claim 1 with similar limitations to claim 1 and is rejected under the same reasoning found in claim 1. Jaderberg further teaches a system comprising: 
one or more processors (Jaderberg, section 3.3 – teaches GPU implementation); and 
a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions (Jaderberg, section 4 – teaches experiments with results using real-world data sets) ...
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He for the same reasons as disclosed in claim 1 above.

Regarding claim 20 (New), Jaderberg in view of Vaswani and further in view of He teaches all of the limitations of the method of claim 1 as noted above. He further teaches wherein reducing the number of channels associated with each weight matrix of the plurality of weight matrices to be less than the first number of channels comprises reducing the number of channels associated with each weight matrix of the plurality of weight matrices to be half of the first number of channels (He, Fig. 2, section 3.1 – teaches reducing the number of channels wherein the channels associated with the weight matrices are less than that of the input where the reduced number of channels for the weight matrices are manually set [Manually setting the number of channels means that any number can be chosen, including half. See He, section 4.1.2 where the chosen mapping was 1:1.15]).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Jaderberg, Vaswani and He in order to reduce the number of channels to maintain accuracy while accelerating the processing of DNNs (He, Abstract).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communication from the examiner should be directed to MARSHALL WERNER whose telephone number is (469) 295-9143. The examiner can normally be reached on Monday – Thursday 7:30 AM – 4:30 PM ET.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar, can be reached at (571) 272-7796. The fax number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/MARSHALL L WERNER/               Examiner, Art Unit 2125                                                                                                                                                                              
	
/BRIAN M SMITH/Primary Examiner, Art Unit 2122