DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's “Reply to Office Action dated 06/23/2020,” filed on 16 September 2020 [hereinafter Response] has been entered, where:
Claims 1, 27, and 31, have been amended.
Claims 12, 14, 16, and 21 have been cancelled.
Claims 1-11, 13, 15, 17-20, and 22-31 are pending.
Claims 1-11, 13, 15, 17-20, and 22-31 are rejected.
Claim Rejections - 35 U.S.C. § 103
3.	The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
4.	The factual inquiries for determining obviousness under 35 U.S.C. § 103 are summarized as follows:
1.	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
5.	This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject er of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
6.	Claims 1-3, 9-11, 13, 23, 27-29, and 31 are rejected under 35 U.S.C. § 103 as being unpatentable over Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics (2016) [hereinafter Battaglia] in view of Vinyals et al., “StarCraft II: A new Challenge for Reinforcement Learning,” (2017) [hereinafter Vinyals], and Choi et al., “Multi-Focus Attention Network for Efficient Deep Reinforcement Learning,” AAAI (2017) [hereinafter Choi].
Regarding claim 1, Battaglia teaches [a] computer-implemented neural network system (Battaglia, page 4, Section 2, second full paragraph, teaches [w]e use standard deep neural network building blocks (neural network system) . . . ) for reinforcement learning, wherein the neural network system is used to control an agent interacting with an environment to perform a task in an attempt to achieve a specified result (Battaglia, Fig. 4 & caption, teaches an interaction network for use in state prediction (reinforcement learning) in which the model takes objects and relations as input (state data), reasons about their interactions (agent interacting with an environment to perform a task in an attempt to achieve a specified result), and applies the effects and physical dynamics to predict the new state (to select actions to be performed)), the system comprising:
an input network configured to receive state data comprising an image in pixel form that characterizes the environment and extract from the state data, respective features (Battaglia, Figures 1(a), 1(b) teaches:

    PNG
    media_image1.png
    249
    781
    media_image1.png
    Greyscale

where the caption further teaches a schematic of an interactive network (that is the interactive network of Figs. 1(a) & (b) is to receive state data comprising an image of the environment and extract from the state data, respective features); Battaglia, page 4, “2 Model: implementation”, last paragraph, teaches [a] CNN treats a local neighborhood of an image in pixel form that characterizes the environment)) . . .;
. . . the relational network comprising at least one attention block (Battaglia, Figures 1(a), 1(b), & caption, teaches [f]or more complex systems (plurality of the entities) the model takes as input a graph that represents a system of objects . . . and relations . . . instantiates the pairwise interaction terms, bk, and computes their effects, ek, via a relational model fR(•) (a relational network comprising at least one attention block)), each attention block comprising a respective transform network (Battaglia, page 5, 3-Experiments, first full paragraph, teaches the “model architecture,” where MLPs contained multiple hidden layers of linear transforms (transform network) plus biases (that is, each attention block comprising a respective transform network)) . . . : 
* * *
and an output network arranged to receive the respective final features (Battaglia, Figures 1(a), 1(b), & caption, teaches [t]he [effects] ek are then aggregated and combined with the oj and external effects, xj , to generate input (as cj) (an output network), for an object model, fO(•), which predicts how the interactions and dynamics influence the objects, p (arranged to receive the respective final features)) . . . .
	Though Battaglia teaches the feature of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, Battaglia, however, does not explicitly teach -
an input network . . . 
a relational network configured to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image, . . . each transform network being arranged to generate respective modified features for the corresponding cell, by attending over input features the plurality of cells; and
an output network . . . , and use the respective final features to select an action to be performed by the agent in response to receiving the state data.
But Vinyals teaches -
an input network . . . (Vinyals, Fig. 1, teaches:

    PNG
    media_image2.png
    593
    957
    media_image2.png
    Greyscale

Vinyals, page 4, 3.2-Observations, first paragraph, teaches the Starcraft II API . . . generates (extracts) a set of “feature layers”, which abstract away from the RGB images extract, from the state data, respective features); Vinyals, page 5, 3.2-Observations, first full paragraph, teaches the main observations come as sets of feature layers which are rendered at N x M pixels (that is, the N x M feature layers of Vinyals is a plurality of spatially distinct cells in the image));
a relational network configured to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image (Vinyals, page 5, 3.2-Observations, second full paragraph, teaches the minimap [features] is a coarse representation of the state of the entire world, and the screen [features] is a detailed view of a subsection of the world corresponding to the player’s on-screen view (that is, to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image)), . . . each transform network being arranged to generate respective modified features for the corresponding cell, by attending over input features the plurality of cells (Vinyals, page 5, 3.2-Observations, second full paragraph, teaches the human interface for the game provides various non-spatial observations. These include the amount of gas and minerals collected, the set of actions currently available (which depends on game context, e.g., which units are selected), detailed information about selected units, build queues, and units in a transport vehicle);
an output network . . . , and use the respective final features to select an action to be performed by the agent in response to receiving the state data (Vinyals, Figure 3, teaches:

    PNG
    media_image3.png
    324
    581
    media_image3.png
    Greyscale

Vinyals, page 6, 3.3-Actions, fifth paragraph, teaches we provide a list of available actions via the observations given to the agent at each step (that is, use the respective final features to select an action to be performed by the agent in response to receiving the state data)).
Battaglia and Vinyals are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Battaglia pertaining to an interaction network with the image with cells of an environment of Vinyals.
	The motivation for doing so is to ensure the availability of domains that are beyond the capabilities of current reinforcement learning methods in one or more dimensions. (Vinyals, page 1, 1-Introduction, first paragraph).
Battaglia teaches the feature of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, where such interaction is in the reinforcement learning of the “cell” environment of Vinyals, the combination of Battaglia and Vinyals, however, does not explicitly teach -
* * *
determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell; and
generate, using the respective attention weights, respective modified features for the cell;
* * *
But Choi teaches -
* * *
determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell (Choi, left column of p. 3, “Multi-focus Attention Network - Single Agent Setting, Input Segmentation”, second paragraph, teaches [w]e partitioned input image into uniform grid and used the cells (small image patches) in the grid as partial states; Choi, right column of p. 3, “Multi-focus Attention Network - Single Agent Setting, Parallel Attentions”, first paragraph, teaches [u]sing the key features extracted from the feature extraction module, parallel attention layers determine what partial states are important by using [equation (5)], where N is the number of attention layers,                         
                            
                                
                                    A
                                
                                
                                    i
                                
                                
                                    n
                                
                            
                        
                     is i-th element of n-th soft attention weight vector, i’ ϵ {0,1,…,K} - i and an is n-th using at least a query vector and a value vector derived from the input features for the cell)); and
generate, using the respective attention weights, respective modified features for the cell (Choi, left column of p. 4, “Multi-focus Attention Network - Single Agent Setting, State-Action Value Estimation”, first paragraph, teaches [u]sing attention weights from parallel attention layers, weighted value feature is defined as [equation (8)] . . . . Then the concatenated feature is used to estimate stat-action value as follows: [equation (10)] (that is, generate, suing the respective attention weights, respective modified features for the cell));
* * *
Battaglia, Vinyals and Choi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Battaglia and Vinyals pertaining to an interaction network based on environment images having cells with the multi-focus attention network of Choi.
Choi, Abstract).
Regarding claim 2, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
Battaglia teaches wherein each of the transform networks comprises one or more head sections, and an adaptive network (Battaglia, page 3, second full paragraph, teaches [a] standard deep neural network (transform network) building blocks (one or more head sections), multilayer perceptrons (MLP) (an adaptive network), matrix operations, etc., . . . ) to generate the modified features from the outputs of head sections (Battaglia, Figure 1(b) & caption, teaches the model takes as input a graph that represents a system of objects . . . and relations . . . , instantiates the pairwise interaction terms bk, and computes their effects ek (that is, modified features) . . . to generate input (as cj), for an object model fO (to generate the modified features from the outputs of head sections) . . . ).
Regarding claim 3, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 2, as described above.
Choi teaches -
wherein, denoting the number of head sections in each transform network as h, each attention block is operative to, for each of the h generate h value vectors for each cell using the input features for the plurality of cells, and each head section is operative to form a sum of the value vectors for the plurality of cells weighted by respective attention weights (Choi, left column of p. 3, “Multi-focus Attention network, Single Agent Setting - Input Segmentation,” first paragraph, teaches partition[ing] input image into uniform grid and used the cells (small image patches) in the grid as partial states (that is, Choi teaches applying attention weights for each cell using the input features); Choi, left column of p. 4, “Multi-focus Attention Network, Single Agent Setting - Parallel Attentions,” first paragraph, teaches [u]sing attention weights from parallel attention layers (that is, denoting the number of head sections in each transform network as h), weighted value feature is defined as [equation (8)] wherein hn (that is, for each of the h generate h value vectors) is the n-th sum of value features weighted by [attention weight vector] An (that is, each head section is operative to form a sum of the value vectors weighted by respective attention weights)) .
Battaglia, Vinyals and Choi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Battaglia and Vinyals pertaining to an interaction network based on environment images having cells with the multi-focus attention network operating at a cell basis of Choi.
Choi, Abstract).
Regarding claim 9, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 3, as described above.
Battaglia teaches wherein each transform network is arranged to concatenate the weighted value vectors (Battaglia, page 4, fourth full paragraph, teaches marshalling function, m, computes the matrix products, ORr and ORs (weighted value vectors), and concatenates them with Ra: m(G) = [ORr;ORs;Ra] = B) (each transform network is arranged to concatenate the weighted value vectors)), and generate the modified features using the concatenated weighted value vectors (Battaglia, page 4, fourth to fifth full paragraph, teaches [t]he resulting B is a (2DS + DR) x NR matrix . . . . The [resulting] B is input to [relational model] φR (generate the modified features using the concatenated weighted value vectors) . . . . ).
Regarding claim 10, the combination of Battaglia, Vinyals, and Cho teaches all of the limitations of claim 9, as described above.
Battaglia teaches wherein each transform network is arranged to add the concatenated weighted value vectors to the input features for the corresponding entity to form a summed vector (Battaglia, page 4, sixth-to-seventh full paragraph, teaches resulting C is a (DS +DX + DE) x NO matrix, whose NO columns represent the object states, external effects, and per-object aggregate interaction effects (each transform network is arranged to add the concatenated weight value vectors to the input features for the corresponding entity to form a summed vector)), and transmit the summed vector to the adaptive network (Battaglia, page 4, sixth-to-seventh full paragraph, teaches [t]he C is input to co. (transmit the summed vector to the adaptive network) . . . . ).
Regarding claim 11, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 2, as described above.
Battaglia teaches wherein the adaptive network comprises a multi-layer perceptron (Battaglia, page 3, second full paragraph, teaches multilayer perceptrons (MLP) (an adaptive network comprises a multi-layer perceptron), . . . ).
Regarding claim 13, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
Battaglia further teaches wherein each input network comprises at least one convolutional layer (Battaglia, page 4, last full paragraph, teaches [a] [convolutional neural network (CNN)] treats a local neighborhood of pixels as related, interacting entities (input network comprises at least one convolutional layer) . . . .).
Regarding claim 23, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
Battaglia further teaches wherein the output network is configured to generate a baseline value (Battaglia, page 6, last full paragraph of Section 3, teaches a constant velocity baseline . . . ; an MLP baseline (the output network is configured to generate a baseline value)).
Regarding claim 27, Battaglia teaches [a] method for controlling an agent interacting with an environment to perform a task in an attempt to achieve a specified result (Battaglia, Fig. 4 & caption, teaches an interaction network for use in state prediction in which the model takes objects and relations as input (state data), reasons about their interactions (agent interacting with an environment to perform a task in attempt to achieve a specified result), and applies the effects and physical dynamics to predict the new state (in an attempt to achieve a specified result)), the method comprising:
receiving state data comprising an image in pixel form that characterizes an environment (Battaglia, Figures 1(a), 1(b) teaches:

    PNG
    media_image1.png
    249
    781
    media_image1.png
    Greyscale

where the caption to Figures 1(a) & (b) further teaches a schematic of an interactive network (that is the interactive network of Figs. 1(a) & (b) is to receive state data comprising an image of the environment); Battaglia, page 4, “2 Model: implementation”, last paragraph, teaches [a] CNN treats a local neighborhood of pixels as related, interacting entities: each pixel is effectively a receiver object and its neighboring pixels are senders. (that is, an image in pixel form that characterizes the environment));
extracting from the state data, for each of multiple entities, respective features (Battaglia, Figures 1(a), 1(b), & caption, teaches the model takes as input a oj, and relations, (i; j, rk)k, [and] instantiates the pairwise interaction terms bk (extracting from the state data, for each of multiple entities potentially in the environment, respective entity data indicative of the presence of the entity in the environment)) . . . ;
* * *
Though Battaglia teaches the feature of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, Battaglia, however, does not explicitly teach -
* * *

generating, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatial distinct cells in the image using a relational neural network; and
selecting an action to be performed by the agent in response to the received state data based on the respective final features for each of the cells.
However, Vinyals teaches - 
* * *
(Vinyals, Fig. 1, teaches:

    PNG
    media_image2.png
    593
    957
    media_image2.png
    Greyscale

Vinyals, page 4, 3.2-Observations, first paragraph, teaches the Starcraft II API . . . generates (extracts) a set of “feature layers”, which abstract away from the RGB images seen during human play, while maintaining the core spatial and graphical concepts (that is, extract, from the state data, respective features); Vinyals, page 5, 3.2-Observations, first full paragraph, teaches the main observations come as sets of feature layers which are rendered at N x M pixels (that is, the N x M feature layers of Vinyals is a plurality of spatially distinct cells in the image));
generating, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatial distinct cells in the image (Vinyals, page 5, 3.2-Observations, second full paragraph, teaches the minimap [features] is a coarse representation of the state of the entire world, and the screen [features] is a detailed view of a subsection of the world corresponding to the player’s on-screen view (that is, to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image)) using a relational neural network (Vinyals, Abstract, teaches giv[ing] initial relational neural network)); and
selecting an action to be performed by the agent in response to the received state data based on the respective final features for each of the cells (Vinyals, Figure 3, teaches:

    PNG
    media_image3.png
    324
    581
    media_image3.png
    Greyscale

Vinyals, page 6, 3.3-Actions, fifth paragraph, teaches we provide a list of available actions via the observations given to the agent at each step (that is, use the respective final features to select an action to be performed by the agent based on the respective final features for each of the cells)).
	Battaglia and Vinyals are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s Battaglia pertaining to an interaction network with the image with cells of an environment of Vinyals.
The motivation for doing so is to ensure the availability of domains that are beyond the capabilities of current reinforcement learning methods in one or more dimensions. (Vinyals, page 1, 1-Introduction, first paragraph).
Though the combination of Battaglia that teaches the feature of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, where such interaction is in the reinforcement learning of the “cell” environment of Vinyals, the combination of Battaglia and Vinyals, however, does not explicitly teach -
* * *
. . . wherein the relational neural network includes a plurality of transform networks corresponding to the plurality of cells that are each configured to:
determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell; and
generate, using the respective attention weights, respective modified features for the cell;
* * *
But Choi teaches -
* * *
. . . wherein the relational neural network includes a plurality of transform networks corresponding to the plurality of cells (Choi, left column of p. 3, “Multi-focus Attention Network, Input Segmentation”, first paragraph, teaches that [its] the relational neural network includes a plurality of transform networks corresponding to the plurality of cells)) that are each configured to:
determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell (Choi, left column of p. 3, “Multi-focus Attention Network - Single Agent Setting, Input Segmentation”, second paragraph, teaches [w]e partitioned input image into uniform grid and used the cells (small image patches) in the grid as partial states; Choi, right column of p. 3, “Multi-focus Attention Network - Single Agent Setting, Parallel Attentions”, first paragraph, teaches [u]sing the key features extracted from the feature extraction module, parallel attention layers determine what partial states are important by using [equation (5)], where N is the number of attention layers,                         
                            
                                
                                    A
                                
                                
                                    i
                                
                                
                                    n
                                
                            
                        
                     is i-th element of n-th soft attention weight vector, i’ ϵ {0,1,…,K} - i and an is n-th selector vector which is trainable like other weights of [the] network (that is, using at least a query vector and a value vector derived from the input features for the cell)); and
generate, using the respective attention weights, respective modified features for the cell (Choi, left column of p. 4, “Multi-focus Attention Network - Single Agent Setting, State-Action Value Estimation”, first paragraph, teaches [u]sing attention weights from parallel attention layers, weighted value feature is defined as [equation (8)] . . . . Then the concatenated feature is used to estimate generate, suing the respective attention weights, respective modified features for the cell));
* * *
Battaglia, Vinyals and Choi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Battaglia and Vinyals pertaining to an interaction network based on environment images having cells with the multi-focus attention network of Choi.
The motivation for doing so is to mimic the human ability to spatially abstract the low-level sensory input into multiple entities and attend to them simultaneously, which also achieves faster learning than existing models. (Choi, Abstract).
Regarding claim 28, the combination of Battaglia, Vinyals and Choi teaches all of the limitations of claim 27, as described above. 
Battaglia teaches -
wherein extracting the respective features comprises:
processing the state data using an input neural network (Battaglia, page 4, “Implementation”, first paragraph, teaches the interaction network uses use standard processing the state data using an input neural network)) . . . .
However, Battaglia does not explicitly teach -

But Vinyals teaches to generate the respective features for each of the plurality of spatially distinct cells (Vinyals, page 4, 3.2-Observations, first paragraph, teaches the Starcraft II API . . . generates a set of “feature layers” (that is, to generate the respective features for each of the plurality of spatially distinct cells); also, with respect to neural network architectures, Vinyals, page 9, 4.1-Learning Algorithm, first paragraph, teaches [o]ur reinforcement learning agents are built using a deep neural network).
Battaglia, Vinyals and Choi are from the same or similar field of endeavor. Battaglia teaches an interaction network, based on neural network building blocks that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting, also based on a neural network architecture. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Battaglia pertaining to Vinyals and the multi-focus attention network of Choi.
The motivation for doing so is to ensure the availability of domains that are beyond the capabilities of current reinforcement learning methods in one or more dimensions. (Vinyals, page 1, 1-Introduction, first paragraph).
Regarding claim 29, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 27, as described above. 
Battaglia teaches wherein the relational neural network includes a plurality of attention blocks (Battaglia, Figures 1(a), 1(b), & caption, teaches [f]or more complex systems the model takes as input a graph that represents a system of objects . . . and relations . . . instantiates the pairwise interaction terms, bk, and computes their effects, ek, via a relational model fR(•) (a relational neural network includes a plurality of attention blocks)) that are each configured to receive respective input features and to generate respective updated features by attending over the respective input features (Battaglia, Figure 1(b) & caption, teaches the model takes as input a graph that represents a system of objects . . . and relations . . . , instantiates the pairwise interaction terms bk, and computes their effects ek (that is, modified features) . . . to generate input (as cj), for an object model fO (to generate the modified features by attending over the respective input features)).
However, Battaglia does not explicitly teach -

Vinyals teaches -
(regarding “plurality of cells,” Vinyals, page 2, 1-Introduction, third paragraph, teaches observations and actions are defined in terms of low resolution grids of features (that is, the features pertaining to each of a plurality of cells); see also Vinyals, page 10, 4.3-Agent Architectures, first paragraph, which teaches re-scale numerical features with a logarithmic transformation as some of them such as hit-points or minerals might attain substantially high values (that is a form of attention block . . . to generate respective updated features for each of the plurality of cells)) . . . .
The motivation for doing so is to ensure the availability of domains that are beyond the capabilities of current reinforcement learning methods in one or more dimensions. (Vinyals, page 1, 1-Introduction, first paragraph).
Regarding claim 31, Battaglia teaches [o]ne or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers (Battaglia, page 5, 3-Experriments, sixth paragraph, teaches [e]ach of the training, validation, test data sets were generated by simulating 2000 scenes over 1000 time steps, and randomly sampling 1 million . . . one-step input/target pairs (that is, one or more non-transitory computer-readable storage media storing instructions)) to implement:
an input network configured to receive state data comprising an image in pixel form that characterizes the environment and extract, from the state data, respective features (Battaglia, Figures 1(a), 1(b) teaches:

    PNG
    media_image1.png
    249
    781
    media_image1.png
    Greyscale

where the caption further teaches the figure is a schematic of an interactive network (that is the interactive network of Figs. 1(a) & (b) is to receive state data comprising an image of the environment and extract from the state data, respective features); Battaglia, page 4, “2 Model: implementation”, last paragraph, teaches [a] CNN treats a local neighborhood of pixels as related, interacting entities: each pixel is effectively a receiver object and its neighboring pixels are senders. (that is, an image in pixel form that characterizes the environment)) . . . ;
. . . the relational network comprising at least one attention block (Battaglia, Figures 1(a), 1(b), & caption, teaches [f]or more complex systems (plurality of the entities) the model takes as input a graph that represents a system of objects . . . and relations . . . instantiates the pairwise interaction terms, bk, and computes their effects, ek, via a relational model fR(•) (a relational network comprising at least one attention block)), each attention block comprising a respective transform network for each of the cells, each transform network (Battaglia, page 5, 3-Experiments, first full paragraph, teaches the “model architecture,” where MLPs contained multiple hidden transform network) plus biases (that is, each attention block comprising a respective transform network)) . . . ; and
an output network arranged to receive the respective final features (Battaglia, Figures 1(a), 1(b), & caption, teaches [t]he [effects] ek are then aggregated and combined with the oj and external effects, xj , to generate input (as cj) (an output network), for an object model, fO(•), which predicts how the interactions and dynamics influence the objects, p (arranged to receive the respective final features)) . . . .
	Though Battaglia teaches the feature of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, Battaglia, however, does not explicitly teach -
an input network . . . 
a relational network configured to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image, . . . 
* * *
and an output network . . . , and use the respective final features to select an action to be performed by the agent in response to receiving the state data.
But Vinyals teaches -
an input network . . . (Vinyals, Fig. 1, teaches:

    PNG
    media_image2.png
    593
    957
    media_image2.png
    Greyscale

Vinyals, page 4, 3.2-Observations, first paragraph, teaches the Starcraft II API . . . generates (extracts) a set of “feature layers”, which abstract away from the RGB images seen during human play, while maintaining the core spatial and graphical concepts (that is, extract, from the state data, respective features); Vinyals, page 5, 3.2-Observations, first full paragraph, teaches the main observations come as sets of feature layers which are rendered at N x M pixels (that is, the N x M feature layers of Vinyals is a plurality of spatially distinct cells in the image));
a relational network configured to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image (Vinyals, page 5, 3.2-Observations, second full paragraph, teaches the minimap [features] is a coarse representation of the state of the entire world, and the screen [features] is a detailed view of a subsection of the world corresponding to the player’s on-screen view (that is, to generate, for each cell, respective final features for the cell by attending over the respective features of the plurality of spatially distinct cells in the image)), . . .;
* * *
and an output network . . . , and use the respective final features to select an action to be performed by the agent in response to receiving the state data (Vinyals, Figure 3, teaches:

    PNG
    media_image3.png
    324
    581
    media_image3.png
    Greyscale

Vinyals, page 6, 3.3-Actions, fifth paragraph, teaches we provide a list of available actions via the observations given to the agent at each step (that is, use the respective final features to select an action to be performed by the agent in response to receiving the state data)).
	Battaglia and Vinyals are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Battaglia pertaining to an interaction network with the image with cells of an environment of Vinyals.
Vinyals, page 1, 1-Introduction, first paragraph).
Though the combination of Battaglia and Vinyals, where Battaglia teaches the feature of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, where such interaction is in the reinforcement learning of the “cell” environment of Vinyals, the combination of Battaglia and Vinyals, however, does not explicitly teach -
* * *
. . . each transform network being arranged to:
determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell; and
generate, using the respective attention weights, respective modified features for the cell;
* * *
But Choi teaches -
* * *
. . . each transform network (Choi, left column of p. 3, “Multi-focus Attention Network, Input Segmentation”, first paragraph, teaches that [its] segmentation module segments the low-level sensory input into multiple segments [that] can be done by various methods. We believe that we can apply more sophisticated methods like . . . spatial transformer networks (that is, each transform network)) being arranged to:
determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell (Choi, left column of p. 3, “Multi-focus Attention Network - Single Agent Setting, Input Segmentation”, second paragraph, teaches [w]e partitioned input image into uniform grid and used the cells (small image patches) in the grid as partial states; Choi, right column of p. 3, “Multi-focus Attention Network - Single Agent Setting, Parallel Attentions”, first paragraph, teaches [u]sing the key features extracted from the feature extraction module, parallel attention layers determine what partial states are important by using [equation (5)], where N is the number of attention layers,                         
                            
                                
                                    A
                                
                                
                                    i
                                
                                
                                    n
                                
                            
                        
                     is i-th element of n-th soft attention weight vector, i’ ϵ {0,1,…,K} - i and an is n-th selector vector which is trainable like other weights of [the] network (that is, using at least a query vector and a value vector derived from the input features for the cell)); and
generate, using the respective attention weights, respective modified features for the cell (Choi, left column of p. 4, “Multi-focus Attention Network - Single Agent Setting, State-Action Value Estimation”, first paragraph, teaches [u]sing attention weights from parallel attention layers, weighted value feature is defined as [equation (8)] . . . . Then the concatenated feature is used to estimate stat-action value as follows: [equation (10)] (that is, generate, suing the respective attention weights, respective modified features for the cell));
* * *
Battaglia, Vinyals and Choi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by usering multiple parallel attentions. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Battaglia and Vinyals pertaining to an interaction network based on environment images having cells with the multi-focus attention network of Choi.
The motivation for doing so is to mimic the human ability to spatially abstract the low-level sensory input into multiple entities and attend to them simultaneously, which also achieves faster learning than existing models. (Choi, Abstract).
7.	Claims 4 and 5 are rejected under 35 U.S.C. § 103 as being unpatentable over Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics (2016) [hereinafter Battaglia] in view of Vinyals et al., “StarCraft II: A new Challenge for Reinforcement Learning,” (2017) [hereinafter Vinyals], Choi et al., “Multi-Focus Attention Network for Efficient Deep Reinforcement Learning,” AAAI (2017) [hereinafter Choi], and Duan et al., “One-Shot Imitation Learning,” (2017) [hereinafter Duan].
Regarding claim 4, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 3, as described above.
However, the combination of Battaglia, Vinyals, and Choi does not explicitly teach -
 the attention block comprises h value networks, each value network being for generating value vectors from input features.
But Duan teaches -
wherein the attention block comprises h value networks, each value network being for generating value vectors from input features (Duan, right column of p. 5, “4.2.1. Neighborhood Attention,” second paragraph, teaches The input to neighborhood attention is a list of embeddings                         
                            
                                
                                    h
                                
                                
                                    1
                                
                                
                                    i
                                    n
                                
                            
                        
                    , . . . ,                         
                            
                                
                                    h
                                
                                
                                    B
                                
                                
                                    i
                                    n
                                
                            
                             
                        
                    of the same dimension, which can be the result of a projection operation over a list of block positions; Duan, Fig. 4 & caption, teaches:

    PNG
    media_image4.png
    177
    277
    media_image4.png
    Greyscale

In which Duan caption teaches for each block, performs one attention query corresponding to each block, and outputs a list of embeddings which have the same dimension as the input (that is, (the attention block comprises h value networks, each value network for generating value vectors from input features)).
Battaglia, Vinyals, Choi, and Duan are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi Duan teaches reinforcement learning implementing soft attention to generalize conditions and tasks unseen in training data. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify the combination of Battaglia, Vinyals, and Choi pertaining to an interaction network based on environment images having cells incorporating multi-focus attention networks with the soft attention embodiment of Duan.
The motivation for doing so is to broaden the policies to accomplish a variety of reinforcement learning tasks. (Duan, Abstract).
Regarding claim 5, the combination of Battaglia, Vinyals, Choi, and Duan teaches all of the limitations of claim 4, as described above.
Battaglia teaches wherein each value network produces value vectors by applying a linear transform to input features (Battaglia, page 6, first full paragraph, teaches [t]he [relational] fR and [object] fO MLPs contained multiple hidden layers of linear transforms plus biases (each value network produces value vector by applying a linear transform to entity data) . . . .).
8.	Claims 6-8, 15, and 18-22 are rejected under 35 U.S.C. § 103 as being unpatentable over Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics (2016) [hereinafter Battaglia] and Vinyals et al., “StarCraft II: A new Challenge for Reinforcement Learning,” (2017) [hereinafter Vinyals], Choi et al., “Multi-Focus Attention Network for Efficient Deep Reinforcement Learning,” AAAI (2017) [hereinafter Choi], and Mottaghi et al., “‘What happens if . . .’ Learning to Predict the Effect of Forces in Images,” (2016) [hereinafter Mottaghi].
Regarding claim 6, the combination of Battaglia, Vinyals, and Choiteaches all of the limitations of claim 3, as described above.
However, the combination of Battaglia, Vinyals, and Choi does not explicitly teach -
for each entity, each respective head section is arranged to generate the attention weights by generating respective salience values for each of the plurality of cells, and combining the salience values using a non-linear function to form the attention weights.
But Mottaghi teaches -
for each entity, each respective head section is arranged to generate the attention weights (Mottaghi, page 7, Section 5.1, second full paragraph, teaches RNNs composed of ReLUs and initialized with identity weight matrix (for each entity, each respective head section is arranged to generate the attention weights) by generating respective salience values for each of the plurality of cells (Mottaghi, page 7, Section 5.1, second full paragraph, teaches the velocities at different time steps are dependent on each other, and the RNN can capture (by generating) these temporal dependencies (respective salience values for each of the corresponding plurality of entities)), and combining the salience values using a non-linear function to form the attention weights (Mottaghi, page 7, Section 5.1, second full paragraph, teaches [t]he output (combining the salience values) at each time step ot is a function of the t. More concretely, [output at each timestep is] ot = SoftMax(g(ht)), where g is a linear function, which is augmented by a ReLU (using a non-linear function to form the attention weights)).
Battaglia, Vinyals, Choi, and Mottaghi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Mottaghi teaches a deep neural network model that learns long-term sequential dependencies of object movements while taking into account the geometry and appearance of the scene. Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the applicant’s invention to modify the teachings of the combination of Battaglia, Vinyals, and Choi pertaining to an interaction network including a plurality of spatially distinct cells with multi-focus attention network with the long-term sequential dependencies of object movement of Mottaghi.
The motivation for doing so is predict long term movement of objects as their reaction to external forces from a single image. (Mottaghi, Abstract).
Regarding claim 7, the combination of Battaglia, Vinyals, Choi, and Mottaghi teaches all of the limitations of claim 6, as described above.
Mottaghi teaches wherein the non-linear function is a soft-max function (Mottaghi, page 7, Section 5.1, second full paragraph, teaches [output at each timestep is] ot = SoftMax(g(ht)) (the non-linear function is a soft-max function) . . . . ). 
Regarding claim 8, the combination of Battaglia, Vinyals, Choi, and Mottaghi teaches all of the limitations of claim 6, as described above.
Battaglia teaches wherein, denoting the number of head sections in each transform network as h, each attention block comprises h query networks for generating a query vector for each cell from the plurality of cells (Battaglia, Figs. 1a, 1b & caption, teaches the model takes objects [(O1, O2, O3)] and relations [(r1, r2)] as input (denoting the number of head sections in each transform network as h), reasons about their interactions (each attention block comprises h query networks for generating a query vector for each entity); Examiner points out that query pertains to determining/querying the next state based on a given state, and that a query vector includes those elements, or tuple, relating to the query), and h key networks for generating a key vector (Battaglia, Figs. 1a, 1b & caption, teaches applies the effects and physical dynamics (h key networks for generating a key vector for each entity from corresponding entity data) . . . . ; Examiner points out that key pertains to answering/solving the next state based on the given state, and that a key vector includes those elements, or tuple, relating to the key), 
each head section being arranged to use the query vector (Battaglia, page 4, sixth full paragraph, teaches [t]he G, X, and E are input (each head section being arranged to use the query vector for the corresponding entity to generate the salience values for each of the plurality of entities) to a, which computes the DE x NO matrix product (dot product of the query vector and the respective key vector),                         
                            
                                
                                    E
                                
                                -
                            
                            =
                            E
                            
                                
                                    R
                                
                                
                                    r
                                
                                
                                    T
                                
                            
                        
                    .
Regarding claim 15, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
However, the combination of Battaglia, Vinyals, and Choi fails to explicitly teach - 
wherein
But Mottaghi teaches -
wherein(Mottaghi, Fig. 4 & caption, teaches input to the model is a force image and an RGB-M image . . . [depicting a region of interest relating to an object] (for each entity, the respective entity data further comprises data indicative of a position of the cell in the input image)).
Battaglia, Vinyals, Choi, and Mottaghi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Mottaghi teaches a deep neural network model that Battaglia, Vinyals and Choi pertaining to an interaction network including a plurality of spatially distinct cells with multi-focus attention network with the CNNs as force tower and image tower respectively relating to object movement indicative of object position of Mottaghi.
The motivation for doing so is predict long term movement of objects as their reaction to external forces from a single image. (Mottaghi, Abstract).
Regarding claim 18, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
Battaglia further teaches the input network including at least one recurrent layer (Battaglia, page 9, first partial paragraph, teaches [b]y adapting the interaction network into a recurrent neural network, even more accurate long-term predictions might be possible (at least one recurrent layer) . . . . ).
Regarding claim 19, the combination of Battaglia, Vinyals, Choi and Mottaghi teaches all of the limitations of claim 18, as described above.
However, the combination of Battaglia, Vinyals, Choi and Mottaghi does not explicitly teach that the recurrent layer is a LSTM layer.
	Mottaghi teaches the recurrent layer is a LSTM layer (Mottaghi, page 7, second full paragraph, teaches [recurrent neural networks] composed of ReLUs . . . are as powerful as standard LSTMs (the recurrent layer is a LSTM layer)).
Regarding claim 20, the combination of Battaglia, Vinyals, Choi, and Mottaghi teaches all of the limitations of claim 19, as described above.
Mottaghi further teaches wherein the LSTM layer is a convolutional LSTM layer (Mottaghi, abstract, teaches combining convolutional and recurrent neural networks; Mottaghi, page 7, second full paragraph, teaches that LSTM is common with recurrent neural network).
Regarding claim 22, the combination of Battaglia, Vinyals, and Choi, teaches all of the limitations of claim 1, as described above.
However, the combination of Battaglia, Vinyals, and Choi does not explicitly teach wherein the output network comprises a rectified linear unit.
	But Mottaghi teaches wherein the output network comprises a rectified linear unit (Mottaghi, Fig. 4 & caption, teaches an output network that includes a [rectified linear unit (ReLU)] (output network comprises a rectified linear unit).
Battaglia, Vinyals, Choi and Mottaghi are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi- agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by usering multiple parallel attentions. Mottaghi teaches a deep neural network model that learns long-term sequential dependencies of object movements while taking into account the geometry and appearance of the scene. Thus, it would have been obvious Battaglia, Vinyals, and Choi pertaining to an interaction network including a plurality of spatially distinct cells with multi-focus attention network with the CNNs as force tower and image tower respectively relating to object movement indicative of object position of Mottaghi.
The motivation for doing so is that RNNs composed of ReLUs and initialized with identity weight matrix are as powerful as standard LSTMs. (Mottaghi, page 7, second full paragraph).
9.	Claims 24-26 are rejected under 35 U.S.C. § 103 as being unpatentable over Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics (2016) [hereinafter Battaglia] in view of Vinyals et al., “StarCraft II: A new Challenge for Reinforcement Learning,” (2017) [hereinafter Vinyals] and US Published Application 20190266489 to Hu et al. [hereinafter Hu].
Regarding claim 24, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
However, the combination of Battaglia, Vinyals, and Choi fails to explicitly teach wherein the output network is configured to generate a policy defining a distribution of respective probability values for each action of a space of possible actions, and select the action stochastically using the policy.
But Hu teaches wherein the output network is configured to generate a policy defining a distribution of respective probability values for each action of a space of possible actions (Hu ¶ 0054 teaches [a] policy (n) may be a strategy distribution of respective probability values) employed to determine the next action for the agent based on the current state (for each action of a space of possible actions)), and select the action stochastically using the policy (Hu ¶ 0060 teaches [t]he agent may select one action from a set of available actions (select the action stochastically using the policy), which results in a new state and a new reward for a subsequent time step).
	Battaglia, Vinyals, Choi, and Hu are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Hu teaches interaction-aware decision making may include generating a multi-goal, multi-agent, multi-stage, interaction-aware decision making network policy based on the first agent neural network and the second agent neural network. Thus, it would have been obvious to one of ordinary skill in the art as of the effective filing date of applicant’s invention to modify the combination of Battaglia, Vinyals, and Choi pertaining to an interaction network of a multi-cell image environment with the multi-goal, multi-agent, multi-stage, interaction-aware decision making network policy of Hu.
	The motivation for doing so is for a decentralized policy trained using a double critic, including a decentralized value function for learning local objectives and a Hu ¶ 0085).
Regarding claim 25, the combination of Battaglia, Vinyals, Choi, and Hu teaches all of the limitations of claim 24, as described above.
Hu teaches wherein the output network is arranged to generate one or more action-related arguments, whereby the agent can perform the selected action based on the action-related arguments (Hu ¶ 0060 teaches [t]he agent may select one action from a set of available actions (whereby the agent can perform the selected action based on the action-related arguments), which results in a new state and a new reward for a subsequent time step. The goal of the agent is generally to collect the greatest amount of rewards possible (output network is arranged to generate one or more action-related arguments)).
	Battaglia, Vinyals, Choi and Hu are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Hu teaches interaction-aware decision making may include generating a multi-goal, multi-agent, multi-stage, interaction-aware decision Battaglia, Vinyals, and Choi pertaining to an interaction network of an image environment having a plurality of cells with the multi-goal, multi-agent, multi-stage, interaction-aware decision making network policy of Hu.
The motivation for doing so is for a decentralized policy trained using a double critic, including a decentralized value function for learning local objectives and a centralized action-value function for learning cooperation, thereby enabling local objectives or goals to be considered while also considering cooperation between N number of agents by showing two equivalent views of the policy gradient and implementing the new actor-critic or agent-critic adaptation (Hu ¶ 0085).
Regarding claim 26, the combination of Battaglia, Vinyals, Choi, and Hu teaches all of the limitations of claim 25, as described above.
Battaglia teaches wherein the action-related arguments comprise respective values for each of plurality of locations in an array having the same number of dimensions as the environment (Battaglia, page 4, third full paragraph, teaches [w]e define O as a DS x NO matrix, whose columns correspond to the objects’ DS-length state vectors. . . . For the graph in Fig. 1b [depicting a n-dimensional environment], . . . [t]he X is a Dx x NR matrix (array having the same number of dimensions as the environment), whose columns represent the interaction terms, Bk, (comprise respective values for each of plurality of locations) for the NR relations . . . .).
10.	Claims 17 and 30 are rejected under 35 U.S.C. § 103 as being unpatentable over Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics (2016) [hereinafter Battaglia] in view of Vinyals et al., “StarCraft II: A new Challenge for Reinforcement Learning,” (2017) [hereinafter Vinyals] and Per-Arne Andersen, “Deep Reinforcement Learning using Capsules in Advanced Game Environments,” University of Agder (Thesis, Jan 2018) [hereinafter Andersen].
Regarding claim 17, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 1, as described above.
However, the combination of Battaglia, Vinyals, and Choi does not explicitly teach -
wherein the output network comprises a max pooling layer 
But Andersen teaches -
wherein the output network comprises a max pooling layer (Andersen, page 14, 2.2.1 Pooling, second paragraph, teaches [t]here are several ways to perform pooling. Max and Average pooling are considered the most stable methods in whereas Max pooling is most used in state-of-the-art research).
Battaglia, Vinyals, Choi, and Andersen are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Andersen teaches a model-free RL technique for solving difficult game environments, extending Deep Q-learning to combines RL and ANN (Artificial Neural Networks). Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Battaglia, Vinyals and Choi pertaining to an interaction network with the image with cells of an environment and further with the max pooling of Andersen.
The motivation for doing so is because pooling reduces the number of parameters to optimize, thus decreasing the computational requirement of the system. (Andersen, page 15, 2.2.1 Pooling, second paragraph).
Regarding claim 30, the combination of Battaglia, Vinyals, and Choi teaches all of the limitations of claim 27, as described above. 
However, the combination of Battaglia, Vinyals, and Choi fails to explicitly teach -
wherein selecting the action to be performed comprises processing the respective final features using an output neural network comprising a max pooling layer for combining the respective final features for the plurality of cells.
But Andersen teaches -
wherein selecting the action to be performed comprises processing the respective final features using an output neural network comprising a max pooling layer for combining the respective final features for the plurality of cells. Andersen, page 14, 2.2.1 Pooling, second paragraph, teaches [t]here are several ways to perform pooling. Max and Average pooling are considered the most stable methods in whereas Max pooling is most used in state-of-the-art research).
Battaglia, Vinyals, Choi, and Andersen are from the same or similar field of endeavor. Battaglia teaches an interaction network that can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Vinyals teaches a reinforcement learning environment having a multi-agent problem with multiple players interacting. Choi teaches applying, in a reinforcement learning environment, a multi-focus attention network (AM Net) that enhances the agent’s ability to attend to important entities by using multiple parallel attentions. Andersen teaches a model-free RL technique for solving difficult game environments, extending Deep Q-learning to combines RL and ANN (Artificial Neural Networks). Thus, it would have been obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to modify Battaglia, Vinyals, and Choi pertaining to an interaction network with the image with cells of an environment and further with the max pooling of Andersen.
The motivation for doing so is because pooling reduces the number of parameters to optimize, thus decreasing the computational requirement of the system. (Andersen, page 15, 2.2.1 Pooling, second paragraph).
Response to Arguments
11.	Examiner has fully considered Applicant’s arguments; however, the Applicant’s amendments to the claims have rendered Applicant’s arguments as moot.
“Battaglia does not disclose or suggest an attention-based neural network, namely, a plurality of transform networks each "being arranged to: determine a respective attention weight between the cell and each of the plurality of cells using at least a query vector and a value vector derived from the input features for the cell; and generate, using the respective attention weights, respective modified features for the cell," as recited by the currently amended independent claims 1 and 31.” (Response at p. 10).
Examiner points out that Applicant appears to argue limitations that are not present in the claims, such as an “attention-based neural network.” The claims instead merely recite the general features of “attention weight” (see, e.g., claim 1), “attention block” (see, e.g., claim 3), etc., which are construed as “weights” and “blocks” pertaining to a general neural network because neither the Applicant’s specification nor Applicant’s claims do not define such a term in a manner contrary to the plain and ordinary meaning. Applicant’s specification instead merely refers to a feature of a “neural network system” (see, e.g., PGPUB ¶¶ 0020, 0023-24, 0052). 
Accordingly, Examiner submits that the reliance on Battaglia as teaching features of Applicant’s is proper because of the BRI of Applicant’s claims.
Moreover, the rejections herein clearly set forth which claim limitations are taught by each of the prior art references, and the reasons why it would be obvious to a person having ordinary skill in the art as of the effective filing date of the Applicant’s invention to combine their teachings. Also, Applicant has not explained why the cited prior art references cannot be combined in the manner set forth in the rejections.
Conclusion
13.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
(Marcel Binz, "Learning Goal-Directed Behaviour," Degree Project in Computer Science and Eng'g (2017)) teaches attention mechanisms in which the attention weights describe how much attention to pay to certain parts of a given input.
14.	Any inquiry concerning this communication or earlier communications from the Examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the Examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the Examiner by telephone are unsuccessful, the Examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business 
/K.L.S./
Examiner, Art Unit 2122

/BABOUCARR FAAL/Primary Examiner, Art Unit 2184