DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Acknowledgement is made of Applicant's claim amendments on 2/2/2021. The claim amendments are entered. Presently, claims 1-7, 21, and 23-34 are now pending. Claim 29 has been amended. Claims 8-20 and 22 remain cancelled. Claim 34 was newly added. 

Response to Arguments
Applicant's arguments filed on 2/2/2021 have been fully considered but they are not persuasive.

Applicant argues that Weston does not apply because it allegedly does not teach that the weight matrices relate to an internal state and there is no similarity computation between an output of the network model and an internal state (Applicant’s reply pgs. 7-10). This is not persuasive. 
Weston discloses that the output component determines the output based on a current state of memory network, as well as a relevancy computation [0021] and [0041]. The current state of the memory comprises existing memory slots, wherein each memory slot is a vector ([0041]). The relevancy computation involves matching functions, i.e. a similarity, between the data ([0013]). Accordingly, Applicant’s argument that the output component in Weston does not 
In [0062], Weston describes an update to the memory network model wherein the computation involves looping over memories at various memory locations and times. The computations comprise argmax calculations with a relevancy score (So). As shown above and also in [0036], relevancy involves matching/similarity and the memory locations involve an internal state and memory vectors. 
Regarding the weight matrices, the matrices are computed based on an input feature vector and a dimensional feature space of the existing memory ([0056]). Wherein the weight matrices are part of the relevancy computations that involves current/existing memory slots and vectors ([0045]-[0046] and [0041]). Therefore, contrary to Applicant’s arguments, the weight matrices are related to an internal state of the memory. 

Applicant also argues that Socher does not apply because it allegedly cannot be combined with Weston (Applicant’s reply pgs.  9-10). This is not persuasive. Socher describes a “dynamic memory network” (Socher title and abstract), while Weston describes a memory network that is dynamic, wherein the memory component of the network can be dynamically updated over time (Weston [0015]). Moreover, the dynamic memory in Socher, like Weston, can operate at a plurality of time steps. Thus, it is conceivable that Socher can operate in conjunction with Weston given that both relate to a dynamic memory network. 
Furthermore, Applicant’s argument that Weston allegedly does not modify an initial state (Applicant’s reply pgs.  9-10) is unfounded because Weston teaches a continuous update of its internal memory components (Weston [0015]). Wherein the update can be performed by a memory update component (Weston [0019] and [0025]-[0029]). 


Applicant also argues that Giles does not apply because it allegedly cannot be combined with Weston (Applicant’s reply pgs. 10-11). This is not persuasive. Giles teaches a higher order neural network, i.e. a network comprising internal representations with varying dimensional space that can incorporate learning when updating the neural network (Giles pg. 4972). Weston discloses a memory network that can possesses varying dimensions for an internal space (Weston [0045] and [0077]), as well as updates for the neural network (Weston [0028]), wherein the updates can involve learning (Weston [0012] and [0014]). Thus, it is conceivable that Giles can operate in conjunction with Weston given that both relate to a neural network with varying dimensions and updates to the network that can involve learning.
Applicant also argues that Weston allegedly does not permit an invariance in the input data and therefore is incompatible with Giles (Applicant’s reply pgs. 10-11). This is not persuasive because Weston is not wholly bound to a particular input sequence order. Weston also allows for a stream/bag of words as an input, wherein such stream/bag of words are devoid of being in an organized or segmented order (Weston [0051]). Accordingly, an invariance in the input data is permitted in Weston, thereby refuting Applicant’s argument. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 7, 21, 23-25, and 30-34 are rejected under 35 U.S.C. 103 as being unpatentable over Weston et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2017/0103324, hereinafter Weston) in view of Giles et. al., “Learning, invariance, and generalization in high-order neural networks” (hereinafter Giles) and Socher et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2016/0350653, hereinafter Socher).

Regarding claim 1, Weston teaches:
	A neural network system implemented by one or more computers, the neural network system comprising:
	a read neural network configured to ([0017]-[0018]: “memory network 200 includes … an input feature map component [i.e. a read neural network]”):
		receive an input set comprising a plurality of inputs ([0018]: “The memory network 200 can receive an incoming input 205 (noted as x), e.g., in form of a character, a word, a sentence, an image, an audio, a video, etc.” Similarly, see [0074]: “[t]he input or the response can include, e.g., a character, a word, a text, a sentence, an image, an audio, a video, a user interface instruction, a computer-generated action, etc.”), and
		process the input set to generate a set of memory vectors, each memory vector corresponding to a different input from the input set ([0018]-[0019]: “The input feature map component 220 can convert the incoming input 205 into an input feature vector 225 in an internal feature representation space, noted as I(x). The input feature vector 225 can be a sparse or dense feature vector, depending on the choice of the internal feature representation space. For textual inputs, the input feature map component 220 can further perform preprocessing (e.g., parsing, co-reference and entity resolution) on the textual inputs. Using the input feature vector 225, the memory update component 230 can update the memory component 210 by, e.g., compressing and generalizing the memory component 210 for some intended future use.” Wherein the memory component is denoted as “mi” ([0019]). See also [0077]: describing that, upon receiving an input, the memory network comprising an input feature map component “can convert the input using a mapping function” into an “internal feature space [with] a dimension of D”. 
	See also [0031]-[0033]: describing that “the input feature map component 420 converts the input 405 into an input feature vector x. The slot-choosing function returns the next empty memory slot N… [Wherein t]he memory update component 430 stores the input feature vector x into the next empty memory slot.”);
	a process neural network configured to maintain an internal state ([0017]: “memory network 200 includes … an output feature map component 240 [i.e. a process neural network]”. See also [0034]: describing the various duties of the output feature map component. See also [0019]-[0022]: describing a process for a “state of the memory component”, which is a part of the memory network, to generate an “internal feature representation space”.); and
	a write neural network configured to ([0017]: “The memory network 200 includes a … response component 250 [i.e. write neural network].”): 
	process an order-invariant numeric embedding to generate a neural network output for the input set ([0022]-[0023]: “The response component 250 converts (e.g., decodes) the output feature vector 245 into a response 290 [i.e. an output via] a desired response format, e.g., a textual response or an action: r =R(o). 
	In other words, the response component 250 produces the actual wording of the answer. In various embodiments, the response component 250 can include, e.g., a recurrent neural network (RNN) that is conditioned on the output of the output feature map component 240 to produce a sentence as the response 290.”); 
([0014]: describing that “[a] memory network 100 is an artificial neural network integrated with a long-term memory component”. Similarly, see Fig. 4 and [0031], [0033], and [0034]: showing various components of the memory network and further describing them.) to generate the order-invariant numeric embedding for the input set ([0045]: “U (referred to as “embedding matrix” or “weight matrix”) is a n×D matrix, where D is the number of features and n is an embedding dimension. The embedding dimension can be chosen based on a balance between computational cost and model accuracy. The mapping functions Φx and Φy map the original input text to an input feature vector in a D-dimensional feature space [enabling an order invariance of the input text]. The D-dimensional feature space can be, e.g., based on an ensemble of words that appear in the existing memory.” See also [0013]: describing that the “one or more supporting memory vectors that are most relevant to the input feature vector among the stored memory slots” can be determined by “the memory network model”. 
	See also [0066]-[0067]: describing the models and embedding space. See also [0081]-[0082]: describing “a weight matrix (also referred to as an “embedding matrix”)”.
	See also [0051]-[0052]: describing a segmentation component technique for when the input is a stream of words rather than a sentence, wherein “c is the sequence of input words representing a bag of words using a separate dictionary”. Whereby the bag of words can also represent a type of order invariance. Furthermore, the “segmentation component [technique] can be modeled similarly to the output feature map component [comprising] the embedding matrix computations] and response component” ([0052]).) 
… based on a weighted sum of memory vectors from the set of memory vectors, wherein each ([0043]-[0050]: describing the relevancy calculations in correlation with the embedding/weight matrices, wherein the “relevancy scoring functions SO and SR can use different weight matrices UO and UR”. Then, a sum can be computed for the relevancy functions and embedding/weight matrices in correlation with the corresponding memory vectors ([0049]). 
	See also [0062]: describing an update to the memory network model so that “[w]hen selecting supporting memory slots, the arg max function is replaced by a loop over memories: i=1, . . . , N.  The [memory network] model keeps the winning memory (y or y′) at each step, and compares the current winner to the next memory mi.” Wherein the current or present winner can comprise an initial version.), 
	….

While Weston teaches the limitations of claim 1, Weston does not explicitly teach “wherein the order-invariant numeric embedding is permutation invariant to the inputs in the input set” on lines 18-19. Giles discloses the claim limitations, teaching: “In high-order [neural] networks it is possible to handcraft the units such that their output is invariant under the action of an arbitrary finite group of transformations on the input space [i.e. input set]….  This invariance is imposed by averaging the weight matrices over the group, thus eliminating the unit's ability to detect correlations which are incompatible with the imposed group invariance.” (Giles Section V). Wherein the computation for this handcrafted unit comprises a “second step [that] follows from the fact that a sum over g = g'h-1 is equivalent to a sum over g', since multiplying all terms in a sum over a group by a member of the same group simply results in a permutation of the terms in the sum.” (Giles Section V). That is, the computations enable a permutation invariance of the input space.
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the neural network in Weston to include a permutation invariance in Giles. Doing so would enable a determination of “what constraints must be placed on the W[eight] matrix in order to ensure that the [neural network] unit's output y is invariant under the transformation group T” and enable the application of these computational techniques to be applied to not just “translational invariance” issues but also “can be generalized to implement any invariance under an arbitrary transformation group” (Giles Section V). 

While the cited references teach the limitations of claim 1, they do not explicitly teach: “by performing operations at each of a plurality of time steps, the operations at each time step including modifying an initial version of the internal state of the process neural network at the time step” on lines 12-14. Socher discloses the claim limitations, teaching: an episodic memory process using long short term memory (LSTM) with a plurality of “time step[s]” and a relevance coefficient (Socher [0070]-[0073], [0075], and [0076]), wherein the episodic memory iterates over the time steps in correlation with the sentences/words and memory sequences. That is, the “episodic memory module 120 iterates over representations of questions 942 and input facts 931-938 provided by question module 140 and input module 130, respectively, while updating its internal episodic memory” (Socher [0086]). Similarly, see Socher [0087] and [0089]-[0095]: describing computations involving time steps in the memory process, whereby some process involves attention mechanisms. 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the model in the cited references to include the computation in Socher. Doing so would enable a “dynamic memory network … [i.e., a] unified framework [that] reduces every task in natural language processing to a question answering problem over an input sequence. Inputs and questions are used to create and connect deep memory sequences. Answers are then generated based on dynamically retrieved memories.” (Socher Abstract).

Regarding claim 2, Weston teaches:
	The neural network system of claim 1, wherein the process neural network comprises:
	a long short-term memory (LSTM) neural network ([0014]: “A memory network 100 is an artificial neural network integrated with a long-term memory component. The memory network 100 conducts logic reasoning using its inference component 120 combined with the long-term memory component 110 (also referred to as “memory component”).” Wherein “[t]he long-term memory component 110 acts as a knowledge base for the memory network 100 to make a predicted response (e.g., answer). The knowledge base is dynamic, meaning that the memory network 100 continues to update the long-term memory component 110 using additional inputs, e.g., over time.” ([0015]).) 
	


Regarding claim 7, Weston teaches:
The neural network system of claim 1, wherein the write neural network is a recurrent neural network configured to process the order-invariant numeric embedding to generate a sequence of neural network outputs ([0023]: “the response component 250 [i.e. write neural network] produces the actual wording of the answer. In various embodiments, the response component 250 can include, e.g., a recurrent neural network (RNN) that is conditioned on the output of the output feature map component 240 to produce a sentence as the response 290.” Wherein an output of the output feature map component can comprise an embedding matrix (see [0045], [0081], and [0082].).

Regarding claim 21, the rejection of claim 1 is incorporated. While the cited references teach the claim limitations, Socher further teaches:
The neural network system of claim 1, wherein the initial version of the internal state of the process neural network at each time step after the initial time step in the plurality of time steps is a version of the internal state of the process neural network that results from updating the modified version of the internal state of the process neural network from a preceding time step (Socher [0087]-[0092]: describing “updating internal episodic memory” that can occur at iterations and “time step[s]” of the memory.). 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the model in the cited references to include a updating in Socher. Doing so would enable a “dynamic memory network … [i.e., a] unified framework [that] reduces every task in natural language processing to a question answering problem over an input sequence. Inputs and questions are used to create and connect deep memory sequences. Answers are then generated based on dynamically retrieved memories.” (Socher Abstract).

Regarding claim 23, the rejection of claim 1 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “wherein the read neural network comprises a feedforward neural network”. Socher discloses the claim limitation, teaching: a “Feedforward Neural Network Language Model” for natural language processing (Socher [0053]). 
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the model in the cited references to include a feedforward neural network in Socher. Doing so would enable a “dynamic memory network … [i.e., a] unified framework [that] reduces every task in natural language processing to a question answering problem over an input sequence. Inputs and questions are used to create and connect deep memory sequences. Answers are then generated based on dynamically retrieved memories.” (Socher Abstract). 

Regarding claim 24, Weston teaches:
A computer-implemented method, comprising: 
receiving, by a read neural network ([0017]-[0018]: “memory network 200 includes … an input feature map component [i.e. a read neural network]”), 
an input set comprising a plurality of inputs ([0018]: “The memory network 200 can receive an incoming input 205 (noted as x), e.g., in form of a character, a word, a sentence, an image, an audio, a video, etc.” Similarly, see [0074]: “[t]he input or the response can include, e.g., a character, a word, a text, a sentence, an image, an audio, a video, a user interface instruction, a computer-generated action, etc.”);
	processing, by the read neural network, the input set to generate a set of memory vectors, each memory vector corresponding to a different input from the input set ([0018]-[0019]: “The input feature map component 220 can convert the incoming input 205 into an input feature vector 225 in an internal feature representation space, noted as I(x). The input feature vector 225 can be a sparse or dense feature vector, depending on the choice of the internal feature representation space. For textual inputs, the input feature map component 220 can further perform preprocessing (e.g., parsing, co-reference and entity resolution) on the textual inputs. Using the input feature vector 225, the memory update component 230 can update the memory component 210 by, e.g., compressing and generalizing the memory component 210 for some intended future use.” Wherein the memory component is denoted as “mi” ([0019]). See also [0077]: describing that, upon receiving an input, the memory network comprising an input feature map component “can convert the input using a mapping function” into an “internal feature space [with] a dimension of D”. 
See also [0031]-[0033]: describing that “the input feature map component 420 converts the input 405 into an input feature vector x. The slot-choosing function returns the next empty memory slot N… [Wherein t]he memory update component 430 stores the input feature vector x into the next empty memory slot.”); 
providing a process neural network that maintains an internal state ([0017]: “memory network 200 includes … an output feature map component 240 [i.e. a process neural network]”. See also [0034]: describing the various duties of the output feature map component. See also [0019]-[0022]: describing a process for a “state of the memory component”, which is a part of the memory network, to generate an “internal feature representation space”.); 
	generating, by an auxiliary system (Fig. 4: showing various components of the memory network. Similarly, see [0031], [0033], [0034]: further describing the components in Fig. 4.), an order-invariant numeric embedding for the input set …, wherein generating the order- invariant numeric embedding comprises ([0045]: “U (referred to as “embedding matrix” or “weight matrix”) is a n×D matrix, where D is the number of features and n is an embedding dimension. The embedding dimension can be chosen based on a balance between computational cost and model accuracy. The mapping functions Φx and Φy map the original input text to an input feature vector in a D-dimensional feature space [enabling an order invariance of the input text]. The D-dimensional feature space can be, e.g., based on an ensemble of words that appear in the existing memory.” See also [0013]: describing that the “one or more supporting memory vectors that are most relevant to the input feature vector among the stored memory slots” can be determined by “the memory network model”. 
	See also [0066]-[0067]: describing the models and embedding space. See also [0081]-[0082]: describing “a weight matrix (also referred to as an “embedding matrix”)”.
See also [0051]-[0052]: describing a segmentation component technique for when the input is a stream of words rather than a sentence, wherein “c is the sequence of input words representing a bag of words using a separate dictionary”. Whereby the bag of words can also represent a type of order invariance. Furthermore, the “segmentation component [technique] can be modeled similarly to the output feature map component [comprising] the embedding matrix computations] and response component” ([0052]).), …:
 (i) computing a weighted sum memory vectors from the set of memory vectors ([0062]: describing an update to the memory network model so that “[w]hen selecting supporting memory slots, the arg max function is replaced by a loop over memories: i=1, . . . , N.  The [memory network] model keeps the winning memory (y or y′) at each step, and compares the current winner to the next memory mi.” Wherein the current or present winner can comprise an initial version.), 
wherein each memory vector is weighted based on a level of similarity between the memory vector and the initial version of the internal state of the process neural network ([0043]-[0050]: describing the relevancy calculations in correlation with the embedding/weight matrices, wherein the “relevancy scoring functions SO and SR can use different weight matrices UO and UR”. Then, a sum can be computed for the relevancy functions and embedding/weight matrices in correlation with the corresponding memory vectors ([0049]). 
See also [0062]: describing an update to the memory network model so that “[w]hen selecting supporting memory slots, the arg max function is replaced by a loop over memories: i=1, . . . , N.  The [memory network] model keeps the winning memory (y or y′) at each step, and compares the current winner to the next memory mi.” Wherein the current or present winner can comprise an initial version.); and 
(ii) …; and 
	processing, by a write neural network ([0017]: “The memory network 200 includes a … response component 250 [i.e. write neural network].”), the order-([0022]-[0023]: “The response component 250 converts (e.g., decodes) the output feature vector 245 into a response 290 [i.e. an output via] a desired response format, e.g., a textual response or an action: r =R(o). 
	In other words, the response component 250 produces the actual wording of the answer. In various embodiments, the response component 250 can include, e.g., a recurrent neural network (RNN) that is conditioned on the output of the output feature map component 240 to produce a sentence as the response 290.”).

While Weston teaches the limitations of claim 24, Weston does not explicitly teach “that is permutation invariant to the inputs in the input set” on lines 6-7. Giles discloses the claim limitations, teaching: “In high-order [neural] networks it is possible to handcraft the units such that their output is invariant under the action of an arbitrary finite group of transformations on the input space [i.e. input set]….  This invariance is imposed by averaging the weight matrices over the group, thus eliminating the unit's ability to detect correlations which are incompatible with the imposed group invariance.” (Giles Section V). Wherein the computation for this handcrafted unit comprises a “second step [that] follows from the fact that a sum over g = g'h-1 is equivalent to a sum over g', since multiplying all terms in a sum over a group by a member of the same group simply results in a permutation of the terms in the sum.” (Giles Section V). That is, the computations enable a permutation invariance of the input space.
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the neural network in Weston to Giles. Doing so would enable a determination of “what constraints must be placed on the W[eight] matrix in order to ensure that the [neural network] unit's output y is invariant under the transformation group T” and enable the application of these computational techniques to be applied to not just “translational invariance” issues but also “can be generalized to implement any invariance under an arbitrary transformation group” (Giles Section V).

While the cited references teach the limitations of claim 24, they do not explicitly teach: “at each of a plurality of time steps” on line 8 and “modifying the initial version of the internal state of the process neural network based on the weighted sum of memory vectors” on lines 13-14. Socher discloses the claim limitations, teaching: an episodic memory process using long short term memory (LSTM) with a plurality of “time step[s]” and a relevance coefficient with a weighted sum (Socher [0070]-[0073], [0075], and [0076]), wherein the episodic memory iterates over the time steps in correlation with the sentences/words and memory sequences. That is, the “episodic memory module 120 iterates over representations of questions 942 and input facts 931-938 provided by question module 140 and input module 130, respectively, while updating its internal episodic memory” (Socher [0086]). Similarly, see Socher [0087] and [0089]-[0095]: describing computations involving time steps in the memory process, whereby some process involves attention mechanisms.
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the model in the cited references to include the computation in Socher. Doing so would enable a “dynamic memory network … [i.e., a] unified framework [that] reduces every task in natural language processing to a question answering problem over an input sequence. Inputs and questions are used to create and connect deep memory sequences. Answers are then generated based on dynamically retrieved memories.” (Socher Abstract).

Regarding claim 25, claim 25 is substantially similar to claim 2 and therefore is rejected on the same ground as claim 2. Claim 25 is a method claim that corresponds to system claim 2.

Regarding claim 30, claim 30 is substantially similar to claim 7 and therefore is rejected on the same ground as claim 7. Claim 30 is a method claim that corresponds to system claim 7.

Regarding claim 31, claim 31 is substantially similar to claim 21 and therefore is rejected on the same ground as claim 21. Claim 31 is a method claim that corresponds to system claim 21.

Regarding claim 32, Weston teaches:
The computer-implemented method of claim 24, wherein the auxiliary system is a part of the process neural network (Fig, 4: showing various internal components of the memory network. Similarly, see [0031], [0033] and [0034]: further describing the components in Fig. 4.) or is external to the process neural network.

Regarding claim 33, claim 33 is substantially similar to claim 23 and therefore is rejected on the same ground as claim 23. Claim 33 is a method claim that corresponds to system claim 23.
Regarding claim 34, the rejection claim 1 is incorporated. Socher further teaches:
The neural network system of claim 1, wherein the initial version of the internal state of the process neural network at the time step is generated at the time step by updating the internal state of the process neural network independently of the set of memory vectors (Socher [0087]: describing updating the episodic memory, wherein the update can comprise a transitive inference of the data. Whereby the data can comprise of memory vectors based on input data ([0065] and [0067]) and the transitive inference can denote an independence since the vectors are being inferred (Socher [0066] and [0152]).).
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the system in the cited references to include the updating in Socher. A motivation to combine the cited references with Socher was previously given.

Claims 3-5 and 26-28 are rejected under 35 U.S.C. 103 as being unpatentable over Weston et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2017/0103324, hereinafter Weston), Giles et. al., “Learning, invariance, and generalization in high-order neural networks” (hereinafter Giles), and Socher et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2016/0350653, hereinafter Socher) in view of Luong et. al., “Effective Approaches to Attention-based Neural Machine Translation” (hereinafter Luong).

Regarding claim 3, Weston teaches:
	The neural network system of claim 1, wherein the order-invariant numeric embedding for the input set ([0045]: “U (referred to as “embedding matrix” or “weight matrix”) is a n×D matrix, where D is the number of features and n is an embedding dimension. The embedding dimension can be chosen based on a balance between computational cost and model accuracy. The mapping functions Φx and Φy map the original input text to an input feature vector in a D-dimensional feature space [enabling an order invariance of the input text]. The D-dimensional feature space can be, e.g., based on an ensemble of words that appear in the existing memory.” See also [0013]: describing that the “one or more supporting memory vectors that are most relevant to the input feature vector among the stored memory slots” can be determined by “the memory network model”. 
	See also [0066]-[0067]: describing the models and embedding space. See also [0081]-[0082]: describing “a weight matrix (also referred to as an “embedding matrix”)”.
See also [0051]-[0052]: describing a segmentation component technique for when the input is a stream of words rather than a sentence, wherein “c is the sequence of input words representing a bag of words using a separate dictionary”. Whereby the bag of words can represent a type of order invariance. Furthermore, the “segmentation component [technique] can be modeled similarly to the output feature map component [comprising] the embedding matrix computations] and response component” ([0052]).)
….

While the cited references teach the limitations of claim 3, they do not explicitly teach “comprises a modified version of the internal state of the process neural network that results from operations performed at the last time step of the plurality of time steps”. Luong discloses the claim limitations, teaching: Our various attention-based models are classified into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions. We illustrate these two model types in Figure 2 and 3 respectively. Common to these two types of models is the fact that at each time step t in the decoding phase, both approaches first take as input the hidden state ht at the top layer of a stacking LSTM. The goal is then to derive a context vector ct that captures relevant source-side information to help predict the current target word yt. While these models differ in how the context vector ct is derived, they share the same subsequent steps.” (Luong Section 3). Luong Section 3 further describes the computations involving a modification of the various states and memory vectors via a comparison of the states to derive an “attentional hidden state” vector. Luong Sections 3.1 and 3.2 further describes the “Global Attention” and “Local Attention” mechanisms, respectively.   
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the neural network in the cited references to include the operations in Luong. Doing so would enable the use of “two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT [workshop on machine translation] translation tasks between English and German in both directions. With local attention, we achieve a signiﬁcant gain of 5.0 BLEU [bilingual evaluation understudy] points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT [neural machine translation] and an n-gram reranker.” (Luong Abstract). 

Regarding claim 4, the rejection of claim 1 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: 
“The neural network system of claim 1, wherein modifying the initial version of the internal state of the process neural network at each time step comprises: determining a respective similarity value for each of the memory vectors from the set of memory vectors, wherein the respective similarity value represents a similarity between the initial version of the internal state of the process neural network and the memory vector; generating, based on the respective similarity values, a respective attention weight for each of the memory vectors; generating a read vector by combining the memory vectors in accordance with the attention weights; and combining the initial version of the internal state of the process neural network and the read vector to generate the modified version of the internal state of the process neural network for the time step.” Luong discloses the claim limitations, teaching:
“The neural network system of claim 1, wherein modifying the initial version of the internal state of the process neural network at each time step comprises:
determining a respective similarity value for each of the memory vectors from the set of memory vectors, wherein the respective similarity value represents a similarity between the initial version of the internal state of the process neural network and the memory vector (Luong Section 3.1: “a variable-length alignment vector at, whose size equals the number of time steps on the source side, is derived by comparing the current target hidden state ht with each source hidden state                         
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            s
                                        
                                    
                                
                                -
                            
                        
                    ”. Wherein the alignment vector at can comprise a respective similarity, ht can comprise an initial internal state, and                          
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            s
                                        
                                    
                                
                                -
                            
                        
                      can comprise a memory vector. See also equation (7) showing the formula. See also Fig. 2: showing the various memory vectors. 
Similarly, see Section 3.2: describing the local attention mechanism.);
generating, based on the respective similarity values, a respective attention weight for each of the memory vectors (Luong Section 3.1: “Given the alignment vector as weights, the context vector ct is computed as the weighted average over all the source hidden states.” See also Fig. 2: “Global attentional model – at each time step t, the model infers a variable-length alignment weight vector at based on the current target state ht and all source states                         
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            s
                                        
                                    
                                
                                -
                            
                        
                    .” 
Similarly, see Section 3.2 and Fig. 3: “Local attention model – the model ﬁrst predicts a single aligned position pt for the current target word. A window centered around the source position pt is then used to compute a context vector ct, a weighted average of the source hidden states in the window. The weights at are inferred from the current target state ht and those source states                         
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            s
                                        
                                    
                                
                                -
                            
                        
                     in the window.”);
generating a read vector by combining the memory vectors in accordance with the attention weights (Luong Section 3.1: “Given the alignment vector as weights, the context vector ct is computed as the weighted average over all the source hidden states.” See also Fig. 2: “Global attentional model – at each time step t, the model infers a variable-length alignment weight vector at based on the current target state ht and all source states                         
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            s
                                        
                                    
                                
                                -
                            
                        
                    . A global context vector ct is then computed as the weighted average, according to at, over all the source states.” Wherein the context vector can comprise a read vector.  
Similarly, see Section 3.2 and Fig. 3: “Local attention model – the model ﬁrst predicts a single aligned position pt for the current target word. A window centered around the source position pt is then used to compute a context vector ct, a weighted average of the source hidden states in the window. The weights at are inferred from the current target state ht and those source states                         
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            s
                                        
                                    
                                
                                -
                            
                        
                     in the window.”); and
combining the initial version of the internal state of the process neural network and the read vector to generate the modified version of the internal state of the process neural network for the time step (Luong Section 3: “given the target hidden state ht and the source-side context vector ct, we employ a simple concatenation layer to combine the information from both vectors to produce an attentional hidden state                         
                             
                            
                                
                                     
                                    
                                        
                                            
                                                
                                                    h
                                                
                                                
                                                    t
                                                
                                            
                                             
                                        
                                        ~
                                    
                                
                            
                        
                      …. The attentional vector                         
                             
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            t
                                        
                                    
                                     
                                
                                ~
                            
                        
                     is then fed through the softmax layer to produce the predictive distribution….” See equations 5 and 6, showing the formula for                         
                             
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            t
                                        
                                    
                                     
                                
                                ~
                            
                        
                    and the predictive distribution, respectively. Wherein                         
                             
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            t
                                        
                                    
                                     
                                
                                ~
                            
                        
                     can comprise a modified internal state. 
Similarly, see Sections 3.1 and 3.2: further describing the                         
                             
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            t
                                        
                                    
                                     
                                
                                ~
                            
                        
                      computation in relation to the “Global Attention” and “Local Attention” mechanisms, respectively. 
See also Section 3.2: stating that the present “computation path is simpler; we go from ht → at → ct →                         
                             
                            
                                
                                    
                                        
                                            h
                                        
                                        
                                            t
                                        
                                    
                                     
                                
                                ~
                            
                        
                     then make a prediction as detailed in Eq. (5), Eq. (6), and Figure 2.”).”
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the neural network in the cited references to include an attention mechanism in Luong. Doing so would enable the use of “two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT [workshop on machine translation] translation tasks between English and German in both directions. With local attention, we achieve a signiﬁcant gain of 5.0 BLEU [bilingual evaluation understudy] points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT [neural machine translation] and an n-gram reranker.” (Luong Abstract).

Regarding claim 5, the rejection of claim 4 is incorporated. While the cited references teach the claim limitations, they do not explicitly teach: “wherein determining the respective similarity value for each of the memory vectors comprises determining a dot product between the initial version of the internal state of the process neural network and the memory vector.” Luong discloses the claim limitations, teaching: a dot product calculation between ht and                                 
                                    
                                        
                                            
                                                
                                                    h
                                                
                                                
                                                    s
                                                
                                            
                                        
                                        -
                                    
                                
                              (Luong Section 3.1). See also Sections 4.3 and 5.3, and Tables 3 and 4: describing the different machine translation experiments involving dot products.).    
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the neural network in the cited references to include an attention mechanism in Luong. Doing so would enable the use of “a better alignment function, [such as] the content-based dot product one, together with dropout yields another gain of+2.7 BLEU [for the German-English translation results].” (Luong Section 4.3). That is, a more efficient translation was obtained. 

Regarding claim 26, claim 26 is substantially similar to claim 3 and therefore is rejected on the same ground as claim 3. Claim 26 is a method claim that corresponds to system claim 3.

Regarding claim 27, claim 27 is substantially similar to claim 4 and therefore is rejected on the same ground as claim 4. Claim 27 is a method claim that corresponds to system claim 4.

Regarding claim 28, claim 28 is substantially similar to claim 5 and therefore is rejected on the same ground as claim 5. Claim 28 is a method claim that corresponds to system claim 5.

Claims 6 and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Weston et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2017/0103324, hereinafter Weston), Giles et. al., “Learning, invariance, and generalization in high-order neural networks” (hereinafter Giles), and Socher et. al. (U.S. Pat. App. Pre-Grant Pub. No. 2016/0350653, hereinafter Socher) in view of Stewart et. al., “Spaun: A Perception-Cognition-Action Model Using Spiking Neurons” (hereinafter Stewart).

Regarding claim 6, the rejection of claim 1 is incorporated. While the cited references teach the claim limitations and “wherein the write neural network is process the order-invariant numeric embedding”, they do not explicitly teach: “a pointer recurrent neural network configured to … generate a plurality of pointers to the inputs in the input set.” Stewart discloses the claim limitations, teaching: “The Semantic Pointer Architecture: Unified Network model consists of multiple modules, depicted in Figure 2. These modules are considered to be cortical and subcortical areas that implement different operations. All components consist of LIF neurons connected via synaptic weights (Eq. 3), but each area computes a different set of functions. We adapt Hinton's (2010) Deep Belief Network to use LIF spiking neurons (via the NEF [neural engineering framework]) and use it to compress a 28x28 image into a 50-dimensional vector we refer to as a semantic pointer: it is semantic because the high-level representation maintains similarity relationships from the image space [from the input set]; and it is a pointer because, as we will see, the system can recover the original information [of the input set] from the compressed form…. A third internal hierarchy (discussed in more detail below in the serial working memory section) forms a working memory capable of binding and unbinding arbitrary semantic pointers, providing the compositionality that is crucial for complex cognition.” (Stewart pg. 1019-1020). Wherein Spaun comprises five subsystems, with the “working memory subsystem includ[ing] eight distinct memory systems, each of which can store semantic pointers.” (Stewart pg. 1020).
Thus, it would have been obvious to Person Having Ordinary Skill in the Art (PHOSITA) before the effective filing date (EFD) to modify the recurrent neural network in the cited references to include a pointer neural network in Stewart. Doing so would enable a “model [that comprises] of 2.3 million spiking neurons whose neural properties, organization, and connectivity match that of the mammalian brain…. Tasks can be presented in any order, with no “rewiring” of the brain for each task. Instead, the model is capable of internal cognitive control (via the basal ganglia), selectively routing information throughout the [neural network] brain and recruiting different cortical components as needed for each task.” (Stewart Abstract).

Regarding claim 29, claim 29 is substantially similar to claim 6 and therefore is rejected on the same ground as claim 6. Claim 29 is a method claim that corresponds to system claim 6.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SELENE A HAEDI whose telephone number is (571)270-5762.  The examiner can normally be reached on M-F 11 AM - 7 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/S.H./Examiner, Art Unit 2121                                                                                                                                                                                                        




/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121