Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-13 and 25-31 are pending .Claims 1, 5, 8, 11, 25, and 29 are independent. Claims 14-24 are canceled.

Response to Amendment
This office action is responsive to the amendment filed on 01/05/2021. As directed by the amendment, claims 1-11, 25-26, and 29 are amended.

Response to Arguments
Applicant's arguments filed 01/05/2021 have been fully considered but they are not persuasive. Applicant argues on page 18 of the Arguments/Remarks that “claim 1 recites features where the output of a preceding layer is included in the subsequent overlaying layers. Specifically, the claimed "POS label embedding vectors" produced by "a POS label embedding layer" are used by "a chunk label embedding layer" and "a dependency parsing layer." The claimed "word embeddings" are used by the "POS label embedding layer," the "chunk label embedding layer," and the "dependency parsing layer." The structure in Goldberg appears to include "pos", “chunk", and "ccg" tasks. Id. Goldberg, however, does not teach or suggest that the output of the layer for the i.e. "pos" task is an input to the layers for both the "chunk" and "ccg" tasks.” Examiner respectfully disagrees, Goldberg discloses on page 233, that the architecture described . 

Claim Objections
Claim 8 objected to because of the following informalities:  Claim 8 line 11 recites "chunk layer embedding layer", Examiner believes the claim intends to recite "chunk label embedding layer" instead.  Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-4 and 5-7 rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claims(s) does/do not fall within at least one of the four categories of patent eligible subject matter because they are directed toward a “neural network” and a “dependency parsing layer of a neural network” without expressing reciting any hardware components or physical structure. From the description in the specifications and under broadest reasonable interpretation of the 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-13 and 25-31 is/are rejected under 35 U.S.C. 103 as being unpatentable over Goldberg, et al. ("Deep multi-task learning with low level tasks supervised at lower layers", hereafter "Goldberg") in view of Zhang, et al.("Stack-propagation: Improved Representation Learning for Syntax", hereafter "Zhang") and Collobert, et al. (US 2011/0301942, hereafter "Collobert").

Regarding Claim 1
Goldberg discloses: A neural network that processes words in an input sentence ([Page 232-233] “in sequence tagging, the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”… “We assume T different training set, D1, · · · , DT , where each Dt contains pairs of input-output sequences (w1:n, yt1:n ), wi ∈ V , yt i ∈ Lt. The input vocabulary V is shared across tasks, but the output vocabularies (tagset) Lt  are task dependent.”), the neural network comprising:  
a part-of-speech (POS) label embedding layer that produces, POS label embeddings from words embeddings generated from the words in the input sentence ([Page 233], “we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos)”); 
a chunk label embedding layer overlaying the  (POS label embedding layer, the chunk label embedding layer produces chunk label embeddings and chunk state vectors from the POS label embeddings and the words embeddings ([Page 233], Goldberg “Instead of conditioning all tasks on the outermost bi-RNN layer, we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos) This enables a hierarchy a task with cascaded predictions, as well as deep task-specific learning for high-level tasks.”); … chunk_tag(w1:n, i) = fchunl(vil(chunk))); 
a dependency parsing layer overlaying the chunk label embedding layer ([Page 234] “The method is effective only when the lower-level POS supervision is applied at the lower layer, supporting the importance of supervising different tasks at different layers.”), the dependency parsing layer  including: 
a bi-directional long-short term memory (LSTM) that processes the word embeddings, the POS label embeddings, the chunk label embeddings and the parent label state vectors ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”); 
Goldberg does not explicitly disclose: an attention encoder that: produces parent label probability mass vectors from the parent label state vectors; and produces parent label embedding vectors from the parent label probability mass vectors; and a dependency relationship label classifier that: exponentially normalizes the parent label state vectors and the parent label embedding vectors  to produce dependency relationship label probability mass vectors; produces dependency relationship label embedding vectors from the dependency relationship label probability mass vectors; and an output that outputs the dependency relationship label embedding vectors;
However, Zhang discloses in the same field of endeavor: an attention encoder that: produces parent label probability mass vectors from the parent label state vectors ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”); and produces parent label embedding vectors from the parent label probability mass vectors ([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”); and 
a dependency relationship label classifier that: exponentially normalizes the parent label state vectors and the parent label embedding vectors  to produce dependency relationship label probability mass vectors ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”); and 4839-0981-2431 v.16 
produces dependency relationship label embedding vectors from the dependency relationship label probability mass vectors ([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”); and  
an output that outputs the dependency relationship label embedding vectors ([Section 2.1 and Figure 2-3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag.”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg and Zhang. Doing so can apply dependency parsing and tagging (Abstract, Zhang).
Goldberg in view of Zhang does not explicitly disclose: A output processor.
	However, Collobert discloses in the same field of endeavor: A processor ([Para 0049] “The computer system 600 includes at least one CPU 620… an output 660 for out putting data”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg, Zhang, and Collobert. Doing so can provide a computer system for implementing the method (Para 0015, Collobert).

Regarding Claim 5
Goldberg discloses: A dependency parsing layer of a neural network system that processes words in an input sentence ([Page 232] “If we take the inputs x1:n to correspond to a sequence of sentence words w1, · · · , wn, we can think of vi = BIRNN(x1:n, i) as inducing an infinite window around a focus word wi .”);4839-0981-2431 v.17 Serial No. 15/421,424the dependency parsing layer overlies a chunk label embedding layer that produces chunk label embeddings and chunk state vectors ([Page 233], Goldberg “Instead of conditioning all tasks on the outermost bi-RNN layer, we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos) This enables a hierarchy a task with cascaded predictions, as well as deep task-specific learning for high-level tasks.”); … chunk_tag(w1:n, i) = fchunl(vil(chunk))) from POS label embeddings and word embeddings of the words in the input sentence ([Page 233], “we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos)”); the chunk label embedding layer, in turn, overlies a part-of-speed (POS) label embedding layer ([Page 233 and ]“We do MTL training for either (POS+chunking) or (POS+CCG), with POS being the lower-level task.”), the POS label embedding layer that produces the POS label embeddings and the POS state vectors from the word embeddings ([Page 233 and ]“We do MTL training for either (POS+chunking) or (POS+CCG), with POS being the lower-level task.”); the dependency parsing layer including a dependency parent layer and a dependency relationship label classifier ([Page 234] “The method is effective only when the lower-level POS supervision is applied at the lower layer, supporting the importance of supervising different tasks at different layers.”), wherein the dependency parent layer includes: a dependency parent analyzer, implemented as a bi-directional long-short term memory (LSTM) ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”), that processes the words in the input sentences, including processing, for each word, the word embeddings, the POS label embeddings, the chunk label embeddings, and the chunk state vector to accumulate forward and backward ([Page 232] Golbers states “A bidirectional RNN (Schuster and Paliwal, 1997; Irsoy and Cardie, 2014) is composed of two RNNs, RNNF and RNNR, one reading the sequence in its regular order, and the other reading it in reverse.” When describing bi-RNNs, Goldberg further states “the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”); and an attention encoder that: processes the forward and backward state vectors for each respective word in the input sentence, and encodes attention as inner products between each respective word and other words in the input sentence, with a linear transform applied to the forward and backward state vectors for the word or the other words prior to the inner products ([Page 233], Goldberg “We use CNN’s LSTM implementation as our RNN variant. The classifiers ft() take the form of a linear transformation followed by a softmax ft(v) = arg maxi sof tmax(W(t)v+b t )[i], where the weights matrix W(t) and bias vector b (t) are task-specific parameters.”); 
Goldberg does not explicitly disclose: applies exponential normalization to vectors of the inner products to produce parent label probability mass vectors and projects the parent label probability mass vectors to produce parent label embedding vectors; and wherein the dependency relationship label classifier, for each respective word in the input sentence: processes the forward and backward state vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors; and projects the dependency relationship label probability mass vectors 
However, Zhang discloses in the same field of endeavor: applies exponential normalization to vectors of the inner products to produce parent label probability mass vectors and projects the parent label probability mass vectors to produce parent label embedding vectors ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”); and wherein the dependency relationship label classifier, for each respective word in the input sentence: processes the forward and backward state vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors ([Section 2.1 The tagger Network, Section 2, and Figure 2-3] “We use two such networks in this work: a window-based version for tagging and a transition-based version for dependency parsing.”); and projects the dependency relationship label probability mass vectors to produce dependency relationship label embedding vectors ([Section 6 and Figure 2-4] ““We present a stacking neural network model for dependency parsing and tagging.”); and an output that outputs at least the dependency relationship label probability mass vectors, or the dependency relationship label embedding vectors ([Section 2.1 and Figure 2-3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag.”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg and Zhang. Doing so can	 apply dependency parsing and tagging (Abstract, Zhang).

	However, Collobert discloses in the same field of endeavor: an output processor ([Para 0049] “The computer system 600 includes at least one CPU 620… an output 660 for out putting data”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg, Zhang, and Collobert. Doing so can provide a computer system for implementing the method (Para 0015, Collobert).

Regarding Claim 8
Goldberg discloses: A method for parsing words in an input sentence using a neural network ([Page 232-233] “in sequence tagging, the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”… “We assume T different training set, D1, · · · , DT , where each Dt contains pairs of input-output sequences (w1:n, yt1:n ), wi ∈ V , yt i ∈ Lt. The input vocabulary V is shared across tasks, but the output vocabularies (tagset) Lt  are task dependent.”), the method comprising: producing at a part-of-speed (POS) label embedding layer POS label embeddings from word embeddings of the words in the input sentence([Page 233], “we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos)”); producing at a chunk label embedding layer that overlies the POS label embedding layer, chunk label embeddings and chunk state vectors the POS label embeddings and the word embeddings ([Page 233], Goldberg “Instead of conditioning all tasks on the outermost bi-RNN layer, we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos) This enables a hierarchy a task with cascaded predictions, as well as deep task-specific learning for high-level tasks.”); … chunk_tag(w1:n, i) = fchunl(vil(chunk))); accessing a dependency parsing layer that overlies a chunk layer embedding layer, dependency parsing layer including a dependency parent layer and a dependency relationship label classifier ([Page 234] “The method is effective only when the lower-level POS supervision is applied at the lower layer, supporting the importance of supervising different tasks at different layers.”); processing, at the dependency parent layer that includes including a bi-directional long-short term memory (LSTM) and one or more classifiers, the word embeddings, the POS label embeddings, the chunk label embeddings and the chunk state vectors, to produce parent label state vectors ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”);
Goldberg does not explicitly disclose: classifying and4839-0981-2431 v.19 exponentially normalizing the parent label state vectors to produce parent label probability mass vectors; producing, at the dependency parent layer, parent label embedding vectors from the parent label probability mass vectors; producing, at the dependency relationship label classifier, dependency relationship label probability mass vectors by classifying and exponentially normalizing the parent label state vectors and the parent label embedding vectors; and producing, at the dependency relationship label classifier, dependency relationship label embedding vectors from the dependency relationship label probability providing at least the dependency relationship label embedding vectors or dependency relationship labels based thereon.
However, Zhang discloses in the same field of endeavor: classifying and4839-0981-2431 v.19 exponentially normalizing ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”) the parent label state vectors to produce parent label probability mass vectors ([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”); 
producing, at the dependency parent layer, parent label embedding vectors from the parent label probability mass vectors ([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”); producing, at the dependency relationship label classifier, dependency relationship label probability mass vectors by classifying and exponentially normalizing the parent label state vectors and the parent label embedding vectors ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”); and producing, at the dependency relationship label classifier, dependency relationship label embedding vectors from the dependency relationship label probability mass vectors([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”); and providing at least the dependency relationship label embedding vectors or dependency relationship labels based thereon ([Section 2.1 and Figure 2-3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag.”).
It would have been obvious of one of skill in the art at the time of filing to combine
(Abstract, Zhang).
Goldberg in view of Zhang does not explicitly disclose: A device.
However, Collobert discloses in the same field of endeavor: A device ([Para 0049] “The computer system 600 includes at least one CPU 620… an output 660 for out putting data”).
It would have been obvious of one of skill in the art at the time of filing to combine Goldberg, Zhang, and Collobert. Doing so can provide a computer system for implementing the method (Para 0015, Collobert).

Regarding Claim 11
Goldberg discloses: A method of dependency parsing using a neural network that processes words in an input sentence ([Page 232-233] “in sequence tagging, the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”… “We assume T different training set, D1, · · · , DT , where each Dt contains pairs of input-output sequences (w1:n, yt1:n ), wi ∈ V , yt i ∈ Lt. The input vocabulary V is shared across tasks, but the output vocabularies (tagset) Lt  are task dependent.”); producing, at a part-of-speed (POS) label embedding layer POS label embeddings from word embeddings of the words in the input sentence ([Page 233], “we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos)”); producing, at a chunk label embedding layer that overlies the POS label embedding layer, chunk label embeddings from the POS label embeddings and the word embeddings ([Page 233], Goldberg “Instead of conditioning all tasks on the outermost bi-RNN layer, we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos) This enables a hierarchy a task with cascaded predictions, as well as deep task-specific learning for high-level tasks.”); … chunk_tag(w1:n, i) = fchunl(vil(chunk)); accessing the dependency parsing layer that overlays the chunk label embedding layer, the dependency parsing layer including a dependency parent layer and a dependency relationship label classifier ([Page 234] “The method is effective only when the lower-level POS supervision is applied at the lower layer, supporting the importance of supervising different tasks at different layers.”); processing, using a bi-directional long-short term memory (LSTM) of a dependency parent analyzer in the dependency parent layer the words in the input sentences ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”), including processing, for each word, the word embeddings, the POS label embeddings, the chunk label embeddings, and the chunk state vector to accumulate forward and backward state vectors that represent forward and backward progressions of interactions among the words in the input sentence ([Page 232]Golbers states “A bidirectional RNN (Schuster and Paliwal, 1997; Irsoy and Cardie, 2014) is composed of two RNNs, RNNF and RNNR, one reading the sequence in its regular order, and the other reading it in reverse.” When describing bi-RNNs, Goldberg further states “the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”); and processing, in an attention encoder of the dependency parent layer  the forward and backward state vectors for each respective word in the input sentence ([Page 233], Goldberg “We use CNN’s LSTM implementation as our RNN variant. The classifiers ft() take the form of a linear transformation followed by a softmax ft(v) = arg maxi sof tmax(W(t)v+b t )[i], where the weights matrix W(t) and bias vector b (t) are task-specific parameters.”); encoding, in the attention encoder, attention as inner products between each respective word and other words in the input sentence, with a linear transform applied to4839-0981-2431 v.111 Serial No. 15/421,424the forward and backward state vectors for the word or the other words prior to the inner products ([Page 233], Goldberg “We use CNN’s LSTM implementation as our RNN variant. The classifiers ft() take the form of a linear transformation followed by a softmax ft(v) = arg maxi sof tmax(W(t)v+b t )[i], where the weights matrix W(t) and bias vector b (t) are task-specific parameters.”); 
Goldberg does not explicitly disclose: applying exponential normalization to vectors of the inner products to produce parent label probability mass vectors projecting the parent label probability mass vectors to produce parent label embedding vectors; in the dependency relationship label classifier, for each respective word in the input sentence: classifying and normalizing the forward and backward state vectors and the parent label embedding vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors; projecting the dependency relationship label probability mass vectors to produce dependency relationship label embedding vectors; and outputting the dependency relationship label probability mass vectors, or the dependency relationship label embedding vectors.
projecting the parent label probability mass vectors to produce parent label embedding vectors ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”); in the dependency relationship label classifier, for each respective word in the input sentence: classifying and normalizing the forward and backward state vectors and the parent label embedding vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors ([Section 2.1 The tagger Network, Section 2, and Figure 2-3] “We use two such networks in this work: a window-based version for tagging and a transition-based version for dependency parsing.”); projecting the dependency relationship label probability mass vectors to produce dependency relationship label embedding vectors([Section 6 and Figure 2-4] ““We present a stacking neural network model for dependency parsing and tagging.”); and outputting the dependency relationship label probability mass vectors, or the dependency relationship label embedding vectors([Section 2.1 and Figure 2-3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag.”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg and Zhang. Doing so can	 apply dependency parsing and tagging (Abstract, Zhang).
	Goldberg in view of Zhang does not explicitly disclose: an output processor.
([Para 0049] “The computer system 600 includes at least one CPU 620… an output 660 for out putting data”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg, Zhang, and Collobert. Doing so can provide a computer system for implementing the method (Para 0015, Collobert).

Regarding Claim 25
Goldberg discloses: A non-transitory machine-readable medium having stored thereon instructions for performing a method for processing words in an input sentence ([Page 232-233] “in sequence tagging, the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”… “We assume T different training set, D1, · · · , DT , where each Dt contains pairs of input-output sequences (w1:n, yt1:n ), wi ∈ V , yt i ∈ Lt. The input vocabulary V is shared across tasks, but the output vocabularies (tagset) Lt  are task dependent.”), the method 4839-0981-2431 v.112Serial No. 15/421,424comprising machine executable code which when executed by at least one machine, causes the machine to: produce POS label embedding vectors in a part-of-speech (POS) label embedding layer from word embeddings of words in an input sentence ([Page 233], “we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos)”); produce chunk label embeddings and chunk state vectors in a chunk label embedding layer overlaying the POS label embedding layer from the POS label embeddings and the word emeddings ([Page 233], Goldberg “Instead of conditioning all tasks on the outermost bi-RNN layer, we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos) This enables a hierarchy a task with cascaded predictions, as well as deep task-specific learning for high-level tasks.”); … chunk_tag(w1:n, i) = fchunl(vil(chunk)))dependency parsing layer overlaying a chunk label embeddings layer, the dependency parsing layer including a dependency parent layer and a dependency relationship label classifier, the dependency parent layer including a bi-directional long-short term memory (LSTM) and one or more classifiers ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”); process, in the dependency parent layer, word embeddings, the POS label embeddings, the chunk label embeddings and the chunk state vectors ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”), 
Goldberg does not explicitly disclose: to produce parent label probability mass vectors by classification and exponential normalization of parent label state vectors produced by the bi-directional LSTM; produce, in the dependency parent layer, parent label embedding vectors from the parent label probability mass vectors; exponentially normalize, in the dependency relationship label classifier, the parent label state vectors and the parent label embedding vectors; produce, in the dependency relationship label classifier, dependency relationship label embedding vectors from the parent label probability mass vectors
([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”) produced by the bi-directional LSTM ([Page 2] “our architecture outperforms recurrent approaches that build custom word representations using character-based LSTMs”); produce, in the dependency parent layer, parent label embedding vectors from the parent label probability mass vectors ([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”); exponentially normalize ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”), in the dependency relationship label classifier, the parent label state vectors and the parent label embedding vectors ([Section 2.1 The tagger Network, Section 2, and Figure 2-3] “We use two such networks in this work: a window-based version for tagging and a transition-based version for dependency parsing.”); produce, in the dependency relationship label classifier, dependency relationship label embedding vectors from the parent label probability mass vectors([Section 2.1 The tagger Network and Figure 3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag”)
([Section 2.1 and Figure 2-3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag.”).
It would have been obvious of one of skill in the art at the time of filing to combine
(Abstract, Zhang).
Goldberg in view of Zhang does not explicitly disclose: A machine.
However, Collobert discloses in the same field of endeavor: a machine ([Para 0049] “The computer system 600 includes at least one CPU 620… an output 660 for out putting data”).
It would have been obvious of one of skill in the art at the time of filing to combine Goldberg, Zhang, and Collobert. Doing so can provide a computer system for implementing the method (Para 0015, Collobert).

Regarding Claim 29
Goldberg discloses: A non-transitory machine-readable medium having stored thereon instructions for performing a method for processing words in an input sentence ([Page 232-233] “in sequence tagging, the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”… “We assume T different training set, D1, · · · , DT , where each Dt contains pairs of input-output sequences (w1:n, yt1:n ), wi ∈ V , yt i ∈ Lt. The input vocabulary V is shared across tasks, but the output vocabularies (tagset) Lt  are task dependent.”), the method comprising machine executable code which when executed by at least one machine, causes the machine to: produce POS label embeddings and POS state vectors in a part-of-speed (POS) label embedding layer from word embeddings generated from words in an input sequence ([Page 233], “we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos)”); 4839-0981-2431 v.114Serial No. 15/421,424produce chunk label embeddings and chunk state vectors from the POS label embeddings and the word embeddings in a chunk label embedding layer of a neural network that overlays the POS label embedding layer ([Page 233], Goldberg “Instead of conditioning all tasks on the outermost bi-RNN layer, we associate an RNN level l(t) with each task t, and let the task specific classifier feed from that layer, e.g., pos_tag(w1:n, i) = fpos(vil(pos) This enables a hierarchy a task with cascaded predictions, as well as deep task-specific learning for high-level tasks.”); … chunk_tag(w1:n, i) = fchunl(vil(chunk))); access a dependency parsing layer overlaying the chunk label embedding layer and including a dependency parent layer and a dependency relationship label classifier, the dependency parent layer including a dependency parent analyzer, implemented as a bi-directional long-short term memory (LSTM) ([Page 232] “We use a specific flavor of Recurrent Neural Networks (RNNs) (Elman, 1990) called long short-term memory networks (LSTMs)”) and an attention encoder; process, in the dependency parent analyzer, the words in the input sentences, including processing, for each word, the word embeddings, the POS label embeddings, the chunk label embeddings, and the chunk state vector to accumulate forward and backward state vectors that represent forward and backward progressions of interactions among the words in the input sentence ([Page 232], Golbers states “A bidirectional RNN (Schuster and Paliwal, 1997; Irsoy and Cardie, 2014) is composed of two RNNs, RNNF and RNNR, one reading the sequence in its regular order, and the other reading it in reverse.” When describing bi-RNNs, Goldberg further states “the input may be the words in the sentence, and the different tasks can be POS-tagging, named entity recognition, syntactic chunking, or CCG supertagging.”); process, in the attention encoder, the forward and backward state vectors for each respective word in the input sentence, and encode attention as inner products between each respective word and other words in the input sentence, with a linear transform applied to the forward and backward state vectors for the word or the other words prior to the inner product ([Page 233], Goldberg “We use CNN’s LSTM implementation as our RNN variant. The classifiers ft() take the form of a linear transformation followed by a softmax ft(v) = arg maxi sof tmax(W(t)v+b t )[i], where the weights matrix W(t) and bias vector b (t) are task-specific parameters.”); 
Goldberg does not explicitly disclose: apply, in the attention encoder, exponential normalization to vectors of the inner products to produce parent label probability mass vectors and project the parent label probability mass vectors to produce parent label embedding vectors; in the dependency relationship label classifier, for each respective word in the input sentence: process the forward and backward state  vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors; and project the dependency relationship label probability mass vectors to produce dependency relationship label embedding vectors; and output, in an output processor, at least results reflecting classification labels for a dependency relationship of each word, the dependency relationship label probability mass vectors, or the dependency relationship label embedding vectors.
However, Zhang discloses in the same field of endeavor: apply, in the attention encoder, exponential normalization to vectors of the inner products to produce parent ([Fig 3 and Section 2 Continuous Stacking Model] “P(y)                         
                            ∝
                        
                     exp{                        
                            
                                
                                    β
                                
                                
                                    y
                                
                                
                                    T
                                
                            
                        
                     h0 + by}”); in the dependency relationship label classifier, for each respective word in the input sentence: process the forward and backward state  vectors and the parent label embedding vectors, to produce dependency relationship label probability mass vectors ([Section 2.1 The tagger Network, Section 2, and Figure 2-3] “We use two such networks in this work: a window-based version for tagging and a transition-based version for dependency parsing.”); and project the dependency relationship label probability mass vectors to produce dependency relationship label embedding vectors ([Section 6 and Figure 2-4] ““We present a stacking neural network model for dependency parsing and tagging.””); and output, in an output processor, at least results reflecting classification labels for a dependency relationship of each word, the dependency relationship label probability mass vectors, or the dependency relationship label embedding vectors ([Section 2.1 and Figure 2-3] “A final softmax layer reads in the activations and outputs probabilities for each possible POS tag.”).
It would have been obvious of one of skill in the art at the time of filing to combine
Goldberg and Zhang. Doing so can apply dependency parsing and tagging (Abstract, Zhang).
Goldberg in view of Zhang does not explicitly disclose: A device and output processor.
([Para 0049] “The computer system 600 includes at least one CPU 620… an output 660 for out putting data”).
It would have been obvious of one of skill in the art at the time of filing to combine Goldberg, Zhang, and Collobert. Doing so can provide a computer system for implementing the method (Para 0015, Collobert).

Regarding Claim 2
Goldberg in view of Zhang and Collobert discloses: The neural network of claim 1: wherein, he parent label state vectors produced by the bi-directional LSTM are forward and backward parent label state vectors for each respective word in the input sentence, which represent forward and backward progressions of interactions among the words in the input sentence from which the parent label probability mass vectors are produced ([Page 232], Goldberg “If we take the inputs x1:n to correspond to a sequence of sentence words w1, · · · , wn, we can think of vi = BIRNN(x1:n, i) as inducing an infinite window around a focus word wi . We can then use vi as an input to a multiclass classification function f(vi), to assign a tag yˆi to each input location i.”); and wherein the an attention encoder that processes the forward and backward parent label state vectors for each respective word in the input sentence, encodes attention as vectors of inner products between each respective word and other words in the input sentence, with a linear transform applied to the forward and backward parent label state vectors for the word or the other words, and produces the parent label embedding vectors from the encoded attention vectors ([Page 233], Goldberg “We use CNN’s LSTM implementation as our RNN variant. The classifiers ft() take the form of a linear transformation followed by a softmax ft(v) = arg maxi sof tmax(W(t)v+b t )[i], where the weights matrix W(t) and bias vector b (t) are task-specific parameters.”).

Regarding Claim 3
Goldberg in view of Zhang and Collobert discloses: The neural network of claim 2, wherein the linear transform is trainable during training of and the dependency relationship label classifier ([Page 233], Goldberg “The network is trained using back-propagation and SGD with batch-sizes of size 1, with the default learning rate.”)

Regarding Claim 4
Goldberg in view of Zhang and Collobert discloses: The neural network of claim 2, wherein a number of available analytical framework labels, over which the parent label probability mass vectors are calculated, is one-fifth or less a dimensionality of the forward and backward parent label state vectors ([Page 234], Goldberg “Our results are significantly better (p < 0.05) than our baseline, and POS supervision at the lower layer is consistently better than standard MTL.”), thereby forming a dimensionality bottleneck that reduces overfitting when training a neural network stack of the bi-directional LSTMs ([Section 4], Zhang “we used section 24 to tune any hyperparameters of the model to avoid overfitting to the development set.”).

Regarding Claim 6
Goldberg in view of Zhang and Collobert discloses: The  dependency parsing layer of the neural network system of claim 5, wherein the linear transform applied prior to the inner product is trainable during training of the dependency parent layer and the dependency relationship label classifier ([Page 233], Goldberg “We use CNN’s LSTM implementation as our RNN variant. The classifiers ft() take the form of a linear transformation followed by a softmax ft(v) = arg maxi sof tmax(W(t)v+b t )[i], where the weights matrix W(t) and bias vector b (t) are task-specific parameters.”).

Regarding Claim 7
Goldberg in view of Zhang and Collobert discloses: The dependency parsing layer of a neural network system of claim 5, wherein a number of available analytical framework labels, over which the dependency relationship label probability mass vectors are calculated, is one-fifth or less a dimensionality of the forward and backward state Application No. 15/421,424vectors ([Page 234], Goldberg “Our results are significantly better (p < 0.05) than our baseline, and POS supervision at the lower layer is consistently better than standard MTL.”), thereby forming a dimensionality bottleneck that reduces overfitting when training a neural network stack of the bi-directional LSTMs ([Section 4], Zhang “we used section 24 to tune any hyperparameters of the model to avoid overfitting to the development set.”).

Regarding Claim 9
(CLAIM 9 IS A METHOD CLAIM THAT CORRESPONDS TO DEVICE CLAIM
2 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 10
(CLAIM 10 IS A METHOD CLAIM THAT CORRESPONDS TO DEVICE CLAIM
4 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 12
(CLAIM 12 IS A METHOD CLAIM THAT CORRESPONDS TO DEVICE CLAIM
6 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 13
(CLAIM 13 IS A METHOD CLAIM THAT CORRESPONDS TO DEVICE CLAIM
7 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 26
(CLAIM 26 IS A NON-TRANSITORY MACHINE-READABLE MEDIUM CLAIM THAT CORRESPONDS TO DEVICE CLAIM 2 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 27
(CLAIM 27 IS A NON-TRANSITORY MACHINE-READABLE MEDIUM CLAIM THAT CORRESPONDS TO DEVICE CLAIM 3 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 28
(CLAIM 28 IS A NON-TRANSITORY MACHINE-READABLE MEDIUM CLAIM THAT CORRESPONDS TO DEVICE CLAIM 4 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 30
(CLAIM 30 IS A NON-TRANSITORY MACHINE-READABLE MEDIUM CLAIM THAT CORRESPONDS TO DEVICE CLAIM 6 AND IS REJECTED ON THE SAME GROUND)

Regarding Claim 31
(CLAIM 31 IS A NON-TRANSITORY MACHINE-READABLE MEDIUM CLAIM THAT CORRESPONDS TO DEVICE CLAIM 7 AND IS REJECTED ON THE SAME GROUND)

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TEWODROS E MENGISTU whose telephone number is (571)270-7714.  The examiner can normally be reached on Mon-Fri 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached on (571)2723768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-






/TEWODROS E MENGISTU/Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121