Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
This office action is in response to the claims filed on 02/23/2018.
Claims 1-20 are presented for examination.
Information Disclosure Statement
The information disclosure statements (IDS) filed 04/04/2018 is in compliance with the provisions of 37 CFR 1.97 and 1.98. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 15-19  are rejected under 35 U.S.C. 101 because the claimed is direct to signal Pers se (media claim).
Claims 15-19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. Specifically, claims 15-19 are toward "machine -readable media”. The broadest reasonable interpretation of “machine -readable media” covers transitory propagating signals, which are non-statutory. When the broadest reasonable interpretation of a claim covers a signal per se, the claim must be rejected under 35 U.S.C. 101 as covering non-statutory subject matter. See In re Nuijten, 500 F.3d 1346, 1356-57 (Fed. Cir. 2007) (transitory embodiments are not directed to statutory subject matter); MPEP 9th Ed., § 2106.I. To overcome this rejection, applicant should insert –- non-transitory — before “machine readable storage device”. Such an amendment is not considered new matter. See the “Subject Matter Eligibility of Computer Readable Media” memo dated January 26, 2010 (OG Cite: 1351 OG 212; OG Date: 23 Feb 2010).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-6, 8-, 9, 11, 16, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ying et al. (Pub. No:. US 2003/0135356– hereinafter, Ying)  in view of Deng et al. (Pub. No:. US 2012/0065976– hereinafter, Deng) and further in Graves et al. (NPL: SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS, Department of Computer Science, University of Toronto, hereinafter, Graves) 
Regarding claim 1, Ying teaches modeling a distribution of the output-sequence probability with a set of artificial neural networks (Ying, [Par.0034], “FIG. 7 shows a block diagram of an RNN according to one embodiment. As described above, a sentence being detected is segmented into a plurality of words, each of the plurality of words associated with a POS tag ( e.g., Tl T2 T3 . . . Tn). A purpose of the RNN is to predict whether there is a phrase break between each tag. Referring to FIG. 7, a tag sequence is generated from the words with tags, such as Tl T2 T3 . . . Tn. Initially, initial breaks Bl and B2 is assigned as TRUE, which indicates a break and a punctuation tag ( e.g., Tl here) is assigned in front of the tag sequence. T2 and T3 represent the first and second tag of the tag sequence respectively. The RNN will detect whether there is a phrase break (e.g., B3) between tag T2 and T3. Typically, Bl, Tl, B2, T2, and T3 are inputted to the first to fifth inputs of the RNN respectively. Once all of the inputs (e.g., Bl, Tl, B2, T2, and T3) are fed into the RNN, the previously trained RNN will generate B3. A value of one indicate B3 is a phrase break and value of zero indicate B3 is not a phrase break.” Examiner’s note, using the RNN to generating the output sequence probability (output sentence) based on the input speech,  
),
the set of artificial neural networks modeling the distributions of the output-segment probabilities with respective instances of a first recurrent neural network having an associated softmax layer ((Ying, [Par.0034], “FIG. 7 shows a block diagram of an RNN according to one embodiment. As described above, a sentence being detected is segmented into a plurality of words, each of the plurality of words associated with a POS tag ( e.g., Tl T2 T3 . . . Tn). A purpose of the RNN is to predict whether there is a phrase break between each tag. Referring to FIG. 7, a tag sequence is generated from the words with tags, such as Tl T2 T3 . . . Tn. Initially, initial breaks Bl and B2 is assigned as TRUE, which indicates a break and a punctuation tag ( e.g., Tl here) is assigned in front of the tag sequence. T2 and T3 represent the first and second tag of the tag sequence respectively. The RNN will detect whether there is a phrase break (e.g., B3) between tag T2 and T3. Typically, Bl, Tl, B2, T2, and T3 are inputted to the first to fifth inputs of the RNN respectively. Once all of the inputs (e.g., Bl, Tl, B2, T2, and T3) are fed into the RNN, the previously trained RNN will generate B3. A value of one indicate B3 is a phrase break and value of zero indicate B3 is not a phrase break.” Examiner’s note, using the RNN to generating the output sequence probability (output sentence) based on the input speech, such as the RNN detects a phrase break of the input speech, wherein, the phrase break is segmentation of plurality of work is considered as an output sequence segmentation.
),
and using one or more hardware processors to train the set of artificial neural networks, wherein a dynamic programming algorithm is used to recursively compute the output-sequence probability from the output-segment probabilities (Ying, [par.0025],  “The present invention utilizes a recurrent neural network (RNN) to detect a prosodic phrase break. FIG. 3 shows an embodiment of a TTS system with an RNN. A text sentence is inputted to a text processing unit 401 for text analysis. During the text processing, the sentence may be segmented into a plurality of words. Then the text processing unit assigns a part of speech (POS) tag to each of the words. The tags of the words may be classified into a specific class as discussed above. As a result, a tag sequence corresponding to the words are generated. The tag sequence is then inputted to the recurrent neural network (RNN) 402. The RNN performs detection of a prosodic phrase break between each of the words. Each of the tags in the tag sequence is sequentially inputted to the RNN. For each inputted tag, a phrase break state is generated from the RNN. The outputted phrase breaks, as well as previously inputted tags are then fed back into the RNN to assist the subsequent prosodic phrase break detection of the subsequent tags of the tag sequence. As a result, a sentence with prosodic phrase break is created. Based on the phrase break detected, the speech features, such as duration, energy, and pitch may be modified. With the phrase break, the length of a word may be longer than a normal one. The sentence with prosodic break is then inputted into the speech processing unit 403 to perform speech synthesis. As a result, a speech (e.g., voice output) is generated through the speech processing unit 403.” Examiner’s note, the input tag is put into the recurrent neural network and the output of POS tag is feedback back to RNN to generate the next inputted POS tag, that is corresponding to the dynamic programming algorithm is used to recursively compute the output-sequence probability.). 
Ying disclose output sequence probability and output sequence segmentation, 
however, Ying does not disclose a method comprising: constructing an out put-sequence probability as a sum, taken over all valid output- sequence segmentations, of products of output-segment probabilities, the recurrent neural network associated with SoftMax layer, 
on the other hand, Deng teach a method comprising: constructing an out put-sequence probability as a sum, taken over all valid output- sequence segmentations, of products of output-segment probabilities (Deng, [Par.0060], “
    PNG
    media_image1.png
    570
    730
    media_image1.png
    Greyscale

“Examiner’s note, the output of state sequence probability is sum to one, by summing all the valid word or phoneme sequence only. Wherein, the output sequence of word or phoneme is considered as the product of output segmentation probability.);
Ying and Deng are analogous in arts because they have the same filed of endeavor of using a neural network for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with  Deng’s method by having constructing an out put-sequence probability as a sum, taken over all valid output- sequence segmentations, of products of output-segment probabilities. The modification would have been obvious because one of the ordinary skills in art would be motivated to processing the speech data (Deng, [Par.0063], “There are several approaches to learning the intermediate layer representations in the DHCRF 700. For example, the intermediate layer learning problem can be cast into a multi-objective programming (MOP) problem in which the average frame-level conditional entropy is minimized and the state occupation entropy is maximized at a substantially similar time. Minimizing the average frame-level conditional entropy can force the intermediate layers to be sharp indicators of subclasses ( or clusters) for each input vector, while maximizing the occupation entropy guarantees that the input vectors be represented distinctly by different intermediate states. The MOP optimization algorithm alternates the steps in optimizing these two contradictory criteria until no further improvement in the criteria is possible or the maximum number of iterations is reached.”
However, Ying and Deng do not teach the recurrent neural network associated with SoftMax layer, 
on the other hand, Graves teaches the recurrent neural network associated with SoftMax layer (Graves [Sec.3.1], “ The first method, known as Connectionist Temporal Classification (CTC) [8, 9], uses a softmax layer to define a separate output distribution Pr(k|t) at every step t along the input sequence. This distribution covers the K phonemes plus an extra blank symbol ∅ which represents a non-output (the softmax layer is therefore size K + 1). Intuitively the network decides whether to emit any label, or no label, at every timestep.”,
Ying, Deng and Graves are analogous in arts because they have the same filed of endeavor of using a neural network for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Graves’s method by having the recurrent neural network associated with SoftMax layer. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve   the speech data processing, (Graves, [Sec.3.2], “In the original formulation Pr(kjt; u) was defined by taking an ‘acoustic’ distribution Pr(kjt) from the CTC network, a ‘linguistic’ distribution Pr(kju) from the prediction network, then multiplying the two together and renormalizing. An improvement introduced in this paper is to instead feed the hidden activations of both networks into a separate feedforward output network, whose outputs are then normalized with a softmax function to yield Pr(kjt; u). This allows a richer set of possibilities for combining linguistic and acoustic information, and appears to lead to better generalization. In particular we have found that the number of deletion errors encountered during decoding is reduced.”).
Regrading claim 16, is being rejected for the same reason as the claim 1. 
Additionally, Ying further teaches one or more machine-readable media storing (Ying, [Claim 19], “A machine-readable medium having stored thereon executable code which causes a machine to perform a method,”):
Regrading claim 20, is being rejected for the same reason as the claim 1. 
Additionally, Ying teaches A system comprising: one or more hardware processors; and one or more machine-readable media storing data (Ying, [Par.0023] “As shown in FIG. 2, the computer system 200 which is a form of a data processing system, includes a bus 202 which is coupled to a microprocessor 203 and a ROM 207 and volatile RAM 205 and a non-volatile memory 206. The microprocessor 203 is coupled to cache memory 204 as shown in the example of FIG. 2.”
and one or more machine-readable media storing data (Ying, [claim 24], “A machine-readable medium having stored thereon executable code which causes a machine to perform a method”).
Regarding claim 2, Ying teaches the method of claim 1, wherein the output-segment probabilities depend on respective concatenations of preceding output segments, and wherein the set of artificial neural networks models the concatenations with a second recurrent neural network (Ying, [Par .0026-0027], “In general, an RNN is used for analysis temporal classification problems. An RNN consists of a set of units, an example of which is shown in FIG. SA. The unit has a weight associated with each unit. A function of the weights and inputs (e.g., a squashing function applied to the sum of the weight-input products) is then generated as an output. These individual units may be connected together as shown in FIG. SB, with an input layer, output layer, and usually a hidden layer… A recurrent neural network (RNN) allows for temporal classification, as shown in FIG. SC. Referring to FIG. 5C, a context layer is added to the structure, which retains information between observations. At each time step, new inputs are fed into the RNN. The previous contents of the hidden layer are passed into the context layer. These contents then feed back into the hidden layer in the next time step. For a classification, post-processing of the outputs of RNN is usually performed. For example, when a threshold on the output from one of the nodes is observed, that particular class has been observed.” Examiner’s note, the context layer is added to the second neural network to perform the feed forward and back word direction to generate an output segment (phrase break). Adding the context layer to the hidden layer is considered as the set of artificial neural networks models the concatenations with a second recurrent neural network).
Regarding claim 3, Ying teaches the method of claim 1, wherein computing the output-sequence probability comprises recursively computing forward and backward probabilities for two-way output-sequence partitioning (Ying, [Par .0026-0027], “In general, an RNN is used for analysis temporal classification problems. An RNN consists of a set of units, an example of which is shown in FIG. SA. The unit has a weight associated with each unit. A function of the weights and inputs (e.g., a squashing function applied to the sum of the weight-input products) is then generated as an output. These individual units may be connected as shown in FIG. SB, with an input layer, output layer, and usually a hidden layer… A recurrent neural network (RNN) allows for temporal classification, as shown in FIG. SC. Referring to FIG. 5C, a context layer is added to the structure, which retains information between observations. At each time step, new inputs are fed into the RNN. The previous contents of the hidden layer are passed into the context layer. These contents then feed back into the hidden layer in the next time step. For a classification, post-processing of the outputs of RNN is usually performed. For example, when a threshold on the output from one of the nodes is observed, that particular class has been observed.” Examiner’s note, the input is passed to context layer from hidden layer and that input is feedback to hidden layer to continually generating.).
Regarding claim 4, Ying teaches the method of claim 1, wherein, in computing the output-segment probabilities, an output-segment length is limited to a specified maximum value (Ying, [Par.0003], “Some of the languages, such as Chinese and Japanese, do not have space between the words. The first step of text analysis for such language processing is word segmentation. Because of the difficulty of syntactic parsing for these languages, most of the conventional TIS systems segment the words in the text analysis procedure, and limit the average length of the words after the segmentation at about 1.6 syllables, through the intrinsic properties of the words. Thus a small pause will be inserted every 1.6 syllables during the speech synthesis if there is no other higher level linguistic information, such as prosodic word, prosodic phrase and intonational phrase. As a result, the speech is not fluent enough. Native speakers tend to group words into phrases whose boundaries are marked by duration and intonational cues. Many phonological rules are constrained to operate only within such phrases, usually termed prosodic phrases. Prosodic phrase will help the TTS system produce more fluent speech, while the prosodic structure of the sentence will also help improve the intelligibility and naturalness of the speech. Therefore placing phrase boundaries is very important to ensure a naturally and sounding TTS system. With correct prosodic phrases detected from text, high quality prosodic model can be created and the acoustic parameters can be provided, which include pitch, energy, and duration, for the speech synthesis.” Examiner’s note, the output segment (detected phrase break of the sentence) is limited by specific length or the number of words.).
Regarding claim 5, Ying teaches the method of claim 1, wherein the set of artificial neural networks is trained using backward propagation of errors (Ying, [Pr.0026],” Through algorithm such as back propagation, the weights of the neural net can be adjusted so as to produce an output on the appropriate unit when a particular pattern at the input is observed .” Examiner’s note, the back propagation is used to adjust the parameter such as weight to produce a satisfy output, therefore, the back propagation is used to reduce an error. And {Par.0033], “Then the system performs prosodic phrasing on the plurality of words with POS tags 602 and matches with the prosodic phrases from the speech database 603. The prosodic phrasing processing is typically performed based on a set of rules, such as energy and cross-zero rates, etc. During the processing, the attributes of the objective functions used by the RNN are adjusted. Then the trainer may perform manually checking 606 to ensure the outputs are satisfied.”),
and wherein, in computing the output-segment probabilities during a forward propagation phase and in computing gradients of the output-segment probabilities used during a backward propagation phase (Ying, [Par.0028], [Par.0028], “Before an RNN can be used, it has to be trained. Training the recurrent network is the most computationally difficult process in the development of a system. Once each frame of the training data has been assigned a label, the RNN training is effectively decoupled form the system training. An objective function may be used to ensure that the network input-output mapping satisfies the desired probabilistic interpretation is specified. Training of the recurrent network is performed using gradient methods. Implementation of the gradient parameter search leads to two integral aspect of the RNN training: computation of the gradient and application of the gradient to update the parameters.” Examiner’s note, the computing a gradient and apply the gradient to an updated parameter is corresponding to the gradient is applied during the forward and backward propagation. ),
However, does not teach contributions computed for longer output segments are reused during computations for shorter output segments contained in the respective longer output segments
On the other hand, Graves teaches contributions computed for longer output segments are reused during computations for shorter output segments contained in the respective longer output segments (Graves, [Sec.3.2] “ Whereas CTC determines an output distribution at every input timestep, an RNN transducer determines a separate distribution Pr(kjt; u) for every combination of input timestep t and output timestep u. As with CTC, each distribution covers the K phonemes plus ?. Intuitively the network ‘decides’ what to output depending both on where it is in the input sequence and the outputs it has already emitted. For a length U target sequence z, the complete set of TU decisions jointly determines a distribution over all possible alignments between x and z, which can then be integrated out with a forward-backward algorithm to determine log Pr(zjx) [10].” Examiner’s note, the back propagation is generated the output sentence from forward of the neural network, therefore, the output of segment sequence (word) of the whole sentence then is reused by the back propagation. ).
Ying, Deng and Graves are analogous in arts because they have the same filed of endeavor of using a neural network for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Graves’s method by having contributions computed for longer output segments are reused during computations for shorter output segments contained in the respective longer output segments. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve   the speech data processing, (Graves, [Sec.3.2], “In the original formulation Pr(kjt; u) was defined by taking an ‘acoustic’ distribution Pr(kjt) from the CTC network, a ‘linguistic’ distribution Pr(kju) from the prediction network, then multiplying the two together and renormalizing. An improvement introduced in this paper is to instead feed the hidden activations of both networks into a separate feedforward output network, whose outputs are then normalized with a softmax function to yield Pr(kjt; u). This allows a richer set of possibilities for combining linguistic and acoustic information, and appears to lead to better generalization. In particular we have found that the number of deletion errors encountered during decoding is reduced.”).
Regarding claim 6, Ying teaches the method of claim 1, further comprising: using the one or more hardware processors to perform a beam search algorithm to determine an output sequence for a given input based on the trained set of artificial neural networks (Ying, [Par.0025], “The present invention utilizes a recurrent neural network (RNN) to detect a prosodic phrase break. FIG. 3 shows an embodiment of a TTS system with an RNN. A text sentence is inputted to a text processing unit 401 for text analysis. During the text processing, the sentence may be segmented into a plurality of words. Then the text processing unit assigns a part of speech (POS) tag to each of the words. The tags of the words may be classified into a specific class as discussed above. As a result, a tag sequence corresponding to the words are generated. The tag sequence is then inputted to the recurrent neural network (RNN) 402. The RNN performs detection of a prosodic phrase break between each of the words. Each of the tags in the tag sequence is sequentially inputted to the RNN. For each inputted tag, a phrase break state is generated from the RNN. The outputted phrase breaks, as well as previously inputted tags are then fed back into the RNN to assist the subsequent prosodic phrase break detection of the subsequent tags of the tag sequence. As a result, a sentence with prosodic phrase break is created. Based on the phrase break detected, the speech features, such as duration, energy, and pitch may be modified. With the phrase break, the length of a word may be longer than a normal one. The sentence with prosodic break is then inputted into the speech processing unit 403 to perform speech synthesis. As a result, a speech (e.g., voice output) is generated through the speech processing unit 403…”).
Regarding claim 8, Ying teaches the method of claim 1, wherein the output-sequence probability is constructed for non- sequence input and wherein empty segments are not permitted in output sequences (Ying, [Pr.0036], “For the subsequent detections, portion of the previous inputted tags and breaks, such as B2, T2, and T3, as well as previously outputted breaks, such as B3, are fed back to the RNN with shifts. For example, the next detection for detecting whether there is a phrase break between tag T3 and the next tag, such as T4 of the tag sequence, will use previously inputs and outputs. In this case, B2, T2, B3, and T3 are inputted to the first, second, third, and fourth inputs of the RNN respectively. The next tag on the tag sequence, such as T4 is retrieved from the tag sequence and inputted to the fifth input of the RNN. As a result, a phrase break B4is generated from the RNN. A value of one indicates B4 is a phrase break and value of zero indicates B4 is not a phrase break. These processes are repeated until there is no more tag left in the tag sequence.” Examiner’s note, the output is stop when there are no more tag sequence input.).
Regarding claim 9, Ying teaches the method of claim 1, wherein the output-sequence probability is constructed for an input sequence, and wherein an output sequence is modeled as monotonically aligned with the input sequence and as having a number of segments equal to a number of elements in the input sequence, empty segments being permitted (Ying, [Par.0024], “FIG. 3 shows a block diagram of a text to speech (TTS) system. The system 300 receives the inputted texts 301 and performs text analysis 309 on the texts. During the text analysis 309, the words of the inputted text would be segmented 302 into a plurality of words. Each word would be assigned with a part of speech (POS) tag associated with the word. The POS tags are typically categorized into several classes. In one embodiment, the tag classification includes adjective, adverb, noun, verb, number, quantifier, preposition, conjunction, idiom, punctuation, and others. Additional classes may be utilized. Based on the POS tags of the words, the system performs prosodic phrase detection 303 using prosodic phrasing model 304. The prosodic phrase model 304 includes many factors, such as energy and duration information of the phrase. The system then utilizes the prosodic phrase break to apply in the prosodic implementation 305. During the prosodic implementation 305, the system may use the prosodic break to modify the syllables of the phrase and apply the prosodic model 306 which may includes pitch information of the phrase. As a result, a prosodic sentence with phrase break is created. The system next performs speech synthesis on the prosodic sentence with phrase break and generates a final voice output 308 (e.g., speech).”  ).
Regrading claim 11, Ying teaches the method of claim 10, further comprising training a second set of neural networks that generates the input sequence from the human-language sequence in the first language (Ying, [Par.0026-0028], “The present invention utilizes a recurrent neural network (RNN) to detect a prosodic phrase break. FIG. 3 shows an embodiment of a TTS system with an RNN. A text sentence is inputted to a text processing unit 401 for text analysis. During the text processing, the sentence may be segmented into a plurality of words. Then the text processing unit assigns a part of speech (POS) tag to each of the words. The tags of the words may be classified into a specific class as discussed above. As a result, a tag sequence corresponding to the words are generated. The tag sequence is then inputted to the recurrent neural network (RNN) 402. The RNN performs detection of a prosodic phrase break between each of the words. Each of the tags in the tag sequence is sequentially inputted to the RNN. For each inputted tag, a phrase break state is generated from the RNN. The outputted phrase breaks, as well as previously inputted tags are then fed back into the RNN to assist the subsequent prosodic phrase break detection of the subsequent tags of the tag sequence. As a result, a sentence with prosodic phrase break is created. Based on the phrase break detected, the speech features, such as duration, energy, and pitch may be modified. With the phrase break, the length of a word may be longer than a normal one. The sentence with prosodic break is then inputted into the speech processing unit 403 to perform speech synthesis. As a result, a speech (e.g., voice output) is generated through the speech processing unit 403… A recurrent neural network (RNN) allows for temporal classification, as shown in FIG. SC. Referring to FIG. SC, a context layer is added to the structure, which retains information between observations. At each time step, new inputs are fed into the RNN. The previous contents of the hidden layer are passed into the context layer. These con tents then feed back into the hidden layer in the next time step. For a classification, post-processing of the outputs for the RNN is usually performed. For example, when a thresh old on the output from one of the nodes is observed, that particular class has been observed.” Examiner’s note, output sequence is then feed back into the RNN for generating the next text input. Wherein, the text input is considered as a human language. in first language. ).
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ying et al. (Pub.No:. US 2003/0135356– hereinafter, Ying)  in view of Deng et al. (Pub.No:. US 2012/0065976– hereinafter, Deng) and further in Graves et al. (NPL: SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS, Department of Computer Science, University of Toronto, hereinafter, Graves) and further in view of Tillmann et al. (NPL: Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, IBM T.J Watson Research Center, hereinafter Tillmann) and further in view of Yamamoto et al. (Pub.No:. US 2011/0173000– hereinafter, Yamamoto) .
Regarding claim 7, Ying teaches the method of claim 6, wherein the input is an input sequence (Ying, [Par.0025], “The present invention utilizes a recurrent neural network (RNN) to detect a prosodic phrase break. FIG. 3 shows an embodiment of a TTS system with an RNN. A text sentence is inputted to a text processing unit 401 for text analysis. During the text processing, the sentence may be segmented into a plurality of words. Then the text processing unit assigns a part of speech (POS) tag to each of the words. The tags of the words may be classified into a specific class as discussed above. As a result, a tag sequence corresponding to the words are generated. The tag sequence is then inputted to the recurrent neural network (RNN) 402. The RNN performs detection of a prosodic phrase break between each of the words. Each of the tags in the tag sequence is sequentially inputted to the RNN.)
However, Ying does not teaches and the beam search algorithm comprises, for each element of the input sequence, performing a left-to-right beam search
Tillmann teaches and the beam search algorithm comprises, for each element of the input sequence, performing a left-to-right beam search (Tillmann, [Sec.3.9], “Sec.3.9 “3.9 Beam Search Implementation In this section, we describe the implementation of the beam search algorithm presented in the previous sections and show how it is applied to the full set of IBM-4 model parameters. 3.9.1 Baseline DP Implementation. The implementation described here is similar to that used in beam search speech recognition systems, as presented in Ney et al. (1992). The similarities are given mainly in the following:
 • The implementation is data driven. Both its time and memory requirements are strictly linear in the number of path hypotheses (disregarding the sorting steps explained in this section).
• The search procedure is developed to work most efficiently when the input sentences are processed mainly monotonically from left to right. The algorithm works cardinality-synchronously, meaning that all the hypotheses that are processed cover subsets of source sentence positions of equal cardinality
• Since full search is prohibitive, we use a beam search concept, as in speech recognition. We use appropriate pruning techniques in connection with our cardinality-synchronous search procedure.”).
Ying, Deng, Graves and Tillmann are analogous in arts because they have the same filed of endeavor of using a neural network for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Tillamnn’s method by having a search beam performs from right to left. The modification would have been obvious because one of the ordinary skills in art would be motivated to perform the search from right to left, (Tillmann, [Sec.3.9], “Sec.3.9 “3.9 Beam Search Implementation In this section, we describe the implementation of the beam search algorithm presented in the previous sections and show how it is applied to the full set of IBM-4 model parameters. 3.9.1 Baseline DP Implementation. The implementation described here is similar to that used in beam search speech recognition systems, as presented in Ney et al. (1992). The similarities are given mainly in the following:
 • The implementation is data driven. Both its time and memory requirements are strictly linear in the number of path hypotheses (disregarding the sorting steps explained in this section).
• The search procedure is developed to work most efficiently when the input sentences are processed mainly monotonically from left to right. The algorithm works cardinality-synchronously, meaning that all the hypotheses that are processed cover subsets of source sentence positions of equal cardinality
• Since full search is prohibitive, we use a beam search concept, as in speech recognition. We use appropriate pruning techniques in connection with our cardinality-synchronous search procedure.”).
However, Ying and Tillmann do not teach and thereafter merging any identical partial candidate output sequences obtained for multiple respective segmentations of the output sequence.
On the other hand, Yamamoto teaches and thereafter merging any identical partial candidate output sequences obtained for multiple respective segmentations of the output sequence (Yamamoto, [Par.0042], “This processing can be made more efficient by pruning or the like. When word category sequences of the same word category continue, postprocessing can also be applied to combine them and output the combined word category sequence.”).
Ying, Deng, Graves, Tillmann and Yamamoto are analogous in arts because they have the same filed of endeavor of using a neural network for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Yamamoto’s method by merging any identical partial candidate output sequences obtained for multiple respective segmentations of the output sequence. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the data processing, (Yamamoto, [Par.0042], “[Par.0042], “This processing can be made more efficient by pruning or the like. When word category sequences of the same word category continue, postprocessing can also be applied to combine them and output the combined word category sequence.”).
Regrading claim 17, is being rejected for the same reason as the claim 7.
Claims 10, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ying et al. (Pub.No:. US 2003/0135356– hereinafter, Ying)  in view of Deng et al. (Pub.No:. US 2012/0065976– hereinafter, Deng) and further in Graves et al. (NPL: SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS, Department of Computer Science, University of Toronto, hereinafter, Graves) and further in Wang et al. (NPL: Translating Phrases in Neural Machine Translation- Soochow University, Suzhou, China- hereinafter, Wang) .
Regarding claim 10, Ying as modified in view of Wang teaches the method of claim 9, wherein the input sequence represents a human-language sequence in a first language and the output sequence represents a human-language sequence in a second language that corresponds to a translation from the first language (wang, [Abstract], “In this work, we propose a method to translate phrases in NMT by integrating a phrase memory storing target phrases from a phrase-based statistical machine translation (SMT) system into the encoder-decoder architecture of NMT. At each decoding step, the phrase memory is first re-written by the SMT model, which dynamically generates relevant target phrases with contextual information provided by the NMT model. Then the proposed model reads the phrase memory to make probability estimations for all phrases in the phrase memory. If phrase generation is carried on, the NMT decoder selects an appropriate phrase from the memory to perform phrase translation and updates its decoding state by consuming the words in the selected phrase. Otherwise, the NMT decoder generates a word from the vocabulary as the general NMT decoder does. Experiment results on the Chinese→English translation show that the proposed model achieves significant improvements over the baseline on various test sets.” Examiner’s note, Chinese and English translation is generated by using encoder and decoder, therefore, the Chinese is considered as first human langue and English is considered as second human language.).
Ying, Deng, Graves, and Wang are analogous in arts because they have the same filed of endeavor of using a machine learning for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Wang’s method by having input sequence represents a human-language sequence in a first language and the output sequence represents a human-language sequence in a second language that corresponds to a translation from the first language. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve a translation process, (Wang, [Abstarct], “In this work, we propose a method to translate phrases in NMT by integrating a phrase memory storing target phrases from a phrase-based statistical machine translation (SMT) system into the encoder-decoder architecture of NMT. At each decoding step, the phrase memory is first re-written by the SMT model, which dynamically generates relevant target phrases with contextual information provided by the NMT model. Then the proposed model reads the phrase memory to make probability estimations for all phrases in the phrase memory. If phrase generation is carried on, the NMT decoder selects an appropriate phrase from the memory to perform phrase translation and updates its decoding state by consuming the words in the selected phrase. Otherwise, the NMT decoder generates a word from the vocabulary as the general NMT decoder does. Experiment results on the Chinese→English translation show that the proposed model achieves significant improvements over the baseline on various test sets.”).
Regrading claim 18, is being rejected for the same reason as the claim 10.
Claims 12, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ying et al. (Pub.No:. US 2003/0135356– hereinafter, Ying)  in view of Deng et al. (Pub.No:. US 2012/0065976– hereinafter, Deng) and further in Graves et al. (NPL: SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS, Department of Computer Science, University of Toronto, hereinafter, Graves) and further in view of Tillmann et al. (NPL: Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation, IBM T.J Watson Research Center, hereinafter Tillmann) and further in view of Kaplan et al. (Pub. No: US 2009/0119162- hereinafter, Kaplan.) 
Regarding claim 12, Ying as modified in view of Tillmann teaches teach the machine translation that locally reorders elements of a sequence of embedded representations of elements of the human-language sequence in the first language. (Tillmann, [Sec.1], “The search procedure presented in this article is based on a DP algorithm to solve the traveling-salesman problem (TSP). A data-driven beam search approach is presented on the basis of this DP-based algorithm. The cities in the TSP correspond to source positions of the input sentence. By imposing constraints on the possible word re-orderings similar to that described in Berger et al. (1996), the DP-based approach becomes more effective: when the constraints are applied, the number of word re-orderings is greatly reduced. The original reordering constraint in Berger et al. (1996) is shown to be a special case of a more general restriction scheme in which the word reordering constraints are expressed in terms of simple combinatorical restrictions on the processed sets of source sentence positions.1 A set of four parameters is given to control the word reordering. Additionally, a set of four states is introduced to deal with grammatical reordering restrictions (e.g., for the translation direction German to English, the word order difference between the two languages is mainly due to the German verb group. In combination with the reordering restrictions, a data-driven beam search organization for the search procedure is proposed.”).
Ying, Deng, Graves and Tillmann are analogous in arts because they have the same filed of endeavor of using a neural network for speech recognition.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Tillamnn’s method by that locally reorders elements of a sequence of embedded representations of elements of the human-language sequence in the first language. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the data processing (Tillmann, [Sec.1], “Tillmann, Vogel, Ney, and Zubiaga (1997) proposes a dynamic programming (DP)–based search algorithm for statistical MT that monotonically translates the input sentence from left to right. The word order difference is dealt with using a suitable preprocessing step. Although the resulting search procedure is very fast, the preprocessing is language specific and requires a lot of manual work. Currently, most search algorithms for statistical MT proposed in the literature are based on the A∗ concept (Nilsson 1971). Here, the word reordering can be easily included in the search procedure, since the input sentence positions can be processed in any order. The work presented in Berger et al. (1996) that is based on the A∗ concept, however, introduces word reordering restrictions in order to reduce the overall search space.”)
However, Tillman does not teach wherein the second set of neural networks comprises a network layer that locally reorders
on the other hand, Kaplan teaches  wherein the second set of neural networks comprises a network layer that locally reorders elements (Kaplan, Claim 28, “The method of claim 28, wherein the prompting is performed using ordering or reordering and an order or reor­dering of the elements in the queue are determined using at least one of (i) collective intelligence analysis, (ii) using collaborative filtering, (iii) statistical correlation methods for determining relationships between the elements, and (iv) a neural network approach.”).
Ying, Deng, Graves, Tillmann and Kaplan are analogous in arts because they have the same filed of endeavor of using a neural network to generate data.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified Ying’s method of processing speech data by using a neural network in combine with Kaplan’s method of having second set of neural networks comprises a network layer that locally reorders elements. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the process of prediction, (Kaplan, [Par.0055], “In other words, generally, the more extreme the forecasts are, the more information those forecasts may contain. Thus, one way to maximize the useful information obtained by the system in the shortest time is to look at all of the stock symbols for which a user has made forecasts in the past ( e.g., those remembered by the cookie) and then order these stock symbols so that the symbols with a history of more extreme forecasts are placed first in the queue. When a user returns to the site, not only will the system remember which symbols the user predicted or forecast on before, but also the system will present these symbols in an order so that those symbols where the user historically has made the most extreme predictions come before symbols with less (historically) extreme predictions. The result is the maximum information obtained from the user in the minimum amount of time.”).
 Regrading claim 19, is being rejected for the same reason as the claim 12. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure is provide below.
Sagawagi  et al. (NPL: Efficient Inference on Sequence Segmentation Models, Hereinafter, Sarawagi) teaches method to sequence segmentation a model . 
Wiseman et al. (NPL: Sequence-to-Sequence Learning as Beam-Search Optimization- Harvard University, Cambridge, MA, USA- hereinafter, Wiseman) teaches applying a sequence to sequence modeling on NPL tool to improve a process of text generation and sequence labeling task. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to EM N TRIEU whose telephone number is (571)272-5747.  The examiner can normally be reached on 7:30 - 5:00 M_TH.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on (571) 272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/E.T./Examiner, Art Unit 2128  

/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128