DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 08 August 2019 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claims 1-3, 6, 8-12, 15-17, and 20 are rejected under pre-AIA  35 U.S.C. 103(a) as being obvious over US 20180350347, hereinafter referred to as Fukuda et al., in view of US 20170032244, hereinafter referred to as Kurata.

Regarding claim 1, Fukuda et al. discloses a computer-implemented method for data augmentation for speech data (“FIG. 1 illustrates a block diagram of a speech recognition system that includes a data augmentation system for augmenting training data for an acoustic model according to an exemplary embodiment of the present invention,” Fukuda et al., Fukuda et al., para [0007].), the method comprising: 

obtaining original speech data including a sequence of feature frames (“The speech recognition engine 104 is configured to convert from input speech signal into a text. The speech recognition engine 104 may receive speech signals digitalized by sampling analog audio input 102, which may be input from a microphone for instance, at a predetermined sampling frequency and a predetermined bit depth,” Fukuda et al., para [0018].); 

generating a partially prolonged copy of the original speech data by inserting one or more new frames into the sequence of feature frames (“As shown in FIG. 1…and a voice stretching module 138 for making given voice data slower,” Fukuda et al., para [0036]. Also, “The voice stretching module 138 may be configured to make given voice data slower. In an embodiment, the voice stretching module 138 may stretch a vowel in the given voice data longer,” Fukuda et al., para [0042]. And, “At step S106, the processing unit may make each speech segment slower by stretching vowels in each speech segment longer. At step S107, the processing unit may store a resultant pseudo faint voice data into the speech data 120,” Fukuda et al., para [0057]. The examiner notes that stretching the voice is equivalent to prolonging the original speech data.); and 

outputting the partially prolonged copy as augmented speech data for training an acoustic model (“The speech recognition engine 104 finds a word sequence with maximum likelihood by using the speech recognition model 106 (including the acoustic model 108) based on the sequence of the acoustic features, and outputs the word sequence found as the decoded result,” Fukuda et al., para [0023].).  

Although Fukuda et al. teaches speech segments, Fukuda et al. does not specifically disclose speech feature frames.

Kurata is cited to disclose speech feature frames (“In another embodiment of the present principles, the original training data may be acoustic, the input segment may be n-frame acoustic features, the extended segment may be n+m-frame acoustic features, and the additional segment may be m-frame acoustic features preceding and/or succeeding the n-frame acoustic features,” Kurata, para [0014].). Kurata benefits Fukuda et al. by providing a system capable of improving recognition accuracy without increasing latency and computation cost during recognition processing (Kurata, para [0008]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Fukuda et al. with those of Kurata to enhance the acoustic model training of Fukuda et al.  

As to claim 11, system claim 11 and method claim 1 are related as method and system of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 11 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0047] teaches processor, memory, and instructions. 

As to claim 16, product claim 16 and method claim 1 are related as method and product of using the same, with each claimed element’s function corresponding to the method 


Regarding claim 2, Fukuda et al., as modified by Kurata, discloses the method of claim 1, wherein the sequence of feature frames of the original speech data has labels representing speech sounds (“Since the input original voice data (e.g., an utterance) may be associated with label information (e.g. a transcription corresponding to the utterance), the resultant pseudo faint voice data (e.g. corresponding to a part of the utterance) and label information that may be a part of the label information associated with the input original voice data (e.g. a part of transcription corresponding to the part of the utterance) can be used to train the acoustic model 108 for speech recognition,” Fukuda et al., para [0045]. Also, “If the process is one for training from scratch, at step S202, the processing unit may train the acoustic model 108 from scratch by using training data that includes the resultant pseudo faint voice data. Note that the original voice data may be associated with label information, thus the resultant voice data and the label information associated with the original voice data may be used to train the acoustic model 108 for speech recognition in step S202. Then, the process may end at step S205,” Fukuda et al., para [0062].) and each new frame is inserted at a position corresponding to a processing frame in response to the processing frame being related to at least one of predetermined speech sounds (“At step S106, the processing unit may make each speech segment slower by stretching vowels in each speech segment longer. At step S107, the processing unit may store a resultant pseudo faint voice data into the speech data 120,” Fukuda et al., para [0057]. Here, the predetermined speech sound is a faint vowel segment.).  

As to claim 12, system claim 12 and method claim 2 are related as method and system of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 12 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0047] teaches processor, memory, and instructions. 

As to claim 17, product claim 17 and method claim 2 are related as method and product of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0089] teaches CRM. 

Regarding claim 3, Fukuda et al., as modified by Kurata, discloses the method of claim 2, wherein the predetermined speech sounds includes one or more vowels (“The voice stretching module 138 may be configured to make given voice data slower. In an embodiment, the voice stretching module 138 may stretch a vowel in the given voice data longer,” Fukuda et al., para [0042].).  

Regarding claim 6, Fukuda et al., as modified by Kurata, discloses the method of claim 5, wherein each feature frame has dynamic acoustic features in addition to the static acoustic features and the method further comprises: 

The acoustic features may further include dynamical features such as delta features and delta-delta features of the aforementioned acoustic features,” Fukuda et al., para [0019]. Delta and delta-delta features are dynamic acoustic features.).  

As to claim 15, system claim 15 and method claim 6 are related as method and system of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 15 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0047] teaches processor, memory, and instructions. 

As to claim 20, product claim 20 and method claim 6 are related as method and product of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0089] teaches CRM. 

Regarding claim 8, Fukuda et al., as modified by Kurata, discloses the method of claim 2, wherein each new frame has a copy of a label assigned to a previous or subsequent frame thereof (“Since the input original voice data (e.g., an utterance) may be associated with label information (e.g. a transcription corresponding to the utterance), the resultant pseudo faint voice data (e.g. corresponding to a part of the utterance) and label information that may be a part of the label information associated with the input original voice data (e.g. a part of transcription corresponding to the part of the utterance) can be used to train the acoustic model 108 for speech recognition,” Fukuda et al.  [0045].).  


Regarding claim 9, Fukuda et al., as modified by Kurata, discloses the method of claim 1, wherein the method further comprises: 

training the acoustic model using the augmented speech data solely or in combination with the original speech data and/or other speech data, the acoustic model including an input layer receiving one or more input feature frames (Fukuda et al., fig. 1 – hidden layers.).  


Regarding claim 10, Fukuda et al., as modified by Kurata, discloses the method of claim 2, wherein the sequence of feature frames of the original speech data is generated by extracting acoustic feature from audio signal data including a series of P201902430US01 (M2385)Page 40 of 44sampled values of audio signal and each feature frame has a label assigned by aligning a transcription to the sequence of feature frames (“Since the input original voice data (e.g., an utterance) may be associated with label information (e.g. a transcription corresponding to the utterance), the resultant pseudo faint voice data (e.g. corresponding to a part of the utterance) and label information that may be a part of the label information associated with the input original voice data (e.g. a part of transcription corresponding to the part of the utterance) can be used to train the acoustic model 108 for speech recognition,” Fukuda et al.  [0045].) or by detecting a speech sound segment in the sequence of feature frames (“The voice stretching module 138 may be configured to make given voice data slower. In an embodiment, the voice stretching module 138 


Claims 4, 13, and 18 are rejected under pre-AIA  35 U.S.C. 103(a) as being obvious over US 20180350347, hereinafter referred to as Fukuda et al., in view of US 20170032244, hereinafter referred to as Kurata, and further in view of US 20170098444, hereinafter referred to as Song.

Regarding claim 4, Fukada et al., as modified by Kurata, discloses the method of claim 2, but not wherein each new frame is inserted with a predetermined probability at the position corresponding to the processing frame related to the at least one of the predetermined speech sounds. Song is cited to disclose wherein each new frame is inserted with a predetermined probability at the position corresponding to the processing frame related to the at least one of the predetermined speech sounds (“For example, in the case where the number of all frames of the first speech is N, or where there are a total of N frames of the first speech made available by the speech input section 210 to the preprocessor 220, and a predetermined uniform interval is K, it may be considered that pronunciation probabilities of successive frames between each Kth frame of the total N frames may be similar to each other. That is, it may be considered that the pronunciation probability of an i-th frame may be similar to those of i+1-th, i+2-th, up to i+(K-1)-th frames, for example,” Song, para [0077].). Song benefits Fukuda et al. by alleviating certain technological problems in automated speech recognition systems, such as the increased required time for calculating pronunciation probabilities corresponding to respective speech units (Song, para [0008]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Fukuda et al. with those of Song to enhance the acoustic model training of Fukuda et al.   

As to claim 13, system claim 13 and method claim 4 are related as method and system of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 13 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0047] teaches processor, memory, and instructions. 

As to claim 18, product claim 18 and method claim 4 are related as method and product of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0089] teaches CRM. 


Claims 5, 14, and 19 are rejected under pre-AIA  35 U.S.C. 103(a) as being obvious over US 20180350347, hereinafter referred to as Fukuda et al., in view of US 20170032244, hereinafter referred to as Kurata, and further in view of US 20040138888, hereinafter referred to as Ramabadran.

Regarding claim 5, Fukuda et al., as modified by Kurata, discloses the method of claim 1, but not wherein each feature frame has static acoustic features and each new frame has new values of the static acoustic features generated by interpolating previous and subsequent frames. Ramabadran is cited to disclose wherein each feature frame has static acoustic features and each new frame has new values of the static acoustic features generated by interpolating previous and subsequent frames (“Once the midpoint frequency, amplitude and phase values are known, the amplitudes and phases at other points may be calculated. For example, once the amplitudes at the midpoints of the current and previous voiced frames are known, the amplitudes at the sub-frame boundaries may be calculated using linear interpolation with an adjustment for the energies at these points,” Ramabadran, para [0054].). Ramabadran benefits Fukuda et al. by providing a method and apparatus for speech reconstruction within a distributed speech recognition system that makes use of missing MFCC values to improve speech reconstruction (Ramabadran, para [0012]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Fukuda et al. with those of Ramabadran to enhance the acoustic model training of Fukuda et al.

As to claim 14, system claim 14 and method claim 5 are related as method and system of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 14 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0047] teaches processor, memory, and instructions. 

As to claim 19, product claim 19 and method claim 5 are related as method and product of using the same, with each claimed element’s function corresponding to the method step. Accordingly claim 19 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Fukuda et al., para [0089] teaches CRM. 



Allowable Subject Matter
Claim 7 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Fukuda et al. and Kurata teach calculating delta and delta-delta feature, but none of the prior art discloses that the calculating for the delta-delta features is done for a group of neighboring frames wider than the delta features. 


Conclusion

Other prior art is noted on attached PTO-892. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  






/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2656