Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 3/21/2021 have been fully considered but they are not persuasive. 
The applicant contends
I. Rejection of Claims 1, 2, 3, 6, 7, 8, 9, 10, 11, 14-20 Under 35 U.S.C. §102(a)(1) 
Claims 1, 2, 3, 6, 7, 8, 9, 10, 11, 14-20 are rejected under 35 U.S.C. §102(a)(1) as being anticipated by Lopatka, et al. (US Publication No.: 20190043489) ("Lopatka"). The withdrawal of this rejection is respectfully requested for at least the following reasons. 
Claim 1 has been amended to incorporate the features of claim 2, which has been cancelled accordingly. Claim 1 has been amended to recite "wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision, wherein applying the temporal structure constraint comprises processing the sequence of frames to determine whether the temporal structure constraint is met". Basis for this amendment can be found in at least paragraph [0026] of the Specification, which describes how "applying the temporal structure constraint to a sequence of frames may comprise processing the sequence of frames to determine whether a constraint, i.e. the temporal structure constraint, is met". 
Lopatka fails to disclose each and every feature of amended claim 1. For example, Lopatka does not disclose an application of a temporal constraint that comprises processing a sequence of frames to determine whether a temporal structure constraint is met. Lopatka relates to acoustic event detection (see Lopatka at Abstract), and states that "[e]vent detection and recognition systems are employed to automatically detect acoustic events of interest, for example to trigger a desired action based on the event" (see Lopatka at paragraph [0001]). In particular, Lopatka at paragraph [0034] states that "[t]he classifier operates on the impulsive features 330, or the continuous features 450, depending on the branch of the event detector, to generate event scores 530 which indicate the probability that an audio frame belongs to a given event class with an associated label." The processing of these event scores is described in paragraphs [0039] and [0040] of Lopatka. 
Lopatka at paragraph [0039] states that "[t]he difference calculation circuit 640 is configured to calculate the difference between target event score (label 0) and the aggregation of the non-target event scores (labels 1-7)." Paragraph [0040] of Lopatka states that "[t]he smoothing circuit 650 is configured to smooth the calculated difference 752... The purpose of the exponential smoothing is to disregard short-term, primarily random, variability of the output scores." However, Lopatka does not disclose applying a constraint to the event scores, either before or after this smoothing process in contrast to amended claim 1. For example, although the smoothing is intended to reduce random variation, there is no disclosure of applying a constraint to check whether smoothing is required, or to determine whether the smoothing process has met a particular constraint consistent with amended claim 1. 

The examiner disagrees. The claim recites the newly added limitation of “applying the temporal structure constraint to a sequence of frames may comprise processing the sequence of frames to determine whether the temporal structure constraint is met”. Such limitation merely recites the processing of sequence frames is performed in order to determine whether the constraint is met. Such limitation does not specify how the sequence of frames are processed nor what is performed in order to process the sequence of frames nor how or what is performed in order to determine whether the constraint is met. The recited claimed language also fails to include language indicating what is constituted as a temporal structure constraint. As a result, the claimed language is interpreted in the broadest reasonable interpretation of the claimed language in light of the specification. The such limitation is interpreted as merely application of the constraint includes processing the sequence of frames to determine whether the constraint is met.
	Although the applicant contends Lopatka does not disclose the newly added limitation, due to the breath of the claimed language, it is shown in the office action below as well in this rebuttal Lopatka discloses the newly recited limitation. Lopatka discloses processing the sequence of frames (Fig. 1, label spectral frames is inputted in to label 140,160. Fig. 2 shows the internals of label 140,160 and Fig. 6 shows the internals of label 230. Fig. 6, label 650,810 shows the application of a temporal structural constraint. Lopatka discloses
“The smoothing circuit 650 is configured to smooth the calculated difference 752. This smoothed difference score is illustrated as 754 in plot 750. In some embodiments, the smoothing operation may be described by the following equation: …. where 0<a<1 is an exponential smoothing constant.” (paragraph 40)


“Also shown, is a threshold marker 756. The threshold circuit 660 is configured to generate an event detection, 150, 170, when the smoothed difference 754 exceeds the selected threshold 756.” 

Such indicates that the smoothed difference generated as described in paragraph 40 (included above) using an exponential smoothing constant is compared to a threshold in order to generate an event detection. This indicates that a determination of whether the temporal structure constraint of applying an exponential smoothing constant to calculate the smoothed difference, which is compared to a threshold is met in order to generate an event detection. 
Furthermore, although Lopatka [0040] states that "[b]y setting [the decay constant] a to a value closer to 1 (e.g. a = 0.999) the backend processing circuit can be tuned to react to longer events", there is no disclosure of determining the value of a such that a particular constraint is met. The smoothing process is merely a way of reducing random variation in general qualitative terms in Lopatka, which is different from the approach of amended claim 1. Therefore Lopatka does not disclose "wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision, wherein applying the temporal structure constraint comprises processing the sequence of frames to determine whether the temporal structure constraint is met," as recited in amended claim 1. Accordingly, Lopatka fails to anticipate claim 1, as well as claims 3, 6, 7, 8, 9, 10, and 11 depending therefrom. 

The examiner disagrees. Although the applicant does not consider Lopatka discloses the recited limitation by disclosing decay constant as indicated in the applicant’s remarks, the recited claimed language is broad and is interpreted as indicated above. As explained above, Lopatka discloses the recited limitation in paragraphs 40,41 in conjunction with Fig. 1, label 180,140,160, Fig. 2 shows the internals of label 140,160 and Fig. 6 shows the internal of label 230 shown in Fig. 2 discloses the recited claimed language. An explanation is found above as well as in the office action below.



The examiner disagrees. The newly recited limitation found in claim 1 is similarly found in claims 19,20. Please see the rebuttal of claim 1, found above.
For at least the reasons explained above, claims 1, 3, 6, 7, 8, 9, 10, 11, 14-20 are patentable over Lopatka, and the withdrawal of this rejection is respectfully requested.
 
	The examiner disagrees. Rebuttal regarding claim 1,19,20 is found above. Claims 3,6,7,8,9,10,11,14-18 are dependent claims, hence incorporate the limitations of the independent claim. Please see the rebuttal of the respective independent claim.

II. Rejection of Claims 12 and 13 Under 35 U.S.C. §103 
Claims 12 and 13 are rejected under 35 U.S.C. §103 as being unpatentable over Lopatka in view of McLaughlin, et al. (Title: Continuous robust sound event classification using time- frequency features and deep learning) ("McLaughlin"). Claims 12 and 13 are patentable for at least the same reasons, as explained above with respect to amended claim 1, from which these claims depend, and for the specific elements recited therein. The addition of McLaughlin fails to rectify the deficiencies of Lopatka, as explained above with respect to amended claim 1. Thus, Lopatka in view of McLaughlin does not teach or suggest to one of ordinary skill in the art how to implement at least the above-discussed feature of amended claim 1, from which claims 12 and 13 depend. Therefore, Lopatka in view of McLaughlin does not render claims 12 and 13 obvious to one of ordinary skill in the art. Accordingly, the withdrawal of this rejection is respectfully requested. 

	The examiner disagrees. Such claims are dependent on respective independent claims. Please see the rebuttal of the respective independent claim.

III. Allowable Subject Matter 
Claims 4 and 5 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the features of the base claim and any intervening claims. Applicant thanks the Examiner for the allowance of claims 4 and 5. 

	Such claims are not amended, hence the scope of such claims are unchanged. As a result, the status of such claims has not changed. Please see the office action below.



Claim Objections
Claim 16 recites “A non-transitory data carrier carrying processor control code which when running on a device causes the device to perform the method of claim 1.” For better clarity, the examiner suggests amending the claimed language to incorporate the limitations that are performed in claim 1. For example, “A non-transitory data carrier carrying processor control code which when running on a device causes the device to perform: for each frame of the sequence: process the frame …” as recited in claim 1.
Claim 17 recites “A computer system configured to implement the method of claim 1”. For better clarity, the examiner suggests amending the claimed language to incorporate the limitations that are performed in claim 1. For example, “A computer system configured to: for each frame of the sequence: process the frame …” as recited in claim 1.
	Claim 18 recites “A consumer electronic device comprising the computer system of claim 17.” For better clarity, the examiner suggests amending the claimed language to incorporate the limitations that are performed by the computer system of claim 17. For example, “A consumer electronic device comprising a computer system configured to: for each frame of the sequence: process the frame …” as recited in claim 1,17.
 
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1,3,6,7,8,9,10,11,14-20, is/are rejected under 35 U.S.C. 102a1 as being anticipated by Lopatka et al (US Publication No.: 20190043489).
Claim 1, Lopatka et al discloses
Preamble: A method for recognizing at least one of a non-verbal sound event and a scene in an audio signal (Fig. 1 shows a classifier of an audio signal or acoustic input signal. Label 150,170 as non-verbal sound event and a scene. Paragraph 21 discloses classification of the acoustic signal into classes of scenes. Paragraph 22 discloses classification of acoustic signal into classes of non-verbal sound event.) comprising 
a sequence of frames of audio data (Fig. 1, label spectral frames.), the method comprising:
for each frame of the sequence (label 140,160 performs classification for each frame.):
processing the frame of the audio data to extract multiple acoustic features for the frame of audio data (Fig. 1, label 140,160 processes the spectral frames. Fig. 3, label 330 outputs acoustic features for the spectral frames of the audio data. Fig. 4, label 450 outputs the acoustic features for the spectral frames of the audio data.); and

processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame (Fig. 2, label 230 performs processing of the outputs from the DNN classifier, shown in Fig. 5, to determine the class. Fig. Paragraph 38 discloses Fig. 7 shows the raw output scores vs time (frame) for each of 8 labeled events. Such indicates that the output of label 230 is for each frame.),
wherein processing the sound class scores includes applying a temporal structure constraint to the sound class scores to generate the sound class decision (Paragraph 40 discloses a smoothing constant applied to the scores in order to tune the backend processing circuit to react to longer events. Paragraph 41 discloses a threshold marker 756, where when the smoothed difference exceeds a threshold, an event detection is outputted. The threshold and smoothing constant is considered temporal structure constraint.),
wherein applying the temporal structure constraint comprises processing the sequence of frames to determine whether the temporal structure constraint is met (Fig. 2, label 210,220 processes the spectral frames of the acoustic input signal shown in Fig. 1, label acoustic input signal and 180. The processing of such frames generates 
processing the sound class decisions for the sequence of frames to recognize that at least one of a non-verbal sound event and a scene. (Paragraph 1 discloses processing the sound class decisions output by label 230, for example, to trigger a desired action based on the event. This indicates that the at least one non-verbal even and a scene (classes as indicated in paragraphs 21-22) are recognized.)
Claim 3, Lopatka et al discloses classifying the acoustic features comprises classifying the frame of audio data using a set of first classifiers (Fig. 2, label 220 for both impulsive event detection and continuous event.) and wherein applying the temporal structure constraint comprises processing the sound class scores using a second classifier (Fig. 2, label 230 as the second classifier.).
Claim 6, Lopatka et al discloses the set of first classifiers comprises a set of neural network classifiers (Fig. 2, label DNN classifier, 220).

Claim 8, Lopatka et al discloses wherein the frame of audio data comprises time domain audio data for a time window (Fig. 1, label spectral frames, Paragraph 20 discloses STFT), and wherein processing the frame of audio data to extract the acoustic features for the frame of audio data comprises transforming the frame of audio data into frequency domain audio data (Paragraph 20 discloses STFT.).  
Claim 9, Lopatka et al discloses wherein processing the frame of audio data to extract multiple acoustic features for the frame of audio data comprises processing the frame of audio data using a feature extraction neural network to extract the acoustic features for the frame.  (Paragraph 31 discloses MFCCs processes the input to generate features for the frames.).
Claim 10, Lopatka et al discloses wherein prior to said classifying the acoustic features to classify the frame (Fig. 1, label DNN classifier.), 
the method comprises concatenating the multiple acoustic features for the frame of audio data with multiple acoustic features for an adjacent frame of audio data in the sequence (Fig. 4, label 450 is the output of concatenation of label 430, Paragraph 32).  
Claim 11, Lopatka et al fails to disclose further comprising adjusting the sound class scores for multiple frames of the sequence of frames based on one or more of: knowledge about one or more of the sound classes; and knowledge about an environment in which the audio data was captured (Fig. 6, label target scores and non-target scores are related to the target events, according to the label associated with the 
Claim 14, Lopatka et al discloses wherein processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame comprises: filtering the sound class scores for the multiple frames to generate a 27MC1-029030 US PRI smoothed score for each frame (Fig. 6, label smoothing circuit performs smoothing or filtering of the difference between the scores, label 640.); and 
comparing each smooth score to a threshold to determine a sound class decision for each frame (Fig. 6, label 660 compares the smoothed scores to a threshold to determine the event detections.).  
Claim 15, Lopatka et al discloses wherein processing the class decisions for the sequence of frames to recognize the at least one of a non-verbal sound event and scene further comprises determining a start and an end time of the at least one of a non-verbal sound event and a scene.  (Paragraph 29 discloses “to constrain the size of the resulting feature vector, the frame rate of the spectral frames 180 may be limited to 20 frames per second or less.” Such indicates that the at least one of the non-verbal sound event and a scene of the frame will have a duration or start and end time. Paragraph 1 discloses the trigger a desired action based on the event. Based on paragraph 1 and 29, the triggering of the event, where the event is determined based on classification shown in Fig. 1, comprises duration of the frame, hence start and end time of the event.)
Claim 16, Lopatka et al discloses A non-transitory data carrier carrying processor control code which when running on a device causes the device to perform the method of claim 1.  (paragraph 65. Please see claim 1.)

Claim 18, Lopatka et al discloses A consumer electronic device comprising the computer system of claim 17.  (Paragraph 62-65, Please see claim 1,17.)
Claim 19, Lopatka et al discloses 
Preamble: A system for recognizing at least one of a non-verbal sound event and a scene in an audio signal (Fig. 1 shows a classifier of an audio signal or acoustic input signal. Label 150,170 as non-verbal sound event and a scene. Paragraph 21 discloses classification of the acoustic signal into classes of scenes. Paragraph 22 discloses classification of acoustic signal into classes of non-verbal sound event.) comprising a sequence of frames of audio data (Fig. 1, label spectral frames.), the system comprising 
a microphone (paragraph 52) to capture the audio data (Fig. 1, label acoustic input) and one or more processors (Paragraph 52,53), 
wherein the system is configured to: 
for each frame of the sequence (label 140,160 performs classification for each frame.): 
process the frame of audio data to extract multiple acoustic features for the frame of audio data (Fig. 1, label 140,160 processes the spectral frames. Fig. 3, label 330 outputs acoustic features for the spectral frames of the audio data. Fig. 4, label 450 outputs the acoustic features for the spectral frames of the audio data.); and 
classify the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class (Paragraph 34 discloses the classifier outputs event scores, “which indicate the probability that an 
process the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame (Fig. 2, label 230 performs processing of the outputs from the DNN classifier, shown in Fig. 5, to determine the class. Fig. Paragraph 38 discloses Fig. 7 shows the raw output scores vs time (frame) for each of 8 labeled events. Such indicates that the output of label 230 is for each frame.); and 
process the class decisions for the sequence of frames to recognize the at least one of a non-verbal sound event and scene. (Paragraph 1 discloses processing the sound class decisions output by label 230, for example, to trigger a desired action based on the event. This indicates that the at least one non-verbal even and a scene (classes as indicated in paragraphs 21-22) are recognized.) 
Claim 20, Lopatka et al discloses
Preamble: A sound recognition device for recognizing at least one of a non-verbal sound event and scene in an audio signal (Fig. 1 shows a classifier of an audio signal or acoustic input signal. Label 150,170 as non-verbal sound event and a scene. Paragraph 21 discloses classification of the acoustic signal into classes of scenes. Paragraph 22 discloses classification of acoustic signal into classes of non-verbal sound event.) comprising a sequence of frames of audio data (Fig. 1, label spectral frames.), the sound recognition device comprising: 
a microphone (paragraph 52) to capture the audio data (Fig. 1, label acoustic input); and 
a processor (Paragraph 52,53) configured to: 

for each frame of the sequence (label 140,160 performs classification for each frame.): 
process the frame of audio data to extract multiple acoustic features for the frame of audio data (Fig. 1, label 140,160 processes the spectral frames. Fig. 3, label 330 outputs acoustic features for the spectral frames of the audio data. Fig. 4, label 450 outputs the acoustic features for the spectral frames of the audio data.); and 
classify the acoustic features to classify the frame by determining, for each of a set of sound classes, a score that the frame represents the sound class (Paragraph 34 discloses the classifier outputs event scores, “which indicate the probability that an audio frame belongs to a given event class with an associated label. A variety of event classes are possible include classes that represent target event …” ); 
process the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame (Fig. 2, label 230 performs processing of the outputs from the DNN classifier, shown in Fig. 5, to determine the class. Fig. Paragraph 38 discloses Fig. 7 shows the raw output scores vs time (frame) for each of 8 labeled events. Such indicates that the output of label 230 is for each frame.); and 
process the class decisions for the sequence of frames to recognize the at least one of a non-verbal sound event and scene (Paragraph 1 discloses processing the sound class decisions output by label 230, for example, to trigger a desired action based on the event. This indicates that the at least one non-verbal even and a scene (classes as indicated in paragraphs 21-22) are recognized.).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 12,13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lopatka et al (US Publication No.: 20190043489) in view of McLoughlin et al (Title: Continuous robust sound event classification using time-frequency features and deep learning).
Claim 12, Lopatka et al discloses wherein processing the sound class scores for multiple frames of the sequence of frames to generate, for each frame, a sound class decision for each frame (Fig. 2, label 230 performs processing of the outputs from the DNN classifier, shown in Fig. 5, to determine the class. Fig. Paragraph 38 discloses Fig. 7 shows the raw output scores vs time (frame) for each of 8 labeled events. Such indicates that the output of label 230 is for each frame.) comprises using features or MFCCs across more than one frame (Fig. 6,8 shows the event scores processed, wherein such event scores are generated by processing MFCCs or features in the DNN, Fig. 1, 
	McLoughlin et al discloses performing MFCC-HMM using Viterbi algorithm. (Section MFCC-HMM discloses the use of Viterbi to determine features and Section SIF with SVM, DNN and CNN performs classification.) It would be obvious to one skilled in the art before the effective filing date of the application to modify the DNN classifier of Lopatka et al with the DNN using Viterbi algorithm as disclosed by McLoughlin et al so to improve the performance of the classifier.
	Claim 13, McLoughlin et al discloses the optimal path search algorithm is a Viterbi algorithm (Section MFCC-HMM discloses using Viterbi algorithm.).

Allowable Subject Matter
Claims 4,5 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the 

/LINDA WONG/Primary Examiner, Art Unit 2655