Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 09/26/2022 has been entered.
 

                                         Reason for no 101 rejection for claim 28
Examiner’s position for not making 101 rejection is that A computer-readable storage device storing instructions….. cannot be interpreted as a Transitory, propagating signal. Thus a storage device does not cover a signal per se.  
                                             
Compact Prosecution
Examiner would like to propose incorporating the limitation wherein the target sound detector detects target sounds such as an alarm and doorbell and a siren and a baby crying and dog barking and displays on a screen to indicate to a user the specific sound detected.  


Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked.

As explained in MPEP §2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:

(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;

(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and

 (C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.

Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.

Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are:
“means for detecting a target sound, the means for detecting the target sound comprising a first stage… wherein the first stage includes means for generating a binary target… ” in claim 29. 
“means for detecting an audio scene…” “means for detecting an audio scene change in the audio data…” “means for classifying the audio data as a particular audio scene in …”

Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. 
In Section 0032 of the specification applicant discloses that device 102 which includes microphone… and wireless speaker and voice command device for example “a smart speaker device or home automation system corresponds to the claimed means for detecting a target sound. 
Also In section 0035 of the specification applicant discloses that at least one of a Bayesian classifier or a Gaussian Mixed Model (GMM) Classifier) is the corresponding structure for means for generating a binary target as recited in claim 29. 
In section 0066 of the specification applicant provides that scene detector which can be a camera GPS receiver or audio scene detector corresponds to the means for detecting an audio scene. 
In sections 0058  and 0059 in the specification applicant provides audio scene change detector which includes a scene transition classifier which corresponds to the structure for means for detecting an audio scene change in the audio data recited in claim 30. 
Also In section 0035 of the specification applicant discloses that at least one of a Bayesian classifier or a Gaussian Mixed Model (GMM) Classifier) is the corresponding structure for means for classifying the audio data as a particular audio scene as recited in claim 30. 

If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

No 112 rejection is needed since corresponding structures are provided in the specification  as discussed above. 

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-14 and 16-22 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant argues the office has failed to provide a proper rationale for combining  Parthasarathi and Gross. Due to the new amendments to the claims Examiner replaced the secondary reference (Gross (US 20180108369) )  with a new reference (Mitchell et al (US10783434). See below for the explanation to the rationale for combining Parthasarathi (20170270919)  with Mitchell et al (US10783434).
In reply, Examiner’s position is that Parthasarathi discloses a system wherein a target sound detector (Wakeword detection 220 in device 110) comprising a first stage (before Wakeword detection) and a second stage, (after Wakeword detection) the first stage including a binary target sound classifier (Classifier 1520) configured to process the audio data, (Section 0149, lines 2-5- thus the classifier confirms the wake word from the other audio signal) 
Upon the system detecting the wakeword the local device may wake and begin transmitting audio data corresponding to input audio to the server for speech processing (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data).
This means if the received speech data is without the wakeword no transmission of the speech data or processing of the speech data takes place.  (See section 0101, lines 7-9 the system describes “wakeword triggered interaction the system”)
Parthasarathi teaches a Wakeword confirmation Module that confirms a wakeword within an utterance before the audio data is recognized or processed. Parthasarathi does not disclose detecting only target non-speech data for further processing. 
The secondary reference Mitchell discloses a similar system (Automatic recognitions system  that only recognized an audio data when a target non-speech audio data is detected by a classifier. (Col. 3 lines 44-52 Mitchell’s system is described as “… a system configured to recognize only target sounds within an audio data). Therefore combining the teaching of recognizing only target sounds within an utterance will improve the performance of the system because only the needed speech data is transmitted and processed.  This will make the system faster and will not waist bandwidth. 
Applicant further argues the Parthasarathi classifies all non-speech audio as "non-speech" and does not distinguish between different types of non-speech audio.



In other words, applicant argues that the cited references fails to disclose distinguishing  between different types of non-speech audio.
In reply, Examiner respectfully disagrees because as indicated by applicant in the remarks filed on 05/04/2022, the main reference teaches a classifier H that is used to label incoming audio as belonging to desired speech or undesired speech. Hence Parthasarathi clearly discloses classifying all non-speech audio as non speech  and does not further classify non-speech audio into different classes. 
The newly added secondary reference (Mitchell) clearly teaches recognizing only target sounds such as baby crying, dog barking or female speaking and not recognizing the non-target sounds class. This therefore means the detected target sound are classified as either baby cry or dog barking or female barking.  (Mitchell: Col. 3 lines 15-25 and lines 45-55), (Col. 4 lines 35-49). 

	Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include the teaching of detecting only target speech data. The motivation is that the combination will make the system perform better and faster. 

Applicant argues that the combination fails to disclose "a first stage ... configured to activate a second stage in response to detection of at least one of the one or more target non- speech sounds," as in claim 1 (emphasis added).
In reply, Examiner respectively disagree because Parthasarathi discloses a system wherein a target sound detector (Wakeword detection 220 in device 110) comprising a first stage (before Wakeword detection) and a second stage, (after Wakeword detection) the first stage including a binary target sound classifier (Classifier 1520-Gaussian Mixture Model (GMM) techniques) configured to process the audio data, (Section 0149, lines 2-5- thus the classifier confirms the wake word from the other audio signal) 
Upon the system detecting the wakeword the local device may wake and begin transmitting audio data corresponding to input audio to the server for speech processing -this is the second stage of the process where the local device begin to transmit the utterance upon detecting the wakeword. (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data).
This means if the received speech data is without the wakeword no transmission of the speech data or processing of the speech data takes place.  (See section 0101, lines 7-9 the system describes “wakeword triggered interaction the system”)
Parthasarathi teaches a Wakeword confirmation Module that confirms a wakeword within an utterance before the audio data is recognized or processed. Parthasarathi does not disclose detecting only target non-speech data for further processing. 
The secondary reference also teaches a similar system where only targeted non-speech data are recognized and processed. The first stage is when the machine learning model classifies a plurality of frames individually  thus the machine learning model determines if a signal is a target speech or not a target speech (Col. 3 lines 44-55) and the second stage is when the machine learning train the sounds that are classified as target sounds. See Col. 4 lines 54-60.
Regarding Claims 6,8,18,21,24, and 30. 
As explained above all the deficiencies argued by applicant has been responded in regards to how the combination of Parthasarathi and Mitchell  reads on Claims 6,8, 18, 21, 24 and 30. The rejection for Claims 6,8,18,21,24 and 30 are therefore maintained. 
Regarding claims 15 and 26
The same limitations are argued by applicant in reference to claims 1-5,7,9-14,16,17,19,20,22,23,25 and 27-29 and therefore the response above are applicable to claims 15 and 26 and therefore the arguments are moot. 


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5,7,9-14,16-17,19 ,22-23,25,27-29 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Mitchell et al. (10783434)
Claim 1, Parthasarathi discloses a device to perform sound detection (Device 110 fig. 19) , comprising: 
one or more processors (Controllers/Processor 1904- shown in fig. 19) comprising:
a buffer configured to store audio data  (Section 0033, lines 19-21- the audio data is stored in the acoustic front end located on the device 110 prior to transmission and it is transmitted only based on when it contains a Wakeword, also see Section 0103, lines 8-10) 
a target sound detector, (Wakeword detection 220 in device 110) configured to detect the presence or absence of multiple one or more target non-speech sounds in the audio data, (Section 0032, lines 1-5- the input audio includes speech or other noise which means the audio can include other non-speech sounds) 

    PNG
    media_image1.png
    279
    497
    media_image1.png
    Greyscale

A “Yes” output from the Wakeword Confirmation means a target sound has been detected.  The wake word “Alexa” has been detected within the input data 111. 


comprising a first stage (before Wakeword detection) including a binary target sound classifier (Classifier 1520-Gaussian Mixture Model (GMM) techniques, section 0035, lines 14-17) configured to process the audio data, the first stage configured to activate a second stage in response to detection of at least one of the one or more target speech sounds (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data- the second stage is when the local device begins transmitting the audio data upon detection of the wakeword).

and a second stage, (after Wakeword detection) wherein the second stage is configured to receive the audio data from the buffer in response to the detection of the target sound. (Based on Section 0038, lines 1-8 the local device wakes up and begin transmitting audio data upon the wakeword confirmation module detecting a wakeword in the input data.) 
(This means that the system recognize and process the input data upon the system detecting a wakeword. For example “Alexa, play some music” wherein the utterance “play some music” is processed only when the system detects “Alexa” which reads on the limitation Wakeword)
The same idea is disclosed in applicant’s specification see Paragraph 0044 “a wakeup interrupt signal”
Parthasarathi does not disclose wherein the second stage is configured to receive the audio data from the buffer in response to the detection of the non-target sound.
Mitchell discloses a system wherein a similar system automatic recognitions system  that only recognized an audio data when a target non-speech audio data is detected by a classifier. (Col. 3 lines 44-52 Mitchell’s system is described as “… a system configured to recognize only target sounds within an audio data). Therefore combining the teaching of recognizing only target sounds within an utterance will improve the performance of the system because only the needed speech data is transmitted and processed.  This will make the system faster and will not waste bandwidth. 
Claim 2, Parthasarathi in view of Mitchell (Col. 3 lines 14-20 recognizing “non-verbal sound” such as baby crying or dog barking) discloses wherein the binary target sound classifier is further configured to generate a binary signal including a first value and a second value, (Parthasarathi: Section 0038, lines 1-8- thus “a Yes or No value” from the Wakeword confirmation module) wherein the first value is set to activate the second stage in response to detecting the presence of at least one or more target sounds in the audio data; (Parthasarathi: Section 0038, lines 1-4 the first score activates the detection system to transmits the audio data to be transmitted to the server for processing- which reads on the second stage, hence the first probability score of 1 reads on the first score) and 
the second value is set to refrain from activating the second stage in response to detecting that none of the one or more target non-speech sounds are in the audio data. (Parthasarathi: Section 0113, lines 11-12- thus second probability score 0 reads on the second value which will be for undesired speech) 
Claim 3, Parthasarathi in view of Mitchell disclose wherein the binary target sound classifier includes a neural network. (Parthasarathi: Section 0096, lines 2-4-thus a classifier trained in a manner to use the RNN which is Recurrent Neural Network- See Section 0074, lines 10-11)
Claim 4, Parthasarathi in view of Mitchell discloses wherein the binary target sound classifier includes at least one of a Bayesian classifier or a Gaussian Mixed Model  classifier. (Parthasarathi: Section 0035, lines 15-18- thus the Gaussian Mixture Model (GMM) technique) 
Claim 5, Parthasarathi in view of Mitchell discloses wherein the one or more target non- speech sounds include a first target sound and a second target sound (Mitchell: one or more target sound classes reads on a first and second target sound) 
wherein the second stage includes a multiple target sound classifier configured to generate a detector output that indicates, the presence or absence of the first target sound in the audio data and the presence or absence of the second target sound on the audio data. (Mitchell: Col. 4 lines 41-49- target sound classes means there will be first and second target sounds for example a baby cry or a dog bark Col. 3 lines 44-50 ) and wherein the multiple target non-speech sounds correspond to multiple classes of sound events, and wherein the one or more processors are further configured to cause an output device to indicate each target sound detected in the audio data. (Mitchell: Col. 3 lines 53-56-  …for each set of sound class a score is output or determined to represent the sound class..)
Claim 7, Parthasarathi in view of Mitchell discloses wherein the signal corresponds to a wakeup interrupt signal. (Parthasarathi: Section 0038 lines 1-4- thus detecting the wake word reads on the wakeup signal) 
Claim 9, Parthasarathi in view of Mitchell discloses further comprising a scene detector configured to classify an environment of the device at least partially (Parthasarathi: Section 0035, lines 11-23 “environmental noise or background noise reads on classification based on environment-section 0041) based on an input signal from the camera,  wherein the second stage includes a multiple target sound classifier that is configured to classify the audio data from among one or more classes of sound events, (Parthasarathi: Section 0109, lines 8-13- labeling the audio frames based on desired speech, non-desired speech or non-speech reads on the multiple classes/categories) and wherein operation of the multiple target sound classifier is at least partially based on the environment classified by the scene detector. (Parthasarathi: Section 0120 Environmental characteristics is use to detect or classify the scene/environment)
Claim 10, Parthasarathi in view of Mitchell discloses wherein the multiple target sound classifier is adjusted to focus on one or more particular classes of the one ore more classes of sound events that correspond to the environment. (Parthasarathi: Section 0127, lines 12-17- thus training the model or adaptation of the acoustic condition to improve the distinction between the units of speech and between speech and noise reads on the adjustments of the classes) 
Claim 11, Parthasarathi in view of Mitchell discloses wherein the multiple target sound classifier is further configured to select a particular set of sound event classes that correspond to the environment from among multiple sets of sound event classes; (Gross: Section 0034,lines 2-6- thus the “sound representing a construction back ground” reads on the environment of the sound event) and classify the audio data based on the sound event classes of the particular set. (Parthasarathi: Section 0035, lines 18-22- thus audio scene can be classified as either environmental or background noise) 
Claim 12, Parthasarathi in view of Mitchell discloses wherein the target sound detector is configured to select, from among one or more sets of trained data, a particular set of trained data that corresponds to a detected environment of the device and to process the audio data based on the particular set of trained data. (Parthasarathi: Section 0146, lines 1-5- thus “determining whether the audio corresponds to particular keywords recognizable by the device”… lines 11-18- thus the models includes predicted location which can be a detected environment of the device) 
Claim 13, Parthasarathi in view of Mitchell discloses wherein the environment is detected based on at least one of a camera, a location detection system, (Parthasarathi: Section 0146, lines 11-18- thus predicted location) or an audio scene detector.
Claim 14, Parthasarathi in view of Mitchell discloses further comprising an audio scene detector that is configured to be activated responsive to detection of the presence of the at least one or more target sounds in the audio data by the binary target sound classifier, (Parthasarathi: Section 0121, lines 5-10- thus the desired talker’s speech, where the features corresponds to a desired user understand a desire user is selected from a plurality of target voices or sounds)  the audio scene detector comprising:
an audio scene change detector configured to process the audio data and to generate a scene change signal in response to detection of an audio scene change; (Parthasarathi: Section 0127, lines 12-15- thus acoustic condition adaptation means the system can respond to an audio scene change) and an audio scene classifier configured to receive the audio data from the buffer in response to the detection of the audio scene change. (Parthasarathi: Section 0041, lines 4-6- thus environmental noise changed to background noise reads on the change of audio scene change) 
Claim 16, Parthasarathi in view of Mitchell discloses wherein the audio scene change detector is further configured to detect the audio scene change based on detecting changes in at least one of noise statistics or non-stationary sound statistics. (Parthasarathi: Section 0127, lines 12-15- thus acoustic condition adaptation means the system can respond to an audio scene change)
Claim 17, Parthasarathi in view of Mitchell discloses wherein the audio scene change detector includes a classifier trained using audio data corresponding to transitions between scenes. (Parthasarathi: Section 0127, lines 12-15- thus acoustic condition adaptation means the system can respond to an audio scene change)
Claim 19, Parthasarathi in view of Mitchell discloses further comprising a microphone coupled to the one or more processors and configured to generate the audio data. (Parthasarathi: Fig. 19, Microphone 1950 generates the audio data) 
Claim 22, Parthasarathi discloses wherein the second stage is configured to process a first portion of the audio data, (Parthasarathi: “First portion of audio data”- see fig. 11) the first portion corresponding to a portion of the audio data in which the first stage detected the presence of the at least one of the one or more target non- speech sounds. (Parthasarathi: Section 0104, lines 6-8 first portion of that input audio corresponds to the desired speaker … that early portion of the input audio may be determined…)

    PNG
    media_image2.png
    226
    699
    media_image2.png
    Greyscale

The first portion of audio data 1102 as indicated on the screenshot reads on the first portion that will indicate if the audio data contains the wakeword. 


Claim 23, Parthasarathi discloses a method of target sound detection, (Wakeword detection 220 in device 110)  the method comprising: 
storing audio data in a buffer; (Section 0033, lines 19-21- the audio data is stored in the acoustic front end located on the device 110 prior to transmission and it is transmitted only based on when it contains a Wakeword, also see Section 0103, lines 8-10) 
 processing the audio data in the buffer using a binary target sound classifier (Classifier 1520-Gaussian Mixture Model (GMM) techniques, section 0035, lines 14-17) in a first stage (before Wakeword detection) of a target sound detector configured to detect the presence or absence of one or more target non-speech sounds; (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data- the second stage is when the local device begins transmitting the audio data upon detection of the wakeword).

activating a second stage of the target sound detector (after Wakeword detection)  in response to detection of at least one of the one or more target non-speech sounds by the first stage (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data)  and processing the audio data from the buffer using a multiple target sound classifier in the second stage. (Based on Section 0038, lines 1-8 the local device wakes up and begin transmitting audio data upon the wakeword confirmation module detecting a wakeword in the input data.) 
(This means that the system recognize and process the input data upon the system detecting a wakeword. For example “Alexa, play some music” wherein the utterance “play some music” is processed only when the system detects “Alexa” which reads on the limitation Wakeword)
The same idea is disclosed in applicant’s specification see Paragraph 0044 “a wakeup interrupt signal”
Parthasarathi does not disclose wherein the second stage is configured to receive the audio data from the buffer in response to the detection of the non-target sound.
Mitchell discloses a system wherein a similar system (Automatic recognitions system  that only recognized an audio data when a target non-speech audio data is detected by a classifier. (Col. 3 lines 44-52 Mitchell’s system is described as “… a system configured to recognize only target sounds within an audio data). Therefore combining the teaching of recognizing only target sounds within an utterance will improve the performance of the system because only the needed speech data is transmitted and processed.  This will make the system faster and will not waste bandwidth. 


Claim 25, Parthasarathi in view of Mitchell discloses further comprising causing an output device to indicate each target sound detected in the audio data. (Parthasarathi: Section 0109, lines 8-13- labeling the audio frames based on desired speech, non-desired speech or non-speech reads on the multiple classes/categories)   

Claim 27, Parthasarathi in view of Mitchell discloses further comprising processing the audio data to detect an audio scene change based on detecting changes between audio scene classes in a first set of audio scene classes; (Parthasarathi: Section 0035, lines 18-22- thus audio scene can be classified as either environmental or background noise) and classifying the audio data based on a second set of audio scene classes, (Mitchell: Col. 4 lines 35-40- a target sound class and scenes such as a railway station or a kitchen) wherein a first count of the audio scene classes in the first set of audio scene classes is less than a second count of audio scene classes in the second set of audio scene classes. (Mitchell: Col. 16 lines 8-16- thus thresholding the scores of frames of baby cry one threshold per class/per score) 

Claim 28, Parthasarathi discloses a computer-readable storage device storing instructions that, when executed by one or more processors, (Controllers/Processor 1904- shown in fig. 19) cause the one or more processors to store audio data in a buffer; (Section 0033, lines 19-21- the audio data is stored in the acoustic front end located on the device 110 prior to transmission and it is transmitted only based on when it contains a Wakeword, also see Section 0103, lines 8-10) 
process the audio data in the buffer using a binary target sound classifier (Classifier 1520-Gaussian Mixture Model (GMM) techniques, section 0035, lines 14-17) in a first stage (before Wakeword detection) of a target sound detector configured to detect the presence or absence of one or more target non-speech sounds;  (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data- the second stage is when the local device begins transmitting the audio data upon detection of the wakeword).
activate a second stage of the target sound detector (after Wakeword detection) in response to detection of at least one of the one or more target non-speech sounds by the first stage; (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data) and process the audio data from the buffer using a multiple target sound classifier in the second stage. (Based on Section 0038, lines 1-8 the local device wakes up and begin transmitting audio data upon the wakeword confirmation module detecting a wakeword in the input data.) 
(This means that the system recognize and process the input data upon the system detecting a wakeword. For example “Alexa, play some music” wherein the utterance “play some music” is processed only when the system detects “Alexa” which reads on the limitation Wakeword)
The same idea is disclosed in applicant’s specification see Paragraph 0044 “a wakeup interrupt signal”
Parthasarathi does not disclose wherein the second stage is configured to receive the audio data from the buffer in response to the detection of the non-target sound.
Mitchell discloses a system wherein a similar system (Automatic recognitions system  that only recognized an audio data when a target non-speech audio data is detected by a classifier. (Col. 3 lines 44-52 Mitchell’s system is described as “… a system configured to recognize only target sounds within an audio data). Therefore combining the teaching of recognizing only target sounds within an utterance will improve the performance of the system because only the needed speech data is transmitted and processed.  This will make the system faster and will not waste bandwidth. 




Claim 29, Parthasarathi discloses an apparatus (Device 110- Fig. 12) comprising  means for detecting one or more target non-speech sounds, (Wakeword detection 220 in device 110 in detecting the wakeword, some non speech sounds such as other noise can also be detected as input data. See Section 0032, lines 2)  
the means for detecting the one or more target non-speech sounds comprising a first stage (before Wakeword detection)  and a second stage, (after Wakeword detection) wherein the first stage includes means for generating a binary target sound classification of audio data and for activating the second stage in response to classifying the audio data as including any of the one or more target non-speech sounds; (Section 0038, lines 1-8 “local device may “wake”  and begin transmitting audio data- the second stage is when the local device begins transmitting the audio data upon detection of the wakeword)
and means for buffering the audio data and for providing the audio data to the second stage in response to the classification of the audio data as including at least one of the one or more target non-speech sounds. (Based on Section 0038, lines 1-8 the local device wakes up and begin transmitting audio data upon the wakeword confirmation module detecting a wakeword in the input data.) 
(This means that the system recognize and process the input data upon the system detecting a wakeword. For example “Alexa, play some music” wherein the utterance “play some music” is processed only when the system detects “Alexa” which reads on the limitation Wakeword)
The same idea is disclosed in applicant’s specification see Paragraph 0044 “a wakeup interrupt signal”
Parthasarathi does not disclose wherein the second stage is configured to receive the audio data from the buffer in response to the detection of the non-target sound.
Mitchell discloses a system wherein a similar system (Automatic recognitions system  that only recognized an audio data when a target non-speech audio data is detected by a classifier. (Col. 3 lines 44-52 Mitchell’s system is described as “… a system configured to recognize only target sounds within an audio data). Therefore combining the teaching of recognizing only target sounds within an utterance will improve the performance of the system because only the needed speech data is transmitted and processed.  This will make the system faster and will not waste bandwidth. 

Claim(s) 6,8, 18,21,24 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Mitchell (US 20180108369)  and further in view of Fink (US9749528)

Claim 6, Parthasarathi in view of Mitchell discloses wherein the binary target sound classifier (Parthasarathi: Classifier 1520- Section 0035, lines 13-18 “linear classifiers”).
Parthasarathi in view of Mitchell does not disclose  a buffer are included in a low-power domain and are configured to operate in an always on mode and wherein the second stage is configured to transition from a low-power state to an active state responsive to receiving the signal.
 
Fink discloses a buffer are included in a low-power domain and are configured to operate in an always on mode (Fink: Col. 3 lines 38-40- low power consumption) and wherein the second stage is configured to transition from a low-power state to an active state responsive to receiving the signal. (Fink: Col. 4 lines 48-50 second power stage reads on when the stage has transitioned from first stage to second stage).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching operating a system where low power state is activated when the mode is switched. The motivation is that Electric power will be saved.

Claim 8, Parthasarathi in view of Mitchell discloses wherein the first stage (Parthasarathi: before the wake word is detected- Section 0038) however Parthasarathi in view of Mitchell does not disclose to activate a camera in response to the detection of a target sound by the first stage. 
Fink discloses activating a camera in response to the detection of a target sound by the first stage. (Fink: Col. 4 lines 3-6- thus the wakeup signal activate one or more components of the video processor –camera). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of automatic camera activation. The motivation is that the system saves power when the system only activates the camera when needed. 
Claim 18, Parthasarathi in view of Mitchell discloses wherein the audio scene detector corresponds to a hierarchical detector (Mitchell: Col. 3 lines 15-21 audio scene indicator ) and the audio scene classifier is configured to classify the audio data according to a second set of audio scene classes, (Parthasarathi: Section 0035, lines 18-22- thus audio scene can be classified as either environmental or background noise) 
 Parthasarathi in view of Mitchell does not disclose the audio scene change detector is configured to detect the audio scene change based on detecting changes between audio scene classes in a first set of audio scene classes; 
wherein a first count of the audio scene classes in the first set of audio scene classes is less than a second count of the audio scene classes in the second set of audio scene classes. 
Fink discloses the audio scene change detector is configured to detect the audio scene change based on detecting changes between audio scene classes in a first set of audio scene classes; (Sensor stage A (102a) is the first audio scene detector) 
wherein a first count of the audio scene classes in the first set of audio scene classes is less than a second count of the audio scene classes in the second set of audio scene classes. (Abstract, lines 11-15- thus the detection or activation of the wakeup signal calls for more power which means less count for the first stage is needed)  
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of a camera detecting audio scenes. Then motivation is that it will help the system to detect a plurality of scenes. 

Claim 21, Parthasarathi in view of Mitchell discloses wherein the one or more processors are implemented in a wireless speaker (Parthasarathi: Speaker 1960 wireless because device 110 is a wireless device) and voice activated device (Parthasarathi: Speech Controlled device 110a in Fig. 21) that includes:
an integrated assistant application configured to be activated responsive to the integrated assistant application,( Parthasarathi: Speech Controlled device 110a in Fig. 21 supports assistant application)  
Parthasarathi in view of Mitchell does not disclose  a camera further configured to be activated responsive to detection of the presence of any of multiple target sounds in the audio data by the binary target sound classifier. 
Fink discloses a camera further configured to be activated responsive to detection of the presence of any of multiple target sounds in the audio data by the binary target sound classifier. 
 (Col. 2 lines 45-50- sensor stages 102a-102n allows the camera system to detect a particular type of activity which includes sound detection) 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of a camera detecting audio scenes. Then motivation is that it will help the system to detect a plurality of scenes.

Claim 24, Parthasarathi in view of Mitchell and further in view of Fink discloses wherein the binary target sound classifier and the buffer operate in an always-on mode, (Parthasarathi: Section 0113, lines 1-3- thus “a first score that corresponds to audio data frame prior to the detection of a desired speech/wake word” where the data is always on)  and wherein activating the second stage includes sending a signal from the first stage to the second stage (Parthasarathi: Section 0038, lines 1-4 the first score activates the detection system to transmits the audio data to be transmitted to the server for processing- which reads on the second stage, hence the first probability score of 1 reads on the first score) and transitioning the second stage from a low-power state to an active state responsive to receiving the signal at the second stage. (Fink: Col. 4 lines 48-50 second power stage reads on when the stage has transitioned from first stage to second stage) 

Clam 30, Parthasarathi in view of Mitchell discloses that the apparatus further comprising means for detecting an audio scene, the means for detecting the audio scene  (Col. 3 lines 19-26 indicating audio scene) comprising:
and means for classifying the audio data as a particular audio scene in response to detection of the audio scene change. (Parthasarathi: Section 0035, lines 15-18- thus the Gaussian Mixture Model (GMM) technique)
Parthasarathi in view of Mitchell does not disclose means for detecting an audio scene change in the audio data;
Fink discloses means for detecting an audio scene change in the audio data;
 (Fink: Col. 4 lines 48-50 second power stage reads on when the stage has transitioned from first stage to second stage) 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of a camera detecting audio scenes. Then motivation is that it will help the system to detect a plurality of scenes. 
Claims 15 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Mitchell et al. (US 10783434) as applied to claims 1-5,7,9-14,16-17, 19-20, 22-23,25 and 27-29 above and further in view of Mitchell et al (US20210193155). 
Claims 15 and 26, Parthasarathi in view of Mitchell et al. (US 10783434) discloses wherein the audio scene classifier is configured to classify the audio data according to multiple audio scene classes, (Parthasarathi: Section 0129 lines 15-18- thus the classifier may take the form of an acoustic model) the multiple audio scene classes in a car, (Mitchell: Col. 3 lines 19-26 audio scenes indicators) 
Parthasarathi in view of Mitchell et al. (US 10783434)  does not disclose wherein the multiple audio scene classes including at least two of at home, in an office, in a restaurant, on a train, on a street, indoors, or outdoors scenes.
Mitchell (US20170270919) discloses at least two of a multiple audio scene classes including at least two of at home, in an office, in a restaurant, on a train, (Section 0028, lines 4 “an in-vehicle device”)  on a street, indoors, or outdoors (Section 0010, lines 9-12 “Indoors and Outdoor scene”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of adding vehicle, indoor and outdoor classes to the set of multiple classes. The motivation is that the detection system will be able to recognize sound from more scenes. 

Claims 15 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Mitchell et al. (US 10783434) as applied to claims 1-5,7,9-14,16-17, 19-20, 22-23,25 and 27-29 above and further in view of Gross (US 20180108369).
                     Claim 20, Parthasarathi in view of Mitchell discloses wherein the second stage includes a multiple target sound classifier configured to generate a detector output that indicates, for each of multiple target sounds, (Parthasarathi: Section 0121, lines 10-16 “the system train a classifier to better classify a desired talker’s speech”- thus the system can indicate if the audio is desired or undesired or no speech audio and therefore reads on the multiple target sounds) the presence or absence of that target sound in the audio data, (Parthasarathi: Section 0122, lines 15-19- thus it is determined if the sound is from the target /desired person or from a different person) and
Parthasarathi in view of Mitchell does not disclose wherein the multiple target sounds correspond to one or more of a vehicle door opening or closing, road noise, a window opening or closing, braking, a hand brake engaging or disengaging, windshield wipers, a tum signal, or an engine revving.
Gross disclose wherein the multiple target sounds correspond to_one or more of a vehicle door opening or closing, road noise, a window opening or closing, braking, a hand brake engaging or disengaging, windshield wipers, a tum signal, or an engine revving. (Gross: Section 0027, lines 6-9- thus sounds played outside of the vehicle maybe doors, windows windshield)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of adding vehicle, indoor and outdoor classes to the set of multiple classes. The motivation is that the detection system will be able to recognize sound from more scenes. 

Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kasilya Sudarsan et al. (US20180047414) discloses a method that constantly monitors a received audio signal may consume resources of the device, such as processor cycles and/or power (e.g., battery power). Further, the device may not need to constantly receive ambient noise in order to effectively detect a relevant sound.
Sundaram (US10121494) discloses a system that detect a user presence if a request is made to inquire into whether a user is present, for example if a system receives a call request or other query as to whether a user is present. It may be beneficial for a system to have presence information available to it prior to receiving such a request.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571)270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING D POON can be reached on 571-272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AKWASI M SARPONG/           Primary  Examiner, Art Unit 2675                                                                                                                                                                                                          10/20/2022