Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

                                         Reason for no 101 rejection for claim 28
Examiner’s position for not making 101 rejection is that A computer-readable storage device storing instructions….. cannot be interpreted as a Transitory, propagating signal. Thus a storage device does not cover a signal per se.  
                                             
Compact Prosecution
Examiner would like to propose incorporating the limitation wherein the target sound detector detects target sounds such as an alarm and doorbell and a siren and a baby crying and dog barking and displays on a screen to indicate to a user the specific sound detected.  


Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked.

As explained in MPEP §2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:

(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;

(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and

 (C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.

Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.

Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitations are:
“means for detecting a target sound, the means for detecting the target sound comprising a first stage… wherein the first stage includes means for generating a binary target… ” in claim 29. 
“means for detecting an audio scene…” “means for detecting an audio scene change in the audio data…” “means for classifying the audio data as a particular audio scene in …”

Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. 
In Section 0032 of the specification applicant discloses that device 102 which includes microphone… and wireless speaker and voice command device for example “a smart speaker device or home automation system corresponds to the claimed means for detecting a target sound. 
Also In section 0035 of the specification applicant discloses that at least one of a Bayesian classifier or a Gaussian Mixed Model (GMM) Classifier) is the corresponding structure for means for generating a binary target as recited in claim 29. 
In section 0066 of the specification applicant provides that scene detector which can be a camera GPS receiver or audio scene detector corresponds to the means for detecting an audio scene. 
In sections 0058  and 0059 in the specification applicant provides audio scene change detector which includes a scene transition classifier which corresponds to the structure for means for detecting an audio scene change in the audio data recited in claim 30. 
Also In section 0035 of the specification applicant discloses that at least one of a Bayesian classifier or a Gaussian Mixed Model (GMM) Classifier) is the corresponding structure for means for classifying the audio data as a particular audio scene as recited in claim 30. 

If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

No 112 rejection is needed since corresponding structures are provided in the specification  as discussed above. 

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-14 and 16-22 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant argues that regarding Claims (1-14 and 16-22) the cited references (Parthasarathi US20170270919) in view of Gross (US 20180108369) does not disclose a target sound detector, configured to detect the presence or absence of multiple target non-speech sounds in the audio data, comprising a first stage and a second stage, 
the first stage including a binary target sound classifier configured to process the audio data, 
the first stage configured to activate the second stage in response to detection of a target sound of the multiple target non-speech sounds by the first stage.
In reply, Examiner respectfully disagree because Parthasarathi discloses a system wherein a target sound detector (Wakeword detection 220 in device 110) comprising a first stage (before Wakeword detection) and a second stage, (after Wakeword detection) the first stage including a binary target sound classifier (Classifier 1520) configured to process the audio data, (Section 0149, lines 2-5- thus the classifier confirms the wake word from the other audio signal) 


    PNG
    media_image1.png
    107
    601
    media_image1.png
    Greyscale

Figure 1:The Classifier Classifies the Desired Speech from the Not Desired speech
the first stage configured to activate the second stage in response to detection of a target sound by the first stage, (Section 0036, lines 1-3 “Once a voice activity (wake word) is detected in the audio”)
(Secondary reference (Gross)  also addresses this issue see Section 0027, lines 1-4- thus the first neural network determines either the sound is inside or outside of a vehicle) 
and the second stage configured to receive the audio data from the buffer in response to the detection of the target sound. (Section 0038, lines 1-4- thus the second stage reads on when the wake word (sound) is detected and the audio data is transmitted from the AFE to the server (Section 0033, lines 19-21)  for processing only when a wake word (sound) is detected) 

In other words, applicant argues that the cited references fails to disclose distinguishing  between different types of non-speech audio.
In reply, Examiner respectfully disagrees because as indicated by applicant in the remarks filed on 05/04/2022, the main reference teaches a classifier H that is used to label incoming audio as belonging to desired speech or undesired speech. Hence Parthasarathi clearly discloses classifying all non-speech audio as non speech  and does not further classify non-speech audio into different classes. 
Gross disclose in Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing)


    PNG
    media_image2.png
    646
    500
    media_image2.png
    Greyscale

	Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include detecting different non-speech audio such as dog barking or cat meowing . The motivation is that the system can detect specific audio sounds.

Regarding claims 23-25 and 27-30, applicant argues that because it is alleged that the cited references (Parthasarathi US20170270919) in view of Gross (US 20180108369) fails to disclose a target sound detector, configured to detect the presence or absence of multiple target non-speech sounds in the audio data, the limitation “processing the audio data in the buffer using a binary target sound classifier in a first stage of a target sound detector configured to detect the presence or absence of one or more target sounds of multiple target non-speech sounds in claim 23 is not disclosed by the combination of Parthasarathi US20170270919) in view of Gross (US 20180108369).
In reply, as it is explained above that the combination of Parthasarathi US20170270919) in view of Gross (US 20180108369) clearly discloses the limitation “a target sound detector, configured to detect the presence or absence of multiple target non-speech sounds in the audio data”  the limitation “processing the audio data in the buffer using a binary target sound classifier in a first stage of a target sound detector configured to detect the presence or absence of one or more target sounds of multiple target non-speech sounds in claim 23 is also disclosed by the combination of Parthasarathi US20170270919) in view of Gross (US 20180108369). Please see below for detailed explanations. 
processing the audio data from the buffer using a multiple target sound classifier in the second stage. (Section 0038, lines 1-4- thus the second stage reads on when the wake word (sound) is detected and the audio data is transmitted for processing only when a wake word (sound) is detected) 
Parthasarathi does not disclose that the target sound detector detect the presence or absence of multiple target non-speech sounds in the audio data. 
Gross discloses a system that detects the presence or absence of multiple target non-speech sounds in the audio data. (Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include detecting different non-speech audio such as dog barking or cat meowing . The motivation is that the system can detect specific audio sounds.   
Regarding claims 15 and 26
The same limitations are argued by applicant in reference to claims 15 and 26 and therefore the response above are applicable to claims 15 and 26 and therefore the arguments are moot. 


Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5,7,9-14,16-17,19-20,22-23,25,27-29 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Gross (US 20180108369)  
Claim 1, Parthasarathi discloses a device to perform sound detection (Device 110 fig. 19) , comprising: 
one or more processors (Controllers/Processor 1904- shown in fig. 19) comprising:
a buffer configured to store audio data  (Section 0033, lines 19-21- the audio data is stored in the acoustic front end located on the device 110 prior to transmission and it is transmitted only based on when it contains a Wakeword, also see Section 0103, lines 8-10) 
a target sound detector (Wakeword detection 220 in device 110) comprising a first stage (before Wakeword detection) and a second stage, (after Wakeword detection) the first stage including a binary target sound classifier (Classifier 1520) configured to process the audio data, (Section 0149, lines 2-5- thus the classifier confirms the wake word from the other audio signal) 


    PNG
    media_image1.png
    107
    601
    media_image1.png
    Greyscale

Figure 2:The Classifier Classifies the Desired Speech from the Not Desired speech
the first stage configured to activate the second stage in response to detection of a target sound by the first stage, (Section 0036, lines 1-3 “Once a voice activity (wake word) is detected in the audio”)
(Secondary reference (Gross)  also addresses this issue see Section 0027, lines 1-4- thus the first neural network determines either the sound is inside or outside of a vehicle) 
and the second stage configured to receive the audio data from the buffer in response to the detection of the target sound. (Section 0038, lines 1-4- thus the second stage reads on when the wake word (sound) is detected and the audio data is transmitted from the AFE to the server (Section 0033, lines 19-21)  for processing only when a wake word (sound) is detected) 
Parthasarathi does not disclose that the target sound detector detect the presence or absence of multiple target non-speech sounds in the audio data. 
Gross discloses a system that detects the presence or absence of multiple target non-speech sounds in the audio data. (Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. Thus the classifier determines if the sound is a child or not child if it is an animal sound.  In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing). 


    PNG
    media_image2.png
    646
    500
    media_image2.png
    Greyscale

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include detecting different non-speech audio such as dog barking or cat meowing . The motivation is that the system can detect specific audio sounds.   
Claim 2, Parthasarathi in view of Gross discloses wherein the binary target sound classifier is further configured to generate a binary signal including a first value and a second value, (Parthasarathi: Section 0113, lines 1-3- thus “a first score that corresponds to audio data frame with a desired speech/non desired speech”) wherein the first value is set to activate the second stage in response to detecting the presence of any of multiple target sounds in the audio data; (Parthasarathi: Section 0038, lines 1-4 the first score activates the detection system to transmits the audio data to be transmitted to the server for processing- which reads on the second stage, hence the first probability score of 1 reads on the first score) and 
the second value is set to refrain from activating the second stage in response to detecting that none of the multiple target non-speech sounds are in the audio data. (Parthasarathi: Section 0113, lines 11-12- thus second probability score 0 reads on the second value which will be for undesired speech) 
Claim 3, Parthasarathi in view of Gross disclose wherein the binary target sound classifier includes a neural network. (Parthasarathi: Section 0096, lines 2-4-thus a classifier trained in a manner to use the RNN which is Recurrent Neural Network- See Section 0074, lines 10-11)
Claim 4, Parthasarathi in view of Gross discloses wherein the binary target sound classifier includes at least one of a Bayesian classifier or a Gaussian Mixed Model  classifier. (Parthasarathi: Section 0035, lines 15-18- thus the Gaussian Mixture Model (GMM) technique) 
Claim 5, Parthasarathi in view of Gross discloses wherein the second stage includes a multiple target sound classifier configured to generate a detector output that indicates, for each of the multiple target non-speech sounds, the presence or absence of that target sound in the audio data, (Parthasarathi: Section 0080, lines 8-13- thus multiple target sounds such as HELO, HALO and YELO indicates a generated output sound by the classifier) and wherein the multiple target non-speech sounds correspond to multiple classes of sound events, (Gross: Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing) and wherein the one or more processors are further configured to cause an output device to indicate each target sound detected in the audio data. (Gross: Section 0014, lines 22-29 Trigger actions upon determining that occupant or vehicle is at risk) 
Claim 7, Parthasarathi in view of Gross discloses wherein the signal corresponds to a wakeup interrupt signal. (Parthasarathi: Section 0038 lines 1-4- thus detecting the wake word reads on the wakeup signal) 
Claim 9, Parthasarathi in view of Gross discloses further comprising a scene detector configured to classify an environment of the device at least partially (Parthasarathi: Section 0035, lines 11-23 “environmental noise or background noise reads on classification based on environment-section 0041) based on an input signal from the camera,  wherein the second stage includes a multiple target sound classifier that is configured to classify the audio data from among multiple classes of sound events, (Parthasarathi: Section 0109, lines 8-13- labeling the audio frames based on desired speech, non-desired speech or non-speech reads on the multiple classes/categories) and wherein operation of the multiple target sound classifier is at least partially based on the environment classified by the scene detector. (Parthasarathi: Section 0120 Environmental characteristics is use to detect or classify the scene/environment)
Claim 10, Parthasarathi in view of Gross discloses wherein the multiple target sound classifier is adjusted to focus on one or more particular classes of the multiple classes of sound events that correspond to the environment. (Parthasarathi: Section 0127, lines 12-17- thus training the model or adaptation of the acoustic condition to improve the distinction between the units of speech and between speech and noise reads on the adjustments of the classes) 
Claim 11, Parthasarathi in view of Gross discloses wherein the multiple target sound classifier is further configured to select a particular set of sound event classes that correspond to the environment from among multiple sets of sound event classes; (Gross: Section 0034,lines 2-6- thus the “sound representing a construction back ground” reads on the environment of the sound event) and classify the audio data based on the sound event classes of the particular set. (Parthasarathi: Section 0035, lines 18-22- thus audio scene can be classified as either environmental or background noise) 
Claim 12, Parthasarathi in view of Gross discloses wherein the target sound detector is configured to select, from among one or more sets of trained data, a particular set of trained data that corresponds to a detected environment of the device and to process the audio data based on the particular set of trained data. (Parthasarathi: Section 0146, lines 1-5- thus “determining whether the audio corresponds to particular keywords recognizable by the device”… lines 11-18- thus the models includes predicted location which can be a detected environment of the device) 
Claim 13, Parthasarathi in view of Gross discloses wherein the environment is detected based on at least one of a camera, a location detection system, (Parthasarathi: Section 0146, lines 11-18- thus predicted location) or an audio scene detector.
Claim 14, Parthasarathi in view of Gross discloses further comprising an audio scene detector that is configured to be activated responsive to detection of the presence of any of multiple target sounds in the audio data by the binary target sound classifier, (Parthasarathi: Section 0121, lines 5-10- thus the desired talker’s speech, where the features corresponds to a desired user understand a desire user is selected from a plurality of target voices or sounds)  the audio scene detector comprising:
an audio scene change detector configured to process the audio data and to generate a scene change signal in response to detection of an audio scene change; (Parthasarathi: Section 0127, lines 12-15- thus acoustic condition adaptation means the system can respond to an audio scene change) and an audio scene classifier configured to receive the audio data from the buffer in response to the detection of the audio scene change. (Parthasarathi: Section 0041, lines 4-6- thus environmental noise changed to background noise reads on the change of audio scene change) 
Claim 16, Parthasarathi in view of Gross discloses wherein the audio scene change detector is further configured to detect the audio scene change based on detecting changes in at least one of noise statistics or non-stationary sound statistics. (Parthasarathi: Section 0127, lines 12-15- thus acoustic condition adaptation means the system can respond to an audio scene change)
Claim 17, Parthasarathi in view of Gross discloses wherein the audio scene change detector includes a classifier trained using audio data corresponding to transitions between scenes. (Parthasarathi: Section 0127, lines 12-15- thus acoustic condition adaptation means the system can respond to an audio scene change)
Claim 19, Parthasarathi in view of Gross discloses further comprising a microphone coupled to the one or more processors and configured to generate the audio data. (Parthasarathi: Fig. 19, Microphone 1950 generates the audio data) 
Claim 20, Parthasarathi in view of Gross discloses wherein the second stage includes a multiple target sound classifier configured to generate a detector output that indicates, for each of multiple target sounds, (Parthasarathi: Section 0121, lines 10-16 “the system train a classifier to better classify a desired talker’s speech”- thus the system can indicate if the audio is desired or undesired or no speech audio and therefore reads on the multiple target sounds)  the presence or absence of that target sound in the audio data, (Parthasarathi: Section 0122, lines 15-19- thus it is determined if the sound is from the target /desired person or from a different person) and wherein the multiple target sounds correspond to one or more of a vehicle door opening or closing, road noise, a window opening or closing, braking, a hand brake engaging or disengaging, windshield wipers, a tum signal, or an engine revving. (Gross: Section 0027, lines 6-9- thus sounds played outside of the vehicle maybe doors, windows windshield) 
Claim 22, Parthasarathi discloses wherein the one or more processors are implemented in a portable electronic device. (Parthasarathi: Speech Controlled device 110a in Fig. 21)
Claim 23, Parthasarathi discloses a method of target sound detection, the method comprising processing the audio data in a memory using a binary target sound classifier in a first stage of a target sound detector; (Parthasarathi: Section 0113, lines 1-3- thus “a first score that corresponds to audio data frame with a desired speech/wake word”)
a buffer configured to store audio data in the detection device. (Section 0033, lines 19-21- the audio data is stored in the acoustic front end located on the device 110 prior to transmission and it is transmitted only based on when it contains a Wakeword, also see Section 0103, lines 8-10) 
activating a second stage of the target sound detector in response to detection of a target sound by the first stage; (Section 0038, lines 1-4 the first score activates the detection system to transmits the audio data to be transmitted to the server for processing- which reads on the second stage, hence the first probability score of 1 reads on the first score)
and processing the audio data from the buffer using a multiple target sound classifier in the second stage. (Section 0038, lines 1-4- thus the second stage reads on when the wake word (sound) is detected and the audio data is transmitted for processing only when a wake word (sound) is detected) 
Parthasarathi does not disclose that the target sound detector detect the presence or absence of multiple target non-speech sounds in the audio data. 
Gross discloses a system that detects the presence or absence of multiple target non-speech sounds in the audio data. (Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include detecting different non-speech audio such as dog barking or cat meowing . The motivation is that the system can detect specific audio sounds.   

Claim 25, Parthasarathi in view of Gross discloses further comprising causing an output device to indicate each target sound detected in the audio data. (Parthasarathi: Section 0109, lines 8-13- labeling the audio frames based on desired speech, non-desired speech or non-speech reads on the multiple classes/categories)   

Claim 27, Parthasarathi in view of Gross discloses further comprising processing the audio data to detect an audio scene change based on detecting changes between audio scene classes in a first set of audio scene classes; (Parthasarathi: Section 0035, lines 18-22- thus audio scene can be classified as either environmental or background noise) and classifying the audio data based on a second set of audio scene classes, (Gross: Section 0016, lines 14-16 sound into categories which reads on the audio scenes classes) wherein a first count of the audio scene classes in the first set of audio scene classes is less than a second count of audio scene classes in the second set of audio scene classes. (Gross: Section 0031, lines 27-30 “an accuracy with a mean square error less than 0.2% on the determination of sounds” reads on how audio scene class is determined based on the accuracy count) 

Claim 28, Parthasarathi discloses a computer-readable storage device storing instructions that, when executed by one or more processors, (Controllers/Processor 1904- shown in fig. 19) cause the one or more processors to:
a buffer configured to store audio data  (Section 0033, lines 19-21- the audio data is stored in the acoustic front end located on the device 110 prior to transmission and it is transmitted only based on when it contains a Wakeword, also see Section 0103, lines 8-10) 

process the audio data in the buffer using a binary target sound classifier in a first stage of a target sound detector (Section 0149, lines 2-5- thus the classifier confirms the wake word from the other audio signal)  configured to detect the presence or absence of one of more target sounds (Section 0113, lines 1-3- thus “a first score that corresponds to audio data frame with a desired speech/wake word”)

    PNG
    media_image1.png
    107
    601
    media_image1.png
    Greyscale

Figure 3:The Classifier Classifies the Desired Speech from the Not Desired speech

activate a second stage (after Wakeword detection) of the target sound detector in response to detection of a target sound of the target non-speech sounds by the first stage; (Section 0038, lines 1-4 the first score activates the detection system to transmits the audio data to be transmitted to the server for processing- which reads on the second stage, hence the first probability score of 1 reads on the first score)
and process the audio data from the buffer using a multiple target sound classifier in the second stage. (Section 0038, lines 1-4- thus the second stage reads on when the wake word (sound) is detected and the audio data is transmitted for processing only when a wake word (sound) is detected) 
Parthasarathi does not disclose that the target sound detector detect the presence or absence of multiple target non-speech sounds in the audio data. 
Gross discloses a system that detects the presence or absence of multiple target non-speech sounds in the audio data. (Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing). 

    PNG
    media_image2.png
    646
    500
    media_image2.png
    Greyscale

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include detecting different non-speech audio such as dog barking or cat meowing . The motivation is that the system can detect specific audio sounds.   

Claim 29, Parthasarathi discloses an apparatus (Device 110- Fig. 12) comprising means for detecting a target sound, (Wakeword detection 220 in device 110) the means for detecting the target sound comprising a first stage (before Wakeword detection) and a second stage, (after Wakeword detection) wherein the first stage includes means for generating a binary target sound classification of audio data (Section 0149, lines 2-5- thus the classifier confirms the wake word from the other audio signal) 

    PNG
    media_image1.png
    107
    601
    media_image1.png
    Greyscale

Figure 4:The Classifier Classifies the Desired Speech from the Not Desired speech

 and for activating the second stage in response to classifying the audio data as including the target sound; (Section 0036, lines 1-3 “Once a voice activity (wake word) is detected in the audio”)
(Secondary reference (Gross)  also addresses this issue see Section 0027, lines 1-4- thus the first neural network determines either the sound is inside or outside of a vehicle) 
and for providing the audio data to the second stage in response to the classification of the audio data as including the target sound. (Section 0038, lines 1-4- thus the second stage reads on when the wake word (sound) is detected and the audio data is transmitted for processing only when a wake word (sound) is detected) 
Parthasarathi does not disclose that the target sound detector detect the presence or absence of multiple target non-speech sounds in the audio data. 
Gross discloses a system that detects the presence or absence of multiple target non-speech sounds in the audio data. (Section 0027, lines 14-19- thus the system recognize the sounds into classified categories such as sounds of an adult or child and an animal. In another example the multiple target non-speech sounds may include child laughing or crying, dog barking or cat meowing). 

    PNG
    media_image2.png
    646
    500
    media_image2.png
    Greyscale

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of modifying the target sound detector of Parthasarathi to include detecting different non-speech audio such as dog barking or cat meowing . The motivation is that the system can detect specific audio sounds.   
 


Claim(s) 6,8, 18,21,24 and 30 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Gross (US 20180108369)  and further in view of Fink (US9749528)

Claim 6, Parthasarathi in view of Gross discloses wherein the binary target sound classifier (Classifier 1520- Section 0035, lines 13-18 “linear classifiers”).
Parthasarathi in view of Gross does not disclose  a buffer are included in a low-power domain and are configured to operate in an always on mode and wherein the second stage is configured to transition from a low-power state to an active state responsive to receiving the signal.
 
Fink discloses a buffer are included in a low-power domain and are configured to operate in an always on mode (Fink: Col. 3 lines 38-40- low power consumption) and wherein the second stage is configured to transition from a low-power state to an active state responsive to receiving the signal. (Fink: Col. 4 lines 48-50 second power stage reads on when the stage has transitioned from first stage to second stage).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching operating a system where low power state is activated when the mode is switched. The motivation is that Electric power will be saved.

Claim 8, Parthasarathi in view of Gross discloses wherein the first stage (Parthasarathi: before the wake word is detected- Section 0038) however Parthasarathi in view of Gross does not disclose to activate a camera in response to the detection of a target sound by the first stage. 
Fink discloses activating a camera in response to the detection of a target sound by the first stage. (Fink: Col. 4 lines 3-6- thus the wakeup signal activate one or more components of the video processor –camera). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of automatic camera activation. The motivation is that the system saves power when the system only activates the camera when needed. 
Claim 18, Parthasarathi in view of Gross discloses wherein the audio scene detector corresponds to a hierarchical detector (Gross: See fig. 4) 
and the audio scene classifier is configured to classify the audio data according to a second set of audio scene classes, (Parthasarathi: Section 0035, lines 18-22- thus audio scene can be classified as either environmental or background noise) 
 Parthasarathi in view of Gross does not disclose the audio scene change detector is configured to detect the audio scene change based on detecting changes between audio scene classes in a first set of audio scene classes; 
wherein a first count of the audio scene classes in the first set of audio scene classes is less than a second count of the audio scene classes in the second set of audio scene classes. 
Fink discloses the audio scene change detector is configured to detect the audio scene change based on detecting changes between audio scene classes in a first set of audio scene classes; (Fink: Sensor stage A (102a) is the first audio scene detector) 
wherein a first count of the audio scene classes in the first set of audio scene classes is less than a second count of the audio scene classes in the second set of audio scene classes. (Fink: Abstract, lines 11-15- thus the detection or activation of the wakeup signal calls for more power which means less count for the first stage is needed)  
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of a camera detecting audio scenes. Then motivation is that it will help the system to detect a plurality of scenes. 

Claim 21, Parthasarathi in view of Gross discloses wherein the one or more processors are implemented in a wireless speaker (Parthasarathi: Speaker 1960 wireless because device 110 is a wireless device) and voice activated device (Parthasarathi: Speech Controlled device 110a in Fig. 21) that includes:
an integrated assistant application configured to be activated responsive to the integrated assistant application,( Parthasarathi: Speech Controlled device 110a in Fig. 21 supports assistant application)  
Parthasarathi in view of Gross does not disclose  a camera further configured to be activated responsive to detection of the presence of any of multiple target sounds in the audio data by the binary target sound classifier. 
Fink discloses a camera further configured to be activated responsive to detection of the presence of any of multiple target sounds in the audio data by the binary target sound classifier. 
 (Fink: Col. 2 lines 45-50- sensor stages 102a-102n allows the camera system to detect a particular type of activity which includes sound detection) 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of a camera detecting audio scenes. Then motivation is that it will help the system to detect a plurality of scenes.

Claim 24, Parthasarathi in view of Gross and further in view of Fink discloses wherein the binary target sound classifier and the buffer operate in an always-on mode, (Parthasarathi: Section 0113, lines 1-3- thus “a first score that corresponds to audio data frame prior to the detection of a desired speech/wake word” where the data is always on)  and wherein activating the second stage includes sending a signal from the first stage to the second stage (Parthasarathi: Section 0038, lines 1-4 the first score activates the detection system to transmits the audio data to be transmitted to the server for processing- which reads on the second stage, hence the first probability score of 1 reads on the first score) and transitioning the second stage from a low-power state to an active state responsive to receiving the signal at the second stage. (Fink: Col. 4 lines 48-50 second power stage reads on when the stage has transitioned from first stage to second stage) 

Clam 30, Parthasarathi in view of Gross discloses that the apparatus further comprising means for detecting an audio scene, the means for detecting the audio scene comprising:
and means for classifying the audio data as a particular audio scene in response to detection of the audio scene change. (Parthasarathi: Section 0035, lines 15-18- thus the Gaussian Mixture Model (GMM) technique)
Parthasarathi in view of Gross does not disclose means for detecting an audio scene change in the audio data;
Fink discloses means for detecting an audio scene change in the audio data;
 (Fink: Col. 4 lines 48-50 second power stage reads on when the stage has transitioned from first stage to second stage) 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of a camera detecting audio scenes. Then motivation is that it will help the system to detect a plurality of scenes. 


Claims 15 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Parthasarathi et al. (US20170270919) in view of Gross et al. (US 9749 528) as applied to claims 1-5,7,9-14,16-17, 19-20, 22-23,25 and 27-29 above and further in view of Mitchell et al (US20210193155). 
Claims 15 and 26, Parthasarathi in view of Gross discloses wherein the audio scene classifier is configured to classify the audio data according to multiple audio scene classes, (Parthasarathi: Section 0129 lines 15-18- thus the classifier may take the form of an acoustic model) the multiple audio scene classes in a car,(Gross; Section 0015, lines 1-5- thus sounds detected from outside of vehicle) 
Parthasarathi in view of Gross does not disclose wherein the multiple audio scene classes including at least two of at home, in an office, in a restaurant, on a train, on a street, indoors, or outdoors scenes.
Mitchell discloses at least two of a multiple audio scene classes including at least two of at home, in an office, in a restaurant, on a train, (Section 0028, lines 4 “an in-vehicle device”)  on a street, indoors, or outdoors (Section 0010, lines 9-12 “Indoors and Outdoor scene”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of adding vehicle, indoor and outdoor classes to the set of multiple classes. The motivation is that the detection system will be able to recognize sound from more scenes. 


Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kasilya Sudarsan et al. (US20180047414) discloses a method that constantly monitors a received audio signal may consume resources of the device, such as processor cycles and/or power (e.g., battery power). Further, the device may not need to constantly receive ambient noise in order to effectively detect a relevant sound.
Sundaram (US10121494) discloses a system that detect a user presence if a request is made to inquire into whether a user is present, for example if a system receives a call request or other query as to whether a user is present. It may be beneficial for a system to have presence information available to it prior to receiving such a request.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571)270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING D POON can be reached on 571-272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AKWASI M SARPONG/           Primary  Examiner, Art Unit 2675                                                                                                                                                                                                          06/22/2022