DETAILED ACTION
This office action is in response to applicant’s communication dated 1/26/2022. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/26/2022 has been entered.

Claims’ Status
Claims 1-21 and 23 are pending and are herein examined.
Claims 1 and 10 are independent.
Claim 22 is cancelled.

Claim Objections
Claim(s) 1 and 10 is/are objected to because of the following informalities: 

Claim 1:
For reciting “and a blocking area; the blocking area designated to”, which it appears the applicant intended: “and a blocking area[[;]], wherein the blocking area is designated to”.
Appropriate correction is required.

Claim 10:
For reciting “and a blocking area wherein the blocking area designated to”, which it appears the applicant intended: “and a blocking area, wherein the blocking area is designated to”.
For reciting “the blocking area”, which includes a semicolon that is struck-through. The striking-through of the semicolon is not in compliance with 37 C.F.R. 1.121(c)(2), which states that "The text of any deleted subject matter must be shown by being placed within double brackets if strike-through cannot be easily perceived." Also see MPEP 714.II.C(B). Also, there is a semicolon missing after the work “signal”. This is correctable to, and is being interpreted for examination purposes as: “the blocking area[[;]] to render a conditionedsignal;”.
For reciting the phrase "an updated background noise" within the limitation “attenuating the noise components and an updated background noise”. Here it appears that applicant intended: “an updated background noise measurement”.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

	Claim(s) 1-2, 6 and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth, Evan (hereinafter Hildreth – WO 2009042579 A1) in view of Robinson; Danny Brant (hereinafter Robinson – US 20140156833 A1), and further in view of Nesta; Francesco et al. (hereinafter Nesta – US 20200412772 A1), Hetherington; Phillip A. et al. (hereinafter Hetherington – US 20070078649 A1) and Nikitin; Alexei V. (hereinafter Nikitin – US 20140195577 A1).

Independent Claim 1:
	Hildreth teaches A computer implemented method of controlling an electronic device in an absence of a physical contact with the electronic device, comprising:
designating an interactive space into a virtual detection space and a blocking area[[;]], (tracking a moving user involves centering images of a user, e.g. by panning and zooming, based on a user’s reference position, wherein the tracking the user also includes localizing audio to focus on the moving user, ¶¶ 8 and 189-190. The blocking of the not in focus image and sounds are accomplished, at least in part, by turning off microphones designated] the “virtual detection space” and images and sounds that are not in focus form the “blocking area”)
wherein the blocking area is designated to prevent the electronic device from tracking users and conveying audio and images captured by a camera and a microphone array; (tracking a user involves focusing on images and audio of a user, ¶¶ 8, 189-190 and 195. Here, the image and sounds that are not in focus, are prevented from being conveyed)
sampling an aural signal received by the microphone array correlated with a noise event […]; (microphone includes filtering process for suppressing background noise, ¶ 64. In order for the filtering of noise to occurring, the process necessarily includes sampling aural signals correlated with noise event(s))
correlating a sample of the aural signal with attributes of a noise signal; (microphone includes filtering process for suppressing background noise, ¶ 64. In order for the filtering of noise to occur, the process necessarily includes identifying [correlating] attributes of the aural signals as noise)
[…]; […];
detecting a user's presence within the virtual detection space of the camera while the electronic device is in a standby state […]; (Identifying an engagement hand gesture while system is on standby state, which transactions into a state facilitating audio or video communications with 
transitioning the electronic device to an interactive state when the user's presence is detected; (Identifying an engagement hand gesture while system is on standby state, which transactions into a state facilitating audio or video communications with another system [“interactive state”],  ¶¶ 183 and 172.)
detecting speech segments in the detection space and converting the speech segments into electrical signals; (¶ 22 – speech signals are received/detected by microphones. Also see ¶ 55, as a basic function, microphones detect/receive sound waves and convert them into electrical sound data; The detection includes detection from the direction of a first user with current focus [“in the detection space”], and whom can be interrupted only when speech is not detected from the first user, ¶ 198)
converting the electrical signals into digital signals at periodic intervals; (the microphones, which are part of the computer system, turn sounds into electrical signals, then the electrical signals produced by the microphones are converted into the electrical signals and then into digital signals, that is, they are “digitized”, ¶ 40 and 64. If this were not so, the computer would not be able to understand the signals, also see ¶ 238. Here, the at periodic intervals”, which allow the ADC sufficient “conversion time” for its conversion process)
identifying the speech segments in the digital signals; (Sounds are classified as voice or not voice, ¶ 198)
attenuating an input comprising the audio and the images from the blocking area to render a conditioned signal; (Tracking/focusing on a user by zooming, panning, cropping, scaling an image, ¶ 189, which is attenuating/conditioning an image signal in the blocking area. Tracking a user also involves focusing on audio of a user, ¶¶ 8, 189-190 and 195. Focusing the audio on the user is considered attenuating the audio from the blocking area to render a conditioned signal)
locating a physical location of a speech source generating the speech segments; (A multi-sensor microphone localizes sounds/voices of users, ¶¶ 64 and 195-196)
adjusting the camera automatically on the physical location of the speech source generating the speech segments; (camera tracking processes, ¶ 74, focus of the camera tracks a moving user and centers user image on screen, ¶¶ 189 and 192. Here, since it is the system performing the tracking of the user via camera adjustments, this is indicative of the the physical location of the speech source generating the speech segments”)
and transmitting the conditioned signal to a remote destination. (see at least ¶¶ 68-69 – “Video of a remote user may be transmitted over a network as compressed data, which is decompressed before being displayed by the user interface 201”. Also see ¶¶ 70, 72-73 and 227)
Hildreth does not appear to expressly teach that the user’s presence is detected by detecting noise components within the virtual detection space by the noise signal model. 
However, Robinson teaches/suggests that the user’s presence is detected by detecting noise components within the virtual detection space by the noise signal model (user’s presence at an endpoint is detected based at least in part on ambient noise, ¶¶ 98 and 165).
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the method of Hildreth wherein the user’s presence is detected by detecting noise components within the virtual detection space by the noise signal model, as taught/suggested by Robinson.

Hildreth does not appear to expressly teach that the noise event is “within a virtual detection space of the camera”,
However, Nesta teaches/suggests that the noise event is “within a virtual detection space of the camera” (audio coming from an audio source that has been identified by a video is selectively enhanced by reducing/removing noise from the signal, ¶¶ 17-18).
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the method of Hildreth wherein the noise event is “within a virtual detection space of the camera”, as taught/suggested by Nesta.
One would have been motivated to make such a combination in order to provide more robustness and preciseness of the method by providing the ability to selectively enhance audio in a noisy environment (Nesta ¶¶ 2 and 17-18).
Hildreth does not appear to expressly teach modeling spectral components of the sample aural signal correlated with the noise signal to generate a noise signal model for the virtual detection space captured by the camera, and updating a background noise when a speech segment is undetected and when a measurement of the noise signal is at or below a median noise measurement in the virtual detection space. 
However, Hetherington teaches/suggests 
modeling spectral components of the sample aural signal correlated with the noise signal to generate a noise signal model for the virtual detection space captured by the camera (the system models spectral characteristics of noises and may detect noise characteristics in a signal and condition/enhance the signal by removing/dampening those characteristics, Hetherington ¶¶ 8 and 21)
updating a background noise when a speech segment is undetected and when a measurement of the noise signal is at or below a [central tendency] noise measurement in the virtual detection space (a noise model, such as an average background noise model, for estimating background noises, is updated when noise events are detected in the absence of speech, Hetherington ¶¶ 32 and 46 and fig. 6, and the calculating of estimates with the model are enabled, in the absence of speech, when instantaneous background noise does not exceed an average [central tendency] background noise, Hetherington ¶¶ 32, 44 and 46).
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the method of Hildreth to include modeling spectral components of the sample aural signal correlated with the noise signal to generate a noise signal model for the virtual detection space captured by the camera, and updating a background noise when a speech segment is undetected and when a measurement of the noise signal is at or below a median noise measurement in the virtual detection space, as taught/suggested by Hetherington.
One would have been motivated to make such a combination in order to improve 
Hildreth as modified by Hetherington doesn’t directly does not appear to expressly teach that the central tendency measurement is a “median” noise measurement.
However, Nikitin teaches/suggests that the central tendency measurement is a “median” noise measurement (enhancing the quality of noisy signals by using a median filter, ¶ 22; mean [average] and median are measures of central tendency, ¶¶ 330 and 338). 
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to further modify the method of Hildreth, as modified, wherein the central tendency measurement is a “median” noise measurement, as taught/suggested by Nikitin.
One would have been motivated to make such a combination in order to improve the robustness of audio signals to noisy signal outliers (¶¶ 22 and 352).

Claim 2:
	The rejection of claim 1 is incorporated. Hildreth further teaches rendering an acknowledgement in response to the virtual detection via a speech synthesis engine. (a user is alerted of incoming call via a text-to-speech capability, Hildreth ¶ 41. This alert is “in response to the virtual detection” because, as explained above, the notification state is not activated unless a user is detected as being present, Hildreth ¶ 176)

Claim 6:
The rejection of claim 1 is incorporated. Hildreth further teaches where the locating a physical location of the speech source comprises identifying a physical location through an acoustic localization based on a time difference of signal arrival between the microphones in the microphone array (Hildreth ¶ 195 – “A sound localization process may utilize a beamform[ing] process, whereby the phase and amplitude of the signal received by each sensor of the microphone array is compared”. It is well understood that beamforming process for speech audio relies on the  difference in time of arrival between audio signals received from multiple microphones, see e.g., Knode; Galen E. et al., US 20190173446 A1, ¶ 56).

Claim 21:
	The rejection of claim 1 is incorporated. Hildreth teaches where the designating the interactive space into the virtual detection space and the blocking area occurs through a localization. (localizing audio and gestures to focus on user 
Hildreth ¶¶ 8, 189-190 and 195.)

Claim(s) 3-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1), Hetherington (US 20070078649 A1) and Nikitin (US 20140195577 A1), as applied to claim 1, and further in view of Weiss; Ron J. et al. (hereinafter Weiss – US 8131543 B1).

Claim 3:
The rejection of claim 1 is incorporated. Hildreth further teaches classifying the [sound] as a speech or the noise signal (¶ 64, noise suppressed; ¶ 198, at least – “classifying the sound as voice or not voice”).
Hildreth doesn’t directly teach converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients. 
However, Weiss suggests/discloses converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients (calculating Mel cepstral coefficient, MFCC, components associated with audio signal, and classifying the signals as speech or noise by using the MFCC components, col 1:45-54). 
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Weiss to Hildreth to include converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients, because this would lead a more efficient method adequately differentiating noise from speech (Weiss Abstract and Weiss col 1:18-25).

Claim 4:
	The rejection of claim 3 is incorporated. Hildreth further teaches identifying a human presence […] (¶ 180 – e.g., though hand gesture).
Hildreth doesn’t directly teach in response to processing the cepstral coefficients. 
However, Weiss discloses classifying the signals as speech or noise by using the 
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Weiss to Hildreth to include in response to processing the cepstral coefficients, because this would lead to a more flexible method allowing presence recognition through speech, e.g., users who are immobile or prefer speech over other input forms, e.g., gestures.

Claim 5:
The rejection of claim 1 is incorporated. Hildreth doesn’t directly teach the speech segments are identified by correlating spectral shapes [e.g., amplitudes] of the digital signals attributed with voiced and unvoiced speech. 
However, Weiss suggests/discloses the speech segments are identified by correlating spectral shapes [e.g., amplitudes] of the digital signals attributed with voiced and unvoiced speech (calculating Mel cepstral coefficient, MFCC, components associated with audio signal, and classifying the signals as speech or noise by using the MFCC components and related signal amplitudes, col 1:45-54 and 12:52-13:2).
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Weiss to Hildreth to include the speech segments are identified by correlating spectral shapes of the digital signals attributed with voiced and unvoiced speech, because this would lead a more efficient method adequately differentiating noise from speech (Weiss Abstract and .

Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1), Hetherington (US 20070078649 A1) and Nikitin (US 20140195577 A1), as applied to claim 6, and further in view of Krupka; Eyal et al. (hereinafter Krupka – US 20190341054 A1).

Claim 7:
The rejection of claim 6 is incorporated. Hildreth further teaches 
where the locating the physical location of the speech source comprises a video localization executed by a video locator (detecting faces within camera images, including determining a location of the faces, ¶ 87. Also a focus is maintained when user is speaking, ¶ 9. Therefore the location of the first user’s face is the “speech source”)
and an augmentor, (¶ 103 – a learning process to reduce noise and change classification of clusters in segmentation process)
[…];
and further comprises: […]; […];
identifying the speech source by a classification (see ¶ 92, at least – “performing statistical analysis to classify the eigemmage as a particular user's face”. Also see ¶¶ 94-95, 108, 162)
[…];
and identifying the physical location of the speech source (“[0064] The microphone 206 may include multiple sensors that are operable to spatially localize sounds” and ¶ 196 – “a microphone may localize the voice of the second user”. Also see “[0195] The system further may include localizing audio to focus on a user based on a user reference position…A sound localization process may increase the sensitivity of sound originating in the direction corresponding to the user reference position, and decrease the sensitivity of sound originating from other directions”)
based on a relative position of the speech source to images of a plurality of objects captured by the camera. (See ¶¶ 103-104, e.g., skin color of people in images)
Hildreth doesn’t directly teach generating a bounding box that encloses a participant's head, extracting features of the participant from within the bounding box , that the augmentor extracts the features when a predicted score exceeds a predetermined threshold and that the identifying speech source by a classification is based on the classification that renders a highest confidence score. 
However, Krupka teaches/suggests 
generating a bounding box that encloses a participant's head, (bounding boxes surrounding participant’s head, see ¶¶ 23 and Fig. 4)
extracting features of the participant from within the bounding box
that the features are extracted when a predicted score exceeds a predetermined threshold and that the identifying speech source by a classification is based on the classification that renders a highest confidence score. (outputting of candidate faces within a bounding box, ¶ 23-24 and their names, according to a selection of candidates with highest audio match confidence by classification machine learning algorithms, ¶ 63, also see ¶¶ 21, 24, 27 and 64)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the method of Krupka to include generating a bounding box that encloses a participant's head, extracting features of the participant from within the bounding box , that the augmentor extracts the features when a predicted score exceeds a predetermined threshold and that the identifying speech source by a classification is based on the classification that renders a highest confidence score, as taught/suggested by Krupka.
One would have been motivated to make such a combination in order to lead to a more reliable audio source association in a multi-sources environment (¶ 2).

Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1), Hetherington (US 20070078649 A1), Nikitin (US 20140195577 A1) and Krupka (US 20190341054 A1), as applied to claim 7 above, and further in view of Hoang Do et al. (hereinafter Do – Non-Patent Literature, “A Real-.

Claim 8:
The rejection of claim 7 is incorporated. Hildreth doesn’t directly teach where the locating a physical location of a speech source is based on detecting a maximum in a steered response power segment, comprising: generating cross-correlation and a phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; and generating an image of reverberation effects within the interactive space; where the phase transform determines a time difference of arrival of a signal between the microphone pair.
However, Do, discloses 
where the locating a physical location of a speech source is based on detecting a maximum in a steered response power segment, (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1) 
comprising: generating cross-correlation and a phase transform values (that the SRP is calculated using the phase transform (SRP-PHAT) values 
at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; (see Pg 1, localization algorithms for sound sources and Pg 3 – TDOA, time-difference-of-arrival)
and generating an image of reverberation effects within the interactive space; (see Pg 3, Figure 1, example of SRC, Stochastic Region Contraction)
where the phase transform determines a time difference of arrival of a signal between the microphone pair. (finding a TDOA, see Pg 3)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the method of Hildreth where the locating a physical location of a speech source is based on detecting a maximum in a steered response power segment, comprising: generating cross-correlation and a phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; and generating an image of reverberation effects within the interactive space; where the phase transform determines a time difference of arrival of a signal between the microphone pair, as taught/suggested by Do.
One would have been motivated to make such a combination in order to achieve a more cost effective localization (see Do Abstract).

Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1), Hetherington (US 20070078649 A1) and Nikitin (US 20140195577 A1), as applied to claim 6 above, and further in view of Do (Non-Patent Literature, “A Real-Time SRP-PHAT Source Location Implementation Using Stochastic Region Contraction (SRC) On A Large-Aperture Microphone Array”, 2006).

Claim 9:
The rejection of claim 6 is incorporated. Hildreth doesn’t directly teach where the locating the physical location of the speech source is based on detecting a maximum in a steered response power segment;  comprising: generating a cross-correlation and phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; where the phase transform determines a time difference of arrival of a signal between the microphone pair; and generating an image showing reverberation effects within the interactive space. 
However, Do discloses
where the locating the physical location of the speech source is based on detecting a maximum in a steered response power segment (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local 
comprising:
generating a cross-correlation and phase transform values (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2)
at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; (see Pg 1, localization algorithms for sound sources and Pg 3 – TDOA, time-difference-of-arrival)
where the phase transform determines a time difference of arrival of a signal between the microphone pair; (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2; Pg 3 – TDOA, time-difference-of-arrival)
and generating an image showing reverberation effects within the interactive space. (see Pg 3, Figure 1, example of SRC, Stochastic Region Contraction)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the method of Hildreth where the locating the physical location of the speech source is based on detecting a maximum in a steered response power segment;  comprising: generating a cross-correlation and phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; where the phase transform determines a time difference of arrival of a signal between the microphone pair; and generating an image showing reverberation effects within the interactive space, as taught/suggested by Do.
One would have been motivated to make such a combination in order to achieve a more cost effective localization (see Do Abstract).

Claim(s) 10-11, 15 and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth, Evan (hereinafter Hildreth – WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1) and Hetherington (US 20070078649 A1).

Independent Claim 10:
	Hildreth teaches An electronic device, comprising; a display; a processor in communication with the display; and a computer program stored in a non-transitory memory executed by the processor that causes actions to be carried out through instructions for: (computer implemented process, ¶ 5, including a display, ¶¶ 4 and 40)
monitoring an interactive space comprising a detection space and a blocking area, (a current user is tracked, ¶¶ 8 and 189-190. This is considered “monitoring” a detection space. Nevertheless, an area that doesn’t have the current focus may be monitored in order to detect a hand gesture of a second user, and switches the focus to the second user, ¶ 
wherein the blocking area is designated to prevent the electronic device from tracking users and conveying audio signals and images captured by a camera and a microphone array; (tracking a user involves focusing on images and audio of a user, ¶¶ 8, 189-190 and 195. Here, the image and sounds that are not in focus, are prevented from being conveyed)
sampling an input signal received by the microphone array correlated with a noise event […];(microphone includes filtering process for suppressing background noise, ¶ 64. In order for the filtering of noise to occurring, the process necessarily includes sampling aural signals correlated with noise event(s))
correlating a sample of the input signal with attributes of an audio noise signal; (microphone includes filtering process for suppressing background noise, ¶ 64. In order for the filtering of noise to occur, the process necessarily includes identifying [correlating] attributes of the aural signals as noise)
[…]; […]; 
detecting a user's presence within the detection space of the camera while the electronic device is in a standby state […]; (Identifying an engagement hand gesture while system is on standby state, which transactions into a state facilitating audio or video communications with another system [“interactive state”] ¶¶ 183 and 172.  Also, the system enters a notification state, if a face is detected, in order to avoid turning on a display device 
transitioning the electronic device to an interactive state when the user's presence is detected within the detection space […]; (Identifying an engagement hand gesture while system is on standby state, which transactions into a state facilitating audio or video communications with another system [“interactive state”],  ¶¶ 183 and 172.)
detecting speech in the detection space and converting the speech into electrical signals; (¶ 22 – speech signals are received/detected by microphones. Also see ¶ 55, as a basic function, microphones detect/receive sound waves and convert them into electrical sound data; The detection includes detection from the direction of a first user with current focus [“in the detection space”], and whom can be interrupted only when speech is not detected from the first user, ¶ 198)
converting the electrical signals into digital signals at periodic intervals; (the microphones, which are part of the computer system, turn sounds into electrical signals, then the electrical signals produced by the microphones are converted into the electrical signals and then into digital signals, that is, they are “digitized”, ¶ 40 and 64. If this were not so, the computer would not be able to understand the signals, also see ¶ 238. Here, the microphone performs the functions of a digitizer, that is, an analog to digital converter (ADC). It would be understood by a person having at periodic intervals”, which allow the ADC sufficient “conversion time” for its conversion process)
identifying speech segments in the digital signals; (Sounds are classified as voice or not voice, ¶ 198)
attenuating the noise components and an updated background noise measurement in the digital signals (suppressing background noise, ¶ 64) 
and aural signals and images from the blocking area[[;]] to render a conditioned signal; (Tracking/focusing on a user by zooming, panning, cropping, scaling an image, ¶ 189, which is attenuating…images from the blocking area to render a conditioned signal. Tracking a user also involves focusing on audio of a user, ¶¶ 8, 189-190 and 195. Focusing the audio on the user is considered attenuating the aural signals…from the blocking area to render a conditioned signal)
locating a physical location of a speech source generating the speech segments; (A multi-sensor microphone localizes sounds/voices of users, ¶¶ 64 and 195-196)
adjusting the camera automatically based on the physical location of the speech source generating the speech segments (camera tracking processes, ¶ 74, focus of the camera tracks a moving user and centers user image on screen, ¶¶ 189 and 192. Here, since it is the system performing the tracking of the user via camera adjustments, this is the physical location of the speech source generating the speech segments”)
and a physical location of participants as participants enter or leave the detection space; (a physical location of users is tracked, ¶¶ 8 and 189-190. During tracking of a first user, focus on a first user is maintained until the user finishes speaking, ¶ 9, and panning camera image from first user to a second user who is speaking when the first user is finished speaking, ¶¶ 196 and 198, so adjusting the camera as the participants enter…the detection space)
and transmitting the conditioned signal to a remote destination. (see at least ¶¶ 68-69 – “Video of a remote user may be transmitted over a network as compressed data, which is decompressed before being displayed by the user interface 201”. Also see ¶¶ 70, 72-73 and 227)
Hildreth does not appear to expressly teach that the user’s presence is detected by detecting noise components within the detection space by the noise signal model. 
However, Robinson teaches/suggests that the user’s presence is detected by detecting noise components within the detection space by the noise signal model (user’s presence at an endpoint is detected based at least in part on ambient noise, ¶¶ 98 and 165).
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of  the user’s presence is detected by detecting noise components within the detection space by the noise signal model, as taught/suggested by Robinson.
One would have been motivated to make such a combination in order to increase the detection versatility and accuracy of the device by allowing another way to detect user presence, including when a user is not speaking (Robinson ¶ 165).
Hildreth as modified by Robinson does not appear to teach that the transitioning occurs when noise components are detected within the detection space by the noise signal model.
However, given that Hildreth teaches transitioning when a user presence is detected (¶¶ 172 and 183) and Robinson modifies the Hildreth’s versatility by including user detection by noise, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the transitioning in the device of Hildreth wherein the transitioning occurs when noise components are detected within the detection space by the noise signal model, in order to apply the additional versatility of Hildreth as modified by Robinson.
Hildreth does not appear to expressly teach that the noise event is “within a detection space of the camera”,
However, Nesta teaches/suggests that the noise event is “within a detection space of the camera” (audio coming from an audio source that has been identified by a video is selectively enhanced by reducing/removing noise from the signal, ¶¶ 17-18).
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of 
One would have been motivated to make such a combination in order to provide more robustness and preciseness of the device by providing the ability to selectively enhance audio in a noisy environment (Nesta ¶¶ 2 and 17-18).
Hildreth, as modified, does not appear to expressly teach modeling spectral components of the sample input signal to generate a noise signal model for the detection space captured by the camera and updating a background noise measurement when a speech segment is undetected and only when a noise measurement of the noise signal is equal to or below an average noise measurement of a plurality of prior background noise measurements in the detection space.
However, Hetherington teaches/suggests 
modeling spectral components of the sample input signal to generate a noise signal model for the detection space captured by the camera (the system models spectral characteristics of noises and may detect noise characteristics in a signal and condition/enhance the signal by removing/dampening those characteristics, Hetherington ¶¶ 8 and 21)
updating a background noise measurement when a speech segment is undetected and only when a noise measurement of the noise signal is equal to or below an average noise measurement of a plurality of prior background noise measurements in the detection space (a noise model, such as an average background noise model, for estimating background noises, is updated when noise events are detected in the absence of 
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to further modify the device of Hildreth to include modeling spectral components of the sample input signal to generate a noise signal model for the detection space captured by the camera, and updating a background noise measurement when a speech segment is undetected and only when a noise measurement of the noise signal is equal to or below an average noise measurement of a plurality of prior background noise measurements in the detection space, as taught/suggested by Hetherington.
One would have been motivated to make such a combination in order to improve the perceptual quality of voice signals (Hetherington ¶ 43).

Claim 11:
	The rejection of claim 10 is incorporated. Hildreth teaches further comprising instructions for rendering, via a speech synthesis engine, an acknowledgement in response to the detection of the user's presence identifying the physical location. (a user is alerted of incoming call via a text-to-speech capability, Hildreth ¶ 41. This alert is “in response to the virtual detection” because, as explained above, the notification state is not activated unless a user is detected as being present, Hildreth ¶ 176)

Claim 15:
	The rejection of claim 10 is incorporated. Hildreth further teaches further comprising instructions where the locating the physical location of a speech source comprises an acoustic localization based on a signal latency received by a microphone pair executed by an acoustic locator. (Hildreth ¶ 195 – “A sound localization process may utilize a beamform[ing] process, whereby the phase and amplitude of the signal received by each sensor of the microphone array is compared”. It is well understood that beamforming process for speech audio relies on the  difference in time of arrival between audio signals received from multiple microphones, see e.g., Knode; Galen E. et al., US 20190173446 A1, ¶ 56).

Claim 23:
	The rejection of claim 10 is incorporated. Hildreth further teaches further comprising monitoring the background noise by a microphone of the microphone array. (microphone 206, which consists of multiple sensors [microphone array] filters out background noises, ¶ 64)

Claim(s) 12-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1) and Hetherington (US 20070078649 A1), as applied to claim 10, and further in view of Weiss (US 8131543 B1).

Claim 12:
	The rejection of claim 10 is incorporated. Hildreth teaches classifying the [sound] as a speech or the noise signal. (¶ 64, noise suppressed; ¶ 198, at least – “classifying the sound as voice or not voice”).
Hildreth doesn’t directly teach converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients. 
However, Weiss suggests/discloses converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients (calculating Mel cepstral coefficient, MFCC, components associated with audio signal, and classifying the signals as speech or noise by using the MFCC components, col 1:45-54). 
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Weiss to Hildreth to include converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients, because this would lead a more efficient device adequately differentiating noise from speech (Weiss Abstract and Weiss col 1:18-25).

Claim 13:
	The rejection of claim 12 is incorporated. Hildreth further teaches further comprising instructions that identify a human presence […]. (¶ 180 – e.g., though hand gesture). 
in response to processing the cepstral coefficients. 
However, Weiss discloses/suggests that the identifying is in response to processing the cepstral coefficients (classifying the signals as speech or noise by using the MFCC components, col 1:45-54). It was well within the capacities of a person having ordinary skill in the art to realize that identify speech would be reflective of a human presence.
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Weiss to Hildreth to include in response to processing the cepstral coefficients, because this would lead to a more flexible device allowing presence recognition through speech, e.g., users who are immobile or prefer speech over other input forms, e.g., gestures.

Claim 14:
	The rejection of claim 10 is incorporated. Hildreth doesn’t directly teach where the speech segments are identified by correlating spectral shapes of the digital signals attributed with voiced and unvoiced speech.
However, Weiss suggests/discloses where the speech segments are identified by correlating spectral shapes of the digital signals attributed with voiced and unvoiced speech (calculating Mel cepstral coefficient, MFCC, components associated with audio signal, and classifying the signals as speech or noise by using the MFCC components and related signal amplitudes, col 1:45-54 and 12:52-13:2). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date where the speech segments are identified by correlating spectral shapes of the digital signals attributed with voiced and unvoiced speech, because this would lead a more efficient device adequately differentiating noise from speech (Weiss Abstract and Weiss col 1:18-25).

Claim(s) 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1), Hetherington (US 20070078649 A1), as applied to claim 10, and further in view of Krupka; Eyal et al. (hereinafter Krupka – US 20190341054 A1).

Claim 16:
	The rejection of claim 10 is incorporated. Hildreth further teaches 
where the locating the physical location of a speech source comprises a video localization executed by a video locator and an augmentor, (detecting faces within camera images, including determining a location of the faces [video locator], ¶ 87. Also a focus is maintained when user is speaking, ¶ 9. Therefore the location of the first user’s face is the “speech source”; ¶ 103 – a learning process [augmentor] to reduce noise and change classification of clusters in segmentation process)
[…];
identifying an active speaker by a classification […]; (see ¶ 92, at least – “performing statistical analysis to classify the eigemmage as a particular 
and identifying a physical location of the active speaker  (“[0064] The microphone 206 may include multiple sensors that are operable to spatially localize sounds” and ¶ 196 – “a microphone may localize the voice of the second user”. Also see “[0195] The system further may include localizing audio to focus on a user based on a user reference position…A sound localization process may increase the sensitivity of sound originating in the direction corresponding to the user reference position, and decrease the sensitivity of sound originating from other directions”)
based on a relative position of the active speaker to images of a plurality of other objects captured by the camera in the interactive space. (See ¶¶ 103-104, e.g., skin color of people in images)
Hildreth does not appear to expressly teach the augmentor generating a bounding box that encloses an active speaker's facial features further comprises: extracting the facial features from a bounding box when a predicted score exceeds a predetermined threshold and that the identifying of speaker is by confidence score. 
However, Krupka teaches/suggests 
the augmentor generating a bounding box that encloses an active speaker's facial features further comprises: extracting the facial features 
and that the identifying of speaker is by confidence score. (outputting of candidate faces within a bounding box, ¶ 23-24 and their names, according to a selection of candidates with highest audio match confidence by classification machine learning algorithms, ¶ 63, also see ¶¶ 21, 24, 27 and 64)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of Krupka to include the augmentor generating a bounding box that encloses an active speaker's facial features further comprises: extracting the facial features from a bounding box when a predicted score exceeds a predetermined threshold and that the identifying of speaker is by confidence score, as taught/suggested by Krupka.
One would have been motivated to make such a combination in order to lead to a more reliable audio source association in a multi-sources environment (¶ 2).

Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1), Hetherington (US 20070078649 A1) and Krupka (US 20190341054 A1), as applied to claim 16 above, and further in view of Hoang Do et al. (hereinafter Do – Non-Patent Literature, “A Real-Time SRP-PHAT .

Claim 17:
	The rejection of claim 16 is incorporated. Hildreth doesn’t directly teach further comprising instructions for locating the physical location of the speech source based on detecting a maximum in a steered response power segment, comprising: generating a cross-correlation and phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; where the phase transform determines a time difference of arrival of a signal between the microphone pair; and generating an image showing reverberation effects within the interactive space.
However, Do discloses 
further comprising instructions for locating the physical location of the speech source based on detecting a maximum in a steered response power segment, (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1)
comprising:
generating a cross-correlation and phase transform values (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2)
at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; (see Pg 1, localization algorithms for sound sources and Pg 3 – TDOA, time-difference-of-arrival)
where the phase transform determines a time difference of arrival of a signal between the microphone pair; (finding a TDOA, see Pg 3)
and generating an image showing reverberation effects within the interactive space. (see Pg 3, Figure 1, example of SRC, Stochastic Region Contraction)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of Hildreth by further comprising instructions for locating the physical location of the speech source based on detecting a maximum in a steered response power segment, comprising: generating a cross-correlation and phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; where the phase transform determines a time difference of arrival of a signal between the microphone pair; and generating an image showing reverberation effects within the interactive space, as taught/suggested by Do.
.

Claim(s) 18-20  is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Robinson (US 20140156833 A1), and further in view of Nesta (US 20200412772 A1) and Hetherington (US 20070078649 A1), as applied to claims 10 and 15 above, and further in view of Do (Non-Patent Literature, “A Real-Time SRP-PHAT Source Location Implementation Using Stochastic Region Contraction (SRC) On A Large-Aperture Microphone Array”, 2006).

Claim 18:
The rejection of claim 15 is incorporated. Hildreth doesn’t directly teach where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment. 
However, Do discloses where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment. (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1). 
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment, as taught/suggested by Do.
One would have been motivated to make such a combination in order to achieve a more cost effective localization (see Do Abstract).

Claim 19:
The rejection of claim 10 is incorporated. Hildreth doesn’t directly teach where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment and a stochastic region contraction. However, Do discloses 
where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment 
and a stochastic region contraction (Abstract, Pg 1, “The problem with computing SRP is that the SRP space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum. Grid search is too expensive for a real-time system…we propose using stochastic region contraction(SRC) to make computing the SRP practical…we show that SRC saves computation by more than two orders of magnitude with almost no loss in accuracy”).
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of Hildreth where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment and a stochastic region contraction, as taught/suggested by Do.
One would have been motivated to make such a combination in order to achieve a more cost effective localization (see Do Abstract).

Claim 20:
The rejection of claim 10 is incorporated. Hildreth further teaches that locating is based on a video classifier (“[0087] The system may implement a process to detect faces with[in] one or more camera images[.] The face detection process may determine the location, size, or other physical characteristics of human faces within the one or more camera images[.] [0088] A process to detect faces within a camera image may include analyzing color[.] Analyzing color may include comparing camera images to a color model, identifying parts of the camera image that have colors consistent with 
Hildreth doesn’t directly teach where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment, [and] a stochastic region contraction.
However, Do discloses 
where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment, (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract)
[and] a stochastic region contraction […]. (Abstract, Pg 1, “The problem with computing SRP is that the SRP space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum. Grid search is too expensive for a real-time 
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to modify the device of Hildreth where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment, [and] a stochastic region contraction as taught/suggested by Do.
One would have been motivated to make such a combination in order to achieve a more cost effective localization (see Do Abstract).

Response to Arguments
Applicant’s prior art arguments have been fully considered but are moot in view of the new grounds of rejection presented above.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Below is a list of these references, including why they are pertinent:
Knode; Galen E. et al., US 20190173446 A1, ¶ 56, shows that it is understood that beamforming process for speech audio relies on the  
Ryan; Joseph F. et al. (US 20150269954 A1), pertinent for disclosing transforming digital sound information into MFCCs, Mel Frequency Cepstral Coefficients, and using the MFCCs to identify human voice (¶ 36 and 41).
Feng (US Patent Application Publication 20190158733), pertinent because “[0008] Embodiments of this disclosure pertain to one or more cameras which are automatically adjusted to continuously and instantly provide an optimal view of all persons attending a video conference using auto-framing. Embodiments of this disclosure pertain to automatically adjusting one or more cameras continuously to provide an optimal view of a person who is speaking.” 
Hart (US Patent Application Publication 20120062729), pertinent because it teaches “A computing device can analyze image or video information to determine a relative position of an active user. The computing device can optimize audio or video data capture based at least in part upon the relative location. The device can capture audio using one or more microphones pointing toward the relative location of the active user, and can use other microphones to determine audio from other sources to be removed from the captured audio. If video data is being captured, a video capture element can be adjusted to focus primarily on the active user. The 
Attorre (US Patent Application Publication 20190035431) pertinent because “[0126] The features or attributes associated with audio information, collectively referred to as “available audio features”, can include,…(k) Mel-Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but are distributed according to the mel-scale (“MFCCs”)”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GABRIEL S MERCADO whose telephone number is (408)918-7537. The examiner can normally be reached Mon-Fri 8am-5pm (Eastern Time).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William L. Bashore can be reached on (571) 272-4088. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, 





/Gabriel Mercado/Examiner, Art Unit 2175           


/DANIEL RODRIGUEZ/Primary Examiner, Art Unit 2175