DETAILED ACTION
This office action is in response to applicant’s authorization, on 9/20/2021, of examiner’s amendment. 
Any citation of the specification is as published under US Patent Application Publication 20210294424.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims’ Status
Claims 1-20 are pending and are herein examined.

Response to Arguments
Applicant’s arguments, see Pg 8, filed 9/20/2021, with respect to Hildreth not anticipating the claims 1-2, 6-7, 10-11 and 15-16 under 35 U.S.C. 102(a)(1) have been fully considered and it is persuasive that Hildreth does not anticipate the claims, as amended. Therefore, the 102 rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of 35 U.S.C. 103. See the details in the 103 rejection section below.  
The remaining claims are also rejected under 103.

Claim Objections
Claim(s) 1 and 10 is/are objected to because of the following informalities: 
Claims 1 and 10 recite “and a blocking area; the blocking area designated to”, which is correctable to, and is being interpreted for examination purposes as: “and a blocking area[[;]], wherein the blocking area is designated to”.
Claim 10 recites “the blocking area; to render a conditioned signal”, which includes semicolon in the wrong place and is missing a semicolon at the end of the clause. This is correctable to, and is being interpreted for examination purposes as: “the blocking area[[;]] to render a conditioned signal;”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112(a)
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

Claims 1 and 10 recite “partitioning an interactive space into a virtual detection space and a blocking area” and “partitioning an interactive space into a detection space and a blocking area”, respectively. The specification describes that users can “partition the interactive space into blocking areas or blocking regions” (¶ 16), and “A system automatically frames locations by detecting a user's presence within a virtual detection space” (Abstract). However, the specification doesn't sufficiently describe “partitioning an interactive space into a virtual detection space and a blocking area” or “partitioning an interactive space into a detection space and a blocking area”.

Claims 1 and 10 recite “audio and images captured by a camera and a microphone array” and “audio signals and images captured by a camera and a microphone array”. The claims then further involve steps/functions of sampling, modeling, updating, detecting, transitioning, and attenuating in relation to noise or noise signal model that are specific to a “detection space” or “virtual detection space” of a camera. The specification describes detectors that can detect voice [herein interpreted as “microphones”] that interface with or are a unitary part of one or more cameras (see 

Claim 1 recites “updating a background noise when a speech segment is undetected and when a detected background noise measurement is below a median noise measurement in the virtual detection space”. Here, the specification doesn't sufficiently describe what is the background noise that is being “updated”. The Instant Specification discloses continuously updating a “background noise measurement”, specifically, that some alternate systems measure continuous background noise and continuously update the measurement during intervals when voice and unvoiced segments are not detected, and that background noise is not measured at all during other intervals when detecting transient noise (those noises that exceed an average or median measurement of prior background measurements), see ¶ 20. However, the specification doesn't sufficiently describe on how the background noise or the disclosed “background noise measurement” should be interpreted (e.g., the specification doesn't sufficiently describe how the background noise measurement can be interpreted as anything other than a median measurement of the prior background noise measurements). 

Claim 10 recites “updating a background noise when a speech segment is undetected and only when a background noise measurement is below an average noise measurement of a plurality of prior background measurement in the detection space”. background noise that is being “updated”, for similar reasons provided above for claim 1. Furthermore, the limitation is problematic as emphasized here: “updating a background noise…only when a background noise measurement is below an average noise measurement of a plurality of prior background measurement in the detection space”. The specification teaches updating a noise measurement and not measuring a measurement when transient noise events are identified, and that the transient noise event is identified when a measurement exceeds an average or median measurement of prior background noise measurement. See ¶ 20. In other words, the description includes the embodiment of updating a noise measurement when background noise measurement is equal to an average noise measurement of a plurality of prior background measurement. However, the specification doesn't sufficiently describe that the updating occurs “only when a background noise measurement is below an average noise measurement of a plurality of prior background measurement”. 
NOTE: Even assuming that claim 10 where to be amended to “only when a background noise measurement is below or equal to an average noise measurement of a plurality of prior background measurement”, the specification would be lacking sufficient description of what the “noise measurement” is and why it makes sense to update such noise measurement “only when a background noise measurement is below or equal to an average noise measurement of a plurality of prior background measurement”, since usually, in statistics, a transient/outlier is typically not identified when a measurement is above an average or median, but instead when the 

Claims 1 and 10 recite one or more of “detecting a user's presence within the virtual detection space of the camera while the electronic device is in a standby state by detecting noise components within the virtual detection space by the noise signal model” (claim 1), “detecting a user's presence within the detection space of the camera while the electronic device is in a standby state by detecting noise components within the virtual detection space by the noise signal model” (claim 10) and “transitioning the electronic device to an interactive state when the user's presence is detected within the detection space when noise components are detected within the detection space by the noise signal model” (claim 10). The specification describes using models to remove noise from a signal (¶ 13) and that “voice detectors” detect the presence of participants (¶ 14). However, the specification doesn't sufficiently describe that “the user's presence” is detected by using “noise components” or a “noise signal model”. 

Claim 1 recites “attenuating the noise components and an updated background noise from the virtual detection space and the speech segments and images from the blocking area to render a conditioned signal”. The specification describes updating a background noise measurement (¶ 20), updating models (¶ 21), updating speech source localizations (¶ 25), updating source location estimates (¶ 29), removing unwanted noise (¶ 12), enhancing speech by dampening/attenuating undesired signals or background noise (see ¶¶ 12, 18 and 21). However, the specification doesn’t attenuating…updated background noise” or what is meant by “attenuating…speech segments”.

Claim 10 recites “the user's presence is detected within the detection space when noise components are detected within the detection space by the noise signal model”. The specification describes using models to remove noise from a signal (¶ 13) and that “voice detectors” detect the presence of participants (¶ 14). However, the specification doesn't sufficiently describe “the user's presence is detected…when noise components are detected…by the noise signal model”.

Claims 2-9 and 11-20 are also rejected as they depend on the claim(s) above.

Claim Rejections - 35 USC § 112(b)
	The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claims 1 and 10 recite “partitioning an interactive space into a virtual detection space and a blocking area” and “partitioning an interactive space into a detection space and a blocking area”, respectively. This is unclear for being inconsistent with the specification. The specification teaches that users can “partition the interactive space into blocking areas or blocking regions” (¶ 16), and “A system automatically frames locations by detecting a user's presence within a virtual detection space” (Abstract). The specification doesn’t teach “partitioning an interactive space into a virtual detection space and a blocking area” or “partitioning an interactive space into a detection space and a blocking area”. A claim, although clear on its face, may also be indefinite when a conflict or inconsistency between the claimed subject matter and the specification disclosure renders the scope of the claim uncertain as inconsistency with the specification disclosure or prior art teachings may make an otherwise definite claim take on an unreasonable degree of uncertainty. In re Moore, 439 F.2d 1232, 1235-36, 169 USPQ 236, 239 (CCPA 1971); In re Cohn, 438 F.2d 989, 169 USPQ 95 (CCPA 1971); In re Hammack, 427 F.2d 1378, 166 USPQ 204 (CCPA 1970). See MPEP 2173.03. For examination purposes, the examiner interprets the phrases as:
“partitioning at least part of an interactive space into at least one blocking area” and
“partitioning at least part of an interactive space into at least one blocking area”.

Claims 1 and 10 recite “audio and images captured by a camera and a microphone array” and “audio signals and images captured by a camera and a microphone array”. The claims then further involve steps/functions of sampling, modeling, updating, detecting, transitioning, and attenuating in relation to noise or noise signal model that are specific to a “detection space” or “virtual detection space” of a camera. Here, it is unclear how audio signals involving noise, captured by a microphone array, that is separate from a camera, are specific to a camera’s detection space. The specification describes that detectors that can detect voice [herein interpreted as “microphones”] that interface with or are a unitary part of one or more cameras (see ¶ 14). However, the specification doesn’t further illuminate on how audio signals involving noise, captured by a microphone array that is separate from a camera, are specific to a camera’s detection space. For examination purposes, the examiner interprets the limitations as: 
“audio and images captured by a camera [[and]] with a built-in microphone array” and
“audio signals and images captured by a camera [[and]] with a built-in microphone array”

Claim 1 recites “updating a background noise when a speech segment is undetected and when a detected background noise measurement is below a median noise measurement in the virtual detection space”. Here, it is unclear what is the background noise that is being “updated”. The specification discloses continuously updating a “background noise measurement”, specifically, that some alternate systems measure continuous background noise and continuously update the measurement during intervals when voice and unvoiced segments are not detected, and that background noise or “background noise measurement” should be interpreted (e.g., the specification explain how the background noise measurement can be interpreted as anything other than a median measurement of the prior background noise measurements). For examination purposes, the examiner interprets the phrase in claim 1 as:
“updating a median measurement of the prior background noise measurements when a speech segment is undetected and when a detected background noise measurement does not exceed measurement of the prior background noise measurementsof the camera” and see further note below for interpretation of claim 10.

Claim 10 recites “updating a background noise when a speech segment is undetected and only when a background noise measurement is below an average noise measurement of a plurality of prior background measurement in the detection space”. Here, it is unclear what is the background noise that is being “updated”, for similar reasons provided above for claim 1. Furthermore, it is also unclear for being inconsistent with the specification for reciting “updating a background noise…only when a background noise measurement is below an average noise measurement of a plurality of prior background measurement in the detection space”. The specification exceeds an average or median measurement of prior background noise measurement. See ¶ 20. In other words, the description includes an embodiment of updating a noise measurement when background noise measurement is equal to an average noise measurement of a plurality of prior background measurement. The specification doesn’t teaches that the updating “only when a background noise measurement is below an average noise measurement of a plurality of prior background measurement”. A claim, although clear on its face, may also be indefinite when a conflict or inconsistency between the claimed subject matter and the specification disclosure renders the scope of the claim uncertain as inconsistency with the specification disclosure or prior art teachings may make an otherwise definite claim take on an unreasonable degree of uncertainty. In re Moore, 439 F.2d 1232, 1235-36, 169 USPQ 236, 239 (CCPA 1971); In re Cohn, 438 F.2d 989, 169 USPQ 95 (CCPA 1971); In re Hammack, 427 F.2d 1378, 166 USPQ 204 (CCPA 1970). See MPEP 2173.03. For examination purposes, the examiner interprets claim 10 as:
“updating [[a]] an average measurement of prior background noise measurements when a speech segment is undetected and [[only]] when a background noise measurement is below an average noise measurement of a plurality of prior background measurements in the detection space of the camera”

Claims 1 and 10 recite one or more of “detecting a user's presence within the virtual detection space of the camera while the electronic device is in a standby state by detecting noise components within the virtual detection space by the noise signal model” (claim 1), “detecting a user's presence within the detection space of the camera while the electronic device is in a standby state by detecting noise components within the virtual detection space by the noise signal model” (claim 10) and “transitioning the electronic device to an interactive state when the user's presence is detected within the detection space when noise components are detected within the detection space by the noise signal model” (claim 10). Here, these limitations are unclear for being inconsistent with the specification. The specification describes using models to remove noise from a signal (¶ 13) and that “voice detectors” detect the presence of participants (¶ 14). The specification doesn’t teach that “the user's presence” is detected by using “noise components” or by a “noise signal model”. A claim, although clear on its face, may also be indefinite when a conflict or inconsistency between the claimed subject matter and the specification disclosure renders the scope of the claim uncertain as inconsistency with the specification disclosure or prior art teachings may make an otherwise definite claim take on an unreasonable degree of uncertainty. In re Moore, 439 F.2d 1232, 1235-36, 169 USPQ 236, 239 (CCPA 1971); In re Cohn, 438 F.2d 989, 169 USPQ 95 (CCPA 1971); In re Hammack, 427 F.2d 1378, 166 USPQ 204 (CCPA 1970). See MPEP 2173.03. For examination purposes, the examiner interprets the phrases as:
“detecting a user's presence within the virtual detection space of the camera while the electronic device is in a standby state by detecting voice components within the virtual detection space of the camera” (claim 1)
“detecting a user's presence within the detection space of the camera while the electronic device is in a standby state by detecting voice components within the virtual detection space of the camera ” (claim 10) 
“transitioning the electronic device to an interactive state when the user's presence is detected within the detection space when voice components are detected within the detection space of the camera ” (claim 10)

Claim 1 recites “attenuating the noise components and an updated background noise from the virtual detection space and the speech segments and images from the blocking area to render a conditioned signal”. Here, it is unclear what is meant by “attenuating…updated background noise”, and because it is unclear what is meant by “attenuating…speech segments”. The specification describes updating a background noise measurement (¶ 20), updating models (¶ 21), updating speech source localizations (¶ 25), updating source location estimates (¶ 29), removing unwanted noise (¶ 12), enhancing speech by dampening/attenuating undesired signals or background noise (see ¶¶ 12, 18 and 21). However, none of these further illuminate on what is meant by “attenuating…updated background noise” or what is meant by “attenuating…speech segments
“attenuating [[the]] noise components and images of the blocking areaof the camera, wherein the noise components include background noise, and wherein the attenuating results in enhanced speech segments”.

Claim 10 recites “attenuating the noise components and an updated background noise, in the digital signals and aural signals and images from the blocking area”. Here, the limitation is unclear for reciting “attenuating…updated background noise” and “attenuating…speech segments”, for the reason(s) provided above for claim 1. Furthermore, this claim is unclear for reciting “in the digital signals and aural signals and images from the blocking area”. Here, it is unclear what is the different between “aural signals” and “digital signals”, because “aural signals” are already in the form of “digital signals”. See the following preceding limitations of claim 10: “converting the speech into electrical signals” and “converting the electrical signals into digital signals at periodic intervals”. For examination purposes, the examiner interprets the phrase as: 
“attenuating [[the]] aural images of the blocking area, wherein the aural components include background noise, and wherein the attenuating results in enhanced speech segments”

Claim 11 recites “rendering an acknowledgement in response to identifying the physical location via a speech synthesis engine”. Here, it is unclear how to apply “via a speech synthesis engine”. Specifically, it’s unclear if the “rendering” is “via a speech synthesis engine”, or if “identifying the physical location” is “via a speech synthesis engine”. Furthermore, it is unclear for being inconsistent with the specification. The specification teaches an acknowledgement that is in response to transitioning to an active state, which occurs after a user’s presence is detected (see ¶¶ 15-16). The specification doesn’t teach that the acknowledgement is in response to identifying the physical location, as claimed. For examination purposes, the examiner interprets the phrase as:
rendering, via a speech synthesis engine, an acknowledgement in response to the detection of the user’s presence 

Claim 6 recites where the locating a physical location of the speech source comprises identifying a physical location through an acoustic localization based on a signal latency received by a microphone pair executed by an acoustic locator. Here, it is unclear is meant by “signal latency received by a microphone pair executed by an acoustic locator”. The specification teaches “[0022] With noise and undesired signals dampened, a locator 110 executes an acoustic localization through the microphone array 404 that comprises several microphones equidistant from each other. The time difference of arrival from between microphones is processed to determine the direction of arrival of the speech signals.” However, the specification doesn’t further illuminate on signal latency received by a microphone pair executed by an acoustic locator”. For examination purposes, the examiner interprets the phrase as: 
“where the locating a physical location of the speech source comprises identifying a physical location through an acoustic localization based on a time difference of signal arrival between the microphones in the microphone array ”

Claims 2-9 and 11-20 are also rejected as they depend on the claim(s) above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 6, 10-11 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth, Evan (hereinafter Hildreth – WO 2009042579 A1) in view of Wang, Zhe (hereinafter Wang – US 20160078873 A1) and further in view of Shin, Wonho et al. (hereinafter Shin – US 20200092519 A1).

	As per claim 1, Hildreth teaches
A computer implemented method of controlling an electronic device in an absence of a physical contact with the electronic device, comprising:
	partitioning (segmenting) at least part of an interactive space into at least one blocking area”[[;]], (The background space [“blocked area”] is segmented/classified [“partitioning”] differently from the foreground space and noise and images are blocked from the background space, e.g., by displaying an segmented image by rendering only parts of that image that are classified as foreground, see at least FIG. 6 and ¶¶ 64, 103 and 115. Also see ¶¶ 94, 103 and 119, among others)
wherein the blocking area is designated to prevent the electronic device from tracking users (the position of the user is tracked based on a centroid of a foreground part of a segmented image, see ¶ 191-192, therefore the background part of the image [“blocking area”] is “designated to prevent the electronic device from tracking users”)
and conveying audio and images captured by a camera [[and]] with a built-in microphone array; (see ¶ 195 for “a microphone array”. Parts of the images are classified as foreground and background based on distance of the objects, see ¶ 100. Sounds microphones may be turned off for sounds located far from the user reference position, see ¶ 195, therefore, the classification is able to “prevent…conveying audio…captured by…a microphone array”. As mentioned above, only parts of images classified as foreground are rendered on a display, so the prevent…images captured by a camera” of images from a camera , see ¶ 115. Images are camera images, see at least ¶ 8. Camera with a built-in microphone array – see device 200 in FIG. 2 and ¶¶ 52-53, 70 and 72.)
sampling an aural signal received by the microphone array correlated with a noise event within a virtual detection space of the camera; (noise sounds are received in order to update a background model, see ¶ 103, so the noise signals received are samples [“sampling”] of noise sound signals [“aural signal”], and are necessarily correlated with “a noise event” through a filtering process of a chroma keying process, see ¶ 97. The sounds are necessarily received through a sound range of the microphone of the camera system 200, see FIG. 2)
correlating a sample of the aural signal with attributes of a noise signal; (noise sounds are received in order to update a background model, see ¶ 103, so the noise signal(s) received is “a sample of the aural signal”, and are necessarily correlated with “attributes of a noise signal” through a filtering process of a chroma keying process, see ¶ 97)
modeling […] the sample aural signal correlated with the noise signal to generate a noise signal model for the virtual detection space captured by the camera; (noise sounds are received in order to update a background model, see ¶ 103, at least an updated noise signal model is generated, and are necessarily correlated with “the noise signal” through a filtering process of a chroma keying process, see ¶ 97)
[…]; 
detecting a user's presence within the virtual detection space of the camera while the electronic device is in a standby state by detecting  [visual] components within the virtual detection space of the camera (“[0180] While [in] the standby state 3001 and notification state 3002, the system may monitor images captured by a camera for an engagement hand gesture[.] A processor may detect an engagement hand gesture such that the user may engage the system by performing an engagement hand gesture”)
	transitioning the electronic device to an interactive state when the user's presence is detected; (“[0183] A hand gesture identification process may identify an engagement hand gesture while the system is in the standby state 3001 and/or the notification state 3002[.] Referring to FIG 30, the system may enter the menu state 3004 when an engagement hand gesture is detected while in a standby state 3001[.] The system may enter the call state 3003 when an engagement hand gesture is detected while [in] a notification state 3002” and ¶ 172 – “a call state 3003, where the system facilitates audio or video communications with another system”, this is an “interactive state”. Also see “[0176] The system may be configured to enter a notification state 3002 if a face detection process detects a face…to avoid turning on a display device when no user is present”. Because a user has to be present to perform a hand gesture, the 
	detecting speech segments in the detection space of the camera and converting the speech segments into electrical signals; (¶ 22 – speech signals are received/detected by microphones. Also see ¶ 55, as a basic function, microphones detect/receive sound waves and convert them into electrical sound data)
	converting the electrical signals into digital signals at periodic intervals; (the microphones are part of the computer system, as can be seen in “[0040]…The media hub 110 is configured to accept incoming…video conference calls…The media hub 110 also includes or is otherwise connected to a microphone for receiving and digitizing ambient sounds” and “[0064] The microphone 206 may include multiple sensors that are operable to spatially localize sounds…The microphone 206 may be part of the user interface 201, such as where a computer monitor”, therefore the electrical signals produced by the microphone are converted into the electrical signals and then into digital signals (that is, they are “digitized”). If this were not so, the computer would not be able to understand the signals, also see “[0238] The features described may be implemented [in] digital electronic circuitry”. Here, the microphone performs the functions of a digitizer, that is, an analog to digital converter (ADC). It would be understood by a person having ordinary skill in the art that ADCs cannot make instantaneous conversions, so the conversion of 
	identifying the speech segments in the digital signals; (“[0097]…a filtering process to reduce noise and change the classification of small isolated clusters (e g , to remove isolated parts of the background that may be classified as foreground)”)
attenuating [[the]] noise components and images of the blocking areaof the camera, wherein the noise components include background noise, and wherein the attenuating results in enhanced speech segments; (only images from foreground is rendered and background noise is “attenuated”/blocked from distant microphones, see mapping the blocking area in the partitioning limitation above. Also see ¶ 64 – “The microphone 206 may include a filtering process operable to suppress background noise and cancel echoes”. Echoes can be from speech segments, see ¶ 70, the speech is conditioned/enhanced by virtue of attenuating the noise components)
	locating a physical location of a speech source generating the speech segments; (“[0064] The microphone 206 may include multiple sensors that are operable to spatially localize sounds” and ¶ 196 – “a microphone may localize the voice of the second user”. Also see “[0195] 
adjusting the camera automatically on the physical location of the speech source generating the speech segments; (“[0074] The processor 205 may be operable to perform several camera tracking processes” and “[0192] The system may track a user reference position so that the camera maintains focus on a user while a user moves (e g , the camera image follows the user)[.] Camera panning and zooming may help assure that the user remains within the transmitted image (e g , [during] videoconferencing)[.] Camera panning and zooming also may help assure that buttons that may be displayed on the display device remain within easy reach of the user” and [0189] “The system may be configured to focus on a user[.] Focusing on a user may include panning and zooming a camera, so that the user's face appears centered, and at a specified size, in the camera image”. Here, since it is the system performing the tracking of the user via camera adjustments, this is indicative of the camera being “adjusting…automatically”. Also see “[0009]…the process may include determining whether the first user has relinquished the focus…Determining whether the first user has relinquished the focus may include determining whether the first user has finished speaking”. 
and transmitting the conditioned signal to a remote destination. (see at least ¶¶ 68-69 – “Video of a remote user may be transmitted over a network as compressed data, which is decompressed before being displayed by the user interface 201”. Also see ¶¶ 70, 72-73 and 227)
Hildreth doesn’t directly teach updating a median measurement of the prior background noise measurements when a speech segment is undetected and when a detected background noise measurement does not exceed measurement of the prior background noise measurementsof the camera. However, Wang discloses an encoder that selects a group of silence frames (collected during a silence period wherein signals with no voice included in them are received) that do not include a transient component, wherein such selection is to obtain an average or median value that is of better quality (see at least ¶ 4 – “the silence signal refers to a signal not including a call voice… an encoder intermittently encodes and sends a special encoding frame, namely, a silence descriptor (SID) frame”, ¶ 177 – “when an SID frame is encoded, a parameter of the SID frame is obtained by obtaining an average value or a median value of parameters of multiple silence frames within the analysis interval. However, an actual background noise spectrum may include various unexpected transient spectral components. Once the analysis interval includes such spectral components, the components may be added in the SID frame in a method for obtaining an average value, and a silence spectrum including such spectral components may even be incorrectly encoded in the SID frame 
Hildreth doesn’t directly teach that modeling is of “spectral components of” the sample aural signal. However, Wang further discloses using spectral parameters [“components”] to analyze sound data for the purpose of improving sound quality (¶ 182). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply the Wang to Hildreth, because this would lead to improving the quality of the sound to ensure comfort of listening users (Wang ¶¶ 4 and 182).
Hildreth doesn’t directly teach that the user presence is detected using “voice” components. However, Shin discloses a device with voice activation module that recognizes a user’s wake up command while in a standby state (¶ 174 and FIG. 7). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply the Shin to Hildreth, because this would lead to a more flexible method allowing voice activation to more users, e.g., users who are immobile or prefer speech over other input forms, e.g., gestures.

As per claim 2, Hildreth further teaches rendering an acknowledgement in response to the virtual detection via a speech synthesis engine. (‘[0041]…the media hub 110 detects an incoming call and alerts the user 104 via an audio and/or video message…the user 104 is alerted that the incoming call is from the board of directors at the user's company when the speakers 113 output audio indicating "Incoming Call The Board”…generated…by applying a text-to-speech capabilities to a caller-ID system’ This alert is “in response to the virtual detection” because, as explained above, the notification state is not activated unless a user is detected as being present. “[0176] The system may be configured to enter a notification state 3002 if a face detection process detects a face…to avoid turning on a display device when no user is present”)

	As per claim 6, Hildreth further teaches where the locating a physical location of the speech source comprises identifying a physical location through an acoustic localization based on a time difference of signal arrival between the microphones in the microphone array (¶ 195 – “A sound localization process may utilize a beamform[ing] process, whereby the phase and amplitude of the signal received by each sensor of the microphone array is compared”. Beamforming process used for speech audio long to include comparing time of signals arrival from sources to microphones in a microphone array to calculate the location of speech sources).

Claims 10-11 and 15  are device claims containing functions found in limitations of method claims 1-2 and 6, respectively, and are rejected using similar rationales.

Claim(s) 3-5 and 12-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Wang (US 20160078873 A1) and further in view of Shin (US 20200092519 A1), as applied to claims 1 and 10, and further in view of Weiss; Ron J. et al. (hereinafter Weiss – US 8131543 B1).

	As per claim 3, Hildreth further teaches classifying the [sound] as a speech or the noise signal (¶ 198, at least – “classifying the sound as voice or not voice”).
doesn’t directly teach converting the digital signals into a plurality of cepstral coefficients and that the classification is of the cepstral coefficients. However, Weiss discloses calculating Mel cepstral coefficient, MFCC, components associated with audio signal, and classifying the signals as speech or noise by using the MFCC components (col 1:45-54). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply the Weiss to Hildreth, because this would lead a more efficient method adequately differentiating noise from speech (Weiss Abstract and Weiss col 1:18-25).

	As per claim 4, Hildreth further teaches identifying a human presence […] (¶ 180 – e.g., though hand gesture).
Hildreth doesn’t directly teach in response to processing the cepstral coefficients. However, Weiss discloses classifying the signals as speech or noise by using the MFCC components (col 1:45-54). It was well within the capacities of a person having ordinary skill in the art to realize that identify speech would be reflective of a human 

	As per claim 5, Hildreth doesn’t directly teach the speech segments are identified by correlating spectral shapes [e.g., amplitudes] of the digital signals attributed with voiced and unvoiced speech. However, Weiss discloses calculating Mel cepstral coefficient, MFCC, components associated with audio signal, and classifying the signals as speech or noise by using the MFCC components and related signal amplitudes (col 1:45-54 and 12:52-13:2). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply the Weiss to Hildreth, because this would lead a more efficient method adequately differentiating noise from speech (Weiss Abstract and Weiss col 1:18-25).

Claims 12-14 are device claims containing functions found in limitations of one or more of claims 3-5 and are rejected using similar rationales.

Claim(s) 7 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Wang (US 20160078873 A1) and further in view of Shin (US 20200092519 A1), as applied to claims 6 and 10, and further in view of Venalainen, Kevin Juho (hereinafter Venalainen – US 20190025400 A1).

As per claim 7, Hildreth further teaches 
where the locating the physical location of the speech source comprises a video localization executed by a video locator (“[0087] The system may implement a process to detect faces with[in] one or more camera images[.] The face detection process may determine the location…or other physical characteristics of human faces” Also see “[0009]…the process may include determining whether the first user has relinquished the focus…Determining whether the first user has relinquished the focus may include determining whether the first user has finished speaking”. Therefore the location of the first user’s face is the “speech source”)
and an augmentor, the augmentor (¶ 103 – a learning process to reduce noise and change classification of clusters in segmentation process)
generating a bounding box that encloses a participant's head (a camera image may be identified as user’s body or face and display, e.g., as a silhouette or contour box, which includes dividing/segmenting the camera image into foreground, e.g., a part of a user, and background, that is, generating a bounding box that encloses a participant's head. Also see Hildreth, Par 114-115. NOTE: a bounding box is not limited to rectangular in shape.)
and further comprises:
	extracting features of the participant from within the bounding box (a camera image may be identified as user’s body or face, that is,  features of the participant, and display, e.g., as a silhouette or contour box, which includes dividing/segmenting the camera image into foreground, e.g., a part of a user, and background, that is, extracting features…within the bounding box. Also see Hildreth, Par 114-115.)
[…];
identifying the speech source by a classification (see ¶ 92, at least – “performing statistical analysis to classify the eigemmage as a particular user's face”. Also see ¶¶ 94-95, 108, 162)
[…];
and identifying the physical location of the speech source (“[0064] The microphone 206 may include multiple sensors that are operable to spatially localize sounds” and ¶ 196 – “a microphone may localize the voice of the second user”. Also see “[0195] The system further may include localizing audio to focus on a user based on a user reference position…A sound localization process may increase the sensitivity of sound originating in the direction corresponding to the user reference position, and decrease the sensitivity of sound originating from other directions”)
based on a relative position of the speech source to images of a plurality of objects captured by the camera. (See ¶¶ 103-104, e.g., skin color of people in images)
when a predicted score exceeds a predetermined threshold and that the identifying speech source by a classification is based on the classification that renders a highest confidence score. However, Venalainen discloses a machine learning logic (“augmentor”) that includes an object of analysis a high level of confidence that is considered in subsequent processing (see at least ¶ 46) that may include decision based on a new sample object exceeding a certain threshold (see at least ¶ 69). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply the Venalainen to Hildreth, because this would lead to improved accuracy of localization estimate (Venalainen ¶ 16).

As per claim 16, Hildreth further teaches 
where the locating the physical location of a speech source comprises a video localization executed by a video locator (“[0087] The system may implement a process to detect faces with[in] one or more camera images[.] The face detection process may determine the location…or other physical characteristics of human faces” Also see “[0009]…the process may include determining whether the first user has relinquished the focus…Determining whether the first user has relinquished the focus may include determining whether the first user has finished speaking”. Therefore the location of the first user’s face is the “speech source”)
and an augmentor, the augmentor (¶ 103 – a learning process to reduce noise and change classification of clusters in segmentation process)
generating a bounding box that encloses an active speaker's facial features (a camera image may be identified as user’s body or face and display, e.g., as a silhouette or contour box, which includes dividing/segmenting the camera image into foreground, e.g., a part of a user, and background, that is, generating a bounding box that encloses a participant's head. Also see Hildreth, Par 114-115. NOTE: a bounding box is not limited to rectangular in shape.)
further comprises:
extracting the facial features from a bounding box […]; (a camera image may be identified as user’s body or face, that is,  features of the participant, and display, e.g., as a silhouette or contour box, which includes dividing/segmenting the camera image into foreground, e.g., a part of a user, and background, that is, extracting features…within the bounding box. Also see Hildreth, Par 114-115.)
identifying an active speaker by a classification […] (see ¶ 92, at least – “performing statistical analysis to classify the eigemmage as a particular user's face”. Also see ¶¶ 94-95, 108, 162 and “[0009]…the process may include determining whether the first user has relinquished the focus…Determining whether the first user has relinquished the focus may include determining whether the first user has finished speaking”)
and identifying a physical location of the active speaker (“[0064] The microphone 206 may include multiple sensors that are operable to spatially localize sounds” and ¶ 196 – “a microphone may localize the voice of the second user”. Also see “[0195] The system further may include localizing audio to focus on a user based on a user reference position…A sound localization process may increase the sensitivity of sound originating in the direction corresponding to the user reference position, and decrease the sensitivity of sound originating from other directions”)
based on a relative position of the active speaker to images of a plurality of other objects captured by the camera in the interactive space. (See ¶¶ 103-104, e.g., skin color of people in images)
Hildreth doesn’t directly teach that the augmentor extracts the features when a predicted score exceeds a predetermined threshold and that the identifying of speaker is both a classification and confidence score. However, Venalainen discloses a machine learning logic (“augmentor”) that includes an object of analysis a high level of confidence that is considered in subsequent processing (see at least ¶ 46) that may include decision based on a new sample object exceeding a certain threshold (see at least ¶ 69). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply the Venalainen to Hildreth, because this would lead to improved accuracy of localization estimate (Venalainen ¶ 16).

Claim(s) 8 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Wang (US 20160078873 A1) and further in view of Shin (US 20200092519 A1) and of Venalainen (US 20190025400 A1), as applied to claims 7 and 16 above, and further in view of Hoang Do et al. (hereinafter Do – Non-Patent Literature, “A Real-Time SRP-PHAT Source Location Implementation Using Stochastic Region Contraction (SRC) On A Large-Aperture Microphone Array”, 2006).

As per claim 8, Hildreth doesn’t directly teach where the locating a physical location of a speech source is based on detecting a maximum in a steered response power segment, comprising: generating cross-correlation and a phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; and generating an image of reverberation effects within the interactive space; where the phase transform determines a time difference of arrival of a signal between the microphone pair.
However, Do, discloses 
where the locating a physical location of a speech source is based on detecting a maximum in a steered response power segment, (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus  
comprising: 
generating cross-correlation and a phase transform values (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2)
at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; (see Pg 1, localization algorithms for sound sources and Pg 3 – TDOA, time-difference-of-arrival)
and generating an image of reverberation effects within the interactive space; (see Pg 3, Figure 1, example of SRC, Stochastic Region Contraction)
where the phase transform determines a time difference of arrival of a signal between the microphone pair. (finding a TDOA, see Pg 3)
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract).

As per claim 17, Hildreth doesn’t directly teach further comprising instructions for locating the physical location of the speech source based on detecting a maximum in a steered response power segment, comprising: generating a cross-correlation and phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; where the phase transform determines a time difference of arrival of a signal between the microphone pair; and generating an image showing reverberation effects within the interactive space.
However, Do discloses 
further comprising instructions for locating the physical location of the speech source based on detecting a maximum in a steered response power segment, (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1)
comprising:
generating a cross-correlation and phase transform values (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2)
at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; (see Pg 1, localization algorithms for sound sources and Pg 3 – TDOA, time-difference-of-arrival)
where the phase transform determines a time difference of arrival of a signal between the microphone pair; (finding a TDOA, see Pg 3)
and generating an image showing reverberation effects within the interactive space. (see Pg 3, Figure 1, example of SRC, Stochastic Region Contraction)
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract).

Claim(s) 9 and 18-20  is/are rejected under 35 U.S.C. 103 as being unpatentable over Hildreth (WO 2009042579 A1) in view of Wang (US 20160078873 A1) and further in view of Shin (US 20200092519 A1), as applied to claims 6, 10 and 15 above, and further in view of Do (Non-Patent Literature, “A Real-Time SRP-PHAT Source Location Implementation Using Stochastic Region Contraction (SRC) On A Large-Aperture Microphone Array”, 2006).

	As per claim 9, Hildreth doesn’t directly teach where the locating the physical location of the speech source is based on detecting a maximum in a steered response power segment;  comprising: generating a cross-correlation and phase transform values at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; where the phase transform determines a time difference of arrival of a signal between the microphone pair; and generating an image showing reverberation effects within the interactive space. 

where the locating the physical location of the speech source is based on detecting a maximum in a steered response power segment (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1)
comprising:
generating a cross-correlation and phase transform values (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2)
at a plurality of time delays associated with a sensing direction of a plurality of microphone pairs processing the aural signal; (see Pg 1, localization algorithms for sound sources and Pg 3 – TDOA, time-difference-of-arrival)
where the phase transform determines a time difference of arrival of a signal between the microphone pair; (that the SRP is calculated using the phase transform (SRP-PHAT) values and cross-correlations for all possible pair of the set of microphones, see Pgs 1-2; Pg 3 – TDOA, time-difference-of-arrival)
and generating an image showing reverberation effects within the interactive space. (see Pg 3, Figure 1, example of SRC, Stochastic Region Contraction)
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract).

As per claim 18, Hildreth doesn’t directly teach where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment. However, Do discloses where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment. (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract).

As per claim 19, Hildreth doesn’t directly teach where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment and a stochastic region contraction. However, Do discloses 
where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract)
and a stochastic region contraction (Abstract, Pg 1, “The problem with computing SRP is that the SRP space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum. Grid search is too expensive for a real-time system…we propose using stochastic region contraction(SRC) to make computing the SRP practical…we show that SRC saves computation by more than two orders of magnitude with almost no loss in accuracy”).
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract).

As per claim 20, Hildreth further teaches that locating is based on a video classifier (“[0087] The system may implement a process to detect faces with[in] one or more camera images[.] The face detection process may determine the location, size, or other physical characteristics of human faces within the one or more camera images[.] [0088] A process to detect faces within a camera image may include analyzing color[.] Analyzing color may include comparing camera images to a color model, identifying parts of the camera image that have colors consistent with human sk[in] and facial features, clustering those parts of the camera image having colors consistent with human skin and facial features, and classifying a cluster as a face if it meets a set of size and shape criteria”).
Hildreth doesn’t directly teach where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment, [and] a stochastic region contraction.
However, Do discloses 
where the locating the physical location of a speech source is based on detecting a maximum in a steered response power segment, (in localizing sources in a noisy, reverberant environment, it has been shown that computing the steered response power (SRP) is more robust than faster, two-stage, direct time-difference of arrival methods, and that the SRP [steered response power] space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum, see Abstract and Pg 1). Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of 
[and] a stochastic region contraction […]. (Abstract, Pg 1, “The problem with computing SRP is that the SRP space has many local maxima and thus computationally intensive grid-search methods are used to find a global maximum. Grid search is too expensive for a real-time system…we propose using stochastic region contraction(SRC) to make computing the SRP practical…we show that SRC saves computation by more than two orders of magnitude with almost no loss in accuracy”)
Therefore, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, to apply Do to Hildreth, because this would lead to more cost effective localization (see Do Abstract).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Below is a list of these references, including why they are pertinent:
Ryan; Joseph F. et al. (US 20150269954 A1), pertinent for disclosing transforming digital sound information into MFCCs, Mel Frequency Cepstral Coefficients, and using the MFCCs to identify human voice (¶ 36 and 41).
Feng (US Patent Application Publication 20190158733), pertinent because “[0008] Embodiments of this disclosure pertain to one or more 
Hart (US Patent Application Publication 20120062729), pertinent because it teaches “A computing device can analyze image or video information to determine a relative position of an active user. The computing device can optimize audio or video data capture based at least in part upon the relative location. The device can capture audio using one or more microphones pointing toward the relative location of the active user, and can use other microphones to determine audio from other sources to be removed from the captured audio. If video data is being captured, a video capture element can be adjusted to focus primarily on the active user. The position of the user can be monitored so the audio and video data capture can be adjusted accordingly.” See abstract. 
Attorre (US Patent Application Publication 20190035431) pertinent because “[0126] The features or attributes associated with audio information, collectively referred to as “available audio features”, can include,…(k) Mel-Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but are distributed according to the mel-scale (“MFCCs”)”
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GABRIEL S MERCADO whose telephone number is (408)918-7537. The examiner can normally be reached Mon-Fri 8am-5pm (Eastern Time).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William L. Bashore can be reached on (571) 272-4088. The fax phone 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Gabriel Mercado/           Examiner, Art Unit 2175                                                                                                                                                                                             

					/WILLIAM L BASHORE/                                                                 Supervisory Patent Examiner, Art Unit 2175