Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/19/2019 and 07/23/2019 is being considered by the examiner.
Drawings
The drawing submitted on 07/19/2019 is being considered by the examiner.
Claim Objections
Claim 15, is objected to because of the following informalities:  The last word of claim 15 spelled should change from “sign” to “signal”.  Appropriate correction is required.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 9-11 and 18-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by XU(US2019/0333498 A1).
 method of estimating a distance from a device to a signal source using a deep learning system, the method comprising: receiving an audio signal, produced by a microphone of the device, that is responsive to sound from the signal source whose distance from the device is to be estimated; processing the audio signal to estimate a direct component (direct path component) of the sound from the signal source, and a reverberant component
(reverberant component) of the sound from the signal source ([0047] The data processing module comprises a classification unit or classifier 100 which is configured to receive data representing at least one audio signal. This may be for example a portion of the audio signal(s) or features extracted from the audio signal(s). The audio signal may be derived from a signal generated by a single or multiple microphones of the device. Based on the received data, the classification block is operable to classify an acoustic environment of the audio signal. For example, the classifier may be operable to classify the acoustic environment of the sound signal derived by the microphone based on the audio data. Thus, the classifier may be configured to enable the estimation of one or more parameters of the real-time acoustic scenario, such as distance between the source--e.g. speaker--and the microphone, and/or the direction of sound projection relative to the microphone. [0050] Reflected sounds captured by a microphone will have travelled on a longer path compared to the direct path and will therefore arrive after sound waves which have travelled on the direct path. By considering the time of arrival of speech sounds it becomes possible to classify received speech into "early" speech sounds--which are speech sounds received within a specified time interval from the start of speech--and "late" speech sounds received by the microphone after the specified time interval. The early speech sounds can be considered to comprise the direct path component of speech, whilst the late speech sounds can be considered to comprise only the reverberant components.); extracting signal characteristics of the direct component and the reverberant component; and estimating, by the deep learning system, the distance of the signal source from the device based on the extracted signal characteristics of the direct component and the reverberant component ([0053] Thus, it will be appreciated that by considering the ratio of the energy of the early arriving sounds to the energy of the late arriving sounds, it is possible to infer one or more characteristics of the acoustic scenario such as distance between the source--e.g. speaker--and the microphone, the direction of sound projection relative to the microphone and the acoustic reverberant condition.  [0054] One or more examples described herein rely upon classifying the acoustic environment of received audio data based on a consideration of one or more features of the audio signal and/or the energy of the received audio signal. In particular, one or more examples described herein rely upon consideration of a ratio of the direct to reverberant sound energy or the Direct to Reverberant Ratio (DRR). [0057] Thus, according to one or more examples the classifier may comprise a model. The model may have been trained offline in order to characterize a plurality of different acoustic environments. For example, the model may have been trained using neural networks. The model may be built by deriving one or more metrics in a plurality of different acoustic environments. At runtime it is possible to use the model to identify at least one likely acoustic scenario based on data or features which are extracted from a real-time audio signal. One or more examples described herein at least partially rely upon a model which has been trained using acoustic features of received sound and/or a metric which represents the energy of received sound. Preferably, one or more example described herein at least partially rely upon a model which has been trained based on information about a ratio of the direct to reverberant sound energy or the Direct to Reverberant Ratio (DRR).).

Regarding Claims 9 and 18, XU teaches: The method of claim 1, wherein the distance estimated by the deep learning system comprises one of a classification output that provide a discrete estimate of the distance from the device to the signal source or a regression output that provides a continuous estimate of the distance from the device to the signal source (See rejection of claim 1).

Regarding Claim 10, XU teaches: The method of claim 1, further comprising training the deep learning system using training data to learn a mapping between audio signals of the training data received by the microphone of the device and a distance to a source of the training data, and wherein estimating by the deep learning system, the distance of the signal source from the device based on the extracted signal characteristics of the direct component and the reverberant component comprises estimating the distance based on the learned mapping ([0047] For example, the classifier may be operable to classify the acoustic environment of the sound signal derived by the microphone based on the audio data. Thus, the classifier may be configured to enable the estimation of one or more parameters of the real-time acoustic scenario, such as distance between the source--e.g. speaker--and the microphone, and/or the direction of sound projection relative to the microphone. [0057] Thus, according to one or more examples the classifier may comprise a model. The model may have been trained offline in order to characterize a plurality of different acoustic environments. For example, the model may have been trained using neural networks. The model may be built by deriving one or more metrics in a plurality of different acoustic environments. At runtime it is possible to use the model to identify at least one likely acoustic scenario based on data or features which are extracted from a real-time audio signal. One or more examples described herein at least partially rely upon a model which has been trained using acoustic features of received sound and/or a metric which represents the energy of received sound. Preferably, one or more example described herein at least partially rely upon a model which has been trained based on information about a ratio of the direct to reverberant sound energy or the Direct to Reverberant Ratio (DRR).) ).

Regarding Claims 11 and 19, XU teaches:  A system configured to learn and estimate a distance from a device to a signal source comprising: a processor; and a memory coupled to the processor to store instructions, which when executed by the processor, cause the processor to: receive an audio signal, produced by a microphone of the device, that is responsive to sound from the signal source whose distance from the device is to be estimated; process the audio signal to estimate a direct component of the sound from the signal source and a reverberant component of the sound from the signal source; extract signal characteristics of the direct component and the reverberant component; and estimate the distance of the signal source from the device based on the extracted signal characteristics of the direct component and the reverberant component based on a learned mapping between an audio signal received by the microphone of the device from a training signal source and a learned distance from the device to the training signal source (See rejection of Claim 1).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 2-6, 8, 12-15, and 17, are rejected under 35 U.S.C. 103 as being unpatentable over XU in view of Jensen et al. (US 2015/0256956 A1).

Regarding Claims 2, 12,  and 20, XU does not teach: The method of claim 1, wherein the processing the audio signal to estimate a direct component and a reverberant component of the sound comprises transforming the audio signal into a time-frequency representation of the audio signal and wherein the direct component and the reverberant component are in time-frequency representation.
Jensen et al. teach: wherein the processing the audio signal to estimate a direct component and a reverberant component of the sound comprises transforming the audio signal into a time-frequency representation of the audio signal and wherein the direct component and the reverberant component are in time-frequency representation ([0018] The disclosure is based on the fact that the spatial characteristics of a typical target speech signal and of a reverberant sound field are quite different. Specifically, the proposed method exploits that a reverberant sound field may be modelled as being approximately isotropic, that is, for a given frequency, the reverberant signal power originating from any direction is (approximately) the same. The direct part of a target speech signal, on the other hand, is confined to roughly one direction. [0021] An object of the present application is to provide a scheme for estimating the signal power as a function of time and frequency of a reverberant part of a reverberant speech signal. [0041] In an embodiment, the method comprises determining separate characteristics (e.g. spatial fingerprints) of the target signal and of the noise signal components. [0077] In an embodiment, the audio processing system, e.g. the microphone unit, and or the transceiver unit comprise(s) a TF-conversion unit for providing a time-frequency representation of an input signal. In an embodiment, the time-frequency representation comprises an array or map of corresponding complex or real values of the signal in question in a particular time and frequency range. In an embodiment, the TF conversion unit comprises a filterbank for filtering a (time varying) input signal and providing a number of (time varying) output signals each comprising a distinct frequency range of the input signal. In an embodiment, the TF conversion unit comprises a Fourier transformation unit for converting a time variant input signal to a (time variant) signal in the frequency domain. [0097] FIGS. 2A-2B schematically illustrate a conversion of a signal in the time domain to the time-frequency domain, FIG. 2A illustrating a time dependent sound signal (amplitude versus time) and its sampling in an analogue to digital converter, FIG. 2B illustrating a resulting `map` of time-frequency units after a (short-time) Fourier transformation of the sampled signal. [0106] FIG. 1A schematically shows an example of an acoustically propagated signal from an audio source (S in FIG. 1A) to a listener (L in FIG. 1A) via direct (p.sub.0) and reflected propagation paths (p.sub.1, P.sub.2, P.sub.3; P.sub.4, respectively) in an exemplary location (Room). The direct (p.sub.0) and early reflections (here the one time reflected (p.sub.1)) propagation paths are indicated FIG. 1A in dashed line, whereas the `late reflections` (here the 2, 3, and 4 times reflected (P.sub.2, P.sub.3, P.sub.4)) time reflected (p.sub.1)) are indicated FIG. 1A in dotted line. FIG. 1B schematically illustrates an example of a resulting time variant sound signal (magnitude |MAG| [dB] versus time) from the sound source S as received at the listener L. In FIG. 1B a predetermined time .DELTA.t.sub.pd defining the `late reverberations` is indicated. The late reverberations are in the present example taken to be those signal components that arrive at the listener a time t.sub.pd after it was issued by the sound source S. In other words, `late reverberations` are signal components of a sound that arrive at a given input unit (e.g. the i.sup.th) a predefined time .DELTA.t.sub.pd after the first peak (p0) of the impulse response has arrived at the input unit in question. The appropriate number of reflections and/or the appropriate predefined time .DELTA.t.sub.pd separating the target signal components (dashed part of the graph in FIG. 1B) from the (undesired) reverberation (noise) signal components (dotted part of the graph in FIG. 1B) depend on the location (distance to and properties of reflective surfaces) and the distance between audio source (S) and listener (L), the effect of reverberation being smaller the smaller the distance between source and listener.).
Therefore it would have obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for XU to include the teaching of Jensen et al. in order to separate target and reverberant signal component of an input signal by converting a time variant input signal to a (time variant) signal in the frequency domain.

Regarding Claims 3 and 13: The method of claim 2, wherein the extracting signal characteristics of the direct component and the reverberant component comprises calculating spectral characteristics of the time-frequency representation of the direct component and the reverberant component (See rejection of Claim 2 and Figs.1-2.).

Regarding Claim 4: The method of claim 2, wherein the extracting signal characteristics of the direct component and the reverberant component comprises calculating a ratio between the signal characteristics of the direct component and the signal characteristics of the reverberant component (See XU teaching, [0053] Thus, it will be appreciated that by considering the ratio of the energy of the early arriving sounds to the energy of the late arriving sounds, it is possible to infer one or more characteristics of the acoustic scenario such as distance between the source--e.g. speaker--and the microphone, the direction of sound projection relative to the microphone and the acoustic reverberant condition. ). [0057] Thus, according to one or more examples the classifier may comprise a model. The model may have been trained offline in order to characterize a plurality of different acoustic environments. For example, the model may have been trained using neural networks. The model may be built by deriving one or more metrics in a plurality of different acoustic environments. At runtime it is possible to use the model to identify at least one likely acoustic scenario based on data or features which are extracted from a real-time audio signal. One or more examples described herein at least partially rely upon a model which has been trained using acoustic features of received sound and/or a metric which represents the energy of received sound. Preferably, one or more example described herein at least partially rely upon a model which has been trained based on information about a ratio of the direct to reverberant sound energy or the Direct to Reverberant Ratio (DRR).).

Regarding Claims 5 and 14: The method of claim 1, further comprising dividing the audio signal into a plurality of audio frames, and wherein the processing the audio signal to estimate a direct component and a reverberant component of the sound comprises processing the plurality of audio frames to estimate the direct component and the reverberant component for each of the plurality of audio frames (see rejection of Claim 2).

Regarding Claims 6 and 15:  The method of claim 5, wherein the estimating, by the deep learning system, the distance of the signal source from the device based on the extracted signal characteristics of the direct component and the reverberant component comprises estimating the distance for each of the plurality of audio frames and wherein the method further comprises: detecting that the audio signal from the signal source is an active speech signal (See XU teaching, [0006] Following detection of a predetermined trigger word, or trigger phrase, the voice trigger system is operable to cause a data buffer containing data representing the speech sounds arriving after the trigger phrase to be output for subsequent processing and speech recognition, e.g. by generating an indication to stream data arriving after the trigger phrase to the subsequent speech recognition system. [0078] FIG. 11 is a state transition diagram which illustrates the transitions between a plurality of states of a processing system according to a present example. In a first control state, the device is in mode configured to listen for a trigger word, in other words the processing system is in a "trigger word required" mode. Thus, the processing system is configured to process speech signals derived by a microphone of the device in order to identify one or more features which are indicative of a trigger word or phrase.); and tracking the distance estimated by the deep learning system during a duration of the active speech signal (See XU teaching in rejection of Claim 1 and Figs. 5-6 and Jensen et al. teaching of Claim 2 and Fig. 2.).

Regarding Claims 8 and 17: The method of claim 6, wherein the detecting that the audio signal from the signal source is an active speech signal comprises recognizing a keyword, and wherein the method further comprises computing statics of the tracked distance estimated by the deep learning system when the keyword is recognized (See XU teaching in rejection of claim 1 and  [0003] The device is configured to sense the speech signals and to process the speech signals in order to recognize that the key phrase has been spoken. In response to the detection of the key phrase, the device may be operable to "wake up" or enable the speech recognition function of the device in order that it is receptive to speech commands. [0006] Following detection of a predetermined trigger word, or trigger phrase, the voice trigger system is operable to cause a data buffer containing data representing the speech sounds arriving after the trigger phrase to be output for subsequent processing and speech recognition, e.g. by generating an indication to stream data arriving after the trigger phrase to the subsequent speech recognition system. [0078] Examples of the present aspects may be understood by considering the functionality of the processing system in each of a plurality of states as well as the conditions or requirements for a transition between first and second states. FIG. 11 is a state transition diagram which illustrates the transitions between a plurality of states of a processing system according to a present example. In a first control state, the device is in mode configured to listen for a trigger word, in other words the processing system is in a "trigger word required" mode. Thus, the processing system is configured to process speech signals derived by a microphone of the device in order to identify one or more features which are indicative of a trigger word or phrase. In the first control state, other functionality of the device, in particular the speech recognition functionality, is in a stand-by or sleep mode. [0079] Upon the detection of the trigger word, the processing system--in particular the control unit 200--is operable to carry out decision making processing. Thus, the device may be considered to be in an interim control state Si or a decision making state. The determination of a trigger word also initiates or resets a timer which is configured to count down from a pre-set value. The decision making processing may be carried out on a frame by frame basis. [0080] In the decision making state the processing system is operable to derive a representation of the acoustic environment. The acoustic environment may be represented by the DRR. Thus, the processing system may be configured to derive a representation of the DRR of the received audio signal based on a pre-trained model of the DRR variations in a plurality of acoustic environments. The representation of the acoustic environment or DRR is preferably derived on a frame by frame basis. The processing system is further configured to compare the representation of the DRR obtained in a given frame with a predetermined threshold or is configured to choose the category corresponding to the maximum output. This classifies the audio frame to be e.g. close-talk, far-field talk, or noise. If both a trigger word and a close-talk acoustic environment are detected, a decision is issued to cause a transition from the first control state to the second control state.).

Allowable Subject Matter
Claims 7 and 16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art of records Markovich Golan et al.(US 2018/0240471 A1) teach: [0030] Thus, in the second paradigm, the reverberation is modeled as a diffuse noise field. In this case, a minimum variance distortionless response (MVDR) superdirective beamformer may be applied in the STFT domain to reduce reverberations. A steering vector towards the desired speaker may be defined using the early component of the impulse responses (IRs). The relative early impulse responses are estimated by: a) dereverberating the microphone signals using a single channel Wiener filter; b) estimating the relative transfer function of the remaining speech components. See, Schwartz et al., "Multimicrophone speech dereverberation and noise reduction using relative early transfer functions," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 240-251 (2015). 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656