DETAILED ACTION
Information Disclosure Statement
The information disclosure statement (IDS) submitted on October 30, 2019 was filed on or after the mailing date of the instant application on October 30, 2019.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 2, 7, and 12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claims 2, 7, and 12 are rejected because of the following informalities: they recite, inter alia, “… wherein a suppression amount to the utterance start direction sound pressure is an estimated amount of a sound pressure based on a sound inputted to one of the first microphone and the second microphone which has directionality in a direction of a sound source at the time point of detecting the utterance start, the sound being inputted in a direction different from the direction of the sound source at the time point of detecting the utterance start.” It is unclear whether the recited “directionality in a direction of a sound source” corresponds to the “sound pressure based on a sound” or the “one of the first microphone and the second microphone.” If the “directionality in a direction of a sound source” corresponds to the “sound pressure based on a sound” that would seem to contradict the rest of the claim language, which states “the sound being inputted in a direction different from the direction of the sound source”.
Therefore, for the purpose of examination, Examiner will interpret Claims 2, 7, and 12 to recite “… wherein (a) one of the first microphone and the second microphone has directionality in a direction of a sound source at the time point of detecting the utterance start, and (b) a suppression amount to the utterance start direction sound pressure is an estimated amount of a sound pressure based on a sound inputted to one of the first microphone and the second microphone, the sound being inputted in a direction different from the direction of the sound source at the time point of detecting the utterance start.” This interpretation appears to be more consistent with the rest of the specification.

Claim Rejections - 35 USC § 103 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 1-4, 6-9, 11-14 are rejected under 35 U.S.C. 103 as being unpatentable over Furuta et al. (PCT Patent Pub WO2020110228 A1), with an effective filing date of November 28, 2018, in view of Noriaki et al. (Japan Patent 5,549,166 B2). The documents and translated copies of the documents have been provided.
With regard to Claims 1,6, and 11, Furuta teaches (a) a computer-readable storage medium (recording medium 138), (b) an utterance detection method (various programs for realizing sound source separation processing) of causing a computer to execute processing (via processor 136), and (c) an utterance detection apparatus (sound source separation device 100) comprising: a memory (memory 137); and a processor (processor 136) coupled to the memory, the processor being configured to:
[0054] The hardware configuration of the sound source separation device 100 can be realized by a computer having a built-in CPU (Central Processing Unit), such as a tablet-type portable computer or a microcomputer for embedded devices such as a car navigation system. Alternatively, the hardware configuration of the sound source separation device 100 is an LSI (Integrate Circuit Integration) such as a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable Scale Integration). May be done. 
 
[0061] The memory 137 is a ROM (Read Only) used as a program memory for storing various programs for realizing sound source separation processing, a work memory used when the processor 

[0063-0064] The recording medium 138 is used for accumulating various data such as various setting data and signal data of the processor 136. As the recording medium 138, for example, a volatile memory such as SDRAM or a non-volatile memory such as HDD or SSD can be used. It is possible to store programs including an OS (Operating System), various setting data, and various data such as audio signal data. The data in the memory 137 can be stored in the recording medium 138. The processor 136 uses the memory 137 as a working memory, and operates according to the computer program read from the memory 137, whereby the T/F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T/F inverse unit. It can function as the conversion unit 111.

detecting an utterance start (See Fig. 4C, 5A and 5B):
Examiner notes that Furuta determines the presence of an utterance by first converting an analog signal into a digital signal divided into frame units. It then classifies each frame unit as containing a target sound and/or a disturbing sound. Furuta also teaches that the target sound is a voice from a speaker (i.e. an utterance). See Fig. 4C, Examiner notes that the utterance is determined when the utterance amount ratio reaches a threshold. The target sound threshold differs from the disturbing sound/interfering sound threshold. By first determining which of the target sound or the disturbing sound/interfering sound is dominant in the predetermined frame section, the utterance amount ratio transitions from one threshold range to another, thus the utterance start time is obtained for example when the ratio is above .5 and the utterance arrives from the direction of the target sound.

[0020] For example, the A / D conversion unit 103 samples the first observation analog signal given from the first microphone 101 at a predetermined sampling frequency and converts it into a digital signal divided in frame units. By doing so, the first observation digital signal is generated. Similarly, the A / D converter 103 samples the second observed analog signal given from the second microphone 102 at a predetermined sampling frequency into a digital signal divided in frame units. By converting, a second observation digital signal is generated. Here, the sampling frequency is, for example, 16 kHz, and the frame unit is, for example, 16 ms.

[0017] … Here, the target sound and the disturbing sound will be described as being voices by different single speakers. 

[0097] In other words, by calculating the integral value of the power spectrum of the predetermined frame section, the occupancy rate of the target sound and the disturbing sound in the predetermined frame section, specifically, it is possible to analyze which is speaking longer or which is louder. Therefore, it is possible to determine which voice is dominant at the time of double talk between the target sound and the disturbing sound, and it is possible to separate the sound source with higher accuracy.

[0040] As shown in FIG. 4C, in the case of a frame satisfying SR(τ) <0.3, there is a high possibility of only disturbing sound, while a frame satisfying SR(τ)> 0.5. In that case, it can be seen that there is a high possibility that only the target sound is used. Further, when 0.3 ≦ SR(τ) ≦ 0.5, it can be considered that both the target sound and the disturbing sound are present.
based on a first sound pressure (voice of the target sound speaker) based on first audio data (first observed analog signal, first channel Ch1, first observed digital signal, first short-time spectrum component X1(ω, τ) ) acquired from a first microphone (microphone 101) and a second sound pressure (voice of the disturbing sound/interfering sound speaker)based on second audio data (second observed analog signal, second channel Ch2, second observed digital signal, second short-time spectrum component X2(ω, τ) ) acquired from a second microphone (microphone 102);
[0017] Here, the first observed analog signal acquired by the first microphone 101 is also referred to as a first channel Ch1, and the second observed analog signal acquired by the second microphone 102 is also referred to as a second channel Ch2 …
[0019] The A/D conversion unit 103 performs analog / digital conversion (analog / digital conversion) for each of the first observation analog signal given from the first microphone 101 and the second observation analog signal given from the second microphone 102. Hereinafter, by performing A / D conversion), each is converted into a digital signal, and a first observation digital signal and a second observation digital signal are generated.

[0023] Specifically, the T/F transforming unit 104 performs a fast Fourier transform of 512 points on the first observed digital signal x1 (t), for example, to perform first short-time spectral component X1(ω, τ) [Generated]. Similarly, the T / F conversion unit 104 generates a second short-time spectral component X2(ω, τ) from the second observed digital signal x2(t). In the following, unless otherwise specified, the short-time spectral component of the current frame is simply omitted from the description as a spectral component.

[0017] … The direction in which the target sound arrives is also referred to as the first direction, and the direction in which the disturbing sound arrives is also referred to as the second direction. Here, the target sound and the disturbing sound will be described as being voices by different single speakers. 

[0027] In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is the target sound or the disturbing sound, the sound arrives using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is in the desired range. Here, since the time difference generated between the signals from the first microphone 101 and the second microphone 102 is determined by the angle θ, it is possible to estimate the arrival direction by using this time difference. Hereinafter, a description will be given with reference to FIGS. 2 and 3.

[0035] The utterance amount ratio calculation unit 107 has a cross spectrum with the first spectral component X1(ω, τ) of the first channel Ch1 and the second spectral component X2(ω, τ) of the second channel Ch2. In response to D(ω, τ), the utterance volume ratio, which is the ratio between the utterance volume of the target sound speaker and the utterance volume of the disturbing sound speaker, is calculated. In other words, the speech volume ratio interferes with the amount of the spectral component of the sound coming from the first range of the first spectral component X1(ω, τ) including the first direction in which the target sound arrives. It is a ratio to the amount of the spectral component of the sound coming from the second range including the second direction in which the sound comes.

[0089] … the arrival direction of the target sound is determined by the sign of the imaginary part Q(ω, τ) of the cross spectrum D(ω, τ) of the equation (1), , but as in the equation (13). In addition, in the conditional expression, by combining the time difference δ(ω, τ) between the first channel Ch1 and the second channel Ch2, which means the angle of the arrival direction, the target speaker and the disturbing sound are calculated from the calculation of the utterance amount. 

[0085] The inputs in the second embodiment include the sounds of the target sound speaker and the disturbing sound speaker captured through the first microphone 101 and the second microphone 102, noise such as automobile running noise, and hands-free calling. It is a received sound of a far-end speaker transmitted from a speaker, a guidance sound transmitted by a car navigation system, an acoustic echo around which car audio music or the like wraps around. Sound other than the voice of the target sound speaker and the disturbing sound speaker is regarded as noise. Further, the noise signal is used as a noise signal. Then, in the second embodiment, the sound coming from a direction not included in the first range including the first direction in which the target sound arrives and the second range including the second direction in which the disturbing sound arrives. …
suppressing (see Fig. 5B, output signal) an utterance direction (arrival direction) sound pressure (either the target sound arriving in a first direction or a disturbing sound/interfering sound arriving in a second direction),
D/A conversion unit 112 generates an output signal by converting the output digital signal y(t) into an analog signal ... FIGS. 5(A) and 5(B) are graphs for explaining the effect in the first embodiment.FIG. 5A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101, similarly to FIG. 4A.FIG. 5B is a graph showing an example of time variation of the output signal output from the D/A conversion unit 112. As is clear from FIGS. 5A and 5B, it can be seen that most of the disturbing sound is removed from the output signal and only the target sound is separated.

[0017] … Furthermore, the directional range in which the target sound and the disturbing sound can reach shall not change with time. The direction in which the target sound arrives is also referred to as the first direction, and the direction in which the disturbing sound arrives is also referred to as the second direction …

[0024] The mask generation unit 105 receives the first spectral component X1(ω, τ) and the second spectral component X2(ω, τ) and is a filtering coefficient for that performs masking to separate the target sound. The frequency filter coefficient bmod(ω, τ) is calculated. … The filtering coefficient for masking a spectral component of the sound arriving from a direction different from the first direction in which the target sound arrives is calculated from the time difference between the time arriving at [first] microphone 101 and the time arriving at second microphone 102.

[0016] … The sound source separation device 100 forms a masking filter based on the signal in the frequency domain generated from the signal in the time domain acquired by the first microphone 101 and the second microphone 102, and uses the masking filter as the first microphone. By multiplying the signal in the frequency domain corresponding to the signal acquired in 101, the output signal of the target sound from which the disturbing sound is removed is obtained.

[0041] Therefore, by using the utterance ratio SR (τ) obtained by the above equation (8) and controlling the masking intensity according to the mode of the observed analog signal, the target sound with high separation accuracy and little distortion , in a frame having a small utterance ratio SR (τ), increasing the value of the masking filter coefficient strongly suppresses disturbing sounds to improve the separation performance, and the utterance ratio SR (τ). In a frame with a large value, it is possible to control the distortion of the target sound by reducing the value of the masking filter coefficient.

[0042] Returning to FIG. 2, the gain calculation unit 108 uses the utterance ratio SR(τ) obtained by the above equation (8) to use the constants in the mask coefficient b(ω,τ) of the above equation (5). A correction gain g(ω, τ) that corrects M is calculated by the following equation (9). … Here, GTgt, GInt, and GDT are predetermined correction gain constants, GTgt is a constant when the observed analog signal is likely to be only the target sound, and GInt … is a constant when there is a high possibility that the observed analog signal is only a disturbing sound, and GDT is a constant when there is a high possibility that both the target sound and the disturbing sound are present in the observed analog signal. In the present embodiment, GTgt=1.5, GDT=0.99, and GInt=0.01 are suitable examples.

[0047] As a further effect of correcting by frequency in this way, when environmental noise is mixed in the observed noise, masking to acoustic signals (for example, noise or music) other than the target voice or abnormal sound is masked. Since the influence of noise is reduced, unpleasant artificial noise (musical tone) caused by unnecessary masking against environmental noise is reduced, malfunction of the voice recognition device or abnormal sound monitoring device due to artificial noise is reduced, and hands-free calling is performed. It also has the side effect of reducing the unpleasant noise of time.

which is one of the first sound pressure and the second sound pressure being larger (speaking longer or speaking louder)
Examiner notes that Furuta determines the presence of an utterance by first converting an analog signal into a digital signal divided into frame units. It then classifies each frame unit as containing a target sound and/or a disturbing sound. Furuta also teaches that the target sound is a voice from a speaker (i.e. an 
 
[0097] In other words, by calculating the integral value of the power spectrum of the predetermined frame section, the occupancy rate of the target sound and the disturbing sound in the predetermined frame section, specifically, it is possible to analyze which is speaking longer or which is louder. Therefore, it is possible to determine which voice is dominant at the time of double talk between the target sound and the disturbing sound, and it is possible to separate the sound source with higher accuracy.

[0040-0041] As shown in FIG. 4C, in the case of a frame satisfying SR (τ) <0.3, there is a high possibility of only disturbing sound, while a frame satisfying SR (τ)> 0.5. In that case, it can be seen that there is a high possibility that only the target sound is used. Further, when 0.3 ≦ SR (τ) ≦ 0.5, it can be considered that both the target sound and the disturbing sound are present. Therefore, by using the utterance ratio SR (τ) obtained by the above equation (8) and controlling the masking intensity according to the mode of the observed analog signal, the target sound with high separation accuracy and little distortion Separation is possible. More specifically, for example, in a frame having a small utterance ratio SR (τ), increasing the value of the masking filter coefficient strongly suppresses disturbing sounds to improve the separation performance, and the utterance ratio SR (τ). In a frame with a large value, it is possible to control the distortion of the target sound by reducing the value of the masking filter coefficient.

[0095-0096] The utterance volume ratio calculation unit 307 calculates the utterance volume ratio SR(τ) by using the above equation (8), and further calculates SR(τ) by using the following equation (14). Smoothed with the utterance amount ratio SR(τ-1) before the frame … Here, α is a smoothing coefficient, and in the utterance volume ratio, by using the utterance volume ratio calculated in the past and smoothing the utterance volume ratio calculated last, it is stable that even if noise is mixed in the observed analog signal. This makes it possible to calculate the utterance volume ratio, and even more accurate sound source separation becomes possible.

Furuta does not teach using the detected utterance to timestamp an utterance start and an utterance end based on the suppressed utterance start direction sound pressure, or comparing sound pressures at a time point of detecting the utterance to see which one is larger. Noriaki however, teaches detecting an utterance start time point (note start timing):
[0007] In the sound processing device of the present invention, the sound pressure transition specifying means specifies a sound pressure transition which represents a transition in the time course of the sound pressure in the input sound from the input sound. A start timing detection means detects, as a note start timing, each time when an increase rate of a sound pressure in a 1 prescribed period defined in a sound pressure change becomes equal to or larger than a prescribed value defined in advance along with a lapse of time in a section in which the specified sound pressure change is monotonically increased.

[0023] Note that when the note period estimation means in the voice processing apparatus of the present invention is configured …, the note period estimation means further includes … a sound pressure variation time point in which sound pressure in the sound pressure change is lower than or equal to the sound pressure in the previous start timing, in advance of a time progression from the rear start timing. It is preferable that the time point of sound pressure change is specified as a note end timing which is paired with the previous start timing.

[0048] In S150, start and finishing timing estimation processing which presumes each start timing and finishing timing of pronunciation periods which are the periods which continued utterance above regular sound pressure are performed in voice inputting. 

Furthermore, Furuta does not explicitly teach suppressing an utterance start direction sound pressure when the utterance start direction sound pressure falls below a non-utterance start direction sound pressure. However, Furuta teaches determining the arrival direction of the source sound and/or the disturbing sound, which is assumed to be constant through time (see [0027]) and Noriaki teaches suppressing an utterance start sound pressure when the utterance start sound pressure falls below a non-utterance start sound pressure (average pressure); comparing sound pressures at a time point () of detecting the utterance to see which one is larger, and detecting an utterance end (note end timing) based on the suppressed utterance start direction sound pressure:
[0015] ... Incidentally, as described in claim 3, the musical note period estimating means has sound pressure in the sound pressure change. It may be configured so that the time point of the sound pressure, which has become lower than the sound pressure at the note start timing, is specified as the note end timing, which is paired with the note start timing … 

[0108] Subsequently, in S 520, based on the sound pressure for each unit interval derived in S 510, a sound pressure transition representing a change in sound pressure along the time progression of the input sound is derived.

[0109] Then, in step S 530, as shown in FIG. 11 a, a noise sound pressure of a predetermined size is subtracted from each sound pressure corresponding to each unit interval in the smoothed sound pressure transition. At this time, for the sound pressure in which the subtraction result becomes negative, the value is set to 0.

[0023] Note that when the note period estimation means in the voice processing apparatus of the present invention is configured …, the note period estimation means further includes …  a sound pressure variation time point in which sound pressure in the sound pressure change is lower than or equal to the sound pressure in the previous start timing, in advance of a time progression from the rear start timing. It is preferable that the time point of sound pressure change is specified as a note end timing which is paired with the previous start timing.

[0135] For example, by executing the start / end timing estimation process, a sound generation start timing as shown in FIG. 12 a (1 to 4 sound generation start timings) is executed. It is assumed that the sound generation end timing (1 sound generation end timing) is specified and the determination target section including the 3 sound generation start timing and the 4 sound generation start timing is specified as the vibrato period (Step S 2). In such a case, since the 3 sound generation start timing and the 4 sound generation start timing are removed as the timing within the period, only 2 of the 1 sound generation start timing and the 2 sound generation start timing remain as shown in FIG. 12 b. Note that, all of the sound generation ends are left without being removed.

[0183] In addition, in Step S 530 in the start end timing estimation process of the above embodiment, a predetermined defined value is used as a noise sound pressure, but the noise sound pressure is not limited thereto. For example, the average sound pressure from the start point of time along the time course of the processed sound data to the first sound generation start timing along the time progression may be set as the noise sound pressure, or a larger value of the specified value and the average sound pressure may be used as the noise sound pressure.

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the system of 
With regard to Claims 2, 7, and 12, Furuta teaches (a) a computer-readable storage medium (recording medium 138), (b) an utterance detection method (various programs for realizing sound source separation processing) of causing a computer to execute processing (via processor 136), and (c) an utterance detection apparatus (sound source separation device 100) comprising: a memory (memory 137); and a processor (processor 136) coupled to the memory, wherein a suppression amount (masking strength) to the utterance start direction (arrival direction) sound pressure is an estimated amount (mask coefficient b(ω, τ),  time-frequency filter coefficient bmod(ω, τ)) of a sound pressure (estimated filtering percentage of the arriving interfering sound/disturbing sound, the lower the coefficient percentage the higher the suppression amount) based on a sound (observation sound) inputted to one of the first microphone and the second microphone (See Fig. 10),
using the utterance ratio SR (τ) obtained by the above equation (8) and controlling the masking intensity according to the mode of the observed analog signal, the target sound with high separation accuracy and little distortion Separation is possible. More specifically, for example, in a frame having a small utterance ratio SR (τ), increasing the value of the masking filter coefficient strongly suppresses disturbing sounds to improve the separation performance, and the utterance ratio SR (τ). In a frame with a large value, it is possible to control the distortion of the target sound by reducing the value of the masking filter coefficient.

[0034] The mask coefficient b(ω, τ) represented by the equation (5) is 1 when it is presumed to be the target sound and M when it is presumed to be a disturbing sound. Here, when M=0, the mask coefficient is a binary of 1 or 0, so a filter having such a mask coefficient is called a binary mask. A decimal number other than binary may be used as the filter coefficient, and the filter in this case is also called a soft mask. However, the filter coefficient is a value less than 1 for both the target sound and the disturbing sound. In this embodiment, for example, M=0.5 is used.

[0053] FIGS. 5(A) and 5(B) are graphs for explaining the effect in the first embodiment.FIG. 5A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101, similarly to FIG. 4A.FIG. 5B is a graph showing an example of time variation of the output signal output from the D/A conversion unit 112. As is clear from FIGS. 5A and 5B, it can be seen that most of the disturbing sound is removed from the output signal and only the target sound is separated.

[0089] … determine whether the arrival direction is the target sound direction or the disturbing sound direction

[0027] In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is the target sound or the disturbing sound, the sound arrives using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is in the desired range. Here, since the time difference generated between the signals from the first microphone 101 and the second microphone 102 is determined by the angle θ, it is possible to estimate the arrival direction by using this time difference. Hereinafter, a description will be given with reference to FIGS. 2 and 3.

[0090-0091] FIG. 10 is a schematic diagram showing an example of a method for excluding the influence of noise other than the target sound and the disturbing sound in the equation (13). In the example of FIG. 10, the exclusion range is described with reference to the first channel Ch1. As shown in FIG. 10, by setting the exclusion range in the calculation of the utterance amount, the influence of noise other than the target sound and the disturbing sound can be excluded, so that the calculation accuracy of the utterance amount ratio is improved and the quality is further improved. It is possible to configure a high sound source separation device. Since the sound source separation device 200 according to the second embodiment is configured as described above, it is possible to create a masking filter having high separation performance at a low calculation cost even under various noise conditions. Therefore, since the target sound can be accurately acquired even under the noise in the automobile, a high-precision voice recognition device, a high-quality hands-free calling device, or an abnormal sound monitoring device for detecting the abnormal sound in the automobile. Can be provided.

[0017-0018] … The direction in which the target sound arrives is also referred to as the first direction, and the direction in which the disturbing sound arrives is also referred to as the second direction. Here, the target sound and the disturbing sound will be described as being voices by different single speakers. The first microphone 101 generates the first observation analog signal by converting the observation sound into an electric signal. The first observed analog signal is given to the A / D conversion unit 103. The second microphone 102 generates a second observation analog signal by converting the observation sound into an electric signal. The second observed analog signal is given to the A / D conversion unit 103.

which has directionality in a direction (see Fig. 3, first direction) of a sound source (see Fig. 3, target sound) at the time point of detecting the 
[0014] FIG. 2 is a block diagram schematically showing an internal configuration of a mask generation unit according to the first to third embodiments. FIG. 3 is a schematic diagram for explaining the arrangement of the first microphone and the second microphone and the direction of arrival of the target sound. [Fig. 4] (A) to (C) are graphs for explaining the utterance amount ratio when the target sound speaker and the disturbing sound speaker speak.

[0017] Here, the first observed analog signal acquired by the first microphone 101 is also referred to as a first channel Ch1, and the second observed analog signal acquired by the second microphone 102 is also referred to as a second channel Ch2 Further, for simplification of the following description, as shown in FIG. 3, the first microphone 101 and the second microphone 102 are located on the same horizontal plane, and their positions are known. Yes, and shall not change over time. Furthermore, the directional range in which the target sound and the disturbing sound can reach shall not change with time. The direction in which the target sound arrives is also referred to as the first direction, and the direction in which the disturbing sound arrives is also referred to as the second direction.

the sound being inputted in a direction (disturbing sound direction) different from the direction (target sound direction) of the sound source at the time point of detecting the utterance start (See Figs. 4A-4B, the target sound and the disturbing sound arrive from different directions and see Figs. 5A-5B, the disturbing sound is suppressed).
[0024-0027] The mask generation unit 105 receives the first spectral component X1(ω, τ) and the second spectral component X2(ω, τ) and is a filtering coefficient for that performs masking to separate the target sound. The frequency filter coefficient bmod(ω, τ) is calculated. For example, the mask generation unit 105 uses the cross-correlation function of the first spectral component 1(ω, τ) and the second spectral component X2(ω, τ) to determine that the observation sound is the first microphone 101. The filtering coefficient for masking a spectral component of the sound arriving from a direction different from the first direction in which the target sound arrives is calculated from the time difference between the time arriving at [first] microphone 101 and the time arriving at second microphone 102. In determining the time-frequency filter coefficient bmod(ω, τ), as shown in FIG. 3, in the horizontal plane where the first microphone 101 and the second microphone 102 are provided, the first microphone 101 with respect to the vertical direction V 2 of the vertical V1 and second microphones 102, from a direction included in a predetermined angle theta, it is assumed that the target sound comes. Incidentally, interference sound is the vertical direction V2 of the vertical V1 and second microphones 102 of the first microphone 101, it is assumed that the target sound coming from the opposite side. Here, the vertical direction V2 of the vertical V1 and second microphones 102 of the first microphone 101, to the straight line connecting the first microphone 101 and second microphone 102, which are perpendicular And The vertical direction V2 of the vertical V1 and second microphones 102 of the first microphone 101, a reference direction is predetermined, not necessarily vertical. Further, it is assumed that the distance between the first microphone 101 and the second microphone 102 is the distance d. In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is the target sound or the disturbing sound, the sound arrives using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is in the desired range. Here, since the time difference generated between the signals from the first microphone 101 and the second microphone 102 is determined by the angle θ, it is possible to estimate the arrival direction by using this time difference. Hereinafter, a description will be given with reference to FIGS. 2 and 3.

[0039] … FIG. 4B is a graph showing an example of time variation in the amount of speech between the target sound speaker and the disturbing sound speaker.

mod(ω, τ)) to the utterance start direction sound pressure is set to a larger value as a degree of similarity (utterance amount ratio) of the first audio data to the second audio data becomes larger:
[See Figs. 4C and 5B.] Examiner notes that during the periods when only the disturbing sound is present, the similarity between 

[0043] That is, the gain calculation unit 108 calculates the correction gain for correcting the mask coefficient so that the higher the utterance amount ratio, the lower the intensity at which masking is performed.

[0035] The utterance amount ratio calculation unit 107 has a cross spectrum with the first spectral component X1(ω, τ) of the first channel Ch1 and the second spectral component X2(ω, τ) of the second channel Ch2. In response to D(ω, τ), the utterance volume ratio, which is the ratio between the utterance volume of the target sound speaker and the utterance volume of the disturbing sound speaker, is calculated. In other words, the speech volume ratio interferes with the amount of the spectral component of the sound coming from the first range of the first spectral component X1(ω, τ) including the first direction in which the target sound arrives. It is a ratio to the amount of the spectral component of the sound coming from the second range including the second direction in which the sound comes.

[0040-0041] As shown in FIG. 4C, in the case of a frame satisfying SR (τ) <0.3, there is a high possibility of only disturbing sound, while a frame satisfying SR (τ)> 0.5. In that case, it can be seen that there is a high possibility that only the target sound is used … Therefore, by using the utterance ratio SR (τ) obtained by the above equation (8) and controlling the masking intensity according to the mode of the observed analog signal, the target sound with high separation accuracy and little distortion Separation is possible. More specifically, for example, in a frame having a small utterance ratio SR (τ), increasing the value of the masking filter coefficient strongly suppresses disturbing sounds to improve the separation performance, and the utterance ratio SR (τ). In a frame with a large value, it is possible to control the distortion of the target sound by reducing the value of the masking filter coefficient.

1(ω, τ) ) and the second audio data (see Fig. 2, (X2(ω, τ) ):
Examiner notes that in [0081-0083] of the instant application, applicant defines the correlation coefficient as integrating the power spectrum across frequencies. Furthermore applicant notes that “A small correlation coefficient presumably represents a 

[0028] … as shown in the following equation (1). The cross spectrum D(ω,τ) is calculated. Then, the mask coefficient calculation unit 106 gives the calculated cross spectrum D(ω,τ) to the utterance amount ratio calculation unit 107.

[0097] In other words, by calculating the integral value of the power spectrum of the predetermined frame section, the occupancy rate of the target sound and the disturbing sound in the predetermined frame section, specifically, it is possible to analyze which is speaking longer or which is louder. Therefore, it is possible to determine which voice is dominant at the time of double talk between the target sound and the disturbing sound, and it is possible to separate the sound source with higher accuracy.


	Claims 5, 10, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Furuta in view of Noriaki, further in view of Furukawa (US Patent Pub 2019/0303443), which claims priority to US Provisional Patent 62/649,904 filed on March 29, 2018.
With regard to Claims 5, 10, and 15, the combination of Furuta and Noriaki teach all of the limitations of Claims 1, 6, and 11 respectively, as noted above. Additionally, Furuta teaches (a) a computer-readable storage medium (recording medium 138), (b) an utterance detection method (various programs for realizing sound source separation processing) of causing a computer to execute processing (via processor 136), and (c) an utterance detection apparatus (sound source separation device 100) comprising: a memory (memory 137); and a processor (processor 136) coupled to the memory, wherein the program further causes the computer to execute processing comprising: determining a direction (arrival direction) of a sound source (voice of the target sound speaker / voice of the disturbing sound speaker) of each of the first audio data (first observed analog signal, first channel Ch1, first observed digital signal, first short-time spectrum component X1(ω, τ) ) and the second audio data (second observed analog signal, second channel Ch2, second observed digital signal, second short-time spectrum component X2(ω, τ)) based on a sound pressure 
[0089] … the arrival direction of the target sound is determined by the sign of the imaginary part Q(ω, τ) of the cross spectrum D(ω, τ) of the equation (1), , but as in the equation (13). In addition, in the conditional expression, by combining the time difference δ(ω, τ) between the first channel Ch1 and the second channel Ch2, which means the angle of the arrival direction, the target speaker and the disturbing sound are calculated from the calculation of the utterance amount. The influence of noise other than the speaker can be excluded. Here, δθDT and δθDN are threshold values of the time difference of the observed analog signal to be excluded from the calculation of the utterance amount, and are predetermined constants obtained by converting the arrival direction angle into the time difference.

[0027] In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is the target sound or the disturbing sound, the sound arrives using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is in the desired range. Here, since the time difference generated between the signals from the first microphone 101 and the second microphone 102 is determined by the angle θ, it is possible to estimate the arrival direction by using this time difference. Hereinafter, a description will be given with reference to FIGS. 2 and 3.

[0097] In other words, by calculating the integral value of the power spectrum of the predetermined frame section, the occupancy rate of the target sound and the disturbing sound in the predetermined frame section, specifically, it is possible to analyze which is speaking longer or which is louder. Therefore, it is possible to determine which voice is dominant at the time of double talk between the target sound and the disturbing sound, and it is possible to separate the sound source with higher accuracy.


[2019/0303443: 0006; 62/649,904: Pg 2, lines 20-26] … (i) identifies that an utterer who utters speech is one of the user and the conversation partner, based on the sound source direction estimated by the sound source direction estimator after the start of the translation is instructed by the translation start button, using a positional relationship indicated by a layout information item selected in advance from a plurality of layout information items that are stored in storage and respectively indicate different positional relationships between the user, the conversation partner, and a display, and (ii) determines a translation direction indicating an input language in which content of the acoustic signal is recognized and an output language into which the content of the acoustic signal is translated, the input language being one of a first language and a second language and the output language being the other one of the first language and the second language; a translator which obtains, according to the translation direction determined by the controller, (i) original text indicating the content of the acoustic signal obtained by causing a recognition processor to recognize the acoustic signal in the input language and (ii) translated text indicating the content of the acoustic signal obtained by causing a translation processor to translate the original text into the output language; 


[2019/0303443: 0066; 62/649,904: Pg 6, line 20-Pg 7, line 13] In addition, for example, the speech translation apparatus may further include: a speech determiner which determines whether the acoustic signal obtained by the microphone array unit includes speech, wherein the controller may determine the translation direction only when (i) the acoustic signal is determined to include speech by the speech determiner and (ii) the sound source direction estimated by the sound source direction estimator indicates the position of the user or the position of the conversation partner in the positional relationship indicated by the layout information item.

[2019/0303443: 0079; 62/649,904: Pg 8, lines 1-9, Pg 7, line 13-18] Speech translation apparatus 100 is an apparatus which translates bi-directionally conversation between user 51 who utters in a first language and conversation partner 52 who utters in a second language. In other words, speech translation apparatus 100 is an apparatus which recognizes each of the languages of the utterances by user 51 and conversation partner 52 among the two different languages of the utterances by user 51 and conversation partner 52, and translates each utterance in one of the languages into an utterance in the other language. Speech translation apparatus 100 is configured to have an elongated shape such as a card for example, and is implemented as a mobile terminal such as a card-shaped terminal, a smartphone, and a tablet. As illustrated in FIG. 1, speech translation apparatus 100 includes: microphone array unit 200 including a microphone array of a plurality of microphones for receiving utterances; and display 300 which displays a result of translation as text. It is to be noted that display 300 is used in portrait orientation or in landscape orientation. 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the system of Furuta with the components of Noriaki that add the capability to timestamp the occurrence of an utterance start and end times, as taught by Noriaki, to further suppress one voice source over the other based the methods provided by Furuta and Noriaki. Furthermore, such a system is capable of performing the voice activity detection of Noriaki either before or after voice recognition and sound source separation of Furuta, to detect utterance start 
Additionally, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the system of Furuta and Noriaki by adding the capability to identify a language based on the direction of sound observed, as taught by Furukawa in order to process time ranges of speech data with detected voice utterances arriving from directional regions with a specified target source, and translate the speech data with high precision even if multiple users are speaking in different directions. That is, it would have been obvious to combine the system of Furuta and Noriaki with the features set forth in Furukawa to create a convenient and fast translation program that is easy to use for the user and the conversation partner and avoids the painstaking need to interact with the device and perform button operations (such as switching the input and output languages) before making each utterance in a conversation. (See [0004], [0074] of Furukawa).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Dean Webb whose telephone 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
02/21/2021