DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-5, 7-9, 11-14, 16-18, and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-5, 8-9, 11, 13-14, 17-18, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoshioka (US 2009/0192788 A1), in view of Gerkmann et al. (Gerkmann, T., Krawczyk, M., & Martin, R. (2010, March). Speech presence probability estimation based on temporal cepstrum smoothing. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4254-4257). IEEE.), hereinafter referred to as Gerkmann, and further in view of Wu et al. (US 11,164,592 B1), hereinafter referred to as Wu.

Regarding claim 1, Yoshioka teaches:
A voice processing method comprising: 
estimating a probability of an audio signal representing sound collected by a first microphone, including a person's voice (Fig. 2 element 34, para [0046], where the index D1 is the probability value) based on a cepstrum waveform of the audio signal (para [0084], where the modulation spectrum is determined using cepstral analysis, which is used to calculate D1 as in para [0047]) by:
determining a gain of the audio signal to be:
from among a range of zero to one in a state where the first probability value is set (Fig. 2 element 44, para [0053-55], where the volume or gain is set to 0 for a nonvocal sound, and the input signal is output for a vocal sound, indicating a gain of 1); and
zero in a state where the second probability value is set (Fig. 2 element 44, para [0053-55], where the volume or gain is set to 0 for a nonvocal sound, and the input signal is output for a vocal sound, indicating a gain of 1);
processing the audio signal based on the determined gain of the audio signal to improve an audio quality at a far-end side (Fig. 2 element 44, para [0053], where the processing determines the output signal based on the D1 and magnitude values, to emit only sounds that the user needs to hear); and 
sending the processed audio signal to the far-end side, where a voice processing device, located at the far-end side reproduces the received processed audio signal to emit sound from a speaker (Fig. 1 element Sout, R2, para [0036], [0053], where the output Sout is sent to a separate space R2 and sound is emitted).  
Yoshioka does not teach:
detecting a peak in a high-order cepstrum of the cepstrum waveform;
setting a first probability value indicative of the audio signal including a person’s voice in a case where the peak in the high-order cepstrum is detected, the first probability value corresponding to a peak level in the audio signal; and
setting a second probability value of zero in a case where the peak in the high-order cepstrum is not detected; 
from among a range of greater than zero and less than one in a state where the first probability value is set;
Gerkmann teaches:
detecting a peak in a high-order cepstrum of the cepstrum waveform (page 4256, section 5, second paragraph, where a maximum in the upper cepstrum is searched for);
setting a first probability value indicative of the audio signal including a person’s voice in a case where the peak in the high-order cepstrum is detected, the first probability value corresponding to a peak level in the audio signal (page 4256, section 5, second paragraph, where a peak that is greater than a threshold is considered voiced, interpreted as a high probability value); and
setting a second probability value  in a case where the peak in the high-order cepstrum is not detected (page 4256, section 5, second paragraph, where a peak is lower than a threshold, the signal segment is assumed to be unvoiced, interpreted as a low probability value); 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka by using the cepstral analysis of Gerkmann (Gerkmann page 4256, section 5, second paragraph) in the speech detection of Yoshioka (Yoshioka Fig. 2 element 34, para [0046]) in order to estimate speech presence at each time-frequency point (Gerkmann page 4254 section 1 second paragraph).
Yoshioka and Gerkmann do not teach that the second probability is zero.
Wu teaches:
setting probabilities to zero (col. 13 lines 6-24, where binary outputs of 1 indicate presence of target speech and 0 indicate absence of target speech), and
determining a gain of the audio signal to be: from among a range of greater than zero and less than one in a state where the first probability value is set (col. 13 lines 6-24, 56-64, where a gain is determined to be greater than 0 and less than 1, when voice activity is detected);
The prior art of Yoshioka in view of Gerkmann contained a device (method, product, etc.) which differed from the claimed device by the substitution of some components (probability of Gerkmann page 4256 section 5 second paragraph) with other components (binary VAD outputs and gains of Wu col. 13 lines 6-24, 56-64); the substituted components and their functions were known in the art; one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable.

Regarding claim 2, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 1, further comprising: 
estimating an audio signal-to-noise (SN) ratio in the audio signal according to sound collected by the first microphone (Yoshioka Fig. 10 element 64, para [0068], where an SN ratio is calculated), wherein 
the determining determines the gain of the audio signal to be from among the range of zero to one based on the estimated SN ratio in the state where the first probability value is set (Yoshioka Fig. 10 element 44, para [0066-69], [0053-55], where the index D1 probability value is altered by the weighting from the SN ratio to D3, which is used in the gain determination).  

Regarding claim 4, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 1, wherein the processing gradually reduces or instantly increases the determined gain of the audio signal (Yoshioka Fig. 14, para [0082], where the signal is muted for nonvocal intervals, but output for vocal intervals, showing an instant increase in the gain).  

Regarding claim 5, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 1, wherein the determining determines the gain of the audio signal to be: 
a minimum in the state where the second probability value, which is less than a predetermined value, is set (Yoshioka para [0083], [0050], [0053], where the probability below a threshold indicates a nonvocal sound, which has a volume set to zero); and 
a value greater than the minimum in the state where the first probability value, which is greater than the predetermined value, is set (Yoshioka para [0083], [0050], [0053], where the probability above a threshold indicates a vocal sound, which has a volume set to a nonzero value).

Regarding claim 8, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 1, wherein the peak in the high-order cepstrum of the cepstrum waveform corresponds to a fundamental tone of a person’s voice (Gerkmann page 4255 section 3 second paragraph, where the upper cepstrum represents the speech fundamental period).  

Regarding claim 9, Yoshioka teaches:
A voice processing device comprising: 
a first microphone (Fig. 1 element 12, para [0037], where a microphone is used); 
a memory storing instruction (para [0038], where storage is used); and 
a processor that implements the stored instructions to execute a plurality of tasks (para [0038], where a processing unit is used) including: 
a voice estimating task that estimates probability of an audio signal collected by the first microphone including a person’s voice (Fig. 2 element 34, para [0046], where the index D1 is the probability value) using a cepstrum waveform of the audio signal (para [0084], where the modulation spectrum is determined using cepstral analysis, which is used to calculate D1 as in para [0047]), the voice estimating task:
a gain determining task that determines a gain of the audio signal: 
from among a range of zero to one in a state where the first probability value is set (Fig. 2 element 44, para [0053-55], where the volume or gain is set to 0 for a nonvocal sound, and the input signal is output for a vocal sound, indicating a gain of 1); and
zero in a state where the second probability is set (Fig. 2 element 44, para [0053-55], where the volume or gain is set to 0 for a nonvocal sound, and the input signal is output for a vocal sound, indicating a gain of 1);
a signal processing task that processes the audio signal based on the determined gain of the audio signal to improve an audio quality at a far-end side (Fig. 2 element 44, para [0053], where the processing determines the output signal based on the D1 and magnitude values, to emit only sounds that the user needs to hear); and 
a sending task that sends the processed audio signal to the far-end side, where a voice processing device, located at the far-end side reproduces the received processed audio signal to emit sound from a speaker (Fig. 1 element Sout, R2, para [0036], [0053], where the output Sout is sent to a separate space R2 and sound is emitted).  
Yoshioka does not teach:
detects a peak in a high-order cepstrum of the cepstrum waveform; 
a probability value setting task that sets:
a first probability value indicative of the audio signal including a person’s voice in a case where the peak in the high-order cepstrum is detected, the first probability value corresponding to a peak level in the audio signal; and
a second probability value of zero in a case where the peak in the high-order cepstrum is not detected; 
from among a range of greater than zero and less than one in a state where the first probability value is set;
Gerkmann teaches:
detects a peak in a high-order cepstrum of the cepstrum waveform (page 4256, section 5, second paragraph, where a maximum in the upper cepstrum is searched for); 
a probability value setting task that sets:
a first probability value indicative of the audio signal including a person’s voice in a case where the peak in the high-order cepstrum is detected, the first probability value corresponding to a peak level in the audio signal (page 4256, section 5, second paragraph, where a peak that is greater than a threshold is considered voiced, interpreted as a high probability value); and
a second probability value  in a case where the peak in the high-order cepstrum is not detected (page 4256, section 5, second paragraph, where a peak is lower than a threshold, the signal segment is assumed to be unvoiced, interpreted as a low probability value); 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka by using the cepstral analysis of Gerkmann (Gerkmann page 4256, section 5, second paragraph) in the speech detection of Yoshioka (Yoshioka Fig. 2 element 34, para [0046]) in order to estimate speech presence at each time-frequency point (Gerkmann page 4254 section 1 second paragraph).
Yoshioka and Gerkmann do not teach that the second probability is zero.
Wu teaches:
setting probabilities to zero (col. 13 lines 6-24, where binary outputs of 1 indicate presence of target speech and 0 indicate absence of target speech), and
determining a gain of the audio signal to be: from among a range of greater than zero and less than one in a state where the first probability value is set (col. 13 lines 6-24, 56-64, where a gain is determined to be greater than 0 and less than 1, when voice activity is detected);
The prior art of Yoshioka in view of Gerkmann contained a device (method, product, etc.) which differed from the claimed device by the substitution of some components (probability of Gerkmann page 4256 section 5 second paragraph) with other components (binary VAD outputs and gains of Wu col. 13 lines 6-24, 56-64); the substituted components and their functions were known in the art; one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable.

Regarding claim 11, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing device according to claim 9, wherein: 
the plurality of tasks include an audio signal-to-noise (SN) ratio estimating task that estimates an audio SN ratio in the audio signal according to sound collected by the first microphone (Yoshioka Fig. 10 element 44, para [0066-69], [0053-55], where an SN ratio is calculated, and where the index D1 probability value is altered by the weighting from the SN ratio to D3, which is used in the gain determination), and
the gain determining task determines the gain of the audio signal to be from among the range of zero to one based on the estimated SN ratio in a state where the first probability value is set (Yoshioka Fig. 10 element 44, para [0066-69], [0053-55], where an SN ratio is calculated, and where the index D1 probability value is altered by the weighting from the SN ratio to D3, which is used in the gain determination).  

Regarding claim 13, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing device according to claim 9, wherein the signal processing task gradually reduces or instantly increases the gain of the audio signal (Yoshioka Fig. 14, para [0082], where the signal is muted for nonvocal intervals, but output for vocal intervals, showing an instant increase in the gain).  

Regarding claim 14, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing device according to claim 9, wherein the gain determining task determines the gain of the audio signal to be: 
a minimum in the state where the second probability value, which is less than a predetermined value, is set (Yoshioka para [0083], [0050], [0053], where the probability below a threshold indicates a nonvocal sound, which has a volume set to zero); and 
a value greater than the minimum in the state where the first probability value, which is greater than the predetermined value, is set (Yoshioka para [0083], [0050], [0053], where the probability above a threshold indicates a vocal sound, which has a volume set to a nonzero value).  

Regarding claim 17, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing device according to claim 9, wherein the peak in the high-order cepstrum of the cepstrum waveform corresponds to a fundamental tone of a person’s voice (Gerkmann page 4255 section 3 second paragraph, where the upper cepstrum represents the speech fundamental period).  

Regarding claim 18, Yoshioka teaches:
A voice processing method comprising:  
extracting a feature amount of voice of an audio signal representing sound collected by a microphone (Fig. 2 element 34, para [0046], where the magnitude L1 is a voice feature amount), the feature amount representing probability of the audio signal including a person’s voice based on a cepstrum waveform of the audio signal (para [0084], where the modulation spectrum is determined using cepstral analysis, which is used to calculate D1 as in para [0047]) by: 
determining a gain of the audio signal to be:
from among a range of zero to one in a state where the first probability value is set (Fig. 2 element 44, para [0053-55], where the volume or gain is set to 0 for a nonvocal sound, and the input signal is output for a vocal sound, indicating a gain of 1); and
zero in a state where the second probability value is set (Fig. 2 element 44, para [0053-55], where the volume or gain is set to 0 for a nonvocal sound, and the input signal is output for a vocal sound, indicating a gain of 1);
processing the audio signal based on the determined gain of the audio signal to improve an audio quality at a far-end side (Fig. 2 element 44, para [0053], where the processing determines the output signal based on the D1 and magnitude values, to emit only sounds that the user needs to hear); and 
sending the processed audio signal to the far-end side, where a voice processing device, located at the far-end side reproduces the received processed audio signal to emit sound from a speaker (Fig. 1 element Sout, R2, para [0036], [0053], where the output Sout is sent to a separate space R2 and sound is emitted).  
Yoshioka does not teach:
detecting a peak in a high-order cepstrum of the cepstrum waveform;
setting a first probability value indicative of the audio signal including a person’s voice in a case where the peak in the high-order cepstrum is detected, the first probability value corresponding to a peak level in the audio signal; and
establishing a second probability value of zero in a case where the peak in the high-order cepstrum is not detected; 
determining a gain of the audio signal to be: from among a range of greater than zero and less than one in a state where the first probability value is set;
Gerkmann teaches:
detecting a peak in a high-order cepstrum of the cepstrum waveform (page 4256, section 5, second paragraph, where a maximum in the upper cepstrum is searched for);
setting a first probability value indicative of the audio signal including a person’s voice in a case where the peak in the high-order cepstrum is detected, the first probability value corresponding to a peak level in the audio signal (page 4256, section 5, second paragraph, where a peak that is greater than a threshold is considered voiced, interpreted as a high probability value); and
establishing a second probability value  in a case where the peak in the high-order cepstrum is not detected (page 4256, section 5, second paragraph, where a peak is lower than a threshold, the signal segment is assumed to be unvoiced, interpreted as a low probability value); 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka by using the cepstral analysis of Gerkmann (Gerkmann page 4256, section 5, second paragraph) in the speech detection of Yoshioka (Yoshioka Fig. 2 element 34, para [0046]) in order to estimate speech presence at each time-frequency point (Gerkmann page 4254 section 1 second paragraph).
Yoshioka and Gerkmann do not teach that the second probability is zero.
Wu teaches:
setting probabilities to zero (col. 13 lines 6-24, where binary outputs of 1 indicate presence of target speech and 0 indicate absence of target speech), and
determining a gain of the audio signal to be: from among a range of greater than zero and less than one in a state where the first probability value is set (col. 13 lines 6-24, 56-64, where a gain is determined to be greater than 0 and less than 1, when voice activity is detected);
The prior art of Yoshioka in view of Gerkmann contained a device (method, product, etc.) which differed from the claimed device by the substitution of some components (probability of Gerkmann page 4256 section 5 second paragraph) with other components (binary VAD outputs and gains of Wu col. 13 lines 6-24, 56-64); the substituted components and their functions were known in the art; one of ordinary skill in the art could have substituted one known element for another, and the results of the substitution would have been predictable.

Regarding claim 20, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 18, further comprising: 
estimating an audio signal-to-noise (SN) ratio in the audio signal according to sound collected by the first microphone (Yoshioka Fig. 10 element 64, para [0068], where an SN ratio is calculated), 
wherein the determining determines the gain of the audio signal to be from among the range of zero to one based on the estimated SN ratio in the state where the first probability value is set  (Yoshioka Fig. 10 element 44, para [0066-69], [0053-55], where the index D1 probability value is altered by the weighting from the SN ratio to D3, which is used in the gain determination).

Claims 3 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoshioka, in view of Gerkmann, and Wu, and further in view of Matsuo (US 2015/0325253 A1).

Regarding claim 3, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 1, further comprising:
Yoshioka in view of Gerkmann and Wu does not teach:
estimating a correlation value of the audio signal representing sound collected by a plurality of microphones, including the first microphone, wherein 
the determining determines the gain of the audio signal representing the sound collected by the plurality of microphones based on the first or second set probability value associated with each of the plurality of microphones and the estimated correlation value.  
Matsuo teaches:
estimating a correlation value of the audio signal representing sound collected by a plurality of microphones, including the first microphone (para [0093], where cross correlation is performed on different inputs from microphones as in para [0084]), wherein 
the determining determines the gain of the audio signal representing the sound collected by the plurality of microphones based on the first or second set probability value associated with each of the plurality of microphones and the estimated correlation value (Fig. 11 element 114, para [0090-91], where the gain determination unit determines gain based on the correlation from element 17 and the probability from element 116).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka in view of Gerkmann and Wu by using the correlation of Matsuo (Matsuo para [0084]) in the gain determination of Yoshioka in view of Gerkmann and Wu (Yoshioka para [0053-55]) by correlating signals from different microphones, in order to determine likelihood that a frame contains speech (Matsuo para [0093]).

Regarding claim 12, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing device according to claim 9, further comprising:
Yoshioka in view of Gerkmann and Wu does not teach:
a plurality of microphones including the first microphone, wherein the plurality of tasks include a correlation value calculating task that estimates a correlation value of the audio signal representing the sound collected by the plurality of microphones, and 
wherein the gain determining task determines the gain of the audio signal representing the sound collected by the plurality of the microphones based on the first or second set probability value associated with each of the plurality of microphones and the estimated correlation value.  
Matsuo teaches:
a plurality of microphones including the first microphone, wherein the plurality of tasks include a correlation value calculating task that estimates a correlation value of the audio signal representing the sound collected by the plurality of microphones (para [0093], where cross correlation is performed on different inputs from microphones as in para [0084]), and 
wherein the gain determining task determines the gain of the audio signal representing the sound collected by the plurality of the microphones based on the first or second set probability value associated with each of the plurality of microphones and the estimated correlation value (Fig. 11 element 114, para [0090-91], where the gain determination unit determines gain based on the correlation from element 17 and the probability from element 116).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka in view of Gerkmann and Wu by using the correlation of Matsuo (Matsuo para [0084]) in the gain determination of Yoshioka in view of Gerkmann and Wu (Yoshioka para [0053-55]) by correlating signals from different microphones, in order to determine likelihood that a frame contains speech (Matsuo para [0093]).

Claims 7 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yoshioka, in view of Gerkmann, and Wu, and further in view of Brown (US 2021/0058720 A1).

Regarding claim 7, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing method according to claim 1, wherein
Yoshioka in view of Gerkmann and Wu does not teach:
the probability value is set using machine learning.
Brown teaches:
the probability value is set using machine learning (para [0064], where machine learning is used to update a voice detection decision tree).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka in view of Gerkmann and Wu by using the machine learning of Brown (Brown para [0064]) in determining the probability of Yoshioka in view of Gerkmann and Wu (Yoshioka para [0046]) by updating weights in the decision process, so that the system becomes personalized for the user (Brown para [0062]).

Regarding claim 16, Yoshioka in view of Gerkmann and Wu teaches:
The voice processing device according to claim 9, wherein
Yoshioka in view of Gerkmann and Wu does not teach:
the probability value setting task sets the probability value using machine learning.
Brown teaches:
the probability value setting task sets the probability value using machine learning (para [0064], where machine learning is used to update a voice detection decision tree).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the system of Yoshioka in view of Gerkmann and Wu by using the machine learning of Brown (Brown para [0064]) in determining the probability of Yoshioka in view of Gerkmann and Wu (Yoshioka para [0046]) by updating weights in the decision process, so that the system becomes personalized for the user (Brown para [0062]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 2019/0267022 A1 para [0023] teaches a gain value between 0 and 1.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRYAN S BLANKENAGEL whose telephone number is (571)270-0685. The examiner can normally be reached 8:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRYAN S BLANKENAGEL/Primary Examiner, Art Unit 2658