DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 29 March 2022 has been entered.

Response to Amendments and Arguments
The applicant’s arguments with respect to the art rejections are moot in view of new grounds for rejection. 

Claim Rejections - 35 USC § 103

1. The following is a quotation of 35 U.S.C. 103 which forms the basis for all
obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that
the claimed invention is not identically disclosed as set forth in section 102
of this title, if the differences between the claimed invention and the prior
art are such that the claimed invention as a whole would have been
obvious before the effective filing date of the claimed invention to a person
having ordinary skill in the art to which the claimed invention pertains.
Patentability shall not be negated by the manner in which the invention
was made.

1. 	Claims 1, 3, 6-8, 10-11, 13, and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over 20110288860, hereinafter referred to as Schevciw et al., in view of US 20130144622, hereinafter referred to as Yamada et al.

Regarding claim 1 (currently amended), Schevciw et al. discloses a method, comprising:

obtaining a spatial audio signal including a plurality of original audio signals collected by a plurality of audio signal collection devices, each audio signal collection device being placed to collect a different speech signal (Schevciw et al., para [0073], [0082], [0112], [0129], [0184].), each audio signal collection device being placed to collect a different speech signal (Schevciw et al., para [0009], [0082], and [0157].); and

separating a continuous speech signal and a corresponding directivity

Schevciw et al. does not disclose combining each continuous speech signal with the corresponding directivity flag signal for the respective sound source direction at times indicated by the corresponding directivity flag signal, to generate a respective speech activation detection signal for the sound source direction such that the speech activation detection signal has directivity;

wherein combining the continuous speech signal with the corresponding directivity flag signal for the sound source direction to generate a speech activation detection signal for the sound source direction comprises: 

determining the directivity flag signal corresponding to respective frame of the continuous speech signal in a frame-by-frame manner; 

obtaining a determination result by determining respective frame of the continuous speech signal as a speech signal or a non-speech signal in a frame-by-frame manner; 

according to both a determination result of the respective frame as a speech signal and the corresponding directivity flag signal of the respective frame indicating that there is a speech signal at a time of the respective frame, setting respective frame of the continuous speech signal as the speech signal or the non-speech signal; and 

determining a signal in respective frame of the continuous speech signal that is set to be the speech signal as a speech activation detection signal.

Yamada et al. is cited to disclose combining each continuous speech signal with the corresponding directivity flag signal for the respective sound source direction at times indicated by the corresponding directivity flag signal (“Direction-specific speech detector 430 extracts a front, a left, and a right speech from the four-channel A/D-converted digital acoustic signals through microphone array 120. Specifically, direction-specific speech detector 430 applies a known directivity control technique to the four-channel digital acoustic signals. Direction-specific speech detector 430 uses such a technique to determine the directivity for each of the front, the left, and the right of user 200 and then detects a front, a left, and a right speech. Direction-specific speech detector 430 determines the presence or absence of speech at short time intervals using the power information on the extracted direction-specific speeches and determines the presence or absence of other speech from each direction for every frame, on the basis of the results of the determination. Direction-specific speech detector 430 then outputs speech or non-speech information indicating the presence or absence of other speech of every frame and each direction to total-amount-of-speech calculator 440 and established-conversation calculator 450,” Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.), to generate a respective speech activation detection signal for the sound source direction such that the speech activation detection signal has directivity (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.);

wherein combining the continuous speech signal with the corresponding directivity flag signal for the sound source direction to generate a speech activation detection signal for the sound source direction comprises: 

determining the directivity flag signal corresponding to respective frame of the continuous speech signal in a frame-by-frame manner (“Direction-specific speech detector 430 extracts a front, a left, and a right speech from the four-channel A/D-converted digital acoustic signals through microphone array 120. Specifically, direction-specific speech detector 430 applies a known directivity control technique to the four-channel digital acoustic signals. Direction-specific speech detector 430 uses such a technique to determine the directivity for each of the front, the left, and the right of user 200 and then detects a front, a left, and a right speech. Direction-specific speech detector 430 determines the presence or absence of speech at short time intervals using the power information on the extracted direction-specific speeches and determines the presence or absence of other speech from each direction for every frame, on the basis of the results of the determination. Direction-specific speech detector 430 then outputs speech or non-speech information indicating the presence or absence of other speech of every frame and each direction to total-amount-of-speech calculator 440 and established-conversation calculator 450,” Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.); 

obtaining a determination result by determining respective frame of the continuous speech signal as a speech signal or a non-speech signal in a frame-by-frame manner (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.); 

according to both a determination result of the respective frame as a speech signal and the corresponding directivity flag signal of the respective frame indicating that there is a speech signal at a time of the respective frame, setting respective frame of the continuous speech signal as the speech signal or the non-speech signal (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.); and 

determining a signal in respective frame of the continuous speech signal that is set to be the speech signal as a speech activation detection signal (As explained in Yamada et al., para [0047], determining a speech/non-speech frame is speech activity detection.). Yamada et al. benefits Scheviciw et al. by providing a speech processing device that can extract a conversation group of three or more speakers from a plurality of speakers with high accuracy (Yamada et al., para [0009]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Scheviciw et al. with those of Yamada et al. to improve the speaker detection and speech processing of Scheviciw et al.

As to claim 11, CRM claim 11 and method claim 1 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 11 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.

As to claim 19, apparatus claim 19 and method claim 1 are related as method and apparatus of using same, with each claimed element's function corresponding to the method step. Accordingly claim 19 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0178], teaches a processor, memory, and instructions.


Claims 2 and 12 canceled.   


Regarding claim 3 (previously presented), Schevciw et al., as modified by Yamada et al., discloses the method of claim 1, wherein separating the continuous speech signal and the corresponding directivity for a sound source direction from the spatial audio signal comprises:

estimating a signal arrival direction for a sound source direction (Schevciw et al., para
[0084], [0085], [0091], [0095], [0129], [0184]);

according to the signal arrival direction for the preset sound source direction (Schevciw
et al., para [0082], [0084], [0092]), generating the directivity for the preset sound source direction (Schevciw et al., para [0084], [0085], [0091], [0095], [0129], [0184]); and

performing a beamforming processing on the plurality of original audio signals to generate the continuous speech signal for the sound source direction (Schevciw et al., para [0007], [0117]).

And, Yamada et al. teaches a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.).

As to claim 13, CRM claim 13 and method claim 3 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 13 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium,	

Claims 5 and 15 canceled.


Regarding claim 7 (currently amended), Schevciw et al., as modified by Yamada et al., discloses the method of claim [[5]]1, wherein the directivity indicates that there is the speech signal or the nonspeech signal at a time of respective frame (Schevciw et al., para [0084], [0085], [0117]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

And, Yamada et al. teaches a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.).

As to claim 17, CRM claim 17 and method claim 7 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.

Regarding claim 8 (original), Schevciw et al., as modified by Yamada et al., discloses the method of claim 7, wherein setting respective frame of the continuous speech signal as the speech signal or the non-speech signal comprises:

if the determination result of a frame of the continuous speech signal is the speech
signal (Schevciw et al., para [0117]-[0119], [0131]), and/or a corresponding directivity of the frame indicates that there is t11e speech signal at the time of the frame (Schevciw et al., para [0084], [0085], [0117]), setting the frame of the continuous speech signal to be the speech signal (Schevciw et al., para [0117]-[0119], [0131]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

And, Yamada et al. teaches a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.).

As to claim 18, CRM claim 18 and method claim 8 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.


Claim 20 canceled. 


2.	Claims 4, 9, and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over 20110288860, hereinafter referred to as Schevciw et al., in view of US 20130144622, hereinafter referred to as Yamada et al., and further in view of US 20170270919, hereinafter referred to as Amazon.

Regarding claim 4 (original), Schevciw et al., as modified by Yamada et al., discloses the method of claim 3, wherein performing the beamforming processing on the plurality of original audio signals to generate the continuous speech signal for the sound source direction comprises:

determining a delay difference between every two signals in the plurality of original audio signals (Schevciw et al., para [0007], [0086], [0090], [0092], [0117], [0125]);
 
performing a delay compensation on the plurality of original audio signals according to the delay difference between every two signals (Schevciw et al., para [0081], [0119],
[0122]); and

performing a summation on the plurality of original audio signals to generate the continuous speech signal for the sound source direction (Schevciw et al., para [0082]-
[0084], [0140]).

Schevciw et al., though, does not teach a weighted summation.

Amazon is cited to teach a weighted summation (Amazon, para [0078], [0079], [0129]).
It would have been obvious to one of ordinary skill in the art at the time of the applicant's invention to modify the apparatus of Schevciw et al. with weighted summation of audio signals as taught by Amazon to enhance the desired speech processing.

As to claim 14, CRM claim 14 and method claim 4 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 14 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.


Regarding claim 9 (currently amended), Schevciw et al., as modified by Yamada et al., discloses the method of claim [[5]]1, wherein obtaining a determination result by determining respective frame of the continuous speech signal as a speech signal or a non-speech signal in a frame-by-frame manner (Yamada et al., para [0047].) comprises:

inputting the continuous speech signal (Yamada et al., para [0047].): and 

determining respective frame of the continuous speech signal as a speech signal or a non-speech signal in a frame-by-frame manner (Yamada et al., para [0047].).

Schevciw et al., though, does not teach a preset neural network model.

Amazon is cited to disclose a preset neural network model (Amazon, abstract, para
[0076], [0111]). It would have been obvious to one of ordinary skill in the art at the time of the applicant's invention to modify the apparatus of Schevciw et al. with a preset neural network model as taught by Amazon to enhance the desired speech processing.


3.	Claims 6, 10, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over 20110288860, hereinafter referred to as Schevciw et al., in view of US 20130144622, hereinafter referred to as Yamada et al., and further in view of US 20170309297, hereinafter referred to as Arsikere et al.

Regarding claim 6 (currently amended), Schevciw et al., as modified by Yamada et al., discloses the method of claim [[5]]1, but not wherein after setting respective frame of the continuous speech signal as the speech signal or the non-speech signal, the method further comprises: determining a duration of a non-speech segment; and

setting respective frame of the continuous speech signal in the non-speech segment with the duration less than a first preset threshold to be the speech signal.

Arsikere et al. is cited to disclose determining a duration of a non-speech segment (““Pause” may refer to a time duration determined between two audio segments of an audio signal. In an embodiment, the pause may also be identified based on a non-speech frame in the audio segment, such that the time duration of the non-speech frame may correspond to the pause. In another embodiment, the pause in an audio segment may be identified from a set of temporally adjacent one or more non-speech frames in the audio segment, such that the collective time duration of the set of temporally adjacent one or more non-speech frames may correspond to the pause. In an embodiment, the time duration between the two audio segments, or the collective time duration of the set of temporally adjacent one or more non-speech frames of an audio segment of an audio signal may be considered as the pause only if the time duration or the collective time duration of the set of temporally adjacent one or more non-speech frames exceeds a predetermined duration,” Arsikere et al., para [0031].); and

setting respective frame of the continuous speech signal in the non-speech segment with the duration less than a first preset threshold to be the speech signal (““Pause” may refer to a time duration determined between two audio segments of an audio signal. In an embodiment, the pause may also be identified based on a non-speech frame in the audio segment, such that the time duration of the non-speech frame may correspond to the pause. In another embodiment, the pause in an audio segment may be identified from a set of temporally adjacent one or more non-speech frames in the audio segment, such that the collective time duration of the set of temporally adjacent one or more non-speech frames may correspond to the pause. In an embodiment, the time duration between the two audio segments, or the collective time duration of the set of temporally adjacent one or more non-speech frames of an audio segment of an audio signal may be considered as the pause only if the time duration or the collective time duration of the set of temporally adjacent one or more non-speech frames exceeds a predetermined duration,” Arsikere et al., para [0031]. Here, setting one or more non-speech frames with a time duration exceeding a predetermined duration (i.e., a threshold) is equivalent to setting one or more non-speech frames with a time duration below a predetermined duration (i.e., a threshold) as a speech signal.). Arsikere et al. benefits Schevciw et al. by a classification of the dialogue act into one or more categories, thereby allowing an organization to derive one or more inferences pertaining to the dialogue act that may further be utilized to determine how efficiently a customer care representative has answered a query of a customer (Arsikere et al., para [0003]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Schevciw et al. with those of Arsikere et al. to extend the usefulness of the speech signal recognition of Schevciw et al. 

As to claim 16, CRM claim 16 and method claim 6 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 16 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.

Regarding claim 10 (currently amended), Schevciw et al., as modified by Yamada et al., discloses the method of claim [[5]]1, but not wherein prior to determining the directivity corresponding to respective frame of the continuous speech signal, the method further comprises: deterrnining a duration of a non-speech indication segment, the non-speech indication segment being composed of a continuous directivity indicating that there is the nonspeech signal at the time of respective frame; and setting the directivity of respective frame of the non-speech indication segment having the duration less than a second preset threshold to indicate that there is the speech signal at the time of respective frame.

Arsikere et al. is cited to disclose determining a duration of a non-speech indication segment, the non-speech indication segment being composed of a continuous directivity indicating that there is the nonspeech signal at the time of respective frame (““Pause” may refer to a time duration determined between two audio segments of an audio signal. In an embodiment, the pause may also be identified based on a non-speech frame in the audio segment, such that the time duration of the non-speech frame may correspond to the pause. In another embodiment, the pause in an audio segment may be identified from a set of temporally adjacent one or more non-speech frames in the audio segment, such that the collective time duration of the set of temporally adjacent one or more non-speech frames may correspond to the pause. In an embodiment, the time duration between the two audio segments, or the collective time duration of the set of temporally adjacent one or more non-speech frames of an audio segment of an audio signal may be considered as the pause only if the time duration or the collective time duration of the set of temporally adjacent one or more non-speech frames exceeds a predetermined duration,” Arsikere et al., para [0031].): and

setting the directivity of respective frame of the non-speech indication segment having the duration less than a second preset threshold to indicate that there is the speech signal at the time of respective frame (““Pause” may refer to a time duration determined between two audio segments of an audio signal. In an embodiment, the pause may also be identified based on a non-speech frame in the audio segment, such that the time duration of the non-speech frame may correspond to the pause. In another embodiment, the pause in an audio segment may be identified from a set of temporally adjacent one or more non-speech frames in the audio segment, such that the collective time duration of the set of temporally adjacent one or more non-speech frames may correspond to the pause. In an embodiment, the time duration between the two audio segments, or the collective time duration of the set of temporally adjacent one or more non-speech frames of an audio segment of an audio signal may be considered as the pause only if the time duration or the collective time duration of the set of temporally adjacent one or more non-speech frames exceeds a predetermined duration,” Arsikere et al., para [0031]. Here, setting one or more non-speech frames with a time duration exceeding a predetermined duration (i.e., a threshold) is equivalent to setting one or more non-speech frames with a time duration below a predetermined duration (i.e., a threshold) as a speech signal.). Arsikere et al. benefits Schevciw et al. by a classification of the dialogue act into one or more categories, thereby allowing an organization to derive one or more inferences pertaining to the dialogue act that may further be utilized to determine how efficiently a customer care representative has answered a query of a customer (Arsikere et al., para [0003]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Schevciw et al. with those of Arsikere et al. to extend the usefulness of the speech signal recognition of Schevciw et al. 

(As previously noted, Yamada et al. teaches a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (Yamada et al., para [0047]. The speech/non-speech information indicating the direction for every frame is a directivity flag.).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2659