DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 22 July 2021 has been entered.
Claim 1 and the other independent claims are amended. 
All previous objections and rejections directed to the Applicant’s disclosure and claims not discussed in this Office Action have been withdrawn by the Examiner.


Response to Amendments and Arguments
The applicant’s arguments and remarks have been considered, but they are not persuasive. 
The applicant states that neither Schevciw nor Brakish teaches or fairly suggests “each directivity flag signal indicating whether there is a speech signal in a respective sound source direction at a same time point” as amended claim 1 recites. The applicant further states that according to Brakish, “[t]he extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time.” In contrast, amended claim 1 recites each human speaker sound source to distinguish between relevant and irrelevant sound sources that exist in the same time frame in the received audio signal.”  
 The applicant further quotes para [0137] of the applicant’s specification: “[t]he directivity flag signal indicates whether the sound source direction has a speech signal at each time point. For example, the directivity flag signal for the main driver’s seat indicates whether there is a speech at the main driver’s seat at various time points. The directivity flag signal from the front passenger’s seat indicates whether there is a speech at the front passenger’s seat at various time points.” However, the examiner again refers to Brakish, para [0073], “…This allows identifying each human speaker within a time domain and frequency domain and separating each identified speaker from other speakers as well as identification of a human speaker in relation to other types of sound sources defined as noise. The frequency characterization also allows distinguishing one or more relevant speakers from non-relevant speakers in the area 20.” Thus, Brakish teaches that speech may be extracted from a particular location at a particular point in time, as described in the applicant’s specification, para [0137]. Additionally, Brakish, para [0126], states, “Some embodiments may thus enable speaker certainty, allowing the system to provide speaker change detection (e.g., to detect that Person A was the speaker so far, and that Person B is now the dominant speaker while Persona A is silent); to be used for speaker identification (e.g., by taking into account the high-quality and noise-free sound that is captured from the target region; and optionally by taking into account other information or parameters, for example, a pre-defined knowledge that the hybrid microphone is directed towards a driver within a vehicle, or towards a lecturer within a lecture hall)…” Thus, this excerpt explains that Brakish is able to perform the same speech detection steps as described in the applicant’s specification, para [0137] – namely, providing a directivity flag signal for the main driver’s seat indicating whether there is a speech at the main driver’s seat at various time points. 

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) a same time point”.

Claim Rejections - 35 USC § 103

1. The following is a quotation of 35 U.S.C. 103 which forms the basis for all
obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that
the claimed invention is not identically disclosed as set forth in section 102
of this title, if the differences between the claimed invention and the prior
art are such that the claimed invention as a whole would have been
obvious before the effective filing date of the claimed invention to a person
having ordinary skill in the art to which the claimed invention pertains.
Patentability shall not be negated by the manner in which the invention
was made.

1. 	Claims 1-3, 5-8, 10-13, and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over 20110288860, hereinafter referred to as Schevciw et al., in view of US 20170150254, hereinafter referred to as Brakish et al.

Regarding claim 1 (currently amended), Schevciw et al. discloses a method, comprising:

obtaining a spatial audio signal (Schevciw et al., para [0064], [0112], [0113]);



combining the continuous speech signal with the corresponding directivity for the sound source direction to generate a speech activation detection signal for the sound source direction (Schevciw et al., para [0084], [0095], [0116], [0117], [0119]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity flag on which the VAD algorithm operates to determine whether speech is detected at a given time.). It would have been obvious to one of ordinary skill in the art at the time of the applicant's invention to modify the apparatus of Schevciw et al. to include flag signal as taught by Brakish et al. for the advantage of providing more desirable voice activation segments.

As to claim 11, CRM claim 11 and method claim 1 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 11 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.


Regarding claim 2 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 1, wherein the spatial audio signal includes a plurality of original audio signals collected by a plurality of audio signal collection devices (Schevciw et al., para [0073], [0082], [0112], [0129], [0184]).

As to claim 12, CRM claim 12 and method claim 2 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 12 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.


Regarding claim 3 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 2, wherein separating the continuous speech signal and the corresponding directivity for a sound source direction from the spatial audio signal comprises:

estimating a signal arrival direction for a sound source direction (Schevciw et al., para
[0084], [0085], [0091 L [0095J, [0129], [0184]);

according to the signal arrival direction for the preset sound source direction (Schevciw
et al., para [0082], [0084], [0092]), generating the directivity for the preset sound source direction (Schevciw et al., para [0084], [0085], [0091], [0095], [0129], [0184]); and

performing a beamforming processing on the plurality of original audio signals to generate the continuous speech signal for the sound source direction (Schevciw et al., para [0007], [0117]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity 

As to claim 13, CRM claim 13 and method claim 3 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 13 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium,	

Regarding claim 5 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 1, wherein combining the continuous speech signal with the corresponding directivity flag signal for the sound source direction to generate a speech activation detection signal for the sound source direction comprises:

determining the directivity corresponding to respective frame of the continuous speech signal (Schevciw et al., para [0084], [0085], [0095], [0116]-[0119]);

obtaining a determination result by determining respective frame of the continuous speech signal as a speech signal or a non-speech signal in a frame-by-frame manner
(Schevciw et al., para [0117]-[0119], [0131]);

according to the determination result and the corresponding directivity of respective
frame of the continuous speech signal, setting respective frame of the continuous speech signal as tile speech signal or the non-speech signal (Schevciw et al., para
[0117]-[0119], [0131 ]); and

determining a signal in respective frame of the continuous speech signal that is set to be the speech signal as a speech activation detection signal (Schevciw et al., para
[0117]-[0119], [0131]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity flag on which the VAD algorithm operates to determine whether speech is detected at a given time.).

As to claim 15, CRM claim 15 and method claim 5 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.


Regarding claim 6 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 5, wherein after setting respective frame of the continuous speech signal as the speech signal or the non-speech signal, the method further comprises:

determining a duration of a non-speech segment (Schevciw et al., para [0102], [0111],
[0118], [0130]), the non-speech segment being a segment composed or respective successive frame or the continuous speech signal that is set to be the non-speech signal (Schevciw et aL, para [0102], [0111], [0118], [0130]); and

setting respective frame of the continuous speech signal in the non-speeci1 segment with the duration less than a first preset threshold to be the speech signal (Schevciw et
aL, para [0102], [0111], [0118], [0130]).

As to claim 16, CRM claim 16 and method claim 6 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 16 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.


claim 7 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 5, wherein the directivity indicates that there is the speech signal or the nonspeech signal at a time of respective frame (Schevciw et aL, para [0084], [0085],
[0117]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity flag on which the VAD algorithm operates to determine whether speech is detected at a given time.).

As to claim 17, CRM claim 17 and method claim 7 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.


Regarding claim 8 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 7, wherein setting respective frame of the continuous speech signal as the speech signal or the non-speech signal comprises:

if the determination result of a frame of the continuous speech signal is the speech
signal (Schevciw et aL, para [0117]-[0119], [0131]), and/or a corresponding directivity of the frame indicates that there is t11e speech signal at the time of the frame (Schevciw et al., para [0084], [0085], [0117]), setting the frame of the continuous speech signal to be the speech signal (Schevciw et aL, para [0117]-[0119], [0131]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity flag on which the VAD algorithm operates to determine whether speech is detected at a given time.).

As to claim 18, CRM claim 18 and method claim 8 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.

Regarding claim 10 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 5, wherein prior to determining the directivity corresponding to respective frame of the continuous speech signal, the method further comprises:

deterrnining a duration of a non-speech indication segment, the non-speech indication segment being composed of a continuous directivity indicating that there is the nonspeech signal at the time of respective frame (Schevciw et aL, para [0084], [0085],
[0097], [0118], [0123]): and

setting the directivity of respective frame of the non-speech indication segment having the duration less than a second preset threshold to indicate that there is the speech signal at the time of respective frame (Schevciw et aL, para [0090], [0091], [0102],
[0103]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity flag on which the VAD algorithm operates to determine whether speech is detected at a given time.).


Regarding claim 19 (currently amended), Schevciw et al. discloses an apparatus, comprising,

one or more processors (Schevciw et aL, para [0175] [0179]),

memory (Schevciw et at, para [0175], [0179]), coupled to the one or more processors, the memory storing thereon computer-executable instructions that, when executed by 

obtaining a spatial audio signal (Schevciw et al., para [0064], [0112], [0113]);

separating a continuous speech signal and a corresponding directivity for a sound source direction from the spatial audio signal (Schevciw et al., para [0084], [0085], [0091], [0095], [0117], [0129], and [0184]); and

combining the continuous speech signal with the corresponding directivity for the sound source direction to generate a speech activation detection signal for the sound source direction (Schevciw et aL, para [0084], [0095], [0116], [0117], [0119]).

Schevciw et al., though, does not teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal.

Bakish et al. is cited to teach a flag signal, the directivity flag signal indicating whether there is a speech signal in the sound source direction such that the speech activation detection signal has directivity at times indicated by the corresponding directivity flag signal (“The extracted audio signal is then analyzed for identification of human speakers and separation of the identified human speakers by executing VAD algorithm over the extracted audio signal to identify whether human speech is detected at the given moment in time…The extracted audio signal from the OSDS contains only the relevant human speaker, since the optical transmitted signal is directed to a single direction at each time,” Brakish et al., para [0079]. Thus, the extracted audio signal is a directivity flag on which the VAD algorithm operates to determine whether speech is detected at a given time.).


Regarding claim 20 (original), Schevciw et al., as modified by Brakish et al., discloses the apparatus of claim 19, wherein the spatial audio signal includes a plurality of original audio signals collected by a plurality of audio signal collection devices (Schevciw et aL, para [0073], [0082], [0112], [0129], [0184]).


Claims 4, 9, and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over 20110288860, hereinafter referred to as Schevciw et al., in view of US 20170150254, hereinafter referred to as Brakish et al., and further in view of US 20170270919, hereinafter referred to as Amazon.

Regarding claim 4 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 3, wherein performing the beamforming processing on the plurality of original audio signals to generate the continuous speech signal for the sound source direction comprises:

determining a delay difference between every two signals in the plurality of original audio signals (Schevciw et al., para [0007], [0086], [0090], [0092], [0117], [0125]);
 
performing a delay compensation on the plurality of original audio signals according to the delay difference between every two signals (Schevciw et aL, para [0081], [0119],
[0122]); and

performing a summation on the plurality of original audio signals to generate the continuous speech signal for tile sound source direction (Schevciw et al., para [0082]-
[0084], [0140]).

Schevciw et al., though, does not teach a weighted summation.

Amazon is cited to teach a weighted summation (Amazon, para [0078], [0079], [0129]).
It would have been obvious to one of ordinary skill in the art at the time of the applicant's invention to modify the apparatus of Schevciw et al. with weighted summation of audio signals as taught by Amazon to enhance the desired speech processing.

As to claim 14, CRM claim 14 and method claim 4 are related as method and CRM of using same, with each claimed element's function corresponding to the method step.
Accordingly claim 14 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Schevciw et al., para [0062], teaches computer readable medium.


Regarding claim 9 (original), Schevciw et al., as modified by Brakish et al., discloses the method of claim 5, wherein obtaining a determination result by determining respective frame of the continuous speech signal as a speech signal or a non-speech signal in a frame-by-frame manner (Schevciw et al., para [0117]-[0119], [0131]) comprises:


[0111], [0118]).

Schevciw et al., though, does not teach a preset neural network model.

Amazon is cited to disclose a preset neural network model (Amazon, abstract, para
[0076], [0111]). It would have been obvious to one of ordinary skill in the art at the time of the applicant's invention to modify the apparatus of Schevciw et al. with a preset neural network model as taught by Amazon to enhance the desired speech processing.

Conclusion
The prior art made of record and not relied upon is considered pertinent to the applicant’s disclosure and is listed in form 892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2656