Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
1.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/11/2022 has been entered.
Response to Amendment
2.	Claims 1-10 have been amended.  
Response to Arguments
3.	Applicants arguments filed have been considered but are moot based on the new grounds of rejection responsive to the amendments, where the prior art, Lyren, teaches audio diarization that includes speaker identification.

Claim Rejections - 35 USC § 102
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

5.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


6.	Claims 1-3, 7-10 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lyren et al (9,584,946).

Regarding claim 1 Lyren teaches A non-transitory computer-readable storage medium for storing a detection program which causes a processor to perform processing (fig 1, 13-14; col 5 l. 18 computer system), the processing comprising: 
acquiring first voice information containing voices of a plurality of speakers (col 3 l. 56-60: multiple different voices or speakers can be present in an audio input, such as multiple voices in a telephone call, teleconference call…movie or video); 
dividing the first voice information into a plurality of frames, each of the plurality of frames having a predetermined time length (col 11 l. 57-60: converts audio input into …features from which multiple features are extracted for each frame and stored as feature vectors); 
detecting, based on a first acoustic feature of a first speaker among the plurality of speakers, one or more first frames from among the plurality of frames, the one or more first frames corresponding to a first speech segment in which a voice of the first speaker is included, the first acoustic feature being an acoustic feature obtained by performing a machine learning on second voice information containing a voice of the first speaker (col 4 l. 15 – 18: divide or segment audio input into different sounds or sound segments such as dividing audio input into different speech segments; 
col 5 l. 33 -36: audio diarization system 110 executes speaker diarization on the audio input. Speaker diarization (aka, speaker diarisation) is a process that divides audio input into segments according to speaker identity. For audio input with voices, speaker diarization combines speaker segmentation and speaker clustering to determine who spoke, when they spoke, and when they did not speak.;
col 5 l. 48-54: Speaker diarization can be actioned with one or more other audio systems, such as combining an audio diarization system with a speaker recognition system to verify, authenticate, or identify a person.;
col 6 l. 13-20: The audio diarization system 110 can also operate with information or knowledge regarding the content of the audio input 150. For example, the audio diarization system is provided with or determines a number of speakers, such as a known number of speakers in a radio or television archive broadcast. As another example, the audio diarization system is provided with audio samples, voiceprints/voice identifications (IDs) or voice models of the speakers;
col 6 l. 34-39: audio diarization system can execute machine learning methods); and 
in response to the detecting of the one or more first frames, detecting, based on a second acoustic feature of a second speaker among the plurality of speakers, one or more second frames from among a portion of the plurality of frames, the one or more second frames corresponding to a second speech segment in which a voice of the second speaker is included, the second acoustic feature being an acoustic feature obtained from the portion of the plurality of frames, the portion of the plurality of frames corresponding to a time range after the detected first speech segment and being limited to a time length adjusted based on a time length of the detected first speech segment (col 4 l. 15 – 18; col 5 l. 33 -36.; col 5 l. 48-54; col 6 l. 13-20; col 6 l. 34-39;
col 6 l. 57-65 : speaker change locations, such as temporal locations when one speaker stops speaking and another begins speaking
-Where Lyren teaches obtaining audio input with multiple speakers, and segmenting and identifying the speakers).

	Regarding claim 2 Lyren teaches The non-transitory computer-readable storage medium according to claim 1, wherein the detecting of the one or more first frames is configured to detect the one or more first frames based on a similarity of the first acoustic feature to an acoustic feature included in each of the plurality of frames (col 5 l. 33-42: The audio diarization system 110 executes speaker diarization on the audio input. Speaker diarization (aka, speaker diarisation) is a process that divides audio input into segments according to speaker identity. For audio input with voices, speaker diarization combines speaker segmentation and speaker clustering to determine who spoke, when they spoke, and when they did not speak.;  
Speaker diarization can determine general speaker identity, such as labeling a voice in the audio input as “Speaker 1” and determining when the speaker speaks
– detecting similar speech segments).

	Regarding claim 3 Lyren teaches The non-transitory computer-readable storage medium according to claim 1, the processing further comprising: 
updating the first acoustic feature based on an acoustic feature of the one or more first frames detected as the first speech segment (col 5 l. 33-42: The audio diarization system 110 executes speaker diarization on the audio input. Speaker diarization (aka, speaker diarisation) is a process that divides audio input into segments according to speaker identity.; col 21 l. 5 – 17: assignments are updated).

	Regarding claim 7 Lyren teaches The non-transitory computer-readable storage medium according to claim 1, wherein 
the detecting of the one or more second frames is configured to 
specify a mode value of the second acoustic feature obtained from the portion of the plurality of frames (col 6 l. 42-50), and 
detect, as the second speech segment, the one or more second frames each of which has an acoustic feature that is close to the mode value (col 6 l. 42-50: executes segmentation and then clustering. This technique splits the audio into successive clusters and merges redundant clusters until each cluster corresponds to a speaker. For example, the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster. – detecting second speech segments, with different characteristics than first segment).  

Regarding claim 8 Lyren teaches The non-transitory computer-readable storage medium according to claim 7, wherein 
the specifying of the mode value includes 
obtaining the mode value of a similarity between the first acoustic feature and the second acoustic feature, and obtaining a threshold corresponding to the obtained mode value, and the detecting of the one or more second frames includes detecting the one or more second frames by using the obtained threshold (col 6 l. 42-50: executes segmentation and then clustering. This technique splits the audio into successive clusters and merges redundant clusters until each cluster corresponds to a speaker. For example, the technique divides the audio input into a number of segments and then iteratively chooses clusters that closely match to repeatedly reduce an overall number of clusters. Clusters can be modeled with GMM in which a distance metric identifies closest clusters. The process repeats until each speaker has one cluster.).


	Regarding claim 9 Lyren teaches A detection method implemented by a computer, the detection method comprising: 
acquiring voice information containing voices of a plurality of speakers; 
dividing the voice information into a plurality of frames, each of the plurality of frames having a predetermined time length; 
detecting, based on a first acoustic feature of a first speaker among the plurality of speakers, one or more first frames from among the plurality of frames, the one or more first frames corresponding to a first speech segment in which a voice of the first speaker is included, the first acoustic feature being an acoustic feature obtained by performing a machine learning on second voice information containing a voice of the first speaker; and 
in response to the detecting of the one or more first frames, detecting, based on a second acoustic feature of a second speaker among the plurality of speakers, one or more second frames from among a portion of the plurality of frames, the one or more second frames corresponding to a second speech segment in which a voice of the second speaker is included, the second acoustic feature being an acoustic feature obtained from the portion of the plurality of frames, the portion of the plurality of frames corresponding to a time range after the detected first speech segment and being limited to a time length adjusted based on a time length of the detected first speech segment.  
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning.


Regarding claim 10 Lyren teaches A detection apparatus comprising: a memory; and a processor coupled to the memory, the processor being configured to 
acquire voice information containing voices of a plurality of speakers, 
divide the voice information into a plurality of frames, each of the plurality of frames having a predetermined time length; 
detect, based on a first acoustic feature of a first speaker among the plurality of speakers, one or more first frames from among the plurality of frames, the one or more first frames corresponding to a first speech segment in which a voice of the first speaker is included, the first acoustic feature being an acoustic feature obtained by performing a machine learning on second voice information containing a voice of the first speaker, and 
in response to the detecting of the one or more first frames, detect, based on a second acoustic feature of a second speaker among the plurality of speakers, one or more second frames from among a portion of the plurality of frames, the one or more second frames corresponding to a second speech segment in which a voice of the second speaker is included, the second acoustic feature being an acoustic feature obtained from the portion of the plurality of frames, the portion of the plurality of frames corresponding to a time range after the detected first speech segment and being limited to a time length adjusted based on a time length of the detected first speech segment.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning.



Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


8.	Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Lyren in view of Huang et al (2005/0027515).

Regarding claim 4 Lyren teaches multiple different voices or speakers can be present in an audio input, such as multiple voices in a call, a movie or video (col 3 l. 56-59) and speaker diarization that divides audio input into segments according to speaker identity (col 5 l. 35-36)
but does not specifically teach where Huang teaches 
The non-transitory computer-readable storage medium according to claim 1, the processing including acquiring any of video information on a face or a phonatory organ of the first speaker and vibration information on the phonatory organ, wherein the detecting of the one or more first frames is configured to detect the one or more first frames as the first speech segment by using any of the video information and the vibration information
 (abstract: The speech sensor signal is generated based on an action undertaken by a speaker during speech, such as facial movement, bone vibration, throat vibration, throat impedance changes, etc. A speech detector component receives an input from the speech sensor and outputs a speech detection signal indicative of whether a user is speaking. The speech detector generates the speech detection signal based on the microphone signal and the speech sensor signal.). 
Lyren teaches audio diarization system segments audio input into speech and non-speech segments (abstract). 
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Huang to receive an input from the speech sensor and output a speech detection signal indicative of whether a user is speaking. 
Lyren already teaches speech detection and speaker diarization, and the incorporation of Huang would allow for the use of additional sensors to better determine when speech is being spoken and yield the predictable results of still segmenting the audio input into speech and non-speech segments (abstract Lyren) and determine who spoke, when they spoke, and when they did not speak (Lyren col 5 l. 38-39).



9.	Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Lyren in view of Olguin Olguin et al (9,443,521).

Claim 5 recites The non-transitory computer-readable storage medium according to claim 1, the processing further comprising: 
in response to performing the detecting of the one or more first frames a plurality of times, calculating an average segment length of a plurality of first speech segments obtained by performing the detecting of the one or more first frames the plurality of times;  and  
adjusting, based on the average segment length, the time range to be used to the detecting of the one or more second frames;
where Lyren teaches detecting of the one or more first frames a plurality of times (col 5 l. 33-42) and determines temporal locations at speaker turns, and segment boundaries (col 6 l. 57-64; col 8 l. 50-64).
Lyren does not specifically teach 
in response to performing the detecting of the one or more first frames a plurality of times, calculating an average segment length of a plurality of first speech segments obtained by performing the detecting of the one or more first frames the plurality of times; and  
adjusting, based on the average segment length, the time range to be used to the detecting of the one or more second frames
Olguin Olguin teaches: average time between two turns of current speaker (col 2 l. 35-43) and average speaking segment length (col 11 l. 57).

Lyren already teaches determining time points where speakers change, and speech segments of particular speakers.  It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate an average segment length (as taught by Olguin) for the speech segments (of a particular speaker) for an improved system.  This would allow to determine separate time spaces where there may be a separate voice or speaker to improve the diarization of Lyren.
Therefore Lyren and Olguin would teach
in response to performing the detecting of the one or more first frames a plurality of times, calculating an average segment length of a plurality of first speech segments obtained by performing the detecting of the one or more first frames the plurality of times;  and  
adjusting, based on the average segment length, the time range to be used to the detecting of the one or more second frames.


Regarding claim 6 Lyren and Olguin teach The non-transitory computer-readable storage medium according to claim 5, wherein 
the adjusting of the time range includes 
increasing the time range to be used to the detecting of the one or more second frames adjacent to a corresponding first speech segment among the plurality of first segments, when the corresponding first speech segment is shorter than the average segment length; and 
reducing the time range to be used to the detecting of the one or more second frames adjacent to the corresponding first speech segment, when the corresponding first speech segment is equal to or longer than the average segment length.
Rejected for similar rationale and reasoning as claim 5 where Lyren already teaches determining time points where speakers change, and speech segments of particular speakers.  It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate an average segment length (as taught by Olguin) for the speech segments (of a particular speaker) for an improved system.  This would allow to determine separate time spaces (by increasing or reducing time range based on first segment) where there may be a separate voice or speaker to improve the diarization of Lyren.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655