Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to the remarks filed 6/16/2022.
Response to Amendment
2.	Claims 1, 9-10 have been amended.  The amended title has been accepted and the objection overcome.
Response to Arguments
3.	Applicants arguments filed have been considered but are moot based on the new grounds of rejection responsive to the amendments, where the prior art, Chaudhuri, teaches segments (of a given time length) of separate speakers that are adjacent to each other (fig 3,4,9 - see art rejection below).
	The additional claims are rejected based on arguments presented above and art rejections below.
Claim Rejections - 35 USC § 102
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

5.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


6.	Claims 1-3, 7-10 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Chaudhuri et al (2018/0174600).

Regarding claim 1 Chaudhuri et al (2018/0174600) teaches A non-transitory computer-readable storage medium for storing a detection program which causes a processor to perform processing (abstract: computer implemented method for speech diarizaiton; fig 1 voice detection; 16; 140: apparatus, processor), the 5processing comprising: 
acquiring voice information containing voices of a plurality of speakers (38: The voice detection subsystem 130 detects separate voices (i.e., those voices that belong to separate speakers/persons) in a video and indicates the temporal positions of these voices in the video.); 
detecting a first speech segment of a first speaker among the plurality of speakers included in the voice information based on a first acoustic feature of the first speaker, the first acoustic feature being obtained by performing a 10machine learning (figures 2, 3, 4, 9; paragraphs 42:  groups detected voices together that are determined to belong to the same speaker. 79: machine learning model, which accepts as input the values of the features for a segment of the audio; [0087] For each speech segment, the voice feature encoding 234 process extracts the features from the speech segment; 112-114); and 
detecting a second speech segment of a second speaker among the plurality of speakers based on a second acoustic feature, the second acoustic feature being an acoustic feature included in the voice information associated with a predetermined time range, The predetermined time range being a time range that is adjacent to the first speech segment and that is limited to a given time length (figures 2, 3, 4, 9; paragraphs 42; 79; 87; 89: takes the embedding generated in the voice feature encoding 234 process and uses the voice clustering 236 process to determine which voices belong to the same speaker.; 90: compares each embedding for each speech segment with the other embeddings for the other speech segments in the video. A clustering of two embeddings means that the voice clustering 236 process has determined that two embeddings are likely from speech segments belonging to the same speaker.; 112-114; fig 9: temporal positions of separate voices
108-110 segments - where the speech segments of different speakers are adjacent to each other and are a certain length of time).  


Regarding claim 2 Chaudhuri teaches The non-transitory computer-readable storage medium according to claim 1, wherein the detecting of a first speech segment is configured to detect the first 20speech segment based on a similarity of the learned acoustic feature to an acoustic feature included in the voice information (figures 3-4; 85-87 speech segment, features; cluster; difference; 90; 112-114 – detecting similar speech segments).  

Regarding claim 3 Chaudhuri teaches The non-transitory computer-readable storage medium according to claim 1, causing the computer to execute the processing further comprising: 25updating the learned acoustic feature based on an acoustic feature of the first speech segment (figures 3-4; 85-87; 90; 112-114).  

Regarding claim 7 Chaudhuri teaches The non-transitory computer-readable storage medium according to claim 1, wherein 25the detecting of a second speech segment is configured to 
specify a mode value of the acoustic feature in a plurality of frames included in the predetermined time range outside the first speech segment (figures 2-4; 42; 85-87; 89-91; 112-114), and 
detect, as the second speech segment, the segment including the frame being close to the mode value (figures 2-4; 42; 85-87; 89-91; 112-114 – detecting second speech segments, with different characteristics than first segment).  

Regarding claim 8 Chaudhuri teaches The non-transitory computer-readable storage medium according to claim 1, wherein the detecting of a second speech segment is configured to 
obtain a mode value of a similarity of the first acoustic feature and 5the second acoustic feature (90; 91; 94; 112-114), 
obtain a threshold corresponding to the obtained mode value (90; 91; 94; 112-114), and 
detect the second speech segment by using the obtained threshold ([0090] The voice clustering 236 process compares each embedding for each speech segment with the other embeddings for the other speech segments in the video. A clustering of two embeddings means that the voice clustering 236 process has determined that two embeddings are likely from speech segments belonging to the same speaker.; 91; 94: threshold level; 112-114).  


Regarding claim 9 Chaudhuri teaches A detection method implemented by a computer, the detection 10method comprising: 
acquiring voice information containing voices of a plurality of speakers; 
detecting a first speech segment of a first speaker among the plurality of speakers included in the voice information based on a first acoustic feature of the first speaker, the first acoustic feature being obtained by performing a 15machine learning; and 
detecting a second speech segment of a second speaker among the plurality of speakers based on a second acoustic feature, the second acoustic feature being an acoustic feature included in the voice information associated with a predetermined time range, The predetermined time range being a time range that is adjacent to the first speech segment and that is limited to a given time length.  
	Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning.

Regarding claim 10 Chaudhuri teaches A detection apparatus comprising: a memory; and a processor coupled to the memory, the processor being configured to 
25acquire voice information containing voices of a plurality of speakers, 
detect a first speech segment of a first speaker among the plurality of speakers included in the voice information based on a first acoustic feature of the first speaker, the first acoustic feature being obtained by performing a 30machine learning, and 52Fujitsu Ref. No.: 19-00736 
detect a second speech segment of a second speaker among the plurality of speakers based on a second acoustic feature, the second acoustic feature being an acoustic feature included in the voice information associated with a predetermined time range, The predetermined time range being a time range that is adjacent to the first speech segment and that is limited to a given time length.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning.


Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

8.	Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri et al (2018/0174600) in view of Huang et al (2005/0027515).

Regarding claim 4 Chaudhuri teaches The non-transitory computer-readable storage medium according to claim 1, wherein 50Fujitsu Ref. No.: 19-00736 any of video information on a face or a phonatory organ of the first speaker and vibration information on the phonatory organ is acquired (abstract: separate faces in a video using face detection), and 
the detecting of a first speech segment is configured to detect the first speech segment by using any of the video information [and the vibration 5information] (abstract; 30 face detection; 38);
but does not specifically teach where Huang teaches the phonatory organ and the vibration 5information (abstract: The speech sensor signal is generated based on an action undertaken by a speaker during speech, such as facial movement, bone vibration, throat vibration, throat impedance changes, etc. A speech detector component receives an input from the speech sensor and outputs a speech detection signal indicative of whether a user is speaking. The speech detector generates the speech detection signal based on the microphone signal and the speech sensor signal.). 
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Huang for an improved system to allow for an additional indication of speech. 


9.	Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhuri et al (2018/0174600) in view of Olguin Olguin et al (9,443,521).


Claim 5 recites The non-transitory computer-readable storage medium according to claim 1, the processing further comprising: 
calculating an average value of time intervals each ranging from a point 10of detection of the first speech segment to a point of detection of a subsequent first speech segment in the detecting a first speech segment; and 
setting the predetermined time range based on the average value; 
where Chaudhuri teaches the predetermined time range (fig 9).
Regarding The predetermined time range, it appears according to claims 1 and 5 that it is the time range when the first speech segment is not occurring (any time that is NOT the first speech segment; ”a time range outside the first speech segment” (claim 1)).  Chaudhuri teaches diarization that determines the specific speaker at different time intervals.  Chaudhuri teaches determine temporal positions of separate voices (fig 9).
Chaudhuri does not specifically teach however 
calculating an average value of time intervals each ranging from a point 10of detection of the first speech segment to a point of detection of a subsequent first speech segment in the detecting a first speech segment; and 
setting the predetermined time range based on the average value. 
Olguin Olguin teaches: average time between two turns of current speaker (col 2 l. 35-43).

It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate an average time interval for an improved system to determine separate time spaces where there may be a separate voice or speaker.


Regarding claim 6 Chaudhuri teaches The non-transitory computer-readable storage medium according to 15claim 5, the processing further comprising: 
calculating a[n average] segment length of a plurality of the first speech segments (fig 4; 9; 74; 95; 135 temporal positions of separate voices); 
increasing the predetermined time range when the corresponding first speech segment is shorter than the [average] segment length (fig 4; 9; 74; 95; 135); and 
20reducing the predetermined time range when the corresponding first speech segment is equal to or longer than the [average] segment length (fig 4; 9; 74; 95; 135). 
Regarding claim 6, as discussed in regards to claim 5, Chaudhuri teaches the predetermined time range (a time range outside of the first speech segment).
The first speech segment has a segment length, and based on that segment length, the predetermined time range is determined (all other times outside of the first speech segment).
So, for a shorter first speech segment – there naturally will be more time for additional speech/silence – larger/increased predetermined time range;
AND for a longer first speech segment; less time for the other segments.

However Chaudhuri does not specifically teach where Olguin Olguin teaches
calculating an average segment length of a plurality of the first speech segments (col 11 l. 57: average speaking segment length). 
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate an average segment length for an improved system to determine separate time spaces where there may be a separate voice or speaker,
Allowing for
increasing the predetermined time range when the corresponding first speech segment is shorter than the average segment length; and 
20reducing the predetermined time range when the corresponding first speech segment is equal to or longer than the average segment length.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655