Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-9 are pending. Claims 1, 7, and 9 are independent.  The independent Claims and Claims 2-3, 6 and 8 are amended.
This Application was published as U.S. 2021/0327446.
Apparent priority:  10 March 2020.
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/26/2022 has been entered.
Response to Amendments and Arguments
            Applicant’s arguments are directed to the material added by amendment which addressed by an added reference.
Applicant has amended Claim 1 (independents) to include a “wherein the boundary between tokens is based on morpheme feature.”  A reference is added and review of prior art indicated that far eastern languages (Korean; Chinese) tend to rely on morpheme boundaries.  See the Conclusion section.
1. A voice conversation reconstruction method performed by a voice conversation reconstruction apparatus, 
the method comprising: 
acquiring a plurality of speaker-specific voice recognition data corresponding to a plurality of speakers about voice conversation; 
dividing each of the plurality of the speaker-specific voice recognition data into a plurality of blocks using a boundary between tokens depending upon a predefined division criterion; 
arranging the plurality of blocks of each of the plurality of the speaker-specific voice recognition data in chronological order irrespective of a speaker; 
merging blocks from continuous utterance of the same speaker among the arranged plurality of blocks; and 
reconstructing the plurality of blocks subjected to the merging in a conversation format in chronological order and based on a speaker, 
wherein the merging blocks determine continuous utterance of a same speaker based on silent section of a predetermined time duration between previous block and current block, and
wherein the boundary between tokens is based on morpheme feature.

This Disclosure pertain to Diarization which separates the speeches of several participants in a conversation and reflects which portion was spoken by whom in the transcript of the conversation.  Diarization is not novel or nonobvious.  The instant Application appears to hint to a modified Diarization method where the transcript of segments of speech from the same speaker are kept together even when there is an interruption/barge-in from the second (or other) speaker who is participating in the conversation.  In other words, the Disclosure claims to deviate from a standard diarization method by ignoring or moving some portions of speech of the second participant to the conversation in the interest of keeping the segments of speech of the first participant together.  However, the situation of interruption and keeping the speech of a single speaker together despite the barge-in/interruption is discussed in Jung in Figure 3 and at [0047]-[0048] and [0054] as applied to Claims 1 and 2.

Claim 2
	Note also that dependent Claim is amended as follows:
2. The method of claim 1, wherein acquiring the speaker-specific voice recognition data includes:
acquiring a first recognition result for each speaker generated on an EPD (End Point Detection) basis from the voice conversation and a second recognition result for each speaker generated every preset time from the voice conversation; and 
collecting the first recognition result and the second recognition result to generate the speaker-specific voice recognition data.

Applicant’s arguments provide that the intent of the Claim (2) is to collect and patch together the portions/partial results of speech by the same speaker.
First, Jung was applied to Claim 2 entirely and to most of Claim 1 for the teaching of the diarization and Applicant only mentions Khan.  Response 6.
Second, note that the previous mapping of Jung is still applicable because the Examiner considered this interpretation.  Figure 3 of Jung and its cited written description describe how portions of speech of a same speaker (first speaker) are placed/merged together despite a momentary barge-in/interruption from the other speakers and presented on the screen.  Therefore, Jung teaches Claim 2 even in view of the amendments and arguments.  Figure 3 of Jung is repeated here:

    PNG
    media_image1.png
    350
    514
    media_image1.png
    Greyscale

Third, the claimed language may still be made clearer by use of first and second speaker to emphasize the arguments:
Suggested. The method of claim 1, wherein acquiring the speaker-specific voice recognition data includes:
acquiring a first recognition result for [[each]] a first speaker generated on an EPD (End Point Detection) basis from the voice conversation and a second recognition result for [[each]] the first speaker generated every preset time from the voice conversation; and 
collecting the first recognition result and the second recognition result to generate the speaker-specific voice recognition data for the first speaker,
wherein the first speaker is in conversation with at least one second speaker.

Note the Specification corresponding to the Arguments:
[0041] FIG. 2 is a flowchart for illustration of a voice conversation reconstruction method according to one embodiment. FIG. 3 is a flowchart illustrating a process of acquiring the voice recognition data per speaker in the voice conversation reconstruction method according to one embodiment. FIG. 4 is a diagram illustrating a result of the voice conversation reconstruction using the voice conversation reconstruction apparatus according to one embodiment.
[0042] Hereinafter, the voice conversation reconstruction method performed by the voice conversation reconstruction apparatus 100 according to one embodiment of the present disclosure will be described in detail with reference to FIG. 1 to FIG. 4.
[0043] First, the input unit 110 individually receives voice data about the voice conversation per speaker, and provides the received speaker-specific voice data to the processor 120.
[0044] Then, the speaker-specific data processor 121 of the processor 120 acquires the speaker-specific voice recognition data about the voice conversation. For example, the ASR included in the speaker-specific data processor 121 may remove noise via a preprocessing process of the speaker-specific voice data input through the input unit 110a and may extract the character string therefrom to obtain the speaker-specific voice recognition data composed of the character string S210.
[0045] In connection therewith, the speaker-specific data processor 121 may apply a plurality of timings at which the recognition result is generated in obtaining the speaker-specific voice recognition data. The speaker-specific data processor 121 generates the first speaker-specific recognition result about the voice conversation on the EPD basis. In addition, the speaker-specific data processor 121 generates the second speaker-specific recognition result every preset time after the last EPD at which the first speaker-specific recognition result is generated occurs S211. In addition, the speaker-specific data processor 121 collects the first speaker-specific recognition result and the second speaker-specific recognition result per speaker without overlap and redundance therebetween, and finally generates the speaker-specific voice recognition data (S212).
[0046] The speaker-specific voice recognition data acquired by the speaker-specific data processor 121 may be reconstructed into a conversation format later using the conversation reconstructor 125. However, in reconstruction of the data into the conversation format having a text format other than the voice, a situation may occur in which a second speaker interjects during a first speaker's speech. When trying to present this situation in the text format, the apparatus has to determine a point corresponding to the second speaker utterance. For example, the apparatus may divide the entire conversation duration into the data of all speakers based on the silence section, then collect the data of all speakers and arrange the data in chronological order. In this case, when text is additionally recognized around the EPD, a length of the text may be added to the screen at once. Thus, the position in text the user is reading may be disturbed or the construction of the conversation may change. Further, in connection therewith, when a construction unit of the conversation is natural, the context of the conversation is damaged. For example, when the second speaker utters “OK” during the continuous speech from the first speaker, the “OK” may not be expressed in the actual context and may be attached to an end portion of the continuous long word from the first speaker. Further, in connection therewith, in terms of the real time response, the recognition result may not be identified on the screen until EPD occurs even though the speaker is speaking and recognizing the speech. Rather, despite the first speaker speaking first, the word from the second speaker later is short and thus terminates before the speech from the first speaker terminates. Thus, a situation may occur where there is no word from the first speaker on the screen, but only the words from the second speaker are displayed on the screen. In order to cope with these various situations, the voice conversation reconstruction apparatus 100 according to one embodiment may execute the block generation process by the block generator 122, the arrangement process by the block arranger 123, and the merging process by the block merger 124. The block generation process and the arrangement process serve to insert the words of another speaker between the words of one speaker to satisfy an original conversation flow. The merging process is intended to prevent a sentence constituting the conversation from being divided into excessively short portions due to generation of blocks as performed for the insertion.
[0047] The block generator 122 of the processor 120 divides the speaker-specific voice recognition data acquired by the speaker-specific data processor 121 into a plurality of blocks according to the predefined division criterion, for example, using a boundary between tokens (words/phrases/morphemes) and may provide the plurality of blocks to the block arranger 122 of the processor 120. For example, the predefined division criterion may be a silent period longer than or equal to a predetermined time duration or a morpheme feature (for example, between words) related to the previous token. The block generator 122 may divide the speaker-specific voice recognition data into a plurality of blocks using the silent section of the predetermined time or longer or the morpheme feature related to the previous token as the division criterion (S220).
[0048] Subsequently, the block arranger 123 of the processor 120 arranges the plurality of blocks generated by the block generator 122 in chronological order irrespective of the speaker and provides the arranged blocks to the block merger 124 of the processor 120. For example, the block arranger 123 may use a start time of each block as the arrangement criterion, or may use a middle time of each block as the arrangement criterion (S230).
[0049] Then, the block merger 124 of the processor 120 may merge blocks from the continuous utterance of the same speaker among the plurality of blocks arranged by the block arranger 123, and may provide the speaker-specific voice recognition data as the results of the block merging to the conversation reconstructor 125. For example, the block merger 124 may determine the continuous utterance of the same speaker based on the silent section of a predetermined time duration or smaller between the previous block and the current block or the syntax feature between the previous block and the current block (for example, when the previous block is an end portion of a sentence) (S240).
[0050] Next, the conversation reconstructor 125 of the processor 120 reconstructs the plurality of blocks as the merging result by the block merger 124 in the conversation format in the chronological order and based on the speaker, and provides the reconstructed voice recognition data to the output unit 130 (S250).
[0051] Then, the output unit 130 outputs the processing result from the processor 120. For example, the output unit 130 may output the converted data provided from the processor 120 to another electronic device connected to the output interface under the control of the processor 120. Alternatively, the output unit 130 may transmit the converted data provided from the processor 120 through the network under the control of the processor 120. Alternatively, the output unit 130 may display the processing result by the processor 120 on the screen of the display apparatus as shown in FIG. 4. As shown in an example of FIG. 4, the output unit 130 may display the voice recognition data about the voice conversation as reconstructed in a conversation format using the conversation reconstructor 125 on the screen in chronological order and based on the speaker. In connection therewith, when updating and outputting the reconstructed voice recognition data, the output unit 130 may update and output a screen reflecting the first speaker-specific recognition result generated in step S211. That is, in step S250, the conversation reconstructor 125 provides the voice recognition data reflecting the first speaker-specific recognition result to the output unit 130 (S260).

    PNG
    media_image2.png
    662
    460
    media_image2.png
    Greyscale


Drawings
The drawings are objected to.
Figure 3, box S212 includes an incomplete sentence:  “Collecting first per-speaker recognition result and second per-speaker recognition result without.”  This was probably intended to be “without overlap and redundancy.”  Alternatively, the “without” can be removed which leads to a correct sentence but a flowchart with no point.  Best would probably be: “Merging first per-speaker recognition result and further per-speaker recognition result without overlap and redundancy.”  This would require chaining S211 as well.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: “input unit” in Claim 7.   Refer to the previous Office actions for explanation.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-9 are rejected under 35 U.S.C. 103 as being unpatentable over Jung (U.S. 20190392837) in view of Kahn (U.S. 20060149558) and further in view of Abe (U.S. 20090265166).
Jung:

    PNG
    media_image3.png
    667
    1015
    media_image3.png
    Greyscale


    PNG
    media_image4.png
    967
    663
    media_image4.png
    Greyscale

Regarding Claim 1, Jung teaches:
1. A voice conversation reconstruction method performed by a voice conversation reconstruction apparatus, [Jung, “Examples described herein improve the way in which a transcript is generated and displayed so that the context of a conversation taking place during a meeting or another type of collaboration event can be understood by a person that reviews the transcript (e.g., reads or browses through the transcript)….”  Abstract.]
the method comprising: 
acquiring a plurality of speaker-specific voice recognition data corresponding to a plurality of speakers about voice conversation; [Jung, Figure 4, “Voice Recognition Profiles 412” teach the “speaker-specific voice recognition data” of the Claim. Figure 1, “Voice Recognition Module 126” is a speaker identification module.   “[0035] The voice recognition module 126 is configured to receive the meeting speech data 120 from the image capture device 116 and to recognize a voice that speaks an utterance. Thus, the voice recognition module 126 matches a voice with a voice recognition profile to identify a user that spoke. …”  “[0017] FIG. 4 is a diagram illustrating components of an example device configured to receive speech data, match a voice with a voice recognition profile, convert the speech data to text, and segment the text to generate a transcript that captures the context of a conversation.”]
dividing each of the plurality of the speaker-specific voice recognition data into a plurality of blocks using a boundary between tokens depending upon a predefined division criterion; [Jung, Figure 1, “Transcript Viewing Application 136” and Figure 2 showing the “sequence of text segments 138” separated by “User IDs 140” of the users (Lisa, Joe, Beth, Lisa again …” who spoke each segment.  “[0043] As shown, the graphical user interface 200 provides separation between individual text segments so that a viewer can better associate the text segment with a user that spoke the words. Furthermore, the user identifiers can include one or more graphical elements useable to enable the viewer to identify a user and/or gather information about the user. A graphical element can include a user name, a user alias, a user avatar, a user photo, a title, a user location, and so forth…”  Reference divides the recognized text/“recognition data” of the Claim into “text segments 204, 206, etc.”/“plurality of blocks.”  The “predefined criterion” of the Claim is taught by the change of speaker and voice.]
arranging the plurality of blocks of each of the plurality of the speaker-specific voice recognition data in chronological order irrespective of a speaker; [Jung, Figure 2 shows the arrangement of the text blocks in chronological order as the corresponding speech was output.  “Sequence of text segments 138.”  “[0037] … As shown via the second area 140 of the graphical user interface, the identifier <UserA> is graphically level with the first <text segment> listed in the first area 138 and thus a viewer of the transcript can deduce that UserA spoke the first text segment listed in the first area 138, <UserB> is graphically level with the second <text segment> listed in the first area 138 and thus the viewer can deduce that UserB spoke the second text segment listed, <UserC> is graphically level with the third <text segment> listed in the first area 138 and thus the viewer can deduce that UserC spoke the third text segment listed, <UserD> is level with the fourth <text segment> listed in the first area 138 and thus the viewer can deduce that UserD spoke the fourth text segment listed, and so forth.”]
merging blocks from continuous utterance of the same speaker among the arranged plurality of blocks; and [Jung, Figures 5B and 7 and flowcharts of Figures 8-9 show the “re-positioning”/reorganizing of the text segments according to speaker or subject/topic.  Figure 5B and Figure 8 teach this limitation.  In Figure 5B only the text segments spoken by Lisa R. are shown in a sequence.  Figure 8, 808.  “[0086] At operation 808, a transcript of the conversation or meeting is generated using the text. As described above, the transcript includes a sequence of text segments and an individual text segment in the sequence of text segments includes an utterance spoken by a single user of the multiple users.”]
reconstructing the plurality of blocks subjected to the merging in a conversation format in chronological order and based on a speaker, [Jung, Figure 5B. “[0071] Upon receiving further user input that specifies a user identifier and/or keyword(s), the transcript generation module 428 is configured to search for and identify text segments in the sales meeting transcript 202 that include the user identifier and/or the keyword(s) specified by the user input. As shown in the example graphical user interface 508 of FIG. 5B, the sales meeting transcript 202 is filtered so that the identified text segments are displayed. In this example, a person reviewing the sales meeting transcript enters "Lisa R." into the text entry window 504 or selects the user identifier corresponding to Lisa R. in the user identifier selection area 506. In response, the text segments that capture utterances spoken by Lisa R. are configured for display and/or to be scrolled through. These text segments include text segments 204 and 210 from FIG. 2, but also text segments 510, 512, 514, and 516. Note that text segments 510, 512, 204, 210, 514, and 516 are displayed in an order in which the utterances are spoken by Lisa R.”]
wherein the merging blocks determine continuous utterance of a same speaker based on silent section of a predetermined time duration between previous block and current block, and
wherein the boundary between tokens is based on morpheme feature.
Jung refers to pauses between the segments of speech of the same speaker.  “[0005] …  In one example, the interruption can be associated with an interjection of words that causes the user to pause for a short period of time (e.g., a few seconds) after the first set of words are spoken and before speaking the second set of words. The user may pause to listen to the words being spoken by the other user….”  See also [0046] and [0048] and Figure 3 of Jung.  Jung teaches that its system is capable of combining the segments of speech of a same speaker irrespective of the interruptions by another participant in the conversation where this interruption may have been instigated by momentary pauses in the speech of the first speaker.
Jung does not teach the pauses/periods of silence or their duration as indicators of end of speech by a speaker.
Khan teaches:
wherein the merging blocks determine continuous utterance of a same speaker based on silent section of a predetermined time duration between previous block and current block , and [Kahn teaches that the voice model of a particular speaker is developed partly by determining the duration of silence that is characteristics of the particular speaker’s speech and is not considered an end-point or a boundary between one sentence and another.  For each speaker this characteristic silence period is “predetermined.”   “[0093] Techniques are disclosed for user-dependent data generation of segmentation modules for speech that may be used for manual or automatic processing. In one approach, speech analysis may be used to determine typical silence length separating a speaker's utterances. ….”]  “[0090] The disclosure also teaches methods for processing speech from two or more speakers at a meeting or legal proceeding. The techniques utilize the ability of the session file editor to change the boundaries of audio segments, align the correct audio to a given speaker, and create verbatim text to create.sub.=a speech user profile for the group as a whole and speaker-specific profiles.”  Figure 6, 601: Silence Detection used for segmentation:  “[0234] … As described in FIG. 6, the first step 601 within the speech segmentation module 600 is to determine whether the module uses silence detection or another method….”  “[0235] … To define the silence before or after an utterance, a silence threshold of consecutive frames may be necessary.”  See also Figures 3 for segmentation and Figure 8 for “speech user profile 812.”]

Jung and Kahn pertain to segmentation of speech according to speaker and it would have been obvious to use the characteristic and thus “predetermined” periods of silence/pause for speaker as an indication that the same speaker is still talking (even when interrupted by the speech of another participant) from Kahn with the system of Jung that primarily uses speaker identification to determine segmentation and also relies on duration of pauses as auxiliary methods of determining segmentation in order to determine that the same speaker is still talking and therefore the two segments of speech need to be combined.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.  (Khan:  “[0091] The disclosure teaches use of the multispeaker techniques primarily for situations where the same speakers, or a subset thereof, are repeatedly present, such as a lengthy legal proceeding, sequential meetings of a corporation's board of directors, or long-term surveillance of suspect individuals by national security or law enforcement groups….”)

Neither teaches the use of morphemes as the tokens of choice.
Abe is directed to “Boundary Estimation …” and teaches the use of morphemes as a suitable speech unit for determining boundaries between portions of speech:
dividing each of the plurality of the speaker-specific voice recognition data into a plurality of blocks using a boundary between tokens depending upon a predefined division criterion; [Abe, Figure 1, “boundary estimation units 102, 130” and Figure 5 showing the phoneme recognition result of input speech with an estimated boundary position and Figure 10 and 11 showing the estimated boundary positions.
…
wherein the boundary between tokens is based on morpheme feature. [Abe, Figure 2, the “Feature Acquisition Unit 112” which is part of the “Analysis Speech Acquisition Unit 101” used “morphemes” as a “linguistic feature.”  [0034].  See also Figure 8, “speech recognition unit 251” whose output is used for boundary estimation uses morphemes as input:  “[0067] The speech recognition unit 251 performs the speech recognition to the input speech 14 to generate word information 21 showing a sequence of words included in a language text corresponding to the contents of the input speech 14, and thus to input the word information 21 to the boundary possibility calculation unit 253. Here, the word information 21 includes the notation information and the reading information of morpheme.”  See also:  “1. A boundary estimation apparatus, comprising …  a pattern generating unit configured to analyze at least one of acoustic feature and linguistic feature in an analysis interval around the second boundary of the second speech to generate a representative pattern showing representative characteristic in the analysis interval;…” and “6. … wherein the linguistic characteristic is at least one of notation information, reading information and part-of-speech information of morpheme obtained by performing a speech recognition processing to a speech.”]
Jung/Kahn and Abe pertain to segmentation of speech in order to detect boundaries of sentences and make the proper attribution and it would have been obvious to use morphemes which are the smallest meaningful units of speech/text as tokens of parsing when analyzing the input speech.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.  (Abe:  “[0006] For example, a speech recorded in a meeting, a lecture, and so on is separated for each predetermined meaning group (in units of meanings) such as sentences, clauses, or statements to be indexed, and thus to find the beginning of an intended position in the speech in accordance with the indexes, whereby it is possible to effectively listen to the speech. In order to perform such an indexing, a boundary separating a speech in units of the meaning is required to be estimated.”)

Regarding Claim 2, Jung teaches:
2. The method of claim 1, wherein acquiring the speaker-specific voice recognition data includes: 
acquiring a first recognition result for each speaker generated on an EPD (End Point Detection) basis from the voice conversation and a second recognition result for each speaker generated every preset time from the voice conversation; and [Jung, the End Point Detection basis in Jung is change of speaker which is detected by the speech recognizer based on speech profiles of the speakers.  Figure 3 shows the timeline on the left hand side of the drawing and shows how the “interrupting utterance 306” / “second speaker-specific recognition result.”  “[0047] …  Consequently, the transcript generation module 130 determines that utterance 306 is an interruption with regard to utterance 304 because two voices of two different people are detected and recognized during a same period of time, the time between t2 and t3.”  “[0048] Rather than generate a single flow text in which words of utterance 306 are interspersed with words of utterance 304 in a strictly time-based manner, the transcript generation module 130 separately identifies the words that comprise utterance 304 and the words that comprise utterance 306 using voice recognition profiles, and groups them into separate text segments to be displayed in the transcript. …”  (Note the description of “preset time” from the instant Application that follows.  According to the Specification of the instant Application, the “preset time” just means a time after the EPD of the first speaker:  “[0034] …For example, the speaker-specific data processor 121 may generate a first speaker-specific recognition result about the voice conversation on an EPD (End Point Detection) basis, and generate a second speaker-specific recognition result at each preset time. For example, the second speaker-specific recognition result may be generated after a last EPD at which the first speaker-specific recognition result is generated occurs….”  “[0045] …For example, the speaker-specific data processor 121 may generate a first speaker-specific recognition result about the voice conversation on an EPD (End Point Detection) basis, and generate a second speaker-specific recognition result at each preset time. For example, the second speaker-specific recognition result may be generated after a last EPD at which the first speaker-specific recognition result is generated occurs….”)]
collecting the first recognition result and the second recognition result to generate the speaker-specific voice recognition data. [Jung, Figure 1, 136 and Figure 2, 200 show the segmentation of the speech of the various speakers (first, second, third, etc.) and allocation of each segment to the corresponding speaker who spoke the segment.  If two speakers speak simultaneously such that their speech overlaps, the system of Jung separates them such that the end result is “without overlap and redundancy”:  “[0005] … In one example, the techniques combine a first set of words and a second set of words (e.g., a set can include one or more words), that are part of an utterance spoken by a user, into a single text segment. The techniques distinguish between the first set of words and the second set of words due to a detected interruption (e.g., the first set of words and the second set of words are separated by an interruption). For instance, the interruption can include a set of words spoken by another user. In one example, the interruption can be associated with an interjection of words that causes the user to pause for a short period of time (e.g., a few seconds) after the first set of words are spoken and before speaking the second set of words. The user may pause to listen to the words being spoken by the other user. In another example, the interruption can be associated with the other user beginning to speak his or her words at the same time the user is speaking the second set of words. Stated another way, the other user begins speaking before the user finishes speaking thereby resulting in an overlapping time period in which multiple people are speaking.”  “[0006] Consequently, the techniques described herein are configured to combine the first and second sets of words spoken by a single user into a single text segment even though there are intervening or overlapping words spoken by the other user. To this end, the first and second sets of words comprise an utterance spoken by the user and the single text segment can be placed in the sequence of text segments of the transcript before a subsequent text segment that captures the set of words spoken by the other user.”]

    PNG
    media_image1.png
    350
    514
    media_image1.png
    Greyscale


Regarding Claim 3, Jung teaches:
3. The method of claim 2, wherein the second recognition result is generated after a last EPD occurs. [Jung, Figure 2, the EPD is change of speaker and after the first speaker Lisa R the EPD/change of speaker occurs and then the second speaker Joe S. is recognized.]

Regarding Claim 4, Jung teaches and therefore suggests:
4. The method of claim 1, wherein the predefined division criterion includes a silence period longer than or equal to a predetermined time duration or a morpheme feature related to a previous token. [Jung uses the change of speaker, as determined by the speaker-recognition based on speaker voice profiles, as its EPD and line of demarcation and is not looking for silence which is the more usual method of end-point detection in speech.  But Jung teaches that a “linguistic unit condition” may also be used to determine the interruption in the speech.  The “linguistic unit condition” teaches or at the least suggests the “morpheme feature related to a previous token” of the Claim.  See Figure 3, “[0054] In various examples, a determination that a first set of words and a second set of words are part of a same linguistic unit can be used as a condition when creating text segments, so words spoken by a single user in a short period of time (e.g., five seconds, ten seconds, etc.) are grouped together in a single text segment rather than being chopped up into multiple different text segments, given a situation where there is an interruption caused by another user speaking an utterance. A linguistic unit can comprise a phrase, a clause, a sentence, a paragraph, or another type of linguistic unit that can be understood on its own from a grammar perspective. A type of linguistic unit (e.g., a sentence) can be predefined for a text segment.”]
Jung teaches the use of “linguistic units” to determined that separate parts of the utterance pertain to the same segment.  This teaching suggests the use of “morphemes” which are parts of a word as Claimed.
Khan teaches:
wherein the predefined division criterion includes a silence period longer than or equal to a predetermined time duration or a morpheme feature related to a previous token. [Khan, Figure 6, “segmentation parameters 635” and “silence detection 601.”  Kahn teaches that detection of “long silence” between utterances leads to segmentation and “long silence” teaches or suggests a silence longer than a threshold to be objective and usable by the machine.  “[0108] FIG. 6 is a flow diagram illustrating an overview of an exemplary embodiment of end point silence detection for segmentation of utterance for a speech segmentation module.”  “[0609] In one approach, as illustrated in FIG. 6, speech input segmentation is based upon detection of long silence between utterances…  Other data 630 may also be considered, such as average long silence length seen in other speakers.”  “[0612] By requiring longer pauses (silence) between words to define an utterance, the segmentation parameters 635 will result in fewer utterance segments. ….”  “[0093] Techniques are disclosed for user-dependent data generation of segmentation modules for speech that may be used for manual or automatic processing. In one approach, speech analysis may be used to determine typical silence length separating a speaker's utterances….”  Khan, Figure 6 also includes a “speech user profile (Fig. 3).”  Figure 3, “Speech user profile 312” is used for segmentation of speech according to speaker.]
Jung and Kahn pertain to segmentation of speech according to speaker and it would have been obvious to use the periods of silence/pause as indicators of segmentation from Kahn with the system of Jung that primarily uses speaker identification to determine segmentation and also relies on duration of pauses as auxiliary methods of determining segmentation.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 5, Jung teaches:
5. The method of claim 1, wherein the merging include determining the continuous utterance from the same speaker based on a silence period shorter than or equal to a predetermined time duration or [Jung determines that the utterance is from a particular speaker based on the “voice recognition profile 412” of the user that is stored for each speaker.  Jung also teaches that if the pause/silence between two portions of speech by the same speaker is shorter than a predefined period of time (Figure 3, 308) the two portions are considered to be continuous.  “[0049] In various examples, the first and second sets of words being spoken within a predefined period of time 308 (e.g., the time between t1 and t3 is less than the predefined period of time 308) may be a condition that must be satisfied to combine the first and second sets of words into a single text segment given a situation where there is an interruption caused by another user speaking an utterance. For example, the predefined period of time 308 can be ten seconds, fifteen seconds, twenty seconds, thirty seconds, one minute, and so forth….”  The role of “predefined period of time 308” also applies to the converse situation:  “[0051] Note that the minimum threshold number of words condition applies in situations where the user continues to speak. Consequently, if a user says a small number of words without continuing to speak within the predefined period of time 308 (e.g., the user says "yes" or "no" in response to a question or the user says, "I agree" and stops speaking), then the user's word(s) can amount to an utterance and a corresponding text segment using the techniques described herein.”] a syntax feature related to a previous block.[Jung teaches both alternatives because it also includes the situation where the “linguistic unit” criterion is used to determine that the portions should be continuous/merged: “15. The method of claim 12, further comprising determining that the first set of words and the second set of words are part of a same linguistic unit, wherein the combining of the first set of words and the second set of words spoken by the first user into the corresponding utterance for the single text segment occurs based on the determining that the first set of words and the second set of words are part of the same linguistic unit.”] [Use of OR makes only one of the conditions limiting.  A reference that teaches one of the alternatives teaches the Claim.]

Regarding Claim 6, Jung teaches:
6. The method of claim 2, 
wherein the method further comprises outputting the voice recognition data reconstructed in the conversation format on a screen, [Jung, Figure 1, 138 or Figure 2 showing the screen with speech recognition results.  “… The techniques described herein further configure a graphical user interface layout, in which the transcript can be displayed….”  Abstract.]
wherein when the screen is updated, the speaker-specific voice recognition data is collectively updated or is updated based on the first recognition result. [Jung, Figure 2, the arrows at the top and bottom indicate a scroll features that updates the screen as more speech comes in.  “[0043] …As the user scrolls through the sequence of text segments, the user identifiers will also scroll to maintain the graphical association between a user identifier and a text segment.”]

Claim 7 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
7. A voice conversation reconstruction apparatus comprising: 
an input unit configured to receive voice conversation input; and [Jung, Figure 1, “speech capture device 116.”  Figure 4, “device 400.” ]
a processor configured to process voice recognition of the voice conversation received through the input unit, [Jung, Figure 4, “device 400.” “[0059] Device 400 includes one or more processing unit(s) 402, computer-readable media 404, input/output (I/O) interfaces 406 that enable the use of I/O devices, and communication interface(s) 408….”] wherein the processor is configured to: 
…

Claim 8 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.

Claim 9 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
9. A computer-readable recording medium storing therein a computer program, wherein the computer program includes instructions for enabling, when the instructions are executed by a processor, the processor to: [Jung, Figure 4, “device 400.” “[0059] Device 400 includes one or more processing unit(s) 402, computer-readable media 404, input/output (I/O) interfaces 406 that enable the use of I/O devices, and communication interface(s) 408….”]
…
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Itsui (U.S. 20190080688):  “A language model generating device according to the present invention includes: a paraphrase generating unit to generate, by using morphemes of a phrase included in learning example sentences that include a plurality of sentences and using synonyms for original expressions of the morphemes, a plurality of paraphrases that include a combination of an original expression of a morpheme and a synonym for an original expression of a morpheme and a combination of synonyms for original expressions of morphemes; and a language model generating unit to generate a language model that is based on an n-gram model from the plurality of paraphrases generated and the learning example sentences.” Abstract.
Hill (U.S. 20040162724): [0147] The concept recognition engine (CRE) provides a robust, language independent way of understanding users' natural language questions from both textual and audio sources. The technology is an advanced natural language processing technology for indexing, mapping and interacting with information based on the meaning, or semantic context, of the information rather than on the literal wording. As opposed to the majority of other natural language efforts, the technology does not rely on a complete formal linguistic analysis of phrases in an attempt to produce a full "understanding" of the text. Instead, the technology is based on a morpheme-level analysis of phrases enabling it to produce an "understanding" of the major components of the encapsulated meaning. [0148] Morphemes are defined as the smallest unit of language that contains meaning, or semantic context. A word may contain one or several morphemes, each of which may have single or multiple meanings. A relatively simple example of this is illustrated using the word geography that is comprised of the morphemes geo, meaning the globe, and graph that means illustration. These two distinct morphemes, when combined, form a concept meaning the study of the globe. Thus, individual units of meaning can be combined to form new concepts that are easily understood in normal communication.
Zass (U.S. 20180020285):  “[0116] In some embodiments, identifying audio portions (654) may comprise analyzing the audio data and/or the preprocessed audio data to identify one or more portions of the audio data. In some examples, an identified portion of the audio data may comprise a continuous part of the audio data or a non-continuous part of the audio data. In some examples, at least one of the one or more portions of the audio data may correspond to at least one of: a silent part of the audio data; a part of the audio data that does not contain speech; a utterance; a phoneme; a syllable; a morpheme; a word; a sentence; a conversation; a number of phonemes; a number of syllables; a number of morphemes; a number of words; a number of sentences; a number of conversations; a continuous part of the audio data corresponding to a single speaker; a non-continuous part of the audio data corresponding to a single speaker; a continuous part of the audio data corresponding to a group of speakers; a non-continuous part of the audio data corresponding to a group of speakers; and so forth.”
Abuelsaad (U.S. 10089067): “In an embodiment, teleconference management program 101 segments each audio signal into one or more of the following speech units: phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. In an embodiment, teleconference management program 101 determines intonational attributes associated with the utterances. Intonational attributes may include, but are not limited to, pitch envelope (i.e., a combination of the speaking fundamental frequency, pitch range, and the shape and timing of the pitch contour), overall speech rate, utterance timing (i.e., duration of segments and pauses), vocal quality, and intensity (i.e., loudness). Teleconference management program 101 stores the speech units and intonational attributes corresponding to the utterances in conversation database 104.”  See Figure 2 as well regarding to removing/muting unrelated portions of speech in a conference.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659