Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-31 are pending. Claims 1, 16, 23 and 28 are independent.  Claims are amended.
This Application is published as U.S. 20220246167.
            Apparent priority: 29 January 2021.
	The previous Office action mentioned and this Office action repeats:  Embodiment shown in Figures 3A and 3B appears to include the intended method of the instant Application which are not currently included in the Claims with sufficient particularity.  

1) The definition of “CHARACTERS” and the manner of “DETERMINING” these characters are missing from the Independent Claims. 2)  %, undefined, cannot be a source of distinction.  Claim needs to define the WHOLE first.  The whole is defined by the moving window. 3) where do the blanks occur? Are they consecutive? Or just occurring anywhere in the sentence?  4) Finally, how does the Rate relate to the threshold?
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection that, if presented, were necessitated by the amendments to the Claims.
This action is Final.
This Application appears related to 16/876,433 published as U.S. 2021/0358490 and 16/884,675 published as 2021/0370188.  Claims of 16/876,433 are close and Obviousness Double Patenting will be re-evaluated closer to allowability.
Response to Arguments
	Applicant’s arguments are moot in view of the new grounds of rejection that were necessitated by the amendments to the Claims.
	Hofer (primary) and Chen (cited in the Conclusion) are directed to end-point detection of speech and both use rate/speed of speech of a speaker as a factor for arriving at the pause/silence threshold that they use to detect end of speech (end of sentence).  Cheng (secondary) is also directed to segmentation of speech by detecting a threshold pause duration.  In Cheng the threshold is calculated based on the number of blank symbols in a window shifting over audio data chunks /segments.  Cheng mentions “speed of speaking” (Col. 1, 25-40) as a factor impacting speech recognition accuracy. Further, the time dependency of the symbols obtained by Cheng are related to the speed of speech.
Interview:
	Currently submitted amendments step back from those submitted for the Interview which were not considered sufficient.
	A number of questions were raised regarding the claimed language and are reflected in the Interview Summary.  The more of those questions that are answered inside the Claim language, the faster the prosecution.

How do you get from the audio segment to characters?  
Percentage of blank characters in WHAT?  What is the WHOLE from which the % is calculated.
HOW does the Rate determine the EOS threshold?  
No particularity regarding key features in the Claim.

	Claim:
The amendments are reflected in bold:
1. A method comprising:
determining an end of speech (EOS) threshold comprising a percentage of blank characters based at least in part on a rate of speech;
predicting, for at least one segment of a set of segments of an audio input, a set of characters representative of the at least one segment; and 
determining an EOS corresponding to the at least one segment based at least in part on a determination that the EOS threshold is satisfied for the set of characters predicted for the at least one segment.
	
	This Claim is now unclear as explained under Suggestions below.

	Applicant argues that Hofer does not teach the amended “determining an end of speech (EOS) threshold comprising a percentage of blank characters based at least in part on a rate of speech” and that: 

    PNG
    media_image1.png
    54
    642
    media_image1.png
    Greyscale
 
    PNG
    media_image2.png
    189
    627
    media_image2.png
    Greyscale

Response 9-10.
	
	IN REPLY: 
1) Hofer, in Figure 1, teaches an “Adaptive Endpoint Detector 15.”  “[0012] … an adaptive endpoint detector 15 …  to determine if the …  phrase spoken by the user corresponds to a complete request ….”  This means that the “Adaptive Endpoint Detector 15” of Hofer finds the EOS of the Claim.
 2) Hofer, then teaches that in the Figure 1, “Adaptive Endpoint Detector 15,” the “adaptive” pertains to an adjustable “Threshold.”  “[0012] … For example, the adaptive endpoint detector 15 may be further configured to … adjust a pause threshold …” Thus, Hofer teaches “determining an end of speech (EOS) threshold” of the Claim.
3) Hofer, next teaches that adjustment of the “threshold” for its Endpoint (EOS) Detector is according to the “conversation speed”  / “rate of speech” of the user.  “[0030] … the time threshold may be adaptively adjusted based on the conversation speed of the user ….”  Thus, Hofer teaches “determining an end of speech (EOS) threshold … based at least in part on a rate of speech.”

What remains is the definition, or measurement, of the “EOS threshold” in terms or unit of “a percentage of blank characters.”
Applicant argues that Hofer adjusts its “pause threshold” / “EOS threshold” based on “pause statistics associated with the user” (Hofer [0012]) whereas the Claim determines its “(EOS) threshold comprising a percentage of blank characters.”  
Issue: Whether the “pause threshold” determined based on “pause statistics associated with the user,” as taught by Hofer, teaches or suggests the “(EOS) threshold comprising a percentage of blank characters”?

First:  What are these “blank characters”?  What is a “character” to begin with?
Second: “percentage of blank characters” of what?  What is the Whole of which a % is taken out?
Exploring the remainder of the language, “predicting, for at least one segment of a set of segments of an audio input, a set of characters representative of the at least one segment; and,” indicates that the input is an “audio input” that has “a set of segments” and that each audio “segment” of the Claim and for each audio “segment,” “a set of characters representative” of the audio “segment” is “predicted.” 
This limitations helps little:  How are the characters predicted for the audio segment?  Does the system conduct speech recognition? Are the characters “letters” of the recognized text?  
“Characters” are NOT DEFINED by the Claim other than by being “PREDICTED FOR a SEGMENT” of an “audio input.  We don’t know if they are letters and spaces between words.  Do we?  “Characters” have a specific definition in this Application which must be inside the Claim in order for “%” to have meaning:  “[0021] … In one example, these characters may include any appropriate alphanumerical character, as well as potentially one or more special characters such as blanks to represent time steps or audio frames in which no other character is detected….”  Published Application.  [0022] equates the blank characters with non-speech states.
Another missing link is in WHAT you determine the number (and thus %) of blank characters.  Speech recognizer puts one space between words and you cannot determine an EOS based on text that is recognized in an ordinary manner.  Therefore, the Claim needs to first have the definition of “Character” and then state an interval in which the number of Characters are counted in order to arrive at a % value: “[0022] The EOS detector, in an embodiment, determines a percentage of blank symbols within the string of characters for a particular window of a sliding window. In one example, for a set of time steps within the string of characters, the EOS detector determines if the current state is a speech state (e.g., a current character indicates speech) or a non-speech state (e.g., the current indicates a blank and/or silence). The EOS detector may then determine if the percentage of blank symbols (e.g., non-speech states) satisfies the EOS threshold….”  Published Application.

Considering the BROAD language of the Claim and after referring to the supporting to Disclosure to arrive at potential intent of the language, “CHARACTERS” of the Claim are indeed units that correspond to TIME.  A BLANK space/character in text has no time associated with it and no matter how long a speaker pauses between two words of a sentence, the speech recognizer places one blank space between the words.  
Accordingly, considering the drastic lack of particularity in the Claim language, the “characters” of the Claim are taught by the “pauses” in the speech/ “audio input” segments as taught by Hofer, both the “characters” and “pauses” being measured in units of time.

Nevertheless, and in anticipation of more particular Claim language, the secondary reference Cheng is added.
Note the comparison of Figures of the instant Application and Cheng.
Instant Application:

    PNG
    media_image3.png
    558
    779
    media_image3.png
    Greyscale

Cheng:

    PNG
    media_image4.png
    523
    802
    media_image4.png
    Greyscale

See also Figure 15C of Cheng which shows the “window length” input and its description that teaches the shifting of the window across the data chunks (similar to what the instant Application does).

    PNG
    media_image5.png
    520
    801
    media_image5.png
    Greyscale

“Turning to FIG. 15C, in executing a pause identification component 2317 of the control routine 2310, core(s) of the processor 2350 or 2550 may be caused to adaptively identify longer pauses defined by larger quantities of consecutive pause data chunks 2131p as likely sentence pauses. More specifically, and starting with the data chunk 2131a that represents the temporally earliest chunk of the speech audio of the speech data set 2130, a window 2236 that covers a preselected quantity of temporally consecutive ones of the data chunks 2131a may be shifted across the length of the speech audio, starting with the temporally earliest data chunk 2131a and proceeding throughout all of the data chunks 2131a in temporal order toward the temporally last data chunk 2131a. Thus, with the window 2236 positioned to begin with the earliest data chunk 2131a (regardless of whether it is a pause data chunk 2131p or a speech data chunk 2131s), measurements of the lengths of each pause represented by multiple consecutive pause data chunks 2131p within the window 2236 (if there are any pauses represented by multiple consecutive pause data chunks 2131p within the window 2236) may be taken to identify the longest pause thereamong. The longest pause that is so identified within the window 2236 (i.e., the pause represented by the greatest quantity of consecutive pause chunks 2131p) may then be deemed likely to be a sentence pause.”
Cheng, Col. 47, 7-32.
Suggestions
Claims 1, 16, 23 and 28 are can benefit from improved clarity
Claim 1 provides:
1. A method comprising:
determining an end of speech (EOS) threshold comprising a percentage of blank characters based at least in part on a rate of speech;
predicting, for at least one segment of a set of segments of an audio input, a set of characters representative of the at least one segment; and 
determining an EOS corresponding to the at least one segment based at least in part on a determination that the EOS threshold is satisfied for the set of characters predicted for the at least one segment.

The amended language is unclear.  Which is it?
A. determining an end of speech (EOS) threshold based at least in part on a rate of speech, wherein the EOS threshold is expressed as a percentage of blank characters;
OR
B. determining an end of speech (EOS) threshold comprising a percentage of blank characters, wherein the percentage of blank characters is based at least in part on a rate of speech;

Applicant refers to [0036] of the Specification as support:

    PNG
    media_image6.png
    284
    644
    media_image6.png
    Greyscale


Based on the supporting paragraph of the Specification, the following is suggested:
1. A method comprising:
determining an end of speech (EOS) threshold , wherein the EOS threshold is expressed as a percentage of blank characters;
predicting, for at least one segment of a set of segments of an audio input, a set of characters representative of the at least one segment; and 
determining an EOS corresponding to the at least one segment based at least in part on a determination that the EOS threshold is satisfied for the set of characters predicted for the at least one segment.

The remaining independent Claims, while different, include similar language and thus suffer from the same lack of clarity.

The first limitation of Claim 1, in view of the Specification, is interpreted as intending: “determining an end of speech (EOS) threshold based at least in part on a rate of speech, wherein the EOS threshold is expressed as a percentage of blank characters.”

Claim 11 is amended to state:
11. The method of claim 1, wherein:
the percentage of blank characters is a threshold percentage of blank characters; and
 determining the EOS corresponding to the at least one segment of the set of segments further comprises determining a predicted percentage of blank characters included in the set of characters predicted for the at least one segment, and 
determining the EOS threshold is satisfied based at least in part on the predicted percentage and threshold percentage.

The highlighted portion appears to restate the concept already present in Claim 1 and therefore casts doubt on the intended meaning of Claim 1.
If Claim 11 has to say that “the percentage of blank characters is a threshold percentage of blank characters;” then what did “determining an end of speech (EOS) threshold comprising a percentage of blank characters …” of Claim 1 intend?

It is suggested that the language of Claim 1 is clarified and Claim 11 is also modified, for example, as follows:
11. The method of claim 1, wherein [[:]]

 determining the EOS corresponding to the at least one segment of the set of segments further comprises: 
determining a predicted percentage of blank characters included in the set of characters predicted for the at least one segment, and 
determining the EOS threshold is satisfied based at least in part on the predicted percentage 

The EOS threshold is expressed as a % of blank characters.  Here, the % of blank characters in an audio segment is determined and then compared against the EOS threshold to determine whether the % of blank characters in the audio segment “satisfies” the EOS threshold (which could mean either exceeds or is below the threshold).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-8, 10-12, and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Hofer (U.S. 2018/0090127) in view of Cheng (U.S. 11,138,979).
Regarding Claim 1, Hofer teaches:
1. A method comprising:
determining an end of speech (EOS) threshold [Hofer, Figure 1, “Adaptive Endpoint Detector 15.”  “[0012] … an adaptive endpoint detector 15 communicatively coupled to the decoder 14 to determine if the decoded phrase spoken by the user corresponds to a complete request …. For example, the adaptive endpoint detector 15 may be further configured to retrieve pause statistics associated with the user and to adjust a pause threshold based on the pause statistics associated with the user. The adaptive endpoint detector 15 may also be further configured to retrieve pause statistics associated with a word, phrase, and/or other contextual interpretation and to adjust the pause threshold based on the decoded phrase spoken by the user and the pause statistics associated with the word/phrase/other contextual interpretation.”  See Figure 2 for “Pause Threshold Adjuster 24.”]  
comprising a percentage of blank characters [Hofer, as provided in the Response to Arguments, arguably teaches or suggests this feature by teaching that it uses the statistics of pauses (blank characters) as threshold for detecting the end of speech.  This is considering that none of Characters, their manner of Prediction, or what constitutes a Blank Character, is defined by the Claim and the supporting Disclosure indicates Characters to be durations of time like “pauses.”]
based at least in part on a rate of speech; [Hofer, “[0030] As an example of how the time threshold may be adaptively adjusted based on the conversation speed of the user during operation of an embodiment of the system 50, the durations of all pauses between words of the user may be determined using an endpoint detection algorithm. Those pause duration values may be stored in a database….”]
predicting, for at least one segment of a set of segments of an audio input, a set of characters representative of the at least one segment; and [Hofer, Figure 1, the “Decoder 14” and “WFST Decoder 59” of Figure 4 generate “Recognition Results” and provide them to the “Endpoint Detector 15 or 60” which teaches “a set of characters representative of the at least one segment” of the Claim.  “[0028] …Those acoustic scores may then be provided to a decoder 59 (e.g., based on WFST) to determine the phrase spoken by the user 55.”  The “set of segments” are the “frames” of audio input as shown in Figure 1 or at Figure 5, “68: go to next time frame,” or any other segmentation applied to the audio input shown in Figure 1.]
determining an EOS corresponding to the at least one segment based at least in part on a determination that the EOS threshold is satisfied for the set of characters predicted for the at least one segment. [Hofer, Figures 3A, 3B, 3C.  “[0020] Turning now to FIGS. 3A to 3C, an embodiment of a method 30 of detecting an endpoint of speech may include detecting a presence of speech in an electronic speech signal at block 31, measuring a duration of a pause following a period of detected speech at block 32, detecting if the pause measured following the period of detected speech is greater than a pause threshold corresponding to an end of an utterance at block 33, and adaptively adjusting the pause threshold corresponding to an end of an utterance based on stored pause information at block 34. For example, the stored pause information may include one or more of pause information associated with a user or pause information associated with one or more contextual interpretations at block 35.”  Figure 5, “Is pause duration > threshold? 69.”  The pause durations are “context dependent” and “context” in Hofer means after a word or phrase or sentence etc.  “[0012] … The adaptive endpoint detector 15 may also be further configured to retrieve pause statistics associated with a word, phrase, and/or other contextual interpretation and to adjust the pause threshold based on the decoded phrase spoken by the user and the pause statistics associated with the word/phrase/other contextual interpretation.” “[0022] Some embodiments of the method 30 may further include determining statistics of pause durations associated with one or more phrase contexts at block 41 ….”]
Hofer arguably teaches or suggests that the EOS threshold is expressed as a % of blank characters (duration of pause or other “statistics” associated with the pause duration) considering the lack of definition and particularity with respect to this term.  However, a reference is added that is more express.
Cheng teaches and suggests:
determining an end of speech (EOS) threshold comprising a percentage of blank characters based at least in part on a rate of speech; [Cheng, Figure 16B, the input audio “Hello. Please leave a message” is divided into “data chunks 2131” / “segments” and shows the individual blank symbols /characters that are detected by the model and used for pause identification according to a “threshold blank string length.”  The length of the consecutive “blank symbols” is detected and compared against a threshold length where the “blank symbols” of Cheng are obtained in a similar fashion to the “blank characters” of the instant Application and each signify a duration of time.  Cheng does not teach that the “threshold blank string length” is expressed in a “percentage.”  However, CLAIM DOES NOT DEFINE ITS CHARACTERS, BLANK CHARACTERS, OR HOW THE PERCENTAGE IS OBTAINED AND PERCENTAGE OF WHAT WHOLE IT IS.  Therefore, the “threshold blank string length” teaches or suggests the “EOS threshold comprising a percentage of blank characters” of the Claim.  “FIG. 16A illustrates the initial division of the speech data set 2130 into data chunks 2131c that each represent a chunk of the speech audio of the speech data set 2130, and the provision of those data chunks 2131c as an input to a neural network 2355 …. FIG. 16B illustrates the use of such a neural network, which has been configured to implement an acoustic model, to identify likely sentence pauses for inclusion in a candidate set 2237c of likely sentence pauses within the speech audio.”  Col. 48, lines 35-46.  “However, it has been observed (and then confirmed by experimentation) that such a trained neural network with a CTC output may also be useful in identifying sentence pauses. …, the CTC output also has a tendency to generate relatively long strings of consecutive blank symbols that correspond quite well to where sentence pauses occur.”  Col. 50, 11-19.  “As each of these outputs are provided by the neural network 2355 or 2555, the length of each string of consecutive blank symbols that may be present therein is compared to a threshold blank string length. Where a string of consecutive blank symbols in such an output is at least as long as the threshold blank string length (e.g., the string of blank symbols corresponding to the pause between the words “Hello” and “Please”), such a string of blank symbols may be deemed likely to correspond to a sentence pause.”  Col. 50, line 64 to Col. 51, line 5.  “In performing such comparisons of the lengths of strings of consecutive blank symbols to the threshold blank string length, an indication of the threshold blank string length may be retrieved from the configuration data 2335. In some embodiments, the threshold blank string length may have been previously derived during neural network training and/or testing to develop the neural network acoustic model configuration data included in the configuration data 2335 for use in configuring the neural network 2355 or 2555 to implement an acoustic model. During such training, it may be that portions of speech audio that are known to include pauses between sentences may be used, and the lengths of the resulting strings of blank symbols that correspond to those sentence pauses may be measured to determine what the threshold blank string length should be to enable its use in distinguishing pauses between sentences from at least pauses between words.”  Col. 51, lines 15-32.]

Regarding Claim 2, Hofer teaches:
2. The method of claim 1, wherein the set of characters predicted for the segment is generated by a connectionist temporal classification (CTC) function generating, as an output, a probability distribution for a character of the set of characters. [Hofer, Figure 4, uses an “Acoustic frontend 57” and “Acoustic Scoring 58” and WFST for speech recognition/ generation of characters:  “[0012] … a feature extractor 12 (e.g., an acoustic feature extractor) communicatively coupled to the speech converter 11 to extract speech features from the electronic signal, a score converter 13 communicatively coupled to the feature extractor 12 to convert the speech features into scores of phonetic units, a decoder 14 (e.g., a weighted finite state transducer/WFST based decoder) communicatively coupled to the score converter 13 to decode a phrase spoken by the user based on the phonetic scores ….”]
Hofer does not teach the use of a CTC for acoustic modeling but a WFST is used as the decoder of the recognition stage that goes well with a CTC acoustic model.
Cheng teaches:
wherein the set of characters predicted for the segment is generated by a connectionist temporal classification (CTC) function generating, as an output, a probability distribution for a character of the set of characters. [Cheng is directed to segmentation of the audio input based on detection of end of sentence pauses.  Figures 16A and 16B showing the use of a neural network with CTC for segmentation of audio and Figures 17A and 17B shows the use of same models for generation of character outputs some of which are blank symbols.  “6. The apparatus of claim 1, wherein: the speech audio is also divided into multiple alternate data chunks that each represent an alternate chunk of multiple alternate chunks of the speech audio; and the at least one processor is caused to perform operations of a second segmentation technique comprising: configure a neural network to implement an acoustic model, wherein the neural network comprises a connectionist temporal classification (CTC) output; provide each alternate data chunk of the multiple alternate data chunks to the neural network as an input and monitor the CTC output for a string of blank symbols generated based on the alternate data chunk; compare a length of each string of blank symbols from the CTC output to a predetermined blank threshold length; and store an indication of each string of blank symbols from the CTC output that is at least as long as the predetermined blank threshold length as a likely sentence pause of a second candidate set of likely sentence pauses.”  “Alternatively or additionally, the multiple segmentation techniques may include the use of a connectionist temporal classification (CTC) segmentation technique in which instances of consecutive blank symbols (sometimes also referred to as "non-alphabetical symbols" or "artificial symbols") generated by a CTC output of a neural network trained to implement an acoustic model are used to identify likely sentence pauses. A neural network incorporating a CTC output and trained to implement an acoustic model would normally be used to identify likely text characters in speech audio based on various acoustic features that are identified as present therein. In such normal use, the CTC output serves to augment the probabilistic indications of text characters that are generated by the neural network with blank symbols that serve to identify instances of consecutive occurrences of the same text character (e.g., the pair of "s" characters in the word "chess"), despite the absence of an acoustic feature that would specifically indicate such a situation (e.g., no acoustic feature in the pronunciation of the "s" sound in the word "chess" that indicates that there are two consecutive "s" characters therein). However, it has been observed through experimentation that the CTC output of such a trained neural network may also be useful in identifying sentence pauses, as it has been observed that the CTC output has a tendency to generate relatively long strings of consecutive blank symbols that tend to correspond to where sentence pauses occur.”  Col. 5, lines 20-46.  “FIGS. 17A, 17B and 17C, taken together, illustrate an example of generating and using the converged set 2238 of likely sentence pauses. FIG. 17A illustrates the combining of multiple candidate sets 2237 of likely sentence pauses to generate the converged set 2238. FIG. 17B illustrates the use of the converged set 2238 in dividing the speech data set 2130 into data segments 2139 representing segments of the speech audio of the speech data set 2130. FIG. 17C illustrates the use of the same neural network implementation of acoustic model as was used in the CTC segmenting technique to perform character identification.”  Col. 51, lines 32-45.]
Hofer and Cheng pertain to audio segmentation for speech recognition and it would have been obvious to combine the use of CTC from Cheng with the system of Hofer and use the neural network with CTC from Cheng instead of the acoustic front end and acoustic scoring of Hofer because as Cheng indicates (see above) this method has been shown to identify double occurrence of a sound (such as s in Chess) which acoustic models are generally not able to do.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, Hofer teaches:
3. The method of claim 2, wherein the rate of speech is determined based at least in part on a set of inter-word intervals. [Hofer, “[0030] As an example of how the time threshold may be adaptively adjusted based on the conversation speed of the user during operation of an embodiment of the system 50, the durations of all pauses between words of the user may be determined using an endpoint detection algorithm. Those pause duration values may be stored in a database. …”  The pause durations are “context dependent” and “context” in Hofer means after a word or phrase or sentence etc.  “[0012] … The adaptive endpoint detector 15 may also be further configured to retrieve pause statistics associated with a word, phrase, and/or other contextual interpretation and to adjust the pause threshold based on the decoded phrase spoken by the user and the pause statistics associated with the word/phrase/other contextual interpretation.” “[0022] Some embodiments of the method 30 may further include determining statistics of pause durations associated with one or more phrase contexts at block 41 ….”  “[0026] … Another technique may involve adjusting the wait time based on the context of previously spoken words or phrases. For example, after the user says "open it", a short pause usually indicates that the sentence is finished. On the other hand, a short pause after the user says "could you" usually doesn't indicate that the sentence is finished. Advantageously, the statistics of pauses may be estimated separately for different words, phrases, and/or sentences. For example, an initial set of statistics may be estimated based on audio recordings from a large set of speakers and may thereafter be adapted at run-time to each individual user. …” ]

Regarding Claim 4, Hofer teaches:
4. The method of claim 3, wherein determining the EOS threshold further comprises modifying the EOS threshold based at least in part on a maximum inter-word interval of the set of inter-word intervals. [Hofer, Figure 3A, 34:  “Adaptively adjust the pause threshold …”  Figure 3B, 37: “Adjust the pause threshold based on the stored pause durations …”  “Pause durations” are stored according to context which includes words such that the pause durations become “inter-word intervals.”  Hofer stores “statistics” of pause durations.  Statistics is known to include max, min, mean, median, std dev. Etc.  Additionally, Hofer teaches that it stores the max pause duration after a partial phrase (which could be a word, see Figure 7 where some of the partial phrases are shown to be words and Figure 8 where they are all words) and uses it to set the threshold:  “[0031] As an example of how the time threshold may be adaptively adjusted based on the context of what was spoken during operation of an embodiment of the system 50, an audio database of spoken phrases may be utilized to determine statistics of pause durations after partially spoken phrases. Only pauses within the partial phrases are considered. Pauses at the end of a complete phrase are not used. As an example, for every partial phrase consisting of n words (n-gram) that was at least spoken by m different speakers, the maximum pause duration after that partial phrase is computed and stored. The wait time threshold for this partial phrase can then be set ten percent (10%) longer than the longest corresponding partial phrase pause duration stored in the database.”]

Regarding Claim 5, Hofer teaches:
5. The method of claim 3, wherein determining the EOS threshold further comprises modifying the EOS threshold based at least in part on a mean of the set of inter-word intervals. [Hofer, Figure 7 shows storing the average/mean pause times between phrases and partial phrases (sub-phrase) which can be a word:  “[0038] Turning now to FIG. 7, an embodiment of an average pause time database may include a representation in a simplified hash table 91….”  “[0040] Turning now to FIG. 8, an embodiment of an average pause time database may include representation in an FST….”  “[0044] …For example, a user carrying a smartphone may approach a public kiosk, the kiosk receives a wireless signal from the smartphone that identifies the user to the human machine interface in the kiosk, the kiosk retrieves the pause profile associated with the identified user, and when the user speaks to the kiosk the user advantageously has a better user experience because the speech recognition system of the human machine interface is adaptively adjusted to the user's average conversation speed and/or contextual partial phrase pause habits.”  The “average conversation speed” is obtained from the “average pause duration.”]

Regarding Claim 6, Hofer teaches or suggests:
6. The method of claim 3, wherein determining the EOS threshold further comprises modifying the EOS threshold based at least in part on a variance of the set of inter-word intervals and a maximum inter-word interval of the set of inter-word intervals. [Hofer, Figures 7 and 8 show storing of average pauses in between words or phrases which are used for adapting the end-point detector pause duration.  Hofer at [0031] also teaches the use of “average conversation speed” which is based on or related to “average pause” to adjust the pause before reply response time.  Hofer acknowledges “[0024] … Also speech disfluency may cause the user to make pause variations….”  Further, the fact of adaptation according to pause (inter-word interval) duration teaches or at the least suggests that the EOS threshold is based on “a variance” of the pause duration.] [Note that Hofer does not teach using the statistical variance of the pause duration (standard deviation to the second power) in a formula for adapting the threshold for pause.  However, this Claim too broadly refers to “variance” without providing particulars or even clarifying that the statistical variance is intended.  Mentioning “variance” in a general way is too broad.  All statistical characteristics of the collected pause durations come into play in the adaptation of the duration, directly or indirectly.]

Regarding Claim 7, Hofer teaches:
7. The method of claim 3, wherein the set of inter-word intervals is calculated by at least determining an amount of time between a first subset of characters of the set of characters and a second subset of characters of the set of characters. [Hofer, Figure 2, “pause duration measurer 22.”  Figure 3C, Figure 3B, 36:  “[0021] Some embodiments of the method 30 may further include storing the measured duration of pauses in the detected speech in the stored pause information associated with the user at block 36, and adjusting the pause threshold based on the stored pause durations associated with the user at block 37….”   The “pause duration” is the “amount of time” between a first word/phrase/sentence and a second word/phrase/sentence where the word/phrase/sentence/ “context” teaches the “subset of characters.”]

Regarding Claim 8, Hofer teaches:
8. The method of claim 7, wherein the amount of time corresponds to a number of segments of the set of segments. [Hofer, in Figures 5 and 6, the analysis and evaluation of “pause duration” is performed on a frame by frame basis in the sense that presence or absence of speech is detected per frame and the analysis is moved forward to the next frame such that the pause duration is incremented in units of frame / “number of segments” of the Claim.  See [0034]-[0036].  “[0036] …If the current pause duration is less than or equal to the context-based time threshold at block 86, then the current pause duration is increased by one (1) time frame at block 87 and processing may continue at the next time frame at block 88….”]

Regarding Claim 10, Hofer teaches:
10. The method of claim 9, wherein the first subset of characters and the second subset of characters further comprise a blank character. [Hofer teaches that its WFST Decoder which is used for speech recognition detect speech vs. non-speech which means that the “characters” that are recognized include “blank symbols/characters.”  [0015] In some embodiments of the speech detector apparatus 20, the speech detector 21 may be a part of the WFST decoder that bases speech/non-speech classification on the WFST state that the best active token is currently in.  different embodiments, the speech detector 21 may be an individual classifier, for example, operating on the acoustic signal or the features from the feature extractor 12….”  Additionally, Figure 2, “speech detector 21” may perform that same classification (letter vs. blank).]

Regarding Claim 11, Hofer teaches and therefore suggests:
11. The method of claim 1, wherein:
the percentage of blank characters is a threshold percentage of blank characters; and [Hofer as discussed in the Response to Arguments and as provided in the rejection of Claim 1 suggests this limitation.  Further, this limitation is redundant and has 112(d) and 112(b) implications because Claim 1 was amended to state that the EOS threshold is stated as a % of blank characters.  For example, if the % of blank characters detected in an audio segment exceeds 70% then that segment coincides with an EOS.]
 determining the EOS corresponding to the at least one segment of the set of segments further comprises determining a predicted percentage of blank characters included in the set of characters predicted for the at least one segment, and [Hofer teaches that it determines “statistics of pause durations associated with one or more phrase contexts.”  See claim 16.  (The “phrase contexts” include words/phrases/sentences as contexts.)  As shown in Figures 5 and 6, the audio “segments” that are investigated have a constant length/duration (frames).  When the duration of pause is known and duration of the segment in which the pauses occurred is also known, the percentage of the segment that is filled with pause can be calculated.  Symbols are states of the finite state transducer (WFST Decoder 59 in Figure 4).  The “output symbols” in Figure 8 teach the pause duration / blank character of the Claim.  “[0040] Turning now to FIG. 8, an embodiment of an average pause time database may include representation in an FST. Advantageously, the pause duration may be evaluated by traversing through the FST. The input symbols of the paths (i.e. the label to the left of the colon in FIG. 8) may represent the words in the phrase. The last output symbol (i.e. the numbers to the right of the colon in FIG. 8) may be the average pause duration of the sub-phrase. In a larger FST, there may also be a failure path with "?:?" labels that is traversed if no input symbol from the current state matches any of the paths from the current state. For example, the phrase "the door" first propagates from the start state (S) through the "the:0.6" path into state two (2). From there it propagates through the "door:1.2" path. The output symbol "1.2" represents the average pause duration of the phrase "the door" as stored in the FST database.”  See also Figure 6 where the flowchart refers to the WFST (84) to retrieve the symbol when speech is not detected.  “[0036] …If speech is not currently detected at block 82, an utterance hypothesis may be retrieved from the decoder at block 84. A context-based time threshold for the longest partial phrase corresponding to the utterance hypothesis may be retrieved from a database of context sensitive thresholds at block 85….”]
determining the EOS threshold is satisfied based at least in part on the predicted percentage and threshold percentage. [Hofer, Figure 5, 69, “Is pause duration >threshold?”  This limitation states the inherent nature of a threshold and its use.  There is a threshold, value is compared against the threshold, if value is above (or below) the threshold then the value falls within a class.]

	Cheng also teaches and suggests:
determining the EOS corresponding to the at least one segment of the set of segments further comprises determining a predicted percentage of blank characters included in the set of characters predicted for the at least one segment, and [Cheng keeps track of the string of “blank symbols” in a segment and therefore is capable of calculating a percentage of blank symbols.  “6. … provide each alternate data chunk of the multiple alternate data chunks to the neural network as an input and monitor the CTC output for a string of blank symbols generated based on the alternate data chunk; compare a length of each string of blank symbols from the CTC output to a predetermined blank threshold length; and store an indication of each string of blank symbols from the CTC output that is at least as long as the predetermined blank threshold length as a likely sentence pause of a second candidate set of likely sentence pauses.”  Sentence pause of Cheng teaches EOS of the Claim.] [Only “percentage” is suggested because Cheng does not calculate a % or a ratio; rather it uses a straight length number.  However, calculation of ratio/percentage is trivial when both numerator and denominator are known as is the case in Cheng.  See Figure 16B where the blank symbols/characters and the non-blank symbols are indicated and one can count the blanks and divide by the total.]
Rationale for combination as provided for Claim 2.

Regarding Claim 12, Hofer teaches and therefore suggests:
12. The method of claim 11, wherein determining the EOS threshold is satisfied when the predicted percentage exceeds the threshold percentage. [Hofer, Figure 5, when at 69, Pause Duration > Threshold the flowchart goes to 72: Signal End of Utterance Detected to Decoder.  Figure 6, 86 and 89.  These teachings require that the duration of pause within a frame to be greater than a threshold.  Considering that the duration of a frame is a constant, this condition can easily translate into the “percentage” of the blank/non-speech exceeding a threshold.  Thus, the percentage is suggested by the teachings of Figures 5 and 6 at 69 and 86.]
Cheng teaches, in Figure 16B, obtaining symbols/characters for both characters and pauses/blank in each speech segment (or in each window in Figure 15c) and considering that the total number of symbols/characters is also determined, a ratio or percentage can be obtained.  “As each of these outputs are provided by the neural network 2355 or 2555, the length of each string of consecutive blank symbols that may be present therein is compared to a threshold blank string length. Where a string of consecutive blank symbols in such an output is at least as long as the threshold blank string length (e.g., the string of blank symbols corresponding to the pause between the words “Hello” and “Please”), such a string of blank symbols may be deemed likely to correspond to a sentence pause….”  Col. 50, line 64 to Col. 51, line 15.  See also the next paragraph of Cheng: col. 51, 15-32.   See  “… a window 2236 that covers a preselected quantity of temporally consecutive ones of the data chunks 2131a may be shifted across the length of the speech audio, …” Col. 47, lines 6-31.  Figure 16B shows that for each data chunk all of the symbols (blank and non-blank) are determined such that the total number is known; hence the ratio/% calculation becomes trivial.]

Regarding Claim 14, Hofer teaches and therefore suggests:
14. The method of claim 1, wherein predicting the set of one or more characters comprises using one or more neural networks to predict the set of one or more characters. [Hofer teaches that its speech recognition system, which also detects pauses, may benefit from a SL algorithm such as a RNN.  It does not teach that the speech recognition is conducted by RNN but that the pause detection algorithm that is used by the speech recognizer is implemented in RNN.  The Claim is broad and thus the teaching of Hofer may be sufficient for teaching or at least suggesting the language of the Claim:  “[0041] According to some embodiments of a speech recognition system, pause duration may be modeled using sequence labeling. For example, a sequence labeling (SL) algorithm such as, for example, a recurrent neuronal network (RNN) may be utilized to determine the duration of an optional pause before the next word starts… An advantage of an SL technique as compared to a database is that it could utilize an extensive history (e.g., theoretically unlimited history). For example, there may be no arbitrary limitation for the n of the n-grams. Using an SL algorithm may also be beneficial for the robustness of the speech recognition system, for example, if the automatic speech recognition makes errors.”]
Cheng expressly teaches:
wherein predicting the set of one or more characters comprises using one or more neural networks to predict the set of one or more characters. [Cheng Figures 13A,B and  17C teach the use of “neural network” to generate the text data 2539.  Figure 16B shows the generation of blank symbols and spaces as well as text characters with the use of the same neural networks.]
 Rationale for combination as provided for Claim 2.

Regarding Claim 15, Hofer uses a neural network for its segmentation task:  “[0021] …Alternatively, or additionally, some embodiments may utilize a machine learning approach to learn pauses given a context. Suitable machine learning algorithms may be, for example, based on recurrent neuronal networks (RNN).”  Hofer does not teach the use of CTC.
Cheng teaches:
15. The method of claim 14, wherein the predicting the set of one or more characters further comprises applying an output of the one or more neural networks to a connectionist temporal classification (CTC) function that generates, as an output, a probability distribution for a character of the set of characters and provides the probability distribution to an EOS detector. [Cheng teaches the  use of a neural network including a CTC for both segmentation (detection of pauses) and for the final speech recognition (conversion of speech to text).  See Figures 16A,B, and 17A, B,C.  The use of CTC for segmentation teaches generating “a probability distribution for a character … and provides … to the EOS detector.]
Rationale for combination as provided for Claim 2.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Hofer and Cheng and further in view of Ye (U.S. 20190279614).
Regarding Claim 9, Hofer does not teach the use of a CTC and Cheng does not mention a greedy selection algorithm.
Ye teaches:
9. The method of claim 7, wherein the set of characters is determined by at least applying a greedy selection algorithm to the output of the CTC function. [ Ye, “[0019] An exemplary hybrid neural network model may be utilized during ASR processing, for example, to enhance accuracy in speech recognition detection as well as reduce WER when compared to traditional word-based modeling for speech recognition. In one example, the hybrid neural network model is a hybrid CTC model as referenced in the foregoing description. Significant progress has been made in ASR when acoustic models trained with feed-forward deep neural networks switched to LSTM RNNs since the latter can better model speech sequences. As referenced above, CTC modeling is utilized to map speech input frames into an output label sequence. When working with speech frames as input in speech recognition tasks, CTC introduces a special blank label and allows for repetition of labels to force the output and input sequences to have the same length. This is optimal for time evaluation of specific portions of frames. CTC modeling outputs are usually dominated by blank symbols and the output tokens corresponding to the non-blank symbols usually occur with spikes in respective posteriors. Greedy decoding is a decoding strategy used to generate ASR outputs from CTC modeling, where non-blank tokens corresponding to the posterior spikes are concatenated and subsequently collapsing those tokens into word outputs if needed. Examples described herein may be configured to utilize greedy decoding but are not so limited, where examples may also utilize other decoding schemes known to one skilled in the field of art.”]
Hofer, Cheng, and Ye pertain to speech recognition and it would have been obvious to combine the greedy decoding algorithm of Ye with the system of combination which uses a CTC because as Ye states this is an algorithm interchangeable with other decoding schemes particularly when blank symbols/tokens exist in the results.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Hofer and Cheng and further in view of Kahn (U.S. 20060149558).
Regarding Claim 13, Hofer and Cheng do not discuss a moving window over the frames of speech/audio input.  (Actually, Cheng does teach a moving window in Figure 15C but the rejection may not be modified at this stage:  “… Starting at the beginning of the speech audio, a window that covers a preselected quantity of temporally adjacent chunks may be shifted across the length of the speech audio, starting with the earliest chunk and proceeding through temporally adjacent chunks toward the temporally latest chunk….”  Cheng, Col. 14, line 44 to Col. 15, line 5.  “… a window 2236 that covers a preselected quantity of temporally consecutive ones of the data chunks 2131a may be shifted across the length of the speech audio, …” Col. 47, lines 6-31.)
Kahn teaches:
13. The method of claim 1, wherein determining the EOS corresponding to the at least one segment of the set of segments further comprises using a sliding widow to identify the at least one segment before predicting the set of segments. [Kahn, “[0235] In one frames-based silence detection method, statistical analysis for each window of sound (e.g. 100 msec in length) may be performed using moving sample window techniques in relation to the defined silence and sound thresholds. ….”  See also the preceding [0234].]
Hofer, Cheng, and Khan pertain to speech recognition and it would have been obvious to combine the moving window of Khan which also pertain to silence/pause detection and segmentation with the system of the combination for the reasons that Kahn uses this method.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 16- 27 are rejected under 35 U.S.C. 103) as being unpatentable over Hofer in view of Cheng.
Regarding Claim 16, Hofer teaches (See also rejection of Claim 1):
16. A system comprising:
one or more processors; and [Hofer, [0013], [0019], and [0023] describe the devices shown in Figures 1 and 2 and used for implementation of method of Figures 3A,3B, and 3C and teach that processor and memory or a computing device are used.]
memory storing instructions that, as a result of being executed by the one or more processors,  [Hofer, “[0013] … Alternatively, or additionally, these components may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., to be executed by a processor or computing device….”]
cause the system to:
generate a determination of an end of speech (EOS) threshold comprising a percentage of blank characters based at least in part on a rate of speech; and [Hofer, Figure 1, “Adaptive Endpoint Detector 15.”  See [0030] and [0032]-[0034] for dependence of the Endpoint Detection on the speed of the conversation / “rate of speech.”  Additionally, see “[0042] … If the user is completely unknown, embodiments of a speech endpoint detector may begin building a set of stored pause information for the unknown user and adapt the response time based on the conversation speed of the unknown user. Substantial improvement in the user experience based on conversation speed, however, may involve a larger sample set that can be developed from a short, one-time interaction. On the other hand, some embodiments may improve the user experience with stored context-based pause information, even for unknown users. Some embodiments may include a semi-adaptive approach where an initial determination is quickly made regarding average conversation speed of the unknown user (e.g., fast, medium, slow, etc.) and setting a corresponding pause threshold based on previously determined pause statistics for the initial determination, and thereafter fine tuning or changing the pause threshold as the sample size increases….”]
generate a determination of an EOS for a window of audio based at least in part on the EOS threshold. [Hofer, Figure 5, “Signal End of Utterance detected to decoder 72” after the “Is Pause Duration >Threshold 69” goes to YES.  The “window” or segment considered in Figure 5 is a frame.  [0034].]
Hofer suggests that the EOS threshold is expressed as a % of blank characters and Cheng also suggests this feature as provided with respect to Claim 1.  Rationale for combination as provided for Claim 1.

Regarding Claim 17, Hofer teaches:
17. The system of claim 16, wherein the memory further includes instructions that as a result of being executed by the one or more processors, cause the system to 
obtain a set of characters predicted for the window of audio generated using at least one of: [Hofer, Figure 4, “Recognition Result” generated by the “WFST Decoder” teaches “set of characters” of the Claim.  Hofer obtains words, phrases, or sentences all of which teach “a set of characters” of the Claim.  ““[0012] … The adaptive endpoint detector 15 may also be further configured to retrieve pause statistics associated with a word, phrase, and/or other contextual interpretation and to adjust the pause threshold based on the decoded phrase spoken by the user and the pause statistics associated with the word/phrase/other contextual interpretation.”  Figures 7 and 8 showing the examples of recognized words and phrases: “open,” “the,” “open the,” “the door.”]
one or more neural networks implementing a neural acoustic model or [Hofer, “[0033] …In some embodiments, the database may include one or more of a relational database, a graphical relationship, or a function mapping speech features to an expected pause duration. A suitable function may be, for example, a non-linear function trained using a machine learning approach (e.g., such as RNN).”  See also [0041] that teaches using RNNs for sequence labeling as a part of the speech recognition system.  Hofer suggests but does not teach implementing the acoustic front end using neural networks.]
a connectionist temporal classification (CTC) function.
Hofer does no teach that the acoustic model is implemented in a NN.
Cheng as applied to Claim 2 teaches:
obtain a set of characters predicted for the window of audio generated using at least one of: [Cheng, Figure 17C, “Text Data 2539” is a set of characters predicted for the “speech data set 2130” shown in Figure 17B. “As words of the speech audio are identified, it may be the processor(s) 2550 of the control device 2500 that assembles the identified words to generate the text data 2539, which may then be transmitted to the requesting device 2700 from which a request may have been received to perform the speech-to-text conversion.”  ]
one or more neural networks implementing a neural acoustic model or [Cheng, Figure 17C showing the “neural network 2355 or 2555” receive the output of “feature detection” which is acoustic features.  “FIGS. 17A, 17B and 17C, taken together, illustrate an example of generating and using the converged set 2238 of likely sentence pauses. FIG. 17A illustrates the combining of multiple candidate sets 2237 of likely sentence pauses to generate the converged set 2238. FIG. 17B illustrates the use of the converged set 2238 in dividing the speech data set 2130 into data segments 2139 representing segments of the speech audio of the speech data set 2130. FIG. 17C illustrates the use of the same neural network implementation of acoustic model as was used in the CTC segmenting technique to perform character identification.”  Col. 51, lines 32-45.]
a connectionist temporal classification (CTC) function. [Cheng, Figure 17C where the acoustic model/ neural network 2355 or 2555 includes a “CTC output 2356 or 2556.”  “FIGS. 16A and 16B, taken together, illustrate an example of use of a connectionist temporal classification (CTC) segmentation technique during pre-processing to also enable the division of the same speech data set 2130 into segments. FIG. 16A illustrates the initial division of the speech data set 2130 into data chunks 2131c that each represent a chunk of the speech audio of the speech data set 2130, and the provision of those data chunks 2131c as an input to a neural network 2355 of one of the node devices 2300, or as an input to a neural network 2555 of the control device 2500. FIG. 16B illustrates the use of such a neural network, which has been configured to implement an acoustic model, to identify likely sentence pauses for inclusion in a candidate set 2237c of likely sentence pauses within the speech audio.”  Col. 48, lines 32-45.]
Rationale for combination as provided for Claim 2.

Regarding Claim 18, Hofer teaches:
18. The system of claim 16, wherein the memory further includes instructions that as a result of being executed by the one or more processors, cause the system to:
obtain a set of words from an audio signal, where the window of audio represents a portion of the audio signal; and [Hofer, Figures 7 and 8 show the recognized words and phrases and Figures 5, 68, and 6, 83, teach that segments of audio that are considered are frames.]
determine the rate of speech based at least in part on a set of inter-word intervals determined based at least in part on the set of words. [Hofer teaches that “speed of conversation” / “rate of speech” is based on the pauses between the words/phrases/sentences of the recognized speech. [0030]-[0034].   “[0044] … For example, a user carrying a smartphone may approach a public kiosk, the kiosk receives a wireless signal from the smartphone that identifies the user to the human machine interface in the kiosk, the kiosk retrieves the pause profile associated with the identified user, and when the user speaks to the kiosk the user advantageously has a better user experience because the speech recognition system of the human machine interface is adaptively adjusted to the user's average conversation speed and/or contextual partial phrase pause habits….”]

Regarding Claim 19, Hofer teaches:
19. The system of claim 16, wherein the memory further includes instructions that as a result of being executed by the one or more processors, cause the system to 
determine the rate of speech based at least in part on a maximum value of a set of inter-word intervals calculated based at least in part on a set of words included in a set of windows representing an audio signal, where the window of audio is a member of the set of windows. [Hofer, Figures 5, 68, and 6, 83, teach that segments/windows of audio that are considered are frames.  Figure 3C teaches storing pause statistics that would include “maximum value” of the pause / “inter-word intervals.”  See also:  “[0031] … As an example, for every partial phrase consisting of n words (n-gram) that was at least spoken by m different speakers, the maximum pause duration after that partial phrase is computed and stored….”]

Regarding Claim 20, Hofer teaches and therefore suggests:
20. The system of claim 16, wherein the window of audio includes a plurality of time steps of an audio signal including speech. [Hofer teaches the use of a frame of audio signal that may include speech.  See [0034]-[0036].  Audio processing generally includes dividing the input audio into frames that are further divided into time steps in order to convert the audio to the frequency domain from which the “Acoustic Features” (Figure 1 and Figure 4) are extracted.  Accordingly, the teaching of the frames and the teaching of the extraction of acoustic features taken together suggest the dividing of the frames of input audio into time steps.]
(Additionally, the blank symbols in Figure 16B of Cheng at the least suggest the time steps corresponding to silence.)

Regarding Claim 21, Hofer teaches:
21. The system of claim 16, wherein the memory further includes instructions that as a result of being executed by the one or more processors, cause the system to 
execute a speech processing pipeline; and [Hofer, the various components shown in Figures 1, 2, or 4 teach the “speech processing pipeline” of the Claim. Figure 4 teaches that the “Endpoint Detector 60” provides its output to the “WFST Decoder 59” and recognition result output from this decoder is provided to the “Language Interpreter Execution Unit 61” which generates a Response.]
wherein instructions that cause the system to generate the determination of the EOS further include instructions that as a result of being executed by the one or more processors, cause the system to generate the determination of the EOS as part of the speech processing pipeline. [Hofer, Figure 1 shows that the “Adaptive Endpoint Detector 15” is receiving input from the other components forming the “pipeline.  See also Figure 4 for the same proposition.]

Regarding Claim 22, Hofer teaches:
22. The system of claim 21, wherein generating the determination of the EOS is executed by an EOS detector of the speech processing pipeline. [Hofer, Figure 1, “Adaptive Endpoint Detector 15.”  Figure 4, “Endpoint Detector 60.”]

Regarding Claim 23, Hofer teaches:
23. A method comprising:
flagging a subset of audio frames of a set of audio frames of an audio signal as end of speech (EOS) based at least in part on an EOS threshold, [Hofer, Figures 5 and 6 go through Frames of input audio and determine whether the frames in which no speech is detected (NO from the top decision step  65 or 82) satisfy a paue time threshold (69, or 86) in which case (YES) they signal an End of Utterance (EOS) (72 or 89).]
wherein the EOS threshold is determined based at least in part on a set of inter-word intervals and comprises a percentage of blank characters. [Hofer, see rejection of Claim 1.  The EOS threshold is adapted according to the pauses detected between words/phrases/sentences of the input speech.  See Figure 3C.]
Regarding the “EOS threshold … comprises a percentage of blank characters” see the rejection of Claim 1 and the combination with Cheng.

Regarding Claim 24, Hofer teaches: 
24. The method of claim 23, wherein the method further comprises determining the set of inter-word intervals based at least in part on a set of characters generated by a connectionist temporal classification (CTC) function using as an input a set of features of the set of audio frames generated by an acoustic model.  [Hofer, generates words/phrases/sentences (calls them “context”) which teach the “set of characters” of the Claim and these characters are the result of “Feature extraction 12” on the input audio followed by a “Score Converter 13” in Figure 1 or “Acoustic Frontend 57” and “Acoustic Scoring 58” that teach the “acoustic model” of the Claim.  The “adaptive endpoint detector 15” or “endpoint detector 60” determine the “set of inter-word intervals” / “pauses.”  “[0012] Turning now to FIG. 1, an embodiment of a speech recognition system 10 may include a speech converter 11 to convert speech from a user into an electronic signal, a feature extractor 12 (e.g., an acoustic feature extractor) communicatively coupled to the speech converter 11 to extract speech features from the electronic signal, a score converter 13 communicatively coupled to the feature extractor 12 to convert the speech features into scores of phonetic units, a decoder 14 (e.g., a weighted finite state transducer/WFST based decoder) communicatively coupled to the score converter 13 to decode a phrase spoken by the user based on the phonetic scores, an adaptive endpoint detector 15 communicatively coupled to the decoder 14 to determine if the decoded phrase spoken by the user corresponds to a complete request, and a request interpreter 16 communicatively coupled to the decoder 14 to interpret the request from the user.  For example, the adaptive endpoint detector 15 may be further configured to retrieve pause statistics associated with the user and to adjust a pause threshold based on the pause statistics associated with the user. The adaptive endpoint detector 15 may also be further configured to retrieve pause statistics associated with a word, phrase, and/or other contextual interpretation and to adjust the pause threshold based on the decoded phrase spoken by the user and the pause statistics associated with the word/phrase/other contextual interpretation.”  “[0028] … The system 50 may also record audio with a microphone 51, process the acoustic data with the processor 52, and then output speech (e.g., via loudspeaker 53) or visual information (e.g., via display 54) to the user or execute commands based on the user's request. The speech from a user 55 may be captured by the microphone 51 and converted into digital signals by an analog-to-digital (A/D) converter 56 before being processed by the processor 52. The processor 52 may include an acoustic frontend 57 to extract acoustic features, which may then be converted into acoustic scores of phonetic units by an acoustic scorer 58. Those acoustic scores may then be provided to a decoder 59 (e.g., based on WFST) to determine the phrase spoken by the user 55.”]
	Hofer does not teach the use of a CTC.
Cheng as applied to Claim 2 teaches:
wherein the method further comprises determining the set of inter-word intervals based at least in part on a set of characters generated by a connectionist temporal classification (CTC) function using as an input a set of features of the set of audio frames generated by an acoustic model. [Cheng, Audio acoustic features are provided to the CTC which identifies likely text characters corresponding to the acoustic features and also finds the pauses / “inter-word intervals” in the speech.  Figures 16A and 16B.  “In some embodiments, the same trained neural network with CTC output that is employed in the CTC segmentation technique during pre-processing may also be employed during the subsequent processing to perform the function for which it was trained. Specifically, that same trained neural network may be used to identify likely text characters from acoustic features detected in the speech audio, including using its CTC output to augment such probabilistic indications of text characters with blank symbols indicative of instances in which there are likely instances of consecutive occurrences of the same text character.”  Col. 16, lines 14-24.  “7. The apparatus of claim 6, wherein the predetermined blank threshold length is based on observations of lengths of strings of blank symbols generated by the CTC output during training of the neural network to implement the acoustic model to identify likely text characters from acoustic features or during testing of the implementation of the acoustic model by the neural network with speech sounds known to include sentence pauses as input.”  “10. The apparatus of claim 6, wherein performing the speech-to-text conversion using the multiple data segments as input comprises the at least one processor performing operations comprising: configure another neural network to implement the acoustic model, wherein the other neural network also comprises a CTC output; provide indications of detected acoustic features of the speech segment of each data segment to the neural network as an input and monitor the CTC output for an instance of the blank symbol indicating that two consecutive instances of a text character output by the neural network as likely characters in a sentence spoken in the speech audio should not be merged into a single instance of the text character; and provide the output of the neural network to a language model to identify the sentence spoken in each speech segment.”]
Rationale for combination as provided for Claim 2.

Regarding Claim 25, Hofer teaches (See also rejection of Claim 5):
25. The method of claim 23, further comprising:
determining an average interval of the set of inter-word intervals, [Hofer, Figures 7 and 8:  “[0038] Turning now to FIG. 7, an embodiment of an average pause time database may include a representation in a simplified hash table 91. The left most column of the table may correspond to a hash index, the middle column may correspond to a stored partial phrase, and the right column may correspond to a phrase duration….”  Figure 8 has another representation of “average pause time” between words: “[0040] Turning now to FIG. 8, an embodiment of an average pause time database may include representation in an FST….”]
the EOS threshold being based at least in part on the average interval of the set of inter-word intervals. [Hofer teaches a threshold for the pause duration (EOS threshold) and with the pause duration is greater than this threshold (Figure 5, 69 and Figure 6, 86) then end of utterance (EOS) is detected (Figure 5, 72, Figure 6, 89).  Figures 7 and 8 correspond to “an average pause time database” for inter-phrase intervals and inter-subphrase intervals which are shown as words in Figures 7 and 8.  See [0038]-[0041].  When the pause information is associated with a user, the “pause threshold” is adapted according to the “pause statistics” of the user.  “Pause statistics” includes pause average.  Thus the EOS/Pause threshold begins with statistics/average pause duration and is adapted according to the user’s conversation speed:  “[0042] Some embodiments may store pause information associated with a user. …. Some embodiments may include a semi-adaptive approach where an initial determination is quickly made regarding average conversation speed of the unknown user (e.g., fast, medium, slow, etc.) and setting a corresponding pause threshold based on previously determined pause statistics for the initial determination, and thereafter fine tuning or changing the pause threshold as the sample size increases….”]

Regarding Claim 26, Hofer teaches and suggests:
26. The method of claim 23, wherein the method further comprises generating a transcript of the audio signal as a result of flagging the subset of audio frames as EOS. [Hofer is directed to a “speech recognition system” and in Figure 4 teaches generating a “Recognition Result.”  Hofer does not expressly teach generating a “transcript” or “text” from the recognized speech.  However, generating a transcript is a trivial step after speech has been recognized.  Further, system of Figure 4 includes a “Display 54” for display of the “Response” which could be equally used for display of the text of the “Recognition Result.”  “[0032] … An example current recognition hypothesis may be "open the door"….”]
Cheng expressly teaches:
wherein the method further comprises generating a transcript of the audio signal as a result of flagging the subset of audio frames as EOS. [Cheng, Abstract:  “An apparatus includes processor(s) to: divide a speech data set into multiple data chunks that each represent a chunk of speech audio; derive a threshold amplitude based on at least one peak amplitude of the speech audio; designate each data chunk with a peak amplitude below the threshold amplitude a pause data chunk; within a set of temporally consecutive data chunks of the multiple data chunks, identify a longest subset of temporally consecutive pause data chunks; within the set of temporally consecutive data chunks, designate the longest subset of temporally consecutive pause data chunks as a likely sentence pause of a candidate set of likely sentence pauses; based on at least the candidate set, divide the speech data set into multiple data segments that each represent a speech segment of the speech audio; and perform speech-to-text conversion, to identify a sentence spoken in each speech segment.”] 
Rationale for combination as provided for Claim 17.  Hofer and Cheng pertain to speech recognition with endpoint detection and Cheng expressly teaches generation of “text” whereas Hofer stops at the recognition step because it is more focused on interactive voice response applications.

Regarding Claim 27, Hofer teaches (See also Claim 21):
27. The method of claim 23, wherein the method further comprises providing the EOS threshold to a speech recognition pipeline. [Hofer, Figure 4 teaches that the “Endpoint Detector 60” provides its output to the “WFST Decoder 59” and recognition result output from this decoder is provided to the “Language Interpreter Execution Unit 61” which generates a Response.  Figures 1 and 2 also show a series of operations and components that teach the “pipeline” of the Claim.]

Claims 28-31 are rejected under 35 U.S.C. 103 as being unpatentable over Hofer and Cheng in view of Yu (U.S. 2020/0074983).
Regarding Claim 28, Hofer teaches (See also rejection of Claim 1 over Hofer and Cheng): 
28. A processor comprising: [Hofer, “[0013] … Alternatively, or additionally, these components may be implemented in one or more modules as a set of logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., to be executed by a processor or computing device….”]
one or more arithmetic logic units (ALUs) to 
generate a determination of an EOS threshold comprising a percentage of blank characters based at least in part on a rate of speech using one or more neural networks, [Hofer, Figure 1, “Adaptive Endpoint Detector 15.”  Figure 3A, “Adaptively adjust the pause threshold …. 34.”  “[0021] …Alternatively, or additionally, some embodiments may utilize a machine learning approach to learn pauses given a context. Suitable machine learning algorithms may be, for example, based on recurrent neuronal networks (RNN).”]
at least in part, by:
predicting, by the one or more neural networks, for a segment of a set of segments of an audio input, a set of characters associated with the segment; and [Hofer, Figure 4, “Recognition Result,” Figures 7 and 8 showing the recognized words and phrases; Figures 5 and 6 showing the segment of audio being a frame of audio.]
determining, by the one or more neural networks, an EOS corresponding to the segment based at least in part on a determination that the EOS threshold 1s satisfied for the set of characters associated with the segment. [Hofer, Figures 5 and 6 showing the determination of the EOS/End of Utterance (72 and 89) corresponding to a segment of audio (see 68 and 83: “go to next frame”).  “[0021] …Alternatively, or additionally, some embodiments may utilize a machine learning approach to learn pauses given a context. Suitable machine learning algorithms may be, for example, based on recurrent neuronal networks (RNN).”  See also [0033], [0037], and [0041] for the use of RNN.]
Cheng was combined (see rejection of Claims 1 and 2) for more expressly showing the blank and non-blank symbols/characters.
Hofer does not teach the use of an ALU.  (See Hofer:  “[0013] … For example, hardware implementations may include configurable logic such as, for example, programmable logic arrays (PLAs), field programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), or in fixed-functionality logic hardware using circuit technology such as, for example, application specific integrated circuit (ASIC), complementary metal oxide semiconductor (CMOS) or transistor-transistor logic (TTL) technology, or any combination thereof….”)
Neither does Cheng.
Yu teaches:
one or more arithmetic logic units (ALUs)  [Yu teaches that an ALU is a commonly used in processing audio: “[0063] The units described herein may be implemented using hardware components and software components. For example, the hardware components may include microphones, amplifiers, band-pass filters, audio to digital convertors, non-transitory computer memory and processing devices. A processing device may be implemented using one or more general-purpose or special purpose computers, such as, for example, a processor, a controller and an arithmetic logic unit, a digital signal processor, a microcomputer, a field programmable array, a programmable logic unit, a microprocessor, hardware circuitry or any other device capable of responding to and executing instructions in a defined manner. The processing device also may access, store, manipulate, process, and create data in response to execution of the software.”] 
Hofer/Cheng and Yu pertain to speech recognition and endpoint detection is a part of speech recognition and it would have been obvious to use an ALU as taught in Yu for the tasks of endpoint detection discussed in Hofer.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.  

Regarding Claim 29, Hofer teaches the use of NN but is not express in that the acoustic front end is implemented in NN. Cheng, as applied to Claim 17, teaches this feature. Cheng is directed to a pre-processing segmentation of audio all implemented in CTC NN.
Yu expressly teaches: 
29. The processor of claim 28, wherein the one or more neural networks further comprise a neural acoustic model. [Yu teaches that it uses a neural network with phone based CTC classification to be applied to the acoustic frames which means that it has a neural network acoustic model.  “Methods and apparatuses are provided for performing acoustic to word (A2W) speech recognition training performed by at least one processor. The method includes initializing, by the at least one processor, one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC), initializing, by the at least one processor, one or more second layers of the neural network with grapheme based CTC, acquiring, by the at least one processor, training data and performing, by the at least one processor, A2W speech recognition training based the initialized one or more first layers and one or more second layers of the neural network using the training data.”  Abstract.  Figure 4 receiving the acoustic data and “[0060] Next, a joint CTC-CE training unit 113 is described herein according to an embodiment. For instance, Cross Entropy (CE) and CTC are two different loss functions for training speech recognition systems. The CE loss is used in related art speech recognition systems where a fixed alignment between acoustic frames and labels is needed. On the other hand, CTC loss is used in related art end-to-end speech recognition systems, where the loss is computed from all alignment paths belong to given target label sequence.”]
Hofer/Cheng and Yu pertain to speech recognition and it would have been obvious to use the neural network layers with CTC classification as taught in Yu for the task of acoustic modeling and instead of the acoustic front end of Hofer/Cheng.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 30, Hofer teaches: 
30. The processor of claim 28, wherein the one or more neural networks are used to implement a speech processing pipeline. [Hofer teaches that the “Endpoint Detector 15/60” which is a part of the speech processing pipeline in Figures 1 and 4 and the WFST which performs the speech recognition is implemented in an RNN.  “[0037] … Advantageously, the database representation as a WFST enables the use of weighted composition to also compute the context sensitive wait time. According to some embodiments, a database of pause information may include a non-linear function using RNN.”  See also “[0033] … The database may not necessarily correspond to a relational database. In some embodiments, the database may include one or more of a relational database, a graphical relationship, or a function mapping speech features to an expected pause duration. A suitable function may be, for example, a non-linear function trained using a machine learning approach (e.g., such as RNN).”]

Regarding Claim 31, Hofer teaches (See also rejection of Claim 24): 
31. The processor of claim 28, 
wherein the one or more ALUs further determine a set of inter-word intervals based at least in part on a set of characters generated by a connectionist temporal classification (CTC) function using as an input a set of features of the set of segments of the audio input generated by an acoustic model. [Hofer, generates words/phrases/sentences (calls them “context”) which teach the “set of characters” of the Claim and these characters are the result of “Feature extraction 12” on the input audio followed by a “Score Converter 13” in Figure 1 or “Acoustic Frontend 57” and “Acoustic Scoring 58” that teach the “acoustic model” of the Claim.  “[0012] Turning now to FIG. 1, an embodiment of a speech recognition system 10 may include a speech converter 11 to convert speech from a user into an electronic signal, a feature extractor 12 (e.g., an acoustic feature extractor) communicatively coupled to the speech converter 11 to extract speech features from the electronic signal, a score converter 13 communicatively coupled to the feature extractor 12 to convert the speech features into scores of phonetic units, a decoder 14 (e.g., a weighted finite state transducer/WFST based decoder) communicatively coupled to the score converter 13 to decode a phrase spoken by the user based on the phonetic scores, an adaptive endpoint detector 15 communicatively coupled to the decoder 14 to determine if the decoded phrase spoken by the user corresponds to a complete request, and a request interpreter 16 communicatively coupled to the decoder 14 to interpret the request from the user…..”  “[0028] … The system 50 may also record audio with a microphone 51, process the acoustic data with the processor 52, and then output speech (e.g., via loudspeaker 53) or visual information (e.g., via display 54) to the user or execute commands based on the user's request. The speech from a user 55 may be captured by the microphone 51 and converted into digital signals by an analog-to-digital (A/D) converter 56 before being processed by the processor 52. The processor 52 may include an acoustic frontend 57 to extract acoustic features, which may then be converted into acoustic scores of phonetic units by an acoustic scorer 58. Those acoustic scores may then be provided to a decoder 59 (e.g., based on WFST) to determine the phrase spoken by the user 55.”]
	Hofer does not teach the use of an ALU or a CTC.
	Cheng teaches the use of a CTC but not expressly an ALU.
Yu teaches the use of “one or more first layers of a neural network with phone based Connectionist Temporal Classification (CTC)” for performing speech recognition which includes generating the pauses between the words.  Abstract.  Yu also teaches that its processors include ALU. [0063].
Rationale for combination as provided for Claim 28.  The use of ALU is common in speech processing applications and a neural network using a CTC classifier may be used in speech recognition where a part of speech recognition is endpoint detection for segmentation.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Chen (U.S. 2020/0410985):  “[0004] Current methods of sentence segmentation based on the speech recognition technology are not accurate because they do not take into account the fluctuation in speech speed when a person speaks. As a result, they suffer from the problem of frequent sentence segmentation or no sentence segmentation in a long time, and their accuracies are reduced.”  Chen takes into account the rate/speed of speech when conducting speech segmentation which is based on finding the start and end points of speech.  Figure 6, “Adaptive Threshold Calculation T6.”  “[0039] FIG. 6 is a schematic flowchart of adaptive sentence segmentation based on a speech speed according to an embodiment of this application.”   “[0054] Optionally, the first threshold is generated according to a previous result of sentence segmentation and/or speech information, that is to say, the first threshold is adaptively and dynamically adjusted according to features (for example, sentence duration and speech speed) of a speech given by a person….”   “[0132] In step B6, a next duration threshold can be adaptively calculated by using the speech speed of this sentence. The reason is that from a perspective of statistics, a higher speech speed indicates a shorter pause between sentences, and on the contrary, a lower speech speed indicates a longer pause between sentences. Therefore, the speech speed and the duration threshold are in a negative correlation.” 
Chen teaches a “Pause time t” which is defined to be an “inter-speech pause time” and is obtained by a “speech detection.”  This “pause time” cannot teach the “inter-word interval” of the claim because the “pause time” is determined before speech recognition and appears to be using a VAD.  The “inter-word intervals” of the Claims is later defined in the succeeding dependents to be an interval between characters and is shown in Figure 5 of the instant Application to be obtained after obtaining a string of characters, i.e., after speech recognition or at least after the application of an acoustic model to the speech.  At any rate, it is not a VAD type determination.  Noting that this Claim and many of the succeeding Claims are so broad as to be taught by the “Pause time t” of Chen as well.  “[0046] … Referring to FIG. 1, …  a user 191 begins to speak in a meeting room. Content of what the user says is speech information. After the speech information of the user is received by a sentence-segmentation-of-speech apparatus 192 (or a sentence-segmentation apparatus) and passes through a speech front-end signal processing module (or a front-end speech processing unit), an audio stream obtained after the speech information experiences speech detection and noise reduction processing is then outputted and an inter-speech pause time that is obtained by the speech detection is outputted. The audio stream is inputted into a speech recognition module for recognition processing and pause information is compared with an adaptively changing duration threshold….”  “[0051] In this embodiment, speech processing is performed on the first to-be-processed speech information through front-end speech processing to obtain an audio data stream, that is, audio stream, and an inter-speech pause time can be detected as well, that is, a first pause duration.”)
The “word number information” which yields the “speech speed” teaches the “inter-word interval” and feeds the “adaptive threshold calculation B6/C6”. Paragraph [0125] teaches the “modifying the EOS threshold” according to the number of words per unit time (which is related to inter-word interval).  See description of “Adaptive threshold calculation B6” of Figure 6 or C6 of Figure 7.  “[0132] In step B6, a next duration threshold can be adaptively calculated by using the speech speed of this sentence. The reason is that from a perspective of statistics, a higher speech speed indicates a shorter pause between sentences, and on the contrary, a lower speech speed indicates a longer pause between sentences. Therefore, the speech speed and the duration threshold are in a negative correlation.”  “[0162] In step C6, a next duration threshold can be adaptively calculated by using the duration information and the speech speed of this sentence…  From a perspective of statistics, a higher speech speed indicates a shorter pause between sentences, and on the contrary, a lower speech speed indicates a longer pause between sentences. Therefore, the speech speed and the duration threshold are in a negative correlation.”]
Chen uses the “speech speed” / rate or speed of speech in the form of “word number information” of Figures 6 and 7 provide a words per unit of time measure of speed/rate which is related to the mean of “inter-word intervals.”   “[0090] … The speed of the speech can be determined according to the word number said in a unit time ….”]

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659