DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/01/2021. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 12 and 17 objected to because of the following informalities: 
“filter words…” should read “filler words.”
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-14 and 16-20 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

The independent claim 1 recites:
obtaining, by a media production platform, a transcript that comprises a series of words arranged in sequential order as uttered in a corresponding audio file;
tokenizing, by the media production platform, each word in the transcript to create a series of tokens;
labeling, by the media production platform, the series of tokens with a Natural Language Processing (NLP) library;
applying, by the media production platform, a rule associated with a filler word to the labeled series of tokens; and
identifying, by the media production platform, a word in the transcript that causes the rule to be positive as an instance of the filler word. 

The limitations of “obtaining…”, “tokenizing…”, “labeling…”, “applying…”, “applying…”, and “identifying…” as drafted cover a human organizing of activities. More specifically, a human based on: data (e.g., text/transcript) received from another human corresponding to a spoken utterance/speech; dividing said text into smaller units (e.g., words/tokens); labeling or categorizing said words (e.g., noun, verb, etc.) to create a list of the words with their respective categories; filtering said words using a predefined known criterion; and identifying words with no significant meaning (e.g., uh, ah, um, etc.) from said list.
This judicial exception is not integrated into a practical application because for example: claim recites: “by the media production platform”. However, in [0046] of the as filed specification, it is disclosed: “For convenience, the media production platform 210 may be referred to as a computer program that resides within the memory 204. However, the media production platform 210 could be comprised of software, firmware, and/or hardware components implemented in, or accessible to, the computing device 200. In accordance with embodiments described herein, the media production platform 210 may include a processing module 212, tokenizing module 214, labeling module 216, and graphical user interface (GUI) module 218…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claims are not patent eligible. 

With respect to claim 2, the claim recites: 
causing, by the media production platform, display of the transcript on an interface;
wherein said identifying comprises visually distinguishing the word from other words in the transcript. 

The claim relates to a human organizing of ideas. This reads on a human writing down text on a piece of paper and underlining or highlighting specific words such as filler words. An interface is presented as an additional limitation.	
This judicial exception is not integrated into a practical application because for example: claim recite “an interface”. However, in [0037] of the as filed specification, it is disclosed: “Accordingly, the interfaces 104 may be viewed on personal computers, tablet computers, mobile phones, wearable electronic devices (e.g., watches or fitness accessories), network-connected ("smart") electronic devices (e.g., televisions or home assistant devices), gaming consoles, or virtual/augmented reality systems (e.g., head-mounted displays).” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claim 3, the claim recites: 
wherein the rule is one of multiple rules applied to the labeled series of tokens by the media production platform, and wherein the multiple rules are associated with multiple filler words.

The claim relates to a human organizing of ideas. This reads on a human determining a set of multiple predefined criterion to identify different filler words. No additional limitations present other than the additional limitation discussed in independent claim 1, above. 	

With respect to claim 4, the claim recites: 
wherein the multiple rules are applied simultaneously so as to concurrently determine whether the word represents an instance of any of the multiple filler words.
The claim relates to a human organizing of ideas. This reads on two humans using two different know predefined criterion at the same time to determine if a specific filler word is present in the text. No additional limitations are present. 	

With respect to claim 5, the claim recites: 
wherein each of the multiple rules is associated with a different filler word.

The claim relates to a human organizing of ideas. This reads on a human defining the known predefined criterion in a way that each criterion is related to a different filler word. No additional limitations are present. 	

With respect to claim 6, the claim recites: 
wherein at least two of the multiple rules are associated with a single filler word.

This reads on a human defining the known predefined criterion in a way that two of the criteria is related to the same filler word. No additional limitations are present. 	

With respect to claim 7, the claim recites: 
wherein said obtaining comprises:
generating the transcript by performing a speech-to-text (STT) operation on the corresponding audio file.
The claim relates to a human organizing of ideas. This reads on a human writing down the spoken utterance from another human (e.g., transcript). No additional limitations are present. 	
With respect to claim 8, the claim recites: 
wherein said obtaining comprises:
receiving input indicative of a selection of the corresponding audio file, and
retrieving, in response to said receiving, the transcript from a storage medium.

The claim relates to a human organizing of ideas. This reads on a human receiving data (e.g., audio/utterance) from another human (upon request) and finding a transcript corresponding to requested audio. A storage medium is presented as an additional limitation.	
This judicial exception is not integrated into a practical application because for example: in [0093] of the as filed specification, it is disclosed: “Further examples of machine- and computer-readable media include recordable-type media such as volatile and non- volatile memory devices 810, removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMS) and Digital Versatile Disks (DVDs)), cloud-based storage, and transmission-type media such as digital and analog communication links…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 


The independent claim 9 recites:
A non-transitory computer-readable medium with instructions stored thereon that, when executed by a processor of a computing device, cause the computing device to perform operations comprising:
tokenizing each word in a transcript to create a series of tokens;
labeling the series of tokens with a Natural Language Processing (NLP) library;
discovering at least one filler word in the transcript by applying at least two rules to the labeled series of tokens, wherein each of the at least two rules is associated with a different filler word; and
causing display of the transcript on an interface in such manner that the at least one filler words is visually distinguishable from other words in the transcript.

The limitations of “tokenizing…”, “labeling…”, “discovering…”, and “causing…” as drafted cover a human organizing of activities. More specifically, a human based on: dividing text (e.g., received from another human) into smaller units (e.g., words/tokens); labeling or categorizing said words (e.g., noun, verb, etc.) to create a list of the words with their respective categories; identifying words with no significant meaning (e.g., uh, ah, um, etc.) from said list using predefined known criterion; and writing on paper said text with identified words underlined or highlighted with other color. 
This judicial exception is not integrated into a practical application because for example: the claim recites: “a non-transitory computer-readable medium with instructions stored thereon that, when executed by a processor of a computing device” and in [0093] of the as filed specification, it is disclosed: “Further examples of machine- and computer-readable media include recordable-type media such as volatile and non- volatile memory devices 810, removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMS) and Digital Versatile Disks (DVDs)), cloud-based storage, and transmission-type media such as digital and analog communication links…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claim 10, the claim recites: 
wherein said labeling comprises employing part-of-speech tagging to assign a separate label to each of the series of tokens.

The claim relates to a human organizing of ideas. This reads on a human labeling or categorizing said words (e.g., noun, verb, etc.) to create a list of the words with their respective categories. No additional limitations are present. 	

With respect to claim 11, the claim recites: 
wherein said labeling comprises employing labeled dependency parsing to assign a separate label to each of the series of tokens.

The claim relates to a human organizing of ideas. This reads on a human labeling or categorizing said words (e.g., noun, verb, etc.) to create a list of the words with their respective categories. No additional limitations are present. 	

With respect to claim 12, the claim recites: 
wherein the at least two rules are applied simultaneously so as to concurrently determine whether each labeled token represents an instance of any of the filter words associated with the at least two rules.

The claim relates to a human organizing of ideas. This reads on two humans using two different know predefined criterion at the same time to determine if a specific filler word is present in the text. No additional limitations are present. 	

With respect to claim 13, the claim recites: 
receiving input indicative of a selection of an audio file through the interface;
retrieving, in response to said receiving, the audio file from a storage medium; and
generating the transcript by performing a speech-to-text (STT) operation on the audio file.

The claim relates to a human organizing of ideas. This reads on a human receiving data (e.g., audio/utterance) from another human (upon request) and finding a transcript corresponding to requested audio; and a human writing down the spoken utterance from another human (e.g., transcript). 
This judicial exception is not integrated into a practical application because for example: in [0093] of the as filed specification, it is disclosed: “Further examples of machine- and computer-readable media include recordable-type media such as volatile and non- volatile memory devices 810, removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMS) and Digital Versatile Disks (DVDs)), cloud-based storage, and transmission-type media such as digital and analog communication links…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claim 14, the claim recites: 
wherein the storage medium is accessible to the computing device across a network.
The claim recites “storage medium is accessible to the computing device across a network” as an additional limitation.
This judicial exception is not integrated into a practical application because for example: in [0028] of the as filed specification, it is disclosed: “For example, while embodiments may be described in the context of a computer program implemented on a network-accessible server system, the relevant features may be similarly applicable to computer programs implemented on computing devices such as mobile phones, tablet computers, or personal computers…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claim 16, the claim recites: 
wherein each rule is represented as a data structure that specifies (i) a filler word and (ii) a contextual parameter indicative of a criterion that must be satisfied for the rule to indicate that a given labeled token represents an instance of the filler word.

The claim relates to a human organizing of ideas. This reads on a human writing down the known predefined criterion as a list, where it specifies the filler words and the corresponding criterion to be met in order to determine presence of the filler words. No additional limitations are present. 	

The independent claim 17 recites:
A system comprising:	
a memory that includes instructions for formulating rules for discovering instances of filler words; and
a processor that, upon executing the instructions, is configured to:
acquire first input indicative of a filler word whose presence is to be discovered in transcripts, 
acquire second input indicative of a contextual parameter for discovering the filter word, and
program a data structure with the first and second input such that when the data structure is applied to token that is representative of a word, an output is produced that indicates whether the word is an instance of the filler word.

The limitations of “acquire…”, “acquire…”, and “program…”, as drafted cover a human organizing of activities. More specifically, a human based on: receiving a first set of words or text comprising words with no significant meaning (or filler words); receiving a second set of text comprising predefined criterion to identify said filler words; defining and writing down said filler words along with said criterion to evaluate any received data (i.e., text received from another human); and writing down as a result if said received text contained a filler word.  
This judicial exception is not integrated into a practical application because for example: the claim recite: “a system comprising a memory […] and a processor […]” and in [0088-0089] of the as filed specification, it is disclosed: “[…] For example, some components of the processing system 800 may be hosted on computing device that includes a media production platform […] The processing system 800 may include a processor 802, main memory 806, non-volatile memory 810, network adapter 812 (e.g., a network interface), video display 818, input/output device 820, control device 822 (e.g., a keyboard, pointing device, or mechanical input such as a button), drive unit 824 that includes a storage medium 826, or signal generation device 830 that are communicatively connected to a bus 816…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 



With respect to claim 18, the claim recites: 
wherein the contextual parameter is indicative of a criterion that must be satisfied for the data structure to indicate that the word is an instance of the filler word.

The claim relates to a human organizing of ideas. This reads on a human writing down the known predefined criterion to be met in order to determine presence of the filler words. No additional limitations are present. 
	
With respect to claim 19, the claim recites: 
generate an interface accessible via a computing device, wherein the first input and/or the second input is provided through the interface.

The claim relates to a human organizing of ideas. This reads on a human writing down the first and second input data (e.g., text or audio transcription) accessible for display. An interface is presented as an additional limitation.	
This judicial exception is not integrated into a practical application because for example: claim recite “an interface”. However, in [0037] of the as filed specification, it is disclosed: “Accordingly, the interfaces 104 may be viewed on personal computers, tablet computers, mobile phones, wearable electronic devices (e.g., watches or fitness accessories), network-connected ("smart") electronic devices (e.g., televisions or home assistant devices), gaming consoles, or virtual/augmented reality systems (e.g., head-mounted displays).” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claim 20, the claim recites: 
wherein the second input is acquired by applying a computer-implemented model to one or more transcripts that includes the filler word.

The claim relates to a human organizing of ideas. This reads on a human receiving applying a set of known predefined steps/criterion to the received text from another human. A computer-implemented model is presented as an additional limitation.
This judicial exception is not integrated into a practical application because for example: in [0043] of the as filed specification, it is disclosed: “The processor 202 can have generic characteristics similar to general-purpose processors, or the processor 202 may be an application-specific integrated circuit (ASIC) that provides control functions to the computing device 200. As shown in Figure 2, the processor 202 can be coupled to all components of the computing device 200, either directly or indirectly, for communication purposes.…” Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-8 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman; Colin W. et al. (US 6161087 A; hereinafter referred to as Wightman et al.) and further in view of Lease, Matthew et al. (Lease, Matthew, Mark Johnson, and Eugene Charniak. "Recognizing disfluencies in conversational speech." IEEE Transactions on Audio, Speech, and Language Processing 14.5 (2006): 1566-1573.; hereinafter referred to as Lease et al.) and Zhang; Pin et al. (US 20210004437 A1; hereinafter referred to as Zhang et al.). 

As to independent claim 1, Wightman et al. teaches a method comprising:
obtaining, by a media production platform, a transcript that comprises a series of words arranged in sequential order as uttered in a corresponding audio file (see Fig. 4A-B and Col. 3, line 66 – Col. 4, line 5; Col. 6, line 48 – Col. 7, line 10: “The speech recognizer 18 receives a digitized recording of speech and performs full word-level recognition of the speech, including recognition of filled pause "words," such as "um" or "ah." The output from the speech recognizer 18 is then fed to the suppressing processor 20, which analyzes the output file and marks for suppression one or more silent pauses and filled pauses. […] (20) FIGS. 4a-4d show in schematic form the operations performed in the suppressing processor 20 and recognizer 18. FIG. 4a shows a waveform 120 in real time representing the sounds of "lab test um", which might be uttered by a speaker in dictation, starting at a time T0 132. As can be seen, there are pauses between each of the three words. FIG. 4b, which is not configured in real time, shows how the text file 52 produced by recognizer 18 might appear after the recognizer processes the portion of a digital recording 50 corresponding to the waveform 120. […] (21) As can be seen in FIG. 4b, the word "lab" is correctly recognized and transcribed. A start time 132 and a stop time 134 relative to T0 or some other time base is associated with the text characters "1-a-b" 130. Following "lab," the recognizer 18 finds a silent pause, identified by code SP in data field 140. A start time 142 and a stop time 144 relative to T0 or some other time base is also associated with the silent pause. Next the text for "test" produced by recognizer 18 is shown; again a start time 152 and a stop time 154 relative to T0 or some other time base is associated with the text characters "t-e-s-t" 150. Following "t-e-s-t" the recognizer 18 identifies a filled pause, "um", identified by FP in field 160, with some silence both preceding and following. This filled pause and silence surrounding it may be merged to create one filled pause, with a corresponding start time 162 and a stop time 164, so that the entire segment may be suppressed.” Here, it is interpreted that the words “lab,” “test,” and “um” as the series of words in sequential order in the transcript (text file produced by the recognizer and received at the suppressing processor).);
tokenizing, by the media production platform, each word in the transcript to create a series of tokens (see Fig. 4A-B and Col. 3, line 66 – Col. 4, line 5; Col. 6, line 48 – Col. 7, line 10 citations as in limitation above: Here, an example transcript is presented: “lab test um” as shown in Fig. 4A and the tokenization is presented in Fig. 4B .);

However, Wightman et al. does not explicitly teach, but Lease et al. teaches:
applying, by the media production platform (disclosed by Wightman et al. above), a rule associated with a filler word to the labeled series of tokens (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work: “[…] As shown in Fig. 2 and described in Section VI, input to our filler detection component consists of tokenized words segmented by detected sentence boundaries. As described below, we also exploit detected POS and syntactic information provided by the syntactic language model (Section III), which outputs the most likely parse tree as well as the language model score for each repair analysis candidate generated by the TAG (Section II). […] As a result, classifying the terms above by their most frequent labeling (DM or nonfiller) and detecting FPs as described earlier only achieves a filler word detection (FWD) error of about 30%, where error is defined as the number of misclassifications divided by the number of true filler words. To improve upon this, a few simple lexical, POS, and syntactic rules were adopted, as listed below. […] Lexical rules • like: label as nonfiller if (a) preceded by ’m , ’re , ’s, feel, I, n’t, seem, something, sound, stuff, things, was, would, or you or (b) followed by that or to. • oh: label as DM whenever it is not the first word of a sentence or the sentence is longer than four words. […]”); and
identifying, by the media production platform (disclosed by Wightman et al. above), a word in the transcript that causes the rule to be positive as an instance of the filler word (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work citations as in previous limitation and further: “As a final note, recall our earlier comment that the TAGmodel identifies fillers involved in speech repairs, but that most fillers actually occur outside of repair contexts. Because the deterministic rules above worked well in both repair and nonrepair contexts, we found that even oracle detection of fillers in repair contexts could only negligibly improve overall performance. Therefore, we discarded filler predictions made by the TAG and predicted fillers entirely on the basis of the rules described above.”).
Wightman et al. and Lease et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. to incorporate the teachings of Lease et al. applying a rule associated with a filler word to the labeled series of tokens and identifying a word in the transcript that causes the rule to be positive as an instance of the filler word which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).
However, Wightman et al. in combination with Lease et al. do not explicitly teach, but Zhang et al. teaches:
labeling, by the media production platform, the series of tokens with a Natural Language Processing (NLP) library (see ¶ [0031]: “[…] For example, a message containing a plurality of words can be processed by a NLP library (e.g., MeCab) where each message is parsed into its words and each word can be tagged with a Part of Speech (POS) identifier (e.g., noun, adverb, adjective, etc.). […]).
Wightman et al., Lease et al., and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. in combination with Lease et al. to incorporate the teachings of Zhang et al. of labeling, by the media production platform, the series of tokens with a Natural Language Processing (NLP) library which provides the benefit of generating message effectiveness predictions and/or other insights associated with messages in a manner that resolves the shortcomings of conventional techniques ([0031] of Zhang et al.).

Regarding claim 2, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 1, above.
Wightman et al. further teaches: 
(the method) further comprising:
causing, by the media production platform, display of the transcript on an interface (see Col. 4, lines 5 – 22: “(5) […] The suppressing processor 20 may also only mark for suppression those silent or filled pauses that are of a minimum length, and may adjust the length of the suppressed pauses by a guard period so that abrupt cut-off or resumption of the speech will be less likely. During playback of the recorded speech, the playback software 16 identifies the locations where a filled pause has been marked for suppression by the suppressing processor 20 and skips over that segment of the recording. The transcriptionist, therefore, does not have to wait through filled pauses to resume transcription of the speech or editing of the written text. When a suppressed filled pause is not played back by the software, a user interface 24 provides a visual or audio signal to alert the transcriptionist to the suppression. This permits the transcriptionist, if desired, to rewind, disable the suppression and hear the recording without the suppression, to determine whether the suppressed portion has useful speech information.”);
wherein said identifying comprises visually distinguishing the word from other words in the transcript (see Fig. 4A-4D and Col. 4, lines 16 – 22; Col. 6, line 58 – Col. 7, line 10; and Col. 9, lines 19-40: “(5) […] When a suppressed filled pause is not played back by the software, a user interface 24 provides a visual or audio signal to alert the transcriptionist to the suppression. […] (21) […] Following "lab," the recognizer 18 finds a silent pause, identified by code SP in data field 140. […] Following "t-e-s-t" the recognizer 18 identifies a filled pause, "um", identified by FP in field 160, with some silence both preceding and following. […] (34) In yet another embodiment, the user interface 24 alerts the user when speech is suppressed during the playback of the digital recording 50. […] In one embodiment, the user interface 24 may be a light or other visual signal (e.g., a flashing icon on a monitor) that indicates when speech is suppressed during playback. For instance, a light could turn on when speech is suppressed during playback or a message could be displayed indicating suppression and, optionally, the duration of the suppression. […]” Here, the visually distinguishing the words from other words in the transcript is interpreted as shown in Fig. 4B-4D, wherein the filler pauses and silence pauses are identified, tagged, and suppressed; which is eventually communicated to the user via a sound/visual alert as disclosed above.).

Regarding claim 3, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 1, above.
Lease et al. further teaches: 
wherein the rule is one of multiple rules applied to the labeled series of tokens by the media production platform, and wherein the multiple rules are associated with multiple filler words (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work: “[…] As shown in Fig. 2 and described in Section VI, input to our filler detection component consists of tokenized words segmented by detected sentence boundaries. As described below, we also exploit detected POS and syntactic information provided by the syntactic language model (Section III), which outputs the most likely parse tree as well as the language model score for each repair analysis candidate generated by the TAG (Section II). […] To improve upon this, a few simple lexical, POS, and syntactic rules were adopted, as listed below. Lexical rules reduced overall error to about 22%, POS rules to about 20%, and syntactic rules to about 19%. Lexical rules • like: label as nonfiller if (a) preceded by ’m , ’re , ’s, feel, I, n’t, seem, something, sound, stuff, things, was, would, or you or (b) followed by that or to. • oh: label as DM whenever it is not the first word of a sentence or the sentence is longer than four words. POS rules • like: label as nonfiller if (a) followed by VB or VBP or (b) preceded NN, NNS, or VBZ. • so: label as nonfiller if followed by (a) IN, (b) preceded by AUX or RB, or (c) if the two preceding tokens were both CC. Syntactic rules • actually: label as DM only if it is either part of an interjection (UH) phrase or if it begins the utterance • so: label as nonfiller if part of an adjectival (ADJP) or adverbial (ADVP) phrase. […] We also augmented this system with a set of manually constructed deterministic rules for detecting fillers and showed that repair and filler predictions could be combined to predict self-interruption points (IPs) as well.” Here, the multiple rules are interpreted as the lexical, POS and syntactic rules, which are associated with filler words such as: (like, oh), (like, so), and (actually, so), respectively.).
Wightman et al., Lease et al., and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. to incorporate the teachings of Lease et al. wherein the rule is one of multiple rules applied to the labeled series of tokens by the media production platform, and wherein the multiple rules are associated with multiple filler words which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).
Regarding claim 4, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 3, above.
Lease et al. further teaches: 
wherein the multiple rules are applied simultaneously so as to concurrently determine whether the word represents an instance of any of the multiple filler words (see Fig. 2 (Overall system architecture) and section V. Filler Word Detection and VIII. Conclusion and Future Work citations as in previous claim 3: Also, it is interpreted that the filler rules are simultaneously applied as shown in Fig. 2 “Deterministic filler and IP rules” [Wingdings font/0xE0] “Detected repairs fillers and IPs”.).
Wightman et al., Lease et al., and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. to incorporate the teachings of Lease et al. wherein the multiple rules are applied simultaneously so as to concurrently determine whether the word represents an instance of any of the multiple filler words which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).

Regarding claim 5, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 3, above.
Lease et al. further teaches: 
wherein each of the multiple rules is associated with a different filler word (see Fig. 2 (Overall system architecture) and section V. Filler Word Detection and VIII. Conclusion and Future Work citations as in previous claim 3: Here, the multiple rules (manually constructed) are interpreted to be associated with different filler words such as examples provided: Lexical rule (s): “oh”, POS rule(s): “so”, and Syntactic rule(s): “actually”.).
Wightman et al., Lease et al., and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. to incorporate the teachings of Lease et al. wherein each of the multiple rules is associated with a different filler word which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).

Regarding claim 6, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 3, above.
Lease et al. further teaches: 
wherein at least two of the multiple rules are associated with a single filler word (see Fig. 2 (Overall system architecture) and section V. Filler Word Detection and VIII. Conclusion and Future Work citations as in previous claim 3: Lexical rule(s) and POS rule(s) associated with “like”.).
 Wightman et al., Lease et al., and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. to incorporate the teachings of Lease et al. wherein at least two of the multiple rules are associated with a single filler word which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).

Regarding claim 7, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 1, above.
Wightman et al. further teaches: 
wherein said obtaining comprises:
generating the transcript by performing a speech-to-text (STT) operation on the corresponding audio file (see Fig. 4A-4B and Col. 6, lines 48-57: “(20) […] FIG. 4a shows a waveform 120 in real time representing the sounds of "lab test um", which might be uttered by a speaker in dictation, starting at a time T0 132. […] FIG. 4b, which is not configured in real time, shows how the text file 52 produced by recognizer 18 might appear after the recognizer processes the portion of a digital recording 50 corresponding to the waveform 120.”).

Regarding claim 8, Wightman et al. in combination with Lease et al. and Zhang et al. teach all of the limitations as in claim 1, above.
Wightman et al. further teaches
wherein said obtaining comprises:
receiving input indicative of a selection of the corresponding audio file (see Col. 4, lines 38-65: “(8) FIG. 2 shows the speech recognizer 18 of FIG. 1 in greater detail. In one embodiment, the speech recognizer 18 uses a digital recording 50 of speech as an input. In the speech playback system 10 of the invention, the digital recording 50 may be made from analog sound received directly over the telephone line. This may be a live, real time dictation to the system 10. Alternatively, a user who wishes to have an audio recording transcribed may simply play the audio recording over the telephone and a digital recording 50 of the audio recording may be made at the receiving end for input into the speech recognizer 18. Alternatively, a user who wishes to have an audio recording transcribed may simply play the audio recording over the telephone and a digital recording 50 of the audio recording may be made at the receiving end for input into the speech recognizer 18.”), and
retrieving, in response to said receiving, the transcript from a storage medium (see Col. 4, lines 38-65: “(15) As noted above, with a particularly clean input digital recording 50, the recognizer output file 54 can be used to produce a first draft transcription. (17) The recognizer output file 54 from the speech recognizer 18 becomes the input to the suppressing processor 20, as illustrated in the embodiment of FIG. 3. The suppressing processor 20 may be software designed to mark for suppression appropriate filled pauses and silent pauses. These marks may later be used to suppress playback of the appropriate filled pauses and silent pauses. The output from the suppressing processor 20, which will be referred to throughout this specification as the LIX file 60, is a massaged version of the recognizer output file 54. The LIX file 60 may contain the original digital recording 50 along with a suppression file 62. The suppression file 62 in the LIX file 60 may contain suppression indicators and numbers inserted in the text file 52 that represent the start and stop times of speech suppression during playback of the speech. The numbers representing the start and stop times of suppression correspond to the time locations in the digital audio recording 50 in the LIX file 60 where a specific suppressed sound occurs.  (31) The playback software 16 may be any variety of software usable on the computer 12, and may contain a graphical user interface to allow the user to access LIX files 60 to begin a transcription session. Here, it is interpreted that the retrieved transcript from a storage medium is associated with the LIX file (“massaged version of the recognizer output file”) accessed by the user using a generic computer.). 

Claim 9-10, 12-14, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Wightman; Colin W. et al. (US 6161087 A; hereinafter referred to as Wightman et al.) and further in view of Zhang; Pin et al. (US 20210004437 A1; hereinafter referred to as Zhang et al.), Lease, Matthew et al. (Lease, Matthew, Mark Johnson, and Eugene Charniak. "Recognizing disfluencies in conversational speech." IEEE Transactions on Audio, Speech, and Language Processing 14.5 (2006): 1566-1573.; hereinafter referred to as Lease et al.), and Scholz; Brian A. et al. (US 20200111386 A1; hereinafter referred to as Scholz et al.). 


As to independent claim 9, Wightman et al. further teaches:
a non-transitory computer-readable medium with instructions stored thereon that (see Col. 3, lines 8-30: “(2) The accompanying Figures depict embodiments of the speech playback system and methods of the present invention, and features and components thereof. With regard to references in this specification to computers, the computers may be any standard computer including standard attachments and components thereof (e.g., a disk drive, hard drive, CD player or network server that communicates with a CPU and main memory, a keyboard and mouse, and a monitor).”), when executed by a processor of a computing device, cause the computing device to perform operations comprising:
tokenizing each word in a transcript to create a series of tokens (see Fig. 4A-B and Col. 3, line 66 – Col. 4, line 5; Col. 6, line 48 – Col. 7, line 10 citations as in claim 1 above);
However, Wightman et al. does not explicitly teach, but Zhang et al. teaches:
labeling the series of tokens with a Natural Language Processing (NLP) library (see ¶ [0031] citation as in claim 1 above);
Wightman et al. and Zhang et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. to incorporate the teachings of Zhang et al. of labeling, by the media production platform, the series of tokens with a Natural Language Processing (NLP) library which provides the benefit of generating message effectiveness predictions and/or other insights associated with messages in a manner that resolves the shortcomings of conventional techniques ([0031] of Zhang et al.).

However, Wightman et al. in combination with Zhang et al. do not explicitly teach, but Lease et al. teaches:
discovering at least one filler word in the transcript by applying at least two rules to the labeled series of tokens, wherein each of the at least two rules is associated with a different filler word (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work citations as in claim 1: ““[…] As shown in Fig. 2 and described in Section VI, input to our filler detection component consists of tokenized words segmented by detected sentence boundaries. As described below, we also exploit detected POS and syntactic information provided by the syntactic language model (Section III), which outputs the most likely parse tree as well as the language model score for each repair analysis candidate generated by the TAG (Section II). […] As a result, classifying the terms above by their most frequent labeling (DM or nonfiller) and detecting FPs as described earlier only achieves a filler word detection (FWD) error of about 30%, where error is defined as the number of misclassifications divided by the number of true filler words. To improve upon this, a few simple lexical, POS, and syntactic rules were adopted, as listed below. […] Lexical rules • like: label as nonfiller if (a) preceded by ’m , ’re , ’s, feel, I, n’t, seem, something, sound, stuff, things, was, would, or you or (b) followed by that or to. • oh: label as DM whenever it is not the first word of a sentence or the sentence is longer than four words.[…] As a final note, recall our earlier comment that the TAGmodel identifies fillers involved in speech repairs, but that most fillers actually occur outside of repair contexts. Because the deterministic rules above worked well in both repair and nonrepair contexts, we found that even oracle detection of fillers in repair contexts could only negligibly improve overall performance. Therefore, we discarded filler predictions made by the TAG and predicted fillers entirely on the basis of the rules described above.”); 
Wightman et al., Zhang et al. and Lease et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. in combination with Zhang et al. to incorporate the teachings of Lease et al. of discovering at least one filler word in the transcript by applying at least two rules to the labeled series of tokens, wherein each of the at least two rules is associated with a different filler word which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).

However, Wightman et al., in combination with Zhang et al., and Lease et al. do not explicitly teach, but Scholz et al. teaches:
causing display of the transcript on an interface in such manner that the at least one filler words is visually distinguishable from other words in the transcript (see Fig. 8 and ¶ [0105]: “In particular embodiments, the scoring module (60) can further function to identify and highlight (132) filler words (55) in the formatted text (98). The highlight (132) can be depicted by under lineation of the filler words (55); however, this example does not preclude any manner of visually viewable highlight of filler words (55), such as shading, colored shading, encircling, dots, bold lines, or the like.”).
Wightman et al., Zhang et al., Lease et al., and Scholz et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have combined Wightman et al. in combination with Zhang et al. and Lease et al. with the teachings of Lease et al. of causing display of the transcript on an interface in such manner that the at least one filler words is visually distinguishable from other words in the transcript in order to yield predictable results of identifying and highlighting filler words present in the text. (See KSR v. Teleflex).
 
Regarding claim 10, Wightman et al. in combination Zhang et al., Lease et al., and Scholz et al. teach all of the limitations as in claim 9, above.
Zhang et al. further teaches
wherein said labeling comprises employing part-of-speech tagging to assign a separate label to each of the series of tokens (see ¶ [0031]: “[…] For example, a message containing a plurality of words can be processed by a NLP library (e.g., MeCab) where each message is parsed into its words and each word can be tagged with a Part of Speech (POS) identifier (e.g., noun, adverb, adjective, etc.). […]).
Wightman et al., Zhang et al., Lease et al., and Scholz et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. in combination with Zhang et al., Lease et al., and Scholz et al. to further incorporate the teachings of Zhang et al. of wherein said labeling comprises employing part-of-speech tagging to assign a separate label to each of the series of tokens which provides the benefit of generating message effectiveness predictions and/or other insights associated with messages in a manner that resolves the shortcomings of conventional techniques ([0031] of Zhang et al.).

Regarding claim 12, Wightman et al. in combination Zhang et al., Lease et al., and Scholz et al. teach all of the limitations as in claim 9, above.
Lease et al. further teaches
wherein the at least two rules are applied simultaneously so as to concurrently determine whether each labeled token represents an instance of any of the filter words associated with the at least two rules (see Fig. 2 (Overall system architecture) and section V. Filler Word Detection and VIII. Conclusion and Future Work citations as in previous claim 3: Also, it is interpreted that the filler rules are simultaneously applied as shown in Fig. 2 “Deterministic filler and IP rules” [Wingdings font/0xE0] “Detected repairs fillers and IPs”).
Wightman et al., Zhang et al., Lease et al., and Scholz et al.  are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al., Zhang et al., Lease et al., and Scholz et al. to further incorporate the teachings of Lease et al. wherein the multiple rules are applied simultaneously so as to concurrently determine whether the word represents an instance of any of the multiple filler words which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).

Regarding claim 13, Wightman et al. in combination Zhang et al., Lease et al., and Scholz et al. teach all of the limitations as in claim 9, above.
Wightman et al. further teaches
wherein the operations further comprise:
receiving input indicative of a selection of an audio file through the interface (see Col. 4, lines 38-65 as in claim 8: ““(8) FIG. 2 shows the speech recognizer 18 of FIG. 1 in greater detail. In one embodiment, the speech recognizer 18 uses a digital recording 50 of speech as an input. In the speech playback system 10 of the invention, the digital recording 50 may be made from analog sound received directly over the telephone line. This may be a live, real time dictation to the system 10. […] Alternatively, a user who wishes to have an audio recording transcribed may simply play the audio recording over the telephone and a digital recording 50 of the audio recording may be made at the receiving end for input into the speech recognizer 18.”);
retrieving, in response to said receiving, the audio file from a storage medium (see Col. 4, lines 38-65 as in claim 8: Here, it is interpreted that the played audio recording (by the user) is stored in a storage medium/device capable of playing the audio recording.); and
generating the transcript by performing a speech-to-text (STT) operation on the audio file (see Fig. 4A-4B and Col. 6, lines 48-57 citations as in claim 7: “(20) […] FIG. 4a shows a waveform 120 in real time representing the sounds of "lab test um", which might be uttered by a speaker in dictation, starting at a time T0 132. […] FIG. 4b, which is not configured in real time, shows how the text file 52 produced by recognizer 18 might appear after the recognizer processes the portion of a digital recording 50 corresponding to the waveform 120.”).

Regarding claim 14, Wightman et al. in combination Zhang et al., Lease et al., and Scholz et al. teach all of the limitations as in claim 13, above.
Wightman et al. further teaches
wherein the storage medium is accessible to the computing device across a network (see Col. 3, lines 8-30: “(2) The accompanying Figures depict embodiments of the speech playback system and methods of the present invention, and features and components thereof. With regard to references in this specification to computers, the computers may be any standard computer including standard attachments and components thereof (e.g., a disk drive, hard drive, CD player or network server that communicates with a CPU and main memory, a keyboard and mouse, and a monitor).”).

Regarding claim 16, Wightman et al. in combination with Lease et al., Zhang et al., and Scholz et al. teach all of the limitations as in claim 9, above.
Lease et al. further teaches
wherein each rule is represented as a data structure that specifies (i) a filler word and (ii) a contextual parameter indicative of a criterion that must be satisfied for the rule to indicate that a given labeled token represents an instance of the filler word (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work citations as in claim 3 above: “[…] As shown in Fig. 2 and described in Section VI, input to our filler detection component consists of tokenized words segmented by detected sentence boundaries. As described below, we also exploit detected POS and syntactic information provided by the syntactic language model (Section III), which outputs the most likely parse tree as well as the language model score for each repair analysis candidate generated by the TAG (Section II). […] To improve upon this, a few simple lexical, POS, and syntactic rules were adopted, as listed below. Lexical rules reduced overall error to about 22%, POS rules to about 20%, and syntactic rules to about 19%. Lexical rules • like: label as nonfiller if (a) preceded by ’m , ’re , ’s, feel, I, n’t, seem, something, sound, stuff, things, was, would, or you or (b) followed by that or to. • oh: label as DM whenever it is not the first word of a sentence or the sentence is longer than four words. POS rules • like: label as nonfiller if (a) followed by VB or VBP or (b) preceded NN, NNS, or VBZ. • so: label as nonfiller if followed by (a) IN, (b) preceded by AUX or RB, or (c) if the two preceding tokens were both CC. Syntactic rules • actually: label as DM only if it is either part of an interjection (UH) phrase or if it begins the utterance • so: label as nonfiller if part of an adjectival (ADJP) or adverbial (ADVP) phrase. […] We also augmented this system with a set of manually constructed deterministic rules for detecting fillers and showed that repair and filler predictions could be combined to predict self-interruption points (IPs) as well.” Here, the each of the rules (Lexical, POS, and syntactic) are interpreted to specify a filler word (“oh”, “so”, “actually”, etc.) and each of them are associated with criterion to determine if the word is a filler word or not.).
Wightman et al. in combination with Lease et al., Zhang et al., and Scholz et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. in combination with Lease et al., Zhang et al., and Scholz et al.  to further incorporate the teachings of Lease et al. wherein each rule is represented as a data structure that specifies (i) a filler word and (ii) a contextual parameter indicative of a criterion that must be satisfied for the rule to indicate that a given labeled token represents an instance of the filler word which provides the benefit of improving the filler word detection error or the number of misclassifications of filler words (section V. Filler Word Detection, page1570, column 2, paragraph 3 of Lease et al.).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Wightman; Colin W. et al. (US 6161087 A; hereinafter referred to as Wightman et al.) and further in view of Zhang; Pin et al. (US 20210004437 A1; hereinafter referred to as Zhang et al.), Lease, Matthew et al. (Lease, Matthew, Mark Johnson, and Eugene Charniak. "Recognizing disfluencies in conversational speech." IEEE Transactions on Audio, Speech, and Language Processing 14.5 (2006): 1566-1573.; hereinafter referred to as Lease et al.), and Scholz; Brian A. et al. (US 20200111386 A1; hereinafter referred to as Scholz et al.), as applied to claim 9, and further in view of Marey; Yusuf AbdElhakam (US 20220012296 A1; hereinafter referred to as Marey et al.). 

Regarding claim 11, Wightman et al. in combination with, Zhang et al., Lease et al. and Scholz et al. teach all of the limitations as in claim 9, above.
However, Wightman et al. in combination with, Zhang et al., Lease et al. and Scholz et al.  do not explicitly teach, but Marey et al. teaches:
wherein said labeling comprises employing labeled dependency parsing to assign a separate label to each of the series of tokens (see ¶ [0026]: “[0026] In some embodiments, statistical NLP techniques may be employed, and natural language understanding (NLU) analytics may be used to identify and parse text. In some natural language recognition models, grammar induction and grammar inference algorithms, such as context-free Lempel-Ziv-Welch algorithm or byte-pair encoding and optimization, may be employed. Lemmatization tasks may be employed to remove inflectional endings, morphological segmentation may be performed to separate words into individual morphemes and identify the class of morphemes, part-of-speech tagging (e.g., using SpaCy, a Python library for advanced NLP), dependency parsing, parsing, […]”).
Wightman et al., Zhang et al., Lease et al., Scholz et al., and Marey et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have substituted the labeling technique (i.e., labeled dependency parsing) in Zhang et al. from Wightman et al. in combination with Zhang et al., Lease et al., and Scholz et al. to incorporate the teachings of Marey et al. wherein said labeling comprises employing labeled dependency parsing to assign a separate label to each of the series of tokens in order to yield predictable results of identifying and parsing text. (See KSR v. Teleflex)

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Wightman; Colin W. et al. (US 6161087 A; hereinafter referred to as Wightman et al.) and further in view of Zhang; Pin et al. (US 20210004437 A1; hereinafter referred to as Zhang et al.), Lease, Matthew et al. (Lease, Matthew, Mark Johnson, and Eugene Charniak. "Recognizing disfluencies in conversational speech." IEEE Transactions on Audio, Speech, and Language Processing 14.5 (2006): 1566-1573.; hereinafter referred to as Lease et al.), and Scholz; Brian A. et al. (US 20200111386 A1; hereinafter referred to as Scholz et al.), as applied to claim 9, and further in view of Ispahani; Abigail  (US 20210064327 A1; hereinafter referred to as Ispahani). 

Regarding claim 15, Wightman et al. in combination Zhang et al., Lease et al., and Scholz et al. teach all of the limitations as in claim 9, above.
Wightman et al. further teaches
wherein the operations further comprise:
receiving input indicative of a selection of an audio file through the interface (see Col. 4, lines 38-65 as in claim 13 above);
retrieving, in response to said receiving, the audio file from a storage medium (see Col. 4, lines 38-65 as in claim 13 above);

However, Wightman et al. in combination Zhang et al., Lease et al., and Scholz et al. do not explicitly teach, but Ispahani teaches:
forwarding the audio file to a transcription service via an application programming interface (see ¶ [0019-0020 and 0024]: “[0019] From step 105, the method continues to either step 106a or step 106b in accordance with the mode of operation selected by the user. If the user has chosen to transcribe the entire audio stream into text (for example, by selecting an option to transcribe the entire audio stream in a user interface provided by the application), the method continues to step 106a. If the user has instead chosen to transcribe selected portions of the audio stream on demand during playback (as described in more detail below), the method continues to step 106b. [0020] In step 106a, the AHM begins transcribing spoken words from the audio stream into text immediately. From step 106a, the method continues to step 107. [0024] In embodiments where the speech-to-text converter resides on the same computer system as that of the AHM, the AHM sends each digital audio chunk to the speech-to-text converter for transcription, for example with an API call to an offline speech recognition software library. The speech-to-text converter transcribes the speech content of each audio chunk into a text string and returns each text string to the AHM in accordance with the conventions of the speech-to-text API.”); and
obtaining the transcript from the transcription service via the application programming interface (see ¶ [0019-0020 and 0024] as in previous limitation: “[…] The speech-to-text converter transcribes the speech content of each audio chunk into a text string and returns each text string to the AHM in accordance with the conventions of the speech-to-text API.”).
Wightman et al., Lease et al., Zhang et al., Scholz et al. and Ispahani  are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wightman et al. in combination with Lease et al., Zhang et al. and Scholz et al. to further incorporate the teachings of Ispahani of forwarding the audio file to a transcription service via an application programming interface and obtaining the transcript from the transcription service via the application programming interface which provides the benefit of allowing a listener to transcribe spoken word audio passages in, for example, a podcast or audio book, for later searching and/or reference ([0003] of Ispahani et al.).

Claims 17-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lease, Matthew et al. (Lease, Matthew, Mark Johnson, and Eugene Charniak. "Recognizing disfluencies in conversational speech." IEEE Transactions on Audio, Speech, and Language Processing 14.5 (2006): 1566-1573.; hereinafter referred to as Lease et al.) and further in view of Wightman; Colin W. et al. (US 6161087 A; hereinafter referred to as Wightman et al.).

As to independent claim 17, Lease et al. teaches:
(see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work of Lease et al. as in claim 1 above)

acquire first input indicative of a filler word whose presence is to be discovered in transcripts (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work citations of Lease et al. as in claim 16 above: Here, it is interpreted that the first input corresponding to the filler words to be discovered are the words: oh, so, actually, like, etc.), 
acquire second input indicative of a contextual parameter for discovering the filter word (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work citations of Lease et al. as in claim 16 above: Here, it is interpreted that the second input corresponding to the contextual parameters to discover the filler words are associated with the deterministic filler rules/criterion/labels of (for example): label as nonfiller if (a) preceded by ’m , ’re , ’s, feel, I, n’t, seem, something, sound, stuff, things, was, would, or you or (b) followed by that or to […], label as DM whenever it is not the first word of a sentence or the sentence is longer than four words, […] or label as nonfiller if part of an adjectival (ADJP) or adverbial (ADVP) phrase.), and 
program a data structure with the first and second input such that when the data structure is applied to token that is representative of a word, an output is produced that indicates whether the word is an instance of the filler word (see Fig. 2 of Lease et al. “Program System Architecture.” Here, the input sentences, which are tokenized, are interpreted to include first (filler words) and second inputs as the deterministic filler rules which include the criterion needed for the discovery of filler words using (criterion disclosed above), which in many instances depend on adjacent tokens/words to the filler words and/or their labels (ADJP, ADVP, etc).) Also, the output is interpreted as the detected repairs, fillers, IPs of Fig. 2.).

However, Lease et al. does not explicitly teach, but Wightman et al. teaches:
A system comprising:
a memory (see Col. 3, lines 8-30 of Wightman et al. citation as in claim 9: “(2) The accompanying Figures depict embodiments of the speech playback system and methods of the present invention, and features and components thereof. With regard to references in this specification to computers, the computers may be any standard computer including standard attachments and components thereof (e.g., a disk drive, hard drive, CD player or network server that communicates with a CPU and main memory, a keyboard and mouse, and a monitor)); and
a processor that (see Col. 3, lines 8-30 of Wightman et al. citation as in previous limitation), upon executing the 
Lease et al. and Wightman et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lease et al. to incorporate the teachings of Wightman et al. of a system comprising a memory and a processor which provides the benefit of using any variety of software usable on a computer system (including a memory and processor) which may contain a graphical user interface to allow the user to access files/transcription process (Col. 8, lines 44-47 of Wightman et al.).

Regarding claim 18, Lease et al. in combination with Wightman et al. teach all of the limitations as in claim 17, above.
Lease et al. further teaches
wherein the contextual parameter is indicative of a criterion that must be satisfied for the data structure to indicate that the word is an instance of the filler word (see Fig. 2 (Overall system architecture) and sections V. Filler Word Detection and VIII. Conclusion and Future Work citations of Lease et al. as in claims 16 and 17 above).

Regarding claim 20, Lease et al. in combination with Wightman et al. teach all of the limitations as in claim 17, above.
Lease et al. further teaches
wherein the second input is acquired by applying a (disclosed in Wightman et al.) to one or more transcripts that includes the filler word (See sections V. Filler Word Detection and VIII. Conclusion and Future Work citations of Lease et al. as in claims 16 and 17 above and Fig. 2: “Overall system architecture”, where the second input is acquired by applying the computer-implemented deterministic rules disclosed above (lexical, POS, and syntactic). Also, even though in Lease et al. there is not an explicit mention of the computer-implemented model, it is interpreted that the overall architecture presented in Fig. 2 is implemented in a general purpose computer. Additionally, the primary reference Wightman et al. does disclose “computerized methods”(Col. 1, lines 8-10).).

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Lease, Matthew et al. (Lease, Matthew, Mark Johnson, and Eugene Charniak. "Recognizing disfluencies in conversational speech." IEEE Transactions on Audio, Speech, and Language Processing 14.5 (2006): 1566-1573.; hereinafter referred to as Lease et al.) and further in view of Wightman; Colin W. et al. (US 6161087 A; hereinafter referred to as Wightman et al.) as applied to claim 17 above and further in view of Scholz; Brian A. et al. (US 20200111386 A1; hereinafter referred to as Scholz et al.).

Regarding claim 19, Lease et al. in combination with Wightman et al. teach all of the limitations as in claim 17, above.
However, Lease et al. in combination with Wightman et al. does not explicitly teach, but Scholz et al. teaches:
wherein the processor is further configured to:
generate an interface accessible via a computing device, wherein the first input and/or the second input is provided through the interface (see ¶ [0031 and 0046]: “[0031] A display surface (22), such as a graphical display surface, provided by a monitor screen or other type of display device can also be connected to the computing devices (3)(3A)(3B). In addition, each of the one or more computing devices (3)(3A)(3B) can further include peripheral input devices (23) such as a video recorder (24), for example a camera, video camera, web camera, mobile phone camera, video phone, or the like, and an audio recorder (25) such as microphones, speaker phones, computer microphones, or the like. [0046] Whether the media input module (36) functions during acquisition of the video stream (24B) or the audio stream (25B) or functions to retrieve media files (40), the media input module (36) can utilize a plurality of different parsers (41) to read video stream data (24C), audio stream data (25C), or the combined stream data (24C/25C) or from any file format or media type. Once the media input module (36) receives the video stream data (24C) or the audio stream data (25C) or combined stream data (24C/25C) and opens the media file (40), the media input module (36) uses a video and audio stream decoder (42) to decode the video stream data (24C) or the audio stream data (25C) or the combined stream data (24C/25C).”).
Lease et al., Wightman et al. and Scholz et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Lease et al. in combination with Wightman et al. to incorporate the teachings of Scholz et al. of causing display of the transcript on an interface in such manner that the at least one filler words is visually distinguishable from other words in the transcript  which provides the benefit of providing a filler word usage metric as part of an interactive presentation assessment ([0106-0107] Scholz et al.).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
06/03/2022