DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 02 December 2020. 
Claims 1-20 are pending in the application. As such, claims 1-20 have been examined. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings were received on 02 December 2020.  These drawings have been accepted and considered by the Examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 7-9, 14, 15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Weil et al. (US Patent Pub. No. 2018/0270446), hereinafter Weil, in view of Wilder et al. (US Patent Pub. No. 2015/0019206), hereinafter Wilder.

Regarding claim 1, Weil teaches a computer-implemented method (Weil [0165] This disclosure above describes various Graphical User Interfaces (GUIs) for implementing various features, processes or workflows. These GUIs can be presented on a variety of electronic devices including but not limited to laptop computers, desktop computers, computer terminals, television systems, tablet computers, e-book readers and smart phones), 
the method comprising: 
obtaining a textual rendering (Weil [0041] FIG. 2 is a conceptual illustration of a media message 200 generated by media messaging application 104. As described above, media message 200 (e.g., media project, media sequence) can include a sequence of video clips. For example, media message 200 can include clips 210, 220, 230, 240, 250 and/or 260. Each video clip can include video data (e.g., still image, sequence of video frames, etc.), audio data (e.g., recorded speech), and/or transcription data (e.g., a speech to text transcription or translation of the audio data)) 
of an audio portion of a video (Weil [0041] Each video clip can include video data (e.g., still image, sequence of video frames, etc.), audio data (e.g., recorded speech), and/or transcription data (e.g., a speech to text transcription or translation of the audio data)), 
wherein the textual rendering is generated by a natural language speech-to-text processor (Weil [0041] and/or transcription data (e.g., a speech to text transcription or translation of the audio data)); 
generating an augmented rendering (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
identifying a mistranscription within the textual rendering (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
using a pretrained word embedding model that creates machine-encoded data structures based on mapping words to an embedding space (Weil [0036] In some implementations, user device can include dictation service 110. For example, dictation service 110 can perform transcriptions of speech in audio data by sending audio data to a network dictation service (described below) and/or by performing transcriptions itself on user device); 
selecting from among a multi-word vocabulary of the pretrained word embedding model a plurality of candidate words for replacing the mistranscription (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score),
the selecting based on similarity values determined for each vocabulary word (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score), 
each similarity value indicating a closeness of a corresponding vocabulary word to the mistranscription (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score);
and modifying the textual rendering by replacing the mistranscription with a candidate word (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
that, based on a comparison of average semantic similarity values of each candidate word in relation to each word contained in the augmented rendering, is more similar to the mistranscription than is each of the other candidate words (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score).
Weil teaches correcting a transcript of an audio portion of a video clip, however Weil does not teach
generating an augmented rendering by “combining the textual rendering with contextualizing data electronically garnered from one or more sources other than the audio portion of the video”;
using a pretrained word embedding model that creates machine-encoded data structures based on mapping words to an embedding space “derived from the contextualizing data.”
Wilder teaches
combining the textual rendering with contextualizing data electronically garnered from one or more sources other than the audio portion of the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata);

derived from the contextualizing data (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata).
Wilder is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil further in view of Wilder to allow for extracting other data from the video, in addition to the audio data, such as OCR data, object recognition data, etc. Doing so would allow for incorporating the additional other data into the interpretation and transcription of the audio portion of the video.

Regarding claim 2, Weil in view of Wilder teaches the method of claim 1.
Weil does not specifically teach, however Wilder further teaches
further comprising generating the electronically garnered contextualizing data by performing at least one of: 
extracting metadata from the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

generating machine-encoded text based on optical character recognition of one or more frames of the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

tagging one or more objects recognized in one or more frames of the video based on classifying the objects using a machine learning classification model (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

or summarizing textual renderings of audio portions of other videos previously captured from one or more channels.
Wilder is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil further in view of Wilder to allow for extracting other data from the video, in addition to the audio data, such as OCR data, object recognition data, etc. Doing so would allow for incorporating the additional other data into the interpretation and transcription of the audio portion of the video.

Regarding claim 7, Weil in view of Wilder teaches the method of claim 1.
Weil further teaches
wherein the modifying is performed in real time during a real-time rendering of the video over a channel (Weil [0095] In some implementations, GUI 1000 can include graphical element 1020 to enable or disable automatic titling for the selected clip. As described above, in near real time while recording audio data, media messaging application 104 can transcribe speech in the audio data stream into transcription data (e.g., text). The transcription data can be presented overlaid on the video data presented in area 1004 in near real time while recording audio data and/or video data. The user can invoke a graphical user interface (e.g., GUI 1100) to enable and/or disable transcription (e.g., titling, captioning, etc.) and/or select a titling style for presenting transcription data by selecting graphical element).

Regarding claim 8, Weil in view of Wilder teaches a system (Weil [0165] This disclosure above describes various Graphical User Interfaces (GUIs) for implementing various features, processes or workflows. These GUIs can be presented on a variety of electronic devices including but not limited to laptop computers, desktop computers, computer terminals, television systems, tablet computers, e-book readers and smart phones), 
comprising: 
a processor (Weil [0165] This disclosure above describes various Graphical User Interfaces (GUIs) for implementing various features, processes or workflows. These GUIs can be presented on a variety of electronic devices including but not limited to laptop computers, desktop computers, computer terminals, television systems, tablet computers, e-book readers and smart phones)

configured to initiate operations including: 

obtaining a textual rendering (Weil [0041] FIG. 2 is a conceptual illustration of a media message 200 generated by media messaging application 104. As described above, media message 200 (e.g., media project, media sequence) can include a sequence of video clips. For example, media message 200 can include clips 210, 220, 230, 240, 250 and/or 260. Each video clip can include video data (e.g., still image, sequence of video frames, etc.), audio data (e.g., recorded speech), and/or transcription data (e.g., a speech to text transcription or translation of the audio data))
of an audio portion of a video (Weil [0041] Each video clip can include video data (e.g., still image, sequence of video frames, etc.), audio data (e.g., recorded speech), and/or transcription data (e.g., a speech to text transcription or translation of the audio data)), 
wherein the textual rendering is generated by a natural language speech-to-text processor (Weil [0041] and/or transcription data (e.g., a speech to text transcription or translation of the audio data)); 
generating an augmented rendering (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)

identifying a mistranscription within the textual rendering (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
using a pretrained word embedding model that creates machine-encoded data structures based on mapping words to an embedding space (Weil [0036] In some implementations, user device can include dictation service 110. For example, dictation service 110 can perform transcriptions of speech in audio data by sending audio data to a network dictation service (described below) and/or by performing transcriptions itself on user device)
selecting from among a multi-word vocabulary of the pretrained word embedding model a plurality of candidate words for replacing the mistranscription (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score), 
the selecting based on similarity values determined for each vocabulary word (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score), 
each similarity value indicating a closeness of a corresponding vocabulary word to the mistranscription (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score); 
and modifying the textual rendering by replacing the mistranscription with a candidate word (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
that, based on a comparison of average semantic similarity values of each candidate word in relation to each word contained in the augmented rendering, is more similar to the mistranscription than is each of the other candidate words (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score).
Weil teaches correcting a transcript of an audio portion of a video clip, however Weil does not teach
generating an augmented rendering by “combining the textual rendering with contextualizing data electronically garnered from one or more sources other than the audio portion of the video”;
using a pretrained word embedding model that creates machine-encoded data structures based on mapping words to an embedding space “derived from the contextualizing data.”

Wilder teaches
combining the textual rendering with contextualizing data electronically garnered from one or more sources other than the audio portion of the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata);

derived from the contextualizing data (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata).
Wilder is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil further in view of Wilder to allow for extracting other data from the video, in addition to the audio data, such as OCR data, object recognition data, etc. Doing so would allow for incorporating the additional other data into the interpretation and transcription of the audio portion of the video.

Regarding claim 9, Weil in view of Wilder teaches the system of claim 8.
Weil does not specifically teach, however Wilder further teaches
further comprising generating the electronically garnered contextualizing data by performing at least one of: 
extracting metadata from the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

generating machine-encoded text based on optical character recognition of one or more frames of the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

tagging one or more objects recognized in one or more frames of the video based on classifying the objects using a machine learning classification model (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

or summarizing textual renderings of audio portions of other videos previously captured from one or more channels.
Wilder is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil further in view of Wilder to allow for extracting other data from the video, in addition to the audio data, such as OCR data, object recognition data, etc. Doing so would allow for incorporating the additional other data into the interpretation and transcription of the audio portion of the video.

Regarding claim 14, Weil teaches a computer program product (Weil [0165] This disclosure above describes various Graphical User Interfaces (GUIs) for implementing various features, processes or workflows. These GUIs can be presented on a variety of electronic devices including but not limited to laptop computers, desktop computers, computer terminals, television systems, tablet computers, e-book readers and smart phones), 
the computer program product comprising: 
one or more computer-readable storage media and program instructions collectively stored on the one or more computer-readable storage media (Weil [0165] This disclosure above describes various Graphical User Interfaces (GUIs) for implementing various features, processes or workflows. These GUIs can be presented on a variety of electronic devices including but not limited to laptop computers, desktop computers, computer terminals, television systems, tablet computers, e-book readers and smart phones), 

the program instructions executable by a processor (Weil [0165] This disclosure above describes various Graphical User Interfaces (GUIs) for implementing various features, processes or workflows. These GUIs can be presented on a variety of electronic devices including but not limited to laptop computers, desktop computers, computer terminals, television systems, tablet computers, e-book readers and smart phones)

to cause the processor to initiate operations including: 

obtaining a textual rendering (Weil [0041] FIG. 2 is a conceptual illustration of a media message 200 generated by media messaging application 104. As described above, media message 200 (e.g., media project, media sequence) can include a sequence of video clips. For example, media message 200 can include clips 210, 220, 230, 240, 250 and/or 260. Each video clip can include video data (e.g., still image, sequence of video frames, etc.), audio data (e.g., recorded speech), and/or transcription data (e.g., a speech to text transcription or translation of the audio data)) 
of an audio portion of a video (Weil [0041] Each video clip can include video data (e.g., still image, sequence of video frames, etc.), audio data (e.g., recorded speech), and/or transcription data (e.g., a speech to text transcription or translation of the audio data)), 
wherein the textual rendering is generated by a natural language speech-to-text processor (Weil [0041] and/or transcription data (e.g., a speech to text transcription or translation of the audio data)); 
generating an augmented rendering (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)

identifying a mistranscription within the textual rendering (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
using a pretrained word embedding model that creates machine-encoded data structures based on mapping words to an embedding space (Weil [0036] In some implementations, user device can include dictation service 110. For example, dictation service 110 can perform transcriptions of speech in audio data by sending audio data to a network dictation service (described below) and/or by performing transcriptions itself on user device); 
selecting from among a multi-word vocabulary of the pretrained word embedding model a plurality of candidate words for replacing the mistranscription (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score),
the selecting based on similarity values determined for each vocabulary word (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score), 
each similarity value indicating a closeness of a corresponding vocabulary word to the mistranscription (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score);
and modifying the textual rendering by replacing the mistranscription with a candidate word (Weil [0039] This allows media messaging application 104 to receive and present text translations for the first portions of the audio data in near real time while also allowing dictation service to use the context provided by the speech audio data added in subsequent portions of the audio data to correct the speech to text translations. Thus, a text translation initially presented by media messaging application 104 may be adjusted or changed after additional speech data is received and processed)
that, based on a comparison of average semantic similarity values of each candidate word in relation to each word contained in the augmented rendering, is more similar to the mistranscription than is each of the other candidate words (Weil [0066] In some implementations, each token 410, 420, 430 generated by dictation service 110 can include word candidates 412, 422 and/or 432, respectively. For example, word candidates 412 can include a collection of word-confidence score pairs. When dictation service 110 translates the speech from the audio data received from media messaging application 104 into text, dictation service 110 can determine the most likely words (candidate words) that match a detected word in the audio data. Dictation service 110 can store the candidate words in the token (e.g., token 410) along with the respective confidence scores (e.g., probabilities) for each candidate word. The pairings of candidate words and confidence scores for token 410 can be stored in word candidates 412. When media messaging application 104 presents token 410 during playback of a clip, media messaging application 104 can present the word in word candidates 412 that has the highest confidence score).
Weil teaches correcting a transcript of an audio portion of a video clip, however Weil does not teach
generating an augmented rendering by “combining the textual rendering with contextualizing data electronically garnered from one or more sources other than the audio portion of the video”;
using a pretrained word embedding model that creates machine-encoded data structures based on mapping words to an embedding space “derived from the contextualizing data.”
Wilder teaches
combining the textual rendering with contextualizing data electronically garnered from one or more sources other than the audio portion of the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata);

derived from the contextualizing data (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata).
Wilder is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil further in view of Wilder to allow for extracting other data from the video, in addition to the audio data, such as OCR data, object recognition data, etc. Doing so would allow for incorporating the additional other data into the interpretation and transcription of the audio portion of the video.

Regarding claim 15, Weil in view of Wilder teaches the computer program product of claim 14.
Weil does not specifically teach, however Wilder further teaches
further comprising generating the electronically garnered contextualizing data by performing at least one of: 
extracting metadata from the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

generating machine-encoded text based on optical character recognition of one or more frames of the video (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

tagging one or more objects recognized in one or more frames of the video based on classifying the objects using a machine learning classification model (Wilder [0020] In accordance with an exemplary embodiment of the claimed invention, the aforesaid method performs at least one of the following: an optical character recognition (OCR) analysis on the time-aligned video frames by the server processor to extract time-aligned OCR metadata; a facial recognition analysis on the time-aligned video frames by the server processor to extract time-aligned facial recognition metadata; and an object recognition analysis on the time-aligned video frames by the server processor to extract time-aligned object recognition metadata); 

or summarizing textual renderings of audio portions of other videos previously captured from one or more channels.
Wilder is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil further in view of Wilder to allow for extracting other data from the video, in addition to the audio data, such as OCR data, object recognition data, etc. Doing so would allow for incorporating the additional other data into the interpretation and transcription of the audio portion of the video.

Regarding claim 20, Weil in view of Wilder teaches the computer program product of claim 14, 
Weil further teaches
wherein the modifying is performed in real time during a real-time rendering of the video over a channel (Weil [0095] In some implementations, GUI 1000 can include graphical element 1020 to enable or disable automatic titling for the selected clip. As described above, in near real time while recording audio data, media messaging application 104 can transcribe speech in the audio data stream into transcription data (e.g., text). The transcription data can be presented overlaid on the video data presented in area 1004 in near real time while recording audio data and/or video data. The user can invoke a graphical user interface (e.g., GUI 1100) to enable and/or disable transcription (e.g., titling, captioning, etc.) and/or select a titling style for presenting transcription data by selecting graphical element).

Claims 3, 10 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Weil in view of Wilder in further view of Prorock et al. (US Patent Pub. No. 2011/0112832), hereinafter Prorock.

Regarding claim 3, Weil in view of Wilder teaches the method of claim 1.
Weil in view of Wilder does not specifically teach, however Prorock teaches
wherein the identifying a mistranscription comprises 
identifying a word in the textual rendering that is not contained in the multi-word vocabulary of the pretrained word embedding model (Prorock [0078] Using image context recognition: in this example assume that there is an unknown (or misspelled) word in a transcript. Because of the synchronization capability of all media resources, the corresponding video image for the same point in time is analyzed. Then the image is analyzed and a "toaster" is identified. A decision is made to determine if the unidentified (or misspelled) word in the transcript is the word "toaster". Accordingly, object recognition techniques can be used to provide information for correction of errors. The object recognition may be applied to media information available in files distinct from the file containing erroneous information being corrected).
Prorock is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Prorock to allow for handling an unidentified word in the transcript. Doing so would allow for replacing the unidentified word with a correct word.

Regarding claim 10, Weil in view of Wilder teaches the system of claim 8.
Weil in view of Wilder does not specifically teach, however Prorock teaches
wherein the identifying a mistranscription comprises 
identifying a word in the textual rendering that is not contained in the multi-word vocabulary of the pretrained word embedding model (Prorock [0078] Using image context recognition: in this example assume that there is an unknown (or misspelled) word in a transcript. Because of the synchronization capability of all media resources, the corresponding video image for the same point in time is analyzed. Then the image is analyzed and a "toaster" is identified. A decision is made to determine if the unidentified (or misspelled) word in the transcript is the word "toaster". Accordingly, object recognition techniques can be used to provide information for correction of errors. The object recognition may be applied to media information available in files distinct from the file containing erroneous information being corrected).
Prorock is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Prorock to allow for handling an unidentified word in the transcript. Doing so would allow for replacing the unidentified word with a correct word.

Regarding claim 16, Weil in view of Wilder teaches the computer program product of claim 14.
Weil in view of Wilder does not specifically teach, however Prorock teaches
wherein the identifying a mistranscription comprises 
identifying a word in the textual rendering that is not contained in the multi-word vocabulary of the pretrained word embedding model (Prorock [0078] Using image context recognition: in this example assume that there is an unknown (or misspelled) word in a transcript. Because of the synchronization capability of all media resources, the corresponding video image for the same point in time is analyzed. Then the image is analyzed and a "toaster" is identified. A decision is made to determine if the unidentified (or misspelled) word in the transcript is the word "toaster". Accordingly, object recognition techniques can be used to provide information for correction of errors. The object recognition may be applied to media information available in files distinct from the file containing erroneous information being corrected).
Prorock is considered to be analogous to the claimed invention because it is in the same field of transcribing audio data from a video. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Prorock to allow for handling an unidentified word in the transcript. Doing so would allow for replacing the unidentified word with a correct word.

Claims 4, 11 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Weil in view of Wilder in further view of Zhang et al. (US Patent Pub. No. 2021/0192147), hereinafter Zhang.

Regarding claim 4, Weil in view of Wilder teaches the method of claim 1.
Weil in view of Wilder does not specifically teach, however Zhang teaches
wherein the identifying a mistranscription comprises 
identifying a word in the textual rendering having an average similarity distance from each word in the multi- word vocabulary of the pretrained word embedding model greater than a predetermined level (Zhang [0082] The polysemy translation method according to the embodiment of the present disclosure determines the word vectors of each interpretation for each candidate word, and combines the interpretations of the corresponding candidate word according to the similarity distance between the word vectors of the interpretations. Thus, by combining the interpretations of the polysemy, the rate of text translation can be increased; [0095] In another possible implementation, the identifying module 420 is further configured to: identify the polysemy from the source language text according to a polysemy library; in which the polysemy library is determined according to a polysemy probability of each word, the polysemy probability of the polysemy is greater than a set threshold; and the polysemy probability comprises a probability P(e|Ti) that a word e is translated to each interpretation Ti, and a probability P (Ti|e) that each interpretation Ti is used as a translation of the word e, where i is a serial number of interpretations of the polysemy, which is a natural number ranging from 1 to n, and n is a total number of the interpretations of the polysemy).
Zhang is considered to be analogous to the claimed invention because it is in the same field of translating using natural language processing technologies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Zhang to allow for translating according to the similarity distance. Doing so would allow for increasing accuracy when translating a text.

Regarding claim 11, Weil in view of Wilder teaches the system of claim 8.
Weil in view of Wilder does not specifically teach, however Zhang teaches
wherein the identifying a mistranscription comprises 
identifying a word in the textual rendering having an average similarity distance from each word in the multi- word vocabulary of the pretrained word embedding model greater than a predetermined level (Zhang [0082] The polysemy translation method according to the embodiment of the present disclosure determines the word vectors of each interpretation for each candidate word, and combines the interpretations of the corresponding candidate word according to the similarity distance between the word vectors of the interpretations. Thus, by combining the interpretations of the polysemy, the rate of text translation can be increased; [0095] In another possible implementation, the identifying module 420 is further configured to: identify the polysemy from the source language text according to a polysemy library; in which the polysemy library is determined according to a polysemy probability of each word, the polysemy probability of the polysemy is greater than a set threshold; and the polysemy probability comprises a probability P(e|Ti) that a word e is translated to each interpretation Ti, and a probability P (Ti|e) that each interpretation Ti is used as a translation of the word e, where i is a serial number of interpretations of the polysemy, which is a natural number ranging from 1 to n, and n is a total number of the interpretations of the polysemy).
Zhang is considered to be analogous to the claimed invention because it is in the same field of translating using natural language processing technologies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Zhang to allow for translating according to the similarity distance. Doing so would allow for increasing accuracy when translating a text.

Regarding claim 17, Weil in view of Wilder teaches the computer program product of claim 14.
Weil in view of Wilder does not specifically teach, however Zhang teaches
wherein the identifying a mistranscription comprises 
identifying a word in the textual rendering having an average similarity distance from each word in the multi- word vocabulary of the pretrained word embedding model greater than a predetermined level (Zhang [0082] The polysemy translation method according to the embodiment of the present disclosure determines the word vectors of each interpretation for each candidate word, and combines the interpretations of the corresponding candidate word according to the similarity distance between the word vectors of the interpretations. Thus, by combining the interpretations of the polysemy, the rate of text translation can be increased; [0095] In another possible implementation, the identifying module 420 is further configured to: identify the polysemy from the source language text according to a polysemy library; in which the polysemy library is determined according to a polysemy probability of each word, the polysemy probability of the polysemy is greater than a set threshold; and the polysemy probability comprises a probability P(e|Ti) that a word e is translated to each interpretation Ti, and a probability P (Ti|e) that each interpretation Ti is used as a translation of the word e, where i is a serial number of interpretations of the polysemy, which is a natural number ranging from 1 to n, and n is a total number of the interpretations of the polysemy).
Zhang is considered to be analogous to the claimed invention because it is in the same field of translating using natural language processing technologies. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Zhang to allow for translating according to the similarity distance. Doing so would allow for increasing accuracy when translating a text.

Claims 5, 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Weil in view of Wilder in further view of Bui et al. (US Patent Pub. No. 2018/0373952), hereinafter Bui.

Regarding claim 5, Weil in view of Wilder teaches the method of claim 1.
Weil in view of Wilder does not specifically teach, however Bui teaches
wherein the selecting is based on a Levenshtein distance (Bui [0070] In the resequencing validation stage of the operation of ERA 150, a subset of the document corpus 106 may be manually annotated by a user employing annotation module 164. Annotation module may include a visualization tool and/or a user interface (UI). The annotations may be ground-truth reading order annotations that are to be used in training the various language models. The validation module 164 may compare the ground-truth reading orders generated via the user annotations and the reading orders generated by classifier 118. For instance, a distance metric, such as but not limited to a levenshtein distance metric, may be determined to generate a validation score. The validation score may be employed as feedback to the language model generation module 116 during the training of the one or more language models. Furthermore, the validation score may be used to evaluate and update language models during normal run-time use of ERA 150 as new and different documents are added to document corpus 106. Thus, various language models may be tuned and/or updated based on the various documents).
Bui is considered to be analogous to the claimed invention because it is in the same field of processing text segments based on trained natural language models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Bui to allow for using the Levenshtein distance in the processing. Doing so would allow for increasing accuracy when processing a text.

Regarding claim 12, Weil in view of Wilder teaches the system of claim 8.
Weil in view of Wilder does not specifically teach, however Bui teaches
wherein the selecting is based on a Levenshtein distance (Bui [0070] In the resequencing validation stage of the operation of ERA 150, a subset of the document corpus 106 may be manually annotated by a user employing annotation module 164. Annotation module may include a visualization tool and/or a user interface (UI). The annotations may be ground-truth reading order annotations that are to be used in training the various language models. The validation module 164 may compare the ground-truth reading orders generated via the user annotations and the reading orders generated by classifier 118. For instance, a distance metric, such as but not limited to a levenshtein distance metric, may be determined to generate a validation score. The validation score may be employed as feedback to the language model generation module 116 during the training of the one or more language models. Furthermore, the validation score may be used to evaluate and update language models during normal run-time use of ERA 150 as new and different documents are added to document corpus 106. Thus, various language models may be tuned and/or updated based on the various documents).
Bui is considered to be analogous to the claimed invention because it is in the same field of processing text segments based on trained natural language models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Bui to allow for using the Levenshtein distance in the processing. Doing so would allow for increasing accuracy when processing a text.

Regarding claim 18, Weil in view of Wilder teaches the computer program product of claim 14.
Weil in view of Wilder does not specifically teach, however Bui teaches
wherein the selecting is based on a Levenshtein distance (Bui [0070] In the resequencing validation stage of the operation of ERA 150, a subset of the document corpus 106 may be manually annotated by a user employing annotation module 164. Annotation module may include a visualization tool and/or a user interface (UI). The annotations may be ground-truth reading order annotations that are to be used in training the various language models. The validation module 164 may compare the ground-truth reading orders generated via the user annotations and the reading orders generated by classifier 118. For instance, a distance metric, such as but not limited to a levenshtein distance metric, may be determined to generate a validation score. The validation score may be employed as feedback to the language model generation module 116 during the training of the one or more language models. Furthermore, the validation score may be used to evaluate and update language models during normal run-time use of ERA 150 as new and different documents are added to document corpus 106. Thus, various language models may be tuned and/or updated based on the various documents).
Bui is considered to be analogous to the claimed invention because it is in the same field of processing text segments based on trained natural language models. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Bui to allow for using the Levenshtein distance in the processing. Doing so would allow for increasing accuracy when processing a text.

Claims 6, 13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Weil in view of Wilder in further view of Jain et al. (US Patent Pub. No. 2020/0242563), hereinafter Jain.

Regarding claim 6, Weil in view of Wilder teaches the method of claim 1.
Weil in view of Wilder does not specifically teach, however Jain teaches
wherein each average semantic similarity value is an average cosine similarity between a candidate word and each word of the augmented rendering (Jain [0062] In this technique the word vectors of multiple word skills (for e.g. adobe flash player) are combined. Further, the vectors of constituent words are averaged and cosine similarity over averaged vectors is computed).
Jain is considered to be analogous to the claimed invention because it is in the same field of making a determination based on examining textual data. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Jain to allow for using the average cosine similarity in the determination process. Doing so would allow for increasing accuracy of recommendations.

Regarding claim 13, Weil in view of Wilder teaches the system of claim 8.
Weil in view of Wilder does not specifically teach, however Jain teaches
wherein each average semantic similarity value is an average cosine similarity between a candidate word and each word of the augmented rendering (Jain [0062] In this technique the word vectors of multiple word skills (for e.g. adobe flash player) are combined. Further, the vectors of constituent words are averaged and cosine similarity over averaged vectors is computed).
Jain is considered to be analogous to the claimed invention because it is in the same field of making a determination based on examining textual data. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Jain to allow for using the average cosine similarity in the determination process. Doing so would allow for increasing accuracy of recommendations.

Regarding claim 19, Weil in view of Wilder teaches the computer program product of claim 14.
Weil in view of Wilder does not specifically teach, however Jain teaches
wherein each average semantic similarity value is an average cosine similarity between a candidate word and each word of the augmented rendering (Jain [0062] In this technique the word vectors of multiple word skills (for e.g. adobe flash player) are combined. Further, the vectors of constituent words are averaged and cosine similarity over averaged vectors is computed).
Jain is considered to be analogous to the claimed invention because it is in the same field of making a determination based on examining textual data. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Weil in view of Wilder further in view of Jain to allow for using the average cosine similarity in the determination process. Doing so would allow for increasing accuracy of recommendations.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL J MUELLER whose telephone number is (571)272-1875. The examiner can normally be reached M-F 7:30am-5:30pm (Eastern).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PAUL J MUELLER/Examiner, Art Unit 2657                                                                                                                                                                                                        
/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657