DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 06/03/2021 has been entered.

Preliminary Remarks
This is a reply to the Request for Continued Examination (RCE) filed on 06/03/2021, in which, claims 21, 31, and 40 have been amended. Claims 21-40 remain pending in the present application with claims 21, 31, and 40 being independent claims.
When making claim amendments, the applicant is encouraged to consider the references in their entireties, including those portions that have not been cited by the examiner and their equivalents as they may most broadly and appropriately apply to any particular anticipated claim amendments.

Response to Arguments
Applicant's arguments with respect to claims 21, 31, and 40 have been considered but are not persuasive.
On Pages 11-12, Application argues that, “Applicant respectfully submits that there is no teaching in Fontana, Thong, Boguraev and Divay, nor any other reference of record, regarding 
In response, Examiner respectfully disagrees. The Time Event Tracker Module disclosed in Thong receives a time-stamped audio and records the time the words were typed in to produce a time aligned transcription (see Thong, paragraph [0041]: “the time event tracker 23 receives the time-stamped audio 21 and records the time the words were typed in by the operator 53. This provides a rough time alignment of the corresponding text 25 that will be precisely realigned by the next module 29. The recorded time events are mapped back to the 
On Page 12, Application argues that, “there is no motivation in any of the references to modify them to change the functionality of what is taught by the references into what is taught and claimed in the instant application. Nor is there any teaching or suggestion or motivation to combine the references to achieve what is taught in the instant application, and using the instant application as a roadmap to combine the art is impermissible hindsight reconstruction.”
In response, Examiner respectfully disagrees. The motivation to combine Thong is to ensure the system have the ability to use the speech recognition techniques at the phoneme level disclosed in Thong to produce time-stamped transcription by analyzing the audio content associated with the multimedia content which is processed by Fontana’s audio processing module; to use audio classifier disclosed in Thong to determine and separate the audio portions containing spoken words and the audio portions containing non-speech sounds such as silence or pause in order to capture non-speech audio; and to use the segmenter disclosed in Thong to analyze time aligned audio and in particular to read time stamps from one word to another in time aligned audio and to detect pauses acoustically therefore providing punctuation, capitalization and other sentence formatting by using Timing information in the raw speech transcripts disclosed in Boguraev in order to increase accuracy and efficiency in search engine 

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the "right to exclude" granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory obviousness-type double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re LongL 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Omum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 

Claims 21, 24-27, 30-31, 34-37, and 40 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 of U.S. Patent No. U.S. 10,332,506 B2 (hereinafter “’2506”) in view of Thong et al. (US 20020010916 A1, hereinafter Thong). 
Regarding claim 21 of this application, 
Claim 21 of the application
Claim 1 of ‘2506
A method comprising: 

A method comprising: 
identifying, via a computing device, a video file;
analyzing, via a computing device, a video file to identify audio data associated with the video file, said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file;
analyzing, via the computing device, the video file to identify audio data associated with the video file, said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file;
determining, via the computing device, a phoneme-level transcription from the audio data 
by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text, the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data; 
determining, via the computing device, a phoneme-level transcription from the audio data, said determination comprising 
by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text;
determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription; 

determining, via the computing device, a timestamp for each word in the text of the phoneme-level transcription, said timestamp indicating a time each word appears in the phoneme-level transcription;

determining, via the computing device, a time-aligned transcription of the audio data based on the phoneme-level transcription and associated timestamps;
automatically inserting, via the computing device, punctuation into the time-aligned transcription based on the text in the time-aligned transcription and the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks;
automatically inserting, via the computing 
device, punctuation into the time-aligned transcription based on the text in the time-
aligned transcription;
determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing 

said character set in the punctuated time-aligned transcription; and 
determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing, based on said punctuation, 
said character set in the punctuated time-
aligned transcription;
storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription.
storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription;

determining, via the computing device, a topic shift among said text of the modified time-aligned transcript based on an applied hyponymy algorithm executed by the computing device;
determining, via the computing device, a location for a paragraph break within the text of the modified time-aligned transcript based on said determined topic shift;
inserting, via the computing device, a paragraph break in said modified time-aligned transcript at said determined location; and
updating, via the computing device, said stored modified time-aligned transcript based on said insertion of the paragraph break.

Regarding claim 21 of this application, claim 1 of '2506 discloses all the subject matter of the claimed invention with the exceptions of the phoneme-level transcription representing  from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks.
Thong from the same or similar fields of endeavor discloses the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data (see Thong, paragraph [0046]: “audio classifier 15 determines and separates the audio portions containing spoken words and the audio portions containing non-speech sounds needing transcribing” and paragraph [0048]: “additional sound or general filler models can be added to the phoneme models in order to capture non-speech audio); 
said time-aligned transcription determination comprising comparing occurrences of words and non-words in the phoneme-level transcription (see Thong, paragraph [0059]: “The resulting output text 31 is thus a sequence of characters with time stamps indicating time occurrence relative to the time scale of the original audio 13”) and their associated timestamps against an acoustic model (see Thong, paragraph [0047]: “This approach is known in the literature as “audio classification”. Numerous techniques may be used. For instance, a HMM (Hidden Markov Model) or neural net system may be trained to recognize broad classes of audio including silence, music, particular sounds, and spoken words”) that has a timing scheme 
the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks (see Thong, paragraphs [0067]-[0070]: “segmenter 33 analyzes time aligned audio 31 and in particular reads time stamps from one word to another in time aligned audio 31. The difference between the time stamp at the end of one word and the time stamp at the beginning of an immediately following word is the amount of time between the two words. That is, that difference measures the length of time of the pause between the two words. If the pause is greater than a predefined suitable threshold (e.g., one second), then segmenter 33 indicates or otherwise records this pair of words as defining a possible break point (between the two words) for captioning purposes… segmenter 33 detects pauses acoustically… segmenter 33 forms groups or units of words between such pauses/ends of sentences. These word groupings are effectively sentences and step 109 thus provides punctuation, capitalization and other sentence formatting and visual structuring”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Thong with the teachings as in claim 1 of '2506. The motivation for doing so would ensure the system to have  from the phoneme-level transcription in order to increase accuracy and efficiency in search engine optimization by utilizing the formatted and readable transcript.
Regarding claim 24 of this application, 
Claim 24 of the application
Claim 5 of ‘2506
The method of claim 21, wherein said capitalizing further comprises: 
applying a language model to said punctuated time-aligned transcription, wherein said determined character set is further based on the applied language model.
The method of claim 1, wherein said capitalizing further comprises: 
applying a language model to said punctuated time-aligned transcription, wherein said determined character set is further based on the applied language model.

It should be noted that the table above distinguishes the equivalent limitations between the instant application and that of ‘2506. In conclusion, claim 24 of the instant application is recited in claim 5 of ‘2506.
Regarding claim 25 of this application, 
Claim 25 of the application
Claim 6 of ‘2506
The method of claim 21, wherein said video file comprises video data and said audio data, wherein said audio data is extracted from said video file.
The method of claim 1, wherein said video file comprises video data and said audio data, wherein said audio data is extracted from said video file.

It should be noted that the table above distinguishes the equivalent limitations between the instant application and that of ‘2506. In conclusion, claim 25 of the instant application is recited in claim 6 of ‘2506.
Regarding claim 26 of this application, 
Claim 26 of the application
Claim 7 of ‘2506
The method of claim 21, wherein said audio data is stored as an audio file in association with said video file in said database, wherein said method further comprises: 
identifying said audio file in said database based on information associated with said video file.
The method of claim 1, wherein said audio data is stored as an audio file in association 
with said video file in said database, wherein said identification of the audio data comprises identifying the audio file based on the identification of the video file.

It should be noted that the table above distinguishes the equivalent limitations between the instant application and that of ‘2506. In conclusion, claim 26 of the instant application is recited in claim 7 of ‘2506.
Regarding claim 27 of this application, 
Claim 27 of the application
Claim 8 of ‘2506
The method of claim 21, further comprising: 
determining a set of words from the text of the phoneme-level transcription; 
comparing each word from the set to a dictionary of terms; and 
confirming each word upon said comparison satisfying a similarity threshold.
The method of claim 1, further comprising: 
determining a set of words from the text of the phoneme-level transcription;
comparing each word from the set to a dictionary of terms;
confirming each word upon said comparison satisfying a similarity threshold.

It should be noted that the table above distinguishes the equivalent limitations between the instant application and that of ‘2506. In conclusion, claim 27 of the instant application is recited in claim 8 of ‘2506.
Regarding claim 30 of this application, 
Claim 30 of the application
Claim 11 of ‘2506
The method of claim 21, further comprising: 
receiving a request for the video file; 
determining a context of the video file based on the modified time-aligned transcript associated with the video file; 
causing communication, over the network, of said context to a third party content platform to obtain a digital content item associated with said context; and 
communicating said identified digital content item in association with said communication of said video file.
The method of claim 1, further comprising: 
receiving a request for the video file; 
determining a context of the video file based on the modified time-aligned transcript associated with the video file;
causing communication, over the network, of said context to an advertisement platform to obtain an advertisement associated with said context; and 
communicating said identified advertisement in association with said communication of said video file.

It should be noted that the table above distinguishes the equivalent limitations between the instant application and that of ‘2506. In conclusion, claim 30 of the instant application is recited in claim 11 of ‘2506.
Claim 31 is rejected for the same reasons as discussed in claim 21 above. In addition, the combination teachings of claim 1 of '2506 and Thong as discussed above also disclose a non-transitory computer-readable storage medium (see Thong, paragraph [0038]: “It is 
The motivation to combine claim 1 of ‘2506 and Thong has been discussed in claim 21 above.
Claim 34 is rejected for the same reasons as discussed in claim 24 above.
Claim 35 is rejected for the same reasons as discussed in claim 25 above.
Claim 36 is rejected for the same reasons as discussed in claim 26 above.
Claim 37 is rejected for the same reasons as discussed in claim 27 above.
Claim 40 is rejected for the same reasons as discussed in claim 21 above. In addition, the combination teachings of claim 1 of '2506 and Thong as discussed above also disclose a processor (see Thong, paragraph [0038]: “digital processor”); and 
a non-transitory computer-readable storage medium (see Thong, paragraph [0038]: “It is understood that these steps/modules are performed by a digital processor in a computer system having appropriate working memory, cache storage and the like …”).
The motivation to combine claim 1 of ‘2506 and Thong has been discussed in claim 21 above.
Claims 22-23 and 32-33 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 4-8, and 11 of U.S. Patent No. U.S. 10,332,506 B2 (hereinafter “’2506”) in view of Thong, and further in view of Holzrichter (US 6006175 A1, hereinafter Holzrichter). 
Regarding claim 22 of this application, 
Claim 22 of the application
Claim 4 of ‘2506
The method of claim 21, wherein said inserting punctuation further comprises: 
parsing the time-aligned transcription and identifying a feature indicating a space between said text characters, said space associated with a natural language pause between words of said speech as indicated by said non-audible content  between the non-audible content and the audible content; and 
inserting a punctuation mark in said time-aligned transcription based on said identified feature.

parsing the time-aligned transcription and identifying a feature indicating a space between said text characters, said space associated with a natural language pause between words of said speech; and 


inserting a punctuation mark in said time-aligned transcription based on said identified feature.

 claim 22 of this application, claim 1 of '2506 and Thong as discussed above discloses all the subject matter of the claimed invention with the exceptions of as indicated by said non-audible content and said mapping between the non-audible content and the audible content.
Holzrichter from the same or similar fields of endeavor discloses indicated by said non-audible content and said mapping between the non-audible content and the audible content (see Holzrichter, Column 15, lines 63-65: “FIG. 19 is a flow chart of an algorithmic procedure for start of speech, end of speech, identification of voiced or unvoiced phoneme, presence of pause, and extraneous noise presence”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Holzrichter with the teachings as in claim 1 of '2506 and Thong. The motivation for doing so would ensure the system to have the ability to use the non-acoustic speech characterization and recognition disclosed in Holzrichter to identify voiced phoneme, unvoiced phoneme and the presence of pause in the between thus determining the phoneme-level transcription representing audible content and non-audible content and a sequential relationship between each in order to determine a time that a word and a nod word appears in the phoneme-level transcription. 
Regarding claim 23 of this application, the combination teachings of claim 1 of '2506, Thong, and Holzrichter as discussed above also disclose the method of claim 22, further comprising: 
analyzing said feature, and based on said analysis, determining a dimensional value of the feature (see Holzrichter, Column 54, lines 21-30: “The method includes using one or more of 
determining a type of said punctuation mark, wherein said inserted punctuation mark is based on said type (see Holzrichter, Column 55, lines 44-51: “The method can automatically generate such multi-word vectors of known multi-word sounds for the purpose of defining, through training, libraries of known multi-word feature vectors, and automatically parse the multi-word vectors by phoneme units (including the silence phoneme) into units defined by prosody constraints, e.g. prosody constraints associated with punctuation marks or associated with pauses in thought by the speaker”).
The motivation for combining the references has been discussed in claim 21 above.
Claim 32 is rejected for the same reasons as discussed in claim 22 above.
Claim 33 is rejected for the same reasons as discussed in claim 23 above.
Claims 28, 29, 38, and 39 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 9 and 10 of U.S. Patent No. U.S. 10,332,506 B2 (hereinafter “’2506”) in view of Thong, and further in view of Fontana et al. (US 20120078712 A1, hereinafter Fontana). 
Regarding claim 28 of this application, 
Claim 28 of the application
Claim 9 of ‘2506
The method of claim 21, further comprising: 
receiving a search request for a video file; and 
identifying, based on the search request, said video file.
The method of claim 1, wherein said identification of the video file is based on a search request.

Regarding claim 28, the combination teachings of claim 1 of '2506 and Thong as discussed above disclose all the subject matter of the claimed invention with the exceptions of receiving a search request for a video file.
Fontana from the same or similar fields of endeavor discloses receiving a search request for a video file (see Fontana, paragraph [0069]: “The request handler 414 can also receive search queries relating to the metadata stored in the grid 408, for example from content 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Fontana with the teachings as in claim 1 of '2506 and Thong.  The motivation for doing so would ensure the system to have the ability to use the content view request handler of multimedia content processing and distribution system disclosed in Fontana to receive search request to retrieve multimedia content stored in the grid's data storage thus receiving a search request for a video file in order to analyze the stored video information.
Regarding claim 29 of this application, 
Claim 29 of the application
Claim 10 of ‘2506
The method of claim 28, further comprising: 
performing a search for said video file by analyzing modified time-aligned transcripts of video files in the database.
The method of claim 9, further comprising: 
performing a search for said video file by analyzing modified time-aligned transcripts of video files in the database.

It should be noted that the table above distinguishes the equivalent limitations between the instant application and that of ‘2506. In conclusion, claim 29 of the instant application is recited in claim 10 of ‘2506.
Claim 38 is rejected for the same reasons as discussed in claim 28 above.
Claim 39 is rejected for the same reasons as discussed in claim 29 above.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 21, 24-26, 28-31, 34-36, and 38-40 are rejected under 35 U.S.C. 103 as being unpatentable over Fontana et al. (US 20120078712 A1, hereinafter Fontana), in view of Thong et al. (US 20020010916 A1, hereinafter Thong), and further in view of Boguraev et al. (US 20020178002 A1, hereinafter Boguraev). 
Regarding claim 21, Fontana discloses a method comprising: 
analyzing, via a computing device, a video file to identify audio data associated with the video file (see Fontana, paragraph [0091]: “The audio processing module 606 is configured to process audio content associated with the multimedia content”), said audio data comprising information associated with text corresponding to speech that is to be rendered contemporaneously with video data of the video file (see Fontana, paragraph [0010]: “upon notification of a request for playback of the multimedia content at the one or more computing systems, providing the text metadata and the object metadata associated with the container for synchronized use during playback of the multimedia content via the container”). 
Regarding claim 21, Fontana discloses all the claimed limitations with the exception of determining, via the computing device, a phoneme-level transcription from the audio data by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text, the phoneme-level transcription representing audible content and non-audible content from the audio data and a mapping of the audible content and non-audible content from within the audio data, the non-audible content corresponding to a region of no speech within the audio data; determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription; determining, via the computing device, a time- from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks; determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription; and storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription.
Thong from the same or similar fields of endeavor discloses determining, via the computing device (see Thong, paragraph [0038]: “It is understood that these steps/modules are performed by a digital processor in a computer system having appropriate working memory, cache storage and the like as made apparent by the functional details below”), a phoneme-level transcription (see Thong, paragraph [0040]: “uses speech recognition techniques at the phoneme level”) from the audio data by extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text (see Thong, paragraph [0040]: “The audio produced 21 is time-stamped since a time dependent transformation has been applied to the audio samples” and paragraph [0042]: “The fourth module 29 receives the roughly aligned text 27 and realigns precisely the text on the audio track 13 using speech recognition techniques at the word level”), the phoneme-level transcription representing audible content and 
determining, via the computing device, a timestamp for the audible and non-audible content in the phoneme-level transcription that indicates a time that a word and a non-word appears in the phoneme-level transcription (see Thong, paragraphs [0050]-[0051]: “speech units are typically phonemes… speech recognizer 41 analyzes the audio speech 17 which is a recorded speech stream and produces a count of speech units for a given unit of time … the speech rate control module 19 produces a time-stamped audio 21 transformation of audio speech”); 
determining, via the computing device, a time-aligned transcription of the audio data based on the phoneme-level transcription and associated timestamps (see Thong, paragraph [0063]: “the output 31 of the realigner module 29 (FIG. 3) is time-stamped text. This timing information is useful to the segmentation process since the length of pauses between words gives an indication of where sentence breaks might be”), said time-aligned transcription determination comprising comparing occurrences of words and non-words in the phoneme-level transcription (see Thong, paragraph [0059]: “The resulting output text 31 is thus a sequence of characters with time stamps indicating time occurrence relative to the time scale of the original audio 13”) and their associated timestamps against an acoustic model (see Thong, paragraph [0047]: “This approach is known in the literature as “audio classification”. Numerous techniques may be used. For instance, a HMM (Hidden Markov Model) or neural net system may be trained to recognize broad classes of audio including silence, music, particular sounds, and spoken words”) that has a timing scheme corresponding to a length of the video file (see Thong, 
automatically inserting, via the computing device, punctuation into the time-aligned transcription based on the text in the time-aligned transcription and the indicated mapping from the phoneme-level transcription, said punctuation based on information associated with the audible content, regions of speech indicated by the non-audible content and paragraphs breaks (see Thong, paragraphs [0067]-[0070]: “segmenter 33 analyzes time aligned audio 31 and in particular reads time stamps from one word to another in time aligned audio 31. The difference between the time stamp at the end of one word and the time stamp at the beginning of an immediately following word is the amount of time between the two words. That is, that difference measures the length of time of the pause between the two words. If the pause is greater than a predefined suitable threshold (e.g., one second), then segmenter 33 indicates or otherwise records this pair of words as defining a possible break point (between the two words) for captioning purposes… segmenter 33 detects pauses acoustically… segmenter 33 forms groups or units of words between such pauses/ends of sentences. These word groupings are effectively sentences and step 109 thus provides punctuation, capitalization and other sentence formatting and visual structuring”).
Thong with the teachings as in Fontana. The motivation for doing so would ensure the system to have the ability to use the speech recognition techniques at the phoneme level disclosed in Thong to produce time-stamped transcription by analyzing the audio speech which is a recorded speech stream; to use audio classifier disclosed in Thong to determine and separate the audio portions containing spoken words and the audio portions containing non-speech sounds in order to capture non-speech audio; and to use the segmenter disclosed in Thong to analyze time aligned audio and in particular to read time stamps from one word to another in time aligned audio; to detect pauses acoustically in order to provide punctuation, capitalization and other sentence formatting and visual structuring; to use The Time Event Tracker Module disclosed in Thong to receive a time-stamped audio and record the time the words were typed in to produce a time aligned transcription; to also use The Time Event Tracker Module disclosed in Thong to link the time stamped audio stream with the original audio or video recording and to use realigner disclosed in Thong to receive an original audio track and the roughly aligned text from Time Event Tracker module by comparing the timed aligned transcription with original audio track to output a new sequence of caption text with improved time alignments wherein the resulting output text is a sequence of characters with time stamps indicating time occurrence relative to the time scale of the original audio thus extracting the text from the audio data and compiling the phoneme-level transcription based on the extracted text; determining the phoneme-level transcription representing audible content and non-audible content; determining a time-aligned transcription of the audio data based on the phoneme-level transcription wherein the time- aligned transcription determination comprising comparing occurrences of words and non- words in the phoneme-level transcription and their associated timestamps against an acoustic model that has a timing scheme corresponding to a length of the video file, such that each word and non-word and their associated timestamps are mapped and stored in  from the phoneme-level transcription in order to increase accuracy and efficiency in search engine optimization by utilizing the formatted and readable transcript.
Regarding claim 21, the combination teachings of Fontana and Thong as discussed above disclose all the subject matter of the claimed invention with the exceptions of determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription; and storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription.
Boguraev from the same or similar fields of endeavor discloses determining, via the computing device, a character set from the text of the punctuated time-aligned transcription based on said punctuation, and automatically capitalizing said character set in the punctuated time-aligned transcription (see Boguraev, paragraphs [0047]-[0053]: “Timing information in the raw speech transcripts is used… pauses of 1.2 seconds or more are replaced with a new paragraph, by adding a period, two blank lines and a capital letter to the next word… Whenever introductory words or phrases are found, a period and two spaces are inserted and the introductory word or first word of an introductory phrase is capitalized”); and 
storing, via the computing device, a modified time-aligned transcript in association with the video file in a database, said modified time-aligned transcript comprising the punctuated and capitalized time-aligned transcription (see Boguraev, paragraph [0071]: “Any medium known or developed that can store information suitable for use with a computer system may be used”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Boguraev with the teachings as in Fontana and Thong. The motivation for doing so would ensure the system to 
Regarding claim 24, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 21, wherein said capitalizing further comprises: 
applying a language model to said punctuated time-aligned transcription, wherein said determined character set is further based on the applied language model (see Boguraev, paragraph [0053]: “In addition to the analysis of the raw speech transcripts provided by a speech engine, there are some English language cues that may be used to improve the recognition of sentence boundaries”).
The motivations for combining the references has been discussed in claim 21 above.
Regarding claim 25, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 21, wherein said video file comprises video data and said audio data, wherein said audio data is extracted from said video file (see Thong, paragraph [0079]: “Each word of the document is time marked to indicate the location of the word in the audio stream or video frame where the document is a video recording”).
The motivations for combining the references has been discussed in claim 21 above.
Regarding claim 26, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 21, wherein said audio data is stored as an 
identifying said audio file in said database based on information associated with said video file (see Fontana, paragraphs [0097]-[0098]: “the container 625 is configured to include identifying information capable of referencing the metadata generated describing the content… The metadata from the audio processing module 606 and video processing module 608 is passed to a database 626, which collects metadata and other information derived from the multimedia content”).
The motivations for combining the references has been discussed in claim 21 above.
Regarding claim 28, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 21, further comprising: 
receiving a search request for a video file (see Fontana, paragraph [0069]: “The request handler 414 can also receive search queries relating to the metadata stored in the grid 408, for example from content consumers seeking a particular piece of multimedia content, or seeking a list of pieces of multimedia content in which the search criteria is found”);and 
identifying, based on the search request, said video file (see Fontana, paragraph [0084]: “FIGS. 9-10 provide details regarding identification of objects within the multimedia content for identification, searching, playback and other multimedia enhancements”).
The motivations for combining the references has been discussed in claim 21 above.
Regarding claim 29, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 28, further comprising: 
performing a search for said video file by analyzing modified time-aligned transcripts of video files in the database (see Fontana, paragraph [0164]: “an audio separation operation 1128 strips, or extracts, the audio from the multimedia content. The audio information is then analyzed, in a speech to text conversion operation 1130, to convert audio information to text information”).

Regarding claim 30, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 21, further comprising: 
receiving a request for the video file (see Fontana, paragraph [0177]: “A request operation 1306 corresponds to receipt of a request for the multimedia content”); 
determining a context of the video file based on the modified time-aligned transcript associated with the video file (see Fontana, paragraph [0178]: “A metadata association operation 1308 corresponds to selection and association of a portion of the generated multimedia data with the content identified by the request”); 
causing communication, over the network, of said context to a third party content platform to obtain a digital content item associated with said context (see Fontana, paragraph [0185]: “Content providers and their advertisers can provide up-to-date information on products, specials or other items to the viewer of the content, and can tailor this information based on known user information. In certain embodiments, broadcast or multicast advertising can be associated with one or more of the videos to overlay dynamic content”); and 
communicating said identified digital content item in association with said communication of said video file (see Fontana, paragraph [0117]: “FIG. 7M illustrates example advertisement data 716 that can be used in association with multimedia content, to link one or more advertisements with multimedia content during playback”).
The motivations for combining the references has been discussed in claim 21 above.
Claim 31 is rejected for the same reasons as discussed in claim 21 above. In addition, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose a non-transitory computer-readable storage medium (see Fontana, paragraph [0076]: “electronic computing device 500 includes a non-volatile storage device 510. Non-volatile storage device 510 is a computer-readable data storage medium that is capable of storing data and/or instructions”).
Claim 34 is rejected for the same reasons as discussed in claim 24 above.
Claim 35 is rejected for the same reasons as discussed in claim 25 above.
Claim 36 is rejected for the same reasons as discussed in claim 26 above.
Claim 38 is rejected for the same reasons as discussed in claim 28 above.
Claim 39 is rejected for the same reasons as discussed in claim 29 above.
Claim 40 is rejected for the same reasons as discussed in claim 21 above. In addition, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose a processor (see Fontana, paragraph [0074]: “electronic computing device 500 comprises a processing unit”); and 
a non-transitory computer-readable storage medium (see Fontana, paragraph [0076]: “electronic computing device 500 includes a non-volatile storage device 510. Non-volatile storage device 510 is a computer-readable data storage medium that is capable of storing data and/or instructions”).
Claims 22, 23, 32, and 33 are rejected under 35 U.S.C. 103 as being unpatentable Fontana, Thong, and Boguraev as applied to claim 21, and further in view of Holzrichter (US 6006175 A1, hereinafter Holzrichter). 
Regarding claim 22, the combination teachings of Fontana, Thong, and Boguraev as discussed above also disclose the method of claim 21, wherein said inserting punctuation further comprises: 
parsing the time-aligned transcription and identifying a feature indicating a space between said text characters, said space associated with a natural language pause between words of said speech (see Boguraev, paragraph [0049]: “Some speech engines provide silence information as a series of "silence tokens," where each token was assigned a duration. Frequently, there would be several sequential silence tokens, presumably separated by non-speech sounds”); and 

The motivations for combining the references has been discussed in claim 21 above.
Regarding claim 22, the combination teachings of Fontana, Thong, and Boguraev as discussed above disclose all the claimed limitations with the exceptions of the as indicated by said non-audible content and said mapping between the non-audible content and the audible content.
Holzrichter from the same or similar fields of endeavor discloses as indicated by said non-audible content and said mapping between the non-audible content and the audible content (see Holzrichter, Column 15, lines 63-65: “FIG. 19 is a flow chart of an algorithmic procedure for start of speech, end of speech, identification of voiced or unvoiced phoneme, presence of pause, and extraneous noise presence”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Holzrichter with the teachings as in Fontana, Thong, and Boguraev. The motivation for doing so would ensure the system to have the ability to use the non-acoustic speech characterization and recognition disclosed in Holzrichter to identify voiced phoneme, unvoiced phoneme and the presence of pause in the between thus determining the phoneme-level transcription representing audible content and non-audible content and a sequential relationship between each in order to determine a time that a word and a nod word appears in the phoneme-level transcription.
Regarding claim 23, the combination teachings of Fontana, Thong, Boguraev, and Holzrichter as discussed above also disclose the method of claim 22, further comprising: 
analyzing said feature, and based on said analysis, determining a dimensional value of the feature (see Holzrichter, Column 54, lines 21-30: “The method includes using one or more of 
determining a type of said punctuation mark, wherein said inserted punctuation mark is based on said type (see Holzrichter, Column 55, lines 44-51: “The method can automatically generate such multi-word vectors of known multi-word sounds for the purpose of defining, through training, libraries of known multi-word feature vectors, and automatically parse the multi-word vectors by phoneme units (including the silence phoneme) into units defined by prosody constraints, e.g. prosody constraints associated with punctuation marks or associated with pauses in thought by the speaker”).
The motivations for combining the references has been discussed in claim 22 above.
Claim 32 is rejected for the same reasons as discussed in claim 22 above.
Claim 33 is rejected for the same reasons as discussed in claim 23 above.
Claims 27 and 37 are rejected under 35 U.S.C. 103 as being unpatentable Fontana, Thong, and Boguraev as applied to claim 21, and further in view of Ellozy et al. (US 5649060 A, hereinafter Ellozy). 
Regarding claim 27, the combination teachings of Fontana, Thong, and Boguraev as discussed above disclose all the claimed limitations with the exceptions of the method of claim 21, further comprising: determining a set of words from the text of the phoneme-level transcription; comparing each word from the set to a dictionary of terms; and confirming each word upon said comparison satisfying a similarity threshold.
Ellozy from the same or similar fields of endeavor discloses the method of claim 21, further comprising: 
determining a set of words from the text of the phoneme-level transcription (see Ellozy, Column 7, lines 33-37: “The mapping block 44 receives as input data a transcript 24 and a decoded text 38 (FIG. 3). The transcript 24 goes to the block 201 where it is partitioned into 
comparing each word from the set to a dictionary of terms (see Ellozy, Column 7, lines 21-27: “The work of the decoder 34 is controlled by the segmentation block 32. The block 32 receives control parameters from the mapping block 44. These parameters include the size of the text, grammatical structures for the text, etc. These parameters are used to determine (1) the size of speech segment from 19 to be passed to the decoder 34, and (2) dictionary and grammar constraints”); and 
confirming each word upon said comparison satisfying a similarity threshold (see Ellozy, Column 9, lines 56-59: “For each word segment in the speech data, compare the score of the aligned word from the provided script Ti with the score of a decoded word in DTi for that speech segment. Insert or replace the script word with the decoded word if the difference satisfies a specified threshold”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to utilize the teachings as in Ellozy with the teachings as in Fontana, Thong, and Boguraev. The motivation for doing so would ensure the system to have the ability to use Ellozy’s mapping block to receives a transcript as input data a transcript and to decode texts into smaller sizes; use Ellozy’s segmentation block to receive control parameters from the mapping block wherein dictionary is one of the control parameter; and to compare the score of the aligned word from the provided script Ti with the score of a decoded word in DTi for that speech segment and to insert or replace the script word with the decoded word if the difference satisfies a specified threshold thus determining a set of words from the text of the phoneme-level transcription; comparing each word from the set to a dictionary of terms and confirming each word upon said comparison satisfying a similarity threshold in order to implement machine learning algorithm to help on creating time-aligned transcription of multimedia content.
Claim 37 is rejected for the same reasons as discussed in claim 27 above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NIENRU YANG whose telephone number is (571)272-4212.  The examiner can normally be reached on Monday - Friday 10 AM - 6 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI TRAN can be reached on 571-272-7382.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


NIENRU YANG
Examiner
Art Unit 2484





/THAI Q TRAN/Supervisory Patent Examiner, Art Unit 2484