DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Arguments
Applicant’s arguments, see pages 11+ of Remarks, filed 01/28/2022, with respect to the rejection(s) of claims 1-20 under Claim Rejections - 35 USC § 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Xu et al. (U.S. Pub. No. 2019/0387263).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 1-8, 10-18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Nir (U.S. Pub. No. 2016/0066055) in view of Xu et al. (U.S. Pub. No. 2019/0387263).

Regarding claims 1, 11 and 20, Nir discloses a video stream processing method, the method comprising:
obtaining, by processing circuitry, first audio stream data in live video stream data (see paragraphs 0013-0015; a CPU for processing the received audio and video signals);
performing, by the processing circuitry, speech recognition on the first audio stream data to generate speech recognition text (see paragraph 0059; a speech recognition module 37 converts each audio time slice to text that includes the transcription of the audio time slice);
generating, by the processing circuitry, caption data according to the speech recognition text, the caption data including caption text and time information corresponding to the caption text (see paragraph 0018; a speech recognition module for converting each audio time slice to text that contains the transcription of the audio time slice); and
adding, by the processing circuitry, the caption text to a corresponding picture frame in the live video stream data corresponding to the caption text to generate 
However, Nir fails to disclose the time information being extracted from a speech start audio frame in the first audio stream data and indicating (1) a time point corresponding to speech start audio frame of the segment of speech on which the speech recognition is performed to generate the speech recognition text of the segment of speech and (2) a duration of the segment of speech on which the speech recognition is performed and adding, by the processing circuitry, the caption text to a corresponding picture frame in the live video stream data according to the time information corresponding to the caption text to generate captioned live video stream data, the corresponding picture frame being determined based on the time point and the duration.
 Xu et al. discloses the time information being extracted from a speech start audio frame in the first audio stream data and indicating (1) a time point corresponding to speech start audio frame of the segment of speech on which the speech recognition is performed to generate the speech recognition text of the segment of speech (see paragraphs 0010, 0027; start timestamp) and (2) a duration of the segment of speech on which the speech recognition is performed (see paragraphs 0010, 0027, 0042, 0054, 0106; subtitle timestamp having a start timestamp and an end timestamp) and
adding, by the processing circuitry, the caption text to a corresponding picture frame in the live video stream data according to the time information corresponding to 
It would have been obvious to a skilled artisan before the effective filing date of the claimed invention to modify the system of Nir with the teachings of Xu et al., the motivation being to for synchronously displaying and matching the streaming media and subtitles (see abstract). 


Regarding claims 2 and 12, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses wherein the adding comprises:
separating the live video stream data into second audio stream data and first picture frame stream data (see paragraph 0033);
determining a target picture frame in the first picture frame stream data, the target picture frame corresponding to the time information (see abstract, paragraphs 0013-0024);
generating a caption image of the caption text (see abstract, paragraphs 0013-0024);
superimposing the caption image on the target picture frame to generate superimposed picture frame stream data (see abstract, paragraphs 0013-0024); and
combining the second audio stream data with the superimposed picture frame stream data to generate the captioned live video stream data (see abstract, paragraphs 0013-0024).


Regarding claims 3 and 13, Nir and Xu et al. discloses everything claimed as applied above (see claims 2 and 12).  Nir discloses wherein the combining comprises:
synchronizing the second audio stream data and the superimposed picture frame stream data according to the time information (see abstract, paragraphs 0022, 0040, 0062); and
combining the synchronized second audio stream data and the superimposed picture frame stream data to generate the captioned live video stream data (see abstract, paragraphs 0022, 0040, 0062).

Regarding claims 4 and 14, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses wherein before the adding, the method includes obtaining second picture frame stream data in the live video stream data (see abstract, paragraphs 0013-0024), and
the adding includes:
determining a target picture frame in the second picture frame stream data, the target picture frame corresponding to the time information (see abstract, paragraphs 0013-0024);
generating a caption image of the caption text (see abstract, paragraphs 0013-0024);

combining the first audio stream data with the superimposed picture frame stream data to generate the captioned live video stream data (see abstract, paragraphs 0013-0024).
Xu et al. discloses the target picture frame being determined based on the time point and the duration (see paragraph 0106).


Regarding claims 5 and 15, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses adding, after a delay of a preset duration from a first moment, the caption text to the corresponding picture frame in the live video stream data according to the time information corresponding to the caption text to generate the captioned live video stream data, the first moment being a time the live video stream data is obtained (see paragraphs 0049, 0062, 0065).

Regarding claims 6 and 16, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses adding, after the caption data is stored, the caption text to the corresponding picture frame in the live video stream data according to the time information corresponding to the caption text to generate the captioned live video stream data (see paragraphs 0021-0022, 0039, 0049; 0062).

claims 7 and 17, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses wherein the performing the speech recognition comprises:
performing a speech start-end detection on the first audio stream data to obtain the speech start audio frame and a speech end audio frame in the first audio stream data, the speech start audio frame corresponding to a beginning of a segment of speech, and the speech end audio frame corresponding to an end of the segment of speech (see paragraphs 0030-0031);
extracting at least one segment of speech data from the first audio stream data according to the speech start audio frame and the speech end audio frame in the first audio stream data, the speech data including another audio frame between the speech start audio frame and the speech end audio frame (see paragraph 0058);
performing speech recognition on the at least one segment of speech data to obtain recognition sub-text corresponding to the at least one segment of speech data (see paragraph 0018, 0059); and
determining the recognition sub-text corresponding to the at least one segment of speech data as the speech recognition text (see paragraph 0018, 0059).

Regarding claims 8 and 18, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses wherein the generating the caption data comprises:
translating the speech recognition text into translated text corresponding to a target language (see paragraphs 0029, 0049);

generating the caption data including the caption text (see paragraphs 0029, 0049).

Regarding claim 10, Nir and Xu et al. discloses everything claimed as applied above (see claim 1).  Nir discloses receiving a video stream obtaining request from a user terminal (see paragraphs 0010, 0025, 0029, 0032-0040, 0049 and 0059);
obtaining language indication information in the video stream obtaining request, the language indication information indicating a caption language (see paragraphs 0010, 0025, 0029, 0032-0040, 0049 and 0059); and
pushing the captioned live video stream data to the user terminal when the caption language indicated by the language indication information corresponds to the caption text (see paragraphs 0010, 0025, 0029, 0032-0040, 0049 and 0059).

Claims 9 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Nir and Xu et al. as applied to claims 1 and 11 above, and further in view of Cuthbert et al. (U.S. Patent No. 9,953,631).


Regarding claims 9 and 19, Nir and Xu et al. discloses everything claimed as applied above (see claims 1 and 11).  Nir discloses wherein the generating the caption data comprises:

However, Nir and Xu et al. are silent as to generating the caption text according to the translated text, the caption text including the speech recognition text and the translated text; and generating the caption data including the caption text.
Cuthbert et al. discloses generating the caption text according to the translated text, the caption text including the speech recognition text and the translated text (see col. 5, lines 23-col. 6, line 10, fig. 3A-3C; displaying both original speech text and translated text); and
generating the caption data including the caption text (see col. 5, lines 23-col. 6, line 10, fig. 3A-3C).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to and to modify the method and system of Nir and Xu et al. to include generating the caption text according to the translated text, the caption text including the speech recognition text and the translated text; and generating the caption data including the caption text as taught by Cuthbert et al. for the advantage of enhancing conversation experience.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NNENNA NGOZI EKPO whose telephone number is (571)270-1663. The examiner can normally be reached M-W 10:00am - 6:30pm, TH-F 8:00am - 4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Brian Pendleton can be reached on 571-272-7527. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is 

NNENNA EKPO
Primary Examiner
Art Unit 2425



/NNENNA N EKPO/Primary Examiner, Art Unit 2425                                                                                                                                                                                                        February 18, 2022.