DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
Claims 17-19, 24-26, and 31-33 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 17, 19, 24, 26, and 31 each recite the limitation “the reference plurality of video frames.” There is insufficient antecedent basis for this limitation in the claims. Claims 18, 25, and 32 and 33 are rejected due to their respective dependencies on a claim rejected 35 USC § 112.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 16-21, 23-28, and 30-34 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Ramnani et al. (US 2016/0373830) and Zafarifar et al. (US 2012/0206567).

Regarding claim 1, Ramnani teaches method comprising:
providing, by one or more processors ([0025], [0099], Figs. 1, 11), a media device with a plurality of video frames to be rendered in accordance with first metadata ([0055], “FIG. 3 is a flow diagram illustrating a method for automatically testing CC rendering according to one embodiment. For example, method 300 can be performed by processing system 100.” [0056], “Referring now to FIG. 3, at block 305, a processing system receives a reference AV stream without CC and a reference AV stream with CC from an AV source. For example, frame dumper 104 receives a reference AV stream without CC and a reference AV stream with CC from AV source 101. At block 310, the processing system generates reference CC images and reference metadata based on the reference AV stream without CC and the reference AV stream with CC.” Fig. 3),
each frame including one or more primary screen objects and at least one secondary screen object ([0039], “Frame 202 includes CC image 211 wherein 
the media device causing each video frame in the plurality of video frames to be rendered based on the first metadata with superposition of the at least one secondary screen object of that video frame onto the one or more primary screen objects of that video frame ([0033], “the caption extractor is configured to generate reference metadata for the reference frames and test metadata for the test frames. In one such embodiment, the metadata includes, but is not limited to, position metadata, frame count metadata, and time point metadata, or any combination thereof. The position metadata indicates the coordinate (e.g., the top left X, Y coordinate) of a CC image in the frame. The frame count metadata indicates the number of frames for which a CC image is in the AV stream (e.g., the number of frames that the CC image is displayed on the screen). The time point metadata indicates the time at which the CC image appeared in the AV stream, relative to the recording start time.” [0047], [0056], “Referring now to FIG. 3, at block 305, a processing system receives a reference AV stream without CC and a reference AV stream with CC from an AV source. For example, frame dumper 104 receives a reference AV stream without CC and a reference AV stream with CC from AV source 101. At block 310, the processing system generates reference CC images and reference metadata based on the reference AV stream without CC and the reference AV stream with CC.” [0039], “Frame 202 includes CC image 211 wherein the top left of CC image 211 is located at 
inputting, by the one or more processors, the rendered plurality of video frames to indicate, for each inputted video frame, whether any secondary screen object is present in that inputted video frame ([0033], “[0057], “At block 320, the processing system receives the test AV stream without CC and the test AV stream with CC from the AV source. For example, in response to determining that AV source 101 has been upgraded with a new software, AV source driver 102 automatically causes AV source 101 to send a test AV stream without CC and a test AV stream with CC, wherein the test AV stream is the same as the reference AV stream, except that CC rendering in the test AV stream is performed by the upgraded software.” Fig. 3);
obtaining, by the one or more processors , second metadata that indicates, for each inputted video frame, whether any secondary screen object is present in that inputted video frame ([0058], “At block 325, the processing system generates test CC images and test metadata based on the test AV stream without CC and the test AV stream with CC. For example, frame dumper 104 extracts test frames with CC 120 and test frames without CC 121 from the test AV stream with CC and the test AV stream without CC, respectively. Caption extractor 108 then generates test CC images 123 and test metadata 122 from test frames with CC 120 and test frames without CC 121.” Fig. 3);
causing, by the one or more processors, a comparison of the second metadata to the first metadata in accordance with which the plurality of video 
providing, by the one or more processors and based on the comparison of the second metadata to the first metadata, a validation result that indicates whether the at least one secondary screen objects in the plurality of video frames were rendered correctly ([0059], “For example, caption comparator 109 compares test CC images 123 against reference CC images 133, and/or compares test metadata 122 against reference metadata 132, to determine if AV source 101 performs CC rendering properly after it has been upgraded with the new software. At block 335, the processing system provides the results of the comparison. For example, caption comparator 109 generates results 110 to provide the results of the comparison.”).
Ramnani does not expressly teach that the inputting comprises inputting the rendered plurality of video frames into a data model trained to indicate, for each inputted video frame, whether any secondary screen object is present in that inputted video frame. Ramnani also does not expressly teach that the obtaining comprises obtaining the second metadata from the data model.

In view of Zafarifar’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the Ramnani such that the inputting comprises inputting the rendered plurality of video frames into a data model trained to indicate, for each inputted video frame, whether any secondary screen object is present in that inputted video frame, and such that the obtaining comprises obtaining the second metadata from the data model. The modification would improve Ramnani by providing a supplemental and/or alternative means for detecting secondary screen objects. The modification would thereby improve accuracy and/or provide an alternative detection means should a technique fail.

Regarding claim 23, Ramnani teaches a system comprising: one or more processors; and a memory storing instructions that, when executed by at least one processor among the one or more processors ([0025], [0099], Figs. 1, 11). The rejection of claim 1 is similarly applied to the remaining limitations of claim 23.

Regarding claim 30, Ramnani teaches a non-transitory machine-readable medium comprising instructions that, when executed by one or more processors of a machine ([0025], [0099], Figs. 1, 11). The rejection of claim 1 is similarly applied to the remaining limitations of claim 30.

Regarding claims 17, 24, and 31, the combination teaches the limitations specified above, and teaches:
inputting first training data that includes a reference plurality of reference video frames (Ramnani: [0027], “In one embodiment, the AV source driver is configured communicate with the AV source to cause the AV source to send a first reference AV stream to the processing system. As used herein, a ‘reference AV stream’ refers to an AV stream wherein the CC rendering is manually verified (e.g., by a tester visually inspecting the CC displayed on the screen) and determined to be correct.” [0029]); and
inputting second training data that includes reference data items associated with the reference plurality of video frames, each reference data item indicating whether any secondary screen object is present in a corresponding reference video frame among the plurality of reference video frames (Ramnani: [0027], [0029], [0033], “In one embodiment, the caption extractor is configured to generate reference metadata for the reference frames and test metadata for the test frames. In one such embodiment, the metadata includes, but is not limited to, position metadata, frame count metadata, and time point metadata, or any combination thereof. The position metadata indicates the coordinate (e.g., the top left X, Y coordinate) of a CC image in the frame. The frame count metadata indicates the number of frames for which a CC image is in the AV stream (e.g., the number of frames that the CC image is displayed on the screen). The time point metadata indicates the time at which the CC image appeared in the AV stream, relative to the recording start time.”).
However, the combination does not expressly teach that the data model is trained based on the above operations, and inputting the first training data and the second training data into the data model.
Zafarifar teaches training a data model to detect subtitles ([0048], “A subtitle detection system and method is illustrated in FIG. 1. The method starts with the existing static region detection process, and prune its results by combining it with a feature based on the density of horizontal transition pairs. This feature computes horizontal transition pairs and selects regions that have a high density of transition pairs. Next, the method performs adaptive temporal filtering on the pruned static region map which integrates the information of the pruned static region map in each image area using an appropriate area-specific filter characterized by whether the region had been detected in previous frames as potential subtitle. The method then performs a bounding box computation, which roughly locates candidate subtitle areas by a rectangle, using height, width and filling-degree constraints. The method then computes, for each bounding box, a horizontal and a vertical text stroke alignment feature and establish a binary classifier that uses these features to identify the bounding boxes that contain subtitles.” [0124], “For training and testing the classifier, we use two videos with similar subtitle fonts, with a total length of around 10700 frames.” [0125], “Training the Classifier”).
In view of Zafarifar’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the Ramnani such that the data model is trained based on the above operations, and inputting the first training data and the second training data into the data model. The modification would serve to further improve the accuracy of secondary screen object detection.

Regarding claims 18, 25, and 32, the combination further teaches wherein: the data model is trained based on the operations, further including: obtaining, from the data model, extracted metadata that indicates, for each reference video frame, whether any secondary screen object is present in that reference video frame (Ramnani: [0027], [0029], [0033], “In one embodiment, the caption extractor is configured to generate reference metadata for the reference frames and test metadata for the test frames. In one such embodiment, the metadata includes, but is not limited to, position metadata, frame count metadata, and time point metadata, or any combination thereof. The position metadata indicates the coordinate (e.g., the top left X, Y coordinate) of a CC image in the frame. The frame count metadata indicates the number of frames for which a CC image is in the AV stream (e.g., the number of frames that the CC image is displayed on the screen). The time point metadata indicates the time at which the CC image appeared in the AV stream, relative to the recording start time.”).

Regarding claims 20, 27, and 33, the combination further teaches the data model is trained based on the operations, further including:
causing a reference device to produce the reference plurality of reference video frames by producing a first rendering of a reference video stream with secondary screen objects visible (Ramnani: [0029], “In an embodiment where the reference AV streams are sent with and without CC, the frame dumper is to extract reference frames from both the reference AV stream without CC and the reference AV stream with CC, and store them in the reference repository.” [0041]);
causing the reference device to produce a comparison plurality of reference video frames by producing a second rendering of the reference video stream without secondary screen objects visible (Ramnani: [0029], “In an embodiment where the reference AV streams are sent with and without CC, the frame dumper is to extract reference frames from both the reference AV stream without CC and the reference AV stream with CC, and store them in the reference repository.” [0041]); and
obtaining the reference data items of the second training data by comparing the first rendering of the reference video stream with secondary screen objects visible to the second rendering of the reference video stream without secondary screen objects visible ([0056], “At block 310, the processing system generates reference CC images and reference metadata based on the reference AV stream without CC and the reference AV stream with CC. For example, frame dumper 104 extracts reference frames with CC 130 and reference frames without CC 131 from the reference AV stream with CC and the reference AV stream without CC, respectively. Caption extractor 108 then generates reference CC images 133 and reference metadata 132 from reference frames with CC 130 and reference frames without CC 131.” [0060], “FIG. 4 is a flow diagram illustrating a method for performing caption extraction (cropping out the CC image from the video frame) according to one embodiment. This approach is called Caption Filter. For example, method 400 can be performed by caption extractor 108.” [0061], “At block 405, the caption extractor syncs each frame with CC to the corresponding frame without CC using a syncing algorithm (described below).” [0063], “the caption extractor saves the CC image along with the corresponding metadata (e.g., time point, position, and frame count).” Figs. 3-4).

Regarding claims 21, 28, and 34, the combination further teaches:
generating the validation result based on operations that include: identifying a characteristic of a secondary screen object in a video frame among the plurality of video frames; comparing the identified characteristic to a corresponding characteristic represented in the first metadata in accordance with which the plurality of video frames was rendered; and detecting a variance in the identified characteristic based on the comparing of the identified characteristic to the corresponding characteristic (Ramnani: [0084], “FIG. 7 is a diagram illustrating a generated log file according to one embodiment. For example, log file 700 can be implemented as part of results 110. … File ‘1.bmp’ indicates that the first test CC image passes the bitmap image comparison (e.g., the content, style, language, etc., of the first test CC image matches the content, style, language, etc., of its corresponding reference CC image). File ‘1.bmp’ indicates, however, that the frame count of the first test CC image does not match the frame count of its corresponding reference CC image.” [0086], “Log file 700 further includes information summarizing the cumulative results of all three test CC images. In particular, log file 700 indicates that there is: 1) a 0% bitmap mismatch for all three test CC images, 2) 0% bitmap missing, 3) 14.285% duration mismatch, 4) 0% time point mismatch, and 5) 0% anchor point mismatch.” Figs. 6-7).

Claims 19, 22, 26, 29, and 35 is/are rejected under 35 U.S.C. 103 as being unpatentable over a combination of Ramnani, Zafarifar, and Cronin et al. (US 2015/0271442).

Regarding claims 19 and 26, the combination further teaches wherein the data model is trained based on the operations, further including:
applying at least one function that reduces error between the extracted metadata from the data model and the reference data items associated with the reference plurality of video frames.
Cronin teaches applying at least one function that reduces error between media data and closed captioning data ([0011], “An automated method of aligning a closed caption track to a media content item is disclosed.” [0043], “Returning to FIG. 2, at 214 method 200 includes aligning the closed caption track to the media content item as a function of the drift value and the offset value”).
In view of Cronin’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination such that the data model is trained based on the operations, further including: applying at least one function that reduces error between the extracted metadata from the data model and the reference data items associated with the reference plurality of video frames. The modification would serve to ensure correct presentation of closed captioning with video content.

Regarding claim 22, 29, and 35 the combination further teaches:
responsive to the variance being detected, calculating an offset based on the detected variance (Ramnani: [0084], “FIG. 7 is a diagram illustrating a generated log file according to one embodiment. For example, log file 700 can be implemented as part of results 110. … File ‘1.bmp’ indicates that the first test CC image passes the bitmap image comparison (e.g., the content, style, language, etc., of the first test CC image matches the content, style, language, etc., of its corresponding reference CC image). File ‘1.bmp’ indicates, however, that the frame count of the first test CC image does not match the frame count of its corresponding reference CC image.” [0086], “Log file 700 further includes information summarizing the cumulative results of all three test CC images. In particular, log file 700 indicates that there is: 1) a 0% bitmap mismatch for all three test CC images, 2) 0% bitmap missing, 3) 14.285% duration mismatch, 4) 0% time point mismatch, and 5) 0% anchor point mismatch.” Figs. 6-7).However, the combination does not expressly teach causing the media device to adjust a subsequent rendering of the plurality of video frames based on the offset.
Cronin teaches causing a media device to adjust a rendering of a plurality of video frames based on an offset ([0011], “An automated method of aligning a closed caption track to a media content item is disclosed.” [0043], “Returning to FIG. 2, at 214 method 200 includes aligning the closed caption track to the media content item as a function of the drift value and the offset value”).
In view of Cronin’s teaching, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination to include causing the media device to adjust a subsequent rendering of the plurality of video frames based on the offset. The modification would serve to ensure correct presentation of closed captioning with video content.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Pham (US 2012/0143606) discloses a system for monitoring video content and logging caption errors ([0010]-[0011]).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL R TELAN whose telephone number is (571)270-5940. The examiner can normally be reached 9:30AM-6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nasser Goodarzi can be reached on (571) 272-4195. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL R TELAN/           Primary Examiner, Art Unit 2426