DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 7/19/2021 have been fully considered but they are not persuasive.
Regarding claim 1 and Prior art Verrilli, Applicant argues:
The disclosures of Verrilli and Koo fail to teach or suggest all the features of claim 1. As discussed during the Interview on June 29, 2021, the disclosure of Verrilli focuses on the detection of text in a text overlay or screenshot. See at least paragraphs [0035], [0045], and [0078]. This is in contrast to the present claims where the text being detected is included in the actual video data and may include embedded textual data, but is not limited to embedded textual data as in Verrilli. Indeed, Applicant's specification clearly differentiates between these two types of textual data. See at least paragraphs [0004] ("Any such text embedded in the image component of the video data is referred to herein as "on-screen text.” On-screen text is differentiated from text rendered from textual data included in the video data in that it is not associated with computer readable data and exists only as an image"), and [0022] ("In one embodiment, the server can analyze the video data to detect text depicted in the visual video content . .. Some video sources generate and embed additional text that can also be included in the visual video content. For example, a news broadcast may include overlays of graphics and/or text that emphasize some aspect of a news story."). Therefore, the disclosure of Verrilli does not teach or disclose, "the video data including text data."

embed text data into the images of the video content. Such text can be rendered as an overlay to portray certain information in addition to or in parallel to the other information being portrayed in the images or audio of the video content” (emphasis added).  As applicant has pointed out above, “Verrilli focuses on the detection of text in a text overlay or screenshot. See at least paragraphs [0035], [0045], and [0078]” (emphasis added). Therefore, Examiner maintains that Verrilli may still be relied upon to teach the newly amended portion of “video data including text data and/or embedded text data.” 

Regarding claim 1 and prior art Koo, Applicant argues:
The disclosure of Koo focuses on detecting text on a moving object, e.g. a vehicle, via an image processing device, e.g. a phone camera. See at least paragraph [0004]. In particular, Koo discloses analyzing a plurality of frames to detect the same text by analyzing a subset, e.g. more than one, of a plurality of frames. See at least paragraphs [0045] ("For example, while the tracker 114 may generate a frame result for every frame from frame 1 to frame n, the object detector/recognizer 124 may generate an output for only frames 1, 5, 13, ..., and n, as shown in FIG. 1") and [0048] ("In a particular embodiment, the object detector/recognizer 124 may have a multi-frame latency. For example, the object detector/recognizer 124 may not generate a frame result for one or more frames of the plurality of frames (i.e., the object detector/recognizer 124 generates a frame result less frequently than the tracker 112)"). (Emphasis added). This is in contrast to the present claim, where the text identification occurs for each frame of the plurality of frames, i.e. ''for a frame in the plurality of frames temporarily stored in the frame buffer." Indeed, throughout the disclosure of Koo, Koo states that the techniques disclosed are different from detecting text in a single image, e.g. single frame. See at least paragraphs [0006] and [0022]. Therefore, the disclosure of Koo actually teaches away from the present claim.

Examiner respectfully disagrees.  As applicant has pointed out, the feature cites, “for a frame in the plurality of frames temporarily stored in the frame buffer” (emphasis added).  According to broadest reasonable interpretation, the identification of a location within the frame corresponding to a region likely containing text is interpreted as being done only for a frame in the plurality of frames.  Additionally, examiner notes in the citations provided by the applicant, the object detector/recognizer does generate an output for multiple frames in a plurality of frames.  This is further taught in Para 0096, wherein the detector/recognizer 124 of Fig. 1 may be configured to detect and/or recognize the object 151 in a subset of frames of the plurality of frames and to generate a single frame result for every N frame results generated by the tracker 112, where N is an integer greater than 1.  Examiner maintains that Koo teaches the identification of text in a frame as well as in one or more frames in a plurality of frames.  

Regarding prior art Oztaskent, Applicant argues:
On pages 10-11 of the Office Action the Examiner cites paragraphs [0063], [0072], and Fig. 4 of Oztaskent as teaching "wherein a user selection of the graphical user interface element initiates a search using the textual data as a search query." (Emphasis Added). However, the disclosure of Oztaskent only discloses initiating an image search using an image selected from video content. During the Interview, the Examiner argued that Oztaskent discloses a "region of interest" may be selected to use in the search query which in combination with Verrilli and Koo would teach "initiates a search using the textual data as a search query." However, in the context of the full disclosure, the "region of interest" is still only an image and not text." Paragraph [0063] of Oztaskent discloses:
"At 140, the client application can enter a result display mode that transmits the user selections, which can include a selected image, a selected region of interest, a selected face, a selected object, and/or any other suitable portion of an image, to the search server." (Emphasis Added).
Thus, it is clear that the "region of interest" is a "suitable portion of an image" and is not text data as in the present claim. Paragraph [0072] of Oztaskent similarly only discloses a user selecting a portion of an image to initiate a search. Therefore, the disclosure of Oztaskent does not teach or suggest the above-identified feature of claim 1.

Examiner respectfully disagrees.  Applicant misconstrues examiner’s position.  Applicant has already claimed that the graphical user interface element definition corresponds to a region based on the textual data in the 7th limitation.  Then, it was established that generating a graphical user interface element is based on the graphical user interface element definition as cited in the first 2 lines of the 8th limitation.  Therefore, examiner has broadly interpreted that the selection of the graphical user interface element already includes the textual data that is identified and recognized in steps prior.  
As cited in the previous office action, Verrilli is relied upon to teach performing a character recognition operation to generate recognized characters, generating textual data based on the recognized data, and generating a GUI element definition corresponding to the region.
  

Swan is relied upon for teaching generating an output text region that includes input from multiple input text regions, wherein the character recognition technology is applied to the detected region and modifies the output video data to include data based on the character values detected in that region.
As discussed above and cited in the previous office action, Para 0063 and 0072 of Oztaskent is relied upon for teaching a user selection of a user selections, which can include a selected image, a selected region of interest, a selected face, a selected object, and/or any other suitable portion of an image, to the search server. Wherein, the client application can then receive and present one or more search results associated with the selected image and/or region of interest to the user.  
Therefore, Examiner interprets that the combination teachings of Verrilli’s feature of OCR data or textual data that corresponds to a region, Swan’s outputted text region, and Oztaskent’s feature of using the region of interest in a search query would teach the applicant’s feature of a user selection of a the graphical user interface element to initiate using the textual data as a search query.  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-5, 7-13, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Verrilli et al. (“Verrilli” US 20140082647) and further in view of Koo et al. (“Koo” US 20130177203), Pickering et al. (“Pickering” US 7489334), Swan (“Swan” US 20100259676), and Oztaskent et al. (“Oztaskent” US 20140282660).

Regarding claim 1, Verrilli teaches a method comprising:
receiving, by a computer system, video data comprising a plurality of frames arranged in an order, the video data including text data and/or embedded text data [i.e. text overlay]; [Verrilli - Fig. 1a: suggests any device may receive video data and digital video data from the broadcast system or content provider.  Para 0035, 0045, 0058: teaches display data which includes a text overlay including information about the playing broadcast media program];
providing, in a frame buffer of the computer system, temporary storage of the video data [Verrilli - Para 0057-0058: discloses a client device obtaining screen capture data from the video signal.  Para 0071, Fig. 8: discloses a client device (item 102-1) with a data module (item 420) included in the memory (item 406) that stores display data in a display data cache (item 844)]; and
for a frame in the plurality of frames temporarily stored in the frame buffer: [Verrilli – Para 0075: discloses the display data cache 844 is used to store images and other data frequently downloaded by the client device 102-1.]
by the computer system, based on an analysis of the video data in the frame buffer [Verrilli - Para 0006: discloses evaluating the display data to determine whether or not the display data includes a text overlay], 
performing, by the computer system, a character recognition operation [i.e. optical character recognition] on the region to generate recognized characters [Verrilli - Para 0014: discloses applying an optical character recognition process to extract the text], 
generating, by the computer system, textual data [i.e. OCR data] based on the recognized characters [Verrilli - Para 0055, Fig. 2: discloses OCR data that it receives and stores in memory]; and
generating, by the computer system, a graphical user interface element definition [i.e. included in OCR data] corresponding to the region based on the textual data [Verrilli - Para 0055: discloses OCR data that includes data about all that was captured],
Verrilli does not explicitly teach identifying, a location within the frame corresponding to a region likely containing text and/or embedded text, wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer, and wherein the stored data comprises a score that is associated with a high probability of the region of the previous frame containing recognizable text and/or embedded text, wherein the computer system identifies the location within the frame corresponding to a region containing text and/or embedded text based upon an identified association between frame context data of the video data in the frame buffer and frame context data that is associated with the score;
wherein performing the character recognition operation on the region comprises performing the character recognition operation on corresponding regions containing the text and/or embedded text in one or more other frames in the plurality of frames, and wherein performing the character recognition operation on the region comprises referencing a standard dictionary;
generating, by the computer system, a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Koo teaches identifying, a location within the frame corresponding to a region likely containing text and/or embedded text, wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation, and wherein the stored data comprises a score that is associated with a high probability of the region of the previous frame containing recognizable text and/or embedded text, wherein the computer system identifies the location within the frame corresponding to a region containing text and/or embedded text based upon an identified association between frame context data of the video data in the frame buffer and frame context data that is associated with the score; [Koo - Para 0073: discloses determining a location of the text in each of the plurality of frames as the text moves relative to the image capture device 102 over a period of time, or as the image capture device 102 moves relative to the text 153 in each of the plurality of frames over a period of time.  Fig. 7: step 750 and 760 suggests estimating motion of object between a particular frame and a previous frame.  Para 0038: discloses to improve precision, a particular text box is shown only when the particular text box is detected in at least m times in recent n frames.  Assuming that the detection probability of a text box is p, this technique may improve precision of text box detection. The improved precision may be expressed as: f ( p , n , m ) = k = m n ( n k ) p k ( 1 - p ) n - k.  Therefore, the probability that a frame will contain a text box will be based on the calculation according to the previous frames];
wherein performing the character recognition operation on the region comprises performing the character recognition operation on corresponding regions containing the text and/or embedded text in one or more other frames in the plurality of frames, and wherein performing the character recognition operation on the region comprises referencing a standard dictionary [Koo - Para 0073: discloses generating proposed text data (e.g., via optical character recognition (OCR)) representing the text in each of the plurality of frames.  Para 0052, 0070, 0083: teaches accessing one or more dictionaries stored in the memory to verify the proposed text data];
Verrilli and Koo are analogous in the art because they are from the same field of frame processing [abstract].  It would have been obvious to one of ordinary skill in the 
Verrilli and Koo do not explicitly teach identifying based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer;
generate a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Pickering teaches identifying based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer; [Pickering – Col. 4, Line12-19: discloses as images are captured and passed to the central processing device, they are stored in an image buffer as shown at step 130 so that they can be compared with each other as detailed below by the image processing modules 140 to 160 where, at module 140, object priority and sensitivity is established; at module 150, frame to frame changes (i.e., 
Verrilli, Koo, and Pickering are analogous in the art because they are from the same field of image analysis [Col. 1, Line7-16].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Verrilli and Koo in view of Pickering to using data from previous frames for the reasons of improving accuracy by comparing frames when determining regions of interest. 
Verrilli, Koo, and Pickering do not explicitly teach generate a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Swan teaches generate a graphical user interface element definition comprising a boundary box [i.e. output text region] corresponding to the region based on the textual data [Swan - Para 0043-0044: discloses generating an output text region that includes input from multiple input text regions]; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters [Swan - Para 0045-0047, Fig. 12: discloses applying character recognition technology to a detected text region, modifying the output video data to include data based on the character values detected in the region.  The text output region may include a rendering of the text].
Verrilli, Koo, Pickering, and Swan are analogous in the art because they are from the same field of character recognition in video signals [abstract].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Verrilli, Koo, and Pickering in view of Swan to GUI elements for the reasons of displaying focus on the recognized characters.
Verrilli, Koo, Pickering, and Swan do not explicitly teach wherein the graphical user interface element is user-selectable, and wherein a user selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Oztaskent teaches wherein the graphical user interface element is user-selectable, and wherein a user selection of the graphical user interface element initiates a search using the textual data as a search query. [Oztaskent – Para 0063, 0072, Fig. 4: teaches the client application can enter a result display mode 
Verrilli, Koo, Pickering, Swan, and Oztaskent are analogous in the art because they are from the same field of providing information related to media content [abstract].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Verrilli, Koo, Pickering, and Swan in view of Oztaskent to selectable elements for the reasons of improving the watching experience by providing additional information when the user selects to search for the identified data.

Regarding claim 3, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 1 further comprising accessing, by the computer system, a dictionary [i.e. database] comprising expected textual data, and wherein generating the textual data comprises comparing the recognized characters with the expected textual data [Verrilli - Para 0061: discloses cross referencing with a database to ensure validity of the information].

Regarding claim 4, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 1 further comprising transmitting the video data and the graphical user interface element definition from the computer system to a remote client computing device [i.e. laptops, tablets, phones] for display on the client computing device [Verrilli - Para 0039: discloses that video data may be received by any number of display devices, including computers, laptop computers, tablet computers, smart phones and the like].

Regarding claim 5, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 1 further comprising storing, by the computer system, the video data and the graphical user interface element definition in one or more data stores accessible to a plurality of client computing devices [Verrilli - Para 0051: discloses memory may optionally include one or more storage devices remotely located in/from the CPUs].

Regarding claim 7, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 1 further comprising:
generating, by the computer system, a graphical user interface element [i.e. on screen button] based on the graphical user interface element definition [Verrilli - Para 0077: discloses an “INFO” button on the application interface displayed]; and
associating, by the computer system, an operation [i.e. the initiation of the overlay] to be performed in response to a user input received through the user interface element [Verrilli - Para 0078: discloses the user input will initiate the display of the program information overlay].

the method of claim 7 wherein the user interface element comprises a visual representation [i.e. on screen results] of at least a portion of the region or the text [Verrilli - Para 0078, Fig. 7a, 7b: discloses the character recognition can be used to do a search query with the results displayed on screen].

Regarding claim 9, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 7 further comprising generating, by the computer system, a graphical user interface [i.e. information box] comprising the graphical user interface element, wherein the graphical user interface is superimposed on the frame and one or more other frames in the plurality of frames [Verrilli - Fig. 7a: suggests that the information box will be an overlay on playing television program].

Regarding claim 10, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 7 further comprising executing, by the computer system, the operation, wherein the operation uses the textual data as input [Verrilli - Para 0067: discloses performing an internet search based on at least some of the extracted text by submitting a search query to the search server system].

Regarding claim 11, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 10, wherein the operation comprises generating a request [i.e. query] for data comprising the textual data, the method further comprising:
sending the request for data from the computer system to an external data source [i.e. search server system] [Verrilli - Para 0067, Fig. 6: discloses the search queries are submitted to a search server system];
receiving, in response to the request for data, additional data [i.e. associated content] related to the textual data [Verrilli - Para 0067: discloses the search server system responds to a received search query by providing information and/or access to information.  Para 0074: discloses an associated content search module to produce one or more search queries transmitting to the search server system]; and
generating, by the computer system, another graphical user interface comprising information based on the additional data [Verrilli - Para 0075: discloses that the search results will be displayed in information box].

Regarding claim 12, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 1 further comprising determining, by the computer system, metadata associated with the video data and comprising information about the content of the video data, and wherein generating the textual data is further based on the metadata [Verrilli - Para 0048: discloses metadata associated with content files.  Para 0045: discloses the metadata being displayed on screen in a text overlay.  Para 0054: discloses the OCR data obtained from information on screen].

Regarding claim 13, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 12 wherein determining the metadata comprises receiving electronic program guide data comprising descriptions of content of the video data [Verrilli - Para 0050: discloses metadata is associated with the content received from the broadcast system].

Regarding claim 17, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 12 wherein determining the metadata comprises receiving a custom dictionary [i.e. content database] of expected textual data associated with the metadata or a user, and wherein generating the textual data comprises comparing the recognized characters with the custom dictionary [Verrilli - Para 0059: discloses the client device may communicate with the media server in order to check the validity of the extracted information using a content database].

Regarding claim 18, Verrilli, Koo, Pickering, Swan, and Oztaskent teaches the method of claim 12 wherein the metadata further comprises predetermined coordinates [i.e. position of expected text overlay] for the region in the frame and an area, and wherein determining the region is based on the metadata [Verrilli - Para 0045, Fig. 1b: discloses the expected text overlay that includes program channel, title, and information about actors, characters, synopses. Para 0050: discloses the application program interface instructions are included with the signal from the broadcasting system, so they metadata that is received will determine how it is displayed based on the instructions].

Regarding claim 19, Verrilli teaches a method comprising:
receiving, by a computer system, video data comprising a plurality of frames arranged in an order, the video data including text data and/or embedded text data; [Verrilli - Fig. 1a: suggests any device or system may receive video data and digital video data from the broadcast system or content provider];
providing, in a frame buffer of the computer system, temporary storage of the video data [Verrilli - Para 0057-0058: discloses a client device obtaining screen capture data from the video signal.  Fig. 1B: suggests a client device with memory]; and
for a frame in the plurality of frames temporarily stored in the frame buffer:
determining, by the computer system, contextual data associated with the video data based on an analysis of the video data in the frame buffer [Verrilli - Para 0058: discloses indicators that identify what is being shown with the content];
by the computer system, based on the contextual data a region [i.e. text overlay] containing text and/or embedded text [Verrilli - Para 0006: discloses evaluating the display data to determine whether or not the display data includes a text overlay];
performing, by the computer system, a character recognition operation on the region to generate recognized characters [Verrilli - Para 0014: discloses applying an optical character recognition process to extract the text],
generating, by the computer system, textual data based on the recognized characters [Verrilli - Para 0055, Fig. 2: discloses OCR data that it receives and stores in memory],
Verrilli does not explicitly teach identifying, a location within the frame corresponding to a region likely containing text and/or embedded text, wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer, and wherein the stored data comprises a score that is associated with a high probability of the region of the previous frame containing recognizable text and/or embedded text, wherein the computer system identifies the location within the frame corresponding to a region containing text and/or embedded text based upon an identified association between frame context data of the video data in the frame buffer and frame context data that is associated with the score;
wherein performing the character recognition operation on the region comprises performing the character recognition operation on corresponding regions containing the text and/or embedded text in one or more other frames in the plurality of frames, and wherein performing the character recognition operation on the region comprises referencing a standard dictionary;
generating, by the computer system, a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generating a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Koo teaches identifying, a location within the frame corresponding to a region likely containing text and/or embedded text, wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation, and wherein the stored data comprises a score that is associated with a high probability of the region of the previous frame containing recognizable text and/or embedded text, wherein the computer system identifies the location within the frame corresponding to a region containing text and/or embedded text based upon an identified association between frame context data of the video data in the frame buffer and frame context data that is associated with the score [Koo - Para 0073: discloses determining a location of the text in each of the plurality of frames as the text moves relative to the image capture device 102 over a period of time, or as the image capture device 102 moves relative to the text 153 in each of the plurality of frames over a period of time.  Fig. 7: step 750 and 760 suggests estimating motion of object between a particular frame and a previous frame.  Para 0038: discloses to improve precision, a particular text box is shown only when the particular text box is detected in at least m times in recent n frames.  Assuming that the detection probability of a text box is p, this technique may improve precision of text box detection. The improved precision may be expressed as: f ( p , n , m ) = k = m n ( n k ) p k ( 1 - p ) n - k.  Therefore, the probability that a frame will contain a text box will be based on the calculation according to the previous frames];
wherein performing the character recognition operation on the region comprises performing the character recognition operation on corresponding regions containing the text and/or embedded text in one or more other frames in the plurality of frames, and wherein performing the character recognition operation on the region comprises referencing a standard dictionary [Koo - Para 0073: discloses generating proposed text data (e.g., via optical character recognition (OCR)) representing the text in each of the plurality of frames.  Para 0052, 0070, 0083: teaches accessing one or more dictionaries stored in the memory to verify the proposed text data];
wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer [Koo - Para 0073: discloses determining a location of the text in each of the plurality of frames as the text moves relative to the image capture device 102 over a period of time, or as the image capture device 102 moves relative to the text 153 in each of the plurality of frames over a period of time.  Fig. 7: step 750 and 760 suggests estimating motion of object between a particular frame and a previous frame].
In addition, the rationale of claim 1 regarding Koo is used for claim 19.
Verrilli and Koo do not explicitly teach identifying based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer;
generating, by the computer system, a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generating a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Pickering teaches identifying based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer,; [Pickering – Col. 4, Line12-19: discloses as images are captured and passed to the central processing device, they are stored in an image buffer as shown at step 130 so that they can be compared with each other as detailed below by the image processing modules 140 to 160 where, at module 140, object priority and sensitivity is established; at module 150, frame to frame changes (i.e., comparing the Nth frame with the N-1th frame within a datastream) are checked; and, at module 160, motion is identified and/or predicted]
In addition, the rationale of claim 1 regarding Pickering is used for claim 19.
generating, by the computer system, a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generating a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Swan teaches generating, by the computer system, a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data [Swan - Para 0043-0044: discloses generating an output text region [i.e. output text region] that includes input from multiple input text regions]; and
generating a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters [Swan - Para 0045-0047, Fig. 12: discloses applying ;
In addition, the rationale of claim 1 regarding Swan is used for claim 19.
Verrilli, Koo, Pickering, and Swan do not explicitly teach wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Oztaskent teaches wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query. [Oztaskent – Para 0063, 0072, Fig. 4: teaches the client application can enter a result display mode that transmits the user selections, which can include a selected image, a selected region of interest, a selected face, a selected object, and/or any other suitable portion of an image, to the search server. The client application can then receive and present one or more search results associated with the selected image and/or region of interest to the user.]
In addition, the rationale of claim 1 regarding Oztaskent is used for claim 19.

Regarding claim 20, Verrilli teaches a computing system comprising:
one or more processors [Verrilli - Para 0042: discloses any device to include one or more processors that is able to connect to the communication network]; and
a memory comprising instructions that, when executed by the processors, configure the one or more processors to be configured to [Verrilli - Para 0042: discloses any device to include one or more processors and memory]:
receive video data comprising a plurality of frames arranged in an order, the video data including text data and/or embedded text data [Verrilli - Fig. 1a: suggests any device or system may receive video data and digital video data from the broadcast system or content provider];
temporarily store the video data in a frame buffer of the computing system [Verrilli - Para 0057-0058: discloses a client device obtaining screen capture data from the video signal.  Fig. 1B: suggests a client device with memory]; and
for a frame in the plurality of frames temporarily stored in the frame buffer:
based on an analysis of the video data in the frame buffer, a region containing text and/or embedded text [Verrilli - Para 0006: discloses evaluating the display data to determine whether or not the display data includes a text overlay],
perform a character recognition operation on the region to generate recognized characters [Verrilli - Para 0014: discloses applying an optical character recognition process to extract the text], 
generate textual data based on the recognized characters [Verrilli - Para 0055, Fig. 2: discloses OCR data that it receives and stores in memory];
Verrilli does not explicitly teach identify, a location within the frame corresponding to a region likely containing text and/or embedded text, wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer, and wherein the stored data comprises a score that is associated with a high probability of the region of the previous frame containing recognizable text and/or embedded text, wherein the location within the frame corresponding to a region containing text and/or embedded text is identified based upon an identified association between frame context data of the video data in the frame buffer and frame context data that is associated with the score;
wherein to perform the character recognition operation on the region comprises to perform the character recognition operation on corresponding regions containing the text and/or embedded text in one or more other frames in the plurality of frames, and wherein performing the character recognition operation on the region comprises referencing a standard dictionary;
generate a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

 identify, a location within the frame corresponding to a region likely containing text and/or embedded text, wherein identifying the location further comprises identifying the location based upon stored data from a character recognition operation, and wherein the stored data comprises a score that is associated with a high probability of the region of the previous frame containing recognizable text and/or embedded text, wherein the location within the frame corresponding to a region containing text and/or embedded text is identified based upon an identified association between frame context data of the video data in the frame buffer and frame context data that is associated with the score [Koo - Para 0073: discloses determining a location of the text in each of the plurality of frames as the text moves relative to the image capture device 102 over a period of time, or as the image capture device 102 moves relative to the text 153 in each of the plurality of frames over a period of time.  Fig. 7: step 750 and 760 suggests estimating motion of object between a particular frame and a previous frame.  Para 0038: discloses to improve precision, a particular text box is shown only when the particular text box is detected in at least m times in recent n frames.  Assuming that the detection probability of a text box is p, this technique may improve precision of text box detection. The improved precision may be expressed as: f ( p , n , m ) = k = m n ( n k ) p k ( 1 - p ) n - k.  Therefore, the probability that a frame will contain a text box will be based on the calculation according to the previous frames];
wherein to perform the character recognition operation on the region comprises to perform the character recognition operation on corresponding regions containing the text and/or embedded text in one or more other frames in the plurality of frames, and wherein performing the character recognition operation on the region comprises referencing a standard dictionary [Koo - Para 0073: discloses generating proposed text data (e.g., via optical character recognition (OCR)) representing the text in each of the plurality of frames.  Para 0052, 0070, 0083: teaches accessing one or more dictionaries stored in the memory to verify the proposed text data];
wherein to identify the location further comprises identifying the location based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer [Koo - Para 0073: discloses determining a location of the text in each of the plurality of frames as the text moves relative to the image capture device 102 over a period of time, or as the image capture device 102 moves relative to the text 153 in each of the plurality of frames over a period of time.  Fig. 7: step 750 and 760 suggests estimating motion of object between a particular frame and a previous frame].
In addition, the rationale of claim 1 regarding Koo is used for claim 20. 
Verrilli and Koo do not explicitly teach identifying based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer;
generate a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Pickering teaches identifying based upon stored data from a character recognition operation previously performed on a region of a previous frame in the frame buffer; [Pickering – Col. 4, Line12-19: discloses as images are captured and passed to the central processing device, they are stored in an image buffer as shown at step 130 so that they can be compared with each other as detailed below by the image processing modules 140 to 160 where, at module 140, object priority and sensitivity is established; at module 150, frame to frame changes (i.e., comparing the Nth frame with the N-1th frame within a datastream) are checked; and, at module 160, motion is identified and/or predicted].
In addition, the rationale of claim 1 regarding Pickering is used for claim 20.
Verrilli, Koo, and Pickering do not explicitly teach generate a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters, wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query.

However, Swan teaches generate a graphical user interface element definition comprising a boundary box corresponding to the region based on the textual data [Swan - Para 0043-0044: discloses generating an output text region [i.e. output text region] that includes input from multiple input text regions]; and
generate a graphical user interface element based on the graphical user interface element definition, wherein the graphical user interface element is superimposed over the boundary box for the frame and for one or more other frames in the plurality of frames, and the graphical user interface element comprises a textual representation of at least a portion of the textual data based on the recognized characters [Swan - Para 0045-0047, Fig. 12: discloses applying character recognition technology to a detected text region, modifying the output video data to include data based on the character values detected in the region.  The text output region may include a rendering of the text];
In addition, the rationale of claim 1 regarding Swan is used for claim 20. 
Verrilli, Koo, Pickering, and Swan do not explicitly teach wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query

However, Oztaskent teaches wherein the graphical user interface element is user-selectable, and wherein a user-selection of the graphical user interface element initiates a search using the textual data as a search query [Oztaskent – Para 0063, 0072, Fig. 4: teaches the client application can enter a result display mode that transmits the user selections, which can include a selected image, a selected region of interest, a selected face, a selected object, and/or any other suitable portion of an image, to the search server. The client application can then receive and present one or more search results associated with the selected image and/or region of interest to the user.]
In addition, the rationale of claim 1 regarding Oztaskent is used for claim 20. 

Claims 2, 21, and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Verrilli, Koo, Pickering, Swan, and Oztaskent as applied to claim 1 above, and further in view of Cummins et al. ("Cummins" US 20150169971).

Regarding claim 2, Verrilli, Koo, Pickering, Swan, and Oztaskent do not explicitly teach claim 2.  However, Cummins teaches The method of claim 1 the stored data comprises one or more of an estimate of successful recognition and a score describing a likelihood of accurate text recognition [Cummins - Para 0029: discloses an OCR engine confidence score that indicates a confidence level that the associated term has been correctly recognized].
Verrilli, Koo, Pickering, Swan, Oztaskent, and Cummins are analogous in the art because they are from the same field of character recognition [abstract].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Verrilli, Koo, Pickering, Swan, and Oztaskent in view of Cummins to confidence levels for the reasons of improving the accuracy of the determined text.

Regarding claim 21, Verrilli, Koo, Pickering, Swan, and Oztaskent do not explicitly teach claim 21.  However, Cummins teaches The method of claim 19, wherein the stored data comprises one or more of an estimate of successful recognition and a score describing a likelihood of accurate text recognition [Cummins - Para 0029: discloses an OCR engine confidence score that indicates a confidence level that the associated term has been correctly recognized].
In addition, the rationale of claim 2 is used for claim 21.

Regarding claim 22, Verrilli, Koo, Pickering, Swan, and Oztaskent do not explicitly teach claim 22.  However, Cummins teaches The method of claim 20, wherein the stored data comprises one or more of an estimate of successful recognition and a score describing a likelihood of accurate text recognition [Cummins - Para 0029: discloses an OCR engine confidence score that indicates a confidence level that the associated term has been correctly recognized].
In addition, the rationale of claim 2 is used for claim 21.

Claim 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Verrilli, Koo, Pickering, Swan, and Oztaskent as applied to claim 1 above, and further in view of Ray ("Ray" US 20050105803).

Regarding claim 6, Verrilli, Koo, Pickering, Swan, and Oztaskent do not explicitly teach claim 6.  However, Ray teaches the method of claim 1 further comprising associating, by the computer system, the graphical user interface element definition with the frame and one or more other frames in the plurality of frames contiguous with the frame according to the order [Ray - Para 0053, Fig. 8: discloses scanning an image in search for text to store as metadata.  After a scan for the image or text, it will proceed to scan the next image in the collection].
Verrilli, Koo, Pickering, Swan, Oztaskent, and Ray are analogous in the art because they are both from the same field of performing image scans [Para 0013].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Verrilli, Koo, Pickering, Swan, and Oztaskent in view of Ray to content scanning for the reasons of performing a scan on different displayed images.

Claims 14-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Verrilli, Koo, Pickering, Swan, and Oztaskent as applied to claim 12 above, and further in view of Bachman ("Bachman" US 8745650).

the method of claim 12 wherein determining the metadata comprises analyzing the video data to detect one or more segments of the video data [Bachman - Col 2, Line 38 – Col 3, Line 8: discloses using metadata to identify content segments, but when no metadata is available, it records and analyzes the video segment for comparison with a backend system to determine its metadata].
Verrilli, Koo, Pickering, Swan, Oztaskent, and Bachman are analogous in the art because they are both from the same field of video segment analyzation [Col 2, Line 38].  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Verrilli, Koo, Pickering, Swan, and Oztaskent in view of Bachman to analyzing metadata for the reasons of identifying content segments.

Regarding claim 15, Verrilli, Koo, Pickering, Swan, and Oztaskent do not explicitly teach claim 15.  However, Bachman teaches the method of claim 14 wherein the segments of the video data are defined by continuity of audio data [Bachman - Col 2, Line 38 – Col 3, Line 8: discloses segments can be determined using time shifted analysis.  When there is some kind of interruption, depending on how long, it can separate and analyze the segments].
In addition, the rationale for claim 14 can be used for claim 15.

the method of claim 14 wherein the segments of the video data are defined by continuity of visual data [Bachman - Col 2, Line 38 – Col 3, Line 8: discloses segments can be determined using time shifted analysis.  When there is some kind of interruption, depending on how long, it can separate and analyze the segments].
In addition, the rationale for claim 14 can be used for claim 16.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nasser Goodarzi can be reached on 571.272.4195. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JAYCEE IMPERIAL/           Examiner, Art Unit 2426


/NASSER M GOODARZI/           Supervisory Patent Examiner, Art Unit 2426