DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Office action is responsive to the Request for Continued Examination (RCE) filed under 37 CFR §1.53(d) for the instant application on May 18, 2022. Applicants have properly set forth the RCE, which has been entered into the application, and an examination on the merits follows herewith.
 	Claims 1-6, 11-14, and 17-18 are amended; and claims 1-20 are pending and have been considered below.



Claim Rejections - 35 USC § 103
 	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
 	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

 	Claims 1, 3-11, 13-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sull et al. (U.S. 2007/0044010) in view of White et al. (U.S. 2015/0370806) and further in view of Chen et al. (U.S. Patent No. 8,874,584). 
With regard to claim 1, Sull teaches one or more computer storage media storing computer-useable instructions that, when used by one or more computing devices ([0032] Multimedia data are accessed by ever increasing kinds of devices such as hand-held computers (HHCs), personal digital assistants (PDAs), and smart cellular phones; [0050] The system of the present invention can be a computer server that is operably connected to a network that has connected to it one or more client devices), cause the one or more computing devices to perform operations comprising: 
 	accessing a hierarchical segmentation of a video timeline of a video (Figs. 54-56; Figs. 70-73; [0133]-[0134] Figs. 55-56 is a timeline diagram; [0148]-[0151] timeline comparison), the hierarchical segmentation ([abstract] The multimedia bookmark facilitates the searching of portions or segments of multimedia files; [0030]-[0031] Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment.  The metadata of segments can form a hierarchical structure where the larger segment contains the smaller segments; [0066] selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; [claim 1] each of the metadata files having at least metadata of one segment to be edited) associating extracted metadata extracted from the video with corresponding video segments ([0286] This metadata is the source of information used by the recommendation engine of the present invention to examine the users' viewing preferences.  After extracting the metadata from the EPG channel stream 5104, the multimedia bookmark process 5106 creates a new multimedia bookmark and places the multimedia bookmark into the user's multimedia bookmark folder on the user's storage device 5108) defined by a first level of the hierarchical segmentation ([abstract] The multimedia bookmark facilitates the searching of portions or segments of multimedia files; [0030]-[0031] Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment.  The metadata of segments can form a hierarchical structure where the larger segment contains the smaller segments; [0066] selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure); 
 	receiving an input identifying a textual search criterion ([abstract] A method and system are provided for tagging, indexing, searching, retrieving, manipulating, and editing video images on a wide area network such as the Internet…The multimedia bookmark facilitates the searching of portions or segments of multimedia files, particularly when used in conjunction with a search engine…Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine; [0017] Each segment may be described by some elementary semantic information using texts; [0030] Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment; [0036] an automatic text caption detector to automatically annotate keywords to the temporally segmented shots; [0040] One important source of information about image and video is the text contained therein.  The video can be easily indexed if access to this textual information content is available.  The text provides clear semantics of video and are extremely useful in deducing the contents of video; [0041] There are many ways that segment and recognize text in printed documents; [0047] The multimedia content can be one or more frames of video, audio data, text data such as a string of characters, or any combination or permutation thereof; [0164] The content information 214 may be composed of audio-visual features and textual features); 
 	executing a search of the extracted metadata using the textual search criterion ([abstract] A method and system are provided for tagging, indexing, searching, retrieving, manipulating, and editing video images on a wide area network such as the Internet…The multimedia bookmark facilitates the searching of portions or segments of multimedia files, particularly when used in conjunction with a search engine…Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine; [0017] Each segment may be described by some elementary semantic information using texts; [0030] Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment; [0036] an automatic text caption detector to automatically annotate keywords to the temporally segmented shots; [0040] One important source of information about image and video is the text contained therein.  The video can be easily indexed if access to this textual information content is available.  The text provides clear semantics of video and are extremely useful in deducing the contents of video; [0041] There are many ways that segment and recognize text in printed documents; [0047] The multimedia content can be one or more frames of video, audio data, text data such as a string of characters, or any combination or permutation thereof; [0164] The content information 214 may be composed of audio-visual features and textual features) to identify matching metadata segments of the extracted metadata and corresponding matching video segments of video segments defined by the first level of the hierarchical segmentation ([0017] However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video; [0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata; [0314] Matching the words in the e-mail text by scanning the e-mail contents for words like "enclose," or "attach" or their equivalent in other languages, preferably the language setting designated by the user); and 
 	emphasizing on the video timeline the corresponding matching video segments from the first level ([0017] However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video; [0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata). However, Sull does not specifically teach visually emphasizing on the video timeline. White teaches visually emphasizing on the video timeline the corresponding matching video segments (Figs. 5, selected portion 230; Fig. 8, visual like indicators 840; Fig. 9; Figs. 11-12; [0094] visual like indicators; [0103] The various available filters can be visually provided to the user for selection via a user interface such as that shown in FIGS. 11 and 12 (STEP 1006); [0104] Points of interest can be matched directly or semantics in the name of the point of interest (e.g. the word “beach”) can be used. In addition, images and video can be processed to recognize particular activities (e.g., snowboarding, skydiving, driving, etc.), particular objects or scenes (e.g., the Empire State Building, the Boston skyline, a snow-covered mountain), weather, lighting conditions, and so on, and audio can be processed to further inform the recognition process (e.g., the sounds of a beach, a crowd, music, etc., can be processed to help identify a location or event)). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the Sull reference, to have visually indicated the video timeline as taught by White, to have achieved a system and method for displaying visually an active learning-based query refinement. However, Sull does not specifically teach: 
- 	by one or more machine learning models
Chen teaches a system for content recognition, search, and retrieval in visual data and extracting distinct activity-agnostic content descriptors from the visual data at each level of a hierarchical content descriptor module [abstract]. Chen also teaches one or more machine learning models ([col. 1, line 10] retrieval in visual data which combines multi-level descriptor generation, hierarchical indexing, and active learning-based query refinement; [col. 2, lines 51-55] The storage module is searched for visual data containing a content of interest based on a user query.  The user query is then refined using an active learning model based on a set of feedback from a user; [col. 3, lines 50-51] a system for content recognition, search, and retrieval in visual data which combines multi-level descriptor generation, hierarchical indexing, and active learning-based query refinement; [col. 5, lines 10-12] a multi-level set of activity-agnostic content descriptors (i.e., descriptors which are not dependent on the specific types of activity the system is capable of handling), hierarchical and graph-based indexing, and active learning models for query refinement). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the video timeline as taught by Sull and visual emphasis taught by White, to have included the machine learning model taught by Chen, to have achieved a system and method for content recognition, search, and retrieval in visual data which combines multi-level descriptor generation, hierarchical indexing, and active learning-based query refinement.

With regard to claim 3, the limitations are addressed above and Sull teaches wherein emphasizing the matching video segments comprises animating the corresponding matching video segments on the video timeline by inducing a traveling wave that displaces the representations of the corresponding matching video segments on the video timeline as the traveling wave travels down the video timeline ([0480] For example, MPEG adopts discrete cosine transform (DCT) of 8.times.8 block into which 64 neighboring pixels are exclusively grouped. Therefor, whatever compression scheme (DCT, discrete wavelet transform, vector quantization, etc.) is adopted for a given block, one need only decompress a small number of blocks in an intra-coded frame, instead of decoding the whole blocks composing the frame when only few pixels out of the whole pixels are needed; [0487] For some compression schemes using the discrete cosine transform (DCT) for intra-frame coding like Motion-JPEG and MPEG or any other transform domain compression schemes such as discrete wavelet transform, it is further possible to reduce the time for constructing visual rhythm). However, Sull does not specifically teach:
visually emphasizing the matching video segments
White teaches systems and methods for video editing and playback [abstract]. White also teaches visually emphasizing on the video timeline the corresponding matching video segments (Figs. 5, selected portion 230; Fig. 8, visual like indicators 840; Fig. 9; Figs. 11-12; [0094] visual like indicators; [0103] The various available filters can be visually provided to the user for selection via a user interface such as that shown in FIGS. 11 and 12 (STEP 1006); [0104] Points of interest can be matched directly or semantics in the name of the point of interest (e.g. the word “beach”) can be used. In addition, images and video can be processed to recognize particular activities (e.g., snowboarding, skydiving, driving, etc.), particular objects or scenes (e.g., the Empire State Building, the Boston skyline, a snow-covered mountain), weather, lighting conditions, and so on, and audio can be processed to further inform the recognition process (e.g., the sounds of a beach, a crowd, music, etc., can be processed to help identify a location or event)). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the Sull reference, to have visually indicated the video timeline as taught by White, to have achieved a system and method for displaying visually an active learning-based query refinement.

With regard to claim 4, the limitations are addressed above and Sull teaches the operations further comprising emphasizing the matching metadata segments on a composite list of the extracted metadata segmented ([0066] The present invention also includes a method for editing a multimedia file by providing a metafile, the metafile having at least one segment that is selectable; selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; [0407]; [0408] composing segment of which the metadata is newly defined in the metafile of the edited video such as segments 3380 and 3382) at locations in the composite list corresponding to boundaries of the corresponding video segments defined by the first level of the hierarchical segmentation (Fig. 8; Figs. 55-56; Figs. 70-73; [0066] The present invention also includes a method for editing a multimedia file by providing a metafile, the metafile having at least one segment that is selectable; selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; [0407]; [0408] composing segment of which the metadata is newly defined in the metafile of the edited video such as segments 3380 and 3382). However, Sull does not specifically teach: 
visually emphasizing the matching video segments
White teaches systems and methods for video editing and playback [abstract]. White also teaches visually emphasizing on the video timeline the corresponding matching video segments (Figs. 5, selected portion 230; Fig. 8, visual like indicators 840; Fig. 9; Figs. 11-12; [0094] visual like indicators; [0103] The various available filters can be visually provided to the user for selection via a user interface such as that shown in FIGS. 11 and 12 (STEP 1006); [0104] Points of interest can be matched directly or semantics in the name of the point of interest (e.g. the word “beach”) can be used. In addition, images and video can be processed to recognize particular activities (e.g., snowboarding, skydiving, driving, etc.), particular objects or scenes (e.g., the Empire State Building, the Boston skyline, a snow-covered mountain), weather, lighting conditions, and so on, and audio can be processed to further inform the recognition process (e.g., the sounds of a beach, a crowd, music, etc., can be processed to help identify a location or event)). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the Sull reference, to have visually indicated the video timeline as taught by White, to have achieved a system and method for displaying visually an active learning-based query refinement.

With regard to claim 5, the limitations are addressed above and Sull teaches the operations further comprising causing the matching metadata segments to be emphasized in a metadata panel (Fig. 6; Fig. 35; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0065] A metadata file is created for each of the video files, each of the metadata files having at least one segment to be edited; [0084] FIG. 6 is an example of two multimedia contents and their associated metadata; [0113] FIG. 35 is a flowchart of an exemplary method of the present invention for virtual video editing based on metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata) and the corresponding matching video segments to be emphasized on the video timeline using a same type of emphasis ([0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0036] For scene change detection, a matching process between two consecutive frames is required; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0163] The content information 214 is used for visually displaying multimedia bookmarks in a bookmark list 208, as well as for searching one or more multimedia content databases for the content that matches the content information 214; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata). However, Sull does not specifically teach: 
visually emphasized
White teaches systems and methods for video editing and playback [abstract]. White also teaches visually emphasizing on the video timeline the corresponding matching video segments (Figs. 5, selected portion 230; Fig. 8, visual like indicators 840; Fig. 9; Figs. 11-12; [0094] visual like indicators; [0103] The various available filters can be visually provided to the user for selection via a user interface such as that shown in FIGS. 11 and 12 (STEP 1006); [0104] Points of interest can be matched directly or semantics in the name of the point of interest (e.g. the word “beach”) can be used. In addition, images and video can be processed to recognize particular activities (e.g., snowboarding, skydiving, driving, etc.), particular objects or scenes (e.g., the Empire State Building, the Boston skyline, a snow-covered mountain), weather, lighting conditions, and so on, and audio can be processed to further inform the recognition process (e.g., the sounds of a beach, a crowd, music, etc., can be processed to help identify a location or event)). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the Sull reference, to have visually indicated the video timeline as taught by White, to have achieved a system and method for displaying visually an active learning-based query refinement.

With regard to claim 6, the limitations are addressed above and Sull teaches the operations further comprising: 
 	segmenting, in response to an input navigating from the first level to a different level of the hierarchical segmentation ([0170] The key frame hierarchy illustrated in FIG. 4 is a tree-structured representation for multi-level abstraction of a video by key frames, where a node denotes each key frame. A number Df is associated with each node and represents the maximum distance between the low-level feature vector of the node 414 and those of its decendent nodes in its subtree (for example, nodes 416 and 418)), a composite list of the extracted metadata into an updated set of metadata segments ([0066] The present invention also includes a method for editing a multimedia file by providing a metafile, the metafile having at least one segment that is selectable; selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; [0407]; [0408] composing segment of which the metadata is newly defined in the metafile of the edited video such as segments 3380 and 3382), the composite list segmented at locations in the composite list corresponding to boundaries of a second set of video segments defined by the different level of the hierarchical segmentation ([0066] The present invention also includes a method for editing a multimedia file by providing a metafile, the metafile having at least one segment that is selectable; selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure; [0178] FIG. 6 shows an example of two multimedia contents and their associated metadata. Since the first multimedia content has five variations and the second has three variations, there are five media profiles in the metadata of the first multimedia content 602, and three media profiles in the metadata of the second 604; [0183] FIG. 7 shows an example of a list of bookmarks 702 for the variations of two multimedia contents in FIG. 6.  The list contains the first and second bookmarks 704 and 706 for the first variation…Thus, these two bookmarks have the same metadata ID referring to the second multimedia content); 
 	identifying an updated set of matching metadata segments, from the updated set of metadata segments defined by the different level, that match the search criterion ([0017] However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video; [0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata; [0314] Matching the words in the e-mail text by scanning the e-mail contents for words like "enclose," or "attach" or their equivalent in other languages, preferably the language setting designated by the user); and 
 	emphasizing on the video timeline an updated set of matching video segments, of the second set of video segments defined by the different level, corresponding to the updated set of matching metadata segments ([0017] However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video; [0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata; [0314] Matching the words in the e-mail text by scanning the e-mail contents for words like "enclose," or "attach" or their equivalent in other languages, preferably the language setting designated by the user). However, Sull does not specifically teach: 
visually emphasizing…matching video segments
White teaches systems and methods for video editing and playback [abstract]. White also teaches visually emphasizing on the video timeline the corresponding matching video segments (Figs. 5, selected portion 230; Fig. 8, visual like indicators 840; Fig. 9; Figs. 11-12; [0094] visual like indicators; [0103] The various available filters can be visually provided to the user for selection via a user interface such as that shown in FIGS. 11 and 12 (STEP 1006); [0104] Points of interest can be matched directly or semantics in the name of the point of interest (e.g. the word “beach”) can be used. In addition, images and video can be processed to recognize particular activities (e.g., snowboarding, skydiving, driving, etc.), particular objects or scenes (e.g., the Empire State Building, the Boston skyline, a snow-covered mountain), weather, lighting conditions, and so on, and audio can be processed to further inform the recognition process (e.g., the sounds of a beach, a crowd, music, etc., can be processed to help identify a location or event)). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the Sull reference, to have visually indicated the video timeline as taught by White, to have achieved a system and method for displaying visually an active learning-based query refinement.

With regard to claim 7, the limitations are addressed above and Sull teaches wherein the extracted metadata comprises transcribed audio of the corresponding video segments ([0164] The content information 214 may be composed of audio-visual features and textual features. The audio-visual features are the information, for example, obtained by capturing or sampling the multimedia content 204 at the bookmarked position 206; [0166] In the case of an audio bookmark of the present invention, the positional information 212 is composed of a URI, a URL, or the like, and a bookmarked position such as elapsed time. Similarly, the content information 214 is composed of audio-visual features such as the sampled audio signal (typically of short duration) and its visualized image.  The content information 214 of an audio bookmark 210 is also composed of such textual features as a title, optionally specified by a user or simply delivered with the content, and annotated text of an audio segment corresponding to the bookmarked position).

With regard to claim 8, the limitations are addressed above and Sull teaches wherein the extracted metadata comprises tags visually extracted from video frames of the corresponding video segments ([abstract] A method and system are provided for tagging, indexing, searching, retrieving, manipulating, and editing video images…Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine; [0154] The present invention provides various methodologies for tagging multimedia files to facilitate the indexing, searching, and retrieving of the tagged files. The tags themselves can be embedded in the electronic file, or stored separately in, for example, a search engine database; [0155] Other aspects of the present invention include using hypershell and other techniques to read text information embedded in multimedia files for use in indexing, particularly tag indexes. Still more methods of the present invention enable the virtual editing of multimedia files by manipulating metadata and/or tags rather than editing the multimedia files themselves; [0161] The method and system of the present invention include a tag that can contain information about all or a portion of a multimedia file).

With regard to claim 9, the limitations are addressed above and Sull teaches wherein the extracted metadata comprises log event tags extracted from a temporal log associated with the video ([abstract] A method and system are provided for tagging, indexing, searching, retrieving, manipulating, and editing video images…Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine; [0154] The present invention provides various methodologies for tagging multimedia files to facilitate the indexing, searching, and retrieving of the tagged files. The tags themselves can be embedded in the electronic file, or stored separately in, for example, a search engine database; [0155] Other aspects of the present invention include using hypershell and other techniques to read text information embedded in multimedia files for use in indexing, particularly tag indexes. Still more methods of the present invention enable the virtual editing of multimedia files by manipulating metadata and/or tags rather than editing the multimedia files themselves; [0161] The method and system of the present invention include a tag that can contain information about all or a portion of a multimedia file).

With regard to claim 10, the limitations are addressed above and Sull teaches the operations further comprising detecting the input selecting a metadata tag as the search criterion from a popup list of top metadata tags in the extracted metadata (Fig. 9; Fig. 28; [abstract] Still more methods are provided for interrogating images that contain textual information (in graphical form) so that the text may be copied to a tag or bookmark that can itself be indexed and searched to facilitate later retrieval via a search engine; [0044] There is, therefore, a need in the art for a method and system that will enable the tagging of multimedia images for indexing, editing, searching and retrieving. There is also a need in the art to enable the indexing of textual information that is embedded in graphical images or other multimedia data so that the text in the image can also be tagged, indexed, searched and retrieved, as is other textual information;[0163] The content information 214 is used for visually displaying multimedia bookmarks in a bookmark list 208, as well as for searching one or more multimedia content databases for the content that matches the content information 214; [0168] The search engine then retrieves a list of relevant segments 334 with their positional information such as URI, URL and the like, and the relative position.  With a multimedia player 336, a user can start playing from the retrieved segments of the contents. The retrieved segments 334 are usually those segments having contents relevant or similar to the content information saved in the multimedia bookmark; [0194] FIG. 9 shows an example of a user interface incorporating the multimedia bookmark of the present invention.  The user interface 900 is composed of a playback area 912 and a bookmark list 916; [0492] As illustrated in FIG. 28, the method of the present invention similarly enables to locate the caption text 2804 of a frame 2802, as well as multiple captions 2808, 2810, and 2812 from another frame 2806 and extract the text and obtain the binarized results 2804', 2808', 2810', and 2812' for subsequent processing, recognizing text, indexing, storing and retrieving).

With regard to claim 11, the method claim corresponds to the media claim 1, respectively, and is therefore rejected with the same rationale.

With regard to claim 13, the method claim corresponds to the media claim 3, respectively, and is therefore rejected with the same rationale.

With regard to claim 14, the method claim corresponds to the media claim 4, respectively, and is therefore rejected with the same rationale.

With regard to claim 15, the method claim corresponds to the media claim 6, respectively, and is therefore rejected with the same rationale.

With regard to claim 16, the method claim corresponds to the media claim 10, respectively, and is therefore rejected with the same rationale.

With regard to claim 17, the system claim corresponds to the media claim 1, respectively, and is therefore rejected with the same rationale.

With regard to claim 19, the system claim corresponds to the media claim 6, respectively, and is therefore rejected with the same rationale.

With regard to claim 20, the limitations are addressed above and Sull teaches the operations further comprising: 
 	placing a set of the corresponding matching video segments from the level into an operational queue ([0021] a user can manually establish relevance between a query and retrieved images, and the relevant images can be used for refining the query; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0055] The server may then generate a query for each feature received and, subsequently, use each query generated to search one or more storage devices; [0057] Upon initiation of a search request at the user system, a query message including multimedia features is preferably broadcast to the peer to peer environment. Upon receipt of the query message, a multimedia search engine on a multimedia database included in a storage device on one or more active nodes is preferably executed; [0061] The present invention also provides a new method to fast find from a large database of image/frames the objects close enough to a query image/frame under a certain distortion); and 
 	executing an operation on a set of the video segments in the operational queue (Figs. 30-31; Fig. 54; [0168] Content information characteristics such as captured frame 322, sampled audio data 324, annotated text of the segment corresponding to a bookmarked position 328, and the title delivered with the content 330 can be used as query input to a multimedia search engine 332; [0169] The method arranges key frames in a hierarchical fashion to enable fast and accurate searching of frames similar to a query image; [0195] The search control button 924 is used for searching multimedia database for multimedia contents relevant to the selected content information 914 as a multimedia query input).




 	Claims 2, 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Sull et al. (U.S. 2007/0044010) in view of White et al. (U.S. 2015/0370806) in view of Chen et al. (U.S. Patent No. 8,874,584) and further in view of Balakrishnan et al. (U.S. 2015/0005646).
With regard to claim 2, the limitations are addressed above and Sull teaches wherein emphasizing the matching video segments comprises the representations of the corresponding matching video segments on the video timeline ([0017] However, in this scenario, it is essential to know the start position of recorded video with respect to the video stream used to generate the metadata in the server/content provider in order to match the temporal position referenced by the metadata to the position of the recorded video; [0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata). However, Sull does not specifically teach visually emphasizing on the video timeline. White teaches visually emphasizing on the video timeline the corresponding matching video segments (Figs. 5, selected portion 230; Fig. 8, visual like indicators 840; Fig. 9; Figs. 11-12; [0094] visual like indicators; [0103] The various available filters can be visually provided to the user for selection via a user interface such as that shown in FIGS. 11 and 12 (STEP 1006); [0104] Points of interest can be matched directly or semantics in the name of the point of interest (e.g. the word “beach”) can be used. In addition, images and video can be processed to recognize particular activities (e.g., snowboarding, skydiving, driving, etc.), particular objects or scenes (e.g., the Empire State Building, the Boston skyline, a snow-covered mountain), weather, lighting conditions, and so on, and audio can be processed to further inform the recognition process (e.g., the sounds of a beach, a crowd, music, etc., can be processed to help identify a location or event)). Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which said subject matter pertains to have modified the Sull reference, to have visually indicated the video timeline as taught by White, to have achieved a system and method for displaying visually an active learning-based query refinement. However, Sull does not specifically teach: 
- 	animating by inducing a transient oscillating displacement of the corresponding matching video segments 
Balakrishnan teaches heart rates and beat lengths extracted from videos by measuring subtle head motion caused by the Newtonian reaction to the influx of blood at each beat [abstract]. Balakrishnan also teaches animating by inducing a transient oscillating displacement of the corresponding matching video segments ([0003] Several sources of involuntary head movement can complicate the isolation of movements attributable to pulsatile activity. One is the pendular oscillatory motion that keeps the head in dynamic equilibrium; [0012] In an embodiment of the present invention, subtle head oscillations that accompany the cardiac cycle are exploited to extract information about cardiac activity from videos; [0013] In an embodiment, a method includes selecting a region of a video, tracking features of the selected region of the video, and analyzing the features of the selected region of the video to determine an oscillation rate of a subject shown in the video; [0022] In an embodiment, a system can include a selection module configured to select a region of a video, a tracking module configured to track features of the selected region of the video, and an analysis module the features of the selected region of the video to determine an oscillation rate of a subject shown in the video; [0075] if head displacement is proportional to the force of blood being pumped by the heart, it may serve as a useful metric to estimate blood stroke volume and cardiac output). Therefore, it would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains to have modified the video timeline as taught by Sull and visual emphasis taught by White and machine learning model taught by Chen, to have included the oscillatory motion and displacement as taught by Balakrishnan, to have achieved a system and method of indexing and searching multimedia files as well as for editing portions of multimedia files, all to facilitate the storing, searching, and retrieving of the multimedia information.

With regard to claim 12, the method claim corresponds to the media claim 2, respectively, and is therefore rejected with the same rationale.

With regard to claim 18, the system claim corresponds to the media claim 2, respectively, and is therefore rejected with the same rationale.



Response to Arguments
 	Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The White reference was incorporated as it teaches a system for video editing and p

 	In Applicant’s arguments, the Sull reference is stated to not teach the limitation, “a hierarchical segmentation of a video timeline of a video, the hierarchical segmentation associating extracted metadata extracted from the video by one or more machine learning models with corresponding video segments defined by a first level of the hierarchical segmentation.” The Chen reference was incorporated as it teaches a system for content recognition, search, and retrieval in visual data and extracting distinct activity-agnostic content descriptors from the visual data at each level of a hierarchical content descriptor module [abstract]. Chen teaches one or more machine learning models ([col. 1, line 10] retrieval in visual data which combines multi-level descriptor generation, hierarchical indexing, and active learning-based query refinement; [col. 2, lines 51-55] The storage module is searched for visual data containing a content of interest based on a user query.  The user query is then refined using an active learning model based on a set of feedback from a user; [col. 3, lines 50-51] a system for content recognition, search, and retrieval in visual data which combines multi-level descriptor generation, hierarchical indexing, and active learning-based query refinement; [col. 5, lines 10-12] a multi-level set of activity-agnostic content descriptors (i.e., descriptors which are not dependent on the specific types of activity the system is capable of handling), hierarchical and graph-based indexing, and active learning models for query refinement). The Sull reference teaches a system and method for tagging, indexing, searching, retrieving, manipulating and editing video images on a wide area network such as the internet. Figures 54-56 displays a timeline diagram (accessing a hierarchical segmentation of a video timeline of a video) of the rewind method and Figures 70-73 display a timeline comparison of the offset recording capability. Sull teaches the multimedia bookmark facilitates the searching of portions or segments of multimedia files [abstract]. The metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment.  The metadata of segments can form a hierarchical structure where the larger segment contains the smaller segments, which show that the hierarchical segmentations of the video exists ([0030] – [0031]). Additionally, Sull teaches selecting a segment (hierarchical segmentation) in the metafile as further determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure ([0030] – [0031]; [claim 1] each of the metadata files having at least metadata of one segment to be edited). Furthermore, Sull teaches associating extracted metadata extracted from the video with corresponding video segments ([0286] This metadata is the source of information used by the recommendation engine of the present invention to examine the users' viewing preferences.  After extracting the metadata from the EPG channel stream 5104, the multimedia bookmark process 5106 creates a new multimedia bookmark and places the multimedia bookmark into the user's multimedia bookmark folder on the user's storage device 5108) defined by a first level of the hierarchical segmentation ([abstract] The multimedia bookmark facilitates the searching of portions or segments of multimedia files; [0030]-[0031] Metadata of a video segment contain textual information such as time information (for example, starting frame number and duration, or starting frame number as well as the finishing frame number), title, keyword, and annotation, as well as image information such as the key frame of a segment.  The metadata of segments can form a hierarchical structure where the larger segment contains the smaller segments; [0066] selecting a segment in the metafile; determining if a composing segment should be created, and if the composing segment should be created, then creating a composing segment in a hierarchical structure). Sull goes on teach the limitation of emphasizing upon the video timeline the corresponding matching video segments such as by knowing the start position of recorded video with respect to the video stream used to generate the metadata in order to match the temporal position referenced by the metadata to the position of the record video ([0017]; [0018] The exact location of the TV program, then, can be easily found by simply matching the reference frames with all the recorded frames for the program; [0023] For a query image/video clip given by a user, these methods search the databases for the images/videos that are most similar to the query. In other words, the goal of the image/video search is to find best matches to the query image/video from the database; [0053] The present invention also relates to the techniques to solve the two problems by downloading the metadata from a distant metadata server and then synchronizing/matching the content with the received metadata; [0291] The delivered metadata might include a set of time codes/frame numbers pointing to the segments of the video content of interest. Since these time codes are defined relative to the start of the video used to generate the metadata, they are meaningful only when the start of the recorded video matches that of the video used for metadata). However, Sull does not specifically mention one or more machine learning models as specified in the amended claim language. The Chen reference teaches that particular aspect of the claims and the combination of the references sufficiently teach the claim language.


Conclusion
 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANDREA C. LEGGETT whose telephone number is (571)270-7700. The examiner can normally be reached M-F 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Ell can be reached on 571-270-3264. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ANDREA C LEGGETT/Primary Examiner, Art Unit 2171