Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20017/0124400 A1 to Yehezkel et al., hereinafter, “Yehezkel”.
Claim 1. A system comprising: a processor; an artificial intelligence (AI) chip; and non-transitory computer readable medium containing programming instructions that, when executed, will cause the processor to; Yehezkel [0018] teaches FIG. 1 is a block diagram of an example of an environment including a system 100 for automatic video summarization, according to an embodiment. The system 100 may include a camera 105 (to capture the video), a storage device 110 (to buffer or store the video), a semantic classifier 115, a relevancy classifier 120, and a multiplexer 125. All of these components are implemented in electromagnetic hardware, such as circuits (e.g., circuit sets described below), processors, memories, disks, etc. In an example, some or all of these components may be co-located in a single device 130.
access a plurality of image frames of a first video; Yehezkel [0019] teaches the storage device 110 is arranged to hold the video. In an example, the video is delivered to the storage device 110 from the camera 105. In an example, the video is delivered by another entity, such as a mobile phone, personal computer, etc. that obtained access to the video at some point. The storage device 110 provides the store from which other components of the system 100 may retrieve and analyze frames, or other data, of the video.
Yehezkel [0058] teaches the component 420 learns a pseudo-semantic domain. This is done using an unsupervised learning algorithm, such as training a deep Boltzmann machine, spectral embedding, an auto-encoder, sparse filtering, etc. The “semantics” naturally arise from the type of features used. The unsupervised learning algorithm reduces noise while maintaining semantic interpretation. Thus, two frames that have similar “semantic” information with respect to other video frames (e.g., transductive inference) are mapped to points with small distance between them with respect to mappings of other video frames. The optional input 2 allows for crowdsourcing by using semantic domains from videos captured under the same or similar context (e.g., preferences of the camera operator, time, place, event, etc.).
where {tilde over (X)} is the set of key frames, X is the set of mapped frames, and y is a constant controlling regularization. The input 3 to the component 435 is an optional input in which data from other videos with the same or similar context may be used to, for example, identify the key frames. The output of component 435 is the generative model that, when queried, provides a semantic-similarity metric with regard to frames in the video (e.g., a measure of how semantically similar two frames are).
use the AI chip to determine first feature descriptors of the first video, each of the first feature descriptors associated with a respective image frame of the plurality of image frames of the first video; Yehezkel [0021] teaches the low level features include a GIST descriptor. A GIST descriptor may be computed by convolving a frame with a number of Gabor filters at different scales and orientations to produce a number of feature maps. In an example, there are thirty two Gabor filters, four scales, and eight orientations used to produce thirty two feature maps for the GIST descriptor. These feature map cs may then be divided into a number of regions (e.g., sixteen regions or four by four grids) in which the average features values of each region are calculated. Finally, the averaged values may be concatenated (e.g., joined) to produce the GIST descriptor. Other low level feature techniques may be used, such as Hough transforms to identify shapes or lines in the frames, color based measurements, etc. In an example, metadata of the frames may be measured for feature extraction, such as the geographic location of the frame capture. In an example, low level sound features may be used. In an example, Mel-frequency cepstral coefficients (MFCCs) may be employed as low level features. In general, audio cues, such as the presence of loud noise or the absence of noise may contribute to identifying interesting (e.g., relevant) portions of the video.
Yehezkel [0023] teaches once features of the frames are extracted, the semantic classifier 115 organizes the frames in a data structure based on the extracted features. Such organization provides a meaningful way for the model to represent the commonality of frames based on the commonality of the respective features of the frames. In an example, generating the semantic model includes the semantic classifier 115 arranged to generate a pseudo-semantic domain from the extracted frame features. Such a pseudo-semantic domain is an n-dimensional space derived from the found features. For example, if each from were measured on three features, the respective measurements of each feature would be a coordinate in a three dimensional space for the respective frame.
Yehezkel [0024] teaches the pseudo-semantic domain may be processed and realized by a number of artificial intelligence networks. For example, the extracted features (e.g., those features found in the frames) may be used to train a deep Boltzmann machine, a type of neural network initialized and trained without supervision. A variety of other unsupervised artificial intelligence models may also be used. In an example, however, the pseudo-semantic domain is created, it is created from only the features present in the video's frames and not from an external source. As will be discussed later, this feature scales the differences between the frames to permit sub-scene differentiation across a wider variety of subject videos than current techniques allow. Other example artificial intelligence techniques that may be used include generative models, such as probabilistic graphical models or mixture models.
Yehezkel [0025] teaches after the pseudo-semantic domain is created, the semantic classifier 115 maps the individual frames to the pseudo-semantic domain. As noted above, such mapping may include using individual feature extraction values as coordinates for the frames. These values may be normalized so as to function as valid coordinates in the n-dimensional space together. In an example, the normalization not performed, and the raw values are used. In the example of the pseudo-semantic domain built using a network, such as the deep Boltzmann network, mapping the individual frames may simply involve feeding each frame through the network to arrive at the resultant coordinates particular to that frame in the n-dimensional space.
Yehezkel [0027-0028]
access second feature descriptors of a second video, each of the second feature descriptors associated with a respective image frame of a plurality of image frames of the second video; Yehezkel [Abstract] teaches system and techniques for automatic video summarization are described herein. A video may be obtained and a semantic model of the video may be generated from frames of the video. Respective relevancy scores may be assigned to the frames. The semantic model may be initialized with the respective relevancy scores. The semantic model may then be iteratively processed to produce sub-scenes of the video, the collection of sub-scenes being the video summarization.
Yehezkel [0023] teaches once features of the frames are extracted, the semantic classifier 115 organizes the frames in a data structure based on the extracted features. Such organization provides a meaningful way for the model to represent the commonality of frames based on the commonality of the respective features of the frames. In an example, generating the semantic model includes the semantic classifier 115 arranged to generate a pseudo-semantic domain from the extracted frame features. Such a pseudo-semantic domain is an n-dimensional space derived from the found features. For example, if each from were measured on three features, the respective measurements of each feature would be a coordinate in a three dimensional space for the respective frame.
Yehezkel [0024] teaches the pseudo-semantic domain may be processed and realized by a number of artificial intelligence networks. For example, the extracted features (e.g., those features found in the frames) may be used to train a deep Boltzmann machine, a type of neural network initialized and trained without supervision. A variety of other unsupervised artificial intelligence models may also be used. In an example, however, the pseudo-semantic domain is created, it is created from only the features present in the video's frames and not from an external source. As will be discussed later, this feature scales the differences between the frames to permit sub-scene differentiation across a wider variety of subject videos than current techniques allow. Other example artificial intelligence techniques that may be used include generative models, such as probabilistic graphical models or mixture models.
Yehezkel [0025] teaches after the pseudo-semantic domain is created, the semantic classifier 115 maps the individual frames to the pseudo-semantic domain. As noted above, such mapping may include using individual feature extraction values as coordinates for the frames. These values may be normalized so as to function as valid coordinates in the n-dimensional space together. In an example, the normalization not performed, and the raw values are used. In the example of the pseudo-semantic domain built using a network, such as the deep Boltzmann network, mapping the individual frames may simply involve feeding each frame through the network to arrive at the resultant coordinates particular to that frame in the n-dimensional space.
Yehezkel [0027-0028]
Yehezkel [0058] teaches the component 420 learns a pseudo-semantic domain. This is done using an unsupervised learning algorithm, such as training a deep Boltzmann machine, spectral embedding, an auto-encoder, sparse filtering, etc. The “semantics” naturally arise from the type of features used. The unsupervised learning algorithm reduces noise while maintaining semantic interpretation. Thus, two frames that have similar “semantic” information with respect to other video frames (e.g., transductive inference) are mapped to points with small distance between them with respect to mappings of other video frames. The optional input 2 allows for crowdsourcing by using semantic domains from videos captured under the same or similar context (e.g., preferences of the camera operator, time, place, event, etc.).
where {tilde over (X)} is the set of key frames, X is the set of mapped frames, and y is a constant controlling regularization. The input 3 to the component 435 is an optional input in which data from other videos with the same or similar context may be used to, for example, identify the key frames. The output of component 435 is the generative model that, when queried, provides a semantic-similarity metric with regard to frames in the video (e.g., a measure of how semantically similar two frames are).
and compare the first feature descriptors and the second feature descriptors to determine a subset of image frames in the second video. Yehezkel [0017] teaches creating the semantic model in this way allows for the intrinsic differences between sub-scenes to define sub-scene boundaries, rather than relying an arbitrary timing, or specially trained classifiers. The system does use classifiers for relevancy ques, but the semantic model permits a much less accurate relevancy classification to be used to produce useful results. Thus, classifiers trained in different environments and scenarios may be used because the results do not depend upon the ultimate objective accuracy of the classifier to what one would consider relevant, but rather on the comparative relevancy within the video. Finally, the system combines the generated semantic model with the imperfect relevancy classification to iteratively generate sub-scenes from the video and thus automatically summarize the video.
Yehezkel [0027] teaches a set of key frames from the mapped frames may be identified. The mapped frames, or points in the n-dimensional space, represent points on a surface of a manifold in the n-dimensional space. It is ultimately this manifold which is the underlying model, however, its exact definition is not necessary to perform the techniques described herein. In fact, a subset of the frames, the key frames, may be used instead. The key frames are single frames that represent a group of frames for a semantic concept. For example, a cluster of frames in the n-dimensional space represent a similar scene. A frame from the cluster may therefore represent the cluster and is this a key frame. A variety of key frame identification techniques may be employed, such as finding a kernel to a cluster. In an example, the key frames may be recursively identified by scoring the frames and successively taking the top scoring frame until a threshold number of key frames are acquired. In an example, the threshold is determined by the length of the video. In an example, the threshold is determined by a number of identified clusters in the n-dimensional space. In an example, where the scoring of frames involves distance between key frames, the threshold is a minimum distance between frames. That is, if the distance between two frames is below the threshold, the recursive search stops.
Yehezkel [0028] teaches the key frames may be scored by distance. Here, identifying frames that are far apart from each other identifies parts of the video that show different things. To score the distance between the frames, a first frame is chosen as the first key frame. In an example, the first frame is chosen based on being the farthest from the origin of the n-dimensional space. A second frame is chosen to be in the set of key frames by choosing the frame that is farthest from the first frame. The third frame chosen is the farthest from both the first frame and the second frame in the set of key frames. As noted above, this may continue until the distance between the nth frame is below a threshold. Thus, the set of key frames is a recursive identifying of key frames by adding a next frame to the set of key frames with the highest score in the set of frames that are mapped. The score of a frame in this example, being the inverse of the sum of a square norm of the coordinate of the frame multiplied by a constant and divided by the square of the norm of the distance between the frames and another frame in the set of key frames for all members of the set of key frames.
Yehezkel [0058] teaches the component 420 learns a pseudo-semantic domain. This is done using an unsupervised learning algorithm, such as training a deep Boltzmann machine, spectral embedding, an auto-encoder, sparse filtering, etc. The “semantics” naturally arise from the type of features used. The unsupervised learning algorithm reduces noise while maintaining semantic interpretation. Thus, two frames that have similar “semantic” information with respect to other video frames (e.g., transductive inference) are mapped to points with small distance between them with respect to mappings of other video frames. The optional input 2 allows for crowdsourcing by using semantic domains from videos captured under the same or similar context (e.g., preferences of the camera operator, time, place, event, etc.).
where {tilde over (X)} is the set of key frames, X is the set of mapped frames, and y is a constant controlling regularization. The input 3 to the component 435 is an optional input in which data from other videos with the same or similar context may be used to, for example, identify the key frames. The output of component 435 is the generative model that, when queried, provides a semantic-similarity metric with regard to frames in the video (e.g., a measure of how semantically similar two frames are).
Yehezkel [0075] teaches the user interface 900 illustrates search-engine-like interface, where presented sub-scenes are ordered in decreasing relevancy (as detected automatically). The user can slide with her finger up or down on the screen to browse sub-scenes. If, for example, the user browses beyond an end of the sub-scenes already generated, the system may produce additional sub-scenes to populate the menu (e.g., by activating the feedback from component 720 to component 715. 
It would have been obvious, before the effective filing date of the claimed invention, to one of ordinary skill in the art to modify and combine the embodiments of Yehezkel. One skilled in the art would have been motivated to modify the embodiments in this manner because it would allow different processes to achieve optimal results and would not cause significant change to the design.
Claim 2. The system of claim 1 further comprising programming instructions configured to display a query output based on the subset of image frames in the second video, wherein the subset of image frames in the second video include image frames in the second video that are similar to the first video, and wherein the query output includes a slide show of the subset of image frames. Yehezkel [0020] teaches the semantic classifier 115 is arranged to generate a semantic model of the video from the frames of the video. As used herein, the semantic model is a device by which represents the similarity between frames. In an example, to generate the model, the semantic classifier 115 is arranged to extract features of the frames.
Yehezkel [0051] teaches the system 200 employs a user interface to present identified sub-scenes and accept user input. The user interface may provide an additional feature to the system 200 of allowing the user to manually select the relevant sub-scenes from a relevancy-wise ordered list (similar to a search engine result list, or a decision-support system). Since the operation of the component 240 is iterative, this list can be grown on-the-fly, in real-time. Moreover, in case a video's fully-automatic summarization differs from a semiautomatic summarization (e.g., including human input), the system 200 can update its semantic model (online and active learning schemes) to incorporate the user's feedback by adjusting the relevancy cues (e.g., relevancy label's domain 220) or intrinsic model (e.g., pseudo-semantic domain 230).
where {tilde over (X)} is the set of key frames, X is the set of mapped frames, and y is a constant controlling regularization. The input 3 to the component 435 is an optional input in which data from other videos with the same or similar context may be used to, for example, identify the key frames. The output of component 435 is the generative model that, when queried, provides a semantic-similarity metric with regard to frames in the video (e.g., a measure of how semantically similar two frames are).
Claim 3. The system of claim 1 further comprising programming instructions configured to display a query output based on the subset of image frames in the second video, wherein the subset of image frames in the second video includes image frames in the second video that are similar to the first video, and wherein the query output includes a video comprising at least the subset of image frames. Yehezkel [0027] teaches the key frames are single frames that represent a group of frames for a semantic concept. For example, a cluster of frames in the n-dimensional space represent a similar scene
Yehezkel [0041] teaches as each sub-scene is selected, a clip of the sub-scene is created. Creating the clip may involve simply identifying the frames that are part of the clip. In an example, creating the clip includes copying the sequence of frames to create the clip.
Claim 4. The system of claim 1, wherein the programming instructions for comparing the first feature descriptors and the second feature descriptors to determine the subset of image frames in the second video further comprise programming instructions configured to: for each image frame of the plurality of image frames in the second video, determine whether the image frame is similar to the first video; Yehezkel [0027] teaches the key frames are single frames that represent a group of frames for a semantic concept. For example, a cluster of frames in the n-dimensional space represent a similar scene
and determine the subset of image frames that are similar to the first video. Yehezkel [0026] teaches the semantic model is generated when the frames are placed in the n-dimensional metric-space such that distances between the frames in the space are calculable. As a simple example, consider the Euclidean distance metric in a two dimensional space (e.g., the dimensions denoted by x and y), the distance from one point (e.g., frame) to another follows √{square root over ((x.sub.1−x.sub.2).sup.2+(y.sub.1−y.sub.2).sup.2)}=distance between the two points 1 and 2. After the creation of the semantic model, the similarity of any frame to another is the exponential of the negative square distance between the two frames in the n-dimensional space. That is, the closer two frames are, the more similar they are.
Yehezkel [0051] teaches the system 200 employs a user interface to present identified sub-scenes and accept user input. The user interface may provide an additional feature to the system 200 of allowing the user to manually select the relevant sub-scenes from a relevancy-wise ordered list (similar to a search engine result list, or a decision-support system).
Claim 5. The system of claim 4. wherein the programming instructions for determining whether the image frame in the second video is similar to the first video further comprise programming instructions configured to: determine a distance between the image frame in the second video and a reference frame in the second video; and upon determining that the distance between the image frame in the second video and the reference frame in the second video is below a first threshold, determine whether the image frame is similar to the first video based on whether the reference frame is similar to the first video; otherwise determine whether the image frame is similar to the first video by comparing the image frame with the first video. Yehezkel [0026] teaches the semantic model is generated when the frames are placed in the n-dimensional metric-space such that distances between the frames in the space are calculable. As a simple example, consider the Euclidean distance metric in a two dimensional space (e.g., the dimensions denoted by x and y), the distance from one point (e.g., frame) to another follows √{square root over ((x.sub.1−x.sub.2).sup.2+(y.sub.1−y.sub.2).sup.2)}=distance between the two points 1 and 2. After the creation of the semantic model, the similarity of any frame to another is the exponential of the negative square distance between the two frames in the n-dimensional space. That is, the closer two frames are, the more similar they are.
Yehezkel [0027] teaches in an example, a set of key frames from the mapped frames may be identified. The mapped frames, or points in the n-dimensional space, represent points on a surface of a manifold in the n-dimensional space. It is ultimately this manifold which is the underlying model, however, its exact definition is not necessary to perform the techniques described herein. In fact, a subset of the frames, the key frames, may be used instead. The key frames are single frames that represent a group of frames for a semantic concept. For example, a cluster of frames in the n-dimensional space represent a similar scene. A frame from the cluster may therefore represent the cluster and is this a key frame. A variety of key frame identification techniques may be employed, such as finding a kernel to a cluster. In an example, the key frames may be recursively identified by scoring the frames and successively taking the top scoring frame until a threshold number of key frames are acquired. In an example, the threshold is determined by the length of the video. In an example, the threshold is determined by a number of identified clusters in the n-dimensional space. In an example, where the scoring of frames involves distance between key frames, the threshold is a minimum distance between frames. That is, if the distance between two frames is below the threshold, the recursive search stops.
Yehezkel [0028] teaches in an example, the key frames may be scored by distance. Here, identifying frames that are far apart from each other identifies parts of the video that show different things. To score the distance between the frames, a first frame is chosen as the first key frame. In an example, the first frame is chosen based on being the farthest from the origin of the n-dimensional space. A second frame is chosen to be in the set of key frames by choosing the frame that is farthest from the first frame. The third frame chosen is the farthest from both the first frame and the second frame in the set of key frames. As noted above, this may continue until the distance between the nth frame is below a threshold. Thus, the set of key frames is a recursive identifying of key frames by adding a next frame to the set of key frames with the highest score in the set of frames that are mapped. The score of a frame in this example, being the inverse of the sum of a square norm of the coordinate of the frame multiplied by a constant and divided by the square of the norm of the distance between the frames and another frame in the set of key frames for all members of the set of key frames. The following equation illustrates this scoring:
Yehezkel [0026-0038]
Yehezkel [0033] teaches in an example, to initialize the model, the multiplexer 125 is arranged to construct a graph in which nodes correspond to the frames of the video and edges are weighted to the exponent of the negative square distance between frames in the semantic model. Thus, the closer the frames are in the semantic model, the greater the weight of edges connecting the frames. The value of the nodes is the corresponding relevancy score for the frame. In an example, edges are omitted (e.g., never placed in the graph) or removed when a distance between two frames is beyond a threshold. That is, if two frames are far enough removed, no edge will remain to connect the corresponding nodes of these frames in the graph. Such edge reduction may increase computational efficiency in converging the model by reducing the number of calculations at each iteration. In an example, the minimum distance is determined such that the graph is fully connected (e.g., there exists a sequence of edges such that each node can reach each other node).
Claim 6. The system of claim 5 further comprising programming instructions configured to, upon determining whether the image frame is similar to the first video by comparing the image frame with the first video: determine the reference frame and a next image frame in the second video; and determine whether the next image in the second video is similar to the first video based on whether the reference frame is similar to the first video. Yehezkel [Abstract] teaches system and techniques for automatic video summarization are described herein. A video may be obtained and a semantic model of the video may be generated from frames of the video. Respective relevancy scores may be assigned to the frames. The semantic model may be initialized with the respective relevancy scores. The semantic model may then be iteratively processed to produce sub-scenes of the video, the collection of sub-scenes being the video summarization.
Yehezkel [0026-0028], [0030-0039] teaches determining respective relevancy scores (frames)
Claim 7. The system of claim 5, wherein the programming instructions for determining whether the image frame is similar to the first video further comprise programming instructions configured to: determine a plurality of distance values each between the image frame and a respective image frame of the plurality of image frames of the first video; and combine the plurality of distance values to determine whether the image frame is similar to the first video. Yehezkel [Abstract] teaches system and techniques for automatic video summarization are described herein. A video may be obtained and a semantic model of the video may be generated from frames of the video. Respective relevancy scores may be assigned to the frames. The semantic model may be initialized with the respective relevancy scores. The semantic model may then be iteratively processed to produce sub-scenes of the video, the collection of sub-scenes being the video summarization.
Yehezkel [0026-0028], [0030-0039] teaches determining respective relevancy scores (frames)
Claim 8. The system of claim 7, wherein the programming instructions for combining the plurality of distance values to determine whether the image frame is similar to the first video farther comprise programming instructions configured to: perform an average operation on the plurality of distance values to determine an average distance; and upon determining the average distance is below a second threshold, determine that the image frame is similar to the first video; otherwise determine that the image frame is not similar to the first video. Yehezkel [Abstract] teaches system and techniques for automatic video summarization are described herein. A video may be obtained and a semantic model of the video may be generated from frames of the video. Respective relevancy scores may be assigned to the frames. The semantic model may be initialized with the respective relevancy scores. The semantic model may then be iteratively processed to produce sub-scenes of the video, the collection of sub-scenes being the video summarization.
Yehezkel [0026-0028], [0030-0039] teaches determining respective relevancy scores (frames)
Claim 9. The system of claim 5 further comprising programming instructions configured to: initialize the reference frame in the second video; determine whether the reference frame is similar to the first video by: determining a plurality of distance values each between the reference frame and a respective image frame of the plurality of image frames of the first video; and determining an average distance of the plurality of distance values; and upon determining the average distance is below a second threshold, determining that the reference frame is similar to the first video: otherwise determining that the image frame is not similar to the first video. Yehezkel [0026-0028], Yehezkel [0028] teaches in an example, the key frames may be scored by distance. Here, identifying frames that are far apart from each other identifies parts of the video that show different things. To score the distance between the frames, a first frame is chosen as the first key frame. In an example, the first frame is chosen based on being the farthest from the origin of the n-dimensional space. A second frame is chosen to be in the set of key frames by choosing the frame that is farthest from the first frame. The third frame chosen is the farthest from both the first frame and the second frame in the set of key frames. As noted above, this may continue until the distance between the nth frame is below a threshold. Thus, the set of key frames is a recursive identifying of key frames by adding a next frame to the set of key frames with the highest score in the set of frames that are mapped. The score of a frame in this example, being the inverse of the sum of a square norm of the coordinate of the frame multiplied by a constant and divided by the square of the norm of the distance between the frames and another frame in the set of key frames for all members of the set of key frames. 
Yehezkel [0086] teaches a first subset of key frames are identified as low-relevancy frames and a second subset of key frames are identified as high-relevancy frames based on the respective relevancy scores. 
Claim 10. The system of claim 1, wherein the programming instructions for determining one of the first feature descriptors associated with the respective image frame of the first video further comprise programming instructions configured to execute the AI chip configured to: determine one or more feature maps of the respective image frame; Yehezkel [0021] teaches the low level features include a GIST descriptor. A GIST descriptor may be computed by convolving a frame with a number of Gabor filters at different scales and orientations to produce a number of feature maps. In an example, there are thirty two Gabor filters, four scales, and eight orientations used to produce thirty two feature maps for the GIST descriptor. These feature maps may then be divided into a number of regions (e.g., sixteen regions or four by four grids) in which the average features values of each region are calculated. Finally, the averaged values may be concatenated (e.g., joined) to produce the GIST descriptor. Other low level feature techniques may be used, such as Hough transforms to identify shapes or lines in the frames, color based measurements, etc. In an example, metadata of the frames may be measured for feature extraction, such as the geographic location of the frame capture. In an example, low level sound features may be used. In an example, Mel-frequency cepstral coefficients (MFCCs) may be employed as low level features. In general, audio cues, such as the presence of loud noise or the absence of noise may contribute to identifying interesting (e.g., relevant) portions of the video.
Yehezkel [0025] teaches after the pseudo-semantic domain is created, the semantic classifier 115 maps the individual frames to the pseudo-semantic domain. As noted above, such mapping may include using individual feature extraction values as coordinates for the frames. These values may be normalized so as to function as valid coordinates in the n-dimensional space together. In an example, the normalization not performed, and the raw values are used. In the example of the pseudo-semantic domain built using a network, such as the deep Boltzmann network, mapping the individual frames may simply involve feeding each frame through the network to arrive at the resultant coordinates particular to that frame in the n-dimensional space.
and use an invariance pooling layer to generate die feature descriptor based on the one or more feature maps. Yehezkel [0057] teaches the video to be summarized is placed in a storage device 405. Scene features are extracted by component 410. These features are common features used to classify between scene types (e.g., indoor/outdoor, beach/sunset/party, etc.). Example features may include GIST descriptors or output of the first layers of a deep convolution network trained for scene classification. The extracted features may be placed in the storage device 415 for use by other components.
Yehezkel [0078] teaches at operation 1010, a semantic model of the video may be generated from the frames of the video. In an example, generating the semantic model includes extracting features of the frames. In an example, extracting the features includes finding low level features. In an example, the low level features include a GIST descriptor. In an example, the low level features include location.
Claim 11. The system of claim 1 further comprising an image sensor configured to capture the plurality of image frames of the first video. Yehezkel [0018] teaches FIG. 1 is a block diagram of an example of an environment including a system 100 for automatic video summarization, according to an embodiment. The system 100 may include a camera 105 (to capture the video)
Claim 12. It differs from claim 1 in that it is a method performed by the system of claim 1. Therefore claim 12 has been analyzed and reviewed in the same way as claim 1. See the above analysis. 
Claim 13. The method of claim 12, wherein outputting the query output comprises displaying a slide show of the subset of image frames, Yehezkel [0075] teaches FIG. 9 illustrates a user interface 900 to support supervised learning for sub-scene selection, according to an embodiment. The unique capabilities for video summarization of the systems discussed above allow for an intuitive user interface to most review or edit sub-scene inclusion in the summarization, as well as for providing some feedback to facilitate supervised learning of the relevancy cues or semantic modeling. For example, when user intervention is requested (e.g., decision-support system mode of operation, the user interface 900 may be employed. If the user is not satisfied with the fully automatic summarization result, particular sub0scenes may be removed, or re-ordered to be less prominent in the summary. The user interface 900 illustrates search-engine-like interface, where presented sub-scenes are ordered in decreasing relevancy (as detected automatically). The user can slide with her finger up or down on the screen to browse sub-scenes. If, for example, the user browses beyond an end of the sub-scenes already generated, the system may produce additional sub-scenes to populate the menu (e.g., by activating the feedback from component 720 to component 715. The user may also select which presented sub-scenes to include, or not to include, in the summary by sliding her finger, for example, to the left and the right respectively. 
wherein the subset of image frames in the second video include image frames in the second video that are similar to the first video. Yehezkel [0086] teaches a first subset of key frames are identified as low-relevancy frames and a second subset of key frames are identified as high-relevancy frames based on the respective relevancy scores. 
Claim 14. The method of claim 12, wherein outputting die query output comprises displaying a video comprising at least the subset of image frames, wherein the subset of image frames in the second video include image frames in the second video that are similar to the first video. Yehezkel [0086] teaches a first subset of key frames are identified as low-relevancy frames and a second subset of key frames are identified as high-relevancy frames based on the respective relevancy scores. 
Claim 15. It differs from claim 4 in that it is the method performed by the system of claim 4. Therefore claim 15 has been analyzed and reviewed in the same way as claim 4. See the above analysis. 
Claim 16. It differs from claim 5 in that it is the method performed by the system of claim 5. Therefore claim 16 has been analyzed and reviewed in the same way as claim 5. See the above analysis. 
Claim 17. It differs from claim 6 in that it is the method performed by the system of claim 6. Therefore claim 17 has been analyzed and reviewed in the same way as claim 6. See the above analysis. 
Claim 18. It differs from claim 7 in that it is the method performed by the system of claim 7. Therefore claim 18 has been analyzed and reviewed in the same way as claim 7. See the above analysis. 
Claim 19. It differs from claim 8 in that it is the method performed by the system of claim 8. Therefore claim 19 has been analyzed and reviewed in the same way as claim 8. See the above analysis. 
Claim 20. It differs from claim 7 in that it is the method performed by the system of claim 7. Therefore claim 20 has been analyzed and reviewed in the same way as claim 7. See the above analysis. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US 2016/0300110 A1 to Prosek.
Prosek [0005] teaches identify a first video represented based on a first set of image frames. A first subset of image frames can be extracted from the first set of image frames.
Prosek [0006] teaches a second video represented based on a second set of image frames can be identified. A second subset of image frames can be extracted from the second set of image frames.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DELOMIA L GILLIARD whose telephone number is (571)272-1681. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached on 571 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DELOMIA L GILLIARD/Primary Examiner, Art Unit 2661