DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This action is in response to the amendments received on 7/20/21.  Claims 1-21 are pending in the application.  Claim 22 has been added. Applicants' arguments have been carefully and respectfully considered.
Claims 1-22 are rejected under 35 U.S.C. 112.
Claims 1, 10 and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over He et al. (US 2017/0293638), and further in view of Song et al. “Hierarchical LSTMs with Adaptive Attention for Visual Captioning, August 2015 and Wang et al. (US 2007/0219945).
Claims 2-5 and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Mei et al. (US 2017/0150235).
Claims 6-9 are rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song, Wang, and Mei, and further in view of Yu et al. (US 2017/0127016).
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Britz, Recurrent Neural Network Tutorial, Part 4, October 27, 2015.
12 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Venugopalan et al., “Translating Videos to Natural Language Using Deep Recurrent Neural Networks”, June 5, 2015.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Mei et al. (US 2017/0150235) and Yu et al. (US 2017/0127016).
Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of He in view of Song, Wang, Mei, and Yu, and further in view of Britz, Recurrent Neural Network Tutorial, Part 4, October 27, 2015.

Allowable Subject Matter
Claim 22 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
Song seems to describe the loss function algorithm on page 13, equation 35.  However, the negative order-violation penalty definition and asymmetric order-violation function definition together with the pairwise ranking loss function would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112 and including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it 

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-22 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Claims 1, 10, and 17 disclose “determining, based on a loss function, one or more instances of video content from the video library that correspond to the textual query.”  The loss function is described in pa 0066, however, the specification does not explain how the loss function is used to determine video content that corresponds to the textual query.  Paragraph 0067 seems to describe how loss function (c,v) relates captions and images, but not how it is used to determine video content corresponding to a query.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 1, 10 and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over He et al. (US 2017/0293638), and further in view of Song et al. “Hierarchical LSTMs with Adaptive Attention for Visual Captioning, August 2015 and Wang et al. (US 2007/0219945).

With respect to claim 1, He teaches a computer-implemented method of querying video content, the computer-implemented method comprising: 
receiving, from a requesting entity, a textual query to be evaluated relative to a video library (He, pa 0037, user can submit a query, pa 0043, data store can be a repository for images, images may be frames of video), the video library containing a plurality of instances of video content (He, pa 0043, images may be frames of video); 
determining one or more instances of video content from the video library that correspond to the textual query (He, pa 0042, provide a display for answers to queries), by analyzing the textual query using a data model that includes a soft-attention neural network module (He, pa 0075-0076, query-feature-determining module can operate a network computational model (NCM) to determine feature information of the query text where NCM can include a neural network model including at least one LSTM), wherein the data model is trained by encoding each phrase describing each of a plurality of training samples as a matrix (He, pa 0074, and 
returning at least an indication of the one or more instances of video content to the requesting entity (He, pa 0094, select output elements from top ranks).
He doesn't expressly discuss a data model that includes a soft-attention neural network module that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module wherein the soft-attention neural network module is used to generate, by calculating an attention-weighted average of video frames of an instance of the one or more instances of video content.
Song teaches determining, based on a loss function, one or more instances of video content (Song, pg. 13, section 6.3, the parameters Ɵ in image captioning model are pretrained by minimizing MLE loss), 
a data model that includes a soft-attention neural network module (Song, pg. 2, “To tackle these issues”, hLSTMat framework), a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module, which are jointly trained (Song, Fig. 2 & section 4.2, two LSTM layers, the bottom decodes visual features and the top focuses on mining deep language context information for video captioning), wherein the soft-attention neural network module is used to align an output of a last state of the language LSTM neural network module with feature vectors of an instance of the one or more instances of video content (Song, pg. 5, section 4.2, MLP layer interpret the output of the softmax layer pt as a probability distribution over word P(zt | z<t, V, Ɵ) where V denotes the features of and generate, by calculating an attention-weighted average of video frames of the instance of video content (Song, section 4.4, compute the average of features across a video, incorporating the attention weights, See Eq. 10), an attention-based representation that is fed to the video LSTM neural network module (Song, section 4.5, use attention mechanism to select important regions in captioning) , wherein the data model is trained by encoding each phrase describing each of a plurality of training samples as a matrix (Song, pg. 4, section 4.1, a video x, we extract L=28 frame-level features V, represented as V={v1, …, vL}), wherein each word of each phrase is encoded as a vector using a trained model for word representation, wherein the trained model is separate from the data model (Song, Fig. 2&3, pg. 5, or attention based LSTM, context vector is in general an important factor, since it provides meaningful visual evidence for caption generation [14]. In order to efficiently adjust the choose of visual information or sentence context information for caption generation, we defined an adaptive temporal context vector ct and a temporal context vector ct at time t., pg. 6, section 4.7, we utilize two different pre-trained hLSTMat models. Each hLSTMat model is trained by taking one type of video feature as the input. As a result, each model separately generates a distribution of the words.).
	It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He with the teachings of Song because it indicates where to look at visual information and when to rely on language context information (Song, pg. 2, “It’s worthwhile to highlight”).

Wang teaches determining a weighted ranking of each phrase based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases (Wang, Fig. 3 step 304 & pa 0038 & pa 0048, determining phrase length property for a phrase, preferring longer names & Fig. 3 step 308 & pa 0043, ranking phrases by score).
It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song because it provides key information for the searched content (Wang, pa 0001 & pa 0048).

With respect to claim 10, He teaches a computer-implemented method of querying video content, the computer-implemented method, comprising: 
receiving, from a requesting entity, a textual query to be evaluated relative to a video library (He, pa 0037, user can submit a query, pa 0043, data store can be a repository for images, images may be frames of video), the video library containing a plurality of instances of video content (He, pa 0043, images may be frames of video); 
training a data model based on a plurality of training samples, wherein the data model comprises a soft-attention neural network module (He, pa 0075-0076, query-feature-determining module can operate a network computational model (NCM) to determine feature information of the query text where NCM can include a neural wherein each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the respective instance of video content (He, pa 0068, training set of data includes images and corresponding text), wherein training the data model comprises, for each of the plurality of training samples: 
encoding each of the plurality of phrases for the training sample as a matrix (He, pa 0084-0085, NCM computes attention information by using matrix for each m image region), wherein each word within the plurality of phrases is encoded as a vector using a trained model for word representation (He, pa 0084, NCM can take as inputs one or more elements of a vector of the feature input of the image); 
encoding the respective instance of video content for the training sample as a sequence of frames (He, pa 0043, images may be frames of video); 
extracting frame features from the sequence of frames (He, pa 0069, determining feature information of images); 
performing an object classification analysis on the extracted frame features (He, pa 0071, pooling for image region data); and 
generating a matrix representing the respective instance of video content, based on the extracted frame features and the object classification analysis (He, pa 0084-0085, NCM computes attention information by using matrix for each m image region), the matrix including feature vectors (He, pa 0084, NCM 418(1) can take as inputs one or more elements of a vector of the feature input of the image and one or 
processing the textual query using the trained data model to identify one or more instances of video content from the plurality of instances of video content (He, pa 0042, provide a display for answers to queries, pa 0075-0076, query-feature-determining module can operate a network computational model (NCM) to determine feature information of the query text where NCM can include a neural network model including at least one LSTM & pa 0085, determining a matrix that provides a vector of the first attention information representing a relevance of each image region m to the query); and 
returning at least an indication of the one or more instances of video content to the requesting entity (He, pa 0094, select output elements from top ranks).
He doesn't expressly discuss a soft-attention neural network module that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module wherein the soft-attention neural network module is used to generate, by calculating an attention-weighted average of video frames of an instance of the one or more instances of video content.
Song teaches wherein the data model comprises a soft-attention neural network module (Song, pg. 2, “To tackle these issues”, hLSTMat framework), a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module, which are jointly trained (Song, Fig. 2 & section 4.2, two LSTM layers, the bottom decodes visual features and the top focuses on mining wherein training the data model comprises, for each of the plurality of training samples: 
encoding each of the plurality of phrases for the training sample as a matrix (Song, pg. 4, section 4.1, a video x, we extract L=28 frame-level features V, represented as V={v1, …, vL}) , wherein each word within the plurality of phrases is encoded as a vector using a trained model for word representation (Song, pg. 4, section 4.1, video x has L=28 frame-level features V, where V is represented as V={v1, …, vL}), wherein the trained model is separate from the data model (Song, Fig. 2&3, pg. 5, or attention based LSTM, context vector is in general an important factor, since it provides meaningful visual evidence for caption generation [14]. In order to efficiently adjust the choose of visual information or sentence context information for caption generation, we defined an adaptive temporal context vector ct and a temporal context vector ct at time t., pg. 6, section 4.7, we utilize two different pre-trained hLSTMat models. Each hLSTMat model is trained by taking one type of video feature as the input. As a result, each model separately generates a distribution of the words.); 
encoding the respective instance of video content for the training sample as a sequence of frames (Song, section 4.1, pg. 4, preprocess each video clip by selected equally-spaced 28 frames out of the first 360 frames);
extracting frame features from the sequence of frames (Song, pg. 4, section 4.1, a video x, we extract L=28 frame-level features V, represented as V={v1, …, vL}); 
performing an object classification analysis on the extracted frame features (Song, pg. 5, determining a context vector ct for vector V); and 
generating a matrix representing the respective instance of video content, based on the extracted frame features and the object classification analysis, the matrix including feature vectors (Song, pg. 4, section 4.1, a video x, we extract L=28 frame-level features V, represented as V={v1, …, vL});
identify, based on a loss function, one or more instances of video content (Song, pg. 13, section 6.3, the parameters Ɵ in image captioning model are pretrained by minimizing MLE loss), 
wherein the soft-attention neural network module is used to align an output of a last state of the language LSTM neural network module with feature vectors of an instance of the one or more instances of video content (Song, pg. 5, section 4.2, MLP layer interpret the output of the softmax layer pt as a probability distribution over word P(zt | z<t, V, Ɵ) where V denotes the features of the corresponding input video) and generate, by calculating an attention-weighted average of video frames of the instance of video content (Song, section 4.4, compute the average of features across a video, incorporating the attention weights, See Eq. 10), an attention-based representation that is fed to the video LSTM neural network module (Song, section 4.5, use attention mechanism to select important regions in captioning).
	It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He with the teachings of Song because where to look at visual information and when to rely on language context information (Song, pg. 2, “It’s worthwhile to highlight”).

Wang teaches determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that more lengthy phrases are ranked above less lengthy phrases (Wang, Fig. 3 step 304 & pa 0038 & pa 0048, determining phrase length property for a phrase, preferring longer names & Fig. 3 step 308 & pa 0043, ranking phrases by score).
It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song with the teachings of Wang because it provides key information for the searched content (Wang, pa 0001 & pa 0048).

With respect to claim 16, He in view of Song and Wang teaches the computer-implemented method of claim 10, wherein determining the weighted ranking between the plurality of phrases further comprises: determining, for each of the plurality of phrases, the respective length of the phrase (Wang, Fig. 3 step 304 & pa 0038 & pa 0048, determining phrase length property for a phrase, preferring longer names & Fig. 3 step 308 & pa 0043, ranking phrases by score).

With respect to claim 17, He teaches a computer-implemented method, the computer-implemented method comprising:
receiving, from a requesting entity, a textual query to be evaluated relative to a video library (He, pa 0037, user can submit a query, pa 0043, data store can be a repository for images, images may be frames of video), the video library containing a plurality of instances of video content (He, pa 0043, images may be frames of video); 
training a data model based in part on a plurality of training samples, wherein the data model comprises a soft-attention neural network module (He, pa 0075-0076, query-feature-determining module can operate a network computational model (NCM) to determine feature information of the query text where NCM can include a neural network model including at least one LSTM), wherein each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the respective instance of video content (He, pa 0068, training set of data includes images and corresponding text), wherein a first one of the plurality of training samples comprises a single frame instance of video content generated from an image file (He, pa 0043, images may be frames of video), and wherein training the data model further comprises, for each of the plurality of training samples: 
encoding each of the plurality of phrases for the training sample as a matrix (He, pa 0084-0085, NCM computes attention information by using matrix for each m image region), wherein each word within the plurality of phrases is encoded as a vector (He, pa 0084, NCM can take as inputs one or more elements of a vector of the feature input of the image); 
generating a matrix representing the respective instance of video content, based at least in part on an object classification analysis performed on frame features extracted from the respective instance of video content (He, pa 0085, determining a matrix that provides a vector of the first attention information representing a relevance of each image region m to the query), the matrix including feature vectors (He, pa 0084, NCM 418(1) can take as inputs one or more elements of a vector of the feature input of the image and one or more elements of a vector of the feature information of the query. NCM 418(1) can then determine the feature information, e.g., as a vector of output values.); and 
processing the textual query using the trained data model to identify one or more instances of video content from the video library that are related to the textual query (He, pa 0085, determining a matrix that provides a vector of the first attention information representing a relevance of each image region m to the query); and 
returning at least an indication of the one or more instances of video content to the requesting entity (He, pa 0094, select output elements from top ranks).
He doesn't expressly discuss a data model that includes a soft-attention neural network module that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module wherein the soft-attention neural network module is used to generate, by calculating an attention-weighted average of video frames of an instance of the one or more instances of video content.
wherein the data model comprises a soft-attention neural network module (Song, pg. 2, “To tackle these issues”, hLSTMat framework) that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module (Song, Fig. 2 & section 4.2, two LSTM layers, the bottom decodes visual features and the top focuses on mining deep language context information for video captioning);
wherein the soft-attention neural network module is used to align an output of a last state of the language LSTM neural network module with feature vectors of an instance of the one or more instances of video content (Song, pg. 5, section 4.2, MLP layer interpret the output of the softmax layer pt as a probability distribution over word P(zt | z<t, V, Ɵ) where V denotes the features of the corresponding input video) and generate, by calculating an attention-weighted average of video frames of the instance of video content (Song, section 4.4, compute the average of features across a video, incorporating the attention weights, See Eq. 10), an attention-based representation that is fed to the video LSTM neural network module (Song, section 4.5, use attention mechanism to select important regions in captioning)
encoding each of the plurality of phrases for the training sample as a matrix (Song, pg. 4, section 4.1, a video x, we extract L=28 frame-level features V, where V is a vector), wherein each word within the plurality of phrases is encoded as a vector using a trained model for word representation (Song, pg. 4, section 4.1, video x has L=28 frame-level features V, where V is represented as V={v1, …, vL}), wherein the trained model is separate from the data model (Song, Fig. 2&3, pg. 5, or attention based LSTM, context vector is in general an important factor, since it 
identify, based on a loss function, one or more instances of video content (Song, pg. 13, section 6.3, the parameters Ɵ in image captioning model are pretrained by minimizing MLE loss).
It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He with the teachings of Song because where to look at visual information and when to rely on language context information (Song, pg. 2, “It’s worthwhile to highlight”).
He in view of Song doesn't expressly discuss determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that more lengthy phrases are ranked above less lengthy phrase.
Wang teaches determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases (Wang, Fig. 3 step 304 & pa 0038 & pa 0048, determining phrase length property for a phrase, preferring longer names & Fig. 3 step 308 & pa 0043, ranking phrases by score).
It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He 

With respect to claim 18, He in view of Song and Wang teaches the computer-implemented method of claim 17, wherein an alignment module within the data model generates a matching score mt,i at each time step t of the language LSTM neural network module, wherein the matching score represents how well a sequentially modeled sentence up to time t - 1 and the video frame vi are semantically matched to one another (Song, pg. 13, section 6.3.1, Constrastive Loss, determining the similarity between an image x and caption c, where x is an image from a video, equation 35 indicates that a pair of caption and image is matched).

With respect to claim 19, He in view of Song and Wang teaches the computer-implemented method of claim 18, wherein the matching score mt,i is used to determine a relevance of the video frame vi and the language LSTM hidden state at the time t - 1, wherein the matching score mt,i is defined as mt,i = (Ф (ht-1, vi) (Song, Fig. 7, input to 2nd LSTM & section 6.2.2, h2t-1 = LSTM(y2t, h2t-1) represents the output from the second LSTM that provides the caption for the language context and video frame), where ht-1 represents the language LSTM hidden state at the time t - 1 that contains information related to a sequentially modeled sentence up to the time t – 1 (Song, section 6.2, ht-1 is the output of the LSTM unit at the t-1 time step).

Claims 2-5 and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Mei et al. (US 2017/0150235).

With respect to claim 2, He in view of Song and Wang teaches the method of claim 1, as discussed above.  He in view of Song and Wang doesn't expressly discuss aligning encoded portions of textual phrases with video frames of video content by backpropagating gradients of the language and video LSTM neural network modules through both the language and video LSTM neural networks and through the soft-attention neural network module.
Mei teaches aligning encoded portions of textual phrases with frames of video content by backpropagating gradients of the language and video LSTM neural network modules through both the language and video LSTM neural networks and through the soft-attention neural network module. (Mei, pa 0020).
	It would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song and Wang with the teachings of Mei because backpropagating allows appropriate adjustment in the model (Mei, pa 0020).

With respect to claim 3, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 2, wherein an alignment module within the data model generates a matching score mt,i at each time step t of the language LSTM neural network, wherein the matching score represents how well a sequentially modeled i are semantically matched to one another (Song, pg. 13, section 6.3.1, Constrastive Loss, determining the similarity between an image x and caption c, where x is an image from a video, equation 35 indicates that a pair of caption and image is matched).

With respect to claim 4, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 3, wherein the matching score mt,i is used to determine a relevance of the video frame vi and the language LSTM hidden state at the time t - 1, wherein the matching score mt,i is defined as mt,i = (Ф (ht-1, vi) (Mei, pa 0050, E(v,s) measures the relevance between content of the video and sentence semantics), where ht-1 represents the language LSTM hidden state at the time t - 1 that contains information related to a sequentially modeled sentence up to the time t – 1 (Mei, Fig. 5, ht-1 & pa 0060, mapping input sequences to a sequence of hidden states).

With respect to claim 5, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 4, wherein determining the one or more one or more instances of video content from the video library that correspond to the textual query comprises: calculating a single value for matching score mt,i by taking a sum of states ht-1 with each video-data vi to obtain a matching-vector and transforming the matching-vector to produce the matching score mt,i (Mei, pa 0065, computing the sum of E(V,S)).

claim 13, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 10, wherein extracting the frame features from the sequence of frames is performed using a pretrained spatial convolutional neural network (CNN) (Mei, pa 0020).

With respect to claim 14, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 10, wherein the generated matrix representing the video is formed as V = {v1,..., vM} € RMxdv of M video feature vectors, wherein each video feature vector has dv dimensions (Mei, pa 0050 & 0077).

With respect to claim 15, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 14, wherein the object classification analysis is performed on the extracted frame features using a deep convolutional neural network trained for object classification, wherein the deep convolutional neural network comprises a defined number of convolutional layers and a defined number of fully connected layers (Mei, Fig. 4 & pa 0020), followed by a softmax output layer (Mei, pa 0063).

Claims 6-9 are rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song, Wang, and Mei, and further in view of Yu et al. (US 2017/0127016).

claim 6, He in view of Song, Wang, and Mei teaches the computer-implemented method of claim 5, as discussed above.  He in view of Song, Wang, and Mei doesn't expressly discuss the limitations of claim 6.	
Yu teaches wherein determining the one or more one or more instances of video content from the video library that correspond to the textual query further comprises: computing an attention weight wt,i for a video frame i of an instance of video content at a time t as wt,i =                         
                            
                                
                                    
                                        
                                            exp
                                        
                                        ⁡
                                        
                                            
                                                
                                                    
                                                        
                                                            m
                                                        
                                                        
                                                            t
                                                            ,
                                                            i
                                                        
                                                    
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                            
                                                T
                                            
                                        
                                        
                                            e
                                            x
                                            p
                                            ⁡
                                            (
                                            
                                                
                                                    m
                                                
                                                
                                                    t
                                                    ,
                                                    j
                                                
                                            
                                            )
                                        
                                    
                                
                            
                        
                     wherein the attention weight wt,i defines a soft- alignment between encoded sentences and video frames, such that a higher attention weight wt,i reflects more saliency attributes to a specific video frame i with respect to words in the sentence (Yu, pa 0045, computing attention weight βtm for features in video feature pool).
	It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song, Wang, and Mei to have included the teachings of Yu because it provides efficient localization of objects in datasets (Yu, pa 0043).

With respect to claim 7, He in view of Song, Wang, Mei and Yu teaches the computer-implemented method of claim 6, wherein determining the one or more instances of video content from the video library that correspond to the textual query further comprises: generating an attention-based representation kt(A) of the instance of video content by calculating a weighted average kt(A) of the video frames using the computed attention weights wt,i, where kt(A) =                         
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                            ,
                                            i
                                        
                                    
                                    
                                        
                                            v
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                     (Yu, pa 0046, a single feature t =                         
                            
                                
                                    ∑
                                    
                                        m
                                        =
                                        1
                                    
                                    
                                
                                
                                    K
                                    M
                                    
                                        
                                            
                                                
                                                    β
                                                
                                                
                                                    m
                                                
                                            
                                        
                                        
                                            t
                                        
                                    
                                    
                                        
                                            v
                                        
                                        
                                            m
                                        
                                    
                                
                            
                        
                    ).

With respect to claim 8, He in view of Song, Wang, Mei and Yu teaches the computer-implemented method of claim 7, wherein the data model is used to determine a respective relevance of each respective video frame of the one or more instances of video content (He, pa 0084-0085, output of NCM can represent a relevance of each image region m to the query).

With respect to claim 9, He in view of Song, Wang, Mei and Yu teaches the computer-implemented method of claim 8, wherein the last state, ht-1 of the language LSTM neural network module is updated according to the following intermediate functions:
                
                    
                        
                            i
                        
                        
                            t
                        
                    
                    =
                    σ
                    (
                    
                        
                            W
                        
                        
                            i
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            i
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            i
                        
                    
                    )
                
            
                
                    
                        
                            f
                        
                        
                            t
                        
                    
                    =
                    σ
                    (
                    
                        
                            W
                        
                        
                            i
                            f
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            f
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            f
                        
                    
                    )
                
            
                
                    
                        
                            o
                        
                        
                            t
                        
                    
                    =
                    σ
                    (
                    
                        
                            W
                        
                        
                            o
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            o
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            o
                        
                    
                    )
                
            
                
                    
                        
                            g
                        
                        
                            t
                        
                    
                    =
                    t
                    a
                    n
                    h
                    (
                    
                        
                            W
                        
                        
                            c
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            c
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            c
                        
                    
                    )
                
            
                
                    
                        
                            c
                        
                        
                            t
                        
                    
                    =
                    
                        
                            f
                        
                        
                            t
                        
                    
                    
                        
                            c
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            i
                        
                        
                            t
                        
                    
                    
                        
                            g
                        
                        
                            t
                        
                    
                
            
                
                    
                        
                            h
                        
                        
                            t
                        
                    
                    =
                    
                        
                            o
                        
                        
                            t
                        
                    
                    
                        
                            t
                            a
                            n
                            h
                            ⁡
                            (
                            c
                        
                        
                            t
                        
                    
                    )
                
            

wherein it represents an input gate, ft represents an forget gate, ot represents an output gate, and ct represents a cell gate of the language LSTM neural network module at a time t (Mei, pa 0062).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song, Wang, and Mei, and further in view of Britz, Recurrent Neural Network Tutorial, Part 4, October 27, 2015.

With respect to claim 11, He in view of Song and Wang teaches the method of claim 10, as discussed above.  He in view of Song and Wang doesn't expressly discuss wherein each word within the plurality of phrases is encoded as a vector using Global Vectors for Word Representation (GloVe) analysis.
Britz teaches wherein each word within the plurality of phrases is encoded as a vector using Global Vectors for Word Representation (GloVe) analysis (Britz, pg. 8, Adding an embedding layer, GloVe is a popular method that create vectors with semantic meaning and allows the network to generalize unseen words).
	It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song and Wang with the teachings of Britz because it allows the neural network to learn less about the language by forming vectors from words (Britz, pg. 8, Adding an embedding layer).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Venugopalan et al., “Translating Videos to Natural Language Using Deep Recurrent Neural Networks”, June 5, 2015.

With respect to claim 12, He in view of Song and Wang teaches the computer-implemented method of claim 10, wherein training the data model based on the plurality of training samples further comprises: generating the sequence of frames for at least one instance of video content by sampling the instance of video content at a predefined interval.
	Venugopalan teaches wherein training the data model based on the plurality of training samples further comprises: generating the sequence of frames for at least one instance of video content by sampling the instance of video content at a predefined interval (Venugopalan, section 3.2, sample frames in the video, 1 in every 10 frames).
	It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song and Wang because it creates an effective summarization of the video for generating video descriptions (Venugopalan, section 3.2).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of Song and Wang, and further in view of Mei et al. (US 2017/0150235) and Yu et al. (US 2017/0127016).

With respect to claim 20, He in view of Song and Wang teaches the computer-implemented method of claim 19, as discussed above.  He in view of Song and Wang doesn't expressly discuss claim 20. 
t,i by taking a sum of states ht-1 with each video-data vi to obtain a matching-vector and transforming the matching-vector to produce the matching score mt,i (Mei, pa 0065, computing the sum of E(V,S)).
It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song and Wang to have included the teachings of Mei because it allows the model to take into account both the contextual relationships (coherence) among the words in the input sentence and the relationships between the semantics of the entire sentence and video features (relevance) (Mei, pa 0066).
He in view of Song and Wang and Mei doesn't expressly discuss the remainder of claim 20.
Yu teaches computing an attention weight wt,i for a video frame i of the instance of video content at a time t as wt,i =                         
                            
                                
                                    
                                        
                                            exp
                                        
                                        ⁡
                                        
                                            
                                                
                                                    
                                                        
                                                            m
                                                        
                                                        
                                                            t
                                                            ,
                                                            i
                                                        
                                                    
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            ∑
                                            
                                                j
                                                =
                                                1
                                            
                                            
                                                T
                                            
                                        
                                        
                                            e
                                            x
                                            p
                                            ⁡
                                            (
                                            
                                                
                                                    m
                                                
                                                
                                                    t
                                                    ,
                                                    j
                                                
                                            
                                            )
                                        
                                    
                                
                            
                        
                     wherein the attention weight wt,i defines a soft- alignment between encoded sentences and video frames, such that a higher attention weight wt,i reflects more saliency attributes to a specific video frame i with respect to words in the sentence (Yu, pa 0045, computing attention weight βtm for features in video feature pool);
generating the attention-based representation, kt(A), by calculating a weighted average kt(A) of the video frames using the computed attention weights wt,i, where kt(A) =                         
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            w
                                        
                                        
                                            t
                                            ,
                                            i
                                        
                                    
                                    
                                        
                                            v
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                     (Yu, pa 0046, a single feature vector may be obtained by weighted averaging in a weighted average block ut =                         
                            
                                
                                    ∑
                                    
                                        m
                                        =
                                        1
                                    
                                    
                                
                                
                                    K
                                    M
                                    
                                        
                                            
                                                
                                                    β
                                                
                                                
                                                    m
                                                
                                            
                                        
                                        
                                            t
                                        
                                    
                                    
                                        
                                            v
                                        
                                        
                                            m
                                        
                                    
                                
                            
                        
                    ).
.

Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over He in view of He in view of Song, Wang, Mei, and Yu, and further in view of Britz, Recurrent Neural Network Tutorial, Part 4, October 27, 2015.

With respect to claim 21, He in view of Song, Wang, Mei, and Yu teaches the computer-implemented method of claim 20, wherein the frame features are extracted using a pretrained spatial convolutional neural network (CNN) (Mei, pa 0020), wherein the object classification analysis is performed on the extracted frame features using a deep convolutional neural network trained for object classification, wherein the deep convolutional neural network comprises a defined number of convolutional layers and a defined number of fully connected layers (Mei, Fig. 4 & pa 0020), followed by a softmax output layer (Mei, pa 0063);
wherein the data model is used to determine a respective relevance of each respective video frame of the one or more instances of video content (He, pa 0084-0085, output of NCM can represent a relevance of each image region m to the query), wherein the last state ht-1 of the language LSTM neural network module is updated according to the following intermediate functions:
                
                    
                        
                            i
                        
                        
                            t
                        
                    
                    =
                    σ
                    (
                    
                        
                            W
                        
                        
                            i
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            i
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            i
                        
                    
                    )
                
            
                
                    
                        
                            f
                        
                        
                            t
                        
                    
                    =
                    σ
                    (
                    
                        
                            W
                        
                        
                            i
                            f
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            f
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            f
                        
                    
                    )
                
            
                
                    
                        
                            o
                        
                        
                            t
                        
                    
                    =
                    σ
                    (
                    
                        
                            W
                        
                        
                            o
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            o
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            o
                        
                    
                    )
                
            
                
                    
                        
                            g
                        
                        
                            t
                        
                    
                    =
                    t
                    a
                    n
                    h
                    (
                    
                        
                            W
                        
                        
                            c
                        
                    
                    
                        
                            v
                        
                        
                            t
                        
                    
                    +
                    
                        
                            U
                        
                        
                            c
                        
                    
                    
                        
                            h
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            b
                        
                        
                            c
                        
                    
                    )
                
            
                
                    
                        
                            c
                        
                        
                            t
                        
                    
                    =
                    
                        
                            f
                        
                        
                            t
                        
                    
                    
                        
                            c
                        
                        
                            t
                            -
                            1
                        
                    
                    +
                    
                        
                            i
                        
                        
                            t
                        
                    
                    
                        
                            g
                        
                        
                            t
                        
                    
                
            
                
                    
                        
                            h
                        
                        
                            t
                        
                    
                    =
                    
                        
                            o
                        
                        
                            t
                        
                    
                    
                        
                            t
                            a
                            n
                            h
                            ⁡
                            (
                            c
                        
                        
                            t
                        
                    
                    )
                
            

wherein it represents an input gate, ft represents an forget gate, ot represents an output gate, and ct represents a cell gate of the language LSTM neural network module at a time t (He, pa 0076, equations 2-6 & Song, section 3.1 equations 1).
He in view of Song, Wang, Mei, and Yu doesn't expressly discuss wherein each word within the plurality of phrases is encoded as a vector using Global Vectors for Word Representation (GloVe) analysis.
Britz teaches wherein each word within the plurality of phrases is encoded as a vector using Global Vectors for Word Representation (GloVe) analysis (Britz, pg. 8, Adding an embedding layer, GloVe is a popular method that create vectors with semantic meaning and allows the network to generalize unseen words).
	It would have been obvious at the effective filing date of the invention to a person having ordinary skill in the art to which said subject matter pertains to have modified He in view of Song, Wang, Mei, and Yu with the teachings of Britz because it allows the neural network to learn less about the language by forming vectors from words (Britz, pg. 8, Adding an embedding layer).


Response to Arguments
Rejection under 35 U.S.C. 103
Applicant argues that He in view of Song and Wang does not teach video content being identified based on a loss function.  The Examiner respectfully disagrees.  Since the video content captions are determined by incorporating a loss function (Song, pg. 4, section 3.2.3), the identification of video content can be “based on a loss function.”

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRITTANY N ALLEN whose telephone number is (571)270-3566.  The examiner can normally be reached on M-F 9 am - 5:00 pm EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on 571-272-4046.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BRITTANY N ALLEN/Primary Examiner, Art Unit 2169