DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/22/21 has been entered.
 
Remarks
In response to the amendment filed 11/22/21, claims 1, 5-8, 10, 17, 18, and 20-22 have been amended.  Claims 1-10 and 13-22 are pending in the application, of which claims 1, 10, and 17 are presented in independent form.  

In view of the Examiners Amendment, authorized by the Attorney of Record, claims 1-10, 17-19, 21, and 22 are further amended by the examiner and claims 23- 31 have been added (details provided below).

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in a telephone interview with Johnny Lam on 3/9/22.



1-9.	(Canceled)

10.	(Currently Amended)	A computer-implemented method of querying video content, the computer-implemented method comprising:
	receiving, from a requesting entity, a textual query to be evaluated relative to a video library, the video library containing a plurality of instances of video content;
	training a data model based on a plurality of training samples, wherein the data model comprises a soft-attention neural network module, a language Long Short-term Memory (LSTM) neural network module, and a video LSTM neural network module, which are jointly trained, wherein each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the respective instance of video content, wherein training the data model comprises, for each of the plurality of training samples:
	encoding each of the plurality of phrases for the training sample as a matrix, wherein each word within the plurality of phrases is encoded as a vector using a trained model for word representation, wherein the trained model is separate from the data model;
	determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases;

	extracting frame features from the sequence of frames;
	performing an object classification analysis on the extracted frame features; and
	generating a matrix representing the respective instance of video content, based on the extracted frame features and the object classification analysis, the matrix including feature vectors;
	processing the textual query using the trained data model to identify a ranking of the plurality of instances of video content responsive to the textual query, wherein the soft-attention neural network module s an output of a last state of the language LSTM neural network module with feature vectors of an instance of the plurality of instances of video content wherein the ranking is identified by generating an attention-based representation that is fed to the video LSTM neural network module , wherein the attention-based representation is generated by calculating an attention-weighted average of frames of the instance of video content based on the aligned output and minimizing a ranking loss function having a penalty function that is asymmetric 
	returning at least an indication of the ranking of the plurality of instances of video content to the requesting entity.



13.	(Previously Presented)	The computer-implemented method of claim 10, wherein extracting the frame features from the sequence of frames is performed using a pretrained spatial convolutional neural network.

14.	(Previously Presented)	The computer-implemented method of claim 10, wherein the generated matrix representing the video is formed as 
    PNG
    media_image1.png
    29
    167
    media_image1.png
    Greyscale

    PNG
    media_image1.png
    29
    167
    media_image1.png
    Greyscale
 of M video feature vectors, wherein each video feature vector has 
    PNG
    media_image2.png
    28
    17
    media_image2.png
    Greyscale

    PNG
    media_image2.png
    28
    17
    media_image2.png
    Greyscale
 dimensions.

15.	(Previously Presented)	The computer-implemented method of claim 14, wherein the object classification analysis is performed on the extracted frame features using a deep convolutional neural network trained for object classification, wherein the deep convolutional neural network comprises a defined number of convolutional layers and a defined number of fully connected layers, followed by a softmax output layer.

16.	(Previously Presented)	The computer-implemented method of claim 10, wherein determining the weighted ranking between the plurality of phrases comprises:
determining, for each of the plurality of phrases, the respective length of the phrase.


receiving, from a requesting entity, a textual query to be evaluated relative to a video library, the video library containing a plurality of instances of video content;
	training a data model based in part on a plurality of training samples, wherein the data model comprises a soft-attention neural network module that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module, wherein each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the respective instance of video content, wherein a first one of the plurality of training samples comprises a single-frame instance of video content generated from an image file, and wherein training the data model comprises, for each of the plurality of training samples:
encoding each of the plurality of phrases for the training sample as a matrix, wherein each word within the plurality of phrases is encoded as a vector using a trained model for word representation, wherein the trained model is separate from the data model;
	determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases; and
generating a matrix representing the respective instance of video content, based at least in part on an object classification analysis performed on frame 
	processing the textual query using the trained data model to identify a ranking of the plurality of instances of video content responsive to the textual query, wherein the soft-attention neural network module is s an output of a last state of the language LSTM neural network module with feature vectors of an instance of the plurality of instances of video content wherein the ranking is identified by generating an attention-based representation that is fed to the video LSTM neural network module , wherein the attention-based representation is generated by calculating an attention-weighted average of video frames of the instance of video content based on the aligned output and minimizing a ranking loss function having a penalty function that is asymmetric 
returning at least an indication of the ranking of the plurality of instances of video content to the requesting entity.

18.	(Currently Amended)	The computer-implemented method of claim 17, wherein the textual query is projected into a joint-embedding space to determine the ranking based on a respective distance from the textual query to each of the plurality of instances of video content in the joint-embedding space, wherein an alignment module within the data model generates a matching score 
    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale

    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale
for each video frame 
    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale

    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale
 at each time step t of the language LSTM neural network module, wherein the matching score 
    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale

    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale
 and the video frame 
    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale

    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale
 are semantically matched to one another.

19.	(Currently Amended)	The computer-implemented method of claim 18, wherein the matching score 
    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale

    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale
 represents a determination of a relevance of the video frame 
    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale

    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale
 and the language LSTM hidden state at the time 
    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale

    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale
, wherein the matching score 
    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale

    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale
 is defined as 
    PNG
    media_image6.png
    29
    122
    media_image6.png
    Greyscale

    PNG
    media_image6.png
    29
    122
    media_image6.png
    Greyscale
, and where 
    PNG
    media_image7.png
    28
    30
    media_image7.png
    Greyscale

    PNG
    media_image7.png
    28
    30
    media_image7.png
    Greyscale
 represents the language LSTM hidden state at the time 
    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale

    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale
 that contains information related to a sequentially modeled sentence up to the time 
    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale

    PNG
    media_image5.png
    28
    35
    media_image5.png
    Greyscale
.

20.	(Previously Presented)	The computer-implemented method of claim 19, wherein identifying the ranking comprises:
	calculating a single value for matching score 
    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale

    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale
 by taking a sum of states 
    PNG
    media_image8.png
    28
    33
    media_image8.png
    Greyscale

    PNG
    media_image8.png
    28
    33
    media_image8.png
    Greyscale
with each video-data 
    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale

    PNG
    media_image4.png
    28
    14
    media_image4.png
    Greyscale
 to obtain a matching-vector and transforming the matching-vector to produce the matching score 
    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale

    PNG
    media_image3.png
    29
    26
    media_image3.png
    Greyscale
;
computing an attention weight 
    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale

    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale
 for a video frame i of the instance of video content at a time t as 
    PNG
    media_image10.png
    43
    126
    media_image10.png
    Greyscale

    PNG
    media_image10.png
    43
    126
    media_image10.png
    Greyscale
, wherein the attention weight 
    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale

    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale
 defines a soft-alignment between encoded sentences and video frames, such that a higher attention 
    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale

    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale
 reflects more saliency attributes to a specific video frame i with respect to words in the sentence; and
generating the attention-based representation, 
    PNG
    media_image11.png
    28
    38
    media_image11.png
    Greyscale

    PNG
    media_image11.png
    28
    38
    media_image11.png
    Greyscale
, by calculating a weighted average 
    PNG
    media_image11.png
    28
    38
    media_image11.png
    Greyscale

    PNG
    media_image11.png
    28
    38
    media_image11.png
    Greyscale
 of the video frames of the instance of video content using the computed attention weights 
    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale

    PNG
    media_image9.png
    29
    24
    media_image9.png
    Greyscale
, where 
    PNG
    media_image12.png
    30
    130
    media_image12.png
    Greyscale

    PNG
    media_image12.png
    30
    130
    media_image12.png
    Greyscale
.

21.	(Currently Amended)	The computer-implemented method of claim 20, wherein each word within the plurality of phrases is encoded as a vector using Global Vectors for Word Representation (GloVe) analysis, wherein the trained model comprises a GloVe model, wherein the GloVe model is trained using distinct word tokens across distinct data sets, wherein the frame features are extracted using a pretrained spatial convolutional neural network, wherein the object classification analysis is performed on the extracted frame features using a deep convolutional neural network trained for object classification, wherein the deep convolutional neural network comprises a defined number of convolutional layers and a defined number of fully connected layers, followed by a softmax output layer;
wherein based on the data model , a respective relevance of each video frame of the instance of video content is determined, wherein the last state, 
    PNG
    media_image7.png
    28
    30
    media_image7.png
    Greyscale

    PNG
    media_image7.png
    28
    30
    media_image7.png
    Greyscale
, of the language LSTM neural network module is updated according to the following intermediate functions:

    PNG
    media_image13.png
    28
    187
    media_image13.png
    Greyscale


    PNG
    media_image14.png
    31
    194
    media_image14.png
    Greyscale


    PNG
    media_image15.png
    28
    196
    media_image15.png
    Greyscale


    PNG
    media_image16.png
    28
    219
    media_image16.png
    Greyscale


    PNG
    media_image17.png
    28
    122
    media_image17.png
    Greyscale


    PNG
    media_image18.png
    28
    109
    media_image18.png
    Greyscale

    PNG
    media_image18.png
    28
    109
    media_image18.png
    Greyscale
,
wherein 
    PNG
    media_image19.png
    28
    11
    media_image19.png
    Greyscale

    PNG
    media_image19.png
    28
    11
    media_image19.png
    Greyscale
 represents an input gate, 
    PNG
    media_image20.png
    28
    12
    media_image20.png
    Greyscale

    PNG
    media_image20.png
    28
    12
    media_image20.png
    Greyscale
 represents an forget gate, 
    PNG
    media_image21.png
    28
    14
    media_image21.png
    Greyscale

    PNG
    media_image21.png
    28
    14
    media_image21.png
    Greyscale
 represents an output gate, and 
    PNG
    media_image22.png
    28
    13
    media_image22.png
    Greyscale

    PNG
    media_image22.png
    28
    13
    media_image22.png
    Greyscale
 represents a cell gate of the language LSTM neural network module at a time t.

22.	(Currently Amended)	The computer-implemented method of claim 21, wherein each of the data model, the trained model, the pretrained spatial convolutional neural network, and the deep convolutional neural network comprises a distinct model, wherein a second one of the plurality of training samples is generated by sampling video content at a predefined interval, wherein each instance of video content comprises a respective video, wherein each phrase comprises a caption, wherein each word of each phrase is encoded as a respective word vector using the trained model for word representation;
wherein the ranking loss function comprises a pairwise ranking loss function given by:

    PNG
    media_image23.png
    200
    400
    media_image23.png
    Greyscale

where (c, v) represents a ground-truth pair of caption and video described by the caption, where c’ represents contrastive captions not describing the video v, where v’ represents contrastive videos not described by the caption c, where α represents a margin hyperparameter, and where S represents a similarity function;
wherein the similarity function includes the penalty function, which comprises , wherein the penalty function is given by:

    PNG
    media_image24.png
    200
    400
    media_image24.png
    Greyscale

where E represents an asymmetric order-violation function given by[[;]]:

    PNG
    media_image25.png
    200
    400
    media_image25.png
    Greyscale

wherein the asymmetric order-violation function is configured to capture, regardless of caption length, relatedness of captions describing, at different levels of detail, a same video.

23.	(New)	A system to query video content, the system comprising:
one or more computer processors;
a memory containing a program executable by the one or more computer processors to perform an operation comprising:

	training a data model based on a plurality of training samples, wherein the data model comprises a soft-attention neural network module, a language Long Short-term Memory (LSTM) neural network module, and a video LSTM neural network module, which are jointly trained, wherein each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the respective instance of video content, wherein training the data model comprises, for each of the plurality of training samples:
	encoding each of the plurality of phrases for the training sample as a matrix, wherein each word within the plurality of phrases is encoded as a vector using a trained model for word representation, wherein the trained model is separate from the data model;
	determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases;
	encoding the respective instance of video content for the training sample as a sequence of frames;
	extracting frame features from the sequence of frames;
	performing an object classification analysis on the extracted frame features; and

	processing the textual query using the trained data model to identify a ranking of the plurality of instances of video content responsive to the textual query, wherein the soft-attention neural network module aligns an output of a last state of the language LSTM neural network module with feature vectors of an instance of the plurality of instances of video content, wherein the ranking is identified by generating an attention-based representation that is fed to the video LSTM neural network module, wherein the attention-based representation is generated by calculating an attention-weighted average of frames of the instance of video content based on the aligned output and minimizing a ranking loss function having a penalty function that is asymmetric; and
	returning at least an indication of the ranking of the plurality of instances of video content to the requesting entity.

24.	(New)	The system of claim 23, wherein extracting the frame features from the sequence of frames is performed using a pretrained spatial convolutional neural network.

25.	(New)	The system of claim 23, wherein the generated matrix representing the video is formed as 
    PNG
    media_image1.png
    29
    167
    media_image1.png
    Greyscale

    PNG
    media_image1.png
    29
    167
    media_image1.png
    Greyscale
 of M video feature vectors, wherein each video feature vector has 
    PNG
    media_image2.png
    28
    17
    media_image2.png
    Greyscale

    PNG
    media_image2.png
    28
    17
    media_image2.png
    Greyscale
 dimensions.

26.	(New)	The system of claim 25, wherein the object classification analysis is performed on the extracted frame features using a deep convolutional neural network trained for object classification, wherein the deep convolutional neural network comprises a defined number of convolutional layers and a defined number of fully connected layers, followed by a softmax output layer.

27.	(New)	The system of claim 23, wherein determining the weighted ranking between the plurality of phrases comprises:
determining, for each of the plurality of phrases, the respective length of the phrase.


28.	(New)	The system of claim 23, wherein each word within the plurality of phrases is encoded as a vector using Global Vectors for Word Representation (GloVe) analysis, wherein the trained model comprises a GloVe model, wherein the GloVe model is trained using distinct word tokens across distinct data sets.

29.	(New)	The system of claim 23, wherein the textual query is projected into a joint-embedding space to determine the ranking based on a respective distance from the textual query to each of the plurality of instances of video content in the joint-embedding space.



31.	(New)	The system of claim 23, wherein the penalty function comprises a negative order-violation penalty.

Allowance
Claims 10 and 13-31 (renumbered as claims 1-20) are allowed over the prior art.

The following is an examiner’s statement of reasons for allowance:
The applicant’s amendment, filed on 11/22/21, and the examiner's amendment authorized by the attorney of record on 3/9/22, overcome the cited prior art with respect to the independent claims:
“receiving, from a requesting entity, a textual query to be evaluated relative to a video library, the video library containing a plurality of instances of video content;
	training a data model based on a plurality of training samples, wherein the data model comprises a soft-attention neural network module, a language Long Short-term Memory (LSTM) neural network module, and a video LSTM neural network module, which are jointly trained, wherein each of the plurality of training samples includes (i) a respective instance of video content and (ii) a respective plurality of phrases describing the respective instance of video content, wherein training the data model comprises, for each of the plurality of training samples:

	determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases;
	encoding the respective instance of video content for the training sample as a sequence of frames;
	extracting frame features from the sequence of frames;
	performing an object classification analysis on the extracted frame features; and
	generating a matrix representing the respective instance of video content, based on the extracted frame features and the object classification analysis, the matrix including feature vectors;
	processing the textual query using the trained data model to identify a ranking of the plurality of instances of video content responsive to the textual query, wherein the soft-attention neural network module aligns an output of a last state of the language LSTM neural network module with feature vectors of an instance of the plurality of instances of video content, wherein the ranking is identified by generating an attention-based representation that is fed to the video LSTM neural network module, wherein the attention-based representation is generated by calculating an attention-weighted 
	returning at least an indication of the ranking of the plurality of instances of video content to the requesting entity.”

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Vendrov et al. Order-Embeddings of Images and Language. 2015. Discloses determining appropriate captions for images and evaluating a query caption for determining a ranked dataset of images.  When comparing the query caption to image captions, a pairwise ranking loss with an asymmetric order-violation penalty is used (Vendrov, section 4).  However, Vendrov does not discuss training a data model as claimed in detail, “determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases;” or “generating an attention-based representation that is fed to the video LSTM neural network module, wherein the attention-based representation is generated by calculating an attention-weighted average of frames of the instance of video content based on the aligned output.”



Song et al., “Hierarchical LSTMs with Adaptive Attention for Video Captioning.” August 2015. Is directed towards applying attention mechanisms in a hierarchical LSTM that considers visual information and contextual information.  Song does not teach comparing captions to a query, “determining a weighted ranking between the plurality of phrases, based on a respective length of each phrase, such that lengthier phrases are ranked above less lengthy phrases;”, or determining a ranking by “minimizing a ranking loss function having a penalty function that is asymmetric.”

Ran Xu, Caiming Xiong, Wei Chen, and Jason J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI'15). AAAI Press, 2346–2352. Discusses a join video-language modeling where the distance 

Agrawal, Harsh, et al. "Sort story: Sorting jumbled images and captions into stories." arXiv preprint arXiv:1606.07493 (2016) discusses aligning image-caption pairs for creating an ordered sequence.  Agrawal et al. does not disclose determining a ranking using the data model as claimed.

	The prior art of record does not disclose, teach, or suggest the above claimed limitations (in combination with all other features in the claim).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRITTANY N ALLEN whose telephone number is (571)270-3566.  The examiner can normally be reached on M-F 9 am - 5:00 pm EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on 571-272-4046.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/BRITTANY N ALLEN/Primary Examiner, Art Unit 2169