DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Election/Restrictions
Applicant’s election without traverse of claims 1-15 in the reply filed on 12 September 2022 is acknowledged.

Drawings
Color photographs and color drawings are not accepted in utility applications unless a petition filed under 37 CFR 1.84(a)(2) is granted. Any such petition must be accompanied by the appropriate fee set forth in 37 CFR 1.17(h), one set of color drawings or color photographs, as appropriate, if submitted via EFS-Web or three sets of color drawings or color photographs, as appropriate, if not submitted via EFS-Web, and, unless already present, an amendment to include the following language as the first paragraph of the brief description of the drawings section of the specification:

The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

Color photographs will be accepted if the conditions for accepting color drawings and black and white photographs have been satisfied. See 37 CFR 1.84(b)(2).
Note that the requirement for three sets of color drawings under 37 C.F.R. § 1.84(a)(2)(ii) is not applicable to color drawings submitted via EFS-Web. Therefore, only one set of such color drawings is necessary when filing via EFS-Web.
The drawings are objected to because Figures 2, 6A and 6B include color, but no petition under 37 C.F.R. § 1.84(a)(2) has been filed.

Information Disclosure Statement
Items CB, CL, CP, CT, CU and CW in the information disclosure statement filed 14 February 2020 fail to comply with the provisions of 37 CFR 1.97, 1.98 and MPEP § 609 because no date has been provided.  The items have been placed in the application file, but the information referred to therein has not been considered as to the merits.  Applicant is advised that the date of any re-submission of any item of information contained in this information disclosure statement or the submission of any missing element(s) will be the date of submission for purposes of determining compliance with the requirements based on the time of filing the statement, including all certification requirements for statements under 37 CFR 1.97(e).  See MPEP § 609.05(a).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-3, 6, 8, 9, 11 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Luo, Minnan, et al. "Simple to complex cross-modal learning to rank." Computer Vision and Image Understanding 163 (2017): 67-77, hereinafter, “Luo”, and further in view of Wang, Liwei, Yin Li, and Svetlana Lazebnik. "Learning Deep Structure-Preserving Image-Text Embeddings." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, hereinafter, “Wang”.

As per claim 1, Luo discloses a system for training a cross-modal search system (Luo, Abstract, cross-modal retrieval; Luo, page 4, 2.2. SPL and SPLD, training the multi-modal embedding space), comprising: 
a training dataset including first objects of a first modality and second objects of a second modality that are associated with the first objects, respectively, wherein the first modality is different than the second modality, and wherein the second objects include text that is descriptive of the first objects (Luo, page 69, 3.1. Problem Formulation, the training dataset consists of n image-text pairs); 
a first matrix including first relevance values indicative of relevance between the first objects and the second objects, respectively (Luo, pages 69-70, 3.1. Problem Formulation, We associate each image with a natural language description such that the training dataset consists of n image-text pairs, i.e. D = {(xi, zi): i = 1, 2, ... , n}, where xi Є X ⊆ Rp represents a p-dimensional visual feature vector extracted from the i-th image and zi Є Z ⊆ Rq refers to a q-dimensional feature vector extracted from the i-th text (sentence) … we collect all sentences and images in X = {x1, x2, ..., xn} and Z = {z1, z2, ..., zn}, respectively. Note that the order of components in X and Z should correspond with each other such, that the i-th image xi Є X and the i-th text zi Є Z come from the same pair in D … Given an image query x Є X, we define a non-linear mapping from image feature space into the shared multimodal embedding space via h [Equation 1], where W1 is a d x p transformation matrix and b1 Є Rd is a bias vector. Similarly, we map each text feature into the shared embedding space by non-linear mapping g [Equation 2] where W2 is a d x p transformation matrix and b2 Є Rd is a bias vector. Through non-linear mapping h and g, the similarity measurement (relevance score) S (x, z) between image query x and the retrieved text z can be obtained via computing the cosine similarity in the shared embedding space ... the underlying correspondence between image and text lies in the embedding parameters W1, b1 and W2, b2); 
a second matrix including second relevance values indicative of relevance between the second objects and the first objects, respectively (Luo, pages 69-70, 3.1. Problem Formulation, We associate each image with a natural language description such that the training dataset consists of n image-text pairs, i.e. D = {(xi, zi): i = 1, 2, ... , n}, where xi Є X ⊆ Rp represents a p-dimensional visual feature vector extracted from the i-th image and zi Є Z ⊆ Rq refers to a q-dimensional feature vector extracted from the i-th text (sentence) … we collect all sentences and images in X = {x1, x2, ..., xn} and Z = {z1, z2, ..., zn}, respectively. Note that the order of components in X and Z should correspond with each other such, that the i-th image xi Є X and the i-th text zi Є Z come from the same pair in D … Given an image query x Є X, we define a non-linear mapping from image feature space into the shared multimodal embedding space via h [Equation 1], where W1 is a d x p transformation matrix and b1 Є Rd is a bias vector. Similarly, we map each text feature into the shared embedding space by non-linear mapping g [Equation 2] where W2 is a d x p transformation matrix and b2 Є Rd is a bias vector. Through non-linear mapping h and g, the similarity measurement (relevance score) S (x, z) between image query x and the retrieved text z can be obtained via computing the cosine similarity in the shared embedding space ... the underlying correspondence between image and text lies in the embedding parameters W1, b1 and W2, b2); and 
a training module configured to: 
based on similarities between ones of the second objects, generate a third matrix by selectively adding first additional relevance values to the first matrix (Luo, pages 68-69, Introduction, Given each image query, the retrieved sentences are ordered according to their ranking loss, as specified by the numbers in Fig. 1. It is reasonable to believe that the sentences ranked higher, i.e., with a smaller loss, are usually more accurate and important ... select ranking sentences together with the corresponding image queries … we adaptively assign each ranking with an importance weight and learn a more optimal multi- modal embedding space gradually from easy to more complex rankings with respect to diverse image queries … we associate each ranking by cross-modal query with an importance weight to train the CMLR model; Luo, page 70, 3.1. Problem Formulation, assume the aligned text zk ranks higher than the other text zj Є Z (j ≠ k) given an image query … we associate each image query xk a tetrad set).
Luo does not explicitly disclose the following limitation as further recited however Wang discloses 
based on the similarities between the ones of the second objects, generate a fourth matrix by selectively adding second additional relevance values to the second matrix (Wang, page 5006, 1. Introduction, preserve neighborhood structure within each individual view. Specifically, in the learned latent space, we want images (resp. sentences) with similar meaning to be close to each other ... for each image its target neighbors from the same class are closer than samples from other classes; Wang, page 5006, 2.1. Network Structure, As shown in Figure 1, our deep model has two branches, each composed of fully connected layers with weight matrices Wl and Vl; Wang, page 5006-5007, 2.2. Training Objective, Our training objective is a stochastic margin-based loss that includes bidirectional cross-view ranking constraints, together with within-view structure-preserving constraints ... Structure-preserving constraints. Let N(xi) denote the neighborhood of xi containing images that share the same meaning. In our case, this is the set of images described by the same sentence as xi. Then we want to enforce a margin of m between N(xi) and any point outside of the neighborhood [Equation 3]. Analogously to (3), we define the constraints for the sentence side as [Equation 4] where N(yi') contains sentences describing the same image ... Figure 2 gives an intuitive illustration of how within-view structure preservation can help with cross-view matching ... within-view structure constraints are added, pushing semantically similar sentences (same color circles) closer to each other ... The weights λ2, λ3 control the importance of the structure-preserving terms, which act as regularizers for the bi-directional retrieval tasks); and 
store the third and fourth matrices in memory of a search module for cross-modal retrieval in response to receipt of search queries (Wang, page 5008, 3.1. Features and Network Settings, memory and training time ... On the image (X) side, when using 4096-dimensional visual features, W1 is a 4096 × 2048 matrix, and W2 is a 2048 × 512 matrix. That is, the output dimensions of the two layers are [2048, 512]. On the text (Y) side, the output dimensions of the V1 and V2 layers are [2048, 512]; Wang, page 5008, 3.2. Image-sentence retrieval, given a test set of 1000 images and 5000 corresponding sentences, we use the images to retrieve sentences and vice versa, and report performance).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Luo to include the within-view neighborhood structure as taught by Wang in order to cluster images / sentences with similar meaning close to each other in the learned latent embedding space so that target samples from within the same class are closer to each other than samples from other classes thereby improving accuracy of matching and image-to-text and text-to-image retrieval (Wang, Abstract; Wang, page 5006, 1. Introduction).

As per claim 2, Luo and Wang disclose the system of claim 1.  Wang discloses further comprising: 
a fifth matrix including third relevance values indicative of relevance between the first objects and the first objects, respectively (Wang, page 5007, Triplet sampling. Our loss involves all triplets consisting of a target instance, a positive match, and a negative match ... This is done by computing pairwise similarities between all (xi, yj), (xi, xj) and (yi, yj)); and 
a sixth matrix including fourth relevance values indicative of relevance between the second objects and the second objects, respectively (Wang, page 5007, Triplet sampling. Our loss involves all triplets consisting of a target instance, a positive match, and a negative match ... This is done by computing pairwise similarities between all (xi, yj), (xi, xj) and (yi, yj)).  The motivation would be the same as above in claim 1.

As per claim 3, Luo and Wang disclose the system of claim 1, wherein the training module is further configured to: 
based on the similarities between the ones of the second objects, generate a seventh matrix by selectively adding third additional relevance values to the fifth matrix (Luo, pages 68-69, Introduction, Given each image query, the retrieved sentences are ordered according to their ranking loss, as specified by the numbers in Fig. 1. It is reasonable to believe that the sentences ranked higher, i.e., with a smaller loss, are usually more accurate and important ... select ranking sentences together with the corresponding image queries … we adaptively assign each ranking with an importance weight); 
based on the similarities between the ones of the second objects, generate an eight matrix by selectively adding fourth additional relevance values to the sixth matrix; and store the seventh and eighth matrices in the memory of the search module for the cross-modal retrieval in response to receipt of search queries (Luo, page 70, 3.2. Self-paced CMLR with diversity regularization, the importance weight v k j ∈ [0 , 1] is updated for each tetrad ( x k , z k , z j , y kj ) with fixed embedding parameters W; Wang, page 5008, 3.1. Features and Network Settings).

As per claim 6, Luo and Wang disclose the system of claim 1 wherein the first objects are one of images, sounds, and videos (Luo, page 69, 3.1. Problem Formulation, the training dataset consists of n image-text pairs).

As per claim 8, Luo and Wang disclose the system of claim 1.  Wang discloses wherein the training module is configured to: 
determine triplet losses based on triplets of the training dataset and using the third and fourth matrices (Wang, page 5007, Triplet sampling. Our loss involves all triplets consisting of a target instance, a positive match, and a negative match ... This is done by computing pairwise similarities between all (xi, yj), (xi, xj) and (yi, yj) within the mini-batch. For each positive pair (i.e., a ground truth image-sentence pair, two neighboring images, or two neighboring sentences), we then find at most top K violations of each relevant constraint); 
train first and second functions for cross-modal retrieval based on the triplet losses (Wang, page 5007, Triplet sampling. Our loss involves all triplets consisting of a target instance, a positive match, and a negative match ... This is done by computing pairwise similarities between all (xi, yj), (xi, xj) and (yi, yj) within the mini-batch. For each positive pair (i.e., a ground truth image-sentence pair, two neighboring images, or two neighboring sentences), we then find at most top K violations of each relevant constraint ... For the experiments with the structure-preserving constraints, in order to get a non-empty set of constraint triplets, we need a moderate number of positive pairs (i.e., at least two sentences that are matched to the same image) in each mini-batch); and 
store the first and second functions in the memory of the search module (Wang, page 5008, 3.1. Features and Network Settings, memory and training time ... On the image (X) side, when using 4096-dimensional visual features, W1 is a 4096 × 2048 matrix, and W2 is a 2048 × 512 matrix. That is, the output dimensions of the two layers are [2048, 512]. On the text (Y ) side, the output dimensions of the V1 and V2 layers are [2048, 512]; Wang, page 5008, 3.2. Image-sentence retrieval, given a test set of 1000 images and 5000 corresponding sentences, we use the images to retrieve sentences and vice versa, and report performance).  The motivation is the same as above in claim 1.

As per claim 9, Luo and Wang disclose the system of claim 1 wherein the training module is configured to: 
determine the quantized mean average precision (mAP) losses based on the training dataset and using the third and fourth matrices (Luo, page 73, 5.4. Evaluation metric, Mean Average Precision (mAP) is used as an evaluation metric ... Given the Average Precision (AP) of all queries, mAP is the mean of all AP values. And the value AP of a query is calculated according to the formula (30) where Y and Y denotes the true ranking list and the predicted ranking list); 
train first and second functions for cross-modal retrieval based on the quantized mAP losses (Luo, page 70, 3.2. Self-paced CMLR with diversity regularization, the importance weight v k j ∈ [0 , 1] is updated for each tetrad ( x k , z k , z j , y kj ) with fixed embedding parameters W ... it is necessary to impose the diversity regularization on the importance weights vector, such that the selected tetrads are scattered over different image queries; Luo, page 74, Figures 3-5) ; and 
store the first and second functions in the memory of the search module (Wang, page 5008, 3.1. Features and Network Settings).

As per claim 11, Luo and Wang disclose the system of claim 1 wherein the third and fourth matrices include values selected from a group consisting of 0 and 1 (Luo, page 70, 3.2. Self-paced CMLR with (diversity regularization, the importance weight v k j ∈ [0 , 1] is updated for each tetrad ( x k , z k , z j , y kj ) with fixed embedding parameters W).

As per claim 12, Luo and Wang disclose the system of claim 1 wherein the third and fourth matrices include values selected from a group consisting of 0, 1, and values between 0 and 1 (Luo, page 68, Figure 1; Luo, page 70, 3.1. Problem formulation, For each ranking text by the k th image query x k , we define the incurred ranking loss function as (7) where W = { W1, b1, W2, b2 } collects the embedding parameters used in functions (1) and (2) ; l(xk, zk, zj, ykj; W) is usually given as a hinge loss by l(xk, zk, zj, Ykj; W) = max (0, ykj[S(xk, zj) − S(xk, zk)] + Δ) (8) with margin Δ≥0 … encourages aligned image-text pairs in d to have a higher score than misaligned pairs by a margin).


Claim(s) 4 and 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Luo, Minnan, et al. "Simple to complex cross-modal learning to rank." Computer Vision and Image Understanding 163 (2017): 67-77, hereinafter, “Luo”, in view of Wang, Liwei, Yin Li, and Svetlana Lazebnik. "Learning Deep Structure-Preserving Image-Text Embeddings." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, hereinafter, “Wang” as applied to claim 1 above, and further in view of Jiang, Lu, et al. "Self-paced learning with diversity." Advances in neural information processing systems 27 (2014), hereinafter, “Jiang”.

As per claim 4, Luo and Wang disclose the system of claim 1 and (Luo, page 68, 1. Introduction, To this end, we incorporate a self-paced learning with diversity (SPLD) theory into CMLR to train an optimal embedding space) but do not explicitly disclose the following limitations as further recited however Jiang discloses wherein the training module is configured to add a first relevance value to the first matrix when a first similarity value representative of a similarity between a first one of the second objects and a second one of the second objects is greater than a predetermined threshold value (Jiang, page 4, 4.2 SPLD Algorithm, input the groups of samples, the up-to-date model parameter w, and two self-paced parameters, and outputs the optimal v ... Samples with L(yi, f(xi, w)) < λ will be selected in training (vi = 1) ... Samples with L(yi, f(xi, w)) > λ + γ will not be selected in training (vi = 0) ... Other samples will be selected by comparing their losses to a threshold λ ... The sample with a smaller loss than the threshold will be selected in training ... the threshold decreases considerably as the rank i grows ... We study a tractable example that allows for clearer diagnosis in Fig. 2, where each keyframe represents a video sample on the event “Rock Climbing” of the TRECVID MED data, and the number below indicates its loss. The samples are clustered into four groups based on the visual similarity).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Luo and Wang to include the thresholds as taught by Jiang in order to ensure diversity in sample selection thereby preventing the system from monotonously selecting samples from the same group (Jiang, page 4, 4.2 SPLD Algorithm). 

As per claim 5, Luo, Wang and Jiang disclose the system of claim 4.  Jiang discloses wherein the training module is configured to add a second relevance value to the second matrix when a second similarity value representative of a second similarity between a third one of the second objects and a fourth one of the second objects is greater than the predetermined threshold value (Jiang, page 5, Figure 2; Jiang, pages 4-5, 4.2 SPLD Algorithm, input the groups of samples, the up-to-date model parameter w, and two self-paced parameters, and outputs the optimal v ... Samples with L(yi, f(xi, w)) < λ will be selected in training (vi = 1) ... Samples with L(yi, f(xi, w)) > λ + γ will not be selected in training (vi = 0) ... Other samples will be selected by comparing their losses to a threshold λ ... The sample with a smaller loss than the threshold will be selected in training ... the threshold decreases considerably as the rank i grows ... We study a tractable example that allows for clearer diagnosis in Fig. 2, where each keyframe represents a video sample on the event “Rock Climbing” of the TRECVID MED data, and the number below indicates its loss. The samples are clustered into four groups based on the visual similarity … When λ does ≠ 0 and γ does ≠ 0 in Fig. 2(b), SPLD balances the easiness and the diversity, and produces a reasonable and diverse curriculum ... In an extreme case where λ = 0 and γ ≠ 0, as illustrated in Fig. 2(c), SPLD selects only diverse samples).  The motivation would be the same as above in claim 4.


Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Luo, Minnan, et al. "Simple to complex cross-modal learning to rank." Computer Vision and Image Understanding 163 (2017): 67-77, hereinafter, “Luo”, in view of Wang, Liwei, Yin Li, and Svetlana Lazebnik. "Learning Deep Structure-Preserving Image-Text Embeddings." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, hereinafter, “Wang” as applied to claim 1 above, and further in view of Wu, Fei, et al. "Cross-media semantic representation via bi-directional learning to rank." Proceedings of the 21st ACM international conference on Multimedia. 2013, hereinafter, “Wu”.

As per claim 7, Luo and Wang disclose the system of claim 1 and the third and fourth matrices (Luo, pages 68-69, Introduction; Wang, page 5006, 1. Introduction; Wang, page 5006, 2.1. Network Structure), but do not explicitly disclose the following limitations as further recited however Wu discloses wherein the training module is configured to: 
determine listwise losses based on the training dataset (Wu, Abstract, cross-media ranking algorithm to optimize the bi-directional listwise ranking loss with a latent space embedding … The latent space embedding is discriminatively learned by the structural large margin learning for optimization with certain ranking criteria); 
train first and second functions for cross-modal retrieval based on the listwise losses (Wu, Abstract, cross-media ranking algorithm to optimize the bi-directional listwise ranking loss with a latent space embedding … The latent space embedding is discriminatively learned by the structural large margin learning for optimization with certain ranking criteria; Wu, page 878, 1. Introduction, learn a latent space which can be applied to both image-query-text retrieval and text-query image retrieval, assuming bi-directional ranking examples); and 
store the first and second functions in the memory of the search module (Wu, page 880, 3.2 The Linear Mapping Functions, the text and the image are mapped to a common k-dimensional latent aspect space ... a k-dimensional latent aspect space but are also faster to compute and lead to much smaller storage by representing the imagery and text in the k dimensions).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Luo and Wang to use the listwise loss function as taught by Wu as an alternative means to evaluate the accuracy of a ranking list based on the relevance between the query and retrieved documents (Wu, page 882, 3.4 Algorithm and Implementation).


Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Luo, Minnan, et al. "Simple to complex cross-modal learning to rank." Computer Vision and Image Understanding 163 (2017): 67-77, hereinafter, “Luo”, in view of Wang, Liwei, Yin Li, and Svetlana Lazebnik. "Learning Deep Structure-Preserving Image-Text Embeddings." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, hereinafter, “Wang” as applied to claim 1 above, and further in view of Hua, Yan, et al. "Cross-modal correlation learning by adaptive hierarchical semantic aggregation." IEEE Transactions on Multimedia 18.6 (2016): 1201-1216, hereinafter, “Hua”.

As per claim 10, Luo and Wang disclose the system of claim 1 and the third and fourth matrices (Luo, pages 68-69, Introduction; Wang, page 5006, 1. Introduction; Wang, page 5006, 2.1. Network Structure), but do not explicitly disclose the following limitations as further recited however Hua discloses wherein the training module is configured to: 
determine the quantized normalized discounted cumulative gain (NDCG) losses based on the training dataset (Hua, pages 1209-1210, Evaluation criteria: We adopt ... normalized discount cumulative gain (NDCG) for performance evaluation ... To measure the performance on data with multi-level semantic relevance, we adopt NDCG); 
train first and second functions for cross-modal retrieval based on the quantized NDCG losses and store the first and second functions in the memory of the search module (Hua, page 1204, IV. Semantic Hierarchy, We are given a cross-modal dataset ... where xi ∈ Rdx and yi ∈ Rdy denote the ith training data pair from X and Y modalities, respectively. ci ∈ {1, 2, . . . , C} denotes the category index of the ith training pair. Since there is complicated semantic relation among the categories, we construct a semantic category hierarchy H on D by combining the similarity modeling from visual domain, textual domain and ontology relatedness).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Luo and Wang to use the normalized discounted cumulative gain measurement as taught by Hua in order to measure the ranking list based on the positions of objects with different degrees of relevance (Hua, pages 1209-1210, Evaluation criteria).


Claim(s) 13-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Luo, Minnan, et al. "Simple to complex cross-modal learning to rank." Computer Vision and Image Understanding 163 (2017): 67-77, hereinafter, “Luo”, in view of Wang, Liwei, Yin Li, and Svetlana Lazebnik. "Learning Deep Structure-Preserving Image-Text Embeddings." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, hereinafter, “Wang”, and further in view of Lee et al., U.S. Publication No. 2020/0097604, hereinafter, “Lee”.

As per claim 13, Luo discloses a method for cross-modal search, comprising: 
receiving, at a search module, a first search query in a first modality (Luo, page 68, 1. Introduction, Given each image query, the retrieved sentences are ordered according to their ranking loss); 
accessing a matrix in memory of the search module in response to the first search query (Luo, page 69, 3.1. Problem Formulation, W1 is a d x p transformation matrix … W2 is a d x p transformation matrix); 
encoding the first search query (Luo, page 69, 3.1. Problem formulation, a p -dimensional visual feature vector extracted from the i th image and z i ... a q -dimensional feature vector extracted from the i th text); 
wherein the third and fourth matrices are generated by: 
accessing a training dataset including first objects of the first modality and second objects of a second modality that are associated with the first objects, respectively, the first modality being different than the second modality, and the second objects including text that is descriptive of the first objects (Luo, page 69, 3.1. Problem Formulation, the training dataset consists of n image-text pairs); 
obtaining a first matrix including first relevance values indicative of relevance between the first objects and the second objects, respectively (Luo, pages 69-70, 3.1. Problem Formulation, We associate each image with a natural language description such that the training dataset consists of n image-text pairs, i.e. D = {(xi, zi): i = 1, 2, ... , n}, where xi Є X ⊆ Rp represents a p-dimensional visual feature vector extracted from the i-th image and zi Є Z ⊆ Rq refers to a q-dimensional feature vector extracted from the i-th text (sentence) … we collect all sentences and images in X = {x1, x2, ..., xn} and Z = {z1, z2, ..., zn}, respectively. Note that the order of components in X and Z should correspond with each other such, that the i-th image xi Є X and the i-th text zi Є Z come from the same pair in D … Given an image query x Є X, we define a non-linear mapping from image feature space into the shared multimodal embedding space via h [Equation 1], where W1 is a d x p transformation matrix and b1 Є Rd is a bias vector. Similarly, we map each text feature into the shared embedding space by non-linear mapping g [Equation 2] where W2 is a d x p transformation matrix and b2 Є Rd is a bias vector. Through non-linear mapping h and g, the similarity measurement (relevance score) S (x, z) between image query x and the retrieved text z can be obtained via computing the cosine similarity in the shared embedding space ... the underlying correspondence between image and text lies in the embedding parameters W1, b1 and W2, b2); 
obtaining a second matrix including second relevance values indicative of relevance between the second objects and the first objects, respectively (Luo, pages 69-70, 3.1. Problem Formulation, We associate each image with a natural language description such that the training dataset consists of n image-text pairs, i.e. D = {(xi, zi): i = 1, 2, ... , n}, where xi Є X ⊆ Rp represents a p-dimensional visual feature vector extracted from the i-th image and zi Є Z ⊆ Rq refers to a q-dimensional feature vector extracted from the i-th text (sentence) … we collect all sentences and images in X = {x1, x2, ..., xn} and Z = {z1, z2, ..., zn}, respectively. Note that the order of components in X and Z should correspond with each other such, that the i-th image xi Є X and the i-th text zi Є Z come from the same pair in D … Given an image query x Є X, we define a non-linear mapping from image feature space into the shared multimodal embedding space via h [Equation 1], where W1 is a d x p transformation matrix and b1 Є Rd is a bias vector. Similarly, we map each text feature into the shared embedding space by non-linear mapping g [Equation 2] where W2 is a d x p transformation matrix and b2 Є Rd is a bias vector. Through non-linear mapping h and g, the similarity measurement (relevance score) S (x, z) between image query x and the retrieved text z can be obtained via computing the cosine similarity in the shared embedding space ... the underlying correspondence between image and text lies in the embedding parameters W1, b1 and W2, b2); 
based on similarities between ones of the second objects, generating the third matrix by selectively adding first additional relevance values to the first matrix (Luo, pages 68-69, Introduction, Given each image query, the retrieved sentences are ordered according to their ranking loss, as specified by the numbers in Fig. 1. It is reasonable to believe that the sentences ranked higher, i.e., with a smaller loss, are usually more accurate and important ... select ranking sentences together with the corresponding image queries … we adaptively assign each ranking with an importance weight and learn a more optimal multi- modal embedding space gradually from easy to more complex rankings with respect to diverse image queries … we associate each ranking by cross-modal query with an importance weight to train the CMLR model; Luo, page 70, 3.1. Problem Formulation, assume the aligned text zk ranks higher than the other text zj Є Z (j ≠ k) given an image query … we associate each image query xk a tetrad set).
Luo does not explicitly disclose the following limitation as further recited however Wang discloses 
accessing a third matrix and fourth matrix in memory of the search module in response to the first search query (Wang, page 5006, 1. Introduction; Wang, page 5006, 2.1. Network Structure, As shown in Figure 1, our deep model has two branches, each composed of fully connected layers with weight matrices Wl and Vl; Wang, page 5006-5007, 2.2. Training Objective); 
encoding the first search query using a first function including the third and fourth matrices (Wang, page 5006, 2. Deep Structure-Preserving Embedding, Let X and Y denote the collections of training images and sentences, each encoded according to their own feature vector representation. We want to map the image and sentence vectors (which may have different dimensions initially) to a joint space of common dimension);
based on the similarities between the ones of the second objects, generating the fourth matrix by selectively adding second additional relevance values to the second matrix (Wang, page 5006, 1. Introduction, preserve neighborhood structure within each individual view. Specifically, in the learned latent space, we want images (resp. sentences) with similar meaning to be close to each other ... for each image its target neighbors from the same class are closer than samples from other classes; Wang, page 5006, 2.1. Network Structure, As shown in Figure 1, our deep model has two branches, each composed of fully connected layers with weight matrices Wl and Vl; Wang, page 5006-5007, 2.2. Training Objective, Our training objective is a stochastic margin-based loss that includes bidirectional cross-view ranking constraints, together with within-view structure-preserving constraints ... Structure-preserving constraints. Let N(xi) denote the neighborhood of xi containing images that share the same meaning. In our case, this is the set of images described by the same sentence as xi. Then we want to enforce a margin of m between N(xi) and any point outside of the neighborhood [Equation 3]. Analogously to (3), we define the constraints for the sentence side as [Equation 4] where N(yi') contains sentences describing the same image ... Figure 2 gives an intuitive illustration of how within-view structure preservation can help with cross-view matching ... within-view structure constraints are added, pushing semantically similar sentences (same color circles) closer to each other ... The weights λ2, λ3 control the importance of the structure-preserving terms, which act as regularizers for the bi-directional retrieval tasks).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Luo to include the within-view neighborhood structure as taught by Wang in order to cluster images / sentences with similar meaning close to each other in the learned latent embedding space so that target samples from within the same class are closer to each other than samples from other classes thereby improving accuracy of matching and image-to-text and text-to-image retrieval (Wang, Abstract; Wang, page 5006, 1. Introduction).
Luo and Wang do not explicitly disclose the following limitations as further recited however Lee discloses 
encoding the first search query using a first function (Lee, ¶0032, the first encoding model 210 may be an image-encoding model for encoding images, and the second encoding model 212 may be a text-encoding model for encoding text);
identifying at least one search result for the first search query based on a result of the encoding using the first function (Lee, ¶0089, The digital assistant may facilitate a search of images available on the Internet when a user input a query text into the Internet browser. That is, the Internet browser on the digital assistant may send the query text to the search engine 204 in the server device 804 to retrieve matching image(s)); and 
transmitting the at least one search result from the search module (Lee, ¶0089, The digital assistant may facilitate a search of images available on the Internet when a user input a query text into the Internet browser. That is, the Internet browser on the digital assistant may send the query text to the search engine 204 in the server device 804 to retrieve matching image(s); Lee, ¶0085, in act 718, the candidate image with the highest similarity score may be determined as being the most similar to the search query sentence. And finally, in act 720, the best candidate image may be returned as the search result image).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Luo and Wang to include the encoding, identifying, retrieving and transmitting / returning the search result as taught by Lee in order to improve the accuracy of matching data of different modalities requested / queried by users (Lee, ¶0017; Lee, ¶0001).

As per claim 14, Luo, Wang and Lee disclose the method of claim 13.  Lee discloses further comprising, by the search module: 
receiving a second search query in the second modality (Lee, ¶0096, the cross-modal attention model 208 may also be used in a different context of searching for a matching text in response to a query image. That is, the client application 202 may provide a search query image to the search engine 204, the search engine 204 may retrieve candidate sentences from the database 206); 
encoding the second search query using a second function including the third and fourth matrices (Lee, ¶0032, the first encoding model 210 may be an image-encoding model for encoding images, and the second encoding model 212 may be a text-encoding model for encoding text); and 
identifying at least one search result for the second search query based on a result of the encoding using the second function (Lee, ¶0096, the cross-modal attention model 208 may also be used in a different context of searching for a matching text in response to a query image. That is, the client application 202 may provide a search query image to the search engine 204, the search engine 204 may retrieve candidate sentences from the database 206; Lee, ¶0085, in act 718, the candidate image with the highest similarity score may be determined as being the most similar to the search query sentence. And finally, in act 720, the best candidate image may be returned as the search result image).  The motivation would be the same as above in claim 13.

As per claim 15, Luo, Wang and Lee disclose the method of claim 14.  Lee discloses further comprising: receiving the first search query from a user device over a network (Lee, ¶0028, a client application 202 may initiate the search by providing a search query sentence to a search engine 204. The client application 202 may be a browser or another type of software application on a client device; Lee, ¶0078, a search engine in a server device may receive the search query sentence provided by a client application in a client device. The search query sentence may be provided by a user for the purpose of finding a matching image); and transmitting the at least one search result for the first search query to the user device over the network (Lee, ¶0085, in act 718, the candidate image with the highest similarity score may be determined as being the most similar to the search query sentence. And finally, in act 720, the best candidate image may be returned as the search result image).  The motivation would be the same as above in claim 13.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189. The examiner can normally be reached M-F, 9:30AM TO 6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/TRACY MANGIALASCHI/Examiner, Art Unit 2668              
/ALEX KOK S LIEW/Primary Examiner, Art Unit 2668