Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This action is responsive to the application filed in the U.S. on 5/7/2020. Claims 1-22 are pending in the case. Claims 1, 10, 16, and 22 are written in independent form.


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 10-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-patentable subject matter. The claims are directed to an abstract idea without significantly more.
Claims 10-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The judicial exception is not integrated into a practical application. The claims do not include additional elements that are sufficient to amount to significantly more than judicial exception. The eligibility analysis in support of these findings is provided below.

As per Claims 10-15,
STEP 1 (Yes):In accordance with Step 1 of the eligibility inquiry (as explained in MPEP 2106), it is noted in the claimed method (claim 10-15) is directed to one of the eligible categories of subject matter and therefore satisfies Step 1.

STEP 2A Prong One (Yes):In accordance with Step 2A Prong one, it is noted that the claims recite an abstract idea by reciting mathematical concepts (including mathematical relationships, formulas, equations, and calculations), which falls into the “Mathematical Concepts” group within the enumerated groupings of abstract ideas. The claims recites the abstract idea of applying a first and second neural network (function) to training data, using a binary hash algorithm (function) to generate a binary hash code and a quantization algorithm (function) to generate a quantization code, “calculating a loss value using a loss function…”, and updating parameters of the first and second neural networks, the binary hash function, and the quantization algorithm (function), which falls within the abstract idea of performing mathematical relationships, formulas, equations, and calculations. The recitation of generic computer components does not negate the abstractness of the given limitations.
The limitations include:
Regarding Claim 10
a method, performed by a computer system, for training a system to perform cross-modal retrieval of a database item having similar semantic meaning to a query item of a different modality, the method comprising:
(a) obtaining a training data set, wherein the training data set includes pairs of data items wherein each pair includes a data item of a first modality and a data item of a second modality, wherein the first and second modalities are different, wherein a first subset of pairs have data items of similar semantic meaning and a second subset of pairs of have data items of dissimilar semantic meaning, and wherein each pair is labeled based on the similarity or dissimilarity of semantic meaning of the data items in the pair (performing a function to collect training data already determined to have different modalities and labeled pairs);
 (b) applying a first neural network to the data items of the first modality in the training data set to generate a feature vector for each of the data items of the first modality (applying a neural network is merely applying a mathematical formula or equation to the data items);
 (c) applying a second neural network to the data items of the second modality in the training data set to generate a feature vector for the each of the data items of the second modality (applying a neural network is merely applying a mathematical formula or equation to the data items);
(d) generating a binary hash code and a quantization code for each of the data items in the training data set based on the feature vectors for the data items, wherein the binary hash codes are generated using a binary hashing algorithm and the quantization codes are generated using a quantization algorithm (mathematical formulas for generating a binary hash code and quantization code);
 (e) calculating a loss value using a loss function that measures an extent to which the feature vectors, binary hash codes, and the quantization codes preserve semantic similarity correlations in the training data pairs and that enables the system to simultaneously optimize the feature vectors, the binary hash codes, and the quantization codes (loss function for calculating a loss value);
 (f) updating parameters of the first and second neural networks, the binary hash algorithm, and the quantization algorithm to reduce the loss value (updating parameters for executing mathematical formulas and equations); and
 (g) repeating steps (b)-(f) for a number of iterations (performing an iterative loop function);

Regarding Claim 11
wherein the first modality is text, and the second modality is images (additional element). 

Regarding Claim 12
wherein the first modality is images, and the second modality is text (additional element).

Regarding Claim 13
wherein the loss function comprises:
a similarity loss sub-function that measures similarities between feature vectors of the data items in each of the pairs of training data items;
a hash loss sub-function that measures binary code error; and
a quantization loss sub-function that measures quantization error;
The claim further recites at a high level of generality sub-functions or formulas/equations that make up the previously recited loss function.

Regarding Claim 14
wherein the loss function also includes a balance loss sub-function that measures a distribution of the number of +1 and -1 binary bits in the binary hash codes for the training data set.
The claim further recites at a high level of generality a sub-function or formula/equation that makes up the previously recited loss function.

Regarding Claim 15
a convolutional neural network is applied to image data items (applying a specific type of neural network to a particular modality of data items is merely applying a mathematical formula or equation to the data items) and
a long-short term memory neural network or multi-layer perceptron is applied to text data items (applying a specific type of neural network to a particular modality of data items is merely applying a mathematical formula or equation to the data items).

Step 2A Prong Two (No)
The additional elements are directed to the use of a computer system and the modalities of data items being processed by the mathematical concept as being text and images (Claims 10-12). However,  these elements fail to integrate the abstract idea into a practical application because they fail to provide an improvement to the functioning of a computer or to any other technology or technical field, fail to apply the exception with a particular machine, fail to apply the judicial exception to effect a particular treatment or prophylaxis for a disease or medical condition, fail to effect a transformation of a particular article to a different state or thing, and fail to apply/use the abstract idea in a meaningful way beyond generally linking the use of the judicial exception to a particular technological environment. Furthermore, these elements have been fully considered, however they are directed to the use of generic computing elements to perform the abstract idea, which is not sufficient to amount to practical application.
Accordingly, because the Step 2A Prong One and Prong Two analysis resulted in the conclusion that the claims are directed to an abstract idea, additional analysis under Step 2B of the eligibility inquiry must be conducted in order to determine whether any claim element or combination of elements amount to significantly more than the judicial exception.

Step 2B (No):
It has been determined that the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional limitation(s) is/are directed to a computer system and the modalities of data items being processed by the mathematical concept as being text and images, though at a very high level of generality and without imposing meaningful limitation on the scope of the claims. Such generic, high-level, and nominal involvement of a computer or computer-based elements for carrying out the invention merely servers to tie the abstract idea to a particular technological environment, which is not enough to render the claims patent-eligible, as noted at pg. 74624 of Federal Register/Vol. 79, No. 241, citing Alice, which in turn cites Mayo. Further, See, e.g., Alice Corp. Pty. Ltd. v. CLS Bank Int'l, 134 S. Ct. 2347, 2359-60, 110 USPQ2d 1976, 1984 (2014). See also OIP Techs. v. Amazon.com, 788 F.3d 1359, 1364, 115 USPQ2d 1090, 1093-94 (Fed. Cir. 2015) ("Just as Diehr could not save the claims in Alice, which were directed to 'implement[ing] the abstract idea of intermediated settlement on a generic computer', it cannot save O/P's claims directed to implementing the abstract idea of price optimization on a generic computer.") ( citations omitted). See also, Affinity Labs of Texas LLC v. DirecTV LLC, 838 F.3d 1253, 1257-1258 (Fed. Cir. 2016) (mere recitation of a GUI does not make a claim patent-eligible); Intellectual Ventures I LLC v. Capital One Bank, 792 F.3d 1363, 1370 (Fed. Cir. 2015)
("the interactive interface limitation is a generic computer element").
The additional elements are broadly applied to the abstract idea(s) at a high level of generality ("similar to how the recitation of the computer in the claims in Alice amounted to mere instructions to apply the abstract idea of intermediated settlement on a generic computer," as explained in MPEP §2106.05(f)) and they operate in well-understood, routine, and conventional manners. Furthermore, generally transmitting, analyzing, and outputting (e.g., displaying) data are examples of insignificant extra-solution activity. The recitation routing, moving, identifying are performed by an apparatus/device is the epitome of "mere instructions to implement an abstract idea on a computer".
MPEP § 2106.0S(d)(II) sets forth the following:
The courts have recognized the following computer functions as well-understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity.
• Receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec ... ; TLI Communications LLC v. AV Auto. LLC ... ; OIP Techs., Inc., v. Amazon.com, Inc ... ; buySAFE, Inc. v. Google, Inc ... ;
• Performing repetitive calculations, Flook ... ; Bancorp Services v. Sun Life ... ;
• Electronic recordkeeping, Alice Corp ... ; Ultramercial ... ;
• Storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc ... ;
• Electronically scanning or extracting data from a physical document, Content Extraction and
Transmission, LLC v. Wells Fargo Bank ... ; and
• A web browser's back and forward button functionality, Internet Patent
• Corp. v. Active Network, Inc ...
. . . Courts have held computer-implemented processes not to be significantly more than an abstract idea (and thus ineligible) where the claim as a whole amounts to nothing more than generic computer functions merely used to implement an abstract idea, such as an idea that could be done by a human analog (i.e., by hand or by merely thinking) ...

In addition, when taken as an ordered combination, the ordered combination adds nothing that is not already present as when the elements are taken individually. There is no indication that the combination of elements integrate the abstract idea into a practical application. Their collective functions merely provide conventional computer implementation. Therefore, when viewed as a whole, these additional claim elements do not provide meaningful limitations to transform the abstract idea into a practical application of the abstract idea or that the ordered combination amounts to significantly more than the abstract idea itself.



Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims Consider -15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Non-Patent Literature Cao et al., "Deep Visual-Semantic Hashing for Cross-Modal Retrieval", KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, Pages 1445–1454 (Year: 2016), hereinafter referred to as Cao.

Regarding Claim 10
Cao teaches a method, performed by a computer system, for training a system to perform cross-modal retrieval of a database item having similar semantic meaning to a query item of a different modality, the method comprising:
(a) obtaining a training data set, wherein the training data set includes pairs of data items
Cao teaches obtaining a training set of bimodal objects (Page 1447, Col. 2, 1st paragraph)
wherein each pair includes a data item of a first modality and a data item of a second modality,
Cao teaches the bimodal objects comprise an image modality and a text modality (Page 1447, Col. 2, 1st paragraph).
wherein the first and second modalities are different,
Cao teaches the modalities of the bimodal objects as being different by teaching one modality as an image modality and another as a text modality (Page 1447, Col. 2, 1st paragraph).
wherein a first subset of pairs have data items of similar semantic meaning and a second subset of pairs of have data items of dissimilar semantic meaning, and
Cao teaches “some pairs of the bimodal objects are associated with similarity labels sij, where
sij = 1 implies oi and oj are similar and sij = −1 indicates oi and oj are dissimilar” (Page 1447, Col. 2, 1st paragraph). Cao further teaches “in supervised cross-modal hashing, S = {sij} is constructed from semantic labels of data points” (Page 1447, Col. 2, 1st paragraph).
wherein each pair is labeled based on the similarity or dissimilarity of semantic meaning of the data items in the pair;
Cao teaches “some pairs of the bimodal objects are associated with similarity labels sij, where
sij = 1 implies oi and oj are similar and sij = −1 indicates oi and oj are dissimilar” (Page 1447, Col. 2, 1st paragraph). Cao further teaches “in supervised cross-modal hashing, S = {sij} is constructed from semantic labels of data points” (Page 1447, Col. 2, 1st paragraph).
(b) applying a first neural network to the data items of the first modality in the training data set to generate a feature vector for each of the data items of the first modality;
Cao teaches applying a first and second neural network to the data items of the first and second modality respectively to generate respective feature vectors of the modalities by teaching “the proposed cross-modal deep hashing approach (DVSH) in Figure 3 is an end-to-end deep architecture for crossmodal hashing, which comprises both convolutional neural network (AlexNet) for learning image representations and recurrent neural network (LSTM) for learning text representations” (Page 1447, Col. 2, 3rd Paragraph).
(c) applying a second neural network to the data items of the second modality in the training data set to generate a feature vector for the each of the data items of the second modality;
Cao teaches applying a first and second neural network to the data items of the first and second modality respectively to generate respective feature vectors of the modalities by teaching “the proposed cross-modal deep hashing approach (DVSH) in Figure 3 is an end-to-end deep architecture for crossmodal hashing, which comprises both convolutional neural network (AlexNet) for learning image representations and recurrent neural network (LSTM) for learning text representations” (Page 1447, Col. 2, 3rd Paragraph).
(d) generating a binary hash code and a quantization code for each of the data items in the training data set based on the feature vectors for the data items,
Cao teaches “generating binary hash codes…in the joint embedding space” for each of the data items “to enable efficient cross-modal retrieval” (Page 1447, Col. 2, 2nd Paragraph). Cao further teaches motivation “to craft two or more hashing networks for directly learning the modality-specific hashing functions” (Page 1449, Section 4.2) where “unsupervised hashing methods learn hash functions that can encode input data points to binary codes using the unlabeled training data” and “typical learning criteria include…quantization error minimization as correlation quantization” (Page 1446 Section 2). Therefore, Cau teaches generating a quantization code along with binary hash codes.
wherein the binary hash codes are generated using a binary hashing algorithm and
Cao teaches “two modality-specific hashing networks for learning hash functions to generate compact binary codes” (Abstract) thereby teaching using a binary hashing function to generate the binary hash codes.
the quantization codes are generated using a quantization algorithm;
Cao teaches “unsupervised hashing methods learn hash functions that can encode input data points to binary codes using the unlabeled training data” and “typical learning criteria include…quantization error minimization as correlation quantization” (Page 1446 Section 2).
(e) calculating a loss value using a loss function that measures an extent to which the feature vectors, binary hash codes, and the quantization codes preserve semantic similarity correlations in the training data pairs and that enables the system to simultaneously optimize the feature vectors, the binary hash codes, and the quantization codes;
Cao teaches “DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes” and “effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a seamless combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable the learning of similarity-preserving and high-quality hash codes” (Abstract). Cao further teaches calculating a cosine max-margin loss value by “integrating…loss functions in a joint optimization problem that is taken over the deep visual-semantic hashing (DVSH) network [with]… penalty parameters for trading off the relative importance of the bitwise max-margin loss and modality-specific squared loss” (Page 1449, Col. 2, Section 4.3).
(f) updating parameters of the first and second neural networks, the binary hash algorithm, and the quantization algorithm to reduce the loss value; and
Cao teaches “we adopt the LSTM as our sequence model, which maps an input yit of each sequence (a sentence in our case) at timestep t and a hidden state hyi(t−1) of previous timestep (t−1) to an output zyit and updates hidden state hyit. Therefore, inference must be run sequentially (i.e. from top to bottom in Figure 3), by computing the activations in order using Equation (3), that is, updating the t-th state based on the (t − 1)-th state.” (Page 1448, Col. 1, 2nd Paragraph).
Cao further teaches minimizing the max-margin loss (Page 1449, Section 4.1.2), thereby teaching reducing the loss value to a minimum.
(g) repeating steps (b)-(f) for a number of iterations;
Cao teaches “we adopt the LSTM as our sequence model, which maps an input yit of each sequence (a sentence in our case) at timestep t and a hidden state hyi(t−1) of previous timestep (t−1) to an output zyit and updates hidden state hyit. Therefore, inference must be run sequentially (i.e. from top to bottom in Figure 3), by computing the activations in order using Equation (3), that is, updating the t-th state based on the (t − 1)-th state.” (Page 1448, Col. 1, 2nd Paragraph). Therefore, Cao teaches performing at least one repeated iteration of the steps.

Regarding Claim 11
Cao further teaches:
wherein the first modality is text, and the second modality is images.
Cao teaches the modalities of the bimodal objects as being different by teaching one modality as an image modality and another as a text modality (Page 1447, Col. 2, 1st paragraph). Cao further teaches “cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa” (Abstract).

Regarding Claim 12
Cao further teaches:
wherein the first modality is images, and the second modality is text.
Cao teaches the modalities of the bimodal objects as being different by teaching one modality as an image modality and another as a text modality (Page 1447, Col. 2, 1st paragraph). Cao further teaches “cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa” (Abstract).

Regarding Claim 13
Cao further teaches:
wherein the loss function comprises:
a similarity loss sub-function that measures similarities between feature vectors of the data items in each of the pairs of training data items;
Cao teaches a cosine max-margin loss function where “for each pair of objects (oi, oj , sij ), if sij = 1, indicating that oi and oj are similar, then their hash codes ui and vj must be similar across different modalities (image and sentence), which is equivalent to requiring that their joint visual-semantic embeddings hi and hj should be similar. Correspondingly, if sij = −1, indicating that oi and oj are dissimilar, then their joint visual-semantic embeddings hi and hj should be dissimilar. We use the cosine similarity…for measuring the closeness between hi and hj” (Section 4.1.1, Pages 1448-1449).
a hash loss sub-function that measures binary code error; and
Cao teaches a bitwise max-margin loss function where “minimizing the bitwise max-margin loss will lead to lower quantization error when binarizing the continuous embeddings…to binary hash codes, which allows us to learn high-fidelity binary codes” (Page 1449, Section 4.1.2).
a quantization loss sub-function that measures quantization error;
Cao teaches a bitwise max-margin loss function where “minimizing the bitwise max-margin loss will lead to lower quantization error when binarizing the continuous embeddings…to binary hash codes, which allows us to learn high-fidelity binary codes” (Page 1449, Section 4.1.2).

Regarding Claim 14
Cao further teaches:
wherein the loss function also includes a balance loss sub-function that measures a distribution of the number of +1 and -1 binary bits in the binary hash codes for the training data set.
Cao teaches “With the trained fusion network and hashing networks, we can obtain K-bit binary hash codes by simple sigh thresholding sgn(u) and sgn(v) for each modality, where sgn(·) is the element-wise sign function that for i = 1, . . . , K, sgn(zi) = 1 if zi > 0, otherwise sgn(zi) = −1.” (Page 1450, Col. 1, 1st Paragraph). Therefore, Cao teaches measuring a distribution of the binary bits in the binary hash codes for the training data.

Regarding Claim 15
Cao further teaches:
a convolutional neural network is applied to image data items and
Cao teaches applying a first and second neural network to the data items of the first and second modality respectively to generate respective feature vectors of the modalities by teaching “the proposed cross-modal deep hashing approach (DVSH) in Figure 3 is an end-to-end deep architecture for crossmodal hashing, which comprises both convolutional neural network (AlexNet) for learning image representations and recurrent neural network (LSTM) for learning text representations” (Page 1447, Col. 2, 3rd Paragraph).
a long-short term memory neural network or multi-layer perceptron is applied to text data items.
Cao teaches applying a first and second neural network to the data items of the first and second modality respectively to generate respective feature vectors of the modalities by teaching “the proposed cross-modal deep hashing approach (DVSH) in Figure 3 is an end-to-end deep architecture for crossmodal hashing, which comprises both convolutional neural network (AlexNet) for learning image representations and recurrent neural network (LSTM) for learning text representations” (Page 1447, Col. 2, 3rd Paragraph).



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	
Claims 1-7, 9, 16-19, and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Cao and further in view of An et al. (U.S. Pre-Grant Publication No. 2015/0154229).

Regarding Claim 1
Cao teaches a method, performed by a computer system, for cross-modal retrieval of a database item having similar semantic meaning to a query item of a different modality, the method comprising:
performing the following with respect to a training phrase:
(a) obtaining a training data set, wherein the training data set includes pairs of data items
Cao teaches obtaining a training set of bimodal objects (Page 1447, Col. 2, 1st paragraph)
wherein each pair includes a data item of a first modality and a data item of a second modality,
Cao teaches the bimodal objects comprise an image modality and a text modality (Page 1447, Col. 2, 1st paragraph).
wherein the first and second modalities are different,
Cao teaches the modalities of the bimodal objects as being different by teaching one modality as an image modality and another as a text modality (Page 1447, Col. 2, 1st paragraph).
wherein a first subset of pairs have data items of similar semantic meaning and a second subset of pairs of have data items of dissimilar semantic meaning, and
Cao teaches “some pairs of the bimodal objects are associated with similarity labels sij, where
sij = 1 implies oi and oj are similar and sij = −1 indicates oi and oj are dissimilar” (Page 1447, Col. 2, 1st paragraph). Cao further teaches “in supervised cross-modal hashing, S = {sij} is constructed from semantic labels of data points” (Page 1447, Col. 2, 1st paragraph).
wherein each pair is labeled based on the similarity or dissimilarity of semantic meaning of the data items in the pair;
Cao teaches “some pairs of the bimodal objects are associated with similarity labels sij, where
sij = 1 implies oi and oj are similar and sij = −1 indicates oi and oj are dissimilar” (Page 1447, Col. 2, 1st paragraph). Cao further teaches “in supervised cross-modal hashing, S = {sij} is constructed from semantic labels of data points” (Page 1447, Col. 2, 1st paragraph).
(b) applying a first neural network to the data items of the first modality in the training data set to generate a feature vector for each of the data items of the first modality;
Cao teaches applying a first and second neural network to the data items of the first and second modality respectively to generate respective feature vectors of the modalities by teaching “the proposed cross-modal deep hashing approach (DVSH) in Figure 3 is an end-to-end deep architecture for crossmodal hashing, which comprises both convolutional neural network (AlexNet) for learning image representations and recurrent neural network (LSTM) for learning text representations” (Page 1447, Col. 2, 3rd Paragraph).
(c) applying a second neural network to the data items of the second modality in the training data set to generate a feature vector for the each of the data items of the second modality;
Cao teaches applying a first and second neural network to the data items of the first and second modality respectively to generate respective feature vectors of the modalities by teaching “the proposed cross-modal deep hashing approach (DVSH) in Figure 3 is an end-to-end deep architecture for crossmodal hashing, which comprises both convolutional neural network (AlexNet) for learning image representations and recurrent neural network (LSTM) for learning text representations” (Page 1447, Col. 2, 3rd Paragraph).
(d) generating a binary hash code and a quantization code for each of the data items in the training data set based on the feature vectors for the data items,
Cao teaches “generating binary hash codes…in the joint embedding space” for each of the data items “to enable efficient cross-modal retrieval” (Page 1447, Col. 2, 2nd Paragraph). Cao further teaches motivation “to craft two or more hashing networks for directly learning the modality-specific hashing functions” (Page 1449, Section 4.2) where “unsupervised hashing methods learn hash functions that can encode input data points to binary codes using the unlabeled training data” and “typical learning criteria include…quantization error minimization as correlation quantization” (Page 1446 Section 2). Therefore, Cau teaches generating a quantization code along with binary hash codes.
wherein the binary hash codes are generated using a binary hashing algorithm and
Cao teaches “two modality-specific hashing networks for learning hash functions to generate compact binary codes” (Abstract) thereby teaching using a binary hashing function to generate the binary hash codes.
the quantization codes are generated using a quantization algorithm;
Cao teaches “unsupervised hashing methods learn hash functions that can encode input data points to binary codes using the unlabeled training data” and “typical learning criteria include…quantization error minimization as correlation quantization” (Page 1446 Section 2).
(e) calculating a loss value using a loss function that measures an extent to which the feature vectors, binary hash codes, and the quantization codes preserve semantic similarity correlations in the training data pairs and that enables the system to simultaneously optimize the feature vectors, the binary hash codes, and the quantization codes;
Cao teaches “DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes” and “effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a seamless combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable the learning of similarity-preserving and high-quality hash codes” (Abstract). Cao further teaches calculating a cosine max-margin loss value by “integrating…loss functions in a joint optimization problem that is taken over the deep visual-semantic hashing (DVSH) network [with]… penalty parameters for trading off the relative importance of the bitwise max-margin loss and modality-specific squared loss” (Page 1449, Col. 2, Section 4.3).
(f) updating parameters of the first and second neural networks, the binary hash algorithm, and the quantization algorithm to reduce the loss value; and
Cao teaches “we adopt the LSTM as our sequence model, which maps an input yit of each sequence (a sentence in our case) at timestep t and a hidden state hyi(t−1) of previous timestep (t−1) to an output zyit and updates hidden state hyit. Therefore, inference must be run sequentially (i.e. from top to bottom in Figure 3), by computing the activations in order using Equation (3), that is, updating the t-th state based on the (t − 1)-th state.” (Page 1448, Col. 1, 2nd Paragraph).
Cao further teaches minimizing the max-margin loss (Page 1449, Section 4.1.2), thereby teaching reducing the loss value to a minimum.
(g) repeating steps (b)-(f) for a number of iterations;
Cao teaches “we adopt the LSTM as our sequence model, which maps an input yit of each sequence (a sentence in our case) at timestep t and a hidden state hyi(t−1) of previous timestep (t−1) to an output zyit and updates hidden state hyit. Therefore, inference must be run sequentially (i.e. from top to bottom in Figure 3), by computing the activations in order using Equation (3), that is, updating the t-th state based on the (t − 1)-th state.” (Page 1448, Col. 1, 2nd Paragraph). Therefore, Cao teaches performing at least one repeated iteration of the steps.
performing the following with respect to a prediction phase:
(h) accessing a database with a plurality of database items of the first modality;
Cao teaches “in cross-modal retrieval systems, the database consists of objects from one modality and the query consists of objects from another modality” (Page 1447, Section 4). Cao further teaches “encod[ing] each image x and sentence y from database and query to compact binary hash codes…in the joint embedding space H to enable efficient cross-modal retrieval” (Page 1447, Col. 2, 2nd Paragraph) thereby teaching accessing the database with a plurality of database items.
(i) applying the first neural network to the database items to generate a feature vector for each of the database items;
Cao teaches “cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa” (Abstract) thereby teaching accessing a database for either an image or text.  Cao further teaches performing image hashing and sentence hashing which are similar to the CNN module of the fusion network for image hashing and similar to the LSTM module of the fusion network for sentence hashing (Page 1449, Sections 4.2.1-4.2.2). Therefore, Cao teaches applying the first neural network to the database items for generating a feature vector.
 (j) generating a binary hash code and a quantization code for each of the database items using the feature vectors for the database items, the binary hash algorithm, and the quantization algorithm;
Cao teaches generating hash codes for images and sentences using the modality-specific hashing network (Figure 3 Retrieval Procedure) and motivation “to craft two or more hashing networks for directly learning the modality-specific hashing functions” (Page 1449, Section 4.2) where “unsupervised hashing methods learn hash functions that can encode input data points to binary codes using the unlabeled training data” and “typical learning criteria include…quantization error minimization as correlation quantization” (Page 1446 Section 2). Therefore, Cao teaches generating a quantization code along with binary hash codes using a binary hashing algorithm and a quantization algorithm.
(k) receiving a query item of the second modality;
Cao teaches receiving a query for at least one of the modalities (Figure 3 Retrieval Procedure) and “in cross-modal retrieval systems, the database consists of objects from one modality and the query consists of objects from another modality” (Page 1447, Section 4).
(l) applying the second neural network to the query item to generate a feature vector for the query item;
Cao teaches “cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa” (Abstract) thereby teaching accessing a database for either an image or text.  Cao further teaches performing image hashing and sentence hashing which are similar to the CNN module of the fusion network for image hashing and similar to the LSTM module of the fusion network for sentence hashing (Page 1449, Sections 4.2.1-4.2.2). Therefore, Cao teaches applying the second neural network to the query for generating a feature vector.
(m) generating a binary hash code for the query item based on the feature vector for the query item;
Cao teaches generating hash codes for images and sentences using the modality-specific hashing network (Figure 3 Retrieval Procedure) and motivation “to craft two or more hashing networks for directly learning the modality-specific hashing functions” (Page 1449, Section 4.2) where “unsupervised hashing methods learn hash functions that can encode input data points to binary codes using the unlabeled training data” and “typical learning criteria include…quantization error minimization as correlation quantization” (Page 1446 Section 2). Therefore, Cao teaches generating a binary hash code for the query item.

Cao explicitly teaches all of the elements as recited above except:
(n) calculating a distance between the query item and each of the database items based on the binary hash codes of the query item and the database items;
 (o) selecting a subset of closest database items to the query item based on the calculated distances;
 (p) computing a quantization distance between the query item and each of the database items in the subset using the quantization codes for each of the database items in the subset; and
 (q) retrieving the database item in the subset with the closest quantization distance to the query item.

However, in the related field of endeavor of a multi-modal search system, An et al. teaches:
(n) calculating a distance between the query item and each of the database items based on the binary hash codes of the query item and the database items;
An et al. teaches calculating a distance between a query item and database items by retrieving candidate responses from the database using a retrieval algorithm (Paras. [0148]-[0149]). An et al. further teaches using hash codes where “during retrieval, the binary codes efficiently use Hamming distance calculations” (Para. [0174]).
(o) selecting a subset of closest database items to the query item based on the calculated distances;
An et al. teaches sending the candidate responses for re-ranking (Para. [0149] & Fig. 11) where “a candidate list is obtained of images in the searchable database that are similar to the query image based at least in part on the similarity measure” (Para. [0063]) thereby teaching selecting a subset as candidate responses based on the calculated measure of distance between the query and the candidate responses.
(p) computing a quantization distance between the query item and each of the database items in the subset using the quantization codes for each of the database items in the subset; and
An et al. teaches “re-ranking of the images might cause some of the candidates to fall out of a top-X group of images, such as the top-10 images” (Para. [0149]) where the image re-ranking scheme is “based on a similarity function combining the low level features similarity and attributes-enhanced feature similarity” (Para. [0163]). An et al. further teaches “for scalability, binary quantization is still extremely useful due to the very efficient calculation of the Hamming distance for retrieval” and “the re-ranking (RR) process after retrieval using AFB boosts the retrieval performance for both datasets” (Para. [0228]).
(q) retrieving the database item in the subset with the closest quantization distance to the query item.
An et al. teaches “re-ranking of the images might cause some of the candidates to fall out of a top-X group of images, such as the top-10 images” thereby teaching retrieving the database item in the candidate list with the top re-ranking.

Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of An et al. and Cao at the time that the claimed invention was effectively filed, to have combined the subsequent re-ranking of the candidate results, as taught by An et al., with the systems and methods for deep visual-semantic hashing for cross-modal retrieval, as taught by Cao.
One would have been motivated to make such combination because An et al. teaches re-ranking the top candidate results “to improve the precision” of the results (Para. [0150]) and re-ranking results reduces false positives (Para. [0162]) and it would have been obvious to a person having ordinary skill in the art that improving precision of query results and reducing false positives would improve the reliability and quality of the final query result set in response to a query.

Regarding Claim 2
All of the elements herein are similar to some or all of the elements recited in Claim 11.

Regarding Claim 3
All of the elements herein are similar to some or all of the elements recited in Claim 12.

Regarding Claim 4
All of the elements herein are similar to some or all of the elements recited in Claim 13.

Regarding Claim 5
All of the elements herein are similar to some or all of the elements recited in Claim 14.

Regarding Claim 6
All of the elements herein are similar to some or all of the elements recited in Claim 15.

Regarding Claim 7
Cao and An et al. further teach:
wherein the distances calculated based on the binary hash codes are Hamming distances.
Cao teaches “a novel Deep Visual-Semantic Hashing (DVSH) approach to cross-modal retrieval, which learns end-to-end (1) a bimodal fusion function…which maps images and texts into a K-dimensional joint Hamming embedding space H so that the embeddings of each image-sentence pair are tightly fused to bridge different modalities whilst the similarity information conveyed in given bimodal object pairs S is preserved” (Page 1447, Col. 2, 2nd Paragraph). Cao further teaches “the derived joint visual-semantic embeddings hi not only captures the spatial dependencies over images and temporal dynamics over sentences using CNN and LSTM respectively, but also captures the cross-modal relationship in a multimodal Hamming embedding space” (Page 1448, Col. 2, 1st Paragraph). Therefore, Cao teaches calculating distances as Hamming distances.

Regarding Claim 9
Cao and An et al. further teach:
wherein a quantization code is also generated for the query item, and
An et al. teaches using quantization for retrieval (Para. [0148]) thereby teaching generating a quantization code for the query.
the quantization distance is calculated using the quantization code of the query item and the quantization code of the database item.
An et al. teaches “aspects also described herein include retrieval of images from a large-scale database of images based on a query image, by accessing a low level feature transformation, a low dimensional projection into a semantic attribute subspace, and a distance metric” where “similarity of the semantic attribute projection is measured by the distance metric” (Para. [0069]).

Regarding Claim 16
All of the elements herein are similar to some or all of the elements recited in Claim 1.

Regarding Claim 17
All of the elements herein are similar to some or all of the elements recited in Claim 11.

Regarding Claim 18
All of the elements herein are similar to some or all of the elements recited in Claim 12.

Regarding Claim 19
All of the elements herein are similar to some or all of the elements recited in Claim 7.

Regarding Claim 21
All of the elements herein are similar to some or all of the elements recited in Claim 9.

Regarding Claim 22
Some of the elements herein are similar to some or all of the elements recited in Claim 1.

Cao and An et al. further teach:
one or more processors (Para. [0241] of An et al.);
one or more memory units couples to the one or more processors, wherein the one or more memory units store instructions that, when executed by the one or more processors, cause the system to perform operations (Paras. [0241]-[0243] of An et al.).


Claims 8 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Cao and An et al., and further in view of Non-Patent Literature Yang et al., "Shared Predictive Cross-Modal Deep Quantization", 14 February 2018,  IEEE Transactions on Neural Networks and Learning Systems ( Volume: 29, Issue: 11, November 2018) (Year: 2018), hereinafter referred to as Yang.

Regarding Claim 8
Cao and An et al. explicitly teach all of the elements as recited above except:
wherein the quantization distance is an asymmetric quantizer distance (AQD) calculated using the feature vector of the query item and the quantization code of the database item.

However, in the related field of endeavor of shared predictive cross-modal deep quantization, Yang teaches:
wherein the quantization distance is an asymmetric quantizer distance (AQD) calculated using the feature vector of the query item and the quantization code of the database item.
Yang teaches in the search process, using asymmetric quantizer distance (QWD) to calculate the distance between the query and the text point in the database (Page 5296, Col. 2, Section D).

Thus, it would have been obvious to one of ordinary skill in the art, having the teachings of Yang, An et al., and Cao at the time that the claimed invention was effectively filed, to have combined the ***, as taught by Yang, with the subsequent re-ranking of the candidate results, as taught by An et al., and the systems and methods for deep visual-semantic hashing for cross-modal retrieval, as taught by Cao.
One would have been motivated to make such combination because Yang teaches explicitly and jointly modeling “a private subspace for each modality and a shared subspace between different modalities” (Page 5294, Col. 2, 2nd Paragraph) where “the private subspaces are used to capture modality specific properties, whereas the shared subspace is used to capture the representations shared by multiple modalities. Fig. 1 shows the difference between the traditional common subspace learning methods and the proposed method. Compared with traditional common subspace learning methods…, by finding a shared subspace that is independent of the private subspaces, our proposed model can capture intrinsic semantic information shared between multimodal data more efficiently.” (Page 5293, Col. 1, 2nd Paragraph).

Regarding Claim 20
All of the elements herein are similar to some or all of the elements recited in Claim 8.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wan et al., "Discriminative Latent Semantic Regression for Cross-Modal Hashing of Multimedia Retrieval", 21 October 2018,  2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM) (Year: 2018) teaches integrating high-level feature extractions and learning of the common data representations based on latent semantic regression for hashing multimedia data in a unified optimization framework where the proposed latent semantic regression approach results in a discriminative solution maximizing the inter-modal correlation while preserving the intra-modal similarity of high-level features.
Foreign Patent Document CN-110309331-A teaches a cross-mode combined hash searching method based on self-monitoring comprising the following steps: step 1: the image modality data is processed with a deep convolutional neural network to data of image modality for feature extraction. the picture data for hash learning, the deep convolutional neural network of the last layer node number is fully connected layer is the length of the hash code, step 2, processing for text mode data using a word bag model for modelling the text data. establishing a fully connected neural network of a two-layer of data of the text mode for feature extraction, the input of the neural network is represented by the word vector using the word model a first fully connected layer node of the data connected with the second full layer node of the data and the hash code of the same length; the step 3: the category label neural network processing: adopting extracting semantic features from tag data from supervised training mode; step 4: the distance between the semantic feature of the feature image and text with the tag network to minimize network extracted, the hash model of image and text network can more fully learning of semantic characteristic between different modes.
Foreign Patent Document CN-110019652-A teaches a cross-modal hash searching method based on deep learning comprising the following steps: the objective function (1) using the obtained based on deep learning technical design of shared image mode and a text mode of binary hash code, image modality and the text modality of deep neural network parameter and the image mode and the text mode of the projection matrix and (2) using an alternative means of updating of unknown variables in the objective function. , and, (3) based on the image mode and the text mode obtained by the solving of deep neural network parameter, and the projection matrix and; (4) based on the generated binary hash code calculating inquiry sample concentrated on the Hamming distance of each sample to search sample, (5) using the searcher based on an approximate nearest neighbor search for cross-mode completes the retrieval of the query sample. The method effectively improves the cross-property of the modal hash searching.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT F MAY whose telephone number is (571)272-3195. The examiner can normally be reached Monday-Friday 9:30am to 6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hosain Alam can be reached on 571-272-3978. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT F MAY/Examiner, Art Unit 2154                                                                                                                                                                                                        8/13/2022

/HOSAIN T ALAM/Supervisory Patent Examiner, Art Unit 2154