DETAILED ACTION
1.	This action is in response the communications filed on 12/01/2020 in which claims 1, 2, 11, 12, 16, and 17 are amended, and claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-5, 7-8, 11-13, 15, 16-18 and 20 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Wang ("Effective deep learning-based multi-modal retrieval").
In regard to claims 1, 11 and 16, Wang teaches: A method of facilitating neural-network-based mapping of multi-data-type items in a vector space and search of the vector space, (Wang, p.79 "Effective deep learning-based multi-modal retrieval"; p.79 Abstract "Multi-modal retrieval [search for multi-data-type items] is emerging as a new search paradigm that enables seamless information retrieval from various types of media... The mainstream solution to the problem is to learn a set of mapping functions that project data from different modalities into a common metric space [a vector space] in which conventional indexing schemes for high-dimensional space can be applied... we exploit deep learning [neural-network-based] techniques to learn effective mapping functions)
the method being implemented by a computer system that comprises one or more processors executing computer program instructions that, when executed, perform the method, the method comprising: (Wang, p.88 "All experiments are conducted on CentOS 6.4 using CUDA 5.5 with NVIDIA [processor] (GeForce GTX TITAN). The size of main memory is 64GB and the size of GPU memory is 6GB. The code [program instructions] and hyper parameter settings… ")
obtaining at least 1000 content items, (Wang, p.95 8.3.1 Datasets "Supervised training requires input image–text pairs [multiple data types] to be associated with additional semantic labels... we use NUS-WIDE dataset to evaluate the performance of supervised training. We extract 203,400 labeled pairs, among which 150,000 are used for training. [> 1000 content items]") each of the 1000 content items comprising multiple data types, (Wang, p.80 2 Problem statements "In our data model, the database D consists of objects from multiple modalities.")
the multiple data types comprising three or more of text, metadata, image, audio, or video; (Wang, p.80 "Each modality represents one type of media such as text, image or video")
causing, based on the 1000 content items, multiple neural networks to be trained to map data in a vector space by (Wang, p.85 5 Supervised approach: MDNN "In this section, we propose a supervised learning algorithm called multi-modal deep neural network (MDNN) based on a deep convolutional neural network (DCNN) model and a neural language model (NLM) [multiple neural networks] to learn mapping functions for the image modality and the text modality, respectively. The model is shown in Fig. 7."; "The trained DCNN (or skip-gram + MLP) maps input data into latent features."; p. 81, "In step 2, objects from different modalities are first mapped into the common space Z [vector space] by function fm.") providing at least a first portion of each of the 1000 content items as input to at least one of the multiple neural networks and (Wang, p.80 "In our data model, the database consists of objects from D multiple modalities."; p. 85, "Fig. 7 Model of MDNN, which consists of one DCNN for image modality [first portion] …"; input image x) providing at least a second portion of each of the 1000 content items as input to at least another one of the multiple neural networks, (Wang, p.85 "and one skip-gram + MLP for text modality.[second portion]"; input text y)
each of the first portions corresponding to a first data type, and each of the second portions corresponding to a second data type different than the first data type, (Wang, p.85 "we propose a supervised learning algorithm called multi-modal deep neural network (MDNN) based on a deep convolutional neural network (DCNN) model and a neural language model (NLM) to learn mapping functions for the image modality [first data type] and the text modality [second data type], respectively."; Image and text are different data types.) a first dimension of the vector space corresponds to the first data type, a second dimension of the vector space corresponds to the second data type, and a third dimension of the vector space corresponds to a third data type, and (Wang, p.80 "For ease of presentation, we use images and text as two sample modalities to explain our idea, i.e., we assume that D = DI ∪ DT. An image (resp. a text document) is represented by a feature vector x ϵ DI (resp. y ϵ DT)... A common approach is to learn a set of mapping functions that project the original feature vectors into a common latent space such that semantically relevant objects (e.g., image and its tags) are located close."; p. 81 "By mapping objects from different high-dimensional feature spaces into a low-dimensional latent space..."; original high-dimensional feature vectors of data types, such as x and y, inherits that the low-dimensional feature vectors corresponds to the data types respectively) the first data type, the second data type, and the third data type respectively corresponds to a different one of text, metadata, image, audio, or video; (Wang, p.80 "Each modality represents one type of media such as text, image or video")
wherein the multiple neural networks are configured to share a common set of activations such that inputs for the first and second data types provided to the multiple neural networks are mapped to the common set of activations; (Wang, p.81 "Definition 1 Common Latent Space Mapping… The common latent space [a common set of activations] mapping provides a unified approach to measuring distance of objects from different modalities [the first and second data types]. As long as all objects can be mapped into the same latent space, they become comparable"; p. 87 "Given a set of [the first and second data types], high-dimensional raw features (e.g., bag-of-visual-words or RGB feature for images) are extracted from each source and mapped into a common latent space [a common set of activations] using the learned mapping functions"; p.82 Fig. 2 "In multi-modal training, objects of same shape from all modalities are moving close to each other.")
obtaining a search request for one or more multi-data-type content items, the search request comprising one or more search parameters, a multi-data-type content item comprising at least some of the multiple data types; (Wang, p.80 "Cross-modal search enables users to explore relevant resources from different modalities [multi-data-type items]. For example, a user can use a tweet [e.g. search parameters] to retrieve relevant photographs and videos from other heterogeneous data sources, or search relevant textual descriptions or videos by submitting an interesting image [e.g. search parameters] as a query [search request].")
predicting, via at least one of the multiple neural networks and the common set of activations, (Wang, p.85 "… multi-modal deep neural network (MDNN) based on a deep convolutional neural network (DCNN) model and a neural language model (NLM)…" [multiple neural networks]; p.81 "… The common latent space [the common set of activations] mapping provides a unified approach… ") a region within the vector space for satisfying the search request, the region being predicted based on the one or more search parameters; and (Wang, p.81 "Definition 2 Multi-Modal Search: Given a query object Q ∈ Dq and a target domain Dt (q, t ∈ {I, T }) [search parameters], find a set O ⊂ Dt with k objects [a region including k objects] such that ∀o ∈ O and o'  ∈ Dt /O, distZ( fq (Q), ft (o')) ≥ distZ( fq (Q), ft (o)).")
providing, as a response to the search request, information indicating one or more content items mapped to the predicted region of the vector space. (Wang, p.81 "When a query Q ϵ Dm comes, it is first mapped into using its modal-specific mapping function fm. Based on the query type, k nearest [items in the region] are retrieved from the index built for the target modality and returned to the user. [a response]")
Claims 11 and 16 recite substantially the same limitation as claim 1, therefore the rejection applied to claim 1 also apply to claims 11 and 16. In addition, Wang teaches predict one or more locations within the vector space based on the one or more parameters; and (Wang, p.81 "Definition 2 Multi-Modal Search: Given a query object Q ∈ Dq and a target domain Dt (q, t ∈ {I, T }) [search parameters], find a set O ⊂ Dt with k objects [a region/location including k objects] such that ∀o ∈ O and o'  ∈ Dt /O, distZ( fq (Q), ft (o')) ≥ distZ( fq (Q), ft (o)).")
provide, as a response to the request, information indicating one or more content items mapped to the predicted one or more locations of the vector space or to one or more other locations proximate the predicted one or more locations. (Wang, p.81 "When a query Q ϵ Dm comes, it is first mapped into using its modal-specific mapping function fm. Based on the query type, k nearest neighbors [itmes in the region/location] are retrieved from the index built for the target modality and returned to the user. [a response]")
In regard to claims 2, 12 and 17, Wang teaches: The method of claim 1, further comprising:processing, via the multiple neural networks, the 1000 content items to generate location predictions with respect to the vector space, each of the location predictions being a location to which at least a portion of a content item of the 1000 content items is predicted to correspond; (Wang, p.90 "Fig. 9 Visualization of latent features [location predictions] after projecting them into 2D space"; p.97 "Fig. 17 Visualization of latent features learned by MDNN [neural networks] for the test dataset of NUS-WIDE-a (features represented by the same shapes and colors are annotated with the same label). a Image latent feature. b Text latent feature [a portion of content items]")
obtaining a reference feedback set, the reference feedback set comprising i) reference locations with respect to the vector space, each of the reference locations being a location to which at least a portion of a content item of the 1000 content items is confirmed to correspond, and (Wang, p.86 "By minimizing prediction error, we require the learned high-level feature vectors f(x) to be discriminative in predicting labels. Images with similar labels shall have similar feature vectors. [reference locations] In this way, the intra modal semantics are preserved... We follow the general learning objective in Eq. 1 and realize LI and LT using Eqs. 12 and 15, respectively.")
(ii) reference indications of outputs that are not to be derived from a machine learning model's processing of the at least a portion of a content item of the 1000 content items; and (Wang, p. 80 "The supervised approach requires additional labels for the media objects… LSCMR [25] uses training examples, each of which consists of a list of objects ranked according to their relevance (based on manual labels) to the first one."; p. 88 " NUS-WIDE… We refer to the image and its tags as an image–text pair. There are 81 ground truth labels manually annotated for evaluation."; p. 95, 8.3 Experimental study of supervised approach "Supervised training requires input image–text pairs to be associated with additional semantic labels… we use NUS-WIDE dataset to evaluate the performance of supervised training"; Manual labels are reference indications of outputs that are not to be derived from a machine learning model's processing, based on spec. [0032], user indications that outputs are inaccurate are examples of reference indications of outputs not derived from machine learning model.)
updating the multiple neural networks based on the location predictions and the reference feedback set. (Wang, p.86 "By minimizing the distance of latent features for an image–text pair, we require their latent features to be closer in the latent space. In this way, the intermodal semantics are preserved… All training is conducted by back-propagation [updating neural network] using mini batch SGD (see “Appendix”) to minimize the objective loss (Eq. 1).")
Claims 12 and 17 recite substantially the same limitation as claim 2, therefore the rejection applied to claim 2 also apply to claims 12 and 17. In addition, Wang teaches: via the multiple prediction models (Wang, p.85 5 Supervised approach: MDNN [multiple prediction models]; MDNN are multiple neural networks including DCNN and NLM)
In regard to claims 3, 13 and 18, Wang teaches: The method of claim 2, wherein a neural network of the multiple neural networks (i) determines similarities or differences between the location predictions and their corresponding reference locations and (Wang, p.86 "Euclidean distance is used to measure the difference [similarities or differences] of the latent features for an image–text pair, i.e., LI,T is defined similarly as in Eq. 8.")
(ii) updates the neural network based on the determined similarities or differences. (Wang, p.86 "By minimizing the distance of latent features for an image–text pair, we require their latent features to be closer in the latent space. In this way, the intermodal semantics are preserved… All training is conducted by back-propagation [updating neural network] using mini batch SGD (see “Appendix”) to minimize the objective loss (Eq. 1).")
Claims 13 and 18 recite substantially the same limitation as claim 3, therefore the rejection applied to claim 3 also apply to claims 13 and 18. In addition, Wang teaches: a prediction model of the multiple prediction models (Wang, p.85 5 Supervised approach: MDNN [multiple prediction models]; MDNN are multiple neural networks including DCNN and NLM)
In regard to claim 4, Wang teaches: The method of claim 1, wherein predicting the region comprises: determining a location within the vector space based on the one or more search parameters, the determined location corresponding to a predetermined relevance threshold for the search request; and (Wang, p.81 "When a query Q ϵ Dm comes, it is first mapped into using its modal-specific mapping function fm."; "Given an image x ∈ DI and a text document y ∈ DT , find two mapping functions f I : DI → Z, and fT : DT → Z, such that if x and y are semantically relevant, the distance between f I (x) and fT (y) in the common latent space Z, denoted by distZ( f I (x), fT (y)) [relevance threshold], is small."; mapping function fm determines a location in a latent space based on the search query)
predicting the region within the vector space based on the determined location. (Wang, "Based on the query type, k nearest neighbors [a region including k objects] are retrieved from the index built for the target modality and returned to the user."; the region is based on a location generated by the mapping funtion fm) 
In regard to claim 5, Wang teaches: The method of claim 4, wherein predicting the region comprises: obtaining a predetermined distance threshold; and predicting the region within the vector space based on the determined location and the predetermined distance threshold. (Wang, p.81 "Definition 2 Multi-Modal Search: Given a query object Q ∈ Dq and a target domain Dt (q, t ∈ {I, T }), find a set O ⊂ Dt with k objects such that ∀o ∈ O and o'  ∈ Dt /O, distZ( fq (Q), ft (o')) ≥ distZ( fq (Q), ft (o)) [distance threshold].")
In regard to claim 7, Wang teaches: The method of claim 1, further comprising:adding a dimension to the vector space subsequent to at least some content items being mapped in the vector space, the added dimension corresponding to a given data type. (Wang, p. 81 "Given an image x ∈ DI and a text document y ∈ DT , find two mapping functions f I : DI → Z, and fT : DT → Z, such that if x and y are semantically relevant, the distance between f I (x) and fT (y) in the common latent space Z... "; "A common approach is to learn a set of mapping functions that project the original feature vectors into a common latent space... " ;"we can obtain a mapping function fm : Dm → Z for each modality m ∈ {I, T }."; Because a set of mapping function is learned for each respective modality, the latent feature vectors fT(y) may be added subsequent to x being mapped by fI(x), the added dimension fT(y) corresponding to text document. This concept also can applies to the heterogeneity of data sources such as video.
In regard to claims 8, 15 and 20, Wang teaches: The method of claim 1, further comprising: obtaining a first content item comprising at least some of the multiple data types; (Wang, p. 81 "In step 1, relevant image–text pairs [content items] are used as input training data for learning the mapping functions. For example, image–text [multiple data types] pairs can be collected from Flickr where the text features are extracted from tags and descriptions for images.")
processing, via a first neural network, at least a first portion of the first content item to generate a first vector corresponding to at least the first portion of the first content item; (see mapping below)processing, via a second neural network, at least a second portion of the first content item to generate a second vector corresponding to at least the second portion of the first content item; and
(Wang, p.85 "we propose a supervised learning algorithm called multi-modal deep neural network (MDNN) based on a deep convolutional neural network (DCNN) model and a neural language model (NLM) to learn mapping functions for the image modality [first portion of the first content items] and the text modality [second portion of the first content items], respectively."; p.81 "Definition 1 Common Latent Space Mapping: Given an image x ∈ DI and a text document y ∈ DT , find two mapping functions f I : DI → Z, and fT : DT → Z, such that if x and y are semantically relevant, the distance between f I (x) [first vector] and fT (y) [second vector] in the common latent space Z")
mapping the first content item in the vector space based on the first vector and the second vector. (Wang, p.81 "mapping functions f I : DI → Z, and fT : DT → Z"; mapping functions are mapping the item based on fI(x) and fT(y) to a common latent space Z)
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 6, 9, 14 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Moataz (US 20150193486 A1).
In regard to claims 6, 14 and 19, Wang fails to teach, but Moataz teaches: The method of claim 1, wherein the search request comprises one or more logical operators, and the one or more search parameters comprises multiple search parameters, the method further comprising: (Moataz, [0028] "According to one embodiment of the invention, the search query [search request] comprises search keywords [search parameters], each search keyword being associated with one negative or one positive operator [logical operators]...")
generating multiple vectors based on the multiple search parameters, at least one of the multiple vectors being generated based on at least one of the multiple search parameters, and at least another one of the multiple vectors being generated based on at least another one of the multiple search parameters; (Moataz, [0032] "each search keyword associated with a positive operator in an vector, called 'positive vector', in the span of the orthonormal basis"; [0033] "each search keyword associated with a negative operator in an vector, called 'negative vector'")
performing vector summation or negation on the multiple vectors based on the one or more logical operators to generate a resulting vector; and (Moataz, [0034] "Determining the search query as a matrix where each row corresponds to one conjunctive clause and each row is a vector based on the positive and negative vectors, each row comprising a first sum in which the positive vectors are gathered [vector summation] and a second sum in which the negative vectors are gathered.")
predicting the region within the vector space based on the resulting vector. (Moataz, [0045] "To know if there is any document that matches the query, the multiplication between the search query expressed as a matrix and the resultant vector corresponding to each of the documents is performed."; abstract "determining a general result based on the result of the multiplication between the query matrix and the resultant vector"; The region including the documents match the query and the resultant vector is determined.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang to incorporate the teachings of Moataz by including Boolean search, such as conjunctive, disjunctive and negative searches. Doing so would enable a system to perform all the search options at the same time. (Moataz, [0003] "it would be advantageous to be able to perform all kind of encrypted searches on a collection of encrypted documents, including any Boolean search, such as conjunctive, disjunctive and negative searches."; [0005] "Currently, there is no solution enabling to perform Boolean search on encrypted data, i.e. conjunctive, disjunctive and negation search option at the same time.")
In regard to claim 9, Wang and Moataz teach: The method of claim 8, wherein mapping the first content item in the vector space comprises: performing vector summation on the first vector and the second vector to generate a resulting vector; and (Moataz, [0011] "Then to each document we associate a resultant vector resulting of a linear combination of all vectors representing the keywords characterizing the document."; [0034] "... each row comprising a first sum in which the positive vectors are gathered [vector summation]")
mapping the first content item in the vector space based on the resulting vector. (Wang, p. 81, "In step 2, objects [first content item] from different modalities are first mapped into the common space Z [vector space] by function fm [the resulting vector generated by fm].")
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang to incorporate the teachings of Moataz by including Boolean search, such as conjunctive, disjunctive and negative searches. Doing so would enable a system to perform all the search options at the same time. 
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Sainath (US 20160099010 A1).
In regard to claim 10, Wang teaches: The method of claim 8, wherein mapping the first content item in the vector space comprises: … mapping the first content item in the vector space based on the resulting vector. (Wang, p. 81, "In step 2, objects [first content item] from different modalities are first mapped into the common space Z [vector space] by function fm [the resulting vector generated by fm].")
Wang fails to teach, but Sainath teaches: …processing, via a third neural network, the first vector and the second vector to generate a resulting vector; and… (Snth., [0021] FIG. 1 shows a block diagram of an example system 100 that represents an acoustic model having CNN, LSTM, and DNN layers. The system 100 processes each input in a sequence of inputs to generate an output for each [third neural network]."; the concept of using a third neural network (DNN) is borrowed here to process outputs from LSTM and CNN, which can be used for generating outputs such as first vector and second vector.)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Wang to incorporate the teachings of Sainath by including three neural networks such as CNNs, LSTM, and DNNs. Doing so would allow a system to utilize their respective modeling capabilities and provides a better overall performance. (Snth., [0021] "In general, CNNs, LSTM, and DNNs are complementary in their modeling capabilities and may be combined into one acoustic model that provides a better overall performance (e.g., lower word-error-rate).")
Response to Arguments
Applicant's arguments filed on 12/01/2020 with respect to the rejection of the claims under 35 U.S.C. 103 have been fully considered but they are moot:
Applicant argues: (see p. 11 middle, claim 1): “Nothing in Wang, however, discloses or suggests mapping inputs for different data types to a common set of activations…” 
Examiner respectfully disagrees: the arguments do not apply to the new citation in the reference (Wang) being used in the current rejection.
Conclusion
The art made of record and not relied upon is considered pertinent to applicant's disclosure.  
(1) Wang ("MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition") teaches commonly learned features that represent both "modalities" of the input. 
(2) Ngiam ("Multimodal Deep Learning") teaches a shared representation between modalities.

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SU-TING CHUANG whose telephone number is (408)918-7519.  The examiner can normally be reached on Monday - Thursday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ERIC NILSSON/Primary Examiner, Art Unit 2122