DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This communication is responsive to the application filed 07/01/2020.
Claims 1-20 are pending with claims 1, 14, and 20 as independent claims.
This action is made Non-Final.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/01/2020 was filed on the same mailing date of the application on 07/01/2020.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Forsyth et al. (US 2020/0311798, filed 03/25/2019, hereinafter as Forsyth) in view of Amer et al. (US 2019/0303404, filed 10/21/2018, hereinafter as Amer).


a neural network having a time series encoder and text encoder which are jointly trained based on a triplet loss, the triplet loss relating to two different modalities of (i) time series and (ii) free-form text comments, which respectively correspond to a training set of time series and a training set of free-form text comments; (Forsyth discloses in [0056-0059] “the visual semantic embedder 124 may include a text embedder 450 and an image embedder 460, which the visual semantic embedder 124 may use to encode each image in a general embedding space of feature vectors… the visual semantic embedder 124 scrapes metadata from or associated with each of the images 410A, 410B, and 410C selected from the online site(s) to retrieve text that is descriptive of each of the images… The visual semantic embedder 124 may then input the descriptive text into the text imbedder 450. The text embedder 450 may, based on analysis of these text descriptions in relation to the characteristic terms, generate a second set of vectors, one for each of the images 410A, 410B, and 410C that corresponds to the text lexicon. The text embedder 450 may output the second set of vectors (e.g., a text embedding) to the full connected layer 464 of the neural networking processing.” And in [0060] “The general embedding may therefore be a third set of vectors that results from this training that is output from the fully connected layer 464 into a projection actuator 455. The full connected layer 464 may train a general visual semantic embedding model over time with respect to many different fashion items.” The visual semantic embedder 124 has a text encoder 450 and an image encoder 450, wherein the image embedder is configured to be associated with visual semantic loss 454 via a fully connected layer 464. See fig. 4B)
a database for storing the training sets with feature vectors extracted from encodings of the training sets, the encodings obtained by encoding the time series in the training set of time series using the time series encoder and encoding the free-form text comments in the training set of free-form text comments using the text encoder; (Forsyth discloses in [0060] “The visual-semantic loss 454 may determine differences between prediction values for the characteristics terms of the first set of vectors compared to the second set of vectors. The visual semantic embedder 124 may then train a general embedding using the visual-semantic loss (e.g., difference) values between the image embedding and text embedding for the characteristic terms of each corresponding item… The general embedding may therefore be a third set of vectors that results from this training that is output from the fully connected layer 464 into a projection actuator 455.” The text embedder/encoder 450 produces text vector and the image embedder/encoder 460 produces image vector such that both vectors may be input to the visual semantic loss 454. See fig. 4B) and 
a hardware processor for retrieving the feature vectors corresponding to at least one of the two different modalities from the database for insertion into a feature space together with at least one feature vector corresponding to a testing input relating to at least one of a testing time series and a testing free-form text comment, determining a set of nearest neighbors from among the feature vectors in the feature space based on distance criteria, and outputting testing results for the testing input based on the set of nearest neighbors; (Forsyth discloses in [0061-retrieves a type 456 with which to choose a type-specific projection 458, which is to act on the third set of vectors of the FC layer 464 to generate individual type-specific embeddings. The type 456 may be determined from text associated with each of the images 410A, 410B, and 410C retrieved from the online site(s)… the projection actuator 455 may force each vector of the third set of vectors (from the full connected layer 464) to be projected onto one of multiple type-specific embedding spaces, each of which is a sub -space of the general embedding space for the lexicon of characteristic terms… the first type-specific embedding space (Embed_1A) may be top-pants space, the second type-specific embedding space (Embed_1B) may be a top-skirt space, and a third type-specific embedding space (Embed_1C) may be a pants-skirt space.” The visual semantic embedder may retrieve a third vector based on the text vector and/or the image vector such that the image may embedded into specific embedding space. See fig. 4B)
Forsyth discloses in [0169] “The logical function or system element described may, among other functions, process and/or convert an analog data source such as an analog electrical, audio, or video signal, or a combination thereof, to a digital data source for audio-visual purposes or other digital processing purposes such as for compatibility for computer processing.” However, Forsyth does not provide details.
Forsyth does not explicitly disclose time series. However, Lim, in an analogous art, discloses in ([0070-0074 and 0150-0152] “system 100C, which is shown converting a set of data including textual scene or story information 112, non -text story information 113, user input 114, and/or non -text user input 115 into animation 180…non -text processor 119 may detect input in the form of one or more instances of non -text scene 
it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Forsyth with the teaching of Amer that would modify the teaching of Forsyth because “Artificial intelligence techniques, such as machine learning, are starting to be applied to disciplines that involve video, and such techniques may have implications for interactive storytelling, video search, surveillance applications, user interface design, and other fields.” See Amer [background].

As per claims 2 and 15, the rejection of the computer processing system of claim 1 is incorporated, further, wherein the triplet loss is for triplets from both of the two different modalities such that a first and a second triplet value are from a same semantic class and a third triplet value is from a different semantic class from among a plurality of semantic classes in which various one of the two different modalities are characterized; (Forsyth discloses in [0064] “Note that the prediction value of the top within top-pants space (Embed_A) is close to the prediction value of the pants in top-pants space, and thus the top and pants would be considered compatible to Table A shows prediction values from triplet loss determiner 480 such that triplet 410A to triplet 410B are close, same class and therefore may be compatible to worn together. However, triplet 410B-triplet 410C are far, different class, and therefore may be incompatible. See Table A). 

As per claims 3 and 16, the rejection of the computer processing system of claim 1 is incorporated, further, wherein the hardware processor performs the insertion into the feature space by applying a sampling method to triplets corresponding to at least one of the training set of time series and the training set of free-form text comments, the sampling method only selecting particular ones of the feature vectors that are outside a pre-specified margin separating at least two different semantic classes in a given tuple by less than a threshold margin violation amount; (Forsyth discloses in [0060] “The general embedding may therefore be a third set of vectors that results from this training that is output from the fully connected layer 464 into a projection actuator 455.” And in [0065] “a small difference means that the two type-specific embeddings are close and a large difference means that the two type-specific embeddings are far apart. Closeness of prediction values may be established as within a percentage threshold of each other, or within a set range, or the like.” The third set of vectors may be resulted from the training that is output from the fully connected layer 464. See fig. 4B. The threshold value may be determined to be within a percentage margin/value). 

As per claims 4 and 17, the rejection of the computer processing system of claim 1 is incorporated, further, wherein the time series encoder and the text encoder are jointly trained by learning transforms such that after an application of the transforms to instances of a same semantic class from the training sets, the instances of the same semantic class remain close in the feature space within a given threshold distance while instances of different semantic classes are separated in the feature space by at least a specified margin distance different than the given threshold distance; (Forsyth discloses in [0064-0066] “Closeness of prediction values may be established as within a percentage threshold of each other, or within a set range, or the like.” Table A indicates that the set range of threshold values would allows for different classes to be compatible such as triplet class 410A is close to triplet class 410C, values 0.64 and 0.76, and triplet class 410A to triplet 410B is even closer, value 0.87 and 0.92, whereas triplet class 410B is far to triplet class 410C, values 0.20 and 0.45. See table A). 

As per claims 5 and 18, the rejection of the computer processing system of claim 4 is incorporated, further, wherein the hardware processor performs the insertion into the feature space by applying a sampling method to triplets corresponding to at least one of the training sets, the sampling method only selecting particular ones of the feature vectors that are outside the pre-specified margin distance by less than a threshold margin violation amount; (Forsyth discloses in [0066-0069] “the visual semantic embedder 124 may further include a second fully connected layer 474 that employs a generalized distance metric 468 with which to train a type-specific embeddings model to quantify the distance (e.g., closeness or separateness) of the items based on these differences in the prediction values across the different specific embedding spaces (Embed_1A, Embed_1B, and Embed_1C). The learning may occur over iterations of items based on the characteristic terms in the text lexicon as applied to different items.”). 

As per claims 6 and 19, the rejection of the computer processing system of claim 1 is incorporated, further, Forsyth does not explicitly disclose where the testing input is an input time series of arbitrary length applied to the time series encoder to obtain the testing results as an explanation of the input time series in a form of one or more free-form text comments. However, Amer, in an analogous art, discloses in ([0132, 0151, 0155-0156, and 0159] “animation 380 may be generated for the purpose of identifying, in an explainable way, videos that are relevant to a text query detected as input at computing device 370…the generated exemplars (e.g., animations 620) represent how the generative ranking model interprets the query. These animations 620 therefore serve as an explanation when presented to the user (i.e., a global explanation) of the decisions made by module 630 when identifying relevant videos.”)
it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Forsyth with the teaching  Amer [background]. 

As per claim 7, the rejection of the computer processing system of claim 1 is incorporated, further, Forsyth does not explicitly disclose wherein the testing input is an input free-form text comment of arbitrary length applied to the text encoder to obtain the testing results as one or more time series having a same semantic class as the input free-form text comment. However, Amer, in an analogous art, discloses in ([0063-0066, 0150-0156] “Neural networks have been successfully applied to NLP problems, such as in sequence-to-sequence or (sequence-to -vector) models applied to machine translation and word-to -vector approaches. In some examples, techniques in accordance with one or more aspects of the present disclosure combine those approaches with supplemental structural information, such as sentence length in a textual description of a scene. Such an approach may model local information and global sentence structure… The input nodes are encoded as a fixed sequence of identical length and the output are labels of the provided structure.”)
	it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Forsyth with the teaching of Amer that would modify the teaching of Forsyth because “Artificial intelligence techniques, such as machine learning, are starting to be applied to disciplines that  Amer [background].

As per claim 8, the rejection of the computer processing system of claim 1 is incorporated, further, Forsyth does not explicitly disclose wherein the testing input comprise both an input time series of arbitrary length applied to the time series encoder to obtain a first vector for the insertion into the feature space and an input free-form text comment of arbitrary length applied to the text encoder to obtain a second vector for the insertion into the feature space. However, Amer, in an analogous art, discloses in ([0009, 0072 and 0151] “comparison module 530 uses a deep CNN-RNN, as further described below and illustrated in FIG. 5B, to encode a fixed-length video clip from database 505 using a CNN. Comparison module 530 encodes text query 502 using a RNN and computes a score of how well each video clip in data store 505 matches the query. In some examples, a long video is partitioned into multiple fixed-length clips and the scores are averaged over the entire video… Non -text processor 119 may process non -text scene information 113 and output the processed information to composition graph module 121.”)
	it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Forsyth with the teaching of Amer that would modify the teaching of Forsyth because “Artificial intelligence techniques, such as machine learning, are starting to be applied to disciplines that involve video, and such techniques may have implications for interactive storytelling,  Amer [background].

As per claim 9, the rejection of the computer processing system of claim 1 is incorporated, further, wherein the triplet loss is optimized by updating parameters of the neural network using stochastic gradient descent; (Forsyth discloses in [0048 and 0051] “The Adam optimizer is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.”). 

As per claim 10, the rejection of the computer processing system of claim 1 is incorporated, further, Forsyth does not explicitly disclose wherein the testing input comprises a tuple of a text segment, a time series segment, and another text segment. However, Amer, in an analogous art, discloses in ([0070] “FIG. 1C illustrates system 100C, which is shown converting a set of data including textual scene or story information 112, non -text story information 113, user input 114, and/or non -text user input 115 into animation 180.” The non-text input may be image, motion, and/or speech/audio data input to be processed by dialogue manager 122 and/or gesture and gaze tracking 124)
	it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Forsyth with the teaching of Amer that would modify the teaching of Forsyth because “Artificial intelligence techniques, such as machine learning, are starting to be applied to disciplines that involve video, and such techniques may have implications for interactive storytelling,  Amer [background].

As per claim 11, the rejection of the computer processing system of claim 1 is incorporated, further, wherein multiple convolutional layers of the neural network capture local contexts and a transformed network of the neural network captures long term context dependencies relative to the local contexts; (Forsyth discloses in [0038, 0044, 0091-0094, and 0101] “the search engine server 120 leverages these pre -trained layers to train one additional neuron per characteristic, allowing the disclosed model to capture fashion characteristics with only few representative examples in the training set. This architecture also makes the model extensible: as tastes change and fashion evolves, new characteristics can be added without having to retrain the entire network.”). 

As per claim 12, the rejection of the computer processing system of claim 1 is incorporated, further, Forsyth does not explicitly disclose wherein the testing input comprises a given time series data at least one hardware sensor for anomaly detection of a hardware system. However, Amer, in an analogous art, discloses in ([0070] “Sensor 171 may detect or sense images, motion, and/or speech and may be a camera and/or a behavior analytics system capable of detecting movements, gestures, poses, and other information.”)
	it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of Forsyth with the teaching  Amer [background].

As per claim 13, the rejection of the computer processing system of claim 12 is incorporated, further, wherein the hardware processor controls the hardware system responsive to testing results; (Forsyth discloses in [0045] “To create the training and test sets, the dataset was randomly split to contain 55,670 (85%) and 9212 (15%) items respectively. The final lexicon includes 1108 unigrams, 134 bi-grams, 45 tri-grams, 11 quad-grams, one penta-gram, and one septa-gram, and contained 136 types, 120 materials, 305 brands, 418 styles, 68 colors, 37 patterns, 84 trims and 151 shapes.”). 

As per claim 14, a computer-implemented method for cross-modal data retrieval, comprising: 
jointly training a neural network having a time series encoder and text encoder based on a triplet loss, the triplet loss relating to two different modalities of (i) time series and (ii) free-form text comments, which respectively correspond to a training set of time series and a training set of free-form text comments; (rejected based on rationale used in rejection of claim 1)
storing, in a database, the training sets with feature vectors extracted from encodings of the training sets, the encodings obtained by encoding the time series in the training set of time series using the time series encoder and encoding the free-form text comments in the training set of free-form text comments using the text encoder; (rejected based on rationale used in rejection of claim 1)
retrieving the feature vectors corresponding to at least one of the two different modalities from the database for insertion into a feature space together with at least one feature vector corresponding to a testing input relating to at least one of a testing time series and a testing free-form text comment; (rejected based on rationale used in rejection of claim 1) and 
determining, by a hardware processor, a set of nearest neighbors from among the feature vectors in the feature space based on distance criteria, and outputting testing results for the testing input based on the set of nearest neighbors; (rejected based on rationale used in rejection of claim 1). 

As per claim 20, a computer program product for cross-modal data retrieval, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising:
jointly training a neural network having a time series encoder and text encoder based on a triplet loss, the triplet loss relating to two different modalities of (i) time series and (ii) free-form text comments, which respectively correspond to a training set of time series and a training set of free-form text comments; (rejected based on rationale used in rejection of claim 1)
storing, in a database, the training sets with feature vectors extracted from encodings of the training sets, the encodings obtained by encoding the time series in the training set of time series using the time series encoder and encoding the free-form text comments in the training set of free-form text comments using the text encoder; (rejected based on rationale used in rejection of claim 1)
retrieving the feature vectors corresponding to at least one of the two different modalities from the database for insertion into a feature space together with at least one feature vector corresponding to a testing input relating to at least one of a testing time series and a testing free-form text comment; (rejected based on rationale used in rejection of claim 1) and 
determining, by a hardware processor of the computer, a set of nearest neighbors from among the feature vectors in the feature space based on distance criteria, and outputting testing results for the testing input based on the set of nearest neighbors; (rejected based on rationale used in rejection of claim 1).

Conclusion
This action is made Non-Final.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See form 892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHAMED I NAZAR whose telephone number is (571)270-3174. The examiner can normally be reached 10 am to 7 pm Mon-Fri. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached on 571-272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.





/AHAMED I NAZAR/Examiner, Art Unit 2178                                                                                                                                                                                                        10/11/2021

/SHAHID K KHAN/Examiner, Art Unit 2178