DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement filed July 6, 2022 fails to comply with 37 CFR 1.98(a)(3)(i) because it does not include a concise explanation of the relevance, as it is presently understood by the individual designated in 37 CFR 1.56(c) most knowledgeable about the content of the information, of each reference listed that is not in the English language.  The IDS fails to comply with 1.98(a)(3)(ii) as no copy English translation was provided for each reference listed that is not in the English language.  It has been placed in the application file, but the information referred to therein has not been considered. See MPEP (609.01(B)(3) sections (a) and (b) for more detail).
Specifically, the IDS filed July 6, 2022 includes a listing for Non-Patent Literature document: “Japanese Office Action for Japanese Application No. 2021-002043, dated June 28, 2022, 5 pages”.  No explanation of relevance appears to have been provided for this document.  No English translation appears to have been provided for this document.  The document has been placed in the application file, but the information referred to therein has not been considered.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

With regard to claims 1, 9, and 17, claim 1 recites the limitation “the unified space feature of the query text and the unified space feature of the candidate video”.  Claims 9 and 17 appear to recite substantially similar claim limitations and are rejected based upon the same rational.  This claim limitation lacks antecedent basis.  Each unique claim element is expected to refer to a unique claim label.  Within this claim limitation, applicant appears to define two distinct claim elements using the same claim label, which renders the meaning of the claim unclear.  For examination purposes this claim limitation has been construed to mean -- the text unified space feature of the query text and the video unified space feature of the candidate video--.

With regard to claims 2-7 and 10-15, claim 2 recites the limitation "wherein the determining, 15according to a query text and a candidate video, a unified space feature of the query text and a unified space feature of the candidate video based on a conversion relationship between a text semantic space and a video semantic space comprises:".  Claim 2 depends from claim 1, which recites “determining, according to a query text and a candidate video, a unified space feature of the query text and a 5unified space feature of the candidate video based on a conversion relationship between a text semantic space and a video semantic space”.
There is insufficient antecedent basis for the claimed query text, candidate video, either unified space feature, conversion relationship, text semantic space and video semantic space.  It is unclear if applicant is attempting to define new claim elements or refer to the previously recited claim elements.  When referencing to previously defined claim elements, the claim should refer to “the” element, not define “a” new element.  It is suggested that the claims be amended to clearly refer to previously existing elements.  Claims 3-8, and 10-15 appear to suffer from similar issue, albeit referencing distinct claim limitations.  For examination purposes each of these claims have been interpreted as referring to the previously defined claim elements instead of defining new claim elements.

With regard to claims 2, 8, 10 and 16, claim 2 recites “the unified space features”.  This claim limitation lacks antecedent basis.  No such element has been defined in the claims.  Each unique claim element is expected to have a unique claim label.  For examination purpose this claim limitation has been construed as --the unified text space feature and unified video space feature--.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2, 9-10, and 17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Peking [CN104166684].  Note that the citations and quotes are made to the Machine Translation provided. 

With regard to claim 1 Peking teaches A method for retrieving (Peking, Page 13, “In view of the deficiencies of the prior art, the present invention proposes a cross-media retrieval method …”) a video as the retrieval data including video media (Peking, Page 14, “Further, in the above-mentioned cross-media retrieval method based on unified sparse representation, the multiple media types in step (1) are five media times, including text, image, video, audio and 3D… the cross-media unified retrieval in step (5) refers to submitting any media type as a query, and the retrieval result includes all media types data in the text set”), comprising: 
determining, according to a query text as the query may be text media (Id) and a candidate video as the result may be video media (Id), a unified space feature of the query text as the feature map matrix of uniform sparse representation for text (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) and a 5unified space feature of the candidate video as the feature map matrix of uniform sparse representation for video (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) based on a conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between a text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and a video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data), and determining a similarity (Peking, Page 14 “the cross-media similarity calculation method in step (4) takes the probability that two pieces of media data belong to the same category as their similarity”) between the query text as a first piece of media data (Id) and the candidate video as the second piece of media data (Id) according to the unified space feature of the query text and the unified space 10feature of the candidate video as the same category (Id); and 
selecting a target video from the candidate video according to the similarity as sorting the result according to the similarity (Peking, Page 14, “the retrieval results includes all media type data in the text set; the After calculating the similarity, the steps are sorted according to the similarity to output the final cross-media retrieval result”), and using the target video as a query result as the retrieval result (Id).  

With regard to claims 2 and 10 Peking further teaches wherein the determining, according to a query text as the query may be text media (Id) and a candidate video as the result may be video media (Id), a unified space feature of the query text as the feature map matrix of uniform sparse representation for text (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) and a 5unified space feature of the candidate video as the feature map matrix of uniform sparse representation for video (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) based on a conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between a text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and a video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) comprises: 
20determining a text space feature of the query text as extracting the elements that form the vector (Peking, Page 13, “(1) establish a cross-media database comprising multiple media types, … extract the feature vector of each media type data”) based on the text semantic space as the original feature vector for the text media type (Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space”; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data); 
determining a video space feature of the candidate video as extracting the elements that form the vector (Peking, Page 13, “(1) establish a cross-media database comprising multiple media types, … extract the feature vector of each media type data”) based on the video semantic space as the original feature vector for the text media type (Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space”; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data); and 
performing a space unification (Peking, Page 14, “unified modeling … through the unified modeling of data of multiple media types”) on the text space feature 25and the video space feature based on the conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between the text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) to obtain the unified space features (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”).  

With regard to claim 9 Peking teaches An electronic device, comprising: 
at least one processor as is inherently required to enable the execution of the disclosed cross-media retrieval method on the Internet (Peking, Page 13 “the present invention proposes a cross-media retrieval method based on a unified sparse representation, which can fully consider the correlation between multiple media types and learn the sparse feature representations of multiple media types at the same time thereby effectively the noise in the feature representation is filtered, and different media data can be corrected each other, which further improves the effectiveness of the unified feature representation and improves the accuracy of cross-media retrieval”) wherein the method is envisioned as improving a computerized search engine such as Google and Baidu (Peking, Page 13 “With the advent of the era of big data, multimedia data on the Internes has grown rapidly, including various media data such as text, images, video, and audio.  However, existing search engines such as Google and Baidu still rely on keyword-based retrieval”); and 
a storage device, communicatively connected with the at 5least one processor, wherein the storage device stores an instruction executable by the at least one processor as is inherently required to enable the execution of the disclosed cross-media retrieval method on the Internet (Id), and 
the instruction is executed by the at least one processor, to cause the at least one processor to perform operations as is inherently required to enable the execution of the disclosed cross-media retrieval method on the Internet (Id), the 10operations comprising:
determining, according to a query text as the query may be text media (Id) and a candidate video as the result may be video media (Id), a unified space feature of the query text as the feature map matrix of uniform sparse representation for text (Perking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) and a 5unified space feature of the candidate video as the feature map matrix of uniform sparse representation for video (Perking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) based on a conversion relationship (Perkin, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between a text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and a video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data), and determining a similarity (Perkin, Page 14 “the cross-media similarity calculation method in step (4) takes the probability that two pieces of media data belong to the same category as their similarity”) between the query text as a first piece of media data (Id) and the candidate video as the second piece of media data (Id) according to the unified space feature of the query text and the unified space 10feature of the candidate video as the same category (Id); and 
selecting a target video from the candidate video according to the similarity as sorting the result according to the similarity (Perkin, Page 14, “the retrieval results includes all media type data in the text set; the After calculating the similarity, the steps are sorted according to the similarity to output the final cross-media retrieval result”), and using the target video as a query result as the retrieval result (Id).  

With regard to claim 17 Peking teaches A non-transitory computer readable storage medium, storing a computer instruction, wherein the computer instruction is used to cause a computer to perform operations as is inherently required to enable the execution of the disclosed cross-media retrieval method on the Internet (Peking, Page 13 “the present invention proposes a cross-media retrieval method based on a unified sparse representation, which can fully consider the correlation between multiple media types and learn the sparse feature representations of multiple media types at the same time thereby effectively the noise in the feature representation is filtered, and different media data can be corrected each other, which further improves the effectiveness of the unified feature representation and improves the accuracy of cross-media retrieval”) wherein the method is envisioned as improving a computerized search engine such as Google and Baidu (Peking, Page 13 “With the advent of the era of big data, multimedia data on the Internes has grown rapidly, including various media data such as text, images, video, and audio.  However, existing search engines such as Google and Baidu still rely on keyword-based retrieval”), the operations comprising:
determining, according to a query text as the query may be text media (Id) and a candidate video as the result may be video media (Id), a unified space feature of the query text as the feature map matrix of uniform sparse representation for text (Perking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) and a 5unified space feature of the candidate video as the feature map matrix of uniform sparse representation for video (Perking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) based on a conversion relationship (Perkin, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between a text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and a video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data), and determining a similarity (Perkin, Page 14 “the cross-media similarity calculation method in step (4) takes the probability that two pieces of media data belong to the same category as their similarity”) between the query text as a first piece of media data (Id) and the candidate video as the second piece of media data (Id) according to the unified space feature of the query text and the unified space 10feature of the candidate video as the same category (Id); and 
selecting a target video from the candidate video according to the similarity as sorting the result according to the similarity (Perkin, Page 14, “the retrieval results includes all media type data in the text set; the After calculating the similarity, the steps are sorted according to the similarity to output the final cross-media retrieval result”), and using the target video as a query result as the retrieval result (Id).  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 3-8, and 11-15 are rejected under 35 U.S.C. 103 as being unpatentable over Peking in view of Liu [Computer Vision and Image Understanding].

With regard to claims 3 and 11, Peking further teaches wherein the determining a video space feature of the candidate video as extracting the elements that form the vector (Peking, Page 13, “(1) establish a cross-media database comprising multiple media types, … extract the feature vector of each media type data”) based on the video 30semantic space as the original feature vector for the text media type (Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space”; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) comprises: …
Peking does not explicitly teach determining a target feature of a target entity in a 3020A12174US candidate video frame; determining a dense feature of the candidate video according to appearance information of the target entity and the target feature; and 5combining at least one of position information of the target entity in the candidate video frame, an area of the target entity or an occurrence order of the candidate video frame, and the dense feature, to obtain the video space feature of the candidate video.  
Liu teaches determining a target feature (Liu, Page 61, Section 3.2.2 “In this paper, four types of image features, i.e. geometric, shape, texture, and color, will be used to represent the weibo image after image processing”) of a target entity as an object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”) in a 3020A12174US candidate video frame as the image (Id); 
determining a dense feature of the candidate video (Liu, Page 61, Section 3.2.2 “The density can be obtained by divided the weibo image area with the square of ROI perimeter”) according to appearance information of the target entity as an object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”) and the target feature (Liu, Page 61, Section 3.2.2 “In this paper, four types of image features, i.e. geometric, shape, texture, and color, will be used to represent the weibo image after image processing”); and 
5combining at least one of position information of the target entity in the candidate video frame (Liu, Page 61, Section 3.2.2 Our model extracts geometric features of Region of Interest (ROI) in the weibo image, including perimeter, area, circularity, rectangularity, density, slenderness, and centroid”), an area of the target entity as an area (Id) or an occurrence order of the candidate video frame, and the dense feature as the density (Id), to obtain the video space feature of the candidate video as the image feature (Liu, Page 61, Section 3.2.2 “In this paper, four types of image features, i.e. geometric, shape, texture and color, will be used to represent the weibo image after image processing”).  
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the feature vector extraction for images/video type media taught by Peking, using the Visual feature extraction techniques taught by Liu as it yields the predictable results of extracting visual features from visual media.  Both devices are extracting the features, explicitly to be used for feature space mapping.  One of ordinary skill in the art would reasonably expect the features extracted by the techniques taught by Liu to be readily usable within the feature space mapping techniques taught and used by Peking.  The proposed combination qualifies as a simple substitution of one known feature extraction technique for another feature extraction technique to obtain the predictable results of generating a feature vector for the media.

With regard to claims 4 and 12, the proposed combination further teaches wherein the determining a target feature (Liu, Page 61, Section 3.2.2 “In this paper, four types of image features, i.e. geometric, shape, texture, and color, will be used to represent the weibo image after image processing”) of a target entity as an object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”) in a 3020A12174US candidate video frame as the image (Id) comprises: 
determining candidate features of the target entity as an object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”) in the candidate video frame as the image (Id); 
15clustering as determining semantic correlations between image and text features (Liu, Page 63, line “The purpose of the genetic algorithm is to optimize the mapping matrix to improve the accuracy and efficiency of semantic correlation recognition… rn denotes whether the nth image-text weibo instance holds semantic correlation between image and text or not”) the determined candidate features as the features of the object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”)  to associate the determined candidate features with the target entity as an object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”); and 
determining the target feature as determining the relevant similar features (Peking, Page 14 “then obtain a unified cross-media retrieval according to the similarity.  As a result, the retrieval result contains all relevant media type data”) wherein the similarity may be calculated using the fitness mapping taught by Liu (Liu, Page 61, Section 3.2.2 “In this paper, four types of image features, i.e. geometric, shape, texture, and color, will be used to represent the weibo image after image processing”; Page 63, “The fitness shows the metrics of individual, denoting weight the solution should be to evolve or not.  The purpose of the genetic algorithm is to optimize the mapping matrix to improve the accuracy and efficiency of semantic correlation recognition in this paper”) of the target entity as an object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”) from the candidate features associated with the target entity as the features of the object (Liu, Page 61, Section 3.2.2 “The shape of an object is the importance characteristic to distinguish the object and the image shape features also play a very fundamental and important role in object recognition”) 20based on confidence levels of the candidate features as the fitness of the feature vectors (Liu, Page 63, “The fitness function determines the direction of convergence in the genetic algorithm”; See Formula 16).  
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the device taught by Peking using the features space mapping taught by Liu as it yields the predictable results of mapping heterogenous features into a single unified space (Liu, Page 62, Section 3.3 “The three types of features are represented as the feature vectors in different vector spaces, and they hold the natural heterogeneity and cannot be compared with each other directly.  Therefore, we need to select one feature space as the unified one, and then map the three types of features into the unified feature space”).  Note that the Genetic mapping algorithm, in which the mapped formula 16 is part of, is applied in mapping matrix optimization (Liu, Page 62, Section 3.3.1 “As mentioned above, the feature space mapping problem is regarded as the one to obtain the optimal mapping matrices, i.e. MT and MS.  In this paper, we applied the generic algorithm in mapping matrix optimization”).

With regard to claims 5 and 13, Peking further teaches wherein the performing a space unification (Peking, Page 14, “unified modeling … through the unified modeling of data of multiple media types”) on the text space feature 25and the video space feature based on the conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between the text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) to obtain the unified space features (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”) comprises: 
… based on the conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between the text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data); and/or 
… based on the conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between the text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data).  
Peking uses the conversion relationship to determine a unified space feature across media (Peking, Page 15 “a feature map matrix of uniform sparse representation across media is learned for each media type”).  Peking does not explicitly teach projecting the text space feature to the video semantic space or projecting the video space feature to the text semantic 30space.
Liu teaches projecting the text space feature to the video semantic space … projecting the video space feature to the text semantic 30space as the system can project any of the feature spaces into the other (Liu, Page 62, Section 3.3 “The three types of features are represented as the feature vectors in different vector spaces, and they hold the natural heterogeneity and cannot be compared with each other directly.  Therefore, we need to select one feature space as the unified one, and then map the three types of features into the unified feature space”; Section 3.3.1 “In this paper, the visual feature space has been regarded as the unified one and the other two types of features, i.e. textural-linguistic and social features, will be mapped in to the visual feature space”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the device taught by Peking using the features space mapping taught by Liu as it yields the predictable results of mapping heterogenous features into a single unified space (Liu, Page 62, Section 3.3 “The three types of features are represented as the feature vectors in different vector spaces, and they hold the natural heterogeneity and cannot be compared with each other directly.  Therefore, we need to select one feature space as the unified one, and then map the three types of features into the unified feature space”).

With regard to claims 6 and 14, the proposed combination further teaches wherein the projecting the text space feature to the video semantic space as the system can project any of the feature spaces into the other (Liu, Page 62, Section 3.3 “The three types of features are represented as the feature vectors in different vector spaces, and they hold the natural heterogeneity and cannot be compared with each other directly.  Therefore, we need to select one feature space as the unified one, and then map the three types of features into the unified feature space”; Section 3.3.1 “In this paper, the visual feature space has been regarded as the unified one and the other two types of features, i.e. textural-linguistic and social features, will be mapped in to the visual feature space”) based on the conversion relationship (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”) between the text semantic space as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video semantic space as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) comprises: 
5calculating a semantic distribution as the semantic correlation recognition (Liu, Page 63, Section 3.3.2 “The Wn represents the weight of the nth image-text wiebo instance in the semantic correlation recognition and can be calculated by the following formula”) of a query word in the query text as the text feature (Id; See Formula 16, Tun-In) under the video semantic space as the video feature (Id; See Formula 16, Tun-In) based on the conversion relationship between the text semantic space and the video semantic space as the mapping between the features (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”; Liu, Page 63 “Tum and In mean the corresponding textual-linguistic feature vector in the unified feature space after the feature space mapping and the visual feature extracted rom the nth image-text weibo instance”) and according to the text space feature as the original feature vector for the text media type (Peking; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video space feature as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data).  

With regard to claims 7 and 15 the proposed combination further teaches wherein the calculating a semantic distribution as the semantic correlation recognition (Liu, Page 63, Section 3.3.2 “The Wn represents the weight of the nth image-text wiebo instance in the semantic correlation recognition and can be calculated by the following formula”) of a query word in the query text as the text feature (Id; See Formula 16, Tun-In) under the video semantic space as the video feature (Id; See Formula 16, Tun-In) based on the conversion relationship between the text semantic space and the video semantic space as the mapping between the features (Peking, Page 15, “where P(1),…,P(s) is the mapping matrix of all s media types in the cross-media database, where the superscript (i) represents the mapping matrix of the ith media type, and the dimension of the matrix is d(1)xc, the original feature vector can be mapped from the d(i)-dimensional space to a unified c-dimensional unified space.”; Liu, Page 63 “Tum and In mean the corresponding textual-linguistic feature vector in the unified feature space after the feature space mapping and the visual feature extracted rom the nth image-text weibo instance”) and according to the text space feature as the original feature vector for the text media type (Peking; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) and the video space feature as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) comprises: 
using the text space feature as the original feature vector for the text media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) as an input feature as the query example (Peking, Page 14, “Use each data int eh test set a s a query example, and the entire test set as a query target to the query”), using the video space feature as the original feature vector for the video media type (Id; Page 14 defines use of Latent Direkley distribution feature vector for text data, and word bag feature vector for video data) as an output feature as the test set (Peking, Page 14, “Use each data int eh test set a s a query example, and the entire test set as a query target to the query”), and inputting the input feature and the output feature into a pre-trained converter model (Liu, Page 61, Section 3.1 (4) “The three types of features from the training image-text weibo dataset will be fed into the SVM based semantic correlation recognition model to train the SVM classifier, and the tree types of features… will be used to predict the class of semantic correlation… the recognition result will be obtained”), to output the semantic distribution as the semantic correlation recognition (Id) of the 20query word in the query text as the input query example text feature (Peking, Page 14; Liu Page 61) under the video semantic space as the test set including the video features (Peking, Page 14; Liu Page 61).  

With regard to claims 8 and 16, Peking further teaches wherein the determining a similarity (Peking, Page 14 “the cross-media similarity calculation method in step (4) takes the probability that two pieces of media data belong to the same category as their similarity”) between the query text as a first piece of media data (Id) and the candidate video as the second piece of media data (Id) according to the unified space feature of the query text and the unified space 10feature of the candidate video as the same category (Id)comprises: 
25calculating word similarities (Peking, Page 14, “calculate the similarity between the query sample and the media data in the query target set, and then obtain a unified cross-media retrieval according to the similarity as a result”) between query words in the query text as the query sample (Id) and the candidate video as the media data (Id) based on the unified space features as the same category (Peking, Page 16 “The similarity calculation between any two media data is: … Here Oip is the unified sparse feature representation of the data p of the ith medium.”); 
Peking does not explicitly teach determining, according to degrees of importance of the query words in a retrieval input text, weights of the words; 30and performing a weighted summation on the word similarities according to the determined weights to obtain the similarity 3220A12174US between the query text and the candidate video.  
Liu teaches determining, according to degrees of importance as the positive and negative instances (Liu, Page 63 Section 3.3.2 “where Np and Nn are the numbers of positive and negative instances in the image-text weibo training dataset separately”) of the query words in a retrieval input text as the text (Id), weights of the words (Liu, Page 63 Section 3.3.2 “The Wn represents the weight of the nth image-text wiebo instance in the semantic correlation recognition and can be calculated by the following formula (17)”); 30and
performing a weighted summation (Liu, See Formula 16, summation symbol) on the word similarities according to the determined weights (Liu, See Formula 16, Wn”) to obtain the similarity3220A12174US between the query text and the candidate video (Liu, See Formula 16, Tum-In; “Tun and In mean the corresponding textual-linguistic feature vector int eh unified feature space after the feature space mapping and the visual feature vectors”).  
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the device taught by Peking using the features space mapping taught by Liu as it yields the predictable results of mapping heterogenous features into a single unified space (Liu, Page 62, Section 3.3 “The three types of features are represented as the feature vectors in different vector spaces, and they hold the natural heterogeneity and cannot be compared with each other directly.  Therefore, we need to select one feature space as the unified one, and then map the three types of features into the unified feature space”).  Note that the Genetic mapping algorithm, in which the mapped formula 16 is part of, is applied in mapping matrix optimization (Liu, Page 62, Section 3.3.1 “As mentioned above, the feature space mapping problem is regarded as the one to obtain the optimal mapping matrices, i.e. MT and MS.  In this paper, we applied the generic algorithm in mapping matrix optimization”).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Xu Changsheng [CN107480194] teaches data mining to build a machine learning model for use in multi-modal searches.
Wang [Facilitating Image Search With a Scalable and Compact Semantic Mapping] teaches an image searching method based on a compact semantic embedding.  The system explicitly maps concepts and images into a unified latent semantic space for representation of the semantic concepts, then a learner embedding matrix is learned that maps the images into the same space.  This enables the system to perform cross-modality image searches of dynamic image repositories.
Please note that one of ordinary skill in the art at the time the invention was filed would recognize that techniques used to analyze images may, generally, be used to analyze frames within a video.  One of ordinary skill in the art would recognize that the use of such image analyzation techniques would generally be expected to enable a system to perform similar analytics on video files.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMANDA WILLIS whose telephone number is (571)270-7691. The examiner can normally be reached Monday-Friday 8am-2pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tamara Kyle can be reached on 571-272-4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AMANDA L WILLIS/Primary Examiner, Art Unit 2156