DETAILED ACTION
This action is in response to communications filed on 03/08/2021  in which claims 1, 5-7, 9-12, 14-18, 23, and 25 are amended; claims 8 and 19-22 are cancelled; and claims 1-7, 9-18, and 23-25 are still pending. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed US Provisional Application 62/254,143, filed November 11, 2015, is acknowledged.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/29/2020 is being considered by the examiner.
Drawings
The drawings were received on 08/26/2019. These drawings are acceptable
Response to Arguments
	Applicant’s remarks and arguments filed 03/08/2021, pgs. 8-19 have been fully considered.

Applicant’s arguments regarding the rejection under 35 USC § 112(b), see pgs. 8-11 of remarks have been fully considered. 
Regarding the applicant argument directed to claimed elements recited in the amended claims 1 and 12 directed to “…using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques”. The examiner respectfully disagrees. The recitation is unclear and renders the claim indefinite. Specifically the claims recite the use of a census based on a majority agreement with providing the metric/standard for making the claimed determination.  The specification provides no standard or process for determining the majority agreement as recited by applicant’s amended claim limitation, thus the claim language is unclear and renders the claim indefinite, see full rejection below. 
Regarding the rejection of claim 2 under 35 USC § 112(b), the problematic language has been removed and the rejection made in  the previous office action has been withdrawn.
 Regarding the rejection of claim 24 under 35 USC § 112(d), the problematic language has been revised, thus the rejection made in the previous office action has been withdrawn.

Applicant’s arguments, see pgs. 11-19 of filed response with respect to the 35 § USC 103 rejection of claims have been fully considered . Below are the examiner’s remarks
Applicant argues the limitation directed to determining a correction using a consensus reached by majority agreement between the plurality of text features from different extraction techniques”,  removing at least one said text feature of the plurality of the text features from the training data using a consensus reached by majority agreement 
Examiner is not persuaded by applicant’s arguments and notes that claim language are given there broadest reasonable interpretation (BRI) in light of the specification, see MPEP 2111. In this case the amended limitations are not supported by the applicant’s original disclosure Specifically, the specification and claims failed to disclose “consensus reached by majority agreement of text features from the plurality of different extraction techniques” instead the pervious claim language refers to a consensus by partial agreement and the specification is silent regarding how the consensus is reached as recited by amended claim limitations , see 112(a) rejection for below. Thus, the original specification and disclosure do not adequately describe the recited amended claim elements in claim 1. The amended claim limitation, therefore, recites additional elements not disclosed as part of the applicant’s original disclosure.  Thus, the rejection has been maintained. Examiner interprets the use of multiple extraction techniques for producing extracted features as within the scope of the amended claim limitations.
The examiner notes that the term “consensus” has been given the proper BRI in light the applicant’s specification paragraph in [0052]: … Additionally, in some instances a plurality of tuple extraction techniques may be applied to the same image caption and consensus used among the techniques to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples… A similar technique may be employed in which a tuple extraction technique is used to perform tuple extraction jointly on a set of captions for the same image and consensus used to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples…, and original claim 6 limitation, as argued in  recited “wherein the extracting includes: extracting a plurality of the text features of the structured semantic knowledge using of a plurality of different tuple extraction techniques; and based on at least partial consensus in the plurality of the text features of the structured semantic knowledge extracted using the plurality of different tuple extraction techniques to: determine whether to correct at least one of the text features of the structured semantic knowledge… 
The term consensus includes any information retrieval for detecting/making correction and assigning a degree of confidence using a plurality of natural language processing techniques. Thus the recitation in the Jin and Lee references are within the scope of the claim limitation in light of this BRI. 
Jin teaches using a plurality of extraction techniques to generate text features and assign a likelihood as the claimed degree of confidence using a plurality of machine learning models as disclosed in pgs. 7-8 and 14 where the trained models are used to generate image captions. Jin does not expressly disclose removing features. Lee discloses the removal of extracted features using a set of rules as the claimed determined consensus for making modification to the extracted text features in pg. 1038. One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to detection patterns for detecting and resolving conflicts in the annotation of images in a 
The dependent claims 24and 25 that depend on the claims above are not determined to be allowable as the rejection of the independent claim limitations are rejected under  35 § USC 103 for the reasons discussed above.
See full rejection below.


Claim Rejections - 35 USC § 112- Description requirement and new matter situations 
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-7, 9-18, and 23-24 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

Regarding claim 1, the claim recites the limitation “a correction to at least one said text feature of the plurality of text features of the training data using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques” (emphasis added) that is not supported by the applicant’s original specification. Specifically, the specification recites a specific information process for correcting mistakes in tuples using a consensus among extraction techniques, recited in original specification filed 12/222/2015 in [0052]: … Additionally, in some instances a plurality of tuple extraction techniques may be applied to the same image caption and consensus used among the techniques to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples… A similar technique may be employed in which a tuple extraction technique is used to perform tuple extraction jointly on a set of captions for the same image and consensus used to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples…, and original claims 6 and 7 limitations, as argued in  recited “wherein the extracting includes: extracting a plurality of the text features of the structured semantic knowledge using of a plurality of different tuple extraction techniques; and based on at least partial consensus in the plurality of the text features of the structured semantic knowledge extracted using the plurality of different tuple extraction techniques to: determine whether to correct at least one of the text features of the structured semantic knowledge;…”. The original specification is silent regrading the amended claim limitations requiring “a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques” and the original claims 6 and 7 limitations are also silent regarding this claim amendment limitation. Thus, the original specification and disclosure do not adequately describe the recited amended claim elements in claim 1. The amended claim limitation, therefore, recites additional elements not disclosed as part of the applicant’s original disclosure.    

	
Regarding claim 23, the claim recited the limitation “assigning … a degree of confidence to the plurality of text features using a consensus reached by the plurality of text features resulting from the plurality of different extraction techniques;” (emphasis added) that is not supported by the applicant’s original specification. Specifically, the specification recites a specific information process for correcting mistakes in tuples using a consensus among extraction techniques, as recited in original specification filed 12/222/2015 in [0052]: … Additionally, in some instances a plurality of tuple extraction techniques may be applied to the same image caption and consensus used among the techniques to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples… A similar technique may be employed in which a tuple extraction technique is used to perform tuple extraction jointly on a set of captions for the same image and consensus used to correct mistakes in tuples, remove bad tuples, and identify high confidence tuples or assign confidences to tuples…, and original claim 6 limitation, as argued in recited “wherein the extracting includes: extracting a plurality of the text features of the structured semantic knowledge using of a plurality of different tuple extraction techniques; and based on at least partial consensus in the plurality of the text features of the structured semantic knowledge extracted using the plurality of different tuple extraction techniques to: determine whether to correct at least one of the text features of the structured semantic knowledge;…”. The original specification is silent regrading the amended claim limitations requiring “using a consensus reached by the plurality of text features resulting from the plurality of different extraction techniques” and the original claims 6 and 7 limitations are also silent regarding this claim amendment limitation. Thus, the original specification and disclosure do not adequately describe the recited amended claim 
Regarding claims dependent claims 2-7 & 9-11 that depend on claim 1, claims 13-18 that depend on claim 12, and claims 24-25 that depend on claim 23, the claims do not resolve the deficiencies noted in the respective independent claims and are appropriately rejected. 

Claim Rejections - 35 USC § 112-Indefinteness
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-7, 9-18, and 23-25 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Regarding claim 1, the claim recites the limitation “…using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques” that renders the claim indefinite. Specifically, it is unclear how extracted features reach a majority consensus, or how a plurality extraction techniques reach a majority consensus. What is the consensus determined from? Are two features automatically considered a consensus, is this based on the number of features extracted from each technique or is a collection of extraction techniques used to out a set of features considered a consensus? The specification provides no standard or process for determining the majority agreement as recited by applicant’s amended claim limitation, thus the claim language is unclear and renders the claim indefinite. The examiners interprets that any set of information 
Regarding claim 12, the claim recite similar limitations to the one recited in claim 1 limitation noted above, and is rejected under similar rationale. 
Regarding claim 1, the claim recites the limitation “determining … a correction to at least one said text feature of the plurality of text features of the training data using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques” that renders the claim indefinite. Specifically, it is unclear how extraction techniques determine an correction. Are the extracted features considered corrections because they have been extracted? It appears to be a missing step in understanding a metric or processes that determine a correction when they are deemed feature extraction tools? Typically an extraction tool will extract a something as in retrieve it, it is not clear how this process results in an corrective output? The examiner interprets any feature extracted to be a determined extracted feature and an information processing technique is used to determined an correction on the extracted data.

Regarding claim 1, the claim recites the limitation “correcting, …, the at least one said text feature in the training data using the correction thereby generating corrected training data; and training, …, a model including the corrected at least one said text feature in the corrected training data” that renders the claim indefinite. Specifically, how does one correct an correction? It would appear that one can only correct an error and how a correction is determined to be corrected in unclear from the claim limitation or applicant’s specification. Thus the claim limitation renders the claim indefinite. Examiner assumes that an modification to a training data is within the scope of correcting using a correction as recited by the claim limitation.

Regarding claim 23, the claim recites the limitation “assigning … a degree of confidence to the plurality of text features using a consensus reached by the plurality of text features resulting from the plurality of different extraction techniques” that renders the claim indefinite. Specifically, it is unclear how a consensus is reached by feature? Do the features have some sort of ability to compare themselves? If so this process/function would needed to help clarify how outputs of a set of extraction techniques (i.e. the recited plurality of features) reach a consensus. What is the consensus determined from? Are two features automatically considered a consensus, is this based on the number of features extracted from each technique or is a collection of extraction techniques used to out a set of features considered a consensus? The specification provides no standard or process for determining the consensus reached by a plurality of extracted features as recited by applicant’s amended claim limitation, thus the claim language is unclear and renders the claim indefinite. The examiners interprets that any set of information processing techniques used to extract features (e.g. using two or more techniques) techniques used to produce the plurality of extracted features has reached a consensus as required by applicant’s recited claim limitation.

Regarding claims dependent claims 2-7 & 9-11 that depend on claim 1, claims 13-18 that depend on claim 12, and claims 24-25 that depend on claim 23, the claims do not resolve the deficiencies noted in the respective independent claims and are appropriately rejected. 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:



Claims 1-7,9-15, and 17-18 and 23-25 are rejected under 35 U.S.C. 103 as being unpatentable over Jin et al. (NPL: “Aligning where to see and what to tell: image caption with region-based attention and scene factorization.” Hereinafter ‘Jin’) in view of Lee et al. (NPL: “The conflict detection and resolution in knowledge merging for image annotation”, hereinafter ‘Lee’) and in further view of Cheng et al. (US Pub. No. 2014/0328570, hereinafter ‘Che’).

Regarding independent claim 1 limitations, Jin teaches in a digital medium environment a method implemented by at least one computing device, the method comprising:
obtaining, by the at least one computing device, training data including images and associated text; (Jin teaches obtaining training data including images and associated text as all captions associated with the images in a training data corpus, in Pg. 7: Sec. Scene Specific context: …we use Latent Dirichlet Allocation(LDA) [2] to model the corpus of all the captions in the training dataset of MSCOCO…)
extracting, by the at least one computing device, a plurality of text features resulting from natural language processing of the associated text of the training data, the plurality of text features resulting, respectively, from a plurality of different extraction techniques, the plurality of text features corresponding to an object within a respective said image of the training data, (Jin teaches extracting a plurality of text features using natural language processing for producing image captions, in Pg. 2: 2nd full ¶: …we propose an image caption system that follows this modeling idea, and exploits the parallel structures between images and sentences...; where the plurality of techniques include using scene specific LSTM neural network models to extract the plurality of text features as words associated with image regions as the object associated with a respective image of the training data, in Pg. 3: Sec. Approach: …an LSTM-based neural network that models the attention dynamics of focusing on those regions as well as generating sequentially the words (section 3.2), a visual scene model that adjusts the LSTM to speciﬁc scenes (section 3.3)…; where the extraction techniques including latent alignment and corresponding words detection, in Pg. 5: Sec. Comparison to other systems using regions: While detecting objects in the image, [6, 8] focus on deriving the latent alignment between the detected regions and the words [claimed extracted plurality of text features] in the training sentences. Their purpose is to use the alignments to train a recurrent neural network generator of word sequences where the training data have become the aligned regions and corresponding words…; and Jim further discloses the latent process as a natural language learning technique for extracting a plurality of text features as words using including Attention-based Multi-Model LSTM, in Pg. 5: Sec. 3.2: Attention-based Multi-Model LSTM Decoder: …We hypothesize there is a latent process {ht} of *abstract meaning", governing the transitions from one concept to another. When this process is used to drive the generation of words, it yields a textual form of the abstract meanings encoded in the image...; And also a plurality of text features as extracted words to compute a topic vector computed using a natural language processing technique that include Latent Diriechlet Allocation (LDA), in Pg. 7: Sec. 3.3: …we use Latent Dirichlet Allocation(LDA) [2] to model the corpus of all the captions in the training dataset of MSCOCO [16] [claimed plurality of text features corresponding to an object within a respective said image of the training data]. For the second step, we train a multilayer perceptron to predict the topic vector, computed by LDA, from each image’s visual feature vector. Note that this predictive model allows to extract topic vectors for images without captions. We call the topic vectors as scene vectors. Details are in the Suppl. Material; and extracting using the plurality of scene-specific LSTM models, in Pg. 8 Sec. Adapt LSTMs to be scene-specific The LSTMs (as described in section 3.2) encodes the language model how the words should be sequentially selected from the vocabulary... Speciﬁcally, given an image and its associated scene vector s, we use “personalized” LSTMs for that image to generate caption...)
determining, by the at least one computing device, a … to at least one said text feature of the plurality of text features of the training data using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques; (Jin teaches the process for determining at least one text feature in the training set images using extracted features using the recited extraction techniques including the plurality of LSTM and MLPs models, in pg. 7: Predict next word Given the updated abstract meaning ht, we predict the next word wt with an one-hidden-layer neural network with W softmax output units. Speciﬁcally, 
    PNG
    media_image1.png
    30
    814
    media_image1.png
    Greyscale
 where fw parameterizes the neural network mapping function up to the softmax layer. Other details and comparison to other decoder methods v0 is initialized as the averaged region features. The LSTMs’ c0(1), h0(1), c0(2), h0(2) are initialized using 4 independently trained MLPs [claimed consensus reached by the majority agreement of the different techniques initialized to obtain data as the claimed determined data] which take v0 as input and have 1 hidden layer with the same size as v0… Our goal is to obtain a scene vector for each image. For the purpose of using this vector for better captioning, this scene vector should be informative of textual descriptions [claimed determined correction as textual descriptions scene vectors infer by the set of models] and also needs to be inferable from visual appearance. We achieve these goals with two steps: unsupervised clustering of captions into “scene” categories and supervised learning of a classiﬁer to predict the scene categories from the visual appearance…; where the determined text feature is reached by the consensus of personized LSTMs,  in Pg. 8: Sec. Adapt LSTMS to be scene specific: The LSTMs (as described in section 3.2) encodes the language model how the words should be sequentially selected from the vocabulary. To inject scene vectors and thus adapt the sentence generation process to be scene-speciﬁc,... Speciﬁcally, given an image and its associated scene vector s, we use “personalized” LSTMs for that image to generate caption…; training using the training data to training the classifers and neural network models, in Pg. 14: Sec. C: …For the second step, we train a multilayer perceptron to predict the scene vector when presented with an image. The training samples for this classiﬁer are the images from the same training dataset of MSCOCO with the target outputs being the LDA-inferred scene vectors [claimed generated corrected training data]. We use an MLP with two hidden layers with the sizes of 1024 and 512. ... We represent the training images with global feature vectors computed on the whole image. While it is possible to use any CNN trained on object recognition tasks, we use the CNN from the Places-205 CNN [30]. Places-205 CNN is based on AlexNet [11], but optimized under a 2.4 million datatbase to predict the locations of the images. We use the computed features at the outputs of the last fully-connected layer... 
…, by the at least one computing device, the at least one said text feature in the training data using the … hereby generating corrected training data; and training, by the at least one computing device, a model including the corrected at least one said text feature in the corrected training data and the image features of the object as part of machine learning, the model once trained is configured to correlate the image features of the object within an input image with the plurality of text features. (Jin teaches selecting features  set of text features based on the determining step for training a CNN model when once trained is configured to correlate using the trained scene classifiers using the determined image features of the object categorized a new input image with the text features that has no caption, in Pg. 14 Sec. C: …For the second step, we train a multilayer perceptron to predict the scene vector when presented with an image. The training samples for this classiﬁer are the images from the same training dataset of MSCOCO [claimed training data] with the target outputs being the LDA-inferred [claimed machine learning to correlate image features with object with the input images] scene vectors [claimed training data including claimed corrected data]. We use an MLP with two hidden layers with the sizes of 1024 and 512. ... We represent the training images with global feature vectors computed on the whole image. While it is possible to use any CNN trained on object recognition tasks, we use the CNN from the Places-205 CNN [30]. Places-205 CNN is based on AlexNet [11], but optimized under a 2.4 million datatbase to predict the locations of the images. We use the computed features at the outputs of the last fully-connected layer. Note that, representing images with global feature vectors and using the scene classiﬁer provide an eﬀective way to categorize test images where captions are not available (thus scene vectors cannot be inferred from LDA). Speciﬁcally, when generating captions for new images, scene vectors predicted from MLP are used.)

Jin does not expressly teach the comparison process for detecting differences for data modification as disclosed in the following claim limitations:
determining, by the at least one computing device, a correction to at least one said text feature of the plurality of text features of the training data using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques;
correcting, …, the at least one said text feature… using the correction
Lee expressly teaches the claim the comparison process for detecting differences for data modification as disclosed in the following claim limitations:
determining, by the at least one computing device, a correction to at least one said text feature of the plurality of text features of the training data using a consensus reached by majority agreement between the plurality of text features from the plurality of different extraction techniques; (Lee teaches determining textual differences as detected data conflicts, in Pg. 1038: Sec. 4.3: … In Table 2, O1 Ξ O2 indicates that O1 and O2 are synonyms, O2 =- O1 indicates O1 and O2 are antonyms. And the possible resolution actions that the system may take to resolve the detected data conﬂicts are shown in Table 3....; and detecting a correction in Pg. 1038: Sec. 4.3: ...In order to detect the data conﬂicts, several simple detecting rules are shown in Table 2 [claimed detection based on consensus reached by the majority agreement of the claimed plurality of different extraction techniques used to generate the rules]. When the system receives a new annotation triple NA: <S1,P1,O1> from an annotator, the system will check with the rules in Table 2 against previous annotation PA: <S2,P2,O2> one by one if any potential conﬂict can be detected [claimed consensus for majority agreement to identify any potential conflict]…)
correcting, …, the at least one said text feature… using the correction (Lee correcting determined textual differences resolution actions to correct detected data conflicts, in Pg. 1038: Sec. 4.3: … In Table 2, O1 Ξ O2 indicates that O1 and O2 are synonyms, O2 =- O1 indicates O1 and O2 are antonyms. And the possible resolution actions that the system may take to resolve the detected data conﬂicts [claimed correcting using the correction as the detected conflict] are shown in Table 3. Basically, the actions that the system can take are: abandon the new one, add the new one, replace the old one with the new one, replace the old one with the modiﬁed new one, and update...

    PNG
    media_image2.png
    266
    1338
    media_image2.png
    Greyscale

and in Pg. 1046: Sec. 5.4: …In order to prevent the mistake during conflict resolutions, we adopted two separate queues as Removing Queue and Adding Queue to keep the annotations to be removed and added respectively…)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for correcting a training knowledge based using natural language techniques for processing extracted text features as disclosed by Lee with the method developing information retrieval and processing using machine learning algorithms as disclosed by Jin.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to detection patterns for detecting and resolving conflicts in the annotation of images in a multi-annotator environment (Lee, Abstract). Doing so will improve the annotation accuracy in annotated data sets in information extraction and retrieval tasks (Lee, Abstract).
 	
While Jin and Lee teach the use of machine learning functions that are preformed by a computing device. Jin and Lee do not expressly disclose the use of a computing device for processing computing instructions as recited by the claim limitation:
at least one computing device… and …the at least one computing device…
Che teaches the use of a computing device for processing computing instructions as recited by the claim limitation:
at least one computing device… and …the at least one computing device… (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…)
The Jin, Lee, and Che references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information retrieval and processing using machine learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the system for processing machine learning function in a computing device that executes instructions by one or more processors, as disclosed by Che with the method developing information retrieval and processing using machine learning algorithms as collectively disclosed by Jin and Lee.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to enable natural language descriptions of multimedia content using a computing system (Che, Abstract). Doing so will allow computer vision mathematical techniques to detect elements from images using machine learning algorithms (Che, 0004).
                                                           
Regarding claim 2, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che further teaches the method as described in claim 1, 
further comprising generating a descriptive summarization of the object of the input image using the model.  (Jin teaches using the trained model as the adapted model that is configured to generate text features as words used in generating a caption, that is considered a descriptive summarization of the input image, as depicted in Fig.5 by computing probabilities that the extracted features is associated with a scene category of the image object as a soft scene membership assignment, in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: 2nd para.)

Regarding claim 3, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che further teaches the method as described in claim 1,
wherein the associated text is free form. (Jin teaches the associated text is free from depicted in Fig. 6 as free form text associated with the topic scenes, in pg. 10, Sec. Effect of scene factors on caption generations.)

Regarding claim 4, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che further teaches the method as described in claim 1,
wherein the associated text is a caption or metadata of the respective said image. (Jin teaches the associated text a caption associated with the respective said image scene and topic index meta data as depicted in Fig. 6 as free form text associated with the topic scenes, in pg. 10, Sec. Effect of scene factors on caption generations.)

Regarding claim 5, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che further teaches the method as described in claim 1,
wherein the plurality of text features are in a form of <subject, predicate, object>. (Jin teaches using training machine learning models using the structure semantic knowledge to extract topic vectors, including in the claimed plurality of text features, from the images to make predictions for images without captions given an image scene and its associated screen vector including a subject “a baby”, predicate “is eating” and object “a slice of pizza” as depicted in Fig. 6, in pgs. 7-8; Sec. Sec. “Scene Specific context” & Sec. “Adapt LSTMs to be scene-specific”)
Additionally, Lee teaches the use of triplets to represent and process extracted text features in a form of <subject, predicate, object> as recited by the claim limitation:
wherein the plurality of text features are in a form of <subject, predicate, object> (Lee, in Pg. 1038: … A single annotation (triple) of an annotator conflicts with the existing knowledge pieces in the value (Object) slot of the triples… in Pg. 1038: Sec. 4.3: ...In order to detect the data conﬂicts, several simple detecting rules are shown in Table 2. When the system receives a new annotation triple NA: <S1,P1,O1> from an annotator, the system will check with the rules in Table 2 against previous annotation PA: <S2,P2,O2> one by one if any potential conﬂict can be detected…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Jin and Lee for the same reasons disclosed above.

	
Regarding claim 6, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che further teaches the method as described in claim 1, further comprising:
removing at least one of the plurality of the text features from use as part of the training. (Jin teaches removing the words of the text features  used as part of the topic index training based on the partial consensus of the topic predicted to correspond to the image text features as depicted in Fig. 6, in pg. 10; Sec. “Effect of scene factors on caption generation”)

Regarding claim 7, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che teaches the method as described in claim 1, further comprising: 
identifying a degree of confidence in the extracting. (Jin teaches identifying a log likelihood as a degree of confidence in extracting words associated with a caption text features  base on a comparisons of exacted words associated a image descriptive sentences, in Pg. 14 Sec. D.1: …Objective function The objective function of our system is the log likelihood of all the captions given image. w(0:nt)−1 represents the previous words before wt(n). Note that w(0n) is a special token #BEGIN# inserted before every sentence. Tn is the length of captions n.
  
    PNG
    media_image3.png
    108
    515
    media_image3.png
    Greyscale
)
Additionally, Lee teaches the claim 7 limitation where the comparisons is part of a corrective process of the text features in the annotation image data.:
identifying a degree of confidence in the extracting. (Lee teaches identifying a degree of confidence a the accuracy measure based on the use of the system for replacing annotations in dealing with conflicts, in Pg. 1047: Sec. 6.2: … Fig. 7 shows the average accuracy of annotations for the 10 images in each group, where the y-coordinate is the average accuracy of annotation that falls in the range [1, 􀀁1] for each image and x-coordinate is the image number. The accuracy of annotation is calculated using following formula: Accuracy ¼ð# of correct annotations 􀀁 # of incorrect annotationÞ=total# of annotations. Since we need to increase the chance of conﬂicts in the experiments, the scope of annotation must be restricted. Therefore, the average accuracies of control and experiment groups are very low, they are 0.1854 and 0.3092 respectively. As shown in Fig. 7, the performance of automatic data conﬂict resolution methods improves 12.38% from 18.54% of accuracy in comparison to a naive annotation system that simply replaces the old annotation by new ones in dealing with conﬂicts…)


Regarding claim 9, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che teaches the method as described in claim 1,
wherein the training includes adapting the plurality of  text features or the image features one to another, within a vector space. (Jin teaches using trained model as the adapted model that is the word plurality of text features in a caption descriptive summarization of the input image, as depicted in Fig.5 such that similar structure concepts have related representation predicted scene topics as depicted in Fig. 6 as associating structured tags as the extracted associated the scene categories including computing probabilities that the extracted features is associated with a scene category as a soft scene membership assignment used to adapt the text features if the image information as depicted in Fig. 6 with the a vector space associated with the scene vectors, in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: Adapt LSTMs to be scene-speciﬁc The LSTMs (as described in section 3.2) encodes the language model how the words should be sequentially selected from the vocabulary. To inject scene vectors and thus adapt the sentence generation process to be scene-speciﬁc, we factorize the parameters in the LSTMs. Speciﬁcally, given an image and its associated scene vector s, we use “personalized” LSTMs for that image to generate caption. Concretely, for all gates, the aﬃne transformations will be reparameterized as Fig. 4. To avoid notation cluttering, assume we have a linear transformation matrix W to be applied to …)


wherein the model explicitly correlates the image features of the input image with the plurality of text features such that at least one of the image features is explicitly correlated with a first one of the plurality of text features but not a second one of the plurality of text features. (Jin teaches the model for image and caption retrieval tasks correlated with a first one of the extracted plurality of text features to infer the correlated image with the text feature, as depicted in Fig. 1, pgs. 1-2: Sec: Introduction; 3rd and 4th paras., but not one a text , that is a second text feature, of the topic index  not selected as text for generating the caption sentence for the image, as depicted in Fig. 6, in pg. 10; Sec. “Effect of scene factors on caption generation”)

Regarding claim 11, the rejection of claim 1 is incorporated and Jin in combination with Lee and Che teaches the method as described in claim 1,
wherein the plurality of text features are explicitly correlated to the image features. (Jin teaches the model for image and caption as the explicitly caption sentences correlated to the extracted text features to infer the correlated image with the text feature, as depicted in Fig. 1, pgs. 1-2: Sec: Introduction; 3rd and 4th paras.,  used to generate a new caption from texts associated with the test images associated with the topic and scene vectors, as explicit correlations with the image captions, as depicted in Fig. 6, in pg. 10; Sec. “Effect of scene factors on caption generation”)

Regarding independent claim 12 limitations, Jin teaches in a digital medium environment 
an extractor module to extract a plurality of text features from text associated with images in training data using natural language processing, the plurality of text features extracted, respectively, using a plurality of different extraction techniques, (Jin teaches extracting a plurality of text features using natural language processing, including using a LSTM neural network to extract the plurality of text features as words associated with image regions as the object associated with a respective image of the training data, in pg. 3: Sec. Approach: …an LSTM-based neural network that models the attention dynamics of focusing on those regions as well as generating sequentially the words (section 3.2), a visual scene model that adjusts the LSTM to speciﬁc scenes (section 3.3)…; where the extraction processing for extracting words from the training data includes other natural language processing techniques including latent alignment and corresponding words detection, in pg. 5: Sec. Comparison to other systems using regions: While detecting objects in the image, [6, 8] focus on deriving the latent alignment between the detected regions and the words in the training sentences. Their purpose is to use the alignments to train a recurrent neural network generator of word sequences where the training data have become the aligned regions and corresponding words… and further discloses the latent process as a natural language learning technique for extracting a plurality of text features as words that have abstract meaning encoded in the image, in pg. 3: Sec. 3.2: Attention-based Multi-Model LSTM Decoder: …We hypothesize there is a latent process {ht} of *abstract meaning", governing the transitions from one concept to another. When this process is used to drive the generation of words, it yields a textual form of the abstract meanings encoded in the image...; And also a plurality of text features as extracted words to compute a topic vector computed using a natural language processing technique Latent Diriechlet Allocation (LDA), in pg. 7: Sec. 3.3: …we use Latent Dirichlet Allocation(LDA) [2] to model the corpus of all the captions in the training dataset of MSCOCO [16]. For the second step, we train a multilayer perceptron to predict the topic vector, computed by LDA, from each image’s visual feature vector. Note that this predictive model allows to extract topic vectors for images without captions. We call the topic vectors as scene vectors. Details are in the Suppl. Material.)
the extractor module configured to …at least one said text feature of the plurality of the text features from the training data based on a using a consensus reached by majority agreement between the plurality of text features resulting from the plurality of different extraction techniques  as corresponding to an object within a respective said image; and (Jin using an extraction process for selecting training data content from annotated  training images that involves a comparing process text features to each other based on the given image caption as a comparison process for comparing the plurality of text features to each other, as wt and wt-1 of all the Tn sentences, using an objective function for processing the words associated with a image with the word sequence of words in the training set for captioning the image, in Pg. 14 Sec. D.1: …Objective function The objective function of our system is the log likelihood of all the captions given image. w(0:nt)−1 represents the previous words before wt(n). Note that w(0n) is a special token #BEGIN# inserted before every sentence. Tn is the length of captions n.

    PNG
    media_image3.png
    108
    515
    media_image3.png
    Greyscale

And comparing the training samples using a feature vector to identifying training samples that are similar and associated images from the same dataset based on the comparing of the scene vectors, in  Pg. 14: Sec. C:  Concretely, for the ﬁrst step, we use Latent Dirichlet Allocation(LDA) [2] to model the corpus of all the captions in the training dataset of MSCOCO. For each image, we obtain a 80-dimensional topic vector that “softly” assigns its caption into the memberships of 80 categories. We call the topic vectors the “scene vectors”. Note that the scene vectors [claimed features corresponding an image object] are purely inferred from captions. For the second step, we train a multilayer perceptron to predict the scene vector when presented with an image [claimed using a consensus reached by majority agreement between the plurality of text features resulting from the plurality of different extraction techniques]. The training samples for this classiﬁer are the images from the same training dataset of MSCOCO with the target outputs being the LDA-inferred scene vectors...)
a model training module to train a model using the training data having the at least one said text feature … and image features as part of machine learning, (Jin teaches using training machine learning models using word features of an image and the image caption to extract topic vectors from the images to make predictions for images without captions, considered part of the machine learning, in pgs. 7-8 given an image scene and its associated screen vector; Sec. Sec. “Scene Specific context” & Sec. “Adapt LSTMs to be scene-specific”)
the model configured for determining probabilities of how well image features of an input image correlate to the plurality of text features. (Jin teaches using trained model as the adapted model that is configured to generate an image representation as a caption descriptive summarization of  an input image, as depicted in Fig.5 by associating structured tags as the extracted associated the scene categories including computing probabilities that the extracted features is associated with a scene category as a soft scene membership assignment, in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: 2nd para & base on the condition probability computing probabilities that the extracted features is associated with a scene category as a soft scene membership assignment, in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: 2nd para & base on the condition probability correlating image with the text feature, as depicted in Fig. 1, pgs. 1-2: Sec: Introduction; 3rd and 4th paras.)
While Jin does disclose a process for automated generation of imagine captions using natural language processing techniques to extract text features from a training corpus, as discussed above; where the training corpus can include text data that can include annotations from hum-generated captions associated with each image, in pg. 14 Sec. D.1. 
Jin does not expressly teach the comparison process for detecting differences for data modification as disclosed in the following claim limitations:
remove at least one said text feature of the plurality of the text features…  based on a using a consensus reached by majority agreement between the plurality of text features resulting from the plurality of different extraction techniques…; 
… having the at least one said text feature removed…
Lee expressly teaches the claim the comparison process for detecting differences for data modification as disclosed in the following claim limitations:
remove at least one said text feature of the plurality of the text features … based on a using a consensus reached by majority agreement between the plurality of text features resulting from the plurality of different extraction techniques…; (Lee teaches comparing text features from the annotation data associated with annotated images for detecting differences as data conflicts between a received annotation triple with all previous annotation triples (e.g. extracted word features), in Pg. 1038: Sec. 4.3: ...In order to detect the data conﬂicts, several simple detecting rules are shown in Table 2. When the system receives a new annotation triple NA: <S1,P1,O1> from an annotator, the system will check with the rules [claimed using a consensus reached by majority agreement between the plurality of text features resulting from the plurality of different extraction techniques ] in Table 2 against previous annotation PA: <S2,P2,O2> one by one if any potential conﬂict can be detected…; where using text features including remove annotations for resolving conflicts at least one text feature in the knowledge base, in Pg. 1046: Sec. 5.4: …In order to prevent the mistake during conflict resolutions, we adopted two separate queues as Removing Queue and Adding Queue to keep the annotations to be removed and added respectively. Those annotations that are waiting to be added to the knowledge base are kept in the Adding Queue while those annotations that are to be removed (no matter due to a conflict with the adding annotations or due to the logical inference from annotations that are in conflict with the adding annotations) from the knowledge base are kept in the Removing Queue…

    PNG
    media_image2.png
    266
    1338
    media_image2.png
    Greyscale

)
… having the at least one said text feature removed… (Lee teaches using text features including the corrected annotations for resolving conflicts, as an removing the said feature associated with the conflict in the knowledge base, in Pg. 1046: Sec. 5.4: …In order to prevent the mistake during conflict resolutions, we adopted two separate queues as Removing Queue and Adding Queue to keep the annotations to be removed and added respectively. Those annotations that are waiting to be added to the knowledge base are kept in the Adding Queue while those annotations that are to be removed (no matter due to a conflict with the adding annotations or due to the logical inference from annotations that are in conflict with the adding annotations) from the knowledge base are kept in the Removing Queue…)
The Jin and Lee references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information retrieval and processing using machine learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for correcting a training knowledge based using natural language techniques for processing extracted text features as disclosed by Lee with the method developing information retrieval and processing using machine learning algorithms as disclosed by Jin.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to detection patterns for detecting and resolving conflicts in the annotation of images in a multi-annotator environment (Lee, Abstract). Doing so will improve the annotation accuracy in annotated data sets in information extraction and retrieval tasks (Lee, Abstract).
 	
While Jin and Lee teach the use of machine learning functions that are preformed by a computing device. Jin and Lee do not expressly disclose the use of a computing device for processing computing instructions as recited by the claim limitation:
at least one computing device,… and recited claim modules
Che teaches the use of a computing device for processing computing instructions as modules for perform operations as recited by the claim limitations:
at least one computing device,… and recited claim modules (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]-[0085]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…[0085] Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data struc­tures may be combined or divided into sub-modules, sub­processes or other units of computer code or data as may be required by a particular design or implementation…)
The Jin, Lee, and Che references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information retrieval and processing using machine learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the system for processing machine learning function in a computing device that executes instructions by one or more processors, as disclosed by Che with the method developing information retrieval and processing using machine learning algorithms as collectively disclosed by Jin and Lee.


Regarding claim 13, the rejection of claim 12 is incorporated and Jin in combination with Lee and Che teaches the system as described in claim 12,
wherein the associated text is unstructured. (Jin teaches the associated text is free from depicted in Fig. 6 as free form text associated with the topic scenes, in pg. 10, Sec. Effect of scene factors on caption generations.)

Regarding claim 14, the rejection of claim 12 is incorporated and Jin in combination with Lee and Che teaches the system as described in claim 12,
wherein the plurality of text features are in a form of a <subject, predicate, object> tuple. (Jin teaches using training machine learning models using the structure semantic knowledge to extract topic vectors from the images to make predictions for images without captions given an image scene and its associated screen vector including a subject “a baby”, predicate “is eating” and object “a slice of pizza” as depicted in Fig. 6, in pgs. 7-8; Sec. Sec. “Scene Specific context” & Sec. “Adapt LSTMs to be scene-specific”)
Additionally, Lee teaches the use of triplets to represent and process extracted text features in a form of <subject, predicate, object> as recited by the claim limitation:
wherein the plurality of text features are in a form of a <subject, predicate, object> tuple. (Lee, in Pg. 1038: … A single annotation (triple) of an annotator conflicts with the existing knowledge pieces in the value (Object) slot of the triples… in Pg. 1038: Sec. 4.3: ...In order to detect the data conﬂicts, several simple detecting rules are shown in Table 2. When the system receives a new annotation triple NA: <S1,P1,O1> from an annotator, the system will check with the rules in Table 2 against previous annotation PA: <S2,P2,O2> one by one if any potential conﬂict can be detected…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Jin and Lee for the same reasons disclosed above.

Regarding claim 15, the rejection of claim 12 is incorporated and Jin in combination with Lee and Che teaches the system as described in claim 12,
wherein the extractor module is configured to localize at least part of the plurality of text features as corresponding to respective portions within respective said images and as not corresponding to other portions within respective said images. (Jin teaches the trained model as the adapted model that is configured to form a structure word the plurality of text features extracted from an image caption as depicted in Fig.5 with associated localized portion within as the scene context associated with the caption input image with at least one extracted text feature in the image, such as baby and tooth brush including in the topics, as depicted in Fig. 5 as extracted words modeled using the LDA to predict the scene categories, in pg. 10, Sec. “Effect of scene factors on caption generations”; where the localize part is corresponding to respected portion in the object region associated with image annotation as depicted in Fig. 5.)

Jin teaches using computer instructions and retrieving data from a server for processing as the recited modules in applicant claim limitation for performing the recited functions, in pg. 8: Last two paras. One of ordinary skill in the art would recognize these activities in addition to the machine learning 
In addition, Che teaches the use of a computing device for processing computing instructions as modules for perform operations as recited by the claim limitations:
at least one computing device,… and recited claim modules (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]-[0085]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…[0085] Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data struc­tures may be combined or divided into sub-modules, sub­processes or other units of computer code or data as may be required by a particular design or implementation…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Jin, Lee, and Che for the same reasons disclosed above.

Regarding claim 17, the rejection of claim 12 is incorporated and Jin in combination with Lee and Che teaches the system as described in claim 12,
further comprising a module configured to generate a caption for the input image based on the plurality of text features. (Jin generating a caption descriptive summarization of the input image, as depicted in Fig.5 by associating structured tags as the extracted associated the scene categories (that is based on the captions of the extracted the plurality of text features associated with the index topics as depicted in Fig. 6), in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: 2nd para. & in pg. 10, Sec. “Effect of scene factors on caption generations”.)
Jin teaches using computer instructions and retrieving data from a server for processing as the recited modules in applicant claim limitation for performing the recited functions, in pg. 8: Last two paras. One of ordinary skill in the art would recognize these activities in addition to the machine learning functions associated with the claim limitation were executed/performed using a computing device that is an inherent means for executing the disclosed machine learning and information retrieval task in computer vision, in pg. 3; Sec. 2; 1st para.
In addition, Che teaches the use of a computing device for processing computing instructions as modules for perform operations as recited by the claim limitations:
at least one computing device,… and recited claim modules (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]-[0085]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…[0085] Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data struc­tures may be combined or divided into sub-modules, sub­processes or other units of computer code or data as may be required by a particular design or implementation…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Jin, Lee, and Che for the same reasons disclosed above.

Regarding claim 18, the rejection of claim 12 is incorporated and Jin in combination with Lee and Che teaches the system as described in claim 12,
further comprising a use module configured to deduce, based on the plurality of text features, scene properties of the input image. (Jin teaches the module used to deduce and infer information based on the word text features of the image representation as the extracted words associated with a caption as depicted in Fig.5 that is also associated localized portion within the scene context as the inferred properties of the input image being processed by the model system, as depicted in Fig. 5 as extracted words modeled using the LDA to predict the scene categories, that is considered a scene properties of the input image, in pg. 10, Sec. Effect of scene factors on caption generations.)

In addition, Che teaches the use of a computing device for processing computing instructions as modules for perform operations as recited by the claim limitations:
at least one computing device,… and recited claim modules (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]-[0085]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…[0085] Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data struc­tures may be combined or divided into sub-modules, sub­processes or other units of computer code or data as may be required by a particular design or implementation…)


Regarding independent claim 23 limitations, Bel teaches in a digital medium environment a method implemented by at least one computing device, the method comprising:
obtaining, by the at least one computing device, training data including images and associated text; (Jin teaches obtaining training data including images and associated text as all captions associated with the images in a training data corpus, in pg. 7: Sec. Scene Specific context: …we use Latent Dirichlet Allocation(LDA) [2] to model the corpus of all the captions in the training dataset of MSCOCO…)
extracting, by the at least one computing device, a plurality of text features using natural language processing from the associated text resulting from a plurality of different extraction techniques, the plurality of text features corresponding to image features of an object within a respective said image of the training data;  (Jin teaches extracting a plurality of text features using natural language processing, including using a LSTM neural network to extract the plurality of text features as words associated with image regions as the object associated with a respective image of the training data, in pg. 3: Sec. Approach: …an LSTM-based neural network that models the attention dynamics of focusing on those regions as well as generating sequentially the words (section 3.2), a visual scene model that adjusts the LSTM to speciﬁc scenes (section 3.3)…; where the extraction processing for extracting words from the training data includes other natural language processing techniques including latent alignment and corresponding words detection, in pg. 5: Sec. Comparison to other systems using regions: While detecting objects in the image, [6, 8] focus on deriving the latent alignment between the detected regions and the words in the training sentences. Their purpose is to use the alignments to train a recurrent neural network generator of word sequences where the training data have become the aligned regions and corresponding words… and further discloses the latent process as a natural language learning technique for extracting a plurality of text features as words that have abstract meaning encoded in the image, in pg. 3: Sec. 3.2: Attention-based Multi-Model LSTM Decoder: …We hypothesize there is a latent process {ht} of *abstract meaning", governing the transitions from one concept to another. When this process is used to drive the generation of words, it yields a textual form of the abstract meanings encoded in the image...; And also a plurality of text features as extracted words to compute a topic vector computed using a natural language processing technique Latent Diriechlet Allocation (LDA), in pg. 7: Sec. 3.3: …we use Latent Dirichlet Allocation(LDA) [2] to model the corpus of all the captions in the training dataset of MSCOCO [16]. For the second step, we train a multilayer perceptron to predict the topic vector, computed by LDA [claimed natural language processing from the associated text resulting from a plurality of different extraction techniques], from each image’s visual feature vector. Note that this predictive model allows to extract topic vectors for images without captions. We call the topic vectors as scene vectors. Details are in the Suppl. Material.)
 assigning, by the at least one computing device, degree of confidence to the plurality of text features using a consensus reached by the plurality of text features resulting from the plurality of different extraction techniques; and. (Jin teaches assigning a log likelihood as a degree of confidence in extracting words associated with a caption text features  base on a comparisons of exacted words associated a image descriptive sentences, in Pg. 14 Sec. D.1: …Objective function The objective function of our system is the log likelihood [claimed degree of confidence to the plurality of text features using a consensus reached] of all the captions given image. w(0:nt)−1 represents the previous words before wt(n). Note that w(0n) is a special token #BEGIN# inserted before every sentence. Tn is the length of captions n.
  
    PNG
    media_image3.png
    108
    515
    media_image3.png
    Greyscale

Implementation details We use the ADAM algorithm [9] [using the claimed the plurality of different extraction techniques], a variant of SGD with adaptive learning rate, to optimize our model. ADAM is advantageous as the eﬀective step size is invariant to the scale of gradients. This invariance is especially importance to our model as our scene-factorized LSTMs have multiplicative parameters (i.e, A, B and F ) to be optimized jointly [using the claimed the plurality of different extraction techniques].)
training, by the at least one computing device, a model using the plurality of text features, the image features of the object, and the degree of confidence as part of machine learning,  (Jin teaches training a model from a constructed training dataset by using a log likelihood as a degree of confidence in extracting words associated with a caption text features  base on a comparisons of exacted words associated a image descriptive sentences used for training a model for identifying the set of vocabulary from the training set of image and caption pairs, in Pg. 14 Sec. D.1: …The remaining part in ”train2014” which has 408,915 pairs of (image, caption) constructs our training set. For Flickr8K, oﬃcial split is available leading to 6,000 images for training, 1,000 images for validation and 1,000 images for evaluation. … We use Stanford PTBTokenizer [17] (also used in MSCOCO API), to tokenize the captions in MSCOCO. For Flickr8K and Flickr30K, tokenization are already done by the dataset releaser. Words in the training set are used to construct the vocabulary and those whose frequency less than 20 are discarded. Three special tokens: #BEGIN#, #END# and #OOV# are also taken into consideration, denoting the starting, the ending of a sentence as well as a universal replacement for out-of-vocabulary words. The ﬁnal vocabulary sizes are 895, 3,544 and 4,523 for Flickr8K, Flickr30K and MSCOCO respectively. Objective function The objective function of our system is the log likelihood of all the captions given image. w(0:nt)−1 represents the previous words before wt(n). Note that w(0n) is a special token #BEGIN# inserted before every sentence. Tn is the length of captions n.
  
    PNG
    media_image3.png
    108
    515
    media_image3.png
    Greyscale

Implementation details We use the ADAM algorithm [9] [using the plurality of text features, the image features of the object, and the degree of confidence as part of machine learning], a variant of SGD with adaptive learning rate, to optimize our model. ADAM is advantageous as the eﬀective step size is invariant to the scale of gradients. This invariance is especially importance to our model as our scene-factorized LSTMs have multiplicative parameters (i.e, A, B and F ) to be optimized jointly [claimed using the plurality of text features, the image features of the object, and the degree of confidence as part of machine learning])
the model once trained is configured to correlate the image features of the object within input image with the plurality of text features. (Jin teaches using the selected training data set of text features for training  a CNN model when once trained is configured to correlate using the trained scene classifiers using the determined image features of the object categorized a new input image with the text features that has no caption, in Pg. 14 Sec. C: …For the second step, we train a multilayer perceptron to predict the scene vector when presented with an image. The training samples for this classiﬁer are the images from the same training dataset of MSCOCO with the target outputs being the LDA-inferred scene vectors. We use an MLP with two hidden layers with the sizes of 1024 and 512. ... We represent the training images with global feature vectors computed on the whole image. While it is possible to use any CNN trained on object recognition tasks, we use the CNN from the Places-205 CNN [30]. Places-205 CNN is based on AlexNet [11], but optimized under a 2.4 million datatbase to predict the locations of the images. We use the computed features at the outputs of the last fully-connected layer. Note that, representing images with global feature vectors and using the scene classiﬁer provide an eﬀective way to categorize test images where captions are not available (thus scene vectors cannot be inferred from LDA). Speciﬁcally, when generating captions for new images [claimed trained modeled configured to correlate the image features of the object within input image with the plurality of text features], scene vectors predicted from MLP are used.)
While Jin does disclose a process for automated generation of imagine captions using natural language processing techniques to extract text features from a training corpus, as discussed above; where the training corpus can include text data that can include annotations from hum-generated captions associated with each image, in pg. 14 Sec. D.1. 
Jin does not expressly teach the comparison process using a natural language process that compares triplets as object, predicate, and subjects vectors.
Lee expressly teaches the claim comparison process using a natural language process that compares triplets as object, predicate, and subjects vectors. (Lee teaches comparing text features from the annotation data associated with annotated images for detecting differences as data conflicts between a received annotation triple with all previous annotation triples (e.g. extracted word features), in Pg. 1038: Sec. 4.3: ...In order to detect the data conﬂicts, several simple detecting rules are shown in Table 2. When the system receives a new annotation triple NA: <S1,P1,O1> from an annotator, the system will check with the rules in Table 2 against previous annotation PA: <S2,P2,O2> one by one if any potential conﬂict can be detected…)
Additionally, Lee teaches the use of a confidence measure as recited by the claim limitation.:
assigning, by the at least one computing device, degree of confidence to the plurality of text features (Lee teaches assigning a degree of confidence a the accuracy measure based on the use of the system for replacing annotations in dealing with conflicts, in Pg. 1047: Sec. 6.2: … Fig. 7 shows the average accuracy of annotations for the 10 images in each group, where the y-coordinate is the average accuracy of annotation that falls in the range [1, 􀀁1] for each image and x-coordinate is the image number. The accuracy of annotation is calculated using following formula: Accuracy ¼ð# of correct annotations 􀀁 # of incorrect annotationÞ=total# of annotations. Since we need to increase the chance of conﬂicts in the experiments, the scope of annotation must be restricted. Therefore, the average accuracies of control and experiment groups are very low, they are 0.1854 and 0.3092 respectively. As shown in Fig. 7, the performance of automatic data conﬂict resolution methods improves 12.38% from 18.54% of accuracy in comparison to a naive annotation system that simply replaces the old annotation by new ones in dealing with conﬂicts…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Jin and Lee for the same reasons disclosed above.


It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for correcting a training knowledge based using natural language techniques for processing extracted text features as disclosed by Lee with the method developing information retrieval and processing using machine learning algorithms as disclosed by Jin.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to detection patterns for detecting and resolving conflicts in the annotation of images in a multi-annotator environment (Lee, Abstract). Doing so will improve the annotation accuracy in annotated data sets in information extraction and retrieval tasks (Lee, Abstract).
 	
While Jin and Lee teach the use of machine learning functions that are preformed by a computing device. Jin and Lee do not expressly disclose the use of a computing device for processing computing instructions as recited by the claim limitation:
at least one computing device,… and …the at least one computing device,…
Che teaches the use of a computing device for processing computing instructions as recited by the claim limitation:
at least one computing device,… and …the at least one computing device,… (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…)
The Jin, Lee, and Che references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information retrieval and processing using machine learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the system for processing machine learning function in a computing device that executes instructions by one or more processors, as disclosed by Che with the method developing information retrieval and processing using machine learning algorithms as collectively disclosed by Jin and Lee.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to enable natural language descriptions of multimedia content using a computing system (Che, Abstract). Doing so will allow computer vision mathematical techniques to detect elements from images using machine learning algorithms (Che, 0004).

Regarding claim 24 the rejection of claim 23 is incorporated and Jin in combination with Lee and Che further teaches the method as described in claim 23,
wherein the associated text is free form. (Jin teaches the associated text is free from depicted in Fig. 6 as free form text associated with the topic scenes, in pg. 10, Sec. Effect of scene factors on caption generations.)


wherein the plurality of features are in a form of <subject, predicate, object>  tuple. (Jin teaches using training machine learning models using the structure semantic knowledge to extract topic vectors from the images to make predictions for images without captions given an image scene and its associated screen vector including a subject “a baby”, predicate “is eating” and object “a slice of pizza”, as the recited form of <subject, predicate, object>  tuple, as depicted in Fig. 6, in pgs. 7-8; Sec. Sec. “Scene Specific context” & Sec. “Adapt LSTMs to be scene-specific”)
Additionally, Lee teaches the use of triplets tuples to represent and process extracted text features in a form of <subject, predicate, object> as recited by the claim limitation:
wherein the plurality of text features are in a form of <subject, predicate, object> tuple. (Lee, in Pg. 1038: … A single annotation (triple) of an annotator conflicts with the existing knowledge pieces in the value (Object) slot of the triples… in Pg. 1038: Sec. 4.3: ...In order to detect the data conﬂicts, several simple detecting rules are shown in Table 2. When the system receives a new annotation triple NA: <S1,P1,O1> from an annotator, the system will check with the rules in Table 2 against previous annotation PA: <S2,P2,O2> one by one if any potential conﬂict can be detected…)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Jin and Lee for the same reasons disclosed above.

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Jin et al. (NPL: “Aligning where to see and what to tell: image caption with region-based attention and scene factorization.” Hereinafter ‘Jin’) in view of Lee et al. (NPL: “The conflict detection and resolution in knowledge merging .

Regarding claim 16, the rejection of claim 12 is incorporated and Jin in combination with Lee and Che teaches the system as described in claim 12,
further comprising a module configured to use the plurality of text features to locate the input image as part of an image search as part of a determination of how well a caption generated using the plurality of text features corresponds to a search … of the image search based on the determined probabilities. (Jin teaches the search to locate the input image as part of a search to determine how well the structured image representation correspond to a search query of languages to generate the captions of the input image with other topics as indexed with other images as depicted in Fig. 6, in pg. 10, Sec. “Effect of scene factors on caption generations”; where generating a caption is based the determined probabilities for generating an image representation as a caption descriptive summarization of  an input image, as depicted in Fig.5 by associating structured tags as the extracted associated the scene categories including computing probabilities that the extracted features is associated with a scene category as a soft scene membership assignment, in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: 2nd para & base on the condition probability computing probabilities that the extracted features is associated with a scene category as a soft scene membership assignment, in pg. 8: Sec. “Adapt LSTMs to be scene-specific”: 2nd para & base on the condition probability correlating image with the text feature, as depicted in Fig. 1, pgs. 1-2: Sec: Introduction; 3rd and 4th paras.).)

In addition, Che teaches the use of a computing device for processing computing instructions as modules for perform operations as recited by the claim limitations:
at least one computing device,… and recited claim modules (Che teaches computing device as the computing system for providing multimedia content understanding, in [0006]: FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module…; where the system includes a processor and memory for executing instructions, in [0084]-[0085]: …Embodiments may also be imple­mented as instructions stored using one or more machine­readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine ( e.g., a computing device or a "virtual machine" running on one or more computing devices)…[0085] Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data struc­tures may be combined or divided into sub-modules, sub­processes or other units of computer code or data as may be required by a particular design or implementation…)

Jin, Lee, and Che do not expressly teach claim 16 limitations:
… locate the input image as part of an image search … to a search query of the image search ... Mao teaches claim 16 limitation:
… locate the input image as part of an image search … to a search query of the image search ... (Mao traches the image retrieval process that is part of a search query that corresponds to a search query if the query input image, in pg. 5: Last para. – pg. 6 1st para.)
The Jin, Lee, Che, and Mao references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information retrieval and processing using machine learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for performing recognition and information retrieval using learning models as disclosed by Mao with the method developing information retrieval and processing using machine learning algorithms as collectively disclosed by Jin, Lee, and Che.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods in order to enable configurations of machine learning functions in a computer system to retrieve image query (Mao, pg. 5; Last three paras.). Doing so improves the performance in information retrieval models for generating novel image captions (Mao, Abstract).


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure as listed below:
Zhang (US Pub. No. 2013/0064460): teaches the use of clustering algorithms to reach a consensus of images that are in the same based on one more similar features in the images; and using mutual information as a measure of consensus. 
Johnson et al. (NPL: “Image retrieval using Scene Graphs”): teaches use of subject, predicate, object tuples to retrieve images for a search query based on objects localized in the images. 	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516.  The examiner can normally be reached on Monday-Friday, 8:00am-5:00pm EST..

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/O.O.A./Examiner, Art Unit 2126  
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126