DETAILED ACTION
This action is in response to claims filed 23 October 2019 for application 16660908 filed 23 October 2019. Currently claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Regarding claims 1, 8 and 16,
In step 1, the claim is directed to the statutory categories of a method, a system and an apparatus respectively.
	In step 2a prong 1, Claims 1, 8 and 15 recites, in part: applying automatically … a neural network to interpret one or more images…, determining automatically…a first syntactical element…, determining automatically a first probability that represents a confidence level…, generating automatically a first semantic chain…, determining automatically a second probability…, generating automatically a natural language-based communication…. The limitations are a process that, under the broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of the generic computer components. That is, other than reciting “computer implemented”, “processor-based device”, “camera” and “processor”, in the context of the claims, the limitations encompass a person assigning words to elements of an image, combining the words to form phrases and requesting confirmation from another person. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
In step 2a prong 2, this judicial exception is not integrated into a practical application. In particular, the claims recite the additional elements of “computer implemented neural network”, “processor-based device”, “camera” and “processor”. The computer components in the claim are recited at a high-level of generality (i.e., as a generic processor performing a generic computer function) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. Please see MPEP §2106.04.(a)(2).III.C. The claims also recite the additional elements of “for delivery to a user” and “receive a first set of information from the one or more cameras”. These limitations amount to mere insignificant extra-solution activity. Please see MPEP §2106.05(g).
In step 2b, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, either alone or in combination. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of “computer implemented neural network”, “processor-based device”, “camera” and “processor” to perform the steps of the claims amount to no more than mere instructions to apply the exception using a generic computer component or perform mere insignificant extra-solution activity. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Please see MPEP §2106.05(b) and (g). The claim is not patent eligible.
In step 2a prong 1, Claims 2-7, 9-15 and 17-20 recite, in part, a convolutional neural network, a first probability is determined from a feature detection node, linking a plurality of semantic chains, further details of the second probability, the communication being syntactical elements, the communication expected to result in a receiving information. The limitations are a process that, under the broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of the generic computer components. The limitations amount to the same abstract idea as stated above. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
In step 2a prong 2, This judicial exception is not integrated into a practical application. The claims recite the same additional elements as recited above. The computer components in the claim are recited at a high-level of generality (i.e., as a generic processor performing a generic computer function) such that it amounts to no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. Please see MPEP §2106.04.(a)(2).III.C and MPEP §2106.05(g).
In step 2b, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception, either alone or in combination. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements to perform the steps of the claims amount to no more than mere instructions to apply the exception using a generic computer component or perform mere insignificant extra-solution activity. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Please see MPEP §2106.05(b) and (g). The claims are not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Karpathy et al. (Deep Visual-Semantic Alignments for Generating Image Descriptions) in view of Mao et al. (Deep Captioning with Multimodal Recurrent Neural Networks (M-RNN)).

Regarding claims 1 and 8, Karpathy discloses: A computer-implemented method comprising: 
applying automatically a computer-implemented neural network to interpret one or more images, wherein the one or more images comprise a first plurality of pixels (“We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. Our experiments show that the generated sentences significantly outperform retrieval-based baselines, and produce sensible qualitative predictions. We then train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations” p2 ¶2); 
determining automatically in accordance with the interpretation by the computer-implemented neural network of the one or more images a first syntactical element that corresponds to a first subset of the first plurality of pixels (Fig 1 shows subsets of the image (subset of pixels) with a corresponding syntactical element, “We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. Our experiments show that the generated sentences significantly outperform retrieval-based baselines, and produce sensible qualitative predictions. We then train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations” p2 ¶2, “Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (vTi, st). We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color.”); 
determining automatically a first probability that represents a confidence level of the accuracy of the correspondence of the first syntactical element to the first subset of the first plurality of pixels (“Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (vTi st). We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color.”, “Consider an image from the training set and its corresponding sentence. We can interpret the quantity vTi st as the unnormalized log probability of the t-th word describing any of the bounding boxes in the image.” P4 §3.1.4 ¶1; note: a syntactical element (word) is assigned to each box with a probability representing a confidence); 
generating automatically a first semantic chain that is based, at least in part, upon the first syntactical element (“The RNN predicts a sentence as follows: We compute the representation of the image bv, set h0 = 0, x1 to the embedding of the word “the”, and compute the distribution over the first word y1. We sample from the distribution (or pick the argmax), set its embedding vector as x2, and repeat this process until the END token is generated.” P5 §RNN at test time); 
represents a confidence level that the first semantic chain reflects objective reality (Fig 7, Table 3, note: human agreement is interpreted as reflecting objective reality);
generating automatically a natural language-based communication for delivery to a user, wherein the communication comprises syntactical elements that are in accordance with the first semantic chain and the second probability (“We first verify that our multimodal RNN is rich enough to support sentence generation for full images. In this experiment, we trained the RNN to generate sentences on full images from Flickr8K, Flickr30K, and MSCOCO datasets. Then at test time, we use the first four out of five sentences as references and the fifth one to evaluate human agreement.” P7 §Our Multimodal RNN outperforms retrieval baseline, Fig 1).

However, Karpathy does not explicitly disclose: determining automatically a second probability that is based, at least in part, on the first probability, wherein the second probability.

Mao teaches: determining automatically a second probability that is based, at least in part, on the first probability, wherein the second probability ( 
    PNG
    media_image1.png
    106
    561
    media_image1.png
    Greyscale
Mao p5 §5 ¶4).

Karpathy and Mao both teach neural networks for generating syntactic chains on images and are analogous. Karpathy teaches a convolutional neural network (CNN) and recurrent neural network (RNN) for generating a syntactic element and stringing elements together to form a syntactic chain using scores. Mao teaches generating elements and chains using probabilities from other probabilities. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the system of Karpathy to use the probabilities of Mao to yield predictable results. Probabilities are intuitive to use as scores or ranks.

Regarding claims 2 and 9, The method of claim 1, further comprising: applying automatically the computer-implemented neural network, wherein the computer-implemented neural network is a convolutional neural network (Fig 4, note: the model is a combination of a CNN and RNN).

Regarding claims 3 and 10, The method of claim 1, further comprising: determining automatically the first probability, wherein the first probability is determined in accordance with information that is accessed from one or more feature detection nodes of the computer-implemented neural network (“Figure 3. Diagram for evaluating the image-sentence score Skl. Object regions are embedded with a CNN (left). Words (enriched by their context) are embedded in the same multimodal space with a BRNN (right). Pairwise similarities are computed with inner products (magnitudes shown in grayscale) and finally reduced to image-sentence score with Equation 16.”, “Consider an image from the training set and its corresponding sentence. We can interpret the quantity vTi st as the unnormalized log probability of the t-th word describing any of the bounding boxes in the image.” P4 §3.1.4 ¶1).

Regarding claims 4 and 11, The method of claim 1, further comprising: generating automatically the first semantic chain, wherein the first semantic chain is further based upon a linking of a plurality of semantic chains (Figure 5 and 7, single words are generated and then words are strung together into phrases (Semantic chains)).

Regarding claims 5 and 12, The method of claim 1,
that each represent a confidence level of accuracy of a correspondence between a syntactical element and a subset of the first plurality of pixels, wherein each of the correspondences is in accordance with an interpretation of each of the subsets of the first plurality of pixels by the computer-implemented neural network  (“We first verify that our multimodal RNN is rich enough to support sentence generation for full images. In this experiment, we trained the RNN to generate sentences on full images from Flickr8K, Flickr30K, and MSCOCO datasets. Then at test time, we use the first four out of five sentences as references and the fifth one to evaluate human agreement.” P7 §Our Multimodal RNN outperforms retrieval baseline, Fig 1).

However, Karpathy does not explicitly disclose: further comprising: determining automatically the second probability, wherein the second probability is based on a plurality of probabilities.

Mao teaches: further comprising: determining automatically the second probability, wherein the second probability is based on a plurality of probabilities ( 
    PNG
    media_image1.png
    106
    561
    media_image1.png
    Greyscale
 Mao p5 §5 ¶4).

Regarding claims 6 and 13, The method of claim 1, further comprising: generating automatically the natural language-based communication comprising the syntactical elements, wherein the syntactical elements are further in accordance with a linking of the first semantic chain with a second semantic chain (“We first verify that our multimodal RNN is rich enough to support sentence generation for full images. In this experiment, we trained the RNN to generate sentences on full images from Flickr8K, Flickr30K, and MSCOCO datasets. Then at test time, we use the first four out of five sentences as references and the fifth one to evaluate human agreement.” P7 §Our Multimodal RNN outperforms retrieval baseline, Fig 1).

Regarding claims 7 and 14, The method of claim 1, further comprising: generating automatically the natural language-based communication, wherein the communication is expected to result in receiving information that will influence the confidence level that the first semantic chain reflects objective reality (Figure 7, Table 3, note: human agreement is also interpreted as receiving information. “wherein the communication is expected to result…” does not hold any patentable weight as it is an intended result and is not positively recited. For compact prosecution a citation has been provided.).

Claim(s) 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Karpathy et al. (Deep Visual-Semantic Alignments for Generating Image Descriptions) in view of Cohen (US 20120324565) and Mao et al. (Deep Captioning with Multimodal Recurrent Neural Networks (M-RNN)).

Karpathy discloses: An apparatus comprising: one or more processors configured to: wherein the first set of information comprises a first plurality of pixels (“We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. Our experiments show that the generated sentences significantly outperform retrieval-based baselines, and produce sensible qualitative predictions. We then train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations” p2 ¶2); 
apply automatically a computer-implemented neural network to interpret the first plurality of pixels (Fig 1 shows subsets of the image (subset of pixels) with a corresponding syntactical element, “We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. Our experiments show that the generated sentences significantly outperform retrieval-based baselines, and produce sensible qualitative predictions. We then train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations” p2 ¶2, “Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (vTi, st). We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color.”); 
determine automatically in accordance with the interpretation of the first plurality of pixels by the computer-implemented neural network a first syntactical element that corresponds to a first subset of the first plurality of pixels (“Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (vTi st). We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color.”, “Consider an image from the training set and its corresponding sentence. We can interpret the quantity vTi st as the unnormalized log probability of the t-th word describing any of the bounding boxes in the image.” P4 §3.1.4 ¶1; note: a syntactical element (word) is assigned to each box with a probability representing a confidence); 
determine automatically a first probability that represents a confidence level of the accuracy of the correspondence of the first syntactical element to the first subset of the first plurality of pixels (“Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (vTi st). We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color.”, “Consider an image from the training set and its corresponding sentence. We can interpret the quantity vTi st as the unnormalized log probability of the t-th word describing any of the bounding boxes in the image.” P4 §3.1.4 ¶1; note: a syntactical element (word) is assigned to each box with a probability representing a confidence); 
provide the first syntactical element and the first probability to a computer-implemented function that generates automatically a first semantic chain that is based, at least in part, upon the first syntactical element and … represents a confidence level that the first semantic chain reflects objective reality (“The RNN predicts a sentence as follows: We compute the representation of the image bv, set h0 = 0, x1 to the embedding of the word “the”, and compute the distribution over the first word y1. We sample from the distribution (or pick the argmax), set its embedding vector as x2, and repeat this process until the END token is generated.” P5 §RNN at test time, Fig 7, Table 3, note: human agreement is interpreted as reflecting objective reality); and 
deliver automatically a natural language-based communication to a user, wherein the communication is automatically generated and comprises syntactical elements that are in accordance with the first semantic chain and the second probability (“We first verify that our multimodal RNN is rich enough to support sentence generation for full images. In this experiment, we trained the RNN to generate sentences on full images from Flickr8K, Flickr30K, and MSCOCO datasets. Then at test time, we use the first four out of five sentences as references and the fifth one to evaluate human agreement.” P7 §Our Multimodal RNN outperforms retrieval baseline, Fig 1).

However, Karpathy does not explicitly disclose: one or more cameras; receive a first set of information from the one or more cameras.
determines automatically a second probability that is based, at least in part, on the first probability, wherein the second probability.

Cohen teaches: one or more cameras; receive a first set of information from the one or more cameras (“After the received picture has been filtered 108, the camera can then capture the filtered picture and a clean copy can be generated 106 from the captured filtered picture. The method then sends 110 the clean copy of the filtered picture to a network, such as a neural or computer network.” [0027]).

Karpathy and Cohen both teach analyzing images using neural networks and are analogous. Karpathy discloses generating syntactic elements and chains from images using a neural network. Cohen teaches obtaining images to analyze using a camera. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the neural network image processing as taught by Karpathy with the image capturing by camera as taught by Cohen. One would have been motivated as a camera is able to capture new images to be analyzed.
Mao teaches: ( 
    PNG
    media_image1.png
    106
    561
    media_image1.png
    Greyscale
Mao p5 §5 ¶4).

Karpathy, Cohen and Mao teach neural networks for analyzing images and are analogous. Karpathy teaches a convolutional neural network (CNN) and recurrent neural network (RNN) for generating a syntactic element and stringing elements together to form a syntactic chain using scores. Cohen teaches obtaining images to analyze using a camera. Mao teaches generating elements and chains using probabilities from other probabilities. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the system of Karpathy and Cohen to use the probabilities of Mao to yield predictable results. Probabilities are intuitive to use as scores or ranks.

Regarding claim 17, Karpathy discloses: The apparatus of claim 16, further comprising the one or more processors configured to: determine automatically the first probability, wherein the first probability is determined in accordance with information that is accessed from one or more feature detection nodes of the computer-implemented neural network (“Figure 3. Diagram for evaluating the image-sentence score Skl. Object regions are embedded with a CNN (left). Words (enriched by their context) are embedded in the same multimodal space with a BRNN (right). Pairwise similarities are computed with inner products (magnitudes shown in grayscale) and finally reduced to image-sentence score with Equation 16.”, “Consider an image from the training set and its corresponding sentence. We can interpret the quantity vTi st as the unnormalized log probability of the t-th word describing any of the bounding boxes in the image.” P4 §3.1.4 ¶1).
	
Regarding claim 18, The apparatus of claim 16, further comprising the one or more processors configured to: provide the first syntactical element and the first probability to the computer-implemented function that generates automatically the first semantic chain that is based, at least in part, upon the first syntactical element and …that each represent a confidence level of accuracy of a correspondence between a syntactical element and a subset of the first plurality of pixels, wherein each of the correspondences is in accordance with an interpretation of each of the subsets of the first plurality of pixels by the computer-implemented neural network.

However, Karpathy does not explicitly disclose: determines automatically the second probability, wherein the second probability is based on a plurality of probabilities.

Mao teaches: determines automatically the second probability, wherein the second probability is based on a plurality of probabilities 
( 
    PNG
    media_image1.png
    106
    561
    media_image1.png
    Greyscale
 Mao p5 §5 ¶4).

Regarding claim 19, Karpathy discloses: The apparatus of claim 16, further comprising the one or more processors configured to: generate automatically the natural language-based communication comprising the syntactical elements, wherein the syntactical elements are further in accordance with a linking of the first semantic chain with a second semantic chain “We first verify that our multimodal RNN is rich enough to support sentence generation for full images. In this experiment, we trained the RNN to generate sentences on full images from Flickr8K, Flickr30K, and MSCOCO datasets. Then at test time, we use the first four out of five sentences as references and the fifth one to evaluate human agreement.” P7 §Our Multimodal RNN outperforms retrieval baseline, Fig 1).

Regarding claim 20, Karpathy discloses: The apparatus of claim 16, further comprising the one or more processors configured to: generate automatically the natural language-based communication, wherein the communication is expected to result in receiving information that will influence the confidence level that the first semantic chain reflects objective reality (Figure 7, Table 3, note: human agreement is also interpreted as receiving information. “wherein the communication is expected to result…” does not hold any patentable weight as it is an intended result and is not positively recited. For compact prosecution a citation has been provided.).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246. The examiner can normally be reached M-F: 7-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ERIC NILSSON/Primary Examiner, Art Unit 2122