DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
The applicant’s claim to priority PRO 62/595,037 on 12/05/2017 is acknowledged. 

Information Disclosure Statement
The applicant filed an IDS on 12/12/2019, 11/17/2021 and 8/29/2022. Each has been annotated and considered.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-11 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. It is also unclear how an image is actively “capturing” a target object. (Note: A suggested change is “capturing a query image, the query image including a target object…”. 

Regarding claim 1, the limitation  “determining a query image, the query image capturing a target object to be interacted with by an end effector of the robot” is indefinite. First of all, it is unclear what “determining a query image” means. It is unclear if this means capturing an image or choosing from a previously available query image, for example.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-7 and 10 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Levine et al. (US 20170334066 hereinafter Levine).
The applied reference has a common assignee and inventor with the instant application. Based upon the earlier effectively filed date of the reference, it constitutes prior art under 35 U.S.C. 102(a)(2). This rejection under 35 U.S.C. 102(a)(2) might be overcome by: (1) a showing under 37 CFR 1.130(a) that the subject matter disclosed in the reference was obtained directly or indirectly from the inventor or a joint inventor of this application and is thus not prior art in accordance with 35 U.S.C. 102(b)(2)(A); (2) a showing under 37 CFR 1.130(b) of a prior public disclosure under 35 U.S.C. 102(b)(2)(B) if the same invention is not being claimed; or (3) a statement pursuant to 35 U.S.C. 102(b)(2)(C) establishing that, not later than the effective filing date of the claimed invention, the subject matter disclosed in the reference and the claimed invention were either owned by the same person or subject to an obligation of assignment to the same person or subject to a joint research agreement.

Regarding claim 1, Levine teaches a method of servoing an end effector of a robot, comprising (See at least: Figs. 1 and 7): 
determining a query image, the query image capturing a target object to be interacted with by an end effector of the robot (See at least: [0105] via “At block 754, the system identifies an image that captures one or more environmental objects in an environment of the robot.”); 
generating an action prediction based on processing the query image, a scene image, and a previous action representation using a neural network model, wherein the scene image is captured by a vision component associated with the robot and captures the target object and the end effector of the robot (See at least: [0105] via “At block 754, the system identifies an image that captures one or more environmental objects in an environment of the robot. In some implementations, such as a first iteration of method 700 for a portion of a candidate movement, the image is a current image. In some implementations, the system also identifies an additional image that at least partially omits the end effector and/or other robot components, such as an additional image of the environmental objects that was captured by a vision sensor when the end effector was at least partially out of view of the vision sensor. In some implementations, the system concatenates the image and the additional image to generate a concatenated image. In some implementations, the system optionally performs processing of the image(s) and/or concatenated image (e.g., to size to an input of the neural network).” Note: The current image teaches the scene image while the additional image teaches the query image.), and 
wherein the neural network model includes one or more recurrent layers each including a plurality of memory units (Figs. 6A and 6B; [0095] via “The example neural network 600 of FIGS. 6A and 6B is an example of the CDNA motion prediction model. The example neural network 600 can be the neural network 125 of FIG. 1, and is one of the three proposed motion prediction models described herein. In the neural network 600 of FIGS. 6A and 6B, convolutional layer 661, convolutional LSTM layers 672-677, and convolutional layer 662 are utilized to process an image 601 (e.g., a camera captured image in an initial iteration, and a most recently predicted image in subsequent iterations).”). 
controlling the end effector of the robot based on the action prediction (See at least: [0116] via “As described herein, in some implementations all or aspects of the control commands generated by control system 860 in moving one or more components of a robot may be based on predicted image(s) generated utilizing predicted transformation(s) determined via a trained neural network. For example, a vision sensor of the sensors 842a-m may capture a current image, and the robot control system 860 may generate candidate robot movement parameters. The robot control system 860 may provide the current image and candidate robot movement parameters to a trained neural network, generate a predicted transformation based on the applying, may generate a predicted image based on the predicted transformation, and utilize the predicted image to generate one or more end effector control commands for controlling the movement of the robot.”); 
generating an additional action prediction immediately subsequent to generating the action prediction, the immediately subsequent action prediction generated based on processing the query image, an additional scene image, and the action prediction using the neural network model, wherein the additional scene image is captured by the vision component after controlling the end effector based on the action prediction and captures the target object and the end effector; and controlling the end effector of the robot based on the additional action prediction (See citations above, as this is a repetition of all the previous steps disclosed.). 

Regarding claim 2, Levine teaches wherein generating the action prediction based on processing the query image, the scene image, and the previous action representation using the neural network model comprises: processing the query image and the scene image using a plurality of visual layers of a visual portion of the neural network model to generate visual layers output; processing the previous action representation using one or more action layers of an action portion of the neural network model to generate action output; combining the visual layers output and the action output and processing the combined visual layers output and action output using a plurality of policy layers of the neural network model, the policy layers including the one or more recurrent layers (See at least: Figs. 6a-6b; [0095]-[0098]). 

Regarding claim 3, Levine teaches wherein the plurality of memory units of the one or more recurrent layers comprise long short-term memory units (See at least: [0097] via “convolutional LSTM layers 671-677”).

Regarding claim 4, Levine teaches wherein processing the query image and the scene image using the plurality of visual layers of the visual portion of the neural network model to generate visual layers output comprises: processing the query image over a first convolutional neural network portion of the visual layers to generate a query image embedding; processing the scene image over a second convolutional neural network portion of the visual layers to generate a scene image embedding; and generating the visual layers output based on the query image embedding and the scene image embedding (See at least: Figs. 6a-6b; [0095]-[0098]).

Regarding claim 5, Levine teaches wherein generating the visual layers output based on the query image embedding and the scene image embedding comprises processing the query image embedding and the scene image embedding over one or more additional layers of the visual layers (See at least: Figs. 6a-6b; [0095]-[0098]).

Regarding claim 6, Levine teaches wherein the action prediction represents a velocity vector for displacement of the end effector in a robot frame of the robot (See at least: [0048] via “Some implementations of the technology described herein are directed to training a neural network, such as a neural network including stacked long short-term memory (LSTM) layers, to enable utilization of the trained neural network to predict a transformation that will occur to an image of a robot's environment in response to particular movement of the robot in the environment. In some implementations, the trained neural network accepts an image (I.sub.t) generated by a vision sensor and accepts candidate robot movement parameters (p.sub.t), such as parameters that define a current robot state and/or one or more candidate actions to be performed to cause the current robot state to transition to a different robot state. In some implementations, the current robot state may be the pose of an end effector of the robot (e.g., a pose of a gripping end effector) and each candidate action may each be (or indicate) a subsequent pose of the end effector. Accordingly, in some of those implementations, the candidate actions may each indicate a motion vector to move from a pose of the end effector to a subsequent pose of the end effector.”; [0103] via “The movement parameters may include, for example, joint-space motion vectors (e.g., joint angle movements) to accomplish the portion of the candidate movement, the transformation of the pose of the end effector over the portion of the candidate movement, joint-space torque vectors to accomplish the portion of the candidate movement, and/or other parameters. It is noted that the particular movement parameters and/or the form of the movement parameters will be dependent on the input parameters of the trained neural network utilized in further blocks.”).
Regarding claim 7, Levine teaches wherein the determining the query image is based on user interface input from a user (See at least: [0047] via “For example, a human can utilize user interface input to define a goal state for an object. For instance, the human can manipulate the object through an interface that displays an image captured by the robot, where the image includes the object and the manipulation enables adjustment of the pose of the object.”).

Regarding claim 10, Levine teaches wherein the query image is generated based on an image captured by the vision component of the robot (Refer to claim 1 for reasoning and rationale).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 8-9 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Levine in view of Sivic et al. (“Efficient Visual Search for Objects in Videos”). 

Regarding claim 8, Levine fails to teach wherein the user interface input is typed or spoken user interface input, and wherein determining the query image based on user interface input from the user comprises: selecting the query image, from a plurality of stock images, based on data, associated with the selected query image, matching one or more terms determined based on the user interface input.
	However, Sivic teaches wherein the user interface input is typed or spoken user interface input, and wherein determining the query image based on user interface input from the user comprises: selecting the query image, from a plurality of stock images, based on data, associated with the selected query image, matching one or more terms determined based on the user interface input (See at least: Fig. 7 and 11-13; Pg. 552 via “In our case, the query vector is given by the frequencies of visual words contained in a user specified subpart of an image, weighted by the inverse document frequencies computed on the entire database of frames. Retrieved frames are ranked according to the similarity of their weighted vectors to this query vector.”)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Levine in view of Sivic to teach wherein the user interface input is typed or spoken user interface input, and wherein determining the query image based on user interface input from the user comprises: selecting the query image, from a plurality of stock images, based on data, associated with the selected query image, matching one or more terms determined based on the user interface input so that user input can be used to narrow down the area of the image(s) that are being used to modify the control of the robot as desired by the user. 
Regarding claim 8, Levine fails to teach wherein determining the query image based on user interface input from the user comprises: causing the scene image or a previous scene image to be presented to the user via a computing device; wherein the user interface input is received via the computing device and indicates a subset of the presented scene image or previous scene image; and generating the query image based on a crop of the scene image or the previous scene image, wherein the crop is determined based on the user interface input.
	However, Sivic teaches wherein determining the query image based on user interface input from the user comprises: causing the scene image or a previous scene image to be presented to the user via a computing device; wherein the user interface input is received via the computing device and indicates a subset of the presented scene image or previous scene image; and generating the query image based on a crop of the scene image or the previous scene image, wherein the crop is determined based on the user interface input (See at least: Fig. 7 and 11-13; Pg. 552 via “In our case, the query vector is given by the frequencies of visual words contained in a user specified subpart of an image, weighted by the inverse document frequencies computed on the entire database of frames. Retrieved frames are ranked according to the similarity of their weighted vectors to this query vector.”)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Levine in view of Sivic to teach wherein determining the query image based on user interface input from the user comprises: causing the scene image or a previous scene image to be presented to the user via a computing device; wherein the user interface input is received via the computing device and indicates a subset of the presented scene image or previous scene image; and generating the query image based on a crop of the scene image or the previous scene image, wherein the crop is determined based on the user interface inputso that user input can be used to narrow down the area of the image(s) that are being used to modify the control of the robot as desired by the user. 

Regarding claim 11, Levine fails to teach wherein the query image, the scene image, and the additional scene image are each two dimensional images.
	However, Sivic teaches wherein the query image, the scene image, and the additional scene image are each two dimensional images (See at least: Fig. 7 and 11-13; Pg. 552 via “In our case, the query vector is given by the frequencies of visual words contained in a user specified subpart of an image, weighted by the inverse document frequencies computed on the entire database of frames. Retrieved frames are ranked according to the similarity of their weighted vectors to this query vector.”)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Levine in view of Sivic to teach wherein the query image, the scene image, and the additional scene image are each two dimensional images so that user input can be used to narrow down the area of the image(s) that are being used to modify the control of the robot as desired by the user. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Harry Oh whose telephone number is (571)270-5912.  The examiner can normally be reached on Monday-Thursday, 9:00-3:00.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abby Lin can be reached on (571) 270-3976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HARRY Y OH/Primary Examiner, Art Unit 3666