DETAILED ACTION
Remarks
This non-Final office action is in response to the CON application filled on 07/19/2021. 
Application 17/379,091 is a CON application of Application 15/862,514, which is an issued patent now US Patent No. 11,097,418. 
Claims 1-17 are pending and examined below. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 6-9 and 14-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2017/0252924 (“Vijayanarasimhan”), and in view of US 2016/0221187 (“Bradski”).
Regarding claim 1, Vijayanarasimhan discloses a method implemented by one or more processors (see [0019], where “The method further includes training, by one or more of the processors, a semantic convolutional neural network based on the training examples.”), comprising:
receiving a group of three-dimensional (3D) data points generated by a vision component of a robot, the group of 3D data points capturing an object in an environment of a robot (see [0050], where “Vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s) that are in the line of sight of the sensors. The vision sensors 184A and 184B may be, for example, monographic cameras, stereographic cameras, and/or 3D laser scanners.”; see also [0065], where “One or more images 161 are applied as input to the grasp CNN 125”; vision sensors are  generating images of an object in the environment considered for grasping.); 
applying vision data as input to at least one trained machine learning model (see [0065], where “One or more images 161 are applied as input to the grasp CNN 125”; see also [0048], where “FIG. 1A illustrates an example environment in which grasp attempts may be performed by robots (e.g., robots 180A, 180B, and/or other robots), data associated with the grasp attempts may be utilized to generate training examples, and/or the training examples may be utilized to train various networks 125, 126, and/or 127 of the semantic grasping model 124.”), the vision data being based on the group of 3D data points, or being generated by an additional vision component of the robot, the vision data capturing the object in the environment of the robot (see [0131], where “At block 754, the system identifies a current image that captures the end effector and one or more environmental objects. In some implementations, the system also identifies an additional image that at least partially omits the end effector, such as an additional image of the environmental objects that was captured by a vision sensor when the end effector was at least partially out of view of the vision sensor. In some implementations, the system concatenates the image and the additional image to generate a concatenated image. In some implementations, the system optionally performs processing of the image(s) and/or concatenated image (e.g., to size to an input of the grasp CNN).”; the system identifies an image that captured by end effector and size/modify/process to an input of the grasp CNN);
processing the vision data using the trained machine learning model to generate output defining one or more grasp regions and, for each of the one or more grasp regions, a corresponding semantic indication (see [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the concatenated image and the candidate end effector motion vector to generate end effector motion output. Output is generated by applying image to CNN (processing the data and generating output). see also [0005], where “For example, a user may provide user interface input (e.g., spoken, typed) that indicates a desire to grasp an object having one or more particular object feature(s) and the robot may utilize the trained networks to attempt a grasp”; see also [0019], where “Each of the training examples further include training example output that includes: at least one grasped object label indicating a semantic feature of an object grasped by the corresponding grasp attempt.”);
selecting a grasp region, of the one or more grasp regions, based on the grasp region corresponding to the object and the object being selected for grasping (see [0066], where “A grasp measure 177 and STN parameters 178 are generated over the grasp CNN 125 based on the applied image(s) 161 and end effector motion vector 162.”; see also [0068], where “The spatial transformation 179 is applied as input to the semantic CNN 127 and semantic feature(s) 180 are generated over the semantic CNN 127 based on the applied spatially transformed image 179. For example, the semantic feature(s) 180 may indicate to which of one or more classes an object in the spatially transformed image 179 belongs such as classes of "eraser", "scissors", "comb", "shovel", "torch", "toy", "hairbrush", and/or other class( es) of greater or lesser granularity.”; the STN parameters correlate to transformation parameters for a cropped (i.e. specified, grasp, etc.) region and provide an additional step for semantic classification by filtering the unnecessary section of input image.); 
selecting, based on the semantic indication of the grasp region, a particular grasp strategy of a plurality of candidate grasp strategies (see fig 7, where a flowchart is showing a semantic grasping model, block 762 is generating semantic features over semantic CNN. See also [0138], where “the system may generate one or more additional candidate end effector motion vectors …and generate: measures of successful grasps…In some of those implementations, the system may generate the end effector command at block 764 based on analysis of all generated measures of successful grasp and corresponding semantic feature(s).”);
determining an end effector pose for interacting with the object to grasp the object (see [0136], where “At block 764, the system generates an end effector command based on the measure of a successful grasp of block 758 and the semantic feature(s) of block 762. Generally, at block 764, the system generates an end effector command that seeks to achieve (through one or more iterations of method 700) a successful grasp that is of an object that has desired object semantic features.”; see also [0139], where “For example, if one or more comparisons of the current measure of successful grasp to the measure of successful grasp determined at block 758 fail to satisfy a threshold, and the current semantic feature(s) indicate the desired object semantic features, then the end effector motion command may be a "grasp command" that causes the end effector to attempt a grasp (e.g., close digits of an impactive gripping end effector). For instance, if the result of the current measure of successful grasp divided by the measure of successful grasp determined at block 758 for the candidate end effector motion vector that is most indicative of successful grasp is greater than or equal to a first threshold (e.g., 0.9), the end effector command may be a grasp command (under the rationale of stopping the grasp early if closing the gripper is nearly as likely to produce a successful grasp as moving it).”; see also [0092], where “block 460, the system generates an end effector motion vector for the instance based on the pose of the end effector at the instance and the pose of the end effector at the final instance of the grasp attempt.”; the system generates command for end effector that seeks to achieve a successful grasp or pose at object grasp.), wherein determining the end effector pose comprises: 
selecting, grasp candidate (see [0138], where “the system may generate one or more additional candidate end effector motion vectors at block 752, and generate: measures of successful grasps for those additional candidate end effector motion vectors at additional iterations of block 758 (based on applying the current image and the additional candidate end effector motion vectors to the grasp CNN); and semantic feature(s) for those additional candidate end effector motion vectors at additional iterations of block 762. The additional iterations of blocks 758 and 762 may optionally be performed in parallel by the system. In some of those implementations, the system may generate the end effector command at block 764 based on analysis of all generated measures of successful grasp and corresponding semantic feature(s).”; grasp candidate is selected by iteration through CNN.); and
providing, to actuators of the robot, commands that cause an end effector of the robot to traverse to the end effector pose in association with attempting a grasp of the object (see fig 7, block 768, where grasp command is implemented for end effector to attempt a grasp.).
Vijayanarasimhan does not disclose the following limitations:
wherein determining the end effector pose comprises: 
selecting, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region;
selecting a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region; and
in response to selecting the first surface normal, determining the end effector pose based on the first surface normal.
 However, Bradski discloses a method wherein determining the end effector pose comprises: 
 selecting, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region defined by output generated by a computing device (see [0098], where “grasp point (which may also be referred to as pick points) may be identified by finding surface normals”; see also [0122], where “In particular, this is demonstrated by the surface normals shown in FIG. 6B, where only surface normals (i.e., potential grasp points) that can be reached by an approach path of the robotic arm 602 are shown”; see also [0136], where “Planning a collision-free path may involve determining the "virtual" location of objects and surfaces in the environment.”; see also [0156], where “virtual grasps”; see also [0031], where “In many cases, a robotic arm maybe equipped with various sensors that allow for analysis of the object and the environment, as well as 3D ("virtual") reconstruction of the object and the environment.”; sensors attached with the robotic device is generating 3D view of the object. Then grasp point is identified by finding surface normals of the object. One surface normal is picked from multiple surface normals based on collision free grasp path. So, a grasp point (3D point) is selected from multiple 3D points (grasp points).); 
selecting a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region (see fig 6A-C, where one grasp point is selected from multiple grasp points. see also [0119], where “potential grasp points on the physical object may be identified by finding spatial clusters of coherently oriented surface normals. In particular, the system may make use of regular organization of surface normals as a good grasp point for suction or electrostatic grippers to pick up an object. Such areas of coherently oriented surface normals are often points of low surface curvature (also referred to as "flat spots") that work well for suction or electrostatic grabbing devices.”; see also [0150], where “the selection of a grasp point is shown in FIG. 6B. FIG. 6B shows an object in the bin 610 together with surface normals on that object produced by processing depth data from a depth sensing device.”); and 
in response to selecting the first surface normal, determining the end effector pose based on the first surface normal (see [0120], where “As shown in FIG. 6B, the robotic arm 602 is on an approach path towards the object 608, where the object 608 exhibits surface normals that represent flat spots on the object 608.”; see also [0121], where “the surface normals (i.e., the graspable features) may allow for estimation of approach trajectories.”). 
Vijayanarasimhan teaches a method for applying vision data of an object to be grasped as input for CNN (trained machine learning model) and selecting a grasp candidate, see citation above). Bradski teaches a method for selecting a grasping point on an object by computing device from multiple potential grasp points (surface normals) that can be reached by robotic arm and selecting a grasp point for collision free grasp (see citation above).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan to incorporate the teachings of Bradski by including the above features, selecting, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region; selecting a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region; and in response to selecting the first surface normal, determining the end effector pose based on the first surface normal, for avoiding collisions with nearby objects and avoiding damage of the object and end effector during grasping.
Regarding claim 6, Vijayanarasimhan further discloses a method wherein the output is generated over a single model of the at least one trained machine learning model (see fig 5, where a method of training of a grasping model based on training examples is shown. See also [0060], where “The training engine 120 trains one or more of the networks 125, 126, and 127 of semantic grasping model 124 based on the training examples of training examples database 117.”; see also [0088], where “generate output over the trained semantic CNN, and utilize the output to determine one or more grasped object features of the object being grasped.”; see also fig 7, where a grasping model is utilized for grasping.).
Regarding claim 7, Vijayanarasimhan further discloses a method wherein the vision data lacks a depth channel (see [0050], where “the vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s)”; See also [0057], where “In some implementations, the current image may include multiple channels, such as a red channel, a blue channel, a green channel, and/or a depth channel.”; Therefore, gleaning from the teachings of Vijayanarasimhan, a variety of image data including at least said shape, color and/or depth, may be derived from one or more sensors. As such, Examiner contends wherein said image data (i.e. vision data) may not include depth data, and therefore the teachings of Vijayanarasimhan satisfy the currently provided claim limitation.).
Regarding claim 8, Vijayanarasimhan further discloses a method wherein the vision data processed using the single model comprises the group of 3D points (see fig 1A, where images (vision data) are input to the grasp CNN. See also [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the concatenated image and the candidate end effector motion vector to generate end effector motion output. Output is generated by applying image to CNN (processing the data and generating output).) without the depth channel (see [0050], where “the vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s)”; See also [0057], where “In some implementations, the current image may include multiple channels, such as a red channel, a blue channel, a green channel, and/or a depth channel.”; Therefore, gleaning from the teachings of Vijayanarasimhan, a variety of image data including at least said shape, color and/or depth, may be derived from one or more sensors. As such, Examiner contends wherein said image data (i.e. vision data) may not include depth data, and therefore the teachings of Vijayanarasimhan satisfy the currently provided claim limitation.).  
Regarding claim 9, Vijayanarasimhan further discloses a robot (see fig 1A, where robots (180A, 180B) are shown), comprising: 
actuators (see [0072], where “the system may generate one or more motion commands to cause one or more of the actuators that control the pose of the end effector to actuate, thereby changing the pose of the end effector.”); 
a vision component (see fig 1A, where 184A/184B vision sensors); 
an end effector (see fig 1A, where 182A/182B are end-effectors); 
memory storing instructions (see fig 9, block 925 is memory. See also [0152], where “These software modules are generally executed by processor 914 alone or in combination with other processors. Memory 925 used in the storage subsystem 924 can include a number of memories including a main random-access memory (RAM) 930 for storage of instructions and data during program execution and a read only memory (ROM) 932 in which fixed instructions are stored.”); - 41 -Attorney Docket No. ZU236-21089 
one or more processors (see [0019], where “The method further includes training, by one or more of the processors, a semantic convolutional neural network based on the training examples.”), executing the instructions, to cause the one or more processors to: 
receive a group of three-dimensional (3D) data points generated by the vision component, the group of 3D data points capturing an object in an environment of a robot (see [0050], where “Vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s) that are in the line of sight of the sensors. The vision sensors 184A and 184B may be, for example, monographic cameras, stereographic cameras, and/or 3D laser scanners.”; see also [0065], where “One or more images 161 are applied as input to the grasp CNN 125”; vision sensors are  generating images of an object in the environment considered for grasping.); 
apply vision data as input to at least one trained machine learning model (see [0065], where “One or more images 161 are applied as input to the grasp CNN 125”; see also [0048], where “FIG. 1A illustrates an example environment in which grasp attempts may be performed by robots (e.g., robots 180A, 180B, and/or other robots), data associated with the grasp attempts may be utilized to generate training examples, and/or the training examples may be utilized to train various networks 125, 126, and/or 127 of the semantic grasping model 124.”), the vision data being based on the group of 3D data points, or being generated by an additional vision component of the robot, the vision data capturing the object in the environment of the robot (see [0131], where “At block 754, the system identifies a current image that captures the end effector and one or more environmental objects. In some implementations, the system also identifies an additional image that at least partially omits the end effector, such as an additional image of the environmental objects that was captured by a vision sensor when the end effector was at least partially out of view of the vision sensor. In some implementations, the system concatenates the image and the additional image to generate a concatenated image. In some implementations, the system optionally performs processing of the image(s) and/or concatenated image (e.g., to size to an input of the grasp CNN).”; the system identifies an image that captured by end effector and size/modify/process to an input of the grasp CNN); 
process the vision data using the trained machine learning model to generate output defining one or more grasp regions and, for each of the one or more grasp regions, a corresponding semantic indication (see [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the concatenated image and the candidate end effector motion vector to generate end effector motion output. Output is generated by applying image to CNN (processing the data and generating output). see also [0005], where “For example, a user may provide user interface input (e.g., spoken, typed) that indicates a desire to grasp an object having one or more particular object feature(s) and the robot may utilize the trained networks to attempt a grasp”; see also [0019], where “Each of the training examples further include training example output that includes: at least one grasped object label indicating a semantic feature of an object grasped by the corresponding grasp attempt.”); 
select a grasp region, of the one or more grasp regions, based on the grasp region corresponding to the object and the object being selected for grasping (see [0066], where “A grasp measure 177 and STN parameters 178 are generated over the grasp CNN 125 based on the applied image(s) 161 and end effector motion vector 162.”; see also [0068], where “The spatial transformation 179 is applied as input to the semantic CNN 127 and semantic feature(s) 180 are generated over the semantic CNN 127 based on the applied spatially transformed image 179. For example, the semantic feature(s) 180 may indicate to which of one or more classes an object in the spatially transformed image 179 belongs such as classes of "eraser", "scissors", "comb", "shovel", "torch", "toy", "hairbrush", and/or other class( es) of greater or lesser granularity.”; the STN parameters correlate to transformation parameters for a cropped (i.e. specified, grasp, etc.) region and provide an additional step for semantic classification by filtering the unnecessary section of input image.); 
select, based on the semantic indication of the grasp region, a particular grasp strategy of a plurality of candidate grasp strategies (see fig 7, where a flowchart is showing a semantic grasping model, block 762 is generating semantic features over semantic CNN. See also [0138], where “the system may generate one or more additional candidate end effector motion vectors …and generate: measures of successful grasps…In some of those implementations, the system may generate the end effector command at block 764 based on analysis of all generated measures of successful grasp and corresponding semantic feature(s).”); 
determine an end effector pose for interacting with the object to grasp the object (see [0136], where “At block 764, the system generates an end effector command based on the measure of a successful grasp of block 758 and the semantic feature(s) of block 762. Generally, at block 764, the system generates an end effector command that seeks to achieve (through one or more iterations of method 700) a successful grasp that is of an object that has desired object semantic features.”; see also [0139], where “For example, if one or more comparisons of the current measure of successful grasp to the measure of successful grasp determined at block 758 fail to satisfy a threshold, and the current semantic feature(s) indicate the desired object semantic features, then the end effector motion command may be a "grasp command" that causes the end effector to attempt a grasp (e.g., close digits of an impactive gripping end effector). For instance, if the result of the current measure of successful grasp divided by the measure of successful grasp determined at block 758 for the candidate end effector motion vector that is most indicative of successful grasp is greater than or equal to a first threshold (e.g., 0.9), the end effector command may be a grasp command (under the rationale of stopping the grasp early if closing the gripper is nearly as likely to produce a successful grasp as moving it).”; see also [0092], where “block 460, the system generates an end effector motion vector for the instance based on the pose of the end effector at the instance and the pose of the end effector at the final instance of the grasp attempt.”; the system generates command for end effector that seeks to achieve a successful grasp or pose at object grasp.), wherein in determining the end effector pose one or more of the processors are to: 
select, grasp candidate (see [0138], where “the system may generate one or more additional candidate end effector motion vectors at block 752, and generate: measures of successful grasps for those additional candidate end effector motion vectors at additional iterations of block 758 (based on applying the current image and the additional candidate end effector motion vectors to the grasp CNN); and semantic feature(s) for those additional candidate end effector motion vectors at additional iterations of block 762. The additional iterations of blocks 758 and 762 may optionally be performed in parallel by the system. In some of those implementations, the system may generate the end effector command at block 764 based on analysis of all generated measures of successful grasp and corresponding semantic feature(s).”; grasp candidate is selected by iteration through CNN.); and
provide, to the actuators of the robot, commands that cause the end effector of the robot to traverse to the end effector pose in association with attempting a grasp of the object (see fig 7, block 768, where grasp command is implemented for end effector to attempt a grasp.).
Vijayanarasimhan does not disclose the following limitations:
wherein determining the end effector pose comprises: 
select, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region; 
select a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction - 42 -Attorney Docket No. ZU236-21089 defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region; and 
in response to selecting the first surface normal, determine the end effector pose based on the first surface normal.
However, Bradski further discloses a system wherein determining the end effector pose comprises: 
select, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region defined by output generated by a computing device (see [0098], where “grasp point (which may also be referred to as pick points) may be identified by finding surface normals”; see also [0122], where “In particular, this is demonstrated by the surface normals shown in FIG. 6B, where only surface normals (i.e., potential grasp points) that can be reached by an approach path of the robotic arm 602 are shown”; see also [0136], where “Planning a collision-free path may involve determining the "virtual" location of objects and surfaces in the environment.”; see also [0156], where “virtual grasps”; see also [0031], where “In many cases, a robotic arm maybe equipped with various sensors that allow for analysis of the object and the environment, as well as 3D ("virtual") reconstruction of the object and the environment.”; sensors attached with the robotic device is generating 3D view of the object. Then grasp point is identified by finding surface normals of the object. One surface normal is picked from multiple surface normals based on collision free grasp path. So, a grasp point (3D point) is selected from multiple 3D points (grasp points).); 
select a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction - 42 -Attorney Docket No. ZU236-21089 defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region (see fig 6A-C, where one grasp point is selected from multiple grasp points. see also [0119], where “potential grasp points on the physical object may be identified by finding spatial clusters of coherently oriented surface normals. In particular, the system may make use of regular organization of surface normals as a good grasp point for suction or electrostatic grippers to pick up an object. Such areas of coherently oriented surface normals are often points of low surface curvature (also referred to as "flat spots") that work well for suction or electrostatic grabbing devices.”; see also [0150], where “the selection of a grasp point is shown in FIG. 6B. FIG. 6B shows an object in the bin 610 together with surface normals on that object produced by processing depth data from a depth sensing device.”); and 
in response to selecting the first surface normal, determine the end effector pose based on the first surface normal (see [0120], where “As shown in FIG. 6B, the robotic arm 602 is on an approach path towards the object 608, where the object 608 exhibits surface normals that represent flat spots on the object 608.”; see also [0121], where “the surface normals (i.e., the graspable features) may allow for estimation of approach trajectories.”).
Vijayanarasimhan teaches a system for applying vision data of an object to be grasped as input for CNN (trained machine learning model) and selecting a grasp candidate, see citation above). Bradski teaches a system for selecting a grasping point on an object by computing device from multiple potential grasp points (surface normals) that can be reached by robotic arm and selecting a grasp point for collision free grasp (see citation above).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan to incorporate the teachings of Bradski by including the above features, select, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region; select a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction - 42 -Attorney Docket No. ZU236-21089 defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region; and in response to selecting the first surface normal, determine the end effector pose based on the first surface normal, for avoiding collisions with nearby objects and avoiding damage of the object and end effector during grasping.
Regarding claim 14, Vijayanarasimhan further discloses a robot wherein the output is generated over a single locally stored model of the at least one trained machine learning model (see fig 5, where a method of training of a grasping model based on training examples is shown. See also [0057], where “The data generated by sensor(s) associated with a robot and/or the data derived from the generated data may be stored in one or more non-transitory computer readable media local to the robot”; See also [0060], where “The training engine 120 trains one or more of the networks 125, 126, and 127 of semantic grasping model 124 based on the training examples of training examples database 117.”; see also [0088], where “generate output over the trained semantic CNN, and utilize the output to determine one or more grasped object features of the object being grasped.”; see also fig 7, where a grasping model is utilized for grasping.).
Regarding claim 15, Vijayanarasimhan further discloses a robot wherein the vision data lacks a depth channel (see [0050], where “the vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s)”; See also [0057], where “In some implementations, the current image may include multiple channels, such as a red channel, a blue channel, a green channel, and/or a depth channel.”; Therefore, gleaning from the teachings of Vijayanarasimhan, a variety of image data including at least said shape, color and/or depth, may be derived from one or more sensors. As such, Examiner contends wherein said image data (i.e. vision data) may not include depth data, and therefore the teachings of Vijayanarasimhan satisfy the currently provided claim limitation.).
Regarding claim 16, Vijayanarasimhan further discloses a robot wherein the vision data processed using the single model comprises the group of 3D points (see fig 1A, where images (vision data) are input to the grasp CNN. See also [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the concatenated image and the candidate end effector motion vector to generate end effector motion output. Output is generated by applying image to CNN (processing the data and generating output).) without the depth channel (see [0050], where “the vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s)”; See also [0057], where “In some implementations, the current image may include multiple channels, such as a red channel, a blue channel, a green channel, and/or a depth channel.”; Therefore, gleaning from the teachings of Vijayanarasimhan, a variety of image data including at least said shape, color and/or depth, may be derived from one or more sensors. As such, Examiner contends wherein said image data (i.e. vision data) may not include depth data, and therefore the teachings of Vijayanarasimhan satisfy the currently provided claim limitation.).  
Regarding claim 17, Vijayanarasimhan further discloses a non-transitory computer readable storage medium storing instructions executable by a processor to perform a method (see [0030], where “Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor ( e.g., a central processing unit (CPU) or graphics processing unit (GPU)) to perform a method such as one or more of the methods described above and/or elsewhere herein.”), the method comprising: 
receiving a group of three-dimensional (3D) data points generated by a vision component of a robot, the group of 3D data points capturing an object in an environment of a robot (see [0050], where “Vision sensors 184A and 184B are sensors that can generate images related to shape, color, depth, and/or other features of object(s) that are in the line of sight of the sensors. The vision sensors 184A and 184B may be, for example, monographic cameras, stereographic cameras, and/or 3D laser scanners.”; see also [0065], where “One or more images 161 are applied as input to the grasp CNN 125”; vision sensors are  generating images of an object in the environment considered for grasping.); - 43 -Attorney Docket No. ZU236-21089 
applying vision data as input to at least one trained machine learning model (see [0065], where “One or more images 161 are applied as input to the grasp CNN 125”; see also [0048], where “FIG. 1A illustrates an example environment in which grasp attempts may be performed by robots (e.g., robots 180A, 180B, and/or other robots), data associated with the grasp attempts may be utilized to generate training examples, and/or the training examples may be utilized to train various networks 125, 126, and/or 127 of the semantic grasping model 124.”), the vision data being based on the group of 3D data points, or being generated by an additional vision component of the robot, the vision data capturing the object in the environment of the robot (see [0131], where “At block 754, the system identifies a current image that captures the end effector and one or more environmental objects. In some implementations, the system also identifies an additional image that at least partially omits the end effector, such as an additional image of the environmental objects that was captured by a vision sensor when the end effector was at least partially out of view of the vision sensor. In some implementations, the system concatenates the image and the additional image to generate a concatenated image. In some implementations, the system optionally performs processing of the image(s) and/or concatenated image (e.g., to size to an input of the grasp CNN).”; the system identifies an image that captured by end effector and size/modify/process to an input of the grasp CNN); 
processing the vision data using the trained machine learning model to generate output defining one or more grasp regions and, for each of the one or more grasp regions, a corresponding semantic indication (see [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the concatenated image and the candidate end effector motion vector to generate end effector motion output. Output is generated by applying image to CNN (processing the data and generating output). see also [0005], where “For example, a user may provide user interface input (e.g., spoken, typed) that indicates a desire to grasp an object having one or more particular object feature(s) and the robot may utilize the trained networks to attempt a grasp”; see also [0019], where “Each of the training examples further include training example output that includes: at least one grasped object label indicating a semantic feature of an object grasped by the corresponding grasp attempt.”); 
selecting a grasp region, of the one or more grasp regions, based on the grasp region corresponding to the object and the object being selected for grasping (see [0066], where “A grasp measure 177 and STN parameters 178 are generated over the grasp CNN 125 based on the applied image(s) 161 and end effector motion vector 162.”; see also [0068], where “The spatial transformation 179 is applied as input to the semantic CNN 127 and semantic feature(s) 180 are generated over the semantic CNN 127 based on the applied spatially transformed image 179. For example, the semantic feature(s) 180 may indicate to which of one or more classes an object in the spatially transformed image 179 belongs such as classes of "eraser", "scissors", "comb", "shovel", "torch", "toy", "hairbrush", and/or other class( es) of greater or lesser granularity.”; the STN parameters correlate to transformation parameters for a cropped (i.e. specified, grasp, etc.) region and provide an additional step for semantic classification by filtering the unnecessary section of input image.); 
selecting, based on the semantic indication of the grasp region, a particular grasp strategy of a plurality of candidate grasp strategies (see fig 7, where a flowchart is showing a semantic grasping model, block 762 is generating semantic features over semantic CNN. See also [0138], where “the system may generate one or more additional candidate end effector motion vectors …and generate: measures of successful grasps…In some of those implementations, the system may generate the end effector command at block 764 based on analysis of all generated measures of successful grasp and corresponding semantic feature(s).”); 
determining an end effector pose for interacting with the object to grasp the object (see [0136], where “At block 764, the system generates an end effector command based on the measure of a successful grasp of block 758 and the semantic feature(s) of block 762. Generally, at block 764, the system generates an end effector command that seeks to achieve (through one or more iterations of method 700) a successful grasp that is of an object that has desired object semantic features.”; see also [0139], where “For example, if one or more comparisons of the current measure of successful grasp to the measure of successful grasp determined at block 758 fail to satisfy a threshold, and the current semantic feature(s) indicate the desired object semantic features, then the end effector motion command may be a "grasp command" that causes the end effector to attempt a grasp (e.g., close digits of an impactive gripping end effector). For instance, if the result of the current measure of successful grasp divided by the measure of successful grasp determined at block 758 for the candidate end effector motion vector that is most indicative of successful grasp is greater than or equal to a first threshold (e.g., 0.9), the end effector command may be a grasp command (under the rationale of stopping the grasp early if closing the gripper is nearly as likely to produce a successful grasp as moving it).”; see also [0092], where “block 460, the system generates an end effector motion vector for the instance based on the pose of the end effector at the instance and the pose of the end effector at the final instance of the grasp attempt.”; the system generates command for end effector that seeks to achieve a successful grasp or pose at object grasp.), wherein determining the end effector pose comprises: 
selecting, grasp candidate (see [0138], where “the system may generate one or more additional candidate end effector motion vectors at block 752, and generate: measures of successful grasps for those additional candidate end effector motion vectors at additional iterations of block 758 (based on applying the current image and the additional candidate end effector motion vectors to the grasp CNN); and semantic feature(s) for those additional candidate end effector motion vectors at additional iterations of block 762. The additional iterations of blocks 758 and 762 may optionally be performed in parallel by the system. In some of those implementations, the system may generate the end effector command at block 764 based on analysis of all generated measures of successful grasp and corresponding semantic feature(s).”; grasp candidate is selected by iteration through CNN.); and
providing, to actuators of the robot, commands that cause an end effector of the robot to traverse to the end effector pose in association with attempting a grasp of the object (see fig 7, block 768, where grasp command is implemented for end effector to attempt a grasp.).
Vijayanarasimhan does not disclose the following limitations:
wherein determining the end effector pose comprises: 
selecting, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region; 
selecting a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region; and
 in response to selecting the first surface normal, determining the end effector pose based on the first surface normal.
However, Bradski further discloses a method, wherein determining the end effector pose comprises: 
selecting, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region defined by output generated by a computing device (see [0098], where “grasp point (which may also be referred to as pick points) may be identified by finding surface normals”; see also [0122], where “In particular, this is demonstrated by the surface normals shown in FIG. 6B, where only surface normals (i.e., potential grasp points) that can be reached by an approach path of the robotic arm 602 are shown”; see also [0136], where “Planning a collision-free path may involve determining the "virtual" location of objects and surfaces in the environment.”; see also [0156], where “virtual grasps”; see also [0031], where “In many cases, a robotic arm maybe equipped with various sensors that allow for analysis of the object and the environment, as well as 3D ("virtual") reconstruction of the object and the environment.”; sensors attached with the robotic device is generating 3D view of the object. Then grasp point is identified by finding surface normals of the object. One surface normal is picked from multiple surface normals based on collision free grasp path. So, a grasp point (3D point) is selected from multiple 3D points (grasp points).); 
selecting a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region (see fig 6A-C, where one grasp point is selected from multiple grasp points. see also [0119], where “potential grasp points on the physical object may be identified by finding spatial clusters of coherently oriented surface normals. In particular, the system may make use of regular organization of surface normals as a good grasp point for suction or electrostatic grippers to pick up an object. Such areas of coherently oriented surface normals are often points of low surface curvature (also referred to as "flat spots") that work well for suction or electrostatic grabbing devices.”; see also [0150], where “the selection of a grasp point is shown in FIG. 6B. FIG. 6B shows an object in the bin 610 together with surface normals on that object produced by processing depth data from a depth sensing device.”); and
 in response to selecting the first surface normal, determining the end effector pose based on the first surface normal (see [0120], where “As shown in FIG. 6B, the robotic arm 602 is on an approach path towards the object 608, where the object 608 exhibits surface normals that represent flat spots on the object 608.”; see also [0121], where “the surface normals (i.e., the graspable features) may allow for estimation of approach trajectories.”). 
Vijayanarasimhan teaches a method for applying vision data of an object to be grasped as input for CNN (trained machine learning model) and selecting a grasp candidate, see citation above). Bradski teaches a method for selecting a grasping point on an object by computing device from multiple potential grasp points (surface normals) that can be reached by robotic arm and selecting a grasp point for collision free grasp (see citation above).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan to incorporate the teachings of Bradski by including the above features, selecting, from the group of 3D points, at least a first 3D point and a second 3D point, based on the first 3D point and the second 3D point being within the grasp region; selecting a first surface normal, that is determined based on the first 3D point, in lieu of a second surface normal, that is determined based on the second 3D point, wherein selecting the first surface normal is based on the first surface normal conforming to a grasp approach direction defined by the particular grasp strategy that is selected based on the semantic indication of the grasp region; and in response to selecting the first surface normal, determining the end effector pose based on the first surface normal, for avoiding collisions with nearby objects and avoiding damage of the object and end effector during grasping.

Claim(s) 2 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2017/0252924 (“Vijayanarasimhan”), in view of US 2016/0221187 (“Bradski”), as applied to claims 1 and 9 above, and further in view of US 2018/0364731(“Liu”).
Regarding claim 2, Vijayanarasimhan further discloses a method wherein the vision data processed using the trained machine learning model to generate the output comprises (see [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the image to CNN to generate end effector motion output.). 
Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the vision data processed…to generate the output comprises two-dimensional (2D) vision data. 
However, Liu discloses a method wherein the vision data processed…to generate the output comprises two-dimensional (2D) vision data (see [0173], where “At step 1260, the extracted new features are matched with features from the previous frame and reprojected feature positions from the 3D map onto a 2D view from a perspective of the propagated pose, producing a list of matching features.”; 3D map is converted to 2D map. see also fig 10, where image is input to a processor).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Liu by including the above feature, wherein the vision data processed…to generate the output comprises two-dimensional (2D) vision data, for getting positional accuracy by converting the sensor data into 2D data.
Regarding claim 10, Vijayanarasimhan further discloses a robot wherein the vision data processed using the trained machine learning model to generate the output comprises (see [0132], where “At block 756, the system applies the current image and the candidate end effector motion vector to a trained grasp CNN. For example, the system may apply the concatenated image, that includes the current image and the additional image, to an initial layer of the trained grasp CNN.”; the system applies the image to CNN to generate end effector motion output.). 
Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the vision data processed…to generate the output comprises two-dimensional (2D) vision data. 
However, Liu further discloses a system wherein the vision data processed…to generate the output comprises two-dimensional (2D) vision data (see [0173], where “At step 1260, the extracted new features are matched with features from the previous frame and reprojected feature positions from the 3D map onto a 2D view from a perspective of the propagated pose, producing a list of matching features.”; 3D map is converted to 2D map. see also fig 10, where image is input to a processor).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Liu by including the above feature, wherein the vision data processed…to generate the output comprises two-dimensional (2D) vision data, for getting positional accuracy by converting the sensor data into 2D data.

Claim(s)  3-5 and 11-13 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 2017/0252924 (“Vijayanarasimhan”), and in view of US 2016/0221187 (“Bradski”), as applied to claim 1 and 9 above, and further in view of US 2019/0361672(“Odhner”).
Regarding claim 3, Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the particular grasp strategy defines a degree of force to apply in attempting the grasp of the object. 
However, Odhner discloses a method wherein the particular grasp strategy defines a degree of force to apply in attempting the grasp of the object (see [0100], where “In the context of the present application, "grasping strategy" or "grasp strategy" may refer to the steps, movements, applied force(s), approach, and other characteristics that define how the robotic manipulator 204 executes a grasping attempt.”; grasp strategy includes applied force data. See also [0102], where “how much force should be generated by the end effector 212 to grasp the item”; see also [0141], where “which grasping technique (s) or strategies are at least likely to produce a successful grasp attempt on a particular item”).
Odhner teaches a method for assessing robotic grasping technique to determine whether the robotic manipulator successfully grasps the item so as to improve techniques implemented by robotic manipulators (see abstract and [0010]). Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Odhner by including the above feature, wherein the particular grasp strategy defines a degree of force to apply in attempting the grasp of the object, to improve techniques for successful grasp implemented by robotic manipulators by avoiding any damage of the object and end effector.
Regarding claim 4, Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the particular grasp strategy defines a grasp type to be performed by the end effector. 
However, Odhner further discloses a method wherein the particular grasp strategy defines a grasp type to be performed by the end effector (see [0100], where “In the context of the present application, "grasping strategy" or "grasp strategy" may refer to the steps, movements, applied force(s), approach, and other characteristics that define how the robotic manipulator 204 executes a grasping attempt.”; grasp strategy includes how the robotic manipulator executes the grasp attempt. See also [0071], where “In some embodiments, such as in the photograph 100 of FIG. 1, the end effector 212 may be configured as a hand device with a plurality of "finger'' portions for grasping or otherwise interacting with an item.”; See also [0072] where “In other embodiments, the end effector 212 may be configured as a suction device.”; finger or suction is interpreted as grasp type.).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Odhner by including the above feature, wherein the particular grasp strategy defines a grasp type to be performed by the end effector, to improve techniques for successful grasp implemented by robotic manipulators by avoiding any damage of the object and end effector.
Regarding claim 5, Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the particular grasp strategy defines a grasp type to be performed by the end effector. 
However, Odhner further discloses a method wherein the particular grasp strategy defines a grasp type to be performed by the end effector (see [0100], where “In the context of the present application, "grasping strategy" or "grasp strategy" may refer to the steps, movements, applied force(s), approach, and other characteristics that define how the robotic manipulator 204 executes a grasping attempt.”; grasp strategy includes how the robotic manipulator executes the grasp attempt. See also [0071], where “In some embodiments, such as in the photograph 100 of FIG. 1, the end effector 212 may be configured as a hand device with a plurality of "finger'' portions for grasping or otherwise interacting with an item.”; See also [0072] where “In other embodiments, the end effector 212 may be configured as a suction device.”; finger or suction is interpreted as grasp type.).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Odhner by including the above feature, wherein the particular grasp strategy defines a grasp type to be performed by the end effector, to improve techniques for successful grasp implemented by robotic manipulators by avoiding any damage of the object and end effector.
Regarding claim 11, Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the particular grasp strategy defines a degree of force to apply in attempting the grasp of the object.
However, Odhner further discloses a robot wherein the particular grasp strategy defines a degree of force to apply in attempting the grasp of the object (see [0100], where “In the context of the present application, "grasping strategy" or "grasp strategy" may refer to the steps, movements, applied force(s), approach, and other characteristics that define how the robotic manipulator 204 executes a grasping attempt.”; grasp strategy includes applied force data. See also [0102], where “how much force should be generated by the end effector 212 to grasp the item”; see also [0141], where “which grasping technique (s) or strategies are at least likely to produce a successful grasp attempt on a particular item”).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Odhner by including the above feature, wherein the particular grasp strategy defines a degree of force to apply in attempting the grasp of the object, to improve techniques for successful grasp implemented by robotic manipulators by avoiding any damage of the object and end effector.  
Regarding claim 12, Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the particular grasp strategy defines a grasp type to be performed by the end effector.  
However, Odhner further discloses a robot wherein the particular grasp strategy defines a grasp type to be performed by the end effector (see [0100], where “In the context of the present application, "grasping strategy" or "grasp strategy" may refer to the steps, movements, applied force(s), approach, and other characteristics that define how the robotic manipulator 204 executes a grasping attempt.”; grasp strategy includes how the robotic manipulator executes the grasp attempt. See also [0071], where “In some embodiments, such as in the photograph 100 of FIG. 1, the end effector 212 may be configured as a hand device with a plurality of "finger'' portions for grasping or otherwise interacting with an item.”; See also [0072] where “In other embodiments, the end effector 212 may be configured as a suction device.”; finger or suction is interpreted as grasp type.).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Odhner by including the above feature, wherein the particular grasp strategy defines a grasp type to be performed by the end effector, to improve techniques for successful grasp implemented by robotic manipulators by avoiding any damage of the object and end effector.
Regarding claim 13, Vijayanarasimhan in view of Bradski does not disclose the following limitation:
wherein the particular grasp strategy defines a grasp type to be performed by the end effector.  
However, Odhner further discloses a robot wherein the particular grasp strategy defines a grasp type to be performed by the end effector (see [0100], where “In the context of the present application, "grasping strategy" or "grasp strategy" may refer to the steps, movements, applied force(s), approach, and other characteristics that define how the robotic manipulator 204 executes a grasping attempt.”; grasp strategy includes how the robotic manipulator executes the grasp attempt. See also [0071], where “In some embodiments, such as in the photograph 100 of FIG. 1, the end effector 212 may be configured as a hand device with a plurality of "finger'' portions for grasping or otherwise interacting with an item.”; See also [0072] where “In other embodiments, the end effector 212 may be configured as a suction device.”; finger or suction is interpreted as grasp type.).
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to have modified Vijayanarasimhan in view of Bradski to incorporate the teachings of Odhner by including the above feature, wherein the particular grasp strategy defines a grasp type to be performed by the end effector, to improve techniques for successful grasp implemented by robotic manipulators by avoiding any damage of the object and end effector.
Examiner Note
List of references not being used on the current rejection but relevant to current invention:
US 9,669,543 (“Stubbs”) discloses a robotic grasp management system.
US 2020/0094405 (“Davidson”) discloses a method for robotic grasp generation using CNN.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOHANA TANJU KHAYER whose telephone number is (408)918-7597.  The examiner can normally be reached on Monday - Thursday, 7 am-5.30 pm, PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abby Lin can be reached on 571-270-3976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SOHANA TANJU KHAYER/             Examiner, Art Unit 3664