DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
Claims 1-20, as amended, are currently pending and have been considered below.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: "sensor module configured to receive," "visual module configured to generate," "tactile module configured to generate" and "pose module configured to estimate" in claims 1 and 5.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-3, 7-9, 13-16 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee MA, Zhu Y, Zachares P, Tan M, Srinivasan K, Savarese S, Fei-Fei L, Garg A, Bohg J. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics. 2020 June 4;36(3):582-96, hereinafter, “Lee”, and further in view of Bimbo, Joao, et al. "Object pose estimation and tracking by fusing visual and tactile information." 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, 2012, hereinafter, “Bimbo”.

As per claim 1, Lee discloses a system for visuo-tactile object pose estimation, comprising: 
a sensor module configured to: 
receive image data about an object in an environment (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera); 
receive depth data about the object in the environment (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera); and 
receive tactile data about the object in the environment (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: ... raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce)); 
a visual module configured to generate a visual estimate of the object based on the image data and the depth data (Lee, page 585, IV. Multimodal Representation Model, Fig. 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera, raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce), and proprioceptive data ... The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality … By using domain-specific encoders, we can avoid engineering how to normalize and combine the raw sensory inputs and features from each modality … For visual feedback, we normalize the image pixels and use a six-layer convolutional neural network (CNN) to encode 128 × 128 × 3 RGB images. For depth feedback, we use an eighteen-layer CNN with 3 × 3 convolutional filters of increasing depths similar toVGG-16 to encode 128 × 128 × 1 depth images. We add a single fully connected layer to the end of both the depth and RGB encoders to transform the final activation maps into a 2 × d-dimensional variational parameter vector); 
a tactile module configured to generate a tactile estimate of the object based on the tactile data (Lee, page 585, IV. Multimodal Representation Model, Fig. 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera, raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce), and proprioceptive data ... The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality … By using domain-specific encoders, we can avoid engineering how to normalize and combine the raw sensory inputs and features from each modality … For haptic feedback, we take the last 32 readings from the six-axis F/T sensor as a 32 × 6 time series and perform five-layer causal convolutions with stride 2 to transform the force readings into a 2 × d-dimensional variational parameter vector); and 
a pose module configured to estimate a pose of the robot manipulator / end effector / peg object based on the visual estimate and the tactile estimate (Lee, page 585, Figure 2, end-effector pose predictor, end effector pose; Lee, page 586, B. Modality Encoders, vectors that represent each modality are fused into one vector; Lee, pages 586-587, D. Self-Supervised Predictions and Decoder Architecture, Given the next robot action and the compact representation of the current sensory data, the model has to predict ... the future end-effector 4-DoF pose … we are also predicting action-conditional end-effector positions … The contact predictor is a one-layer MLP and performs binary classification. The end-effector prediction network is a four-layer MLP that predicts the next-step position and roll angle of the end-effector).
Lee further discloses (Lee, page 589, C. Reward Design, The stages are reaching (r), aligning (a), inserting (i), and completed (c) ... where s = (sx, sy, sz) denotes the peg’s current relative position to the peg hole and sψ is the current relative orientation along the z-axis of the peg in relation to the peg hole) but does not explicitly disclose the following limitations as further recited however Bimbo discloses 
estimate a pose of the object based on the visual estimate and the tactile estimate (Bimbo, Abstract, This paper presents a method to estimate a grasped object’s 6D pose by fusing sensor data from vision, tactile sensors and joint encoders. Given an initial pose acquired by the vision system and the contact locations on the fingertips, an iterative process optimises the estimation of the object pose by finding a transformation that fits the grasped object to the finger tip).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Lee to include using the multi-modal sensing architecture of Lee to estimate the pose of the object separate from the robot manipulator being manipulated by the robot dependent on the robot task as taught by Bimbo in order to further refine the knowledge of the object’s location within the robotic manipulator (Bimbo, Abstract).

As per claim 2, Lee and Bimbo disclose the system of claim 1, wherein the image data is RGB data from a first optical sensor (Lee, page 589, B. Robot Environment Setup, On the real robot, we use the Kinect v2 camera; Lee, page 586, B. Modality Encoders, a fixed RGB-D camera).

As per claim 3, Lee and Bimbo disclose the system of claim 1, wherein the depth data is received from a second sensor that is a ranging sensor (Lee, page 589, B. Robot Environment Setup, On the real robot, we use the Kinect v2 camera).

As per claim 7, Lee discloses a method for visuo-tactile object pose estimation, comprising: 
receiving image data about an object in an environment (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera); 
receiving depth data about the object (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera); 
generating a visual estimate of the object based on the image data and the depth data (Lee, page 585, IV. Multimodal Representation Model, Fig. 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera, raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce), and proprioceptive data ... The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality … By using domain-specific encoders, we can avoid engineering how to normalize and combine the raw sensory inputs and features from each modality … For visual feedback, we normalize the image pixels and use a six-layer convolutional neural network (CNN) to encode 128 × 128 × 3 RGB images. For depth feedback, we use an eighteen-layer CNN with 3 × 3 convolutional filters of increasing depths similar toVGG-16 to encode 128 × 128 × 1 depth images. We add a single fully connected layer to the end of both the depth and RGB encoders to transform the final activation maps into a 2 × d-dimensional variational parameter vector); 
receiving tactile data about the object (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: ... raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce)); 
generating a tactile estimate of the object based on the tactile data (Lee, page 585, IV. Multimodal Representation Model, Fig. 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera, raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce), and proprioceptive data ... The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality … By using domain-specific encoders, we can avoid engineering how to normalize and combine the raw sensory inputs and features from each modality … For haptic feedback, we take the last 32 readings from the six-axis F/T sensor as a 32 × 6 time series and perform five-layer causal convolutions with stride 2 to transform the force readings into a 2 × d-dimensional variational parameter vector); and 
estimating a pose of the robot manipulator / end effector / peg object based on the visual estimate and the tactile estimate (Lee, page 585, Figure 2, end-effector pose predictor, end effector pose; Lee, page 586, B. Modality Encoders, vectors that represent each modality are fused into one vector; Lee, pages 586-587, D. Self-Supervised Predictions and Decoder Architecture, Given the next robot action and the compact representation of the current sensory data, the model has to predict ... the future end-effector 4-DoF pose … we are also predicting action-conditional end-effector positions … The contact predictor is a one-layer MLP and performs binary classification. The end-effector prediction network is a four-layer MLP that predicts the next-step position and roll angle of the end-effector).
Lee further discloses (Lee, page 589, C. Reward Design, The stages are reaching (r), aligning (a), inserting (i), and completed (c) ... where s = (sx, sy, sz) denotes the peg’s current relative position to the peg hole and sψ is the current relative orientation along the z-axis of the peg in relation to the peg hole) but does not explicitly disclose the following limitations as further recited however Bimbo discloses
estimating a pose of the object based on the visual estimate and the tactile estimate (Bimbo, Abstract, This paper presents a method to estimate a grasped object’s 6D pose by fusing sensor data from vision, tactile sensors and joint encoders. Given an initial pose acquired by the vision system and the contact locations on the fingertips, an iterative process optimises the estimation of the object pose by finding a transformation that fits the grasped object to the finger tip).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Lee to include using the multi-modal sensing architecture of Lee to estimate the pose of the object separate from the robot manipulator being manipulated by the robot dependent on the robot task as taught by Bimbo in order to further refine the knowledge of the object’s location within the robotic manipulator (Bimbo, Abstract).

As per claim 8, Lee and Bimbo disclose the method of claim 7, wherein the image data is RGB data from a first optical sensor (Lee, page 589, B. Robot Environment Setup, On the real robot, we use the Kinect v2 camera; Lee, page 586, B. Modality Encoders, a fixed RGB-D camera).

As per claim 9, Lee and Bimbo disclose the method of claim 7, wherein the depth data is received from a second sensor that is a ranging sensor (Lee, page 589, B. Robot Environment Setup, On the real robot, we use the Kinect v2 camera).

As per claim 13, Lee and Bimbo disclose the method of claim 7, wherein the pose of the object defines a location of the object in a three-dimensional space of the environment (Bimbo, page 65, A. Vision, The vision system provides the 3D shape and the initial pose of the target object).

As per claim 14, Lee discloses a non-transitory computer readable storage medium storing instructions that when executed by a computer having a processor (Lee, page 589, D. Implementation Details, The representation models are trained for 20 epochs on a Quadro P5000 GPU before starting policy learning) to perform a method for visuo-tactile object pose estimation, the method comprising: 
receiving image data about an object in an environment (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera); 
receiving depth data about the object (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera); 
generating a visual estimate of the object based on the image data and the depth data (Lee, page 585, IV. Multimodal Representation Model, Fig. 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera, raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce), and proprioceptive data ... The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality … By using domain-specific encoders, we can avoid engineering how to normalize and combine the raw sensory inputs and features from each modality … For visual feedback, we normalize the image pixels and use a six-layer convolutional neural network (CNN) to encode 128 × 128 × 3 RGB images. For depth feedback, we use an eighteen-layer CNN with 3 × 3 convolutional filters of increasing depths similar toVGG-16 to encode 128 × 128 × 1 depth images. We add a single fully connected layer to the end of both the depth and RGB encoders to transform the final activation maps into a 2 × d-dimensional variational parameter vector); 
receiving tactile data about the object (Lee, page 585, Figure 2. The network takes data from four different sensors as input: RGB images, depth map, F/T readings over a 32 ms window, and end-effector position, orientation, and velocity; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: ... raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce)); 
generating a tactile estimate of the object based on the tactile data (Lee, page 585, IV. Multimodal Representation Model, Fig. 2 visualizes the proposed representation learning model, which uses neural network encoders to learn features from raw sensory inputs and neural network decoders to predict; Lee, page 586, B. Modality Encoders, Our model encodes four types of sensory data available to the robot: RGB (oRGB) and depth images (odepth) from a fixed RGB-D camera, raw haptic feedback from a wrist-mounted force–torque (F/T) sensor (oforce), and proprioceptive data ... The heterogeneous nature of this data requires domain-specific encoders to capture the unique characteristics of each modality … By using domain-specific encoders, we can avoid engineering how to normalize and combine the raw sensory inputs and features from each modality … For haptic feedback, we take the last 32 readings from the six-axis F/T sensor as a 32 × 6 time series and perform five-layer causal convolutions with stride 2 to transform the force readings into a 2 × d-dimensional variational parameter vector); and 
estimating a pose of the robot manipulator / end effector / peg object based on the visual estimate and the tactile estimate (Lee, page 585, Figure 2, end-effector pose predictor, end effector pose; Lee, page 586, B. Modality Encoders, vectors that represent each modality are fused into one vector; Lee, pages 586-587, D. Self-Supervised Predictions and Decoder Architecture, Given the next robot action and the compact representation of the current sensory data, the model has to predict ... the future end-effector 4-DoF pose … we are also predicting action-conditional end-effector positions … The contact predictor is a one-layer MLP and performs binary classification. The end-effector prediction network is a four-layer MLP that predicts the next-step position and roll angle of the end-effector).
Lee further discloses (Lee, page 589, C. Reward Design, The stages are reaching (r), aligning (a), inserting (i), and completed (c) ... where s = (sx, sy, sz) denotes the peg’s current relative position to the peg hole and sψ is the current relative orientation along the z-axis of the peg in relation to the peg hole) but does not explicitly disclose the following limitations as further recited however Bimbo discloses
estimating a pose of the object based on the visual estimate and the tactile estimate (Bimbo, Abstract, This paper presents a method to estimate a grasped object’s 6D pose by fusing sensor data from vision, tactile sensors and joint encoders. Given an initial pose acquired by the vision system and the contact locations on the fingertips, an iterative process optimises the estimation of the object pose by finding a transformation that fits the grasped object to the finger tip).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Lee to include using the multi-modal sensing architecture of Lee to estimate the pose of the object separate from the robot manipulator  being manipulated by the robot dependent on the robot task as taught by Bimbo in order to further refine the knowledge of the object’s location within the robotic manipulator (Bimbo, Abstract).

As per claim 15, Lee and Bimbo disclose the non-transitory computer readable storage medium of claim 14, wherein the image data is RGB data from a first optical sensor (Lee, page 589, B. Robot Environment Setup, On the real robot, we use the Kinect v2 camera; Lee, page 586, B. Modality Encoders, a fixed RGB-D camera).

As per claim 16, Lee and Bimbo disclose the non-transitory computer readable storage medium of claim 14, wherein the depth data is received from a second sensor that is a ranging sensor (Lee, page 589, B. Robot Environment Setup, On the real robot, we use the Kinect v2 camera).

As per claim 20, Lee and Bimbo disclose the non-transitory computer readable storage medium of claim 14, wherein the pose of the object defines a location of the object in a three-dimensional space of the environment (Bimbo, page 65, A. Vision, The vision system provides the 3D shape and the initial pose of the target object).


Claim(s) 4, 10 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee MA, Zhu Y, Zachares P, Tan M, Srinivasan K, Savarese S, Fei-Fei L, Garg A, Bohg J. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics. 2020 June 4;36(3):582-96, hereinafter, “Lee”, in view of Bimbo, Joao, et al. "Object pose estimation and tracking by fusing visual and tactile information." 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, 2012, hereinafter, “Bimbo” as applied to claims 1, 7 and 14 above, and further in view of Calandra, Roberto, et al. "More than a feeling: Learning to grasp and regrasp using vision and touch." IEEE Robotics and Automation Letters 3.4 (2018): 3300-3307, hereinafter, “Calandra”.

As per claim 4, Lee and Bimbo disclose the system of claim 1, wherein the visual module employs a first convolutional neural network (CNN), wherein the tactile module employs a second CNN (Lee, page 586, B. Modality Encoders, For visual feedback, we normalize the image pixels and use a six-layer convolutional neural network (CNN) similar to FlowNet to encode 128 × 128 × 3 RGB images … We add a single fully connected layer to the end of both the depth and RGB encoders to transform the final activation maps into a 2 × d-dimensional variational parameter vector. For haptic feedback, we take the last 32 readings from the six-axis F/T sensor as a 32 × 6 time series and perform five-layer causal convolutions with stride 2 to transform the force readings into a 2 × d-dimensional variational parameter vector).
Lee and Bimbo do not explicitly disclose the following limitation as further recited however Calandra discloses 
wherein the pose module employs a fully connected CNN layer (Calandra, page 3303, A. End-to-End Outcome Prediction, Network Design: We process each image using a convolutional network ... we pass the neural network the difference of the GelSight images before and after contact. The action network is a multi-layer perceptron consisting of two fully-connected layers with 1024 hidden units each. This network takes as input vector representations of the action and pose).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Lee and Bimbo to include the fully connected layer as taught by Calandra in order to enable the receipt of various inputs into the pose module including vectors, images and signals (Calandra, page 3303, A. End-to-End Outcome Prediction).

Regarding claim(s) 10 and 17: 
A corresponding reasoning as given earlier (see rejection of claim(s) 4) applies, mutatis mutandis, to the subject-matter of claim(s) 10 and 17, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 4.


Claim(s) 5, 6, 11, 12, 18 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over over Lee MA, Zhu Y, Zachares P, Tan M, Srinivasan K, Savarese S, Fei-Fei L, Garg A, Bohg J. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics. 2020 June 4;36(3):582-96, hereinafter, “Lee”, in view of Bimbo, Joao, et al. "Object pose estimation and tracking by fusing visual and tactile information." 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, 2012, hereinafter, “Bimbo” as applied to claims 1, 7 and 14 above, and further in view of Gkanatsios N, Chalvatzaki G, Maragos P, Peters J. Orientation attentive robot grasp synthesis. arXiv preprint arXiv:2006.05123. 2020 Jun 9, hereinafter, “Gkanatsios”.

As per claim 5, Lee and Bimbo disclose the system of claim 1, and (Bimbo, page 68, B. Computational Remarks, In order to enable the method to work in real-time, different strategies were employed to reduce the computational effort of the algorithm. The first simplification was based on the assumption that the first estimation from the vision system is near enough to allow the creation of regions on the object surface that are near to the fingertip contact location) but do not explicitly disclose the following limitations as further recited however Gkanatsios discloses wherein the visual module is further configured to determine a region of interest (RoI) based on the image data (Gkanatsios, page 5, C. ORANGE: Orientation-attentive grasp synthesis, The proposed framework, ORANGE is depicted in Fig. 2. ORANGE is model-agnostic; it suffices to employ any CNN-based model that has the capacity to segment regions of interest. Assuming such a model, an initial depth image is processed to output an augmented grasp map G).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Lee and Bimbo to include grasp regions of interest as taught by Gkanatsios in order to improve robot grasp accuracy by estimating the grasping points over different object orientations (Gkanatsios, Abstract).

As per claim 6, Lee, Bimbo and Gkanatsios discloses the system of claim 5, wherein the RoI is determined based on an object segmentation neural network (Gkanatsios, page 5, C. ORANGE: Orientation-attentive grasp synthesis, The proposed framework, ORANGE is depicted in Fig. 2. ORANGE is model-agnostic; it suffices to employ any CNN-based model that has the capacity to segment regions of interest).

Regarding claim(s) 11 and 18: 
A corresponding reasoning as given earlier (see rejection of claim(s) 5) applies, mutatis mutandis, to the subject-matter of claim(s) 11 and 18, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 5.

Regarding claim(s) 12 and 19: 
A corresponding reasoning as given earlier (see rejection of claim(s) 6) applies, mutatis mutandis, to the subject-matter of claim(s) 12 and 19, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 6.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189. The examiner can normally be reached M-F, 9:30AM TO 6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/TRACY MANGIALASCHI/Examiner, Art Unit 2668

/VU LE/Supervisory Patent Examiner, Art Unit 2668