DETAILED ACTION
This action is responsive to the Application filed on 08/05/2019. Claims 1-20  are pending in the case.  Claims 1, 10 and 17 are independent claims. 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-6 are rejected under 35 U.S.C. § 103 as being unpatentable over Lee et al “Cross-Domain Image-Based 3D Shape Retrieval by View Sequence Learning” hereinafter Lee, further in view of Nasiri et al “Image-based deep learning automated sorting of date fruit” hereinafter Nasiri.

As to independent claim 1
Lee teaches, A convolutional neural network (CNN) method comprising: generating, an anchor vector in a semantic space in response to an anchor image being provided to a first CNN, wherein the anchor image is a two-dimensional RGB image, ( Figure 1 caption “We propose a cross-domain image-based 3D shape retrieval. Given an input query image, our system automatically returns a list of similar 3D shapes by L2 distance. Our method learns a joint image and 3D shape embedding space.”  And Figure 2
    PNG
    media_image1.png
    188
    440
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    475
    829
    media_image2.png
    Greyscale
 as shown in Figure 1 and Figure 2, an anchor image is provided to an Image CNN the output of the CNN is an image representation or feature vector which is a learnt descriptive vector in embedding space, this corresponds to an “anchor vector in a semantic space”.) generating a positive vector and negative vector in the semantic space in response to a negative image and a positive image being provided to a second CNN,  wherein the negative image is a first three-dimensional CAD image and the positive image is a second three-dimensional CAD image (Figure 2 and Section 3.3 “The CNNs for images (Image-CNN) and views (ViewCNN) are learned jointly to construct a joint embedding space…. The three streams in the triplet network are anchor image, positive shape, and negative shape (see Figure 2).” As can be seen in Figure 2 a positive CAD shape and negative CAD shape are provided to the network, the output includes a positive shape representation and negative shape representation, which corresponds to the positive vector and negative vector.) and applying a cross-domain deep metric learning algorithm that is operable to extract image features in the semantic space using the anchor vector, positive vector, and negative vector. (Introduction ¶02 “We propose a new cross-domain learning model to better generate the image and shape representations in a joint embedding space. Therefore, the similarities between images and 3D shapes can be effectively computed by the distances in this space.” As shown in figure 2, the 3d shapes are a positive and negative shape. Each input results in a shape or image vector embedding representation.)
Lee does notes explicitly teach, wherein [a CNN] includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first fully connected layer: 
Nasiri however when addressing convolutional neural networks for image processing teaches, wherein [a CNN] includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first fully connected layer: (pg 136 “VGGNet consists of five various blocks which are set homogeneously and sequentially so that the output of each block is defined as the input of the next block (Fig. 2). By this architecture, the network extracts powerful features from the input images such as texture, shape, and color” 

    PNG
    media_image3.png
    325
    941
    media_image3.png
    Greyscale
as shown in the figure the CNN which extracts features from a multi-channel image contains ach of the claimed layers.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Lee to include each of the layers described by Nasiri.  One would have been motivated to make such a combination because both Lee and Nasiri describe the use of a CNN for extracting features from images. The network used by Nasiri is based on the VGG network architecture which is common and performs well in image processing domains, Nasiri notes, “There are various dominant pre-trained structures of CNN which have been successfully trained by a major dataset of labeled images such as ImageNet with 1000 different classes… Common pre-trained CNN consists… VGG… The very depth VGGNet significantly outperforms the architectures which achieved the best results” (Section 2.2 Nasiri)

claim 2
Lee/Nasiri teaches claim 1
Further Lee teaches, wherein the cross-domain deep metric learning algorithm is a triplet loss algorithm that is operable to decrease a first distance between the anchor vector and the positive vector in the semantic space and increase a second distance between the anchor vector and the negative vector in the semantic space. ( Section 3.3 “The CNNs for images (Image-CNN) and views (ViewCNN) are learned jointly to construct a joint embedding space. We use triplet neural network architecture and propose a fast triplet architecture to speed up the training. The goal of triplet neural network is to enforce the anchor negative distances at least farther than the anchor-positive distances by a certain margin: … where dpos is the anchor-positive distance and dneg is the anchor-negative distance. The three streams in the triplet network are anchor image, positive shape, and negative
shape The triplet loss is defined as: … 
    PNG
    media_image4.png
    27
    299
    media_image4.png
    Greyscale
”  the loss function is minimized through training such that dneg is maximized and dpos is minimized, thus corresponding to increasing the negative distance and decreasing the positive distance.)
claim 3
Lee/Nasiri teaches claim 1
Further Lee/Nasiri teaches, where in the one or more first convolutional layers and one or more second convolutional layers are operable to apply one or more activation functions ( Nasiri pg 136 column 1 ¶02 “VGG-16 is one of the two VGG architectures… which contains 13 convolutional layers… Features map is the output of each convolution or pooling layer. The activation function for each convolutional layers is the ReLU (Rectified Linear Unit) function” a feature map is the output of the convolutional layer, and each convolutional layer has an activation function, a ReLU. Lee describes that there are at least two CNNs containing corresponding first and second layers.)

claim 4
Lee/Nasiri teaches claim 3
Further Lee/Nasiri teaches, where in the one or more activation functions are implemented using a rectified linear unit. ( Nasiri pg 136 column 1 ¶02 “VGG-16 is one of the two VGG architectures… which contains 13 convolutional layers… Features map is the output of each convolution or pooling layer. The activation function for each convolutional layers is the ReLU (Rectified Linear Unit) function” a feature map is the output of the convolutional layer, and each convolutional layer has an activation function, a ReLU. Lee describes that there are at least two CNNs containing corresponding first and second layers.)

claim 5
Lee/Nasiri teaches claim 3
Further Lee teaches, wherein the first CNN and second CNN further include one or more normalization layers. (see figure 2 
    PNG
    media_image5.png
    475
    856
    media_image5.png
    Greyscale
as shown each CNN is followed by a L2 normalization layer)

claim 6
Lee/Nasiri teaches claim 3
Further Lee teaches, wherein the second CNN is designed using a Siamese network. (see figure 2, the second view-CNN network is connected to the cross-view convolution network, the weights of the negative and positive side are shared. Section 3.1 “Each rendered view is fed into a View-CNN,… The parameters from different View-CNNs are shared” because the parameters are shared the second CNN is a Siamese network)


Claim(s) 7 are rejected under 35 U.S.C. § 103 as being unpatentable over Lee/Nasiri, further in view of Yamashita et al “multiple skip connections of dilated convolution network for semantic segmentation” hereinafter Yamashita.

claim 7
Lee/Nasiri teaches claim 1
Lee/Nasiri does not explicitly teach, wherein the first CNN and second CNN employ a skip- connection architecture
Yamashita when addressing skip connections for convolutional neural networks teaches, [a CNN] employ a skip-connection architecture (Related works ¶01 pg 1593 “The network of this task is based on these object recognition network structures. Fully Convolutional Network (FCN) employ pretrained VGG16 which is learned using Imagenet and fine-tune the network… It is adopted skip connections to capture global information of the entire image and local information of each class” a VGG16 is employed with additional skip connections corresponding to the claim language which uses CNNs which employ skip connections.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Lee/Nasiri to include skip connections as discussed in Yamashita.  One would have been motivated to make such a combination because both Lee/Nasiri and Yamashita employ a VGG style convolutional neural network for extracting features from images. The network described by Yamashita notes that modifying the VGG architecture with appropriate skip connections advances state of the art methods noting “compared to the state of the art methods, our network achieves equivalent accuracy” (Conclusion Yamashita)

Claim(s) 8-9 are rejected under 35 U.S.C. § 103 as being unpatentable over Lee/Nasiri, further in view of Buras et al US document ID US 10636323 B2 hereinafter Buras. 
claim 8
Lee/Nasiri teaches claim 1
Lee/Nasiri does not explicitly teach, performing step recognition by analyzing the image features extracted in the semantic space.
Buras when addressing using an image processing neural network system to perform a task based on automatic scene analysis teaches, performing step recognition by analyzing the image features extracted in the semantic space. ( column 10 line 66-column 11 line 8 “For example, the MLM 600 may cause the ARUI 300 to display a virtual image or video instructing the novice user to change the orientation of a probe to match a desired reference (e.g., expert) orientation, or may display a correct motion path to be taken by the novice user in repeating a prior reference motion, with color-coding to indicate portions of the novice user's prior path that were erroneous or sub-optimal. In some embodiments, the MLM 600 may cause the ARUI 300 to display only portions of the novice user's motion that must be corrected”  the machine learning machine displays to a user corrective action through a augmented reality display ARUI. Column 11 line 9-25 “MLM 600 also includes a fourth module that receives real-time data from the medical equipment system 200 itself (e.g., via an interface with computer 700) during a medical procedure performed by the novice user, and a fifth module that compares that data to stored reference outcome data from library 500. For example, the MLM 600 may receive image data from an ultrasound machine during use by a novice user... The MLM 600 further includes a sixth module that generates real-time outcome-based feedback based on the comparison performed in the fifth module” Column 17 line 63-66 “MLM 600 provides outcome-based feedback by comparing novice user ultrasound images and reference ultrasound images using a neural network.” images features are extracted using a neural network. The features extracted by a neural network are descriptive of features of the image, thus extracted “in semantic space” as claimed.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to utilize the CNN discussed in Lee/Nasiri for the augmented reality guidance system disclosed by Buras.  One would have been motivated to make such a combination because both Lee/Nasiri and Buras employ convolutional neural networks for image feature extraction. Buras notes that “Classification is a common machine learning problem… Applicants have discovered that a number of specific steps are advisable to enable MLM 600 to have good performance in classifying ultrasound images to generate 3D AR feedback guidance that is useful for guiding novice users. These include care in selecting both the training set and the validation data set for the neural network, and specific techniques for optimizing the neural network's learning parameters” (column 19 line 8-18 Buras)

claim 9
Lee/Nasiri teaches claim 1
Lee/Nasiri does not explicitly teach, determining if an invalid repair sequence has occurred based on an analysis of the image features in the semantic space.
Buras when addressing using an image processing neural network system to perform a task based on automatic scene analysis teaches, determining if an invalid repair sequence has occurred based on an analysis of the image features in the semantic space. ( Column 1 line 30-35 ” In many medical situations, diagnostic or treatment of medical conditions, which may include life-saving care, must be provided by persons without extensive medical training. This may occur because trained personnel are either not present or are unable to respond. For example, temporary treatment of broken bones” the guidance system aids a user to perform repair sequences, in this context a repair sequence includes steps to treat/repair human ailments. Column 9 line 34-44 “This feedback enables the novice user to correct mistakes or incorrect usage of the medical equipment and achieve an outcome similar to that of the expert user… the real-time 3D AR feedback may include… tactile information (e.g., vibrations or pulses when the novice user is in the correct or incorrect position)” the system is able to indicate invalid steps taken, when the system detects an incorrect position it may beep or buzz. Column 16 line 8-12 “MLM 600 in the embodiment of FIG. 2 also provides outcome-based feedback based on comparing the ultrasound images generated in real-time by the novice user 50 to reference images” Column 18 line 8-10 “neural networks used in MLM 600 preferably include at least one convolutional layer, because image processing is the primary basis for outcome-based feedback.” Such feedback is based on the image analysis performed by the convolutional network, corresponding to the image features as claimed.)
For the reasons to combine Lee/Nasiri with Buras see the rejection of claim 8

Claim(s) 10, 12-20 are rejected under 35 U.S.C. § 103 as being unpatentable over Buras/Lee/Nasiri.

As to independent claim 10
	Buras teaches, An augmented reality system comprising: ( Abstract “A medical guidance system providing real-time, three-dimensional (3D) augmented reality (AR) feedback guidance to a novice user of medical equipment”) a visualization device operable to acquire one or more RGB images; (Column 18 line line 40-44 “images from ultrasound system 210 must be converted to a standard format usable by the neural network …For example, ultrasound images captured by one type of ultrasound machine (FUS) are in the RGB24 image format” the ultrasound device captures RGB images to be processed by a neural network) and a controller operable to [process neural network features from images], (Figure 8 and Figure 2 the figures describe a method and system consisting of a controller to receives images and process them with a neural network.)
	Buras does not explicitly teach, responsive to an anchor image being provided to a first CNN, generating an anchor vector in a semantic space, wherein the anchor image is a two-dimensional RGB image, wherein the first CNN includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first filly connected layer;  responsive to a negative image and positive image being provided to a second CNN, generating a positive vector and negative vector in the semantic space, wherein the negative image is a first three-dimensional CAD image and the positive image is a second three-dimensional CAD image, wherein the second CNN includes one or more second convolutional layers, wherein the second CNN includes one or more second convolutional layers, one or more second max pooling layers. a second flattening layer, a second dropout layer, and a second fully connected layer: and apply a cross-domain deep metric learning algorithm that is operable to extract image features in the semantic space using the anchor vector, positive vector, and negative vector.
	Lee however when addressing the use of a convolutional neural network system for image processing a feature extraction teaches, responsive to an anchor image being provided to a first CNN, generating an anchor vector in a semantic space, wherein the anchor image is a two-dimensional RGB image, ( Figure 1 caption “We propose a cross-domain image-based 3D shape retrieval. Given an input query image, our system automatically returns a list of similar 3D shapes by L2 distance. Our method learns a joint image and 3D shape embedding space.”  And Figure 2
    PNG
    media_image1.png
    188
    440
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    475
    829
    media_image2.png
    Greyscale
 as shown in Figure 1 and Figure 2, an anchor image is provided to an Image CNN the output of the CNN is an image representation or feature vector which is a learnt descriptive vector in embedding space, this corresponds to an “anchor vector in a semantic space”.) responsive to a negative image and positive image being provided to a second CNN, generating a positive vector and negative vector in the semantic space, wherein the negative image is a first three-dimensional CAD image and the positive image is a second three-dimensional CAD image, (Figure 2 and Section 3.3 “The CNNs for images (Image-CNN) and views (ViewCNN) are learned jointly to construct a joint embedding space…. The three streams in the triplet network are anchor image, positive shape, and negative shape (see Figure 2).” As can be seen in Figure 2 a positive CAD shape and negative CAD shape are provided to the network, the output includes a corresponding positive shape representation and negative shape representation.) and apply a cross-domain deep metric learning algorithm that is operable to extract image features in the semantic space using the anchor vector, positive vector, and negative vector. (Introduction ¶02 “We propose a new cross-domain learning model to better generate the image and shape representations in a joint embedding space. Therefore, the similarities between images and 3D shapes can be effectively computed by the distances in this space.” As shown in figure 2, the 3d shapes are a positive and negative shape. Each input results in a shape or image vector representation.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Buras with the convolutional neural network described by Lee.  One would have been motivated to make such a combination because both Lee and Buras describe the use of a CNN for extracting features from images and blending 3d environments with 2d images . Lee notes “experiments show that there exists a large domain gap between images and 3D shape views that cannot be bridged with simple CNN features” to remedy this Lee proposes to “augment the original MVCNN with triplet network… our triplet MVCNN explicitly learns the cross domain image-shape pairs and improves the mAP to 40.85%” (Lee pg 263-264)
Buras/Lee does not explicitly teach, wherein [a CNN] includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first filly connected layer; 
Nasiri however when addressing convolutional neural networks for image processing teaches, wherein [a CNN] includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first filly connected layer; (pg 136 “VGGNet consists of five various blocks which are set homogeneously and sequentially so that the output of each block is defined as the input of the next block (Fig. 2). By this architecture, the network extracts powerful features from the input images such as texture, shape, and color” 

    PNG
    media_image3.png
    325
    941
    media_image3.png
    Greyscale
as shown in the figure the CNN which extracts features from a multi-channel image contains ach of the claimed layers.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Lee to include each of the layers described by Nasiri.  One would have been motivated to make such a combination because both Lee and Nasiri describe the use of a CNN for extracting features from images. The network used by Nasiri is based on the VGG network architecture which is common and performs well in image processing domains, Nasiri notes, “There are various dominant pre-trained structures of CNN which have been successfully trained by a major dataset of labeled images such as ImageNet with 1000 different classes… Common pre-trained CNN consists… VGG… The very depth VGGNet significantly outperforms the architectures which achieved the best results” (Section 2.2 Nasiri)

claim 12
Buras/Lee/Nasiri teaches claim 10
Further Lee teaches, wherein the controller is further operable to decrease a first distance between the anchor vector and the positive vector in the semantic space and increase a second distance between the anchor vector and the negative vector in the semantic space ( Section 3.3 “The CNNs for images (Image-CNN) and views (ViewCNN) are learned jointly to construct a joint embedding space. We use triplet neural network architecture and propose a fast triplet architecture to speed up the training. The goal of triplet neural network is to enforce the anchor negative distances at least farther than the anchor-positive distances by a certain margin: … where dpos is the anchor-positive distance and dneg is the anchor-negative distance. The three streams in the triplet network are anchor image, positive shape, and negative shape The triplet loss is defined as: … 
    PNG
    media_image4.png
    27
    299
    media_image4.png
    Greyscale
”  the loss function is minimized through training such that dneg is maximized and dpos is minimized, thus corresponding to increasing the negative distance and decreasing the positive distance.)

claim 13
Buras/Lee/Nasiri teaches claim 10
Further Lee teaches, wherein the controller is further operable to apply a post-processing image algorithm to the one or more RGB images. (See figure 2 and Section 3.3 “We use Image-CNN with the adaption layer for the anchor image stream” the adaption layer corresponds to the post-processing image algorithm, after the image is processed by the CNN the adaption layer further processes the anchor stream.)

claim 14
Buras/Lee/Nasiri teaches claim 10
Further Buras teaches, wherein the controller is further operable to determine a current step of a work procedure. ( column 10 line 66-column 11 line 8 “For example, the MLM 600 may cause the ARUI 300 to display a virtual image or video instructing the novice user to change the orientation of a probe to match a desired reference (e.g., expert) orientation, or may display a correct motion path to be taken by the novice user in repeating a prior reference motion, with color-coding to indicate portions of the novice user's prior path that were erroneous or sub-optimal. In some embodiments, the MLM 600 may cause the ARUI 300 to display only portions of the novice user's motion that must be corrected”  the machine learning machine displays to a user corrective action through an augmented reality display ARUI. Which is Realtime or current steps in a work procedure.)

claim 15
Buras/Lee/Nasiri teaches claim 10
Further Buras teaches, wherein the controller is further operable to display instructions to the visualization device based on the current step of the work procedure. ( column 10 line 66-column 11 line 8 “For example, the MLM 600 may cause the ARUI 300 to display a virtual image or video instructing the novice user to change the orientation of a probe to match a desired reference (e.g., expert) orientation, or may display a correct motion path to be taken by the novice user in repeating a prior reference motion, with color-coding to indicate portions of the novice user's prior path that were erroneous or sub-optimal. In some embodiments, the MLM 600 may cause the ARUI 300 to display only portions of the novice user's motion that must be corrected”  the machine learning machine displays to a user corrective action through an augmented reality display, ARUI. Which is Realtime or current steps in a work procedure.)

claim 16
Buras/Lee/Nasiri teaches claim 10
Further Lee teaches, wherein the second CNN is designed using a Siamese network. (see figure 2, the second view-CNN network is connected to the cross-view convolution network, the weights of the negative and positive side are shared. Section 3.1 “Each rendered view is fed into a View-CNN,… The parameters from different View-CNNs are shared” because the parameters are shared the second CNN is a Siamese network)

As to independent claim 17
	Buras teaches, An augmented reality method comprising: ( Abstract “A medical guidance system providing real-time, three-dimensional (3D) augmented reality (AR) feedback guidance to a novice user of medical equipment”)
	Buras does not explicitly teach, generating an anchor vector in a semantic space in response to an anchor image being provided to a first CNN. wherein the anchor image is a two-dimensional RGB image, wherein the first CNN includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first filly connected layer; generating a positive vector and negative vector in the semantic space in response to a negative image and positive image being provided to a second CNN,
 wherein the negative image is a first three-dimensional CAD image and the positive image is a second three-dimensional CAD image, wherein the second CNN includes one or more second convolutional layers, one or more second max pooling layers. a second flattening layer, a second dropout layer, and a second fully connected layer; and extracting one or more image features from different modalities using the anchor vector, positive vector, and negative vector.
	Lee however when addressing the use of a convolutional neural network system for image processing a feature extraction teaches, generating an anchor vector in a semantic space in response to an anchor image being provided to a first CNN. wherein the anchor image is a two-dimensional RGB image. ( Figure 1 caption “We propose a cross-domain image-based 3D shape retrieval. Given an input query image, our system automatically returns a list of similar 3D shapes by L2 distance. Our method learns a joint image and 3D shape embedding space.”  And Figure 2
    PNG
    media_image1.png
    188
    440
    media_image1.png
    Greyscale

    PNG
    media_image2.png
    475
    829
    media_image2.png
    Greyscale
 as shown in Figure 1 and Figure 2, an anchor image is provided to an Image CNN the output of the CNN is an image representation or feature vector which is a learnt descriptive vector in embedding space, this corresponds to an “anchor vector in a semantic space”.) generating a positive vector and negative vector in the semantic space in response to a negative image and positive image being provided to a second CNN, wherein the negative image is a first three-dimensional CAD image and the positive image is a second three-dimensional CAD image. (Figure 2 and Section 3.3 “The CNNs for images (Image-CNN) and views (ViewCNN) are learned jointly to construct a joint embedding space…. The three streams in the triplet network are anchor image, positive shape, and negative shape (see Figure 2).” As can be seen in Figure 2 a positive CAD shape and negative CAD shape are provided to the network, the output includes a corresponding positive shape representation and negative shape representation.) and extracting one or more image features from different modalities using the anchor vector, positive vector, and negative vector. (Introduction ¶02 “We propose a new cross-domain learning model to better generate the image and shape representations in a joint embedding space. Therefore, the similarities between images and 3D shapes can be effectively computed by the distances in this space.” As shown in figure 2, the 3d shapes are a positive and negative shape. Each input results in a shape or image vector representation. A 2d image and a 3d shape are different modalities.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Buras with the convolutional neural network described by Lee.  One would have been motivated to make such a combination because both Lee and Buras describe the use of a CNN for extracting features from images and blending 3d environments with 2d images . Lee notes “experiments show that there exists a large domain gap between images and 3D shape views that cannot be bridged with simple CNN features” to remedy this Lee proposes to “augment the original MVCNN with triplet network… our triplet MVCNN explicitly learns the cross domain image-shape pairs and improves the mAP to 40.85%” (Lee pg 263-264)
Buras/Lee does not explicitly teach, wherein [a CNN] includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first fully connected layer; 
Nasiri however when addressing convolutional neural networks for image processing teaches, wherein [a CNN] includes one or more first convolutional layers, one or more first max pooling layers, a first flattening layer, a first dropout layer, and a first filly connected layer; (pg 136 “VGGNet consists of five various blocks which are set homogeneously and sequentially so that the output of each block is defined as the input of the next block (Fig. 2). By this architecture, the network extracts powerful features from the input images such as texture, shape, and color” 

    PNG
    media_image3.png
    325
    941
    media_image3.png
    Greyscale
as shown in the figure the CNN which extracts features from a multi-channel image contains ach of the claimed layers.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Buras/Lee to include each of the layers described by Nasiri.  One would have been motivated to make such a combination because both Buras/Lee and Nasiri describe the use of a CNN for extracting features from images. The network used by Nasiri is based on the VGG network architecture which is common and performs well in image processing domains, Nasiri notes, “There are various dominant pre-trained structures of CNN which have been successfully trained by a major dataset of labeled images such as ImageNet with 1000 different classes… Common pre-trained CNN consists… VGG… The very depth VGGNet significantly outperforms the architectures which achieved the best results” (Section 2.2 Nasiri)

claim 18
Buras/Lee/Nasiri teaches claim 17
Further Lee teaches, wherein the cross-domain deep metric learning algorithm is a triplet loss algorithm that is operable to decrease a first distance between the anchor vector and the positive vector in the semantic space and increase a second distance between the anchor vector and the negative vector in the semantic space. ( Section 3.3 “The CNNs for images (Image-CNN) and views (ViewCNN) are learned jointly to construct a joint embedding space. We use triplet neural network architecture and propose a fast triplet architecture to speed up the training. The goal of triplet neural network is to enforce the anchor negative distances at least farther than the anchor-positive distances by a certain margin: … where dpos is the anchor-positive distance and dneg is the anchor-negative distance. The three streams in the triplet network are anchor image, positive shape, and negative
shape The triplet loss is defined as: … 
    PNG
    media_image4.png
    27
    299
    media_image4.png
    Greyscale
”  the loss function is minimized through training such that dneg is maximized and dpos is minimized, thus corresponding to increasing the negative distance and decreasing the positive distance.)


claim 19
Buras/Lee/Nasiri teaches claim 17
Buras teaches, performing step recognition by analyzing the image features extracted in the semantic space. ( column 10 line 66-column 11 line 8 “For example, the MLM 600 may cause the ARUI 300 to display a virtual image or video instructing the novice user to change the orientation of a probe to match a desired reference (e.g., expert) orientation, or may display a correct motion path to be taken by the novice user in repeating a prior reference motion, with color-coding to indicate portions of the novice user's prior path that were erroneous or sub-optimal. In some embodiments, the MLM 600 may cause the ARUI 300 to display only portions of the novice user's motion that must be corrected”  the machine learning machine displays to a user corrective action through a augmented reality display ARUI. Column 11 line 9-25 “MLM 600 also includes a fourth module that receives real-time data from the medical equipment system 200 itself (e.g., via an interface with computer 700) during a medical procedure performed by the novice user, and a fifth module that compares that data to stored reference outcome data from library 500. For example, the MLM 600 may receive image data from an ultrasound machine during use by a novice user... The MLM 600 further includes a sixth module that generates real-time outcome-based feedback based on the comparison performed in the fifth module” Column 17 line 63-66 “MLM 600 provides outcome-based feedback by comparing novice user ultrasound images and reference ultrasound images using a neural network.” images features are extracted using a neural network. The features extracted by a neural network are descriptive of features of the image, thus extracted “in semantic space” as claimed.)

claim 20
Buras/Lee/Nasiri teaches claim 17
Buras teaches, determining if an invalid repair sequence has occurred based on an analysis of the image features in the semantic space. ( Column 1 line 30-35 ” In many medical situations, diagnostic or treatment of medical conditions, which may include life-saving care, must be provided by persons without extensive medical training. This may occur because trained personnel are either not present or are unable to respond. For example, temporary treatment of broken bones” the guidance system aids a user to perform repair sequences, in this context a repair sequence includes steps to treat/repair human ailments. Column 9 line 34-44 “This feedback enables the novice user to correct mistakes or incorrect usage of the medical equipment and achieve an outcome similar to that of the expert user… the real-time 3D AR feedback may include… tactile information (e.g., vibrations or pulses when the novice user is in the correct or incorrect position)” the system is able to indicate invalid steps taken, when the system detects an incorrect position it may beep or buzz. Column 16 line 8-12 “MLM 600 in the embodiment of FIG. 2 also provides outcome-based feedback based on comparing the ultrasound images generated in real-time by the novice user 50 to reference images” Column 18 line 8-10 “neural networks used in MLM 600 preferably include at least one convolutional layer, because image processing is the primary basis for outcome-based feedback.” Such feedback is based on the image analysis performed by the convolutional network, corresponding to based the image features as claimed.)



Claim(s) 11 is rejected under 35 U.S.C. § 103 as being unpatentable over Buras/Lee/Nasiri, further in view of Georgakis et al “Learning Local RGB-to-CAD Correspondences for Object Pose Estimation” hereinafter Georgakis.

claim 11
Buras/Lee/Nasiri teaches claim 10
Buras/Lee/Nasiri does not explicitly teach, wherein the controller is further operable to determine a pose of an image object within the one or more RGB images
Georgakis however when addressing determination of a pose within an RGB image teaches, wherein the controller is further operable to determine a pose of an image object within the one or more RGB images ( Figure 1 caption “We present a new method that matches RGB images to depth renderings of CAD models for object pose estimation” the system described matches a RGB image to a pose based on a cad model, or 3d pose.)
Accordingly, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the CNN discussed in Buras/Lee/Nasiri that matches 3d images with 2d images to determine a pose of an object in the image as described by Georgakis.  One would have been motivated to make such a combination because both Buras/Lee/Nasiri and Georgakis describe the use of a CNN for extracting features from images. Georgakis presents an approach for learning poses without pose annotations noting “the proposed method on unseen testing data compared to supervised approaches, suggesting that it is possible to learn generalizable models without depending on pose annotations.”  (Conclusion Georgakis)

Conclusion
Prior art: 
Feng et al “2D3D-MatchNet: Learning to Match Keypoints Across 2D Image and 3D Point Cloud” discloses a cross domain image processing neural network that uses 2d images and 3d point clouds as input trained to match via triplet loss
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached M-F 7:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.R.G./
Examiner, Art Unit 2122                                                    
/KAKALI CHAKI/            Supervisory Patent Examiner, Art Unit 2122