DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/07/2021 and 02/05/2020 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 3, 6, 7, 8, 10, 13-15 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Mahboob US PG-Pub (US 20190354919 A1) in view of Vo et al. US PG-Pub(US 20200302168 A1).
Regarding Claim 1, Mahboob teaches a device comprising: an image sensor configured to capture images(¶[0046] In FIG. 1B, an image capturing device (such as a camera) of smartphone 102 may be used to capture an image of package label 104.); storage configured to store image data representing the captured images(¶[0103] The Mobile device 1100 may also include Central Processing Unit (CPU) 1104 and a memory 1106 to process data, such as the collected environmental data, inputted data, or data retrieved from a storage device. The examiner interprets that since the mobile device is capturing the images that since it has a memory the captured images are being stored in the memory of the mobile device.)and a processor communicatively coupled to at least the storage(¶[0103] The Mobile device 1100 may also include Central Processing Unit (CPU) 1104 and a memory 1106 to process data, such as the collected environmental data, inputted data, or data retrieved from a storage device. The CPU 1104 may include one or more processors configured to execute computer program instructions to perform various processes and methods.), derive a converged ROI based on at least a portion of the ROI of at least one of the captured training images of the set of captured training images and generate an anchor model based on a combination of the converged ROI and the common set of visual features, wherein: the common set of visual features defines an anchor, a location of each visual feature is specified relative to the converged ROI; (¶[0034] a deep learning-based approach may involves using object detection algorithm Faster R-CNN for the purpose of drawing a bounding box around a Region of Interest (ROI) on a package label. The ROIs may include sender info region, receiver info region, barcode region, courier info region, etc. In an exemplary embodiment Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. The examiner interprets the bounding box around the package label would derive a converged ROI of the image. The prior art uses an R-CNN which uses the bounding box of the package label to serve 
and the anchor model is to be used by the processor during an operating mode to derive a location of a candidate ROI relative to the anchor in a captured image (¶[0067] Exemplary systems and methods may be utilized to extract information from a package label, generically, independent of database. Additionally, exemplary systems and methods may accurately parse information of sender and receiver. Coalescence of techniques like Pattern matching, NER, and localization of ROI (Region of Interest) may be leveraged to extract essential pieces of information like name, phone and business/organization of sender and recipient, with high accuracy. The examiner interprets that the processor in the prior art is using the ROI of the package label to extract essential pieces of information such as name, phone and business/location)
Mahboob does not explicitly teach the processor configured, during a training mode, to perform operations comprising: receive a set of training images captured by the image sensor, wherein each captured training image of the set of captured training images comprises multiple visual features and a region of interest (ROI); receive manually input indications of a location of the ROI within each captured training image of the set of captured training images; perform at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of at least one captured training image of the set of captured training images to at least partially align the multiple visual features and ROIs among all of the captured training images of the set of captured training images to identify a common set of visual features that are present within all of the captured training images; 
Vo teaches the processor configured, during a training mode, to perform operations comprising: receive a set of training images captured by the image sensor, wherein each captured training image of the set of captured training images comprises multiple visual features and a region of interest (ROI) (¶[0022] a wager object region proposal network (RPN) to receive image data from captured images of the gaming table. ¶[0092] The RPNs 920 and 940 may take an image as an input and as an output produce one or more object proposals. Each object proposal may comprise the co-ordinates on an image that may define a rectangular boundary of a region of interest with the detected object, and an associated objectness score, which reflects the likelihood that one of a class of objects may be present in the region of interest. The examiner interprets the prior art is using captured image data from the camera system to train the neural network to produce a region of interest in the captured image) ; receive manually input indications of a location of the ROI within each captured training image of the set of captured training images(¶[0094], The training data set may comprise several images in which boundaries of regions of interest and the identity of the object in every region of interest may have been manually identified and recorded); perform at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of at least one captured training image of the set of captured training images to at least partially align the multiple visual features and ROIs among all of the captured training images of the set of captured training images to identify a common set of visual features that are present within all of the captured training images(¶[0132], flowchart 1400 includes steps for image segmentation and object detection for non-wager objects. At step 1410, the regions of interest identified in step 1110 are processed through a region of interest alignment neuron layer to improve the alignment of the boundaries of the identified regions of interest in order to improve the subsequent step of image segmentation or masking process. At step 1420 the aligned regions of interest are processed through a trained Mask R-CNN, output in the form of a binary segmentation mask for each non-wager object identified at step 1114 is produced at step 1430. The output may be in the form of a binary segmentation mask for each identified object, wherein each binary segmentation mask represents a set of pixels in the captured image that are associated with an identified object. The examiner interprets that the prior art is performing an translational transformation since the alignment neuron layer is aligning the boundaries of the ROI in order to segment the image and perform object detection.);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to use the captured images by the image sensor to train the neural network and perform a transformation on the captured images. One skilled in the art would have been motivated to modify Mahboob in this manner in order to improve the performance or speed of object detection. (Vo, ¶[0056])
Regarding Claim 3, the combination of Mahboob and Vo teaches the device of claim 1, comprising an input device, wherein the processor is caused to operate at least the input device to provide a user interface (UI) by which the indications of a location of the ROI within each captured training image of the set of captured training images is manually provided (Vo, ¶[0112] In order to prepare a substantial data set for training the machine learning or neural network algorithms, regions of interest may be manually drawn or identified in images captured from games on the gaming table. The regions of interest may be manually tagged with relevant identifiers, such as wager objects, persons, cards or other game objects, for example using an annotation or tagging tool as illustrated in FIG. 4C. An example of a suitable annotation tool is the “Labellmg” tool accessible through GitHub that provides an annotation XML file of each file in Pascal VOC format. Further, additional parameters, for example relating to a difficult object or segmented object may also be identified by manual tagging using a tagging tool with respect to a region of interest. As seen in Fig. 4C of the prior art, the examiner interprets that a user interface is provided on a device in which a user can specify the location of the ROI of interest in the training images.) 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to provide a user interface to indicate the location of the ROI in the training images. One skilled in the art would have been motivated to modify Mahboob in this manner in order to improve the performance or speed of object detection. (Vo, ¶[0056])
Regarding Claim 6, the combination of Mahboob and Vo teaches the device of claim 1, wherein the derivation of the converged ROI comprises a performance, by the processor, of operations comprising reshape an area overlapped by the ROIs of all of the captured training images of the set of captured training images to give the converged ROI a rectangular shape. (Vo, ¶[0123] The regions of interest identified by Wager Object RPN 940 may only be rectangular in shape and the edges of the rectangle must be parallel to the edges of the input image. As isolation of objects to be identified in the proposed regions of interest is vital for accuracy in object detection. To overcome this, instead of treating the entire edge pattern of a wager object as a target for object detection, the Wager Object RPN 940 is trained to identify ends of each visible edge pattern at step 1128. For example, the regions of interest 1310 and 1315 identified in the image frame 1300 bound or cover only one part of an edge pattern on wager object. Such edge patterns are distinct and spaced around the circumference of the chip and are separated by non-patterned edge regions. The examiner interprets that the prior art is being trained to reshape and isolate overlap regions in the image and that the converged ROI is a rectangular shaped).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to reshape regions such that the converged ROI a rectangular shape. One skilled in the art would have been motivated to modify Mahboon in this manner in order to improve the performance or speed of object detection. (Vo,¶[0056])
Regarding Claim 7, the combination of Mahboob and Vo teaches the device of claim 1, 
Mahboob teaches wherein use of the anchor model by the processor during the operating mode to derive a location of a candidate ROI in the captured image comprises a performance, by the processor, of operations comprising: search within the captured image for the common set of visual features of the anchor ([0034], Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. Subsequently, text contained in that specific region may be parsed into components including, but not limited to, name, organization, street address, city, state, country, phone number, etc. The examiner interprets that the R-CNN of the prior art is searching for features in of the package label.);and in response to the common set of visual features being found within the captured image, use an indication of the location of at least one visual feature of the common set of visual features relative to the converged ROI to derive the location of the candidate ROI within the captured image(¶[0034], Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. Subsequently, text contained in that specific region may be parsed into components including, but not limited to, name, organization, street address, city, state, country, phone number, etc., by using an exemplary deep learning based NER module. An architecture of an exemplary NER module may contains combination of Bidirectional LSTM (Long Short-Term Memory), CNN (Convolutional Neural Network) and CRF (Conditional Random Field) for labelling the aforementioned entities present in the localized text.) the examiner interprets that the R-CNN will locate regions pertaining to the textual features of the image in order to derive the locations of bounding boxes around those features)
wherein the processor is configured to perform operations comprising: perform at least one machine vision task within the candidate ROI(¶[0051], an image of a package label 104 may be sent for OCR. An exemplary OCR engine may be utilized for OCR, such as OCR engine 106 that may comprise an open-source solutions like Tesseract or a commercial option such as like ABBYY Finereader, Google Vision, Nuance OmniPage, the examiner interprets OCR is being performed on the candidate region of the package label ); and transmit data output by the performance of the at least one machine vision task to another device via a network(¶[0104] The mobile device 1100 may include an I/O Unit 1108 for sending data over a network or any other medium. For example, I/O Unit 1100 may send data over a network, point-to-point, and/or point-to-multipoint connection either wirelessly or over a cable.)
Regarding Claim 8, Mahboob teaches a machine vision system comprising: a camera, wherein: the camera comprises an image sensor configured to capture images (¶[0046] In FIG. 1B, an image capturing device (such as a camera) of smartphone 102 may be used to capture an image of package label 104) and the camera is configured to use an anchor model, during an operating mode of the machine vision system, to derive a location of a candidate region of interest (ROI) relative to an anchor in a captured image(¶[0034] In an exemplary embodiment, a deep learning-based approach may involves using object detection algorithm Faster R-CNN for the purpose of drawing a bounding box around a Region of Interest (ROI) on a package label. The ROIs may include sender info region, receiver info region, barcode region, courier info region, etc. In an exemplary embodiment Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. The examiner interprets the anchor model to be a package label and the R-CNN of the prior art is deriving a candidate region of interest by drawing a bounding box around the package label.); derive a converged ROI based on at least a portion of the ROI of at least one of the captured training images of the set of captured training image and generate the anchor model based on a combination of the converged ROI and the common set of visual features, wherein:  the common set of visual features defines the anchor and a location of each visual feature is specified relative to converged ROI (¶[0034] a deep learning-based approach may involves using object detection algorithm Faster R-CNN for the purpose of drawing a bounding box around a Region of Interest (ROI) on a package label. The ROIs may include sender info region, receiver info region, barcode region, courier info region, etc. In an exemplary embodiment Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. The examiner interprets the bounding box around the package label would derive a converged ROI of the image. The prior art uses an R-CNN which uses the bounding box of the package label to serve as the anchor and will derive the location of each visual features of that label such as sender info region, receiver info region and barcode region.)
Mahboob does not explicitly teach a manual input device and a control device communicatively coupled to at least the camera and the input device and comprising a processor configured, during a training mode of the machine vision system, to perform operations comprising: receive, from the camera, a set of training images captured by the image sensor, wherein each captured training image of the set of captured training images comprises multiple visual features and an ROI  receive, from the input device, manually input indications of a location of the ROI within each captured training image of the set of captured training images perform at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of at least one captured training image of the set of captured training images to at least partially align the multiple visual features and ROIs among all of the captured training images of the set of captured training images to identify a common set of visual features that are present within all of the captured training images.
Vo teaches a manual input device([0112], The regions of interest may be manually tagged with relevant identifiers, such as wager objects, persons, cards or other game objects, for example using an annotation or tagging tool as illustrated in FIG. 4C. An example of a suitable annotation tool is the “Labellmg” tool accessible through GitHub that provides an annotation XML file of each file in Pascal VOC format. Further, additional parameters, for example relating to a difficult object or segmented object may also be identified by manual tagging using a tagging tool with respect to a region of interest. As seen in Figure 4C of the prior art, there is a user interface for a user to annotate regions of interest and the examiner interprets that it would be obvious that the interface has to have a device associated with the annotation tool.) and a control device communicatively coupled to at least the camera and the input device (¶[0059],The cameras may communicate the captured images to the computing device 130 through a communication link 107, which may be in the form of a USB cable or a wireless communication link. An example of a suitable camera for each of cameras 120 and 320 is the BRIO 4k Webcam camera from Logitech.), and comprising a processor configured, during a training mode of the machine vision system, to perform operations comprising: receive, from the camera, a set of training images captured by the image sensor, wherein each captured training image of the set of captured training images comprises multiple visual features and an ROI(¶[0022] a wager object region proposal network (RPN) to receive image data from captured images of the gaming table. ¶[0092] The RPNs 920 and 940 may take an image as an input and as an output produce one or more object proposals. Each object proposal may comprise the co-ordinates on an image that may define a rectangular boundary of a region of interest with the detected object, and an associated objectness score, which reflects the likelihood that one of a class of objects may be present in the region of interest. The examiner interprets the prior art is using captured image data from the camera system to train the neural network to produce a region of interest in the captured image); 
receive, from the input device, manually input indications of a location of the ROI within each captured training image of the set of captured training images(¶[0094], The training data set may comprise several images in which boundaries of regions of interest and the identity of the object in every region of interest may have been manually identified and recorded);perform at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of at least one captured training image of the set of captured training images to at least partially align the multiple visual features and ROIs among all of the captured training images of the set of captured training images to identify a common set of visual features that are present within all of the captured training images(¶[0132], flowchart 1400 includes steps for image segmentation and object detection for non-wager objects. At step 1410, the regions of interest identified in step 1110 are processed through a region of interest alignment neuron layer to improve the alignment of the boundaries of the identified regions of interest in order to improve the subsequent step of image segmentation or masking process. At step 1420 the aligned regions of interest are processed through a trained Mask R-CNN, output in the form of a binary segmentation mask for each non-wager object identified at step 1114 is produced at step 1430. The output may be in the form of a binary segmentation mask for each identified object, wherein each binary segmentation mask represents a set of pixels in the captured image that are associated with an identified object. The examiner interprets that the prior art is performing an translational transformation since the alignment neuron layer is aligning the boundaries of the ROI in order to segment the image and perform object detection.);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to use the captured images by the image sensor to train the neural network and perform a transformation on the captured images. One skilled in the art would have been motivated to modify Mahboob in this manner in order to improve the performance or speed of object detection. (Vo, ¶[0056])
Regarding Claim 10, the combination of Mahboob and Vo teaches the machine vision system of claim 8, comprising a display, wherein the processor is caused to operate at least the input device and the display to provide a user interface (UI) to prompt an operator to operate at least the input device to manually provide the indications of a location of an ROI within each captured training image of the set of captured training images (Vo, ¶[0112] In order to prepare a substantial data set for training the machine learning or neural network algorithms, regions of interest may be manually drawn or identified in images captured from games on the gaming table. The regions of interest may be manually tagged with relevant identifiers, such as wager objects, persons, cards or other game objects, for example using an annotation or tagging tool as illustrated in FIG. 4C. An example of a suitable annotation tool is the “Labellmg” tool accessible through GitHub that provides an annotation XML file of each file in Pascal VOC format. Further, additional parameters, for example relating to a difficult object or segmented object may also be identified by manual tagging using a tagging tool with respect to a region of interest. As seen in Fig. 4C of the prior art, the examiner interprets that a user interface is provided on a device in which a user can specify the location of the ROI of interest in the training images.) 
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to provide a user interface to indicate the location of the ROI in the training images. One skilled in the art would have been motivated to modify Mahboob in this manner in order to improve the performance or speed of object detection. (Vo, ¶[0056])
Regarding Claim 13, the combination of Mahboob and Vo teaches the machine vision system of claim 8, wherein derivation of the converged ROI comprises a performance, by the processor, of operations comprising reshape an area overlapped by the ROIs of all of the captured training images of the set of captured training images to give the converged ROI a rectangular shape.. (Vo, ¶[0123] The regions of interest identified by Wager Object RPN 940 may only be rectangular in shape and the edges of the rectangle must be parallel to the edges of the input image. As isolation of objects to be identified in the proposed regions of interest is vital for accuracy in object detection. To overcome this, instead of treating the entire edge pattern of a wager object as a target for object detection, the Wager Object RPN 940 is trained to identify ends of each visible edge pattern at step 1128. For example, the regions of interest 1310 and 1315 identified in the image frame 1300 bound or cover only one part of an edge pattern on wager object. Such edge patterns are distinct and spaced around the circumference of the chip and are separated by non-patterned edge regions. The examiner interprets that the prior art is being trained to reshape and isolate overlap regions in the image and that the converged ROI is a rectangular shaped).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to reshape regions such that the converged ROI a rectangular shape. One skilled in the art would have been motivated to modify Mahboon in this manner in order to improve the performance or speed of object detection. (Vo,¶[0056])
Regarding Claim 14, the combination of Mahboob and Vo teaches the machine vision system of claim 8, 
Mahboob teaches wherein use of the anchor model by the camera during the operating mode to derive a location of a candidate ROI in the captured image comprises a performance, by the camera, of operations comprising: search within the captured image for the common set of visual features of the anchor ([0034], Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. Subsequently, text contained in that specific region may be parsed into components including, but not limited to, name, organization, street address, city, state, country, phone number, etc. The examiner interprets that the R-CNN of the prior art is searching for features in of the package label.);and in response to the common set of visual features being found within the captured image, use an indication of the location of at least one visual feature of the common set of visual features relative to the converged ROI to derive the location of the candidate ROI within the captured image(¶[0034], Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. Subsequently, text contained in that specific region may be parsed into components including, but not limited to, name, organization, street address, city, state, country, phone number, etc., by using an exemplary deep learning based NER module. An architecture of an exemplary NER module may contains combination of Bidirectional LSTM (Long Short-Term Memory), CNN (Convolutional Neural Network) and CRF (Conditional Random Field) for labelling the aforementioned entities present in the localized text.) the examiner interprets that the R-CNN will locate regions pertaining to the textual features of the image in order to derive the locations of bounding boxes around those features)
wherein the processor is configured to perform operations comprising: perform at least one machine vision task within the candidate ROI(¶[0051], an image of a package label 104 may be sent for OCR. An exemplary OCR engine may be utilized for OCR, such as OCR engine 106 that may comprise an open-source solutions like Tesseract or a commercial option such as like ABBYY Finereader, Google Vision, Nuance OmniPage, the examiner interprets OCR is being performed on the candidate region of the package label ); and transmit data output by the performance of the at least one machine vision task to another device via a network(¶[0104] The mobile device 1100 may include an I/O Unit 1108 for sending data over a network or any other medium. For example, I/O Unit 1100 may send data over a network, point-to-point, and/or point-to-multipoint connection either wirelessly or over a cable.)
Regarding Claim 15, Mahboob teaches a method comprising, deriving, by the processor, a converged ROI based on at least a portion of the ROI of at least one of the captured images of the set of captured images; and generating, by the processor, an anchor model based on a combination of the converged ROI and the common set of visual features, wherein: the common set of visual features defines an anchor; a location of each visual feature is specified relative to the converged ROI; and the anchor model is to be used by the processor during an operating mode to derive a location of a candidate ROI relative to the anchor in a captured image. (¶[0034] a deep learning-based approach may involves using object detection algorithm Faster R-CNN for the purpose of drawing a bounding box around a Region of Interest (ROI) on a package label. The ROIs may include sender info region, receiver info region, barcode region, courier info region, etc. In an exemplary embodiment Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. The examiner interprets the bounding box around the package label would derive a converged ROI of the image. The prior art uses an R-CNN which uses the bounding box of the package label to serve as the anchor and will derive the location of each visual features of that label such as sender info region, receiver info region and barcode region.)
Mahboob does not expliticily teach during a training mode, performing operations comprising: receiving, at a processor and from an image sensor, a set of training images captured by the image sensor, wherein each captured training image of the set of captured training images comprises multiple visual features and a region of interest (ROI); receiving, at the processor, manually input indications of a location of the ROI within each captured training image of the set of captured training images; performing, by the processor, at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of at least one captured training image of the set of captured training images to at least partially align the multiple visual features and ROIs among all of the captured training images of the set of captured training images to identify a common set of visual features that are present within all of the captured training images; 
Vo teaches during a training mode, performing operations comprising: receiving, at a processor and from an image sensor, a set of training images captured by the image sensor, wherein each captured training image of the set of captured training images comprises multiple visual features and a region of interest (ROI) (¶[0022] a wager object region proposal network (RPN) to receive image data from captured images of the gaming table. ¶[0092] The RPNs 920 and 940 may take an image as an input and as an output produce one or more object proposals. Each object proposal may comprise the co-ordinates on an image that may define a rectangular boundary of a region of interest with the detected object, and an associated objectness score, which reflects the likelihood that one of a class of objects may be present in the region of interest. The examiner interprets the prior art is using captured image data from the camera system to train the neural network to produce a region of interest in the captured image); receiving, at the processor, manually input indications of a location of the ROI within each captured training image of the set of captured training images; (¶[0094], The training data set may comprise several images in which boundaries of regions of interest and the identity of the object in every region of interest may have been manually identified and recorded); performing, by the processor, at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of at least one captured training image of the set of captured training images to at least partially align the multiple visual features and ROIs among all of the captured training images of the set of captured training images to identify a common set of visual features that are present within all of the captured training images; (¶[0132], flowchart 1400 includes steps for image segmentation and object detection for non-wager objects. At step 1410, the regions of interest identified in step 1110 are processed through a region of interest alignment neuron layer to improve the alignment of the boundaries of the identified regions of interest in order to improve the subsequent step of image segmentation or masking process. At step 1420 the aligned regions of interest are processed through a trained Mask R-CNN, output in the form of a binary segmentation mask for each non-wager object identified at step 1114 is produced at step 1430. The output may be in the form of a binary segmentation mask for each identified object, wherein each binary segmentation mask represents a set of pixels in the captured image that are associated with an identified object. The examiner interprets that the prior art is performing an translational transformation since the alignment neuron layer is aligning the boundaries of the ROI in order to segment the image and perform object detection.);
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to use the captured images by the image sensor to train the neural network and perform a transformation on the captured images. One skilled in the art would have been motivated to modify Mahboob in this manner in order to improve the performance or speed of object detection. (Vo, ¶[0056])
Regarding Claim 19, the combination of Mahboob and Vo teaches the method of claim 15, wherein the derivation of the converged ROI comprises a performance, by the processor, of operations comprising reshape an area overlapped by the ROIs of all of the captured training images of the set of captured training images to give the converged ROI a rectangular shape. (Vo, ¶[0123] The regions of interest identified by Wager Object RPN 940 may only be rectangular in shape and the edges of the rectangle must be parallel to the edges of the input image. As isolation of objects to be identified in the proposed regions of interest is vital for accuracy in object detection. To overcome this, instead of treating the entire edge pattern of a wager object as a target for object detection, the Wager Object RPN 940 is trained to identify ends of each visible edge pattern at step 1128. For example, the regions of interest 1310 and 1315 identified in the image frame 1300 bound or cover only one part of an edge pattern on wager object. Such edge patterns are distinct and spaced around the circumference of the chip and are separated by non-patterned edge regions. The examiner interprets that the prior art is being trained to reshape and isolate overlap regions in the image and that the converged ROI is a rectangular shaped).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Vo to Mahboob in order to reshape regions such that the converged ROI a rectangular shape. One skilled in the art would have been motivated to modify Mahboon in this manner in order to improve the performance or speed of object detection. (Vo,¶[0056])
Regarding Claim 20, the combination of Mahboob and Vo teaches the method of claim 15, Mahboob teaches wherein use of the anchor model by the processor during the operating mode to derive a location of a candidate ROI in the captured image comprises a performance, by the processor, of operations comprising: search within the captured image for the common set of visual features of the anchor ([0034], Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. Subsequently, text contained in that specific region may be parsed into components including, but not limited to, name, organization, street address, city, state, country, phone number, etc. The examiner interprets that the R-CNN of the prior art is searching for features in of the package label.);and in response to the common set of visual features being found within the captured image, use an indication of the location of at least one visual feature of the common set of visual features relative to the converged ROI to derive the location of the candidate ROI within the captured image(¶[0034], Faster R-CNN locates regions on the package label image by outputting the name of that region as well as the coordinates of the bounding box which encapsulates that specific region. Subsequently, text contained in that specific region may be parsed into components including, but not limited to, name, organization, street address, city, state, country, phone number, etc., by using an exemplary deep learning based NER module. An architecture of an exemplary NER module may contains combination of Bidirectional LSTM (Long Short-Term Memory), CNN (Convolutional Neural Network) and CRF (Conditional Random Field) for labelling the aforementioned entities present in the localized text.) the examiner interprets that the R-CNN will locate regions pertaining to the textual features of the image in order to derive the locations of bounding boxes around those features)
wherein the processor is configured to perform operations comprising: perform at least one machine vision task within the candidate ROI(¶[0051], an image of a package label 104 may be sent for OCR. An exemplary OCR engine may be utilized for OCR, such as OCR engine 106 that may comprise an open-source solutions like Tesseract or a commercial option such as like ABBYY Finereader, Google Vision, Nuance OmniPage, the examiner interprets OCR is being performed on the candidate region of the package label ); and transmit data output by the performance of the at least one machine vision task to another device via a network(¶[0104] The mobile device 1100 may include an I/O Unit 1108 for sending data over a network or any other medium. For example, I/O Unit 1100 may send data over a network, point-to-point, and/or point-to-multipoint connection either wirelessly or over a cable.)
Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Mahboob US PG-Pub (US 20190354919 A1) in view of Vo et al. US PG-Pub (US 20200302168 A1) in view of Tombari et al. US Patent (US 9495607 B2, as cited by applicant in IDS filed on 02/05/2020).
Regarding Claim 2, the combination of Mahboob and Vo teaches the device of claim 1, they don’t explicitly teach wherein: the multiple visual features of each captured training image of the set of captured training images comprises multiple line segments and the common set of visual features comprises the line segments of the multiple line segments within each of the captured training images that are present within all of the captured training images.
Tombari teaches wherein: the multiple visual features of each captured training image of the set of captured training images comprises multiple line segments (Col 2, Lines 23-25, During operation, the computer system may optionally receive (or access) an image that includes the object, and may optionally extract line segments aligned with edge pixels associated with the object.); and the common set of visual features comprises the line segments of the multiple line segments within each of the captured training images that are present within all of the captured training images (Col 2, Lines 26-32, the computer system may optionally receive the extracted line segments. Then, the computer system may determine orientations for the line segments. Moreover, the computer system may identify one or more subsets of the line segments, where a given subset includes k of the line segments that are proximate to a given line segment in the line segments. The examiner interprets that the prior art is capable of determining a subset of line segments of the original line segment in the image.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Tombari to Mahboob and Vo in order for the visual features of the image to comprise of line segments. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to describe objects in images using descriptors based on features associated with edge pixels. (Tombari, Col 1, Lines 15-17)
Regarding Claim 9, the combination of Mahboob and Vo teaches the machine vision system of claim 8, they don’t explicitly teach wherein: the multiple visual features of each captured training image of the set of captured training images comprises multiple line segments and the common set of visual features comprises the line segments of the multiple line segments within each of the captured training images that are present within all of the captured training images.
Tombari teaches wherein: the multiple visual features of each captured training image of the set of captured training images comprises multiple line segments (Col 2, Lines 23-25, During operation, the computer system may optionally receive (or access) an image that includes the object, and may optionally extract line segments aligned with edge pixels associated with the object.); and the common set of visual features comprises the line segments of the multiple line segments within each of the captured training images that are present within all of the captured training images (Col 2, Lines 26-32, the computer system may optionally receive the extracted line segments. Then, the computer system may determine orientations for the line segments. Moreover, the computer system may identify one or more subsets of the line segments, where a given subset includes k of the line segments that are proximate to a given line segment in the line segments. The examiner interprets that the prior art is capable of determining a subset of line segments of the original line segment in the image.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Tombari to Mahboob and Vo in order for the visual features of the image to comprise of line segments. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to describe objects in images using descriptors based on features associated with edge pixels. (Tombari, Col 1, Lines 15-17)
Regarding Claim 16, the combination of Mahboob and Vo teaches the method of claim 15, they don’t explicitly teach wherein: the multiple visual features of each captured training image of the set of captured training images comprises multiple line segments and the common set of visual features comprises the line segments of the multiple line segments within each of the captured training images that are present within all of the captured training images.
Tombari teaches wherein: the multiple visual features of each captured training image of the set of captured training images comprises multiple line segments (Col 2, Lines 23-25, During operation, the computer system may optionally receive (or access) an image that includes the object, and may optionally extract line segments aligned with edge pixels associated with the object.); and the common set of visual features comprises the line segments of the multiple line segments within each of the captured training images that are present within all of the captured training images (Col 2, Lines 26-32, the computer system may optionally receive the extracted line segments. Then, the computer system may determine orientations for the line segments. Moreover, the computer system may identify one or more subsets of the line segments, where a given subset includes k of the line segments that are proximate to a given line segment in the line segments. The examiner interprets that the prior art is capable of determining a subset of line segments of the original line segment in the image.)
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Tombari to Mahboob and Vo in order for the visual features of the image to comprise of line segments. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to describe objects in images using descriptors based on features associated with edge pixels. (Tombari, Col 1, Lines 15-17)
Claims 4, 5, 11, 12, 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Mahboob US PG-Pub (US 20190354919 A1) in view of Vo et al. US PG-Pub (US 20200302168 A1) in view of Melikian US Patent (US 7831098 B2, as cited by applicant in IDS filed on 02/05/2020).
Regarding Claim 4, the combination of Mahboob and Vo teach the device of claim 1, they don’t explicitly teach wherein the performance of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform comprises a performance, by the processor, of operations comprising: select a first captured training image of the set of captured training images to serve as a reference image and perform the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of a second captured training image of the set of captured training images to maximize a number of visual features of the multiple of visual features present within the second captured training image that are aligned with corresponding visual features of the multiple visual features present within the first captured training image
Melikian teaches wherein the performance of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform comprises a performance, by the processor, of operations comprising: select a first captured training image of the set of captured training images to serve as a reference image (Col 1, Lines 6-10, the present invention is directed to a system and method for pattern identification of learned image (or learned pattern) in a target image, wherein the learned image and the target image have linear features. The examiner interprets that the learned image is the reference image used to perform matching.); and perform the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of a second captured training image of the set of captured training images to maximize a number of visual features of the multiple of visual features present within the second captured training image that are aligned with corresponding visual features of the multiple visual features present within the first captured training image(Col 2, Lines 52-61, determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object; determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets that the prior art is performing various transformations in order to align all the features of the learned object and the target object in the image.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Melikian to Mahboob and Vo in order to use a reference image to align the visual features of the first and second image. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to determine if the learned object matches the target image. (Melikian, Col 3, Lines 60-62)
Regarding Claim 5, the combination of Mahboob, Vo and Melikian teaches the device of claim 4, wherein the processor is further configured to cease the performance of the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform in response to reaching a predetermined limit selected from a group consisting of: a minimum number of corresponding visual features aligned between the first captured training image and the second captured training image (Melikian, Col 2, Lines 52-61, determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object; and determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets the prior art is determining the min amount needed to transform the line segment of the learned object to match the line segment of the target object); and a maximum number of performances of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform(Melikian, Col 2, lines 55-61, determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets that once the learned object and target object are the same in size it has hit the maximum of the object matching it will cease the performance of the transformation).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Melikian to Mahboob and Vo in order to cease the transformation of the visual features once a min and max threshold has been reached. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to determine if the learned object matches the target image. (Melikian, Col 3, Lines 60-62)
Regarding Claim 11, the combination of Mahboob and Vo teach the machine vision system of claim 8, they don’t explicitly teach wherein the performance of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform comprises a performance, by the processor, of operations comprising: select a first captured training image of the set of captured training images to serve as a reference image and perform the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of a second captured training image of the set of captured training images to maximize a number of visual features of the multiple of visual features present within the second captured training image that are aligned with corresponding visual features of the multiple visual features present within the first captured training image
Melikian teaches wherein the performance of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform comprises a performance, by the processor, of operations comprising: select a first captured training image of the set of captured training images to serve as a reference image (Col 1, Lines 6-10, the present invention is directed to a system and method for pattern identification of learned image (or learned pattern) in a target image, wherein the learned image and the target image have linear features. The examiner interprets that the learned image is the reference image used to perform matching.); and perform the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of a second captured training image of the set of captured training images to maximize a number of visual features of the multiple of visual features present within the second captured training image that are aligned with corresponding visual features of the multiple visual features present within the first captured training image(Col 2, Lines 52-61, determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object; determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets that the prior art is performing various transformations in order to align all the features of the learned object and the target object in the image.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Melikian to Mahboob and Vo in order to use a reference image to align the visual features of the first and second image. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to determine if the learned object matches the target image. (Melikian, Col 3, Lines 60-62)
Regarding Claim 12, the combination of Mahboob, Vo and Melikian teaches the machine vision system of claim 11, wherein the processor is further configured to cease the performance of the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform in response to reaching a predetermined limit selected from a group consisting of: a minimum number of corresponding visual features aligned between the first captured training image and the second captured training image (Melikian, Col 2, Lines 52-61, determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object; and determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets the prior art is determining the min amount needed to transform the line segment of the learned object to match the line segment of the target object); and a maximum number of performances of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform(Melikian, Col 2, lines 55-61, determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets that once the learned object and target object are the same in size it has hit the maximum of the object matching it will cease the performance of the transformation).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Melikian to Mahboob and Vo in order to cease the transformation of the visual features once a min and max threshold has been reached. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to determine if the learned object matches the target image. (Melikian, Col 3, Lines 60-62)
Regarding Claim 17, the combination of Mahboob and Vo teach the method of claim 15, they don’t explicitly teach wherein the performance of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform comprises a performance, by the processor, of operations comprising: select a first captured training image of the set of captured training images to serve as a reference image and perform the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of a second captured training image of the set of captured training images to maximize a number of visual features of the multiple of visual features present within the second captured training image that are aligned with corresponding visual features of the multiple visual features present within the first captured training image
Melikian teaches wherein the performance of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform comprises a performance, by the processor, of operations comprising: select a first captured training image of the set of captured training images to serve as a reference image (Col 1, Lines 6-10, the present invention is directed to a system and method for pattern identification of learned image (or learned pattern) in a target image, wherein the learned image and the target image have linear features. The examiner interprets that the learned image is the reference image used to perform matching.); and perform the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform on the multiple visual features and ROI of a second captured training image of the set of captured training images to maximize a number of visual features of the multiple of visual features present within the second captured training image that are aligned with corresponding visual features of the multiple visual features present within the first captured training image(Col 2, Lines 52-61, determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object; determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets that the prior art is performing various transformations in order to align all the features of the learned object and the target object in the image.).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Melikian to Mahboob and Vo in order to use a reference image to align the visual features of the first and second image. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to determine if the learned object matches the target image. (Melikian, Col 3, Lines 60-62)
Regarding Claim 18, the combination of Mahboob, Vo and Melikian teaches the method of claim 17, wherein the processor is further configured to cease the performance of the at least one of a rotational transform, a translational transform, a scaling transform or a shear transform in response to reaching a predetermined limit selected from a group consisting of: a minimum number of corresponding visual features aligned between the first captured training image and the second captured training image (Melikian, Col 2, Lines 52-61, determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object; and determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets the prior art is determining the min amount needed to transform the line segment of the learned object to match the line segment of the target object); and a maximum number of performances of at least one of a rotational transform, a translational transform, a scaling transform or a shear transform(Melikian, Col 2, lines 55-61, determining if the learned object matches the target object based at least in part on the step of determining the amount of translation, rotation, and scaling needed to transform the line segment of the learned object to have one or more lines substantially the same size as lines on the target object. The examiner interprets that once the learned object and target object are the same in size it has hit the maximum of the object matching it will cease the performance of the transformation).
It would have been obvious at the time of filing to one of ordinary skill in the art to add the teaching of Melikian to Mahboob and Vo in order to cease the transformation of the visual features once a min and max threshold has been reached. One skilled in the art would have been motivated to modify Mahboob and Vo in this manner in order to determine if the learned object matches the target image. (Melikian, Col 3, Lines 60-62)
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAN D HOANG whose telephone number is (571)272-4344.  The examiner can normally be reached on Monday-Friday 8-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire X. Wang can be reached on (571) 270-1051.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/HAN HOANG/Examiner, Art Unit 2663                                                                                                                                                                                                        
/CLAIRE X WANG/Supervisory Patent Examiner, Art Unit 2663