DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 19 July 2022 has been entered.

Response to Amendment
Claims 1 and 9 have been amended.  Claims 1-17 are currently pending and have been considered below.

Response to Arguments
Applicant’s arguments filed 19 July 2022 with respect to claim(s) 1-17 have been carefully considered but are moot in view of the new grounds of rejection necessitated by Applicant’s amendments.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claim(s) 1, 2, 5, 9, 10, 13 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tonioni, Alessio, Eugenio Serra, and Luigi Di Stefano. "A deep learning pipeline for product recognition on store shelves." 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS). IEEE, 2018, hereinafter, “Tonioni”, in view of Liu et al., U.S. Publication No. 2020/0293830, hereinafter, “Liu”, and further in view of Ba, Jimmy, Volodymyr Mnih, and Koray Kavukcuoglu. "Multiple object recognition with visual attention." arXiv preprint arXiv:1412.7755v2 (2015), hereinafter, “Ba”.

As per claim 1, Tonioni discloses a method for identifying an item, comprising: 
acquiring an item image of a to-be-identified item (Tonioni, page 25, Introduction, Query images for product recognition are taken in the store with cheap equipment (e.g., a smartphone)); 
setting initial position coordinates of the to-be-identified item on the item image (Tonioni, page 25, Fig. 1: Illustration of query images (a) and reference images (b) for the product recognition task. Bounding boxes overlapped to (a) shows correct detection colored according to recognized class; Tonioni, page 25, Introduction, Given a shelf image, we first perform a class-agnostic object detection to extract region proposals enclosing individual items; Tonioni, page 26, A. Detection, Given a query image featuring several items displayed in a store shelf, the first stage of our pipeline aims at obtaining a set of bounding boxes to be used as region proposals); and 
executing following identifying: 
inputting the item image and the initial position coordinates into a pre-trained module to output an item feature of the to-be-identified item (Tonioni, page 26, Section III. Proposed Approach, Fig. 2 shows an overview of our proposed pipeline. In the first step ... a Detector extracts region proposals from the query image. Then ... each region proposal is encoded by an Embedder into ad-hoc image descriptors); 
inputting the item feature into a pre-trained machine learning network to output a predicted category and predicted position coordinates of the to-be-identified item (Tonioni, page 27, B. Recognition, Starting from the candidate regions delivered by the Detector, we perform recognition by means of K-NN similarity search between a global descriptor computed on each candidate region and a database of similar descriptors (one for each product); Tonioni, page 28, C. Refinement, given the candidate regions extracted from the query image and their corresponding sets of K-NN, we consider the 1-NN of the region proposals extracted with a high confidence (> 0:1) by the Detector in order to find the main macro category of the image. Then, in case the majority of detections votes for the same macro category, it is safe to assume that the pictured shelf contains almost exclusively items of that category thus filter the K-NN for all candidate regions accordingly);
determining whether a preset condition is satisfied (Tonioni, page 28, C. Refinement, The aim of the final refinement is to remove false detections and re-rank the first K-NN found in the previous step in order to fix possible recognition mistakes … both the Query and each of the first K-NN reference images are described by a set of local features F1, F2, ..., Fk, each consisting in a spatial position (xi, yi) within the image and a compact descriptor fi. Given these features, we look for similarities between descriptors extracted from query and reference images, to compute a set of matches. Matches are then weighted based on the distance in the descriptor space, d(fi; fj) and a geometric consistency criterion relying on the unit-norm vector from the spatial location of a feature to the image center ... Finally, the first K-NN are re-ranked according to the sum of the weights Wij computed for the matches between the local features ... A simple additional refinement step consists in filtering out wrong recognitions by the distance ratio criterion (i.e., by thresholding the ratio of the distances in feature space between the query descriptor and its 1-NN and 2-NN). If the ratio is above a threshold, the recognition is deemed as ambiguous and discarded); and 
determining, in response to the preset condition being satisfied, a predicted category of the to-be-identified item outputted by the machine learning network a last time for use as a final category of the to-be-identified item (Tonioni, page 28, C. Refinement, Finally, we propose a re-ranking and filtering method specific to the grocery domain where ... products belonging to the same macro category are typically displayed close one to another on the shelf. In particular, given the candidate regions extracted from the query image and their corresponding sets of K-NN, we consider the 1-NN of the region proposals extracted with a high confidence (> 0:1) by the Detector in order to find the main macro category of the image. Then, in case the majority of detections votes for the same macro category, it is safe to assume that the pictured shelf contains almost exclusively items of that category thus filter the K-NN for all candidate regions accordingly). 
Tonioni does not explicitly disclose the following limitations as further recited however Liu discloses
inputting the item feature into a pre-trained long short-term memory network to output a predicted category (Liu, ¶0017, the first sub-model uses images of a detected article that are obtained at different angles and generated in time order as inputs, to obtain feature processing results of the images, and outputs the feature processing results to the second sub-model; and the second sub-model performs time series analysis on the feature processing results of the images to determine a damage detection result; Liu, ¶0020, The first sub-model can be any machine learning model, and an advantageous result usually can be achieved by using an algorithm that is suitable for feature extraction and processing, for example, a deep convolutional neural network (DCNN). The second sub-model can be any machine learning model that can perform time series analysis, for example, a recurrent neural network (RNN), a long short-term memory (LSTM) network; Liu, ¶0042, the first sub-model performs feature extraction; Liu, ¶0063, The deep convolutional neural network sub-model first performs feature extraction on each image).
Tonioni and Liu are analogous art as they are both concerned with image processing and recognition via extraction of features from images, the extracted features are input into a machine learning model in order to output a predicted category.  It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to substitute the LSTM of Liu for the machine learning algorithm of Tonioni in order to provide an alternate means to output the predicted category of the features extracted from the input image (Tonioni, page 28, C. Refinement; Liu, ¶0017). 
Tonioni and Liu do not explicitly disclose the following limitations as further recited however Ba discloses 
selecting one point on the item image, and using coordinates of the selected point in a pre-established rectangular coordinate system on the item image as initial position coordinates of the to-be-identified item on the item image (Ba, pages 2-3, 3 Deep Recurrent Visual Attention Model, At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln ... The model can be broken down into a number of sub-components, each mapping some input into a vector output ... Glimpse network: The glimpse network is a non-linear function that receives the current input image patch, or glimpse, xn and its location tuple ln , where ln = (xn, yn), as input and outputs a vector gn); and 
inputting the item image and the initial position coordinates into a pre-trained attention module to output an item feature of the to-be-identified item, wherein the attention module increases a weight of an area centered on the selected point indicated by the initial position coordinates on the item image, such that an identification focus is concentrated on the area centered on the point indicated by the initial position coordinates; inputting the item feature into a pre-trained long short-term memory network (Ba, Abstract, an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image; Ba, pages 2-3, 3 Deep Recurrent Visual Attention Model, At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln ... The model can be broken down into a number of sub-components, each mapping some input into a vector output ... Glimpse network: The glimpse network is a non-linear function that receives the current input image patch, or glimpse, xn and its location tuple ln , where ln = (xn, yn), as input and outputs a vector gn. The job of the glimpse network is to extract a set of useful features from location ln of the raw visual input. We will use Gimage (xn | Wimage) to denote the output vector from function Gimage() that takes an image patch xn and is parameterized by weights Wimage … Recurrent network: The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step ... We use Long-Short-Term Memory units ... Emission network: The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network… Classification network: The classification network outputs a prediction for the class label y based on the final feature vector).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Tonioni and Liu to include the attention model algorithm as taught by Ba in order to provide an alternate means to recognize multiple items or objects in images while using fewer parameters (Ba, Abstract). 

As per claim 2, Tonioni, Liu and Ba disclose the method according to claim 1, wherein the method further comprises: using, in response to determining the preset condition not being satisfied, the predicted position coordinates of the to-be-identified item as the initial position coordinates, and continuing executing the identifying (Tonioni, page 28, C. Refinement, The aim of the final refinement is to remove false detections and re-rank the first K-NN found in the previous step in order to fix possible recognition mistakes … re-ranking of the first K-NN may be achieved by looking at peculiar image details that may ... be crucial to differentiate a product from others looking very similar. Thus, both the Query and each of the first K-NN reference images are described by a set of local features F1, F2, ..., Fk, each consisting in a spatial position (xi, yi) within the image and a compact descriptor fi. Given these features, we look for similarities between descriptors extracted from query and reference images, to compute a set of matches; Ba, pages 2-3, 3 Deep Recurrent Visual Attention Model, At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step … Emission network: The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network). 

As per claim 5, Tonioni, Liu and Ba disclose the method according to claim 1, wherein the preset condition comprises: a number of iterations of executing the identifying being greater than or equal to a preset number of iterations (Tonioni, page 28, B. Recognition, when a query image is processed, the same embedding is computed on each of the candidate regions, ipq, cropped from the query image, iq, so to get E(ipq). Finally, for each ipq we compute the distance in the embedding space with respect to each reference descriptor, denoted as d(E(ipq), E(ir)), in order to sift-out the first K-NN of E(ipq) in the reference database; Ba, page 5, 3.2 Multi-object / Sequential Classification as a Visual Attention Task, The deep recurrent attention model then learns to predict one object at a time as it explores the image in a sequential manner. We can utilize a simple fixed number of glimpses for each target in the sequence. In addition, a new class label for the “end-of-sequence” symbol is included to deal with variable numbers of objects in an image. We can stop the recurrent attention model once a terminal symbol is predicted). 

As per claim 9, Tonioni discloses an apparatus for identifying an item, comprising: 
acquiring an item image of a to-be-identified item (Tonioni, page 25, Introduction, Query images for product recognition are taken in the store with cheap equipment (e.g., a smartphone)); 
setting initial position coordinates of the to-be-identified item on the item image (Tonioni, page 25, Fig. 1: Illustration of query images (a) and reference images (b) for the product recognition task. Bounding boxes overlapped to (a) shows correct detection colored according to recognized class; Tonioni, page 25, Introduction, Given a shelf image, we first perform a class-agnostic object detection to extract region proposals enclosing individual items; Tonioni, page 26, A. Detection, Given a query image featuring several items displayed in a store shelf, the first stage of our pipeline aims at obtaining a set of bounding boxes to be used as region proposals); and 
executing following identifying: 
inputting the item image and the initial position coordinates into a pre-trained module to output an item feature of the to-be-identified item (Tonioni, page 26, Section III. Proposed Approach, Fig. 2 shows an overview of our proposed pipeline. In the first step ... a Detector extracts region proposals from the query image. Then ... each region proposal is encoded by an Embedder into ad-hoc image descriptors); 
inputting the item feature into a pre-trained machine learning network to output a predicted category and predicted position coordinates of the to-be-identified item (Tonioni, page 27, B. Recognition, Starting from the candidate regions delivered by the Detector, we perform recognition by means of K-NN similarity search between a global descriptor computed on each candidate region and a database of similar descriptors (one for each product); Tonioni, page 28, C. Refinement, given the candidate regions extracted from the query image and their corresponding sets of K-NN, we consider the 1-NN of the region proposals extracted with a high confidence (> 0:1) by the Detector in order to find the main macro category of the image. Then, in case the majority of detections votes for the same macro category, it is safe to assume that the pictured shelf contains almost exclusively items of that category thus filter the K-NN for all candidate regions accordingly); 
determining whether a preset condition is satisfied (Tonioni, page 28, C. Refinement, The aim of the final refinement is to remove false detections and re-rank the first K-NN found in the previous step in order to fix possible recognition mistakes … both the Query and each of the first K-NN reference images are described by a set of local features F1, F2, ..., Fk, each consisting in a spatial position (xi, yi) within the image and a compact descriptor fi. Given these features, we look for similarities between descriptors extracted from query and reference images, to compute a set of matches. Matches are then weighted based on the distance in the descriptor space, d(fi; fj) and a geometric consistency criterion relying on the unit-norm vector from the spatial location of a feature to the image center ... Finally, the first K-NN are re-ranked according to the sum of the weights Wij computed for the matches between the local features ... A simple additional refinement step consists in filtering out wrong recognitions by the distance ratio criterion (i.e., by thresholding the ratio of the distances in feature space between the query descriptor and its 1-NN and 2-NN). If the ratio is above a threshold, d, the recognition is deemed as ambiguous and discarded); and 
determining, in response to the preset condition being satisfied, a predicted category of the to-be-identified item outputted by the machine learning network a last time for use as a final category of the to-be-identified item (Tonioni, page 28, C. Refinement, Finally, we propose a re-ranking and filtering method specific to the grocery domain where ... products belonging to the same macro category are typically displayed close one to another on the shelf. In particular, given the candidate regions extracted from the query image and their corresponding sets of K-NN, we consider the 1-NN of the region proposals extracted with a high confidence (> 0:1) by the Detector in order to find the main macro category of the image. Then, in case the majority of detections votes for the same macro category, it is safe to assume that the pictured shelf contains almost exclusively items of that category thus filter the K-NN for all candidate regions accordingly). 
Tonioni does not explicitly disclose the following limitations as further recited however Liu discloses
at least one processor; and a memory storing instructions, wherein the instructions when executed by the at least one processor, cause the at least one processor to perform operations (Liu, ¶0071, a computing device includes one or more central processing units (CPUs), input/output interfaces, network interfaces, and memories), the operations comprising:
inputting the item feature into a pre-trained long short-term memory network to output a predicted category (Liu, ¶0017, the first sub-model uses images of a detected article that are obtained at different angles and generated in time order as inputs, to obtain feature processing results of the images, and outputs the feature processing results to the second sub-model; and the second sub-model performs time series analysis on the feature processing results of the images to determine a damage detection result; Liu, ¶0020, The first sub-model can be any machine learning model, and an advantageous result usually can be achieved by using an algorithm that is suitable for feature extraction and processing, for example, a deep convolutional neural network (DCNN). The second sub-model can be any machine learning model that can perform time series analysis, for example, a recurrent neural network (RNN), a long short-term memory (LSTM) network; Liu, ¶0042, the first sub-model performs feature extraction; Liu, ¶0063, The deep convolutional neural network sub-model first performs feature extraction on each image).
Tonioni and Liu are analogous art as they are both concerned with image processing and recognition via extraction of features from images, the extracted features are input into a machine learning model in order to output a predicted category.  It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to substitute the LSTM of Liu for the machine learning algorithm of Tonioni in order to provide an alternate means to output the predicted category of the features extracted from the input image (Tonioni, page 28, C. Refinement; Liu, ¶0017).
Tonioni and Liu do not explicitly disclose the following limitations as further recited however Ba discloses 
selecting one point on the item image, and using coordinates of the selected point in a pre-established rectangular coordinate system on the item image as initial position coordinates of the to-be-identified item on the item image (Ba, pages 2-3, 3 Deep Recurrent Visual Attention Model, At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln ... The model can be broken down into a number of sub-components, each mapping some input into a vector output ... Glimpse network: The glimpse network is a non-linear function that receives the current input image patch, or glimpse, xn and its location tuple ln , where ln = (xn, yn), as input and outputs a vector gn); and
inputting the item image and the initial position coordinates into a pre-trained attention module to output an item feature of the to-be-identified item, wherein the attention module increases a weight of an area centered on the point indicated by the initial position coordinates on the item image, such that an identification focus is concentrated on the area centered on the point indicated by the initial position coordinates; inputting the item feature into a pre-trained long short-term memory network (Ba, Abstract, an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image; Ba, pages 2-3, 3 Deep Recurrent Visual Attention Model, At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln ... The model can be broken down into a number of sub-components, each mapping some input into a vector output ... Glimpse network: The glimpse network is a non-linear function that receives the current input image patch, or glimpse, xn and its location tuple ln , where ln = (xn, yn), as input and outputs a vector gn. The job of the glimpse network is to extract a set of useful features from location ln of the raw visual input. We will use Gimage (xn | Wimage) to denote the output vector from function Gimage() that takes an image patch xn and is parameterized by weights Wimage … Recurrent network: The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step ... We use Long-Short-Term Memory units ... Emission network: The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network… Classification network: The classification network outputs a prediction for the class label y based on the final feature vector).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Tonioni and Liu to include the attention model algorithm as taught by Ba in order to provide an alternate means to recognize multiple items or objects in images while using fewer parameters (Ba, Abstract).

As per claim 10, Tonioni, Liu and Ba disclose the apparatus according to claim 9, wherein the operations further comprise: using, in response to determining the preset condition not being satisfied, the predicted position coordinates of the to-be-identified item as the initial position coordinates, and continuing executing the identifying (Tonioni, page 28, C. Refinement, The aim of the final refinement is to remove false detections and re-rank the first K-NN found in the previous step in order to fix possible recognition mistakes … re-ranking of the first K-NN may be achieved by looking at peculiar image details that may ... be crucial to differentiate a product from others looking very similar. Thus, both the Query and each of the first K-NN reference images are described by a set of local features F1, F2, ..., Fk, each consisting in a spatial position (xi, yi) within the image and a compact descriptor fi. Given these features, we look for similarities between descriptors extracted from query and reference images, to compute a set of matches; Ba, pages 2-3, 3 Deep Recurrent Visual Attention Model, At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step … Emission network: The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network). 

As per claim 13, Tonioni, Liu and Ba disclose the apparatus according to claim 9, wherein the preset condition comprises: a number of iterations of executing the identifying being greater than or equal to a preset number of iterations (Tonioni, page 28, B. Recognition, when a query image is processed, the same embedding is computed on each of the candidate regions, ipq, cropped from the query image, iq, so to get E(ipq). Finally, for each ipq we compute the distance in the embedding space with respect to each reference descriptor, denoted as d(E(ipq), E(ir)), in order to sift-out the first K-NN of E(ipq) in the reference database; Ba, page 5, 3.2 Multi-object / Sequential Classification as a Visual Attention Task, The deep recurrent attention model then learns to predict one object at a time as it explores the image in a sequential manner. We can utilize a simple fixed number of glimpses for each target in the sequence. In addition, a new class label for the “end-of-sequence” symbol is included to deal with variable numbers of objects in an image. We can stop the recurrent attention model once a terminal symbol is predicted). 

As per claim 17, Tonioni, Liu and Ba disclose a non-transitory computer readable medium, storing a computer program thereon, wherein the computer program, when executed by a processor, implements the method according to claim 1 (Liu, ¶0073, The computer storage medium can be configured to store information that can be accessed by the computing device).


Claims 3, 4, 6-8, 11, 12 and 14-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tonioni, Alessio, Eugenio Serra, and Luigi Di Stefano. "A deep learning pipeline for product recognition on store shelves." 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS). IEEE, 2018, hereinafter, “Tonioni”, in view of Liu et al., U.S. Publication No. 2020/0293830, hereinafter, “Liu”, in view of Ba, Jimmy, Volodymyr Mnih, and Koray Kavukcuoglu. "Multiple object recognition with visual attention." arXiv preprint arXiv:1412.7755v2 (2015), hereinafter, “Ba” as applied to claims 1, 5, 9 and 13 above, and further in view of Dugar et al., U.S. Publication No. 2021/0012272, hereinafter, “Dugar”.

As per claim 3, Tonioni, Liu and Ba disclose the method according to claim 1, but do not explicitly disclose the following limitations as further recited however Dugar discloses wherein the acquiring an item image of a to-be-identified item comprises: 
acquiring a shelf image before a user takes or places the to-be-identified item from or on a shelf, and a shelf image after the user takes or places the to-be-identified item from or on the shelf (Dugar, ¶0040, FIG. 3 shows a ground truth (GT) image 300 of a retail facility shelf location, such as may be stored in planogram 122 … Locations for each of the items is annotated on GT image 300, for example showing locations (1,1) through (3,7); Dugar, ¶0041, FIG. 4 shows a real time (RT) image 400 corresponding to GT image 300 that is collected for the anomaly detection. RT image 400 has an annotated empty location 402. In some examples, RT image 400 is captured by CV component 126); and 
comparing the shelf image before the user takes or places the to-be-identified item from or on the shelf, and the shelf image after the user takes or places the to-be-identified item from or on the shelf, to segment the item image of the to-be-identified item (Dugar, ¶0041, initial anomaly detection is performed that identifies any overall anomalous behavior using a comparison of RT image 400 with GT image 300 … The image embedding is extracted from the current planogram image for which the anomalous condition (if present) is to be detected. Some examples use transfer learning with a pre-trained CNN-based architecture in order to compare the image embedding between RT image 400 with GT image 300. If there is a sufficient difference from majority of the planogram images (e.g., GT image 300 and other planogram images corresponding to the same shelf unit location), such as a difference exceeding a threshold, an overall anomalous indicator value is set; Dugar, ¶0042, This permits detection of first level anomalies such as empty (blank) shelf space).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to include the comparison of shelf images as taught by Dugar in the system of Tonioni, Liu and Ba in order to provide a means to detect anomalous conditions such as empty shelves, broken items, overcrowding and items in incorrect locations (Dugar, ¶0036).

As per claim 4, Tonioni, Liu, Ba and Dugar disclose the method according to claim 3, wherein the comparing the shelf image before the user takes or places the to-be-identified item from or on the shelf, and the shelf image after the user takes or places the to-be-identified item from or on the shelf, to segment the item image of the to-be-identified item comprises: 
inputting the shelf image before the user takes or places the to-be-identified item from or on the shelf, and the shelf image after the user takes or places the to-be-identified item from or on the shelf into a pre-trained target detection model, to output position information of the to-be-identified item (Dugar, ¶0041, initial anomaly detection is performed that identifies any overall anomalous behavior using a comparison of RT image 400 with GT image 300 … The image embedding is extracted from the current planogram image for which the anomalous condition (if present) is to be detected. Some examples use transfer learning with a pre-trained CNN-based architecture in order to compare the image embedding between RT image 400 with GT image 300); and 
segmenting the item image of the to-be-identified item from the shelf image before the user takes or places the to-be-identified item from or on the shelf, or the shelf image after the user takes or places the to-be-identified item from or on the shelf based on the position information of the to-be-identified item (Dugar, Figure 3, ground truth image, Figure 4, real time image, item 402, annotated empty location; Dugar, ¶0045, FIG. 6 shows a detected edge image 600 corresponding to RT image 400 … In some examples, a neural net architecture is created and deployed to identify crossing points in an image (e.g., RT image 400), which will become aid in marking boundaries around the items … crossing point detection algorithm assists with segmenting the planogram image (e.g., RT image 400) into various items). 

As per claim 6, Tonioni, Liu and Ba disclose the method according to claim 5, wherein the preset number of iterations is determined by: 
inputting the sample item feature into the long short-term memory network, to output a predicted sample category and predicted sample position coordinates of the sample item (Tonioni, page 27, B. Recognition, Starting from the candidate regions delivered by the Detector, we perform recognition by means of K-NN similarity search between a global descriptor computed on each candidate region and a database of similar descriptors (one for each product); Tonioni, page 28, C. Refinement, given the candidate regions extracted from the query image and their corresponding sets of K-NN, we consider the 1-NN of the region proposals extracted with a high confidence (> 0:1) by the Detector in order to find the main macro category of the image. Then, in case the majority of detections votes for the same macro category, it is safe to assume that the pictured shelf contains almost exclusively items of that category thus filter the K-NN for all candidate regions accordingly; Liu, ¶0017, the first sub-model uses images of a detected article that are obtained at different angles and generated in time order as inputs, to obtain feature processing results of the images, and outputs the feature processing results to the second sub-model; and the second sub-model performs time series analysis on the feature processing results of the images to determine a damage detection result; Liu, ¶0020, The first sub-model can be any machine learning model, and an advantageous result usually can be achieved by using an algorithm that is suitable for feature extraction and processing, for example, a deep convolutional neural network (DCNN). The second sub-model can be any machine learning model that can perform time series analysis, for example, a recurrent neural network (RNN), a long short-term memory (LSTM) network); 
determining whether a duration of executing the determining exceeds a preset duration (Tonioni, page 28, C. Refinement, A simple additional refinement step consists in filtering out wrong recognitions by the distance ratio criterion (i.e., by thresholding the ratio of the distances in feature space between the query descriptor and its 1-NN and 2-NN). If the ratio is above a threshold, d, the recognition is deemed as ambiguous and discarded); and
statisticizing, in response to the identification accuracy rate being not lower than the preset accuracy rate, a number of iterations of the determining, for use as the preset number of iterations (Tonioni, page 28, B. Recognition, when a query image is processed, the same embedding is computed on each of the candidate regions, ipq, cropped from the query image, iq, so to get E(ipq). Finally, for each ipq we compute the distance in the embedding space with respect to each reference descriptor, denoted as d(E(ipq), E(ir)), in order to sift-out the first K-NN of E(ipq) in the reference database). 
Tonioni, Liu and Ba do not explicitly disclose the following limitations as further recited however Dugar discloses
acquiring a sample, wherein the sample includes a sample item image and a sample category tag of a sample item (Dugar, ¶0004, receive a real time (RT) image of a shelf unit corresponding to at least a first portion of a planogram; detect, within the RT image, item boundaries for a plurality of items on the shelf unit and tag boundaries for a plurality of tags associated with the shelf unit); 
setting initial sample position coordinates of the sample item on the sample item image (Dugar, ¶0004, receive a real time (RT) image of a shelf unit corresponding to at least a first portion of a planogram; detect, within the RT image, item boundaries for a plurality of items on the shelf unit); and executing following determining: 
inputting the sample item image and the initial sample position coordinates into the attention module, to output a sample item feature of the sample item (Dugar, ¶0032, An attribute extraction component 128 is operable to extract attributes, from RT image 400, for at least one of tags 108a-108h and at least of items 106a-106h. Some examples of attribute extraction component 128 use long short-term memory (LSTM) processes, Tesseract LSTM optical character recognition (OCR) processes, and convolutional neural networks (CNNs)); and 
determining an identification accuracy rate based on the predicted sample category and the sample category tag, in response to the duration of executing the determining failing to exceed the preset duration (Dugar, ¶0041, examples use transfer learning with a pre-trained CNN-based architecture in order to compare the image embedding between RT image 400 with GT image 300. If there is a sufficient difference from majority of the planogram images (e.g., GT image 300 and other planogram images corresponding to the same shelf unit location), such as a difference exceeding a threshold, an overall anomalous indicator value is set); 
determining whether the identification accuracy rate is not lower than a preset accuracy rate (Dugar, ¶0041, examples use transfer learning with a pre-trained CNN-based architecture in order to compare the image embedding between RT image 400 with GT image 300. If there is a sufficient difference from majority of the planogram images (e.g., GT image 300 and other planogram images corresponding to the same shelf unit location), such as a difference exceeding a threshold, an overall anomalous indicator value is set).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to modify the teachings of Tonioni, Liu and Ba to include the sample image and sample tag as taught by Dugar in order to provide an additional means to detect anomalous conditions such as empty shelves, broken items, overcrowding and items in incorrect locations and to validate the determined category via the correspondence between the image and the tag (Dugar, ¶0029; ¶0036).

As per claim 7, Tonioni, Liu, Ba and Dugar disclose the method according to claim 6, wherein the determining the preset number of iterations further comprises: using, in response to determining the identification accuracy rate being lower than the preset accuracy rate, the predicted sample position coordinates as the initial sample position coordinates, and continuing executing the determining (Tonioni, page 27, B. Recognition, Starting from the candidate regions delivered by the Detector, we perform recognition by means of K-NN similarity search between a global descriptor computed on each candidate region and a database of similar descriptors (one for each product); Tonioni, page 28, C. Refinement, given the candidate regions extracted from the query image and their corresponding sets of K-NN, we consider the 1-NN of the region proposals extracted with a high confidence (> 0:1) by the Detector in order to find the main macro category of the image. Then, in case the majority of detections votes for the same macro category, it is safe to assume that the pictured shelf contains almost exclusively items of that category thus filter the K-NN for all candidate regions accordingly; Liu, ¶0017, the first sub-model uses images of a detected article that are obtained at different angles and generated in time order as inputs, to obtain feature processing results of the images, and outputs the feature processing results to the second sub-model; and the second sub-model performs time series analysis on the feature processing results of the images to determine a damage detection result; Liu, ¶0020, The first sub-model can be any machine learning model, and an advantageous result usually can be achieved by using an algorithm that is suitable for feature extraction and processing, for example, a deep convolutional neural network (DCNN). The second sub-model can be any machine learning model that can perform time series analysis, for example, a recurrent neural network (RNN), a long short-term memory (LSTM) network). 

As per claim 8, Tonioni, Lui, Ba and Dugar disclose the method according to claim 7, wherein the determining the preset number of iterations further comprises: statisticizing, in response to determining the duration of executing the determining exceeding the preset duration, a number of iterations of executing the determining, for use as the preset number of iterations (Tonioni, page 28, B. Recognition, when a query image is processed, the same embedding is computed on each of the candidate regions, ipq, cropped from the query image, iq, so to get E(ipq). Finally, for each ipq we compute the distance in the embedding space with respect to each reference descriptor, denoted as d(E(ipq), E(ir)), in order to sift-out the first K-NN of E(ipq) in the reference database). 

Regarding claim(s) 11: 
A corresponding reasoning as given earlier (see rejection of claim(s) 3) applies, mutatis mutandis, to the subject-matter of claim(s) 11, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 3.

Regarding claim(s) 12: 
A corresponding reasoning as given earlier (see rejection of claim(s) 4) applies, mutatis mutandis, to the subject-matter of claim(s) 12, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 4.

Regarding claim(s) 14: 
A corresponding reasoning as given earlier (see rejection of claim(s) 6) applies, mutatis mutandis, to the subject-matter of claim(s) 14, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 6.

Regarding claim(s) 15 and 16: 
A corresponding reasoning as given earlier (see rejection of claim(s) 7 and 8) applies, mutatis mutandis, to the subject-matter of claim(s) 15 and 16, and therefore is/are also considered rejected under the grounds given in the rejection of claim(s) 7 and 8.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189. The examiner can normally be reached M-F, 9:30AM TO 6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/TRACY MANGIALASCHI/Examiner, Art Unit 2668                    
/VU LE/Supervisory Patent Examiner, Art Unit 2668