Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office action is in response to the claims filed on 09/17/2019. Claims 11-13, 16, 25-33 were canceled. 
Claims 1-24 are presented for examination.
Information Disclosure Statement
The information disclosure statements (IDS) filed 12/10/2019 is in compliance with the provisions of 37 CFR 1.97 and 1.98. Accordingly, the information disclosure statement is being considered by the examiner.
Priority
The following claimed benefit is acknowledged: the instant application, filed 09/17/2019 claims priority from provisional application 62533535, filed 07/17/2017.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-8, 10 are rejected under 35 U.S.C. 103 as being unpatentable over Shankar et al. (NPL: Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce– Flipkart Internet Pvt. Ltd., Bengaluru, India- hereinafter, Shankar) in view of Kumar et al. (NPL: Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions- The University of Adelaide and Australian Centre for Robotic Vision- hereinafter, Kumar).
Regarding claim 1, Shankar teaches a computer-implemented method for generating a unified machine learning computing model using a neural network on a data processing apparatus, the method comprising: (Shankar, [Abstract], “In this paper, we present a unified end-to-end approach to build a large scale Visual Search and Recommendation system for ecommerce. Previous works have targeted these problems in isolation. We believe a more effective and elegant solution could be obtained by tackling them together. We propose a unified Deep Convolutional Neural Network architecture, called VisNet 1, to learn embeddings to capture the notion of visual similarity, across several semantic granularities. We demonstrate the superiority of our approach for the task of image retrieval, by comparing against the state-of-the-art on the Exact Street2Shop [14] dataset. We then share the design decisions and trade-offs made while deploying the model to power Visual Recommendations across a catalog of 50M products, supporting 2K queries a second at Flipkart, India’s largest e-commerce company.”)
determining, by the data processing apparatus and for the neural network, respective learning targets for each of a plurality of object verticals, wherein each object vertical defines a distinct category for an object that belongs to the vertical (Shankar, [Sec.1, Figs1-4 and Section 3.1], “…A related problem is that of image-based recommendations or Visual Recommendations. A user interested in buying a particular item from the catalog may want to browse through visually similar items before finalizing the purchase. These could be items with similar colors, patterns and shapes. Traditional recommender systems [18, 25] that are based on collaborative filtering techniques fail to capture such details since they rely only on user click / purchase activity and completely ignore the image content. Further, they suffer from the ‘cold start’ problem - newly introduced products do not have sufficient user activity data for meaningful recommendations. In this paper, we address the problems of both Visual Recommendations (retrieving a ranked list of catalog images similar to another catalog image) and Visual Search (retrieving a ranked list of catalog images similar to a “wild” (user-uploaded) image). the core task, common to both, is quantitative estimation of visual similarity between two images containing fashion items. this is fraught with several challenges, as outlined below. We start with the challenges in visual recommendation, where we deal only with catalog images…Our model, VisNet, is a Convolutional Neural Network (CNN) trained using the triplet based deep ranking paradigm proposed in [33]. It contains a deep CNN modelled after the VGG-16 network [27], coupled with parallel shallow convolution layers in order to capture both high level and low-level image details simultaneously (see Section 3.1). through extensive experiments on the Exact Street2Shop dataset created by Kiapour et al. in [14], we demonstrate the superiority of VisNet over previous state-of-the-art. We also present a semi-automatic training data generation methodology that is critical to training VisNet. Initially, the network is trained only on catalog images. Candidate training data is generated programmatically with a set of Basic Image Similarity Scorers from which final training data is selected via human vetting. This network is further fine-tuned for the task of Visual Search…” Examiner’s note, the neural network identifies or ranks similar items from the specific category based on the query image (uploaded by the user), therefore, the item is considered as the object that belong to specific category, the category is considered as the vertical. Shankar further teaches a learning target, as it can be seen at [Section 3.1, Fig.3], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space… We use two different types of triplets, in-class triplets and out of class triplets (see Figure 3 for an example). The in-class triplets help in teaching the network to pay attention to nuanced differences, like the thickness of stripes, using hard negatives. These are images that can be deemed similar to the query image in a broad sense but are less similar to the query image when compared to the positive image due to fine-grained distinctions. This enables the network to learn robust embeddings that are sensitive to subtle distinctions in colors and patterns. The out-of-class triplets contain easy negatives and help in teaching the network to make coarse-grained distinctions. During training, the images in the triplet are fed to 3 sub-networks with shared weights (see Figure 2a). Each sub-network generates an embedding or feature vector, thus 3 embedding vectors, q, p and n are generated - and fed into a Hinge Loss function:

    PNG
    media_image1.png
    578
    543
    media_image1.png
    Greyscale

“ 
Examiner’s note, using the convolutional neural network (CNN) to generate the embedding output based on the capture of visual similar such as a triple images (query image, positive image and negative image) are fed into the each sub-networks such as 16-layer VGG net and Conv Layers 1 and 2) in order to find the matching item based on the search query image. Therefore, the training of sub-networks are considered as the learning targets);
[…]
the neural network to identify data associated with each of the plurality of object verticals, where the neural network is trained using the respective learning targets [Section 3.1, Fig.3], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space… We use two different types of triplets, in-class triplets and out of class triplets (see Figure 3 for an example). The in-class triplets help in teaching the network to pay attention to nuanced differences, like the thickness of stripes, using hard negatives. These are images that can be deemed similar to the query image in a broad sense but are less similar to the query image when compared to the positive image due to fine-grained distinctions. This enables the network to learn robust embeddings that are sensitive to subtle distinctions in colors and patterns. The out-of-class triplets contain easy negatives and help in teaching the network to make coarse-grained distinctions. During training, the images in the triplet are fed to 3 sub-networks with shared weights (see Figure 2a). Each sub-network generates an embedding or feature vector, thus 3 embedding vectors, q, p and n are generated - and fed into a Hinge Loss function:

    PNG
    media_image1.png
    578
    543
    media_image1.png
    Greyscale

“ using the convolutional neural network (CNN) to generate the embedding output based on the capture of visual similar such as a triple images (query image, positive image and negative image) are fed into the each sub-networks such as 16-layer VGG net and Conv Layers 1 and 2) in order to find the matching item based on the search query image. Therefore, the training of sub-networks are considered as the learning targets, which are used to identify the item of the specific category based on the inputted image uploaded by the user.)
and generating, by the data processing apparatus, a unified machine learning model configured to identify items that are included in the data associated with each of the plurality of object verticals  (Shankar, [abstract], “We propose a unified Deep Convolutional Neural Network architecture, called VisNet 1, to learn embeddings to capture the notion of visual similarity, across several semantic granularities.” and  (Shankar, [section 1, page 2, the left column, and section 3.1], “The main contribution of this paper is an end-to-end solution for large scale Visual Recommendations and Search. We share the details of our model architecture, training data generation pipeline as well as the architecture of our deployed system. Our model, VisNet, is a Convolutional Neural Network (CNN) trained using the triplet based deep ranking paradigm proposed in [33]. It contains a deep CNN modelled after the VGG-16 network [27], coupled with parallel shallow convolution layers in order to capture both high level and low-level image details simultaneously (see Section 3.1)”. for further clarification, see [section 3.1] “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space. Our CNN is modelled after [33]. We replace AlexNet with the 16- layered VGG network [27] in our implementation. this significantly improves our recall numbers (see section 4). Our conjecture is that the dense (stride 1) convolution with small receptive field (3 X 3) digests pattern details much better than the sparser AlexNet. these details may be even more important in the product similarity problem than the object recognition problem. Each training data element is a triplet of 3 images, < q;p;n >, a query image (q), a positive image (p) and a negative image (n). It is expected that the pair of images ¹q;pº are more visually similar compared to the pair of images ¹q;nº. Using triplets enables us to train a network to directly rank images instead of optimizing for binary/discriminatory decisions (as done in Siamese networks). It should be noted that here the training data needs to be labeled only for relative similarity…” Examiner’s note, Unified Deep Convolutional Neural Network architecture (VisNet) using the CNN to generate the identify the particular item of a specific category based on the query inputted image, therefore, the CNN is considered as the unified learning model. For further detail, see [Section 4, fig. 2, 4, 5,], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. thee clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…Algorithms were evaluated on what percentage of these triplets were correctly ranked by them (i.e., given a triplet < q;p;n >, if D¹q;pº < D¹q;nº, score 1, else 0). the results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. Thee different BISSs that were used are described in Section 3.2…” Examiner’s note, the image triplets is considered as the unified learning model, which were generated to identify the search item belong to particular category (vertical). However, the claim recites the limitation “a unified machine learning model configured to identify items that are included in the data associated with each of the plurality of object verticals” is the intended used limitation, the unified machine learning model is not required to identify the items included in the data associated with each of the plurality of object verticals.).
Shankar discloses the neural network to identify data associated with each of the plurality of object verticals, where the neural network is trained using the respective learning targets.
However, Shankar does not teach training, by the data processing apparatus and based on a first loss function, and using the neural network trained based on the first loss function,
On the other hand, Kumar teaches training, by the data processing apparatus and based on a first loss function, (Kumar, [Abstract], “Recent innovations in training deep convolutional neural network (ConvNet) models have motivated the design of new methods to automatically learn local image descriptors. The latest deep ConvNets proposed for this task consist of a siamese network that is trained by penalizing misclassification of pairs of local image patches. Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, current Siamese and triplet networks have been trained with stochastic gradient descent that computes the gradient from individual pairs or triplets of local image patches, which can make them prone to overfitting. In this paper, we first propose the use of triplet networks for the problem of local image descriptor learning. Furthermore, we also propose the use of a global loss that minimizes the overall classification error in the training set, which can improve the generalization capability of the model. Using the UBC benchmark dataset for comparing local image descriptors, we show that the triplet network produces a more accurate embedding than the siamese network in terms of the UBC dataset errors.” Examiner’s note, the machine learning system using the deep convolutional neural network (ConvNet) models to generate the triplet network based on the global loss function (first loss function) to minimize the misclassification error, for further detail see Kumar, [Fig.1, section 1, 3, 3.1. 3.2], “For instance, the triplet network [33, 14, 26, 35] (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalize better… 

    PNG
    media_image2.png
    703
    704
    media_image2.png
    Greyscale
 
…” Examiner’s note,  Fig.1-(d) is the triplet network based on the global loss function to identify the data (input data) into a particular class, therefore, The triplet network (neural network ) is trained based on the global loss function, wherein, the global loss function is considered as the first loss function.);
and using the neural network trained based on the first loss function  (Kumar, [Fig.1, section 1, 3, 3.1. 3.2], “  For instance, the triplet network [33, 14, 26, 35] (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalize better… 
  
…” Examiner’s note,  Fig.1-(d) is the triplet network based on the global loss function to identify the data (input data) into a particular class, therefore, The triplet network (neural network ) is trained based on the global loss function, wherein, the global loss function is considered as the first loss function.),
Shankar and Kumar are analogous in arts because they have the same filed of endeavor of generating the machine learning based on the loss function to minimize the classification error.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the identification of the data object into a particular category (object vertical) taught by Shankar, further in view of Kumar having a training, by the data processing apparatus and based on a first loss function, and using the neural network trained based on the first loss function. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classification and minimize the classification errors, (Kumar, [Abstract], “Recent innovations in training deep convolutional neural network (ConvNet) models have motivated the design of new methods to automatically learn local image descriptors. The latest deep ConvNets proposed for this task consist of a siamese network that is trained by penalizing misclassification of pairs of local image patches. Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, current Siamese and triplet networks have been trained with stochastic gradient descent that computes the gradient from individual pairs or triplets of local image patches, which can make them prone to overfitting. In this paper, we first propose the use of triplet networks for the problem of local image descriptor learning. Furthermore, we also propose the use of a global loss that minimizes the overall classification error in the training set, which can improve the generalization capability of the model. Using the UBC benchmark dataset for comparing local image descriptors, we show that the triplet network produces a more accurate embedding than the siamese network in terms of the UBC dataset errors.”). 
Regarding claim 2, Shankar teaches the method of claim 1, wherein determining respective learning targets for the neural network further comprises: training, by the data processing apparatus and based on a second loss function, at least one other neural network to identify data associated with each of the plurality of object verticals (Shankar, [Section 3.1, 3.2 and 4.1], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space. Our CNN is modelled after [33]. We replace AlexNet with the 16- layered VGG network [27] in our implementation. this significantly improves our recall numbers (see section 4). Our conjecture is that the dense (stride 1) convolution with small receptive field (3 X 3) digests pattern details much better than the sparser AlexNet. these details may be even more important in the product similarity problem than the object recognition problem. Each training data element is a triplet of 3 images, < q;p;n >, a query image (q), a positive image (p) and a negative image (n). It is expected that the pair of images ¹q;pº are more visually similar compared to the pair of images ¹q;nº. Using triplets enables us to train a network to directly rank images instead of optimizing for binary/discriminatory decisions (as done in Siamese networks). It should be noted that here the training data needs to be labeled only for relative similarity…

    PNG
    media_image3.png
    679
    573
    media_image3.png
    Greyscale

…” Examiner’s note, the query image (q), positive image (p) and negative image (n) are inputted into the Hinge loss function (second loss function) to identify the item that including the data associated with object verticals (such as back shirt with long sleeves in Fig.3 is associated with a shirt category)); 
in response to training, generating, by the data processing apparatus, two or more embedding outputs, where each embedding output indicates a particular learning target and includes a vector of parameters that correspond to the data associated with a particular object vertical (Shankar, [section 3.1, Figs. 2, 3], “
Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space. Our CNN is modelled after [33]. We replace AlexNet with the 16- layered VGG network [27] in our implementation. this significantly improves our recall numbers (see section 4). Our conjecture is that the dense (stride 1) convolution with small receptive field (3 X 3) digests pattern details much better than the sparser AlexNet. these details may be even more important in the product similarity problem than the object recognition problem. Each training data element is a triplet of 3 images, < q;p;n >, a query image (q), a positive image (p) and a negative image (n). It is expected that the pair of images ¹q;pº are more visually similar compared to the pair of images ¹q;nº. Using triplets enables us to train a network to directly rank images instead of optimizing for binary/discriminatory decisions (as done in Siamese networks). It should be noted that here the training data needs to be labeled only for relative similarity…

    PNG
    media_image4.png
    662
    559
    media_image4.png
    Greyscale

…” Examiner’s note, the triplet images (positive image, negative image, query image) are considered as plurality of embedding outputs, which are generated by CNN. Each of sub network generates an embedding feature vector in order to identify the particular object verticals (category of the item/image) therefore, an embedding feature vector is indicate each sub network, wherein, each of the sub-network is considered as the machine learning target.);
and generating, by the data processing apparatus and using the at least one other neural network trained based on the second loss function, respective machine learning models, each machine learning model being configured to use a particular embedding output (Shankar, [Section 4, fig. 2, 4, 5], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. The clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…Algorithms were evaluated on what percentage of these triplets were correctly ranked by them (i.e., given a triplet < q;p;n >, if D¹q;pº < D¹q;nº, score 1, else 0). the results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. Thee different BISSs that were used are described in Section 3.2…” Examiner’s note, the items/products are grouped into similar categories, each of category is trained on the separate network model. For example, a particular network train on the category images, wherein the category images (clothing category) including plurality of object verticals such as t-shirts, shirts and Top.).
Regarding claim 3, Shankar teaches the method of claim 2, wherein determining respective learning targets for the neural network further comprises: providing, for training the neural network, the respective learning targets generated from respective separate models (Shankar, [Sec.31.-3.2], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. these embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space…

    PNG
    media_image5.png
    725
    611
    media_image5.png
    Greyscale

Figure 2b shows the content of each sub-network. It has the following parallel paths
16-Layer VGG net without the final loss layer: the output of the last layer of this net captures abstract, high level characteristics of the input image
Shallow Conv Layers 1 and 2: capture fine-grained details of the input image.
this parallel combination of deep and shallow network is essential for the network to capture both the high level and low level details needed for visual similarity estimation, leading to better results. During inference, any one of the sub-networks (they are all the same, since the weights are shared) takes an image as input and generates an embedding. Finding similar items then boils down to the task of nearest neighbor search in the embedding space. We grouped the set of product items into related categories, e.g., clothing (which includes shirts, t-shirts, tops, etc), footwear, and trained a separate deep ranking NN for each category.
…” Examiner’s note each of sub network generates a particular embedding output, wherein, each of sub network is considered as the machine learning target. The particular network is separately trained on each of a category.).
Regarding claim 4, Shankar teaches the method of claim 2, wherein each of the plurality of object verticals corresponds to a particular category of items (Shankar, [Section 3.1], “

    PNG
    media_image5.png
    725
    611
    media_image5.png
    Greyscale

Figure 2b shows the content of each sub-network. It has the following parallel paths
16-Layer VGG net without the final loss layer: the output of the last layer of this net captures abstract, high level characteristics of the input image
Shallow Conv Layers 1 and 2: capture fine-grained details of the input image.
this parallel combination of deep and shallow network is essential for the network to capture both the high level and low level details needed for visual similarity estimation, leading to better results. During inference, any one of the sub-networks (they are all the same, since the weights are shared) takes an image as input and generates an embedding. Finding similar items then boils down to the task of nearest neighbor search in the embedding space. We grouped the set of product items into related categories, e.g., clothing (which includes shirts, t-shirts, tops, etc), footwear, and trained a separate deep ranking NN for each category.” Examiner’s note, the plurality of items (shirt, T-shirts, and Top) are grouped into a object vertical (such as clothing) that are corresponding to a particular category.).
and the data associated with each of the plurality of object verticals includes image data of an item in the particular category of items (Shankar, [Section 4, fig. 2, 4, 5], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. The clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…Algorithms were evaluated on what percentage of these triplets were correctly ranked by them (i.e., given a triplet < q;p;n >, if D¹q;pº < D¹q;nº, score 1, else 0). the results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. Thee different BISSs that were used are described in Section 3.2…” Examiner’s note, the input data (query image) associated with a particular object verticals such as dress or t-shirt that will be classified into specific category.).
Regarding claim 5, Shankar teaches the method of claim 4, wherein the particular category is an apparel category and items of the particular category include at least one of: handbags, shoes, dresses, pants, or outerwear; and wherein the image data indicates an image of at least one of: a particular handbag, a particular shoe, a particular dress, a particular pant, or particular outerwear (Shankar, [Section 4.1, Fig. 5, Table 2], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. the clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…”).
Regarding claim 6, Shankar teaches the method of claim 5, wherein: each of the respective machine learning models are configured to identify data associated with a particular object vertical and within a first degree of accuracy (Shankar, [Section 4.1 and 4.2], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture… The results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. The different BISSs that were used are described in Section 3.2…”  Examiner’s note, table 1 showing an accuracy results of plurality of training method of the in class triplet and out of class triplet. However, the claim does not define what is a first degree of accuracy, therefore, examiner interprets the first degree of accuracy is considered as the in-class triplet accuracy. The limitation “machine learning models are configured to identify data associated with a particular object vertical and within a first degree of accuracy” is the intended used limitation, the machine learning model are not required to identify data associated with a particular object vertical and within a first degree of accuracy.);
 and the unified machine learning model is configured to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy (Shankar, [Section 4.1 and 4.2], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture… The results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. The different BISSs that were used are described in Section 3.2…” Examiner’s note, the claim does not clarify what is the second degree of accuracy, therefore, examiner interprets the accuracy of out-of-class triplet is considered as the second degree of accuracy, which is exceeds the first degree of accuracy in the table 1. However, the limitation “unified machine learning model is configured to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy” is intended used limitation, the Unified machine learning model is not required to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy.).
Regarding claim 7, Shankar teaches the method of claim 2 , wherein determining the respective learning targets for each of the plurality of object verticals, comprises: analyzing the two or more embedding outputs, each embedding output corresponding to a particular object vertical of the plurality of object verticals; and based on the analyzing, determining the respective learning targets for each of the plurality of object verticals (Shankar, [Sec.31.-3.2], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space…

    PNG
    media_image5.png
    725
    611
    media_image5.png
    Greyscale
”
Examiner’s note, identify the plurality of embedding output (positive image, negative image) of plurality object verticals based on the query image, and each of sub-network generates an embedding output. Each of sub-network is considered as the learning target. Therefore, each of learning target is determined based on the particular embedding output).
Regarding claim 8, Shankar as modified in view of Kumar teaches the method of claim 2, wherein the first loss function is an L2-loss function and generating the unified machine learning model includes (Kumar, [section 1, 3, 3.1. 3.2], “For instance, the triplet network [33, 14, 26, 35] (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalize better…

    PNG
    media_image6.png
    623
    624
    media_image6.png
    Greyscale
”
Fig.1-(d) is the triplet network based on the loss function to identify the data (input data) into a class, the global loss function is considered as L2-loss function.): 
generating a particular unified machine learning model that minimizes a computational output associated with the L2-loss function (Kumar, [Section 1, page 5385-5386, right column], “For instance, the triplet net Work [33, 14, 26, 35] (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalize better… (Fig. 1-(d)) and a new global loss function to train local image descriptor learning models that can be applied to the siamese and triplet networks (Fig. 1-(b),(d)). The global loss to produce a feature embedding minimises the variance of the distance between descriptors (in the embedded space) belonging to the same and different classes, minimises the mean distance between descriptors belonging to the same class and maximises the mean distance between descriptors belonging to different classes”).
Shankar and Kumar are analogous in arts because they have the same filed of endeavor of generating the machine learning based on the loss function to minimize the classification error.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the identification of the data object into a particular category (object verticals) taught by Shankar, further in view of Kumar by training, by having the first loss function is an L2-loss function and generating the unified machine learning model includes generating a particular unified machine learning model that minimizes a computational output associated with the L2-loss function. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classification and minimize the classification errors, (Kumar, [Abstract], “Recent innovations in training deep convolutional neural network (ConvNet) models have motivated the design of new methods to automatically learn local image descriptors. The latest deep ConvNets proposed for this task consist of a siamese network that is trained by penalizing misclassification of pairs of local image patches. Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, current Siamese and triplet networks have been trained with stochastic gradient descent that computes the gradient from individual pairs or triplets of local image patches, which can make them prone to overfitting. In this paper, we first propose the use of triplet networks for the problem of local image descriptor learning. Furthermore, we also propose the use of a global loss that minimizes the overall classification error in the training set, which can improve the generalization capability of the model. Using the UBC benchmark dataset for comparing local image descriptors, we show that the triplet network produces a more accurate embedding than the siamese network in terms of the UBC dataset errors.”).
Regarding claim 10, Shankar teaches the method of claim 2, wherein the second loss function is a triplet loss function and generating the respective machine learning models includes: generating a particular machine learning model based on associations between an anchor image, a positive image, and a negative image (Shankar, [Section 3.1, 3.2, Fig.3], “Catalog Image Triplets: Here, the query image (q), positive image (p) and negative image (n) are all catalog images. During training, in order to generate a triplet < q;p;n >, q is randomly sampled from the set of catalog images. Candidate positive images are programmatically selected in a bootstrapping manner by a set of basic image similarity scoring techniques, described below. It should be noted that a “Basic Image Similarity Scorer” (BISS) need not be highly accurate. It need not have great recall (get most images highly similar to query) nor great precision (get nothing but highly similar images). The principal expectation from the BISS set is that, between all of them, they more or less identify all the images that are reasonably similar to the query image. As such, each BISS could focus on a sub-aspect of similarity, e.g., one BISS can focus on color, another on pattern etc. Each BISS programmatically identifies the 1000 nearest neighbors to q from the catalog. thee union of top 200 neighbors from all the BISSs form the sample space for p…”	).
Claim 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shankar et al. (NPL: Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce– Flipkart Internet Pvt. Ltd., Bengaluru, India- hereinafter, Shankar) in view of Kumar et al. (NPL: Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions- The University of Adelaide and Australian Centre for Robotic Vision- hereinafter, Kumar) and further in view of Arpit et al (NPL: Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks- hereinafter, Arpit).
Regarding claim 9, Shankar as modified in view of Kumar teaches the method of claim 2, wherein the neural network includes a plurality of neural network layers that receive multiple layer inputs, and where training the neural network based on the first loss function includes: performing batch normalization to normalize layer inputs to a particular neural network layer (Kumar, [Section 3, 3.1, 3.2], “


    PNG
    media_image6.png
    623
    624
    media_image6.png
    Greyscale


    PNG
    media_image7.png
    354
    673
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    695
    679
    media_image8.png
    Greyscale

…”);
However, Shankar and Kumar do not teach and minimizing covariate shift in response to performing the batch normalization.
On the other hand, Arpit teaches and minimizing covariate shift in response to performing the batch normalization (Arpit, [Abstract], “While the authors of Batch Normalization (BN) identify and address an important problem involved in training deep networks– Internal Covariate Shift– the current solution has certain drawbacks. For instance, BN depends on batch statistics for layer wise input normalization during
 training which makes the estimates of mean and standard deviation of input (distribution) to hidden layers inaccurate due to shifting parameter values (especially during initial training epochs). Another fundamental problem with BN is that it cannot be used with batch-size 1 during training. We address these drawbacks of BN by proposing a non-adaptive normalization technique for removing covariate shift, that we call Normalization Propagation.”).
Shankar, Kumar and Arpit are analogous in arts because they have the same filed of endeavor of generating the machine learning based on the loss function to minimize the classification error.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the identification of the data object into a particular category (object verticals) taught by Shankar, further in view of Kumar and Arpit by minimizing covariate shift in response to performing the batch normalization. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classification, (Arpit, [Section 56], “We want to verify the following: a) performance comparison of NormProp when using Global Data Normalization vs. Batch Data Normalization; b) NormProp alleviates the problem of Internal Covariate Shift more accurately compared to BN; c) thus, convergence stability of NormProp is better than BN; d) effect of batch-size on the behavior of NormProp, especially batch-size 1 (BN not applicable). Finally we report classification result on various datasets using NormProp and BN…”
Claims 14-21, 23-24 are rejected under 35 U.S.C. 103 as being unpatentable over Shankar et al. (NPL: Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce– Flipkart Internet Pvt. Ltd., Bengaluru, India- hereinafter, Shankar) in view of Kumar et al. (NPL: Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions- The University of Adelaide and Australian Centre for Robotic Vision- hereinafter, Kumar) further in view of Lan et al. (Pub. No.: US2015/0294192, hereinafter, Lan).
Regarding claim 14 is being rejected for the same reason as the claim 1.
Additionally, Shankar teaches the system for generating a unified machine learning model using a neural network, the system comprising: a data processing apparatus configured to implement the neural network, the data processing apparatus including one or more processing devices (Shankar, [Abstract], “In this paper, we present a unified end-to-end approach to build a large scale Visual Search and Recommendation system for ecommerce. Previous works have targeted these problems in isolation. We believe a more effective and elegant solution could be obtained by tackling them together. We propose a unified Deep Convolutional Neural Network architecture, called VisNet 1, to learn embeddings to capture the notion of visual similarity, across several semantic granularities.” Examiner’s note, the unified deep convolution neural network architecture (Visnet) generates the output matched images based on  the input data (query image) is entered by the user. Therefore, the Visnet will run on the computer, wherein, the computer system will have at least one processor.);
However, Shankar and Kumar do not teach one or more non-transitory machine-readable storage devices storing instructions that are executable by the one or more processing devices to cause performance of operations comprising:
On the other hand, Lan teaches one or more non-transitory machine-readable storage devices storing instructions that are executable by the one or more processing devices to cause performance of operations comprising (Lan, [Claim 10], “A non-transitory computer-readable storage medium storing a program, which, when executed by a processor performs operations for detecting objects in an input image, the operations comprising: receiving a set of training images and associated annotations, the annotations labeling categories and locations of objects which appear in the images;”):
Shankar Kumar and Lan are analogous in arts because they have the same filed of endeavor of generating the machine learning to identify category for input images.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the identification of the data object into a particular category (object vertical) taught by Shankar, further in view of Kumar and Lan by having one or more non-transitory machine-readable storage devices storing instructions that are executable by the one or more processing devices to cause performance of operations comprising. The modification would have been obvious because one of the ordinary skills in art would be motivated to implement the processing unit to execute the program based on the computer readable medium, (Lan, [Par.0049], “Aspects of the present disclosure may be embodied as a system, method or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware aspect, an entirely software aspect (including firmware, resident software, micro-code, etc.) or an aspect combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.”).
Regarding claim 24 is being rejected for the same reason as the claim 14.
Regarding claim 15, Shankar teaches the system of claim 14, wherein determining respective learning targets for the neural network further comprises: training, by the data processing apparatus and based on a second loss function, at least one other neural network to identify data associated with each of the plurality of object verticals (Shankar, [Section 3.1, 3.2 and 4.1], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space. Our CNN is modelled after [33]. We replace AlexNet with the 16- layered VGG network [27] in our implementation. this significantly improves our recall numbers (see section 4). Our conjecture is that the dense (stride 1) convolution with small receptive field (3 X 3) digests pattern details much better than the sparser AlexNet. these details may be even more important in the product similarity problem than the object recognition problem. Each training data element is a triplet of 3 images, < q;p;n >, a query image (q), a positive image (p) and a negative image (n). It is expected that the pair of images ¹q;pº are more visually similar compared to the pair of images ¹q;nº. Using triplets enables us to train a network to directly rank images instead of optimizing for binary/discriminatory decisions (as done in Siamese networks). It should be noted that here the training data needs to be labeled only for relative similarity…

    PNG
    media_image3.png
    679
    573
    media_image3.png
    Greyscale

…” Examiner’s note, the query image (q), positive image (p) and negative image (n) are inputted into the Hinge loss function (second loss function) to identify the item that including the data associated with object verticals (such as back shirt with long sleeves in Fig.3 is associated with a shirt category)); 
in response to training, generating, by the data processing apparatus, two or more embedding outputs, where each embedding output indicates a particular learning target and includes a vector of parameters that correspond to the data associated with a particular object vertical (Shankar, [section 3.1, Figs. 2, 3], “
Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space. Our CNN is modelled after [33]. We replace AlexNet with the 16- layered VGG network [27] in our implementation. this significantly improves our recall numbers (see section 4). Our conjecture is that the dense (stride 1) convolution with small receptive field (3 X 3) digests pattern details much better than the sparser AlexNet. these details may be even more important in the product similarity problem than the object recognition problem. Each training data element is a triplet of 3 images, < q;p;n >, a query image (q), a positive image (p) and a negative image (n). It is expected that the pair of images ¹q;pº are more visually similar compared to the pair of images ¹q;nº. Using triplets enables us to train a network to directly rank images instead of optimizing for binary/discriminatory decisions (as done in Siamese networks). It should be noted that here the training data needs to be labeled only for relative similarity…

    PNG
    media_image4.png
    662
    559
    media_image4.png
    Greyscale

…” Examiner’s note, the triplet images (positive image, negative image, query image) are considered as plurality of embedding outputs, which are generated by CNN. Each of sub network generates an embedding feature vector in order to identify the particular object verticals (category of the item/image) therefore, an embedding feature vector is indicate each sub network, wherein, each of the sub-network is considered as the machine learning target.);
and generating, by the data processing apparatus and using the at least one other neural network trained based on the second loss function, respective machine learning models, each machine learning model being configured to use a particular embedding output (Shankar, [Section 4, fig. 2, 4, 5], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. The clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…Algorithms were evaluated on what percentage of these triplets were correctly ranked by them (i.e., given a triplet < q;p;n >, if D¹q;pº < D¹q;nº, score 1, else 0). the results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. Thee different BISSs that were used are described in Section 3.2…” Examiner’s note, the items/products are grouped into similar categories, each of category is trained on the separate network model. For example, a particular network train on the category images, wherein the category images (clothing category) including plurality of object verticals such as t-shirts, shirts and Top.).
Regarding claim 17, Shankar teaches the system of claim 15, wherein each of the plurality of object verticals corresponds to a particular category of items (Shankar, [Section 3.1], “

    PNG
    media_image5.png
    725
    611
    media_image5.png
    Greyscale

Figure 2b shows the content of each sub-network. It has the following parallel paths
•	16-Layer VGG net without the final loss layer: the output of the last layer of this net captures abstract, high level characteristics of the input image
•	Shallow Conv Layers 1 and 2: capture fine-grained details of the input image.
this parallel combination of deep and shallow network is essential for the network to capture both the high level and low level details needed for visual similarity estimation, leading to better results. During inference, any one of the sub-networks (they are all the same, since the weights are shared) takes an image as input and generates an embedding. Finding similar items then boils down to the task of nearest neighbor search in the embedding space. We grouped the set of product items into related categories, e.g., clothing (which includes shirts, t-shirts, tops, etc), footwear, and trained a separate deep ranking NN for each category.” Examiner’s note, the plurality of items (shirt, T-shirts, and Top) are grouped into a object vertical (such as clothing) that are corresponding to a particular category.).
and the data associated with each of the plurality of object verticals includes image data of an item in the particular category of items (Shankar, [Section 4, Fig. 2, 4, 5], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. The clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…Algorithms were evaluated on what percentage of these triplets were correctly ranked by them (i.e., given a triplet < q;p;n >, if D¹q;pº < D¹q;nº, score 1, else 0). the results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. Thee different BISSs that were used are described in Section 3.2…” Examiner’s note, the input data (query image) associated with a particular object verticals such as dress or t-shirt that will be classified into specific category.).
Regarding claim 18, Shankar teaches the system of claim 17, wherein the particular category is an apparel category and items of the particular category include at least one of: handbags, shoes, dresses, pants, or outerwear; and wherein the image data indicates an image of at least one of: a particular handbag, a particular shoe, a particular dress, a particular pant, or particular outerwear (Shankar, [Section 4.1, Fig. 5, Table 2], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture, shown in Figure 2. the clothing network was trained on 1.5 million Catalog Image Triplets and 1.5 million Wild Image Triplets. Catalog Image Triplets were generated from 250K t-shirts, 150K shirts, 30K tops, altogether about 500K dress items. Wild Image Triplets were generated from the Exact Street2Shop ([14]) dataset which contains around 170K dresses, 68K tops and 35K outerwear items…”).
Regarding claim 19, Shankar teaches the system of claim 18, wherein: each of the respective machine learning models are configured to identify data associated with a particular object vertical and within a first degree of accuracy
(Shankar, [Section 4.1 and 4.2], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture… The results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. The different BISSs that were used are described in Section 3.2…”  Examiner’s note, table 1 showing an accuracy results of plurality of training method of the in class triplet and out of class triplet. However, the claim does not define what is a first degree of accuracy, therefore, examiner interprets the first degree of accuracy is considered as the in-class triplet accuracy. The limitation “machine learning models are configured to identify data associated with a particular object vertical and within a first degree of accuracy” is the intended used limitation, the machine learning model are not required to identify data associated with a particular object vertical and within a first degree of accuracy.);
 and the unified machine learning model is configured to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy (Shankar, [Section 4.1 and 4.2], “For highest quality, we grouped our products into similar categories (e.g., dresses, t-shirts, shirts, tops, etc. in one category called clothing, footwear in another etc.) and trained separate networks for each category. In this paper, we are mostly reporting the clothing network numbers. All networks had the same architecture… The results are shown in Table 1. We compare the accuracy numbers of our best model, VisNet, with those of the individual BISSs that were used for candidate training triplet generation. The different BISSs that were used are described in Section 3.2…” Examiner’s note, the claim does not define what is second degree of accuracy, therefore, examiner interprets the accuracy of out-of-class triplet is considered as the second degree of accuracy, which is exceeds the first degree of accuracy, from table 1. However, the limitation “unified machine learning model is configured to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy” is intended used limitation, the Unified machine learning model is not required to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy.).

Regarding claim 20, Shankar teaches the system of  claim 15 , wherein determining the respective learning targets for each of the plurality of object verticals, comprises: analyzing the two or more embedding outputs, each embedding output corresponding to a particular object vertical of the plurality of object verticals; and based on the analyzing, determining the respective learning targets for each of the plurality of object verticals (Shankar, [Sec.31.-3.2], “Our core approach consists of training a Convolutional Neural Network (CNN) to generate embeddings that capture the notion of visual similarity. These embeddings serve as visual descriptors, capturing a complex combination of colors and patterns. We use a triplet based approach with a ranking loss to learn embeddings such that the Euclidean distance between embeddings of two images measures the (dis)similarity between the images. Similar images can be then found by k-Nearest-Neighbor searches in the embedding space…

    PNG
    media_image5.png
    725
    611
    media_image5.png
    Greyscale
”
Examiner’s note, identify the plurality of embedding output (positive image, negative image) of plurality object verticals based on the query image, and each of sub-network generates an embedding output. Each of sub-network is considered as the learning target. Therefore, each of learning target is determined based on the particular embedding output).
Regarding claim 21, Shankar as modified in view of Kumar teaches the system of claim 15, wherein the first loss function is an L2-loss function and generating the unified machine learning model includes (Kumar, [section 1, 3, 3.1. 3.2], “For instance, the triplet network [33, 14, 26, 35] (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalise better…

    PNG
    media_image6.png
    623
    624
    media_image6.png
    Greyscale
”
Fig.1-(d) is the triplet network based on the loss function to identify the data(input data) into a particular class, the global loss function is considered as L2-loss function.): 
generating a particular unified machine learning model that minimizes a computational output associated with the L2-loss function (Kumar, [Section 1, page 5385-5386, right column], “For instance, the triplet net Work [33, 14, 26, 35] (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalize better… (Fig. 1-(d)) and a new global loss function to train local image descriptor learning models that can be applied to the siamese and triplet networks (Fig. 1-(b),(d)). The global loss to produce a feature embedding minimises the variance of the distance between descriptors (in the embedded space) belonging to the same and different classes, minimises the mean distance between descriptors belonging to the same class and maximises the mean distance between descriptors belonging to different classes”).
Shankar and Kumar are analogous in arts because they have the same filed of endeavor of generating the machine learning based on the loss function to minimize the classification error.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the identification of the data object into a particular category (object verticals) taught by Shankar, further in view of Kumar by training, by having the first loss function is an L2-loss function and generating the unified machine learning model includes generating a particular unified machine learning model that minimizes a computational output associated with the L2-loss function. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classification and minimize the classification errors, (Kumar, [Abstract], “Recent innovations in training deep convolutional neural network (ConvNet) models have motivated the design of new methods to automatically learn local image descriptors. The latest deep ConvNets proposed for this task consist of a siamese network that is trained by penalizing misclassification of pairs of local image patches. Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, current Siamese and triplet networks have been trained with stochastic gradient descent that computes the gradient from individual pairs or triplets of local image patches, which can make them prone to overfitting. In this paper, we first propose the use of triplet networks for the problem of local image descriptor learning. Furthermore, we also propose the use of a global loss that minimizes the overall classification error in the training set, which can improve the generalization capability of the model. Using the UBC benchmark dataset for comparing local image descriptors, we show that the triplet network produces a more accurate embedding than the siamese network in terms of the UBC dataset errors.”).
Regarding claim 23, Shankar teaches the system of claim 15, wherein the second loss function is a triplet loss function and generating the respective machine learning models includes: generating a particular machine learning model based on associations between an anchor image, a positive image, and a negative image (Shankar, [Section 3.1, 3.2, Fig.3], “Catalog Image Triplets: Here, the query image (q), positive image (p) and negative image (n) are all catalog images. During training, in order to generate a triplet < q;p;n >, q is randomly sampled from the set of catalog images. Candidate positive images are programmatically selected in a bootstrapping manner by a set of basic image similarity scoring techniques, described below. It should be noted that a “Basic Image Similarity Scorer” (BISS) need not be highly accurate. It need not have great recall (get most images highly similar to query) nor great precision (get nothing but highly similar images). The principal expectation from the BISS set is that, between all of them, they more or less identify all the images that are reasonably similar to the query image. As such, each BISS could focus on a sub-aspect of similarity, e.g., one BISS can focus on color, another on pattern etc. Each BISS programmatically identifies the 1000 nearest neighbors to q from the catalog. thee union of top 200 neighbors from all the BISSs form the sample space for p…”	).
Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Shankar et al. (NPL: Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce– Flipkart Internet Pvt. Ltd., Bengaluru, India- hereinafter, Shankar) in view of Kumar et al. (NPL: Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions- The University of Adelaide and Australian Centre for Robotic Vision- hereinafter, Kumar) further in view of Lan et al. (Pub. No.: US2015/0294192, hereinafter, Lan) and further in view of Arpit et al (NPL: Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks- hereinafter, Arpit).
Regarding claim 22, Shankar as modified in view of Kumar and Lan teaches the system of claim 15, wherein the neural network includes a plurality of neural network layers that receive multiple layer inputs, and where training the neural network based on the first loss function includes: performing batch normalization to normalize layer inputs to a particular neural network layer (Kumar, [Section 3, 3.1, 3.2], “


    PNG
    media_image9.png
    606
    607
    media_image9.png
    Greyscale


    PNG
    media_image7.png
    354
    673
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    695
    679
    media_image8.png
    Greyscale

…”);
However, Shankar, Kumar and Lan do not teach and minimizing covariate shift in response to performing the batch normalization.
On the other hand, Arpit teaches and minimizing covariate shift in response to performing the batch normalization (Arpit, [Abstract], “While the authors of Batch Normalization (BN) identify and address an important problem involved in training deep networks– Internal Covariate Shift– the current solution has certain drawbacks. For instance, BN depends on batch statistics for layer wise input normalization during  training which makes the estimates of mean and standard deviation of input (distribution) to hidden layers inaccurate due to shifting parameter values (especially during initial training epochs). Another fundamental problem with BN is that it cannot be used with batch-size 1 during training. We address these drawbacks of BN by proposing a non-adaptive normalization technique for removing covariate shift, that we call Normalization Propagation.”).
Shankar, Kumar, Lan and Arpit are analogous in arts because they have the same filed of endeavor of generating the machine learning based on the loss function to minimize the classification error.
Accordingly, it would have been prima facie obvious to one of the ordinary skills in the art before the effective filing date of the claimed invention to have modified the identification of the data object into a particular category (object verticals) taught by Shankar, further in view of Kumar, Lan and Arpit by minimizing covariate shift in response to performing the batch normalization. The modification would have been obvious because one of the ordinary skills in art would be motivated to improve the accuracy of classification, (Arpit, [Section 56], “We want to verify the following: a) performance comparison of NormProp when using Global Data Normalization vs. Batch Data Normalization; b) NormProp alleviates the problem of Internal Covariate Shift more accurately compared to BN; c) thus, convergence stability of NormProp is better than BN; d) effect of batch-size on the behavior of NormProp, especially batch-size 1 (BN not applicable). Finally we report classification result on various datasets using NormProp and BN…”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure is provide below.
Zhang  et al. (NPL: IMPROVING TRIPLET-WISE TRAINING OF CONVOLUTIONAL NEURAL NETWORK FOR VEHICLE RE-IDENTIFICATION- CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, University of Science and Technology of China, Hefei 230027, China-hereinafter, Zhang) teaches minimizing the loss by using an improved triple loss function. 
Akata et al. (NPL: Evaluation of Output Embeddings for Fine-Grained Image Classification- Computer Vision and Multimodal Computing Max Planck Institute for Informatics, Saarbrucken, Germany, hereinafter, Akata) teaches using SVM classification to improve the misclassification or loss based on the loss function. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EM N TRIEU whose telephone number is (571)272-5747.  The examiner can normally be reached on 7:30 - 5:00 M_TH. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez Rivas can be reached on (571) 272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/E.T./Examiner, Art Unit 2128       

/OMAR F FERNANDEZ RIVAS/Supervisory Patent Examiner, Art Unit 2128