DETAILED ACTION
1.	This office action is in response to the Application No. 15900351 filed on  02/20/2018. Claims 1-20 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


3.	Claims 1-4, 6-13, 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over You et al. (Image Captioning with Semantic Attention, arXiv:1603.03925v1 [cs.CV] 12 Mar 2016) in view of Wang et al (US10755082) and further in view of Song et al (WO2017168125)

	Regarding clam 1, You teaches a system for training attention controlled neural networks to generate attribute-modulated-feature vectors using attribute attention projections (The training data for each image consist of input image features v, {Ai} and output caption words sequence {Yt}. Our goal is to learn all the attention model parameters ΘA = {U, V, W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set. The loss of one training example is defined as the total negative log-likelihood of all the words combined with regularization terms on attention scores {α i t} and {β i}, pg. 4, left col, last para.;) comprising:
	a plurality of training images, (We extract key words as the visual attributes for our model from a large image dataset, pg. 5, left col, second para.) and 
	generate at least one attribute attention projection, of training images of the plurality of training images (The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.; “Examiner notes: projection matrix is interpreted as attribute attention projection”)
	utilize the at least one attribute attention projection to generate at least one attribute-modulated-feature vector for at least one training image of the training images ((Once calculated, the attention scores are used to modulate the strength of attention on different attribute. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para., ”Examiner notes: vector w is the attribute-modulated-feature vector ”); 
	by inserting the at least one attribute attention projection between at least one set of layers of the attention controlled neural network (The framework of the proposed image captioning system. Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, Fig. 2, pg. 3, left para.)
	and jointly learn at least one updated attribute attention projection and updated parameters of the attention controlled neural network by minimizing a loss from a loss function based on the at least one attribute-modulated-feature vector (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.; Once calculated, the attention scores are used to modulate the strength of attention on different attributes. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space, pg. 3, right col last para., pg. 4, left col, first para.)) 
	You does not explicitly teach at least one processor; at least one non-transitory computer memory comprising an attention controlled neural network, instructions that, when executed by at least one processor, cause the system to:; for at least one attribute category
	Wang teaches at least one processor; at least one non-transitory computer memory comprising an attention controlled neural network, instructions that, when executed by at least one processor, cause the system to: (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of training images and to improve CNN efficacy (Wang, col 3, lines 58-63)
	 Song teaches for at least one attribute category (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified You to incorporate the teachings of Song for the benefit of extracting edge maps from bounding box areas of the images (Song, [0091])

	Regarding claim 2, Modified You teaches the system of claim 1, You teaches generate the at least one attribute attention projection (The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.; “Examiner notes: projection matrix is interpreted as attribute attention projection”)
	based on at least one attribute code; of the training images (With all the information useful for predicting Yt captured by the current state ht, the score βit for each attribute Ai is measured with respect to ht: 
	βit ∝ exp (hTt V σ(Eyi ) (8)
where V ∈ R n×d is the bilinear parameter matrix. σ denotes the activation function connecting input node to hidden state in RNN, which is used here to ensure the same nonlinear transform is applied to the two feature vectors before they are compared. Again, { βit } are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution p, “Examiner notes: the score βit is the attribute code”)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3.
	Song teaches for the at least one attribute category (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified You to incorporate the teachings of Song for the benefit of extracting edge maps from bounding box areas of the images (Song, [0091])
	
           Regarding claim 3, Modified You teaches the system of claim 1, generate the at least one attribute attention projection ((The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.; “Examiner notes: projection matrix is interpreted as attribute attention projection”)
) by: 
         updating, in a first training iteration, a first attribute attention projection ; of a first set of training images from the training images; (In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para.; The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection”; We show the changes of the attention weights for several candidate concepts with respect to the recurrent neural network iterations, Pg. 1, right col, Fig. 1) and
          updating, in a second training iteration, a second attribute attention projection ; of a second set of training images from the training images (In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. ; Again, {β i t} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para. “Examiner notes: WY,h ∈ R d×n is the second attribute  projection; We show the changes of the attention weights for several candidate concepts with respect to the recurrent neural network iterations, Pg. 1, right col, Fig. 1 )
                Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3	
	Song teaches for a first attribute category (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022]. “Examiner notes: a plurality of object categories includes a first attribute category of object”)
	for the at least one attribute category of the training images (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022].)
	 for a second attribute category (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022]. “Examiner notes: a plurality of object categories includes a second attribute category of object”))
	The same motivation to combine as dependent claim 2 applies here.

	Regarding claim 4, Modified You teaches the system of claim 3, You teaches to insert the at least one attribute attention projection between the at least one set of layers in part by: ((The framework of the proposed image captioning system. Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, Fig. 2, pg. 3, left para.)
	utilizing the attention controlled neural network in the first training iteration to: (Our goal is to learn all the attention model parameters ΘA = {U, V, W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set. The loss of one training example is defined as the total negative log-likelihood of all the words combined with regularization terms on attention scores {α i t} and {β i}, pg. 4, left col, last para.; We show the changes of the attention weights for several candidate concepts with respect to the recurrent neural network iterations, Pg. 1, right col, Fig. 1)
	utilizing the attention controlled neural network in the second training iteration to: (Our goal is to learn all the attention model parameters ΘA = {U, V, W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set. The loss of one training example is defined as the total negative log-likelihood of all the words combined with regularization terms on attention scores {α i t} and {β i}, pg. 4, left col, last para.; We show the changes of the attention weights for several candidate concepts with respect to the recurrent neural network iterations, Pg. 1, right col, Fig. 1)
	apply the first attribute attention projection ((The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system ((Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3)
	generate a first feature map based on a first training image of the first set of training images (The input images 210 are of pixel size of 230×80×3, and are firstly passed through 64 learned filters of size 7×7×3. Then, the resulting feature maps are passed through a max pooling kernel of size 3×3×3 with stride 3, Finally, these feature maps are passed through a rectified linear unit (ReLU) to introduce non-linearities.. col 4, lines 20-25);
	first feature map between a first set of layers of the attention controlled neural network to generate a first discriminative feature map for the first training image (The first part of system 200 is the global sub-network 210, which includes a convolutional layer and max pooling layer. These layers are used to extract the low-level features of the input images, providing multi-level feature representations to be discriminately learned in the following part sub-network, col 4, lines 15-20])
	generate a second feature map based on a second training image of the second set of training images (The second part of system 200 is the local sub-network 220, which includes four teams of convolutional layers and max pooling layers. The input feature maps are divided into four equal horizontal patches across the height channel, which introduces 4×64 local feature maps of different body parts. Then, each local feature map is passed through a convolutional layer, which has 32 learned filters, each of size 3×3. Afterwards, the resulting feature maps are passed through max pooling kernels of size 3×3 with stride 1. Finally, a rectified linear unit (ReLU) is provided after each max pooling layer, col 4, lines 26-36);
	and apply the second attribute attention projection to the second feature map between a second set of layers of the attention controlled neural network to generate a second discriminative feature map for the second training image (In order to learn the feature representations of different body parts discriminately, col 4, lines 36-39)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of training images and to improve CNN efficacy (Wang, col 3, lines 58-63)

	Regarding claim 6, Modified You teaches the system of claim 3, You teaches jointly learn the at least one updated attribute attention projection and the updated parameters of the attention controlled neural network by: ((Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.;
Once calculated, the attention scores are used to modulate the strength of attention on different attributes. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space, pg. 3, right col last para., pg. 4, left col, first para.)) 
	jointly updating, in the first training iteration, the first attribute attention projection and parameters of the attention controlled neural network ((In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para.; The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection,)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	determining, in the first training iteration, a first triplet loss from a triplet-loss function based on a comparison of attribute-modulated-feature vectors for a first anchor image, a first positive image, and a first negative image from the first set of training images; (Given a set of triplet training samples {Xi a,Xi p,Xi n}i=1 N, the symmetric triplet loss improves the ranking accuracy by jointly keeping the similarity of positive pairs and dissimilarity of negative pairs. The hinge-like form of symmetric triplet loss can be formulated as follows: …, col 5, lines15-25; Given a set of triplet units X={xi a,xi p,xi n}i=1 N, where xi a and xi p) are the positive pairs and xi a and xi n represent the negative pairs, col 8, lines 40 -44)
	based on the first triplet loss (Given a set of triplet training samples {Xi a,Xi p,Xi n}i=1 N, the symmetric triplet loss improves the ranking accuracy by jointly keeping the similarity of positive pairs and dissimilarity of negative pairs, col 5, lines 16-19)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of maximizing the relative distance between positive pairs and negative pairs in order to improve the ability to distinguish different individuals (Wang, col 8, lines 64-66).

	Regarding claim 7, Modified You teaches the system of claim 6, You teaches jointly learn the at least one updated attribute attention projection and the updated parameters of the attention controlled neural network by: (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.;
Once calculated, the attention scores are used to modulate the strength of attention on different attributes. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space, pg. 3, right col last para., pg. 4, left col, first para.;  In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para.))
	jointly updating, in the second training iteration, the second attribute attention projection and the parameters of the attention controlled neural network (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.; Again, {βit} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para. In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para.; We show the changes of the attention weights for several candidate concepts with respect to the recurrent neural network iterations, Pg. 1, right col, Fig. 1. “Examiner notes: WY,h ∈ R d×n is the second attribute  projection) 
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	determining, in the second training iteration, a second triplet loss from the triplet-loss function based on a comparison of attribute-modulated-feature vectors for a second anchor image, a second positive image, and a second negative image from the second set of training images; (Given a set of triplet training samples {Xi a,Xi p,Xi n}i=1 N, the symmetric triplet loss improves the ranking accuracy by jointly keeping the similarity of positive pairs and dissimilarity of negative pairs, col 5, lines 16-19) and
	based on the second triplet loss (Given a set of triplet training samples {Xi a,Xi p,Xi n}i=1 N, the symmetric triplet loss improves the ranking accuracy by jointly keeping the similarity of positive pairs and dissimilarity of negative pairs, col 5, lines 16-19)
	The same motivation to combine dependent claim 6 applies here.

	Regarding claim 8, Modified You teaches the system of claim 7, You teaches update the first attribute attention projection and the second attribute attention projection in multiple training iterations (In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para.; The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection; Again, {β i t} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para., “Examiner notes: WY,h ∈ R d×n is the second attribute  projection,)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system to: (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	to comprise relatively similar values, wherein the relatively similar values indicate a correlation between the first attribute category and the second attribute category; or
update the first attribute attention projection and the second attribute attention projection in multiple training iterations to comprise relatively dissimilar values, wherein the relatively dissimilar values indicate a discorrelation between the first attribute category and the second attribute category (This is generally illustrated in FIG. 5, with self-paced deep ranking model 500 developed by relative similarity comparison. Relative similarity comparison can be formulated as follows:
	
    PNG
    media_image1.png
    38
    42
    media_image1.png
    Greyscale
=max {
    PNG
    media_image2.png
    38
    50
    media_image2.png
    Greyscale
+∥f(x i a)−f(x i p)∥2 2 −∥f(x i a)−f(x i n)∥2 2,0}
where 
    PNG
    media_image2.png
    38
    50
    media_image2.png
    Greyscale
is the margin between positive pairs and negative pairs in the feature space, and the f(⋅) is the learned feature mapping function. As a result, the relative distance between positive pairs and negative pairs are maximized, which improves the ability to distinguish different individuals, col 8, lines 55-67)
	The same motivation to combine as dependent claim 6 applies here.

	Regarding claim 9, You teaches generate an attribute attention projection
(The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.; “Examiner notes: projection matrix is interpreted as attribute attention projection”)
	based on an attribute code (With all the information useful for predicting Yt captured by the current state ht, the score βit for each attribute Ai is measured with respect to ht: 
	βit ∝ exp (hTt V σ(Eyi ) (8)
 where V ∈ R n×d is the bilinear parameter matrix. σ denotes the activation function connecting input node to hidden state in RNN, which is used here to ensure the same nonlinear transform is applied to the two feature vectors before they are compared. Again, { βit } are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution p, pg. 4, left col, second para. “Examiner notes: the score βit is the attribute code”)
	utilize an attention controlled neural network to generate an attribute-modulated-feature vector for the digital input image by inserting the attribute attention projection between at least one set of layers of the attention controlled neural network; (Figure 2. The framework of the proposed image captioning system. Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, pg. 3, left col; ((Once calculated, the attention scores are used to modulate the strength of attention on different attribute. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para., ”Examiner notes: vector w is the attribute-modulated-feature vector ”); and
	perform a task based on the digital input image and the attribute-modulated-feature vector (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR, pg. 4, left col, last para.; Once calculated, the attention scores are used to modulate the strength of attention on different attributes. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space, pg. 3, right co,l last para., pg. 4, left col, first para.) 
	You does not explicitly teach a non-transitory computer readable medium storing instructions thereon that, when executed by at least one processor, cause a computing device to: and for an attribute category of a digital input image.
	Wang teaches a non-transitory computer readable medium storing instructions thereon that, when executed by at least one processor, cause a computing device to: (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of training images and to improve CNN efficacy (Wang, col 3, lines 58-63)
	Song teaches for an attribute category of a digital input image ((An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified You to incorporate the teachings of Song for the benefit of extracting edge maps from bounding box areas of the images (Song, [0091])

	Regarding claim 10, Modified You teaches the non-transitory computer readable medium of claim 9, You teaches utilize the attention controlled neural network to generate the attribute-modulated-feature vector based on parameters of the attention controlled neural network (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.; Again, {β i t} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para. In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: WY,h ∈ R d×n is the second attribute  projection,)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the computing device to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of training images and to improve CNN efficacy (Wang, col 3, lines 58-63)

	Regarding claim 11, Modified You teaches the non-transitory computer readable medium of claim 9, You teaches utilize an additional neural network to generate the attribute attention projection based on the attribute code (With all the information useful for predicting Yt captured by the current state ht, the score βit for each attribute Ai is measured with respect to ht: 
	βit ∝ exp (hTt V σ(Eyi ) (8)
 where V ∈ R n×d is the bilinear parameter matrix. σ denotes the activation function connecting input node to hidden state in RNN, which is used here to ensure the same nonlinear transform is applied to the two feature vectors before they are compared. Again, { βit } are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution p, “Examiner notes: the score βit is the attribute code”)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the computing device to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	The same motivation to combine as dependent claim 10 applies here.

	Regarding claim 12, Modified You teaches the non-transitory computer readable medium of claim 9, You teaches wherein the attribute attention projection comprises a channel-wise scaling vector or a channel-wise projection matrix ((The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.; “Examiner notes: projection matrix is interpreted as attribute attention projection”)
	Wang teaches channel-wise matrix (The input feature maps are divided into four equal horizontal patches across the height channel, which introduces 4×64 local feature maps of different body parts. Then, each local feature map is passed through a convolutional layer, which has 32 learned filters, each of size 3×3, col 4, lines 28-33)
	The same motivation to combine as dependent claim 10 applies here.

	Regarding claim 13, Modified You teaches the non-transitory computer readable medium of claim 9, You teaches insert the attribute attention projection between the at least one set of layers in part by utilizing the attention controlled neural network to: (The framework of the proposed image captioning system. Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, Fig. 2, pg. 3, left para.)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the computing device to (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	generate a first feature map from the digital input image; (The input images 210 are of pixel size of 230×80×3, and are firstly passed through 64 learned filters of size 7×7×3. Then, the resulting feature maps are passed through a max pooling kernel of size 3×3×3 with stride 3, Finally, these feature maps are passed through a rectified linear unit (ReLU) to introduce non-linearities.. col 4, lines 20-25)
	apply the attribute attention projection to the first feature map between a first set of layers of the attention controlled neural network to generate a first discriminative feature map for the digital input image; (The first part of system 200 is the global sub-network 210, which includes a convolutional layer and max pooling layer. These layers are used to extract the low-level features of the input images, providing multi-level feature representations to be discriminately learned in the following part sub-network, col 4, lines 15-20, Fig. 2)
	generate a second feature map based on the digital input image; (The second part of system 200 is the local sub-network 220, which includes four teams of convolutional layers and max pooling layers. The input feature maps are divided into four equal horizontal patches across the height channel, which introduces 4×64 local feature maps of different body parts. Then, each local feature map is passed through a convolutional layer, which has 32 learned filters, each of size 3×3. Afterwards, the resulting feature maps are passed through max pooling kernels of size 3×3 with stride 1. Finally, a rectified linear unit (ReLU) is provided after each max pooling layer, col 4, lines 26-36, Fig. 2) and
	apply the attribute attention projection to the second feature map between a second set of layers of the attention controlled neural network to generate a second discriminative feature map for the digital input image (the local feature maps of different body parts are discriminately learned by following two fully-connected layers in each team. The dimension of the fully-connected layer is 100, and a rectified linear unit (ReLU) is added between the two fully connected layers. Then, the discriminately learned local feature representations of the first four fully connected layers are concatenated to be summarized by adding another fully connected layer, whose dimension is 400., col 4, lines 42-49)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of learning the feature representations of different part discriminately (Wang, col 4, lines 36-39)

	Regarding claim 15, Modified You teaches the non-transitory computer readable medium of claim 9, You teaches generate a second attribute attention projection (Again, {β i t} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para. In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: WY,h ∈ R d×n is the second attribute  projection)
	based on a second attribute code (The loss of one training example is defined as the total negative log-likelihood of all the words combined with regularization terms on attention scores {αit} and {βit}, “Examiner notes: αit  is the second attribute code”)
	utilize the attention controlled neural network to generate a second attribute-modulated-feature vector for the digital input image ((Once calculated, the attention scores are used to modulate the strength of attention on different attribute. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para., ”Examiner notes: vector w is the attribute-modulated-feature vector ”); 
	by inserting the second attribute attention projection between the at least one set of layers of the attention controlled neural network; (The framework of the proposed image captioning system. Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, Fig. 2, pg. 3, left para.)
	utilize the attention controlled neural network (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.;) 
	by inserting the third attribute attention projection between the at least one set of layers of the attention controlled neural network; (The framework of the proposed image captioning system. Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, Fig. 2, pg. 3, left para.) and
	perform the task based the digital input image, the attribute-modulated-feature vector, the second attribute-modulated-feature vector, and the third attribute-modulated-feature vector ((Once calculated, the attention scores are used to modulate the strength of attention on different attribute. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.,; Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, pg. 3, Fig. 2)
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the computing device to: (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3)
	Song teaches for a second attribute category of the digital input image; (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022].)
	based on a third attribute code (21 and 15 binary attributes for shoes and chairs respectively were selected and all 1,432 images were annotated with ground-truth attribute vectors)
	for a third attribute category of the digital input image; (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022] “Examiner notes: plurality of categories of object include first category, second category, third category and so on”)
	generate a third attribute attention projection (In particular, each image was represented by its annotated attribute vector, concatenated with a data driven representation obtained by feeding the image into an existing well-trained deep neural network [0082])
	to generate a third attribute-modulated-feature vector for the digital input image (During training, there are three branches in the network of the invention, and each corresponds to one of the atoms in the triplet: query sketch s, positive photo p+ and negative photo p- (see Fig. 9). The weights of the two photo branches should always be shared, [0087] “Examiner notes: f0 is a weight which is the third attribute-modulated-feature vector”) 
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method Modified You to incorporate the teachings of Song for the benefit of extracting edge maps from bounding box areas of the images (Song, [0091])

	Regarding claim 16, Modified You teaches the non-transitory computer readable medium of claim 15, Wang teaches wherein: a first relative value difference separates the attribute attention projection and the second attribute attention projection, (A symmetry regularization term is used to revise the asymmetric gradient back-propagation of relative similarity comparison metric, so as to jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit, col 8, lines 37-40)
	the first relative value difference indicating a correlation between the attribute category and the second attribute category; (This is generally illustrated in FIG. 5, with self-paced deep ranking model 500 developed by relative similarity comparison. Relative similarity comparison can be formulated as follows:

    PNG
    media_image1.png
    38
    42
    media_image1.png
    Greyscale
=max {
    PNG
    media_image2.png
    38
    50
    media_image2.png
    Greyscale
+∥f(x i a)−f(x i p)∥2 2 −∥f(x i a)−f(x i n)∥2 2,0}
where 
    PNG
    media_image2.png
    38
    50
    media_image2.png
    Greyscale
 is the margin between positive pairs and negative pairs in the feature space, and the f(⋅) is the learned feature mapping function. As a result, the relative distance between positive pairs and negative pairs are maximized, which improves the ability to distinguish different individuals, col 8, lines 55-67) and
	a second relative value difference separates the attribute attention projection and the third attribute attention projection, the second relative value difference indicating a discorrelation between the attribute category and the third attribute category (A symmetry regularization term is used to revise the asymmetric gradient back-propagation of relative similarity comparison metric, so as to jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit.
Given a set of triplet units X={xi a,xi p,xi n}i=1 N, where xi a and xi p) are the positive pairs and xi a and xi n represent the negative pairs, a self-paced ranking can be developed as follows:….,
where μ=[μ1, . . . , μN]T are the weights of all samples, λ, ϑ are the model age parameters, ζ is the weight parameter. Use of this method allows jointly pulling the positive pairs and pushing the negative pairs in each triplet unit. In effect, the first term maximizes the relative distances between the positive pairs and negative pairs, col 8, lines 37-50)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method You to incorporate the teachings of Wang for the benefit of learning the feature representations of different part discriminately (Wang, col 4, lines 36-39)

4.	Claims 5, 14 are rejected under 35 U.S.C. 103 as being unpatentable over You et al. (Image Captioning with Semantic Attention, arXiv:1603.03925v1 [cs.CV] 12 Mar 2016) in view of Wang et al (US10755082) in view of Song et al (WO2017168125) and further in view of Dijkman et al (US20170011281)

	Regarding claim 5, Modified You teaches the system of claim 4, You teaches to apply the first attribute attention projection (The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection
	to apply the second attribute attention projection (Again, {β i t} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY, h (ht + diag (wY, A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para. In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: WY,h ∈ R d×n is the second attribute  projection.
	Wang teaches further comprising instructions that, when executed by the at least one processor, cause the system to: ((Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3)
	 to the first feature map between the first set of layers (The input images 210 are of pixel size of 230×80×3, and are firstly passed through 64 learned filters of size 7×7×3. Then, the resulting feature maps are passed through a max pooling kernel of size 3×3×3 with stride 3, Finally, these feature maps are passed through a rectified linear unit (ReLU) to introduce non-linearities.. col 4, lines 20-25)
	 to the second feature map between the second set of layers (The second part of system 200 is the local sub-network 220, which includes four teams of convolutional layers and max pooling layers. The input feature maps are divided into four equal horizontal patches across the height channel, which introduces 4×64 local feature maps of different body parts. Then, each local feature map is passed through a convolutional layer, which has 32 learned filters, each of size 3×3. Afterwards, the resulting feature maps are passed through max pooling kernels of size 3×3 with stride 1. Finally, a rectified linear unit (ReLU) is provided after each max pooling layer, col 4, lines 26-36)
	Modified You does not explicitly teach utilize a first gradient modulator in the first training iteration; utilize a second gradient modulator in the second training iteration
	Dijkman teaches utilize a first gradient modulator in the first training iteration (The network architecture 900 may be trained utilizing training module 971 [0083] …utilizing back-propagation of gradients to modulate scores for prior boxes in the attention path 908… This technique may be used to modulate the scores for prior boxes in the attention path [0104], “Examiner notes: training module 971 as first modulator”); and
	utilize a second gradient modulator in the second training iteration (The network architecture 900 may be trained utilizing training module 972 [0083] …utilizing back-propagation of gradients to modulate scores for prior boxes in the attention path 908… This technique may be used to modulate the scores for prior boxes in the attention path [0104], “Examiner notes: training module 972 as second modulator”)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of modified You to incorporate the teachings of Dijkman for the benefit of a neural network comprising an attention path that outputs a signal indicating whether an object of interest is present in a corresponding window (Dijkman [0080]).

	Regarding claim 14, Modified You teaches the non-transitory computer readable medium of claim 13, You teaches to apply the attribute attention projection (The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection
	to apply the attribute attention projection (The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.;  In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: Wx,Y ∈ R m×d is the first attribute attention projection
	Wang teaches to the first feature map between a first convolutional layer and a second convolutional layer of the attention controlled neural network; (The input images 210 are of pixel size of 230×80×3, and are firstly passed through 64 learned filters of size 7×7×3. Then, the resulting feature maps are passed through a max pooling kernel of size 3×3×3 with stride 3, Finally, these feature maps are passed through a rectified linear unit (ReLU) to introduce non-linearities.. col 4, lines 20-25, “Examiner notes: the first feature map which is passed to the pooling layer is between the convolutional layers of the global sub-network 210 and the convolutional layers of the local sub-network 220, Fig. 2”)
	to the second feature map between a third convolutional layer and a fully-connected layer of the attention controlled neural network. (The second part of system 200 is the local sub-network 220, which includes four teams of convolutional layers and max pooling layers. The input feature maps are divided into four equal horizontal patches across the height channel, which introduces 4×64 local feature maps of different body parts. Then, each local feature map is passed through a convolutional layer, which has 32 learned filters, each of size 3×3. Afterwards, the resulting feature maps are passed through max pooling kernels of size 3×3 with stride 1. Finally, a rectified linear unit (ReLU) is provided after each max pooling layer, col 4, lines 26-36, “Examiner notes: the local feature map, which has 32 learned filters, each of size 3×3 is between the convolution layer in local sub-network 220 and fusion sub-network 230, Fig. 2”)
	 further comprising instructions that, when executed by the at least one processor, cause the computing device to: (Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions, col 11-12, lines 66-67 and 1-3):
	Dijkman teaches utilize a first gradient modulator (The network architecture 900 may be trained utilizing training module 971 [0083] …utilizing back-propagation of gradients to modulate scores for prior boxes in the attention path 908… This technique may be used to modulate the scores for prior boxes in the attention path [0104], “Examiner notes: training module 971 as first modulator”); and
	utilize a second gradient modulator (The network architecture 900 may be trained utilizing training module 972 [0083] …utilizing back-propagation of gradients to modulate scores for prior boxes in the attention path 908… This technique may be used to modulate the scores for prior boxes in the attention path [0104], “Examiner notes: training module 972 as second modulator”)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of modified You to incorporate the teachings of Dijkman for the benefit of a neural network comprising an attention path that outputs a signal indicating whether an object of interest is present in a corresponding window (Dijkman [0080]).

5.	Claims 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over You et al. (Image Captioning with Semantic Attention, arXiv:1603.03925v1 [cs.CV] 12 Mar 2016) in view of Song et al (WO2017168125)

	Regarding claim 17, teaches a method for training and applying attention controlled neural networks (The training data for each image consist of input image features v, {Ai} and output caption words sequence {Yt}. Our goal is to learn all the attention model parameters ΘA = {U, V , W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set. The loss of one training example is defined as the total negative log-likelihood of all the words combined with regularization terms on attention scores {αit} and {βit}, pg. 4, left col, last para.)  comprising:
	performing a step for training an attention controlled neural network to generate attribute-modulated-feature vectors using attribute attention projections; ((Once calculated, the attention scores are used to modulate the strength of attention on different attribute. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para., ”Examiner notes: vector w is the attribute-modulated-feature vector, and Wx,Y ∈ R m×d is the attribute attention projection ”); and
	performing a step for generating an attribute-modulated-feature vector for a digital input image using an attribute attention projection and the trained attention controlled neural network; (Once calculated, the attention scores are used to modulate the strength of attention on different attribute. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space., pg. 4, left col, first para.,; Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, pg. 3, Fig. 2 “Examiner notes: vector w is the attribute-modulated-feature vector ”) and
	performing a task based on the digital input image and the attribute-modulated-feature vector for the digital input image. (the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) £i βitσ(Eyi ))), (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para., “Examiner notes: vector was attribute-modulated feature vector is implemented in the softmax normalization”)
	You does not explicitly teach attribute categories.
	Song teaches attribute categories (An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object, [0022])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified method of You to incorporate the teachings of Song for the benefit of of extracting edge maps from bounding box areas of the images (Song, [0091])

	Regarding claim 18, Modified You teaches the method of claim 17, Song teaches wherein the attribute categories comprise facial-feature categories or product-feature categories (Attribute Annotation: first an ontology of attributes for shoes and chairs is defined based on existing UT-Zap50K attributes and product tags on online shopping websites [0081])
	The same motivation to combine dependent claim 17 applies here.

	Regarding claim 19, Modified You teaches the method of claim 17, You teaches wherein performing the task based on the digital input image and the attribute-modulated-feature vector for the digital input image ((Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR, pg. 4, left col, last para.; Once calculated, the attention scores are used to modulate the strength of attention on different attributes. The weighted sum of all attributes is mapped from word embedding space to the input space of xt together with the previous word: 
	xt = Wx,Y Eyt−1 + diag(wx,A) ∑ i αit Eyi  , (7)
where Wx,Y ∈ R m×d is the projection matrix, diag(w) denotes a diagonal matrix constructed with vector w, and wx,A ∈ R d models the relative importance of visual attributes in each dimension of the word space, pg. 3, right col, last para., pg. 4, left col, first para.)  comprises 
	Song teaches retrieving, from an image database, a digital output image corresponding to the digital input image, (The fine-grained retrieval engine 2 comprises a ranking model, e.g. a deep ranking model, trained from a database of sketches and photos. It can then be used non-interactively to retrieve photos similar to an input sketch [0057])
	the digital output image including an output attribute that corresponds to an input attribute of the digital input image (The final layer has 250 output units corresponding to 250 categories (the number of unique classes in the TU-Berlin sketch dataset), upon which we place a softmax loss [00110])
	The same motivation to combine dependent claim 17 applies here.

	Regarding claim 20, Modified You teaches the method of claim 17, You teaches further comprising: generating an additional attribute-modulated-feature vector for the digital input image using an additional attribute attention projection and the trained attention controlled neural network; (Our goal is to learn all the attention model parameters ΘA = {U,V ,W∗,∗ , w∗,∗} jointly with all RNN parameters ΘR by minimizing a loss function over training set, pg. 4, left col, last para.;
Again, {βit} are used to modulate the attention on all the attributes, and the weighted sum of their activations is used as a compliment to ht in determining the distribution pt. Specifically, the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) X i β i tσ(Eyi )) , (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para. In training, we use RMSProp algorithm to do model updating with a mini-batch size of 256, pg. 5, right col, fourth para. “Examiner notes: WY,h ∈ R d×n is the additional attribute  projection; Visual features of CNN responses v and attribute detections {Ai} are injected into RNN (dashed arrows) and get fused together through a feedback loop (blue arrows). Attention on attributes is enforced by both input model φ and output model ϕ, pg. 3, Fig. 2. ” Examiner notes: vector w is the additional attribute-modulated-feature vector ”) and
	performing the task based on the digital input image, the attribute-modulated-feature vector, and the additional attribute-modulated-feature vector (the distribution is generated by a linear transform followed by a softmax normalization: 
	pt ∝ exp E TWY,h(ht + diag(wY,A) £i βitσ(Eyi ))), (9) 
where WY,h ∈ R d×n is the projection matrix and wY,A ∈ R n models the relative importance of visual attributes in each dimension of the RNN state space, pg. 4, left col, second to the last para., “Examiner notes: vector was attribute-modulated feature vector is implemented in the softmax normalization”)
	Song teaches by retrieving, from an image database, a digital output image corresponding to the digital input image, (The fine-grained retrieval engine 2 comprises a ranking model, e.g. a deep ranking model, trained from a database of sketches and photos. It can then be used non -interactively to retrieve photos similar to an input sketch [0057])
	the digital output image including a first output attribute and a second output attribute respectfully corresponding to a first input attribute and a second attribute of the digital input image. (The final layer has 250 output units corresponding to 250 categories (the number of unique classes in the TU-Berlin sketch dataset), upon which we place a softmax loss [00110].)
	The same motivation to combine dependent claim 17 applies here.

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2121                                    

/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121