DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Examiner attempted to reach Jiaping Liu multiple times via telephone, to conduct an Examiner’s Amendment that would put the Application in condition for allowance. Claim 5 depends on a cancelled claimed 4. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 5 depends on a cancelled claimed 4. There is insufficient antecedent basis for this limitation in the claim. Appropriate correction is required. 

Allowable Subject Matter
Claims 1-3, 5-13, 15-20 would be allowable if the Applicant can overcome the 112 Rejection set forth. 
The following is a statement of reasons for the indication of allowable subject matter:  Li et al. (KR 102352128) in view of Hori et al. (US 2021/0082398): Li teaches a method for performing a visual dialogue task by a neural network model, the method comprising: receiving, at a visual dialogue neural network language model, an image input and text input, wherein the text input comprises a dialogue history between the visual dialogue neural network language model and a human user and a current question by the human user ([pg. 1; para. 3-4] imaged-based conversation system using deep image understating; input processing unit includes a language processing unit which generates a language feature by fusing the question features extracted from the questioner’s question about the input image and one or more dialog features extracted from the past conversation history with respect to the input image); 
generating, from the image input and text input and using a transformer encoder network in the visual dialogue neural network language model, a unified contextualized representation, wherein the unified contextualized representation includes a token level encoding of the image input and text input ([pg. 1; para. 4-5] the input processing unit detects an object in an image given as an input for an image-based conversation and recognized attribute information of the detected object; the deep learning algorithm is used to extract visual features, question features and dialog features; the context generator generates a context feature by fusing the final visual feature of the image processor and the language feature of the language processor); 
generating, from the unified contextualized representation, using a plurality of visual encoding layers in the visual dialogue neural network language model, an encoded visual dialogue input, wherein the encoded visual dialogue input includes a position level encoding and a segment type encoding ([pg. 2 last para.] [pg. 3 para. 1] [pg. 4 para. 2-3] the visual feature extraction unit extracts visual feature from the input image using a convolutional neural network; the encoder extracts attribute information in the input image; the encoders uses language features vectors to focus attention on the most relevant regions in the overall image; each human area is detected in a visual feature map; the visual feature of each human region obtained goes through a Person Attribute Recognition stage; encoder encodes the features using Long Short Term Memory (LSTM) layers, which is a word embedding and a recurrent neural network (RNN)); 
generating, from the encoded visual dialogue input and using a first self-attention mask associated with discriminative settings of the transformer encoder network ([pg. 5] para. 1] the discriminative decoder of the system is the fused feature information obtained from the encoder; based on the list of answers, choose the most appropriate answer; identifying a list of incoming answers of each candidate answer) or a second self- attention mask associated with generative settings of the transformer encoder network, an answer prediction ([pg. 4 para. 5] the encoder uses an attention mechanism to extract the current question from the input image; the correlation between the visual feature vector and the linguistic feature vector is calculated through the dot product; the calculated dot product value is used as a weight value through a Softmax layer); and 
providing the answer prediction as a response to the current utterance of the human user ([pg. 2 para. 3] selecting an appropriate answer to the question from the candidate answer).
Hori teaches utterance by the human user ([0002] human-machine interface that can process spoken dialogs).
The difference between the prior art and the claimed invention is that the above prior art does not explicitly teach wherein the generating comprises setting a first subset of generative settings associated with the second self-attention mask to zero values that allows each token in a context sequence to be visible for attending to each other with one or more attention layers in the transformer encoder network.
Therefore, it would not have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of the above prior art to include wherein the generating comprises setting a first subset of generative settings associated with the second self-attention mask to zero values that allows each token in a context sequence to be visible for attending to each other with one or more attention layers in the transformer encoder network. Therefore, the claimed invention is deemed novel. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656