DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 9, 11-13 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (KR 102352128) in view of Hori et al. (US 2021/0082398).

Claim 1,
Li teaches a method for performing a visual dialogue task by a neural network model, the method comprising: receiving, at a visual dialogue neural network language model, an image input and text input, wherein the text input comprises a dialogue history between the visual dialogue neural network language model and a human user and a current question by the human user ([pg. 1; para. 3-4] imaged-based conversation system using deep image understating; input processing unit includes a language processing unit which generates a language feature by fusing the question features extracted from the questioner’s question about the input image and one or more dialog features extracted from the past conversation history with respect to the input image); 
generating, from the image input and text input and using a transformer encoder network in the visual dialogue neural network language model, a unified contextualized representation, wherein the unified contextualized representation includes a token level encoding of the image input and text input ([pg. 1; para. 4-5] the input processing unit detects an object in an image given as an input for an image-based conversation and recognized attribute information of the detected object; the deep learning algorithm is used to extract visual features, question features and dialog features; the context generator generates a context feature by fusing the final visual feature of the image processor and the language feature of the language processor); 
generating, from the unified contextualized representation, using a plurality of visual encoding layers in the visual dialogue neural network language model, an encoded visual dialogue input, wherein the encoded visual dialogue input includes a position level encoding and a segment type encoding ([pg. 2 last para.] [pg. 3 para. 1] [pg. 4 para. 2-3] the visual feature extraction unit extracts visual feature from the input image using a convolutional neural network; the encoder extracts attribute information in the input image; the encoders uses language features vectors to focus attention on the most relevant regions in the overall image; each human area is detected in a visual feature map; the visual feature of each human region obtained goes through a Person Attribute Recognition stage; encoder encodes the features using Long Short Term Memory (LSTM) layers, which is a word embedding and a recurrent neural network (RNN)); 
generating, from the encoded visual dialogue input and using a first self-attention mask associated with discriminative settings of the transformer encoder network ([pg. 5] para. 1] the discriminative decoder of the system is the fused feature information obtained from the encoder; based on the list of answers, choose the most appropriate answer; identifying a list of incoming answers of each candidate answer) or a second self- attention mask associated with generative settings of the transformer encoder network, an answer prediction ([pg. 4 para. 5] the encoder uses an attention mechanism to extract the current question from the input image; the correlation between the visual feature vector and the linguistic feature vector is calculated through the dot product; the calculated dot product value is used as a weight value through a Softmax layer); and 
providing the answer prediction as a response to the current utterance of the human user ([pg. 2 para. 3] selecting an appropriate answer to the question from the candidate answer).
The different between the prior art and the claimed invention is that Li does not explicitly teach utterance by the human user.
Hori teaches utterance by the human user ([0002] human-machine interface that can process spoken dialogs).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Li with teachings of Hori by modifying the system for visual dialog using deep visual understanding as taught by Li to include utterance by the human user as taught by Hori for the benefit of processing multimodal sensory inputs and generate suitable responses in conversations (Hori [0002]).

Claim 2,
Li further teaches the method of claim 1, wherein the unified contextualized representation comprises a concatenation of the image input and the text input that comprises a caption of the image input, a dialogue history of a user-machine interaction and a user question of the current utterance ([pg. 2 para. 4] [pg. 3 para. 4] generating the context feature by fusing the final visual feature of the image processor and the language feature processor; the past conversation history includes a pair of questions and answers of each round made before the current question with respect to the input image, and a caption that is a short explanatory text for the input image).

Claim 9,
Li further teaches the method of claim 1, wherein generating the answer prediction comprises: generating, by the transformer encoder network, a plurality of answer candidates; and providing, by a ranking module of the visual dialogue neural network language model, a plurality of dense annotations that specify a plurality of relevance scores for the plurality of answer candidates ([pg. 3 para. 6] the answer selection unit extracts answer features for each candidate answer included in the answer list; the answer selector calculates the dot product of the context feature and each answer feature to obtain an inner product value, then converts it into a score of the corresponding candidate answer, and a candidate with a relatively high score among the converted scores).

Claim 11,
Li further teaches the method of claim 9, further comprising: generating a sampled answer candidate list that comprises the plurality of relevance scores based at least in part on first relevance score candidates with non-zero values having a first priority and second relevance score candidates with zero values having a second priority, wherein the first priority is greater than the second priority (the answer selection unit 500 extracts answer features for each candidate answer included in the answer list; converts it into a score of the corresponding candidate answer, and a candidate with a relatively high score among the converted scores; select and print the answer as an answer to the question; the priority in the limitation describes a ranking system which Li teaches (ranking candidate answer)).


Li teaches a system comprising: a non-transitory memory; and one or more hardware processors coupled to the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising: receiving, at a visual dialogue module neural network language model, an image input and text input, wherein the text input comprises a dialogue history between the visual dialogue module neural network language model and a human user and a current question by the human user ([pg. 1; para. 3-4] imaged-based conversation system using deep image understating; input processing unit includes a language processing unit which generates a language feature by fusing the question features extracted from the questioner’s question about the input image and one or more dialog features extracted from the past conversation history with respect to the input image); 
generating, using a plurality of image encoding layers in the visual dialogue module neural network language model, an encoded image input, wherein the encoded image input includes a token level encoding and at least one of a position level encoding or a segment level encoding ([pg. 1; para. 4-5] [pg. 3 para. 1] the input processing unit detects an object in an image given as an input for an image-based conversation and recognized attribute information of the detected object; the deep learning algorithm is used to extract visual features, question features and dialog features; the context generator generates a context feature by fusing the final visual feature of the image processor and the language feature of the language processor; the visual features using a CNN algorithm (it is  inherent that a neural includes layers)); 
generating, using a plurality of text encoding layers in the visual dialogue module neural network language model, an encoded text input, wherein the encoded text input includes a token level encoding and at least one of a position level encoding or a segment level encoding ([pg. 2 last para.] [pg. 3 para. 1] [pg. 4 para. 2-3] the visual feature extraction unit extracts visual feature from the input image using a convolutional neural network; the encoder extracts attribute information in the input image; the encoders uses language features vectors to focus attention on the most relevant regions in the overall image; each human area is detected in a visual feature map; the visual feature of each human region obtained goes through a Person Attribute Recognition stage; encoder encodes the features using Long Short Term Memory (LSTM) layers, which is a word embedding and a recurrent neural network (RNN)); 
concatenating the encoded image input and the encoded text input in to a single input sequence ([pg. 2 para.4] the context generator generates a context feature by fusing the final visual feature of the image processor and the language feature of the language processor); and 
generating, from the single input sequence and using a pre-trained language model with one or more self-attention mask layers in the visual dialogue module neural network language model, a response to the current utterance of the human user ([pg. 2 para. 3] [pg. 4 para. 5] [pg. 5 para. 1] the discriminative decoder of the system is the fused feature information obtained from the encoder; based on the list of answers, choose the most appropriate answer; identifying a list of incoming answers of each candidate answer; the encoder uses an attention mechanism to extract the current question from the input image; the correlation between the visual feature vector and the linguistic feature vector is calculated through the dot product; the calculated dot product value is used as a weight value through a Softmax layer; selecting an appropriate answer to the question from the candidate answer).
The different between the prior art and the claimed invention is that Li does not explicitly teach utterance by the human user.
Hori teaches utterance by the human user ([0002] human-machine interface that can process spoken dialogs).
(Hori [0002]).

Claim 13,
Li further teaches the system of claim 12, wherein a segment level encoding layer from the plurality of image encoding layers generates the segment level encoding from the image input, wherein the segment level encoding identifies visual information type of the image input ([pg. 2 last para.] [pg. 4 para. 2-3] the visual feature extraction unit extracts visual features from the input image; the encoder extracts the visual feature vectors from the input image; the visual features (Cropped Regions) of each human region obtained through such a person detection stage go through a Person Attribute Recognition stage).

Claim 17,
A non-transitory, machine-readable medium having stored thereon machine-readable instructions executable to cause a system to perform operations comprising: receiving, at a visual dialogue neural network language model, an image input and text input, wherein the text input comprises a dialogue history between the visual dialogue neural network language model and a human user and a current utterance by the human user; generating, from the image input and text input, using a plurality of visual encoding layers in the visual dialogue neural network language model, an encoded visual dialogue input, wherein the encoded visual dialogue input includes a position level encoding and a segment type encoding; generating, from the encoded visual dialogue input and using a transformer (Claim 17 contains subject matter similar to claim 1, and thus is rejected under similar rationale)

Claims 3 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (KR 102352128) in view of Hori et al. (US 2021/0082398) and further in view of Reisswig et al. (US 2020/0258498).

Claim 3,
Li and Hori teach all the limitations in claim 1. The difference between the prior art and the claimed invention is that Li and Hori do not explicitly teach masking, at the visual dialogue neural network language model, in a random order, a subset of tokens including special tokens in a text segment of the text input; and replacing the subset of tokens including the special tokens with a mask token using masked language modeling.
Reisswig teaches masking, at the visual dialogue neural network language model, in a random order, a subset of tokens including special tokens in a text segment of the text input; and replacing the subset of tokens including the special tokens with a mask token using masked language modeling ([0023-0024] [0043] the language model training service 304 masks one or more strings of the language model training sample to generate a masked language model training sample; masking can include replacing a string with a random string and/or replacing the characters of the string with randomly-selected characters).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Li and Hori with teachings of Reisswig by modifying the system for visual dialog using deep visual understanding as taught by Li to include masking, at the visual dialogue neural network language model, in a random order, a subset of tokens including special tokens in a text segment of the text input; and replacing the subset of tokens including the special tokens with a mask token using masked language modeling as taught by Reisswig for the benefit of training a convolutional autoencoder language model with unlabeled data in an unsupervised manner (Reisswig [0023-0024]).

Claim 18,
The non-transitory, machine-readable medium of claim 17, wherein generating the answer prediction comprises: masking, at the visual dialogue neural network language model, in a random order, a subset of tokens including special tokens in a text segment of the text input; and replacing the subset of tokens including the special tokens with a mask token using masked language modeling. (Claim 18 contains subject matter similar to claim 3, and thus is rejected under similar rationale)

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (KR 102352128) in view of Hori et al. (US 2021/0082398) and further in view of Qiu et al. (CN 110209898).

Claim 6,

Qiu teaches generating, by the visual dialogue neural network language model, a prediction indicating a likelihood of whether an appended answer in the text input is correct or not based on the encoded visual dialogue input using a next-sentence prediction (NSP) operation ([pg. 6 para. 3] the selecting the error dialog and correct dialog as input dialog of the algorithm (e.g., algorithm can adopt BERT-next-sentence-Prediction etc.) to obtain the training model. the step of using the training model to judge the input dialog in the question sentence and an answer sentence matching degree of matching dialog of the matching degree is lower than the threshold value).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Li and Hori with teachings of Qiu by modifying the system for visual dialog using deep visual understanding as taught by Li to include generating, by the visual dialogue neural network language model, a prediction indicating a likelihood of whether an appended answer in the text input is correct or not based on the encoded visual dialogue input using a next-sentence prediction (NSP) operation as taught by Qiu for the benefit of obtaining the matching answer in the correct combination (Qiu [pg. 6 para. 3]).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (KR 102352128) in view of Hori et al. (US 2021/0082398) and further in view of Raiman et al. (US 2018/0300312).

Claim 10,

Raiman teaches combining the plurality of relevance scores for each of the plurality of answer candidates to form a combined set of relevance scores; and normalizing the combined set of relevance scores into a probability distribution to fine tune the plurality of dense annotations ([0060] the answer scores are exponentialized for each candidate span (or potential answer) and a partition function is created by summing all exponentialized answer scores; the partition function is used to globally normalize each exponentialized answer score to get a globally normalized probability for each candidate span).
Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the teachings of Li and Hori with teachings of Raiman by modifying the system for visual dialog using deep visual understanding as taught by Li to include combining the plurality of relevance scores for each of the plurality of answer candidates to form a combined set of relevance scores; and normalizing the combined set of relevance scores into a probability distribution to fine tune the plurality of dense annotations as taught by Raiman for improving the performance of GNR models and is of independent interest for a variety of natural language processing (NLP) tasks (Raiman [Abstract]).

Allowable Subject Matter
Claims 4-5, 7-8, 14-16 and 19-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and is considered pertinent to applicant's disclosure.
Das et al. – “Visual Dialog” – We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial contains 1 dialog (10 question- answer pairs) on 140k images from the COCO dataset, with a total of 1.4M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders (Late Fusion, Hierarchical Recurrent Encoder and Memory Network) and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656