DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities: paragraphs 9 and 149 in the originally filed Specification recite “vide” but should recite “video”. Appropriate correction is required.

Claim Objections
Claims 1 and 10, and therefore claims 2-9 and 11-18 which depend therefrom are objected to because of the following informalities:  claims 1 and 10 both recite “the vide data” but should recite “the video data”.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 4 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Vasconcelos et al., (US 2021/0012769 A1, herein “Vasconcelos”).
Regarding claim 1, Vasconcelos teaches an artificial intelligence (AI) device comprising (Vasconcelos figs. 3B and 6, paras. [0110] and [0087], neural speech processing system using neural network layers (thus AI based) where the system is comprised of components within a common device such as a mobile computing device 355 in fig. 3B, which corresponds to client device 350 in Fig. 3B): 
a content output interface configured to output video data contained in content and voice data contained in the content (Vasconcelos paras. [0083], [0088]-[0089], client device including cameras and microphones (content output interface) to provide a common video recording (content) in the client device, from which video and audio data channels are generated and output to be processed into feature tensors); and 
Vasconcelos paras. [0090]-[0093], transmitter and receiver 418 and 428, which could be a transceiver, where para. [0157] teaches that processors perform steps of the methods described therein) configured to acquire a voice recognition result by providing, to a voice recognition model, content extraction information including at least one of video information acquired from the vide data in the content or tag information of the content and the voice data (Vasconcelos paras. [0089]-[0092], visual feature tensors (video information from the video data) and audio feature tensors (voice data) generated from the common video recording are transmitted to the server device, where the server processes the tensors in a linguistic model LM comprised of an acoustic model and a pronunciation model (voice recognition model) to generate and parse text data resulting from the speech recognition of the audio data, and mapping the text data to a voice command, then execute the voice command to obtain a response to send back to the user (voice recognition result)), and control the content output interface to output the voice recognition result (Vasconcelos paras. [0091]-[0093] and [0083], output of the linguistic model is processed to determine the response to the client device, which in turn is processed by the client device and a response to the user (voice recognition result) is output by the client device through the user interface via the display screen (part of the content output interface)).
While Vasconcelos teaches generally that processors perform steps of the methods described therein, Vasconcelos does not teach that the transmitter, receiver and other functions disclosed as performed in the client device are necessarily performed by a processor. However, it would have been obvious to one of ordinary skill in the art to have used a processor for the transmitter and receiver functions disclosed See MPEP §2143(I)(A).
Regarding claim 2, Vasconcelos teaches wherein the voice recognition model includes a recurrent neural network (RNN) (Vasconcelos para. [0103], visual and audio feature extractor comprise a recurrent neural network), and wherein the processor is configured to: set, for the RNN, an initial hidden state corresponding to the content extraction information (Vasconcelos para. [0116], audio and video feature tensors are used to set an initial hidden state of a recurrent neural network).
Regarding claim 4, Vasconcelos teaches wherein the processor is configured to: set, for an RNN, a first initial hidden state corresponding to video information extracted from a first scene of the content (Vasconcelos para. [0116], video feature tensors are used to set an initial hidden state of a recurrent neural network); and provide first voice data, which is output from the first scene, to the RNN having the first initial hidden state (Vasconcelos paras. [0113]-[0114], [0116], in the language model processing (which uses an RNN and outputs text corresponding to the audio input), the output symbol of the RNN which is set to the initial hidden state by the visual feature tensor is fed back (provided) to the RNN as a second input which has the initial hidden state set), to acquire a voice recognition result corresponding to the first voice data (Vasconcelos para. [0116], the output of the RNN with the fed back input (first voice data), then is the voice recognition result provided by the language model).
Regarding claim 9, Vasconcelos teaches wherein the RNN is: trained by using first training voice data corresponding to first content extraction information and a language labeled on the first training voice data (Vasconcelos para. [0118], training data  comprising a data triple of image data, audio data (first training voice data) and ground truth text data (language labeled)), in a state that a first initial hidden state corresponding to the first content extraction information is set (Vasconcelos para. [0118], audio data is supplied to the audio feature extractor and in a forward pass the output text data is generated as described with reference to fig. 6, where para. [0116] describes fig. 6 as setting an initial state of the recurrent neural network to the audio feature tensor output from the audio feature extractor (see para. [0112])); and 
trained by using second training voice data corresponding to second content extraction information and a language labeled on the second training voce data (Vasconcelos paras. [0118] and [0119], training performed using batching, thus in batches of training data, and thus considering a second batch of training data comprising a data triple of image data, audio data (second training voice data) and ground truth text data (language labeled)), in a state that a second initial hidden state corresponding to the second content extraction information is set (Vasconcelos para. [0118], audio data is supplied to the audio feature extractor and in a forward pass the output text data is generated as described with reference to fig. 6, where para. [0116] describes fig. 6 as setting an initial state of the recurrent neural network to the audio feature tensor output from the audio feature extractor (see para. [0112])).
Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Vasconcelos, as set forth above regarding claim 1 from which claim 3 depends, further in view of Li et al., (US 2020/0175335 A1, herein “Li”). Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Vasconcelos in view of Veeramani, as set forth below regarding claim 11 from which claim 12 depends, further in view of Li.
Regarding claims 3 and 12, Vasconcelos teaches wherein the voice recognition model: calculates a hidden state and a voice recognition result at a time step "t" by using the initial hidden state and voice data at the time step "t" (Vasconcelos paras. [0110], [0114]-[0116], in the linguistic model which is a recurrent neural network having multiple layers, outputs over a sequence of time steps (thus including a time step t), an initial hidden state is set to an audio feature tensor (voice data) and any further hidden layers between the input and output layers would calculate state based on the previous input layer which is set to the audio feature tensor (voice data) at a first time step sequence, and where the output layer (voice recognition result) outputs over this first step sequence). 
While Vasconcelos teaches using an RNN for the linguistic model (which performs voice recognition), Vasconcelos does not explicitly teach all of the details of the RNN processing. Therefore, Vasconcelos does not explicitly teach calculates a hidden state and a voice recognition result at a time step "t+1" by using the hidden state at the time step "t" and voice data at the time step "t+1".
Li teaches calculates a hidden state and a voice recognition result at a time step "t+1" by using the hidden state at the time step "t" and voice data at the time step "t+1" (Li fig. 5, paras. [0034]-[0035] and [0038], in a multi-layer neural network comprised of a time neural network units, where recurrent neural networks are used as the time neural network units, the neural network processing connects hidden states together such that the output of the output layer (for a voice recognition result at a time t+1) is driven by the output of the layer processing blocks (including hidden state outputs for previous time steps – time step t), and where paras. [0024] and [0033] teach that a classifier as shown in fig. 5 is used for speech recognition, and can receive as input utterances (voice data) and output a recognition result).
Therefore, taking the teachings of Vasconcelos and Li together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the output of Vasconcelos with the details of an RNN classifier as disclosed in Li at least because doing so would provide significant increases in classification task (including speech recognition) accuracy (see Li paras. [0017], [0019]).
Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Vasconcelos, as set forth above regarding claim 2 from which claims 7 and 8 depend, further in view of Li et al., (US 2017/0262995 A1, herein “Li2”). Claims 16 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Vasconcelos in view of Veeramani, as set forth below regarding claim 11 from which claims 16 and 17 depend, further in view of Li2.
Regarding claims 7 and 16, Vasconcelos teaches wherein the video information includes: at least one of (Vasconcelos paras. [0105]-[0107], visual feature extractor classifies the input frames of the video, thus provides a description of some kind, but not specifically of a scene or an object in the frame or text in the frame).
Vasconcelos does not explicitly teach an object in the video data, a text in the video data, or description information of a scene.
Li2 paras. [0057]-[0058], RNN classifier predicts a classification label for a frame of video including an object in the frame or sequence of frames).
Therefore, taking the teachings of Vasconcelos and Li2 together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the output of Vasconcelos with the labeling of the RNN classifier as disclosed in Li2 at least because doing so would improve modeling of sequences of temporal data. (see Li2 para. [0032]).
Regarding claims 8 and 17, Vasconcelos does not explicitly teach the limitation of claims 8 and 17.
Li2 teaches wherein the tag information includes: at least one of a title of the content, a subject of the content, or description of the content (Li2 paras. [0057]-[0058], RNN classifier predicts a classification label for a frame of video including an object (subject of the content) in the frame or sequence of frames).
Therefore, taking the teachings of Vasconcelos and Li2 together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the output of Vasconcelos with the labeling of the RNN classifier as disclosed in Li2 at least because doing so would improve modeling of sequences of temporal data. (see Li2 para. [0032]).
Claims 10, 11, 13 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Vasconcelos, in view of Veeramani et al., (US 2018/0277142 A1, herein “Veeramani”).
Regarding claim 10, Vasconcelos teaches an operating method of an Al device, the operating method comprising (Vasconcelos figs. 4A, 3B and 6, paras. [0110] and [0087], operations of a neural speech processing system using neural network layers (thus AI based)): 
acquiring content extraction on information including at least one of video information acquired from vide data contained in content or tag information of the content (Vasconcelos paras. [0089]-[0090], visual feature tensors (video information from the video data) and audio feature tensors (voice data) generated from the common video recording are transmitted to the server device); 
acquiring a voice recognition result by providing, to a voice recognition model, the content extraction information and voice data contained in the content (Vasconcelos paras. [0089]-[0092], the server processes the tensors in a linguistic model LM comprised of an acoustic model and a pronunciation model (voice recognition model) to generate and parse text data resulting from the speech recognition of the audio data, and mapping the text data to a voice command, then execute the voice command to obtain a response to send back to the user (voice recognition result)); and 
outputting the voice recognition result (Vasconcelos paras. [0091]-[0093] and [0083], a response to the user (voice recognition result) is output by the client device through the user interface via the display screen (part of the content output interface)).
	Vasconcelos does not explicitly teach outputting the video data contained in the content and the voice data contained in the content.
 	Veeramani teaches outputting the video data contained in the content and the voice data contained in the content (Veeramani fig. 4, paras. [0037]-[0038], in an audio/video stream (content) the audio portion is speech recognized, then text overlay is generated from the recognized speech in the audio portion and is incorporated (output) into video output and as well the audio in from which the speech is recognized is output also).
	Therefore, taking the teachings of Vasconcelos and Veeramani together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the output of Vasconcelos with the output of video and audio data with voice as disclosed in Veeramani at least because doing so would provide greater effective participation between a user and multi-media applications (see Veeramani paras. [0004], [0038]).
Regarding claim 11, Vasconcelos teaches wherein the voice recognition model includes a recurrent neural network (RNN) (Vasconcelos para. [0103], visual and audio feature extractor comprise a recurrent neural network), and wherein the acquiring of the voice recognition result includes: setting, for the RNN, an initial hidden state corresponding to the content extraction information (Vasconcelos paras. [0113] and [0116], in the linguistic model used for the speech recognition that is an RNN, the audio and video feature tensors are used to set an initial hidden state of a recurrent neural network).
Regarding claim 13, Vasconcelos teaches wherein the acquiring of the voice recognition result includes: setting, for an RNN, a first initial hidden state corresponding to video information extracted from a first scene of the content (Vasconcelos para. [0116], video feature tensors are used to set an initial hidden state of a recurrent neural network); and providing first voice data, which is output from the first scene, to the RNN Vasconcelos paras. [0113]-[0114], [0116], in the language model processing (which uses an RNN and outputs text corresponding to the audio input), the output symbol of the RNN which is set to the initial hidden state by the visual feature tensor is fed back (provided) to the RNN as a second input which has the initial hidden state set), to acquire a voice recognition result corresponding to the first voice data (Vasconcelos para. [0116], the output of the RNN with the fed back input (first voice data), then is the voice recognition result provided by the language model).
Regarding claim 18, Vasconcelos teaches wherein the RNN is: trained by using first training voice data corresponding to first content extraction information and a language labeled on the first training voice data (Vasconcelos para. [0118], training data  comprising a data triple of image data, audio data (first training voice data) and ground truth text data (language labeled)), in a state that a first initial hidden state corresponding to the first content extraction information is set (Vasconcelos para. [0118], audio data is supplied to the audio feature extractor and in a forward pass the output text data is generated as described with reference to fig. 6, where para. [0116] describes fig. 6 as setting an initial state of the recurrent neural network to the audio feature tensor output from the audio feature extractor (see para. [0112])); and 
trained by using second training voice data corresponding to second content extraction information and a language labeled on the second training voce data (Vasconcelos paras. [0118] and [0119], training performed using batching, thus in batches of training data, and thus considering a second batch of training data comprising a data triple of image data, audio data (second training voice data) and ground truth text data (language labeled)), in a state that a second initial hidden state Vasconcelos para. [0118], audio data is supplied to the audio feature extractor and in a forward pass the output text data is generated as described with reference to fig. 6, where para. [0116] describes fig. 6 as setting an initial state of the recurrent neural network to the audio feature tensor output from the audio feature extractor (see para. [0112])).


Allowable Subject Matter
Claims 5-6 and 14-15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and for where any additional informalities exist as noted above, those informalities being overcome by an appropriate amendment. These claims are allowable as they all recite setting a further (second or third) initial hidden state corresponding specifically to tag information of the content and further recite providing voice data to the RNN with the further initial hidden state to acquire a voice recognition result. The closest cited art of record includes Vasconcelos and Li2. Vasconcelos is directed towards processing video including the visual component and audio component, using an RNN to perform speech recognition. However, Vasconcelos does not provide much detail into the specific RNN processes beyond what was cited and referenced in the rejection rationale above for various claims. In particular, Vasconcelos does not disclose setting an initial hidden state corresponding to tag information, and then providing voice data to the RNN with the set initial hidden state corresponding to the tag.  Li2 discloses hidden states in an RNN that 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Jain et al., US 2017/0262996 A1. Li is directed towards using recurrent neural networks to analyze video. Jain discusses hidden states and layers in a video data classifier, including labels for the video data. Jain does not appear to teach aspects related to audio and video in its RNN processing.
Wang et al., US 2020/0219517 A1. Wang is directed towards receiving an utterance of speech and segmenting the speech according to speakers using a speech model. While Wang uses an RNN for its speech model, it does not consider video or image data for use in the speech processing.
Lee et al., US 2020/0380976 A1. Lee is directed towards an artificial intelligence system that displays an image including at least one object receiving a voice, and using the AI algorithm to identify an object related to the voice and acquire tag information. Lee does not explicitly teach about video data and temporal aspects of processing voice associated with video.



Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908.  The examiner can normally be reached on Monday-Friday, 09:30-18:30 EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


MICHELLE M. KOETH
Primary Examiner
Art Unit 2656



/MICHELLE M KOETH/Primary Examiner, Art Unit 2656