DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed 09/28/2022 has been entered. Claims 1-20 remain pending in the application.

Response to Arguments
Applicant’s arguments with respect to claims 1, 8, and 15 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-14 are rejected under 35 U.S.C. 103 as being unpatentable over Tian et al., "Audio-Visual Event Localization in Unconstrained Videos" (2018, previously cited in IDS), hereinafter referred to as Tian in view of Ramaswamy et al. “See the Sound, Hear the Pixels” (March 2020), hereinafter referred to as Ramaswamy.

Regarding claim 1, Tian discloses a system (Fig. 3) comprising: 
a hardware processor (Section 7, NVIDIA GPU); 
a memory (Section 6, Experiments, the experiments were done with a computer that’s uses the NVIDIA GPU, a computer has memory) coupled with the hardware processor (Section 7, NVIDIA GPU); 
the hardware processor (Section 7, NVIDIA GPU) configured to: 
receive a video feed (Fig. 3, audio and video, Section 3 (Cross-Modality Localization), “given a segment of one modality (auditory/visual)”) for audio-visual event localization (Abstract, “audio-visual event localization in unconstrained videos”); 
based on a combination of extracted audio features and video features (Fig. 3, Vt is the video and At is the audio, they are combined and fed into the audio-guided visual attention) of the video feed (Fig. 3, audio and video, Section 3 (Cross-Modality Localization), “given a segment of one modality (auditory/visual)”), determine informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) by running a first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network); 
based on the informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) determined by the first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network), determine relation-aware video features by running a second neural network (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively);
based on the informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) determined by the first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network), determine relation-aware audio features by running a third neural network (Fig. 3, the audio-visual event localization framework uses another Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware audio features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively); 
obtain a dual-modality representation based on the relation-aware video features and the relation-aware audio features by running a fourth neural network (Fig. 3, fusion network uses the outputs of the two LSTMs as an input, Section 4.3, Audio-Visual Feature Fusion, “we introduce a Dual Multimodal Residual Network (DMRN)”, “Given audio and visual features from LSTMs, the DMRN will compute the updated audio and visual features”).; 
input the dual-modality representation to a classifier to identity an audio-visual event in the video feed (Fig. 3, Section 4.1, “To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.”).

Tian discloses that the second and third neural network are LSTMs, but Tian does not explicitly disclose that the second neural network is configured to implement attention mechanism and learn the relation-aware video features using at least one query derived from a video feature and key-value pairs derived from both video and audio features associated with the video feed and the third neural network is configured to implement attention mechanism and learn the relation-aware audio features using at least one query derived from an audio feature and the key-value pairs derived from both video and audio features associated with the video feed.
	However, Ramaswamy teaches that the second neural network (Fig. 2, the inputs are both the vide and the audio, the second neural network is SWAB or Segment-Wise Attention Block (SWAB) (Section 1, p. 2960), Section 3, “) Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”) is configured to implement attention mechanism (Section 3.4, attention mechanism) and learn the relation-aware video features (Section 1, “We propose a Segment-Wise Attention Block (SWAB) which combines global information of the two modalities with audio-assisted visual features and audio features correspondingly such that it weighs the segments in the video according to the importance of segments in the audio”, Section 3, “Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”, both modalities being audio and video) using at least one query derived from a video feature (Fig. 2, input is a video for the SWAB on the pink box) and key-value pairs derived from both video and audio features associated with the video feed (Fig. 2, audio-assisted video, Section 1, SWAB uses the audio-assisted visual features coming from the fusion block (AVFB) and the audio features, along with the global information from the respective modalities, to localize sound source in the scene by providing segment-wise attention) and the third neural network (Fig. 2, the inputs are both the vide and the audio, the second neural network is SWAB or Segment-Wise Attention Block (SWAB) (Section 1, p. 2960), Section 3, “) Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”) is configured to implement attention mechanism  (Section 3.4, attention mechanism) and learn the relation-aware audio features (Section 1, “We propose a Segment-Wise Attention Block (SWAB) which combines global information of the two modalities with audio-assisted visual features and audio features correspondingly such that it weighs the segments in the video according to the importance of segments in the audio”, Section 3, “Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”, both modalities being audio and video) using at least one query derived from an audio feature (Fig. 2, input from the video) for the SWAB on the purple box) and the key-value pairs derived from both video and audio features associated with the video feed (Fig. 2, audio-assisted video, Section 1, SWAB uses the audio-assisted visual features coming from the fusion block (AVFB) and the audio features, along with the global information from the respective modalities, to localize sound source in the scene by providing segment-wise attention).
	Tian and Ramaswamy are both considered to be analogous to the claimed invention because they are in the same field of audio-visual event localization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system as taught by Tian to incorporate the teachings of Ramaswamy that the second neural network is configured to implement attention mechanism and learn the relation-aware video features using at least one query derived from a video feature and key-value pairs derived from both video and audio features associated with the video feed and the third neural network is configured to implement attention mechanism and learn the relation-aware audio features using at least one query derived from an audio feature and the key-value pairs derived from both video and audio features associated with the video feed. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because it provides efficient fusion of audio and visual information from unconstrained videos by also providing segment-wise attention leading to superior performance (Ramaswamy, Section 2 p. 2961).

Regarding claim 2, the combination of Tian in view of Ramaswamy discloses the system of claim 1 (Fig. 3), wherein the hardware processor (Section 7, NVIDIA GPU) is further configured to run a first convolutional neural network (Fig. 3, CNN) with at least a video portion of the video feed to extract the video features (Fig. 3, the video Vt is used as an input for one of the CNN, Section 4.1, “The feature extraction module utilizes pre-trained CNNs to extract visual features and audio features from each Vt and At”).

Regarding claim 3, the combination of Tian in view of Ramaswamy discloses the system of claim 1 (Fig. 3), wherein the hardware processor  (Section 7, NVIDIA GPU) is further configured to run a second convolution neural network (Fig. 3, CNN) with at least an audio portion of the video feed to extract the audio features (Fig. 3, the video Vt is used as an input for one of the CNN, Section 4.1, “The feature extraction module utilizes pre-trained CNNs to extract visual features and audio features from each Vt and At”).

Regarding claim 4, the combination of Tian in view of Ramaswamy discloses the system of claim 1 (Fig. 3), wherein the dual-modality representation (Fig. 3, fusion network uses the outputs of the two LSTMs as an input, Section 4.3, Audio-Visual Feature Fusion, “we introduce a Dual Multimodal Residual Network (DMRN)”, “Given audio and visual features from LSTMs, the DMRN will compute the updated audio and visual features”) is used as a last layer of the classifier in identifying the audio-visual event (Fig. 3, Section 4.1, “To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.”, the fully connected layer is the last layer).

Regarding claim 5, the combination of Tian in view of Ramaswamy discloses the system of claim 1 (Fig. 3), wherein the classifier identifying the audio-visual event in the video feed (Section 4.1, “The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment”) includes identifying a location in the video feed where the audio-visual event is occurring (Section 6.3, “When the rat appears in the 5th frame but is not making any sound, the attention does not focus on the rat. When the rat sound becomes audible, the attention focuses on the sounding rat. This observation validates that the audio-guided attention mechanism is helpful to distinguish audio-visual unrelated videos, and is not just to capture a saliency map with objects.”, the framework of Tian can identify which frames the rat appears) and a category of the audio-visual event (Section 4.1, “This joint audiovisual representation is used to output event category for each video segment”).

Regarding claim 6, the combination of Tian in view of Ramaswamy discloses the system of claim 1 (Fig. 3), wherein the second neural network (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively) takes both temporal information in the video features and cross-modality information between the video features and the audio features (Fig. 3, the temporal information in the video features and the output of the audio guided visual attention model or the cross-modality information are the input for the first LSTM which is above the other one) in determining the relation-aware video features (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively).

Regarding claim 7, the combination of Tian in view of Ramaswamy discloses the system of claim 1 (Fig. 3), wherein the third neural network (Fig. 3, the audio-visual event localization framework uses another Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware audio features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively) takes both temporal information in the audio features and cross-modality information between the video features and the audio features (Fig. 3, the temporal information in the audio features and the output of the audio guided visual attention model or the cross-modality information are the input for the first LSTM which is below the other one) in determining the relation-aware audio features (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively).

Regarding  claim 8, Tian discloses a computer-implemented method (Section 4, Methods for Audio-Visual Event Localization) comprising: 
receiving a video feed (Fig. 3, audio and video, Section 3 (Cross-Modality Localization), “given a segment of one modality (auditory/visual)”) for audio-visual event localization (Abstract, “audio-visual event localization in unconstrained videos”): 
based on a combination of extracted audio features and video features (Fig. 3, Vt is the video and At is the audio, they are combined and fed into the audio-guided visual attention) of the video feed (Fig. 3, audio and video, Section 3 (Cross-Modality Localization), “given a segment of one modality (auditory/visual)”), determine informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) by running a first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network); 
based on the informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) determined by the first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network), determine relation-aware video features by running a second neural network (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively); 
based on the informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) determined by the first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network), determine relation-aware audio features by running a third neural network (Fig. 3, the audio-visual event localization framework uses another Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware audio features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively); 
obtain a dual-modality representation based on the relation-aware video features and the relation-aware audio features by running a fourth neural network (Fig. 3, fusion network uses the outputs of the two LSTMs as an input, Section 4.3, Audio-Visual Feature Fusion, “we introduce a Dual Multimodal Residual Network (DMRN)”, “Given audio and visual features from LSTMs, the DMRN will compute the updated audio and visual features”).; 
input the dual-modality representation to a classifier to identity an audio-visual event in the video feed (Fig. 3, Section 4.1, “To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.”).

Tian discloses that the second and third neural network are LSTMs, but Tian does not explicitly disclose that the second neural network is configured to implement attention mechanism and learn the relation-aware video features using at least one query derived from a video feature and key-value pairs derived from both video and audio features associated with the video feed and the third neural network is configured to implement attention mechanism and learn the relation-aware audio features using at least one query derived from an audio feature and the key-value pairs derived from both video and audio features associated with the video feed.
	However, Ramaswamy teaches that the second neural network (Fig. 2, the inputs are both the vide and the audio, the second neural network is SWAB or Segment-Wise Attention Block (SWAB) (Section 1, p. 2960), Section 3, “) Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”) is configured to implement attention mechanism (Section 3.4, attention mechanism) and learn the relation-aware video features (Section 1, “We propose a Segment-Wise Attention Block (SWAB) which combines global information of the two modalities with audio-assisted visual features and audio features correspondingly such that it weighs the segments in the video according to the importance of segments in the audio”, Section 3, “Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”, both modalities being audio and video) using at least one query derived from a video feature (Fig. 2, input is a video for the SWAB on the pink box) and key-value pairs derived from both video and audio features associated with the video feed (Fig. 2, audio-assisted video, Section 1, SWAB uses the audio-assisted visual features coming from the fusion block (AVFB) and the audio features, along with the global information from the respective modalities, to localize sound source in the scene by providing segment-wise attention) and the third neural network (Fig. 2, the inputs are both the vide and the audio, the second neural network is SWAB or Segment-Wise Attention Block (SWAB) (Section 1, p. 2960), Section 3, “) Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”) is configured to implement attention mechanism  (Section 3.4, attention mechanism) and learn the relation-aware audio features (Section 1, “We propose a Segment-Wise Attention Block (SWAB) which combines global information of the two modalities with audio-assisted visual features and audio features correspondingly such that it weighs the segments in the video according to the importance of segments in the audio”, Section 3, “Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”, both modalities being audio and video) using at least one query derived from an audio feature (Fig. 2, input from the video) for the SWAB on the purple box) and the key-value pairs derived from both video and audio features associated with the video feed (Fig. 2, audio-assisted video, Section 1, SWAB uses the audio-assisted visual features coming from the fusion block (AVFB) and the audio features, along with the global information from the respective modalities, to localize sound source in the scene by providing segment-wise attention).
	Tian and Ramaswamy are both considered to be analogous to the claimed invention because they are in the same field of audio-visual event localization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by Tian to incorporate the teachings of Ramaswamy that the second neural network is configured to implement attention mechanism and learn the relation-aware video features using at least one query derived from a video feature and key-value pairs derived from both video and audio features associated with the video feed and the third neural network is configured to implement attention mechanism and learn the relation-aware audio features using at least one query derived from an audio feature and the key-value pairs derived from both video and audio features associated with the video feed. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because it provides efficient fusion of audio and visual information from unconstrained videos by also providing segment-wise attention leading to superior performance (Ramaswamy, Section 2 p. 2961).

Regarding  claim 9, the combination of Tian in view of Ramaswamy discloses the method of claim 8 (Section 4, Methods for Audio-Visual Event Localization), further comprising running a first convolutional neural network (Fig. 3, CNN) with at least a video portion of the video feed to extract the video features (Fig. 3, the video Vt is used as an input for one of the CNN, Section 4.1, “The feature extraction module utilizes pre-trained CNNs to extract visual features and audio features from each Vt and At”).

Regarding  claim 10, the combination of Tian in view of Ramaswamy discloses the method of claim 8 (Section 4, Methods for Audio-Visual Event Localization), further comprising running a second convolution neural network (Fig. 3, CNN) with at least an audio portion of the video feed to extract the audio features (Fig. 3, the video Vt is used as an input for one of the CNN, Section 4.1, “The feature extraction module utilizes pre-trained CNNs to extract visual features and audio features from each Vt and At”).

Regarding  claim 11, the combination of Tian in view of Ramaswamy discloses the method of claim 8 (Section 4, Methods for Audio-Visual Event Localization), wherein the dual-modality representation (Fig. 3, fusion network uses the outputs of the two LSTMs as an input, Section 4.3, Audio-Visual Feature Fusion, “we introduce a Dual Multimodal Residual Network (DMRN)”, “Given audio and visual features from LSTMs, the DMRN will compute the updated audio and visual features”) is used as a last layer of the classifier in identifying the audio-visual event (Fig. 3, Section 4.1, “To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.”, the fully connected layer is the last layer).

Regarding  claim 12, the combination of Tian in view of Ramaswamy discloses the method of claim 8 (Section 4, Methods for Audio-Visual Event Localization), wherein the classifier identifying the audio-visual event (Section 4.1, “The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment”) includes identifying a location in the video feed where the audio-visual event is occurring (Section 6.3, “When the rat appears in the 5th frame but is not making any sound, the attention does not focus on the rat. When the rat sound becomes audible, the attention focuses on the sounding rat. This observation validates that the audio-guided attention mechanism is helpful to distinguish audio-visual unrelated videos, and is not just to capture a saliency map with objects.”, the framework of Tian can identify which frames the rat appears) and a category of the audio-visual event (Section 4.1, “This joint audiovisual representation is used to output event category for each video segment”).

Regarding  claim 13, the combination of Tian in view of Ramaswamy discloses the method of claim 8 (Section 4, Methods for Audio-Visual Event Localization), wherein the second neural network (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively) takes both temporal information in the video features and cross-modality information between the video features and the audio features (Fig. 3, the temporal information in the video features and the output of the audio guided visual attention model or the cross-modality information are the input for the first LSTM which is above the other one) in determining the relation-aware video features (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively).

Regarding  claim 14, the combination of Tian in view of Ramaswamy discloses the method of claim 8 (Section 4, Methods for Audio-Visual Event Localization), wherein the third neural network (Fig. 3, the audio-visual event localization framework uses another Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware audio features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively) takes both temporal information in the audio features and cross-modality information between the video features and the audio features (Fig. 3, the temporal information in the audio features and the output of the audio guided visual attention model or the cross-modality information are the input for the first LSTM which is below the other one) in determining the relation-aware audio features (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively).

Claims 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Tian in view of Marcheret et al. (US 20180025729 A1), hereinafter referred to as Marcheret in further view of Ramaswamy.

Regarding  claim 15, Tian discloses a computer program product (Fig. 3, it is the computer framework of the method for Audio-Visual Event Localization explained in Section 4): 
receive a video feed (Fig. 3, audio and video, Section 3 (Cross-Modality Localization), “given a segment of one modality (auditory/visual)”) for audio-visual event localization (Abstract, “audio-visual event localization in unconstrained videos”), 
based on a combination of extracted audio features and video features (Fig. 3, Vt is the video and At is the audio, they are combined and fed into the audio-guided visual attention) of the video feed (Fig. 3, audio and video, Section 3 (Cross-Modality Localization), “given a segment of one modality (auditory/visual)”), determine informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) by running a first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network); 
based on the informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) determined by the first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network), determine relation-aware video features by running a second neural network (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively); 
based on the informative features and regions in the video feed (Section 4.1, “We use an audio-guided visual attention model to generate a context vector”, Section 4.2, Audio-Guided Visual Attention, “It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal.”) determined by the first neural network (Fig. 3, Section 4.2, the audio-guided visual attention model is the first neural network, based on the Specification in para. 0045,“The audio-guided visual attention module 212 can include a neural network (for example, referred to as a first neural network for explanation or illustration), so the audio-guided visual attention model of Tian is the same as the disclosed first neural network), determine relation-aware audio features by running a third neural network (Fig. 3, the audio-visual event localization framework uses another Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware audio features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively); 
input the dual-modality representation to a classifier to identity an audio-visual event in the video feed (Fig. 3, Section 4.1, “To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.”).

Tian does not explicitly discloses a computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions readable/executable by a device.
	However, Marcheret discloses a computer program product (para. 0095, “data 718 and 724 are a computer program product”, Marcheret also teaches a classification or prediction of the speech of both the video component and audio component using multiple neural network as seen in Fig. 3 and Fig. 6) comprising a computer readable storage medium having program instructions (para. 0095, “including a computer-readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc.) that provides at least a portion of the software instructions”) embodied therewith, the program instructions readable/executable by a device (para. 0096, “Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device”).
Tian and Marcheret are both considered to be analogous to the claimed invention because they are in the same field of audio-visual classification. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the computer program product as taught by Tian to incorporate the teachings of Marcheret of a computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions readable/executable by a device. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to use the device to perform the particular tasks of the computer program product (Marcheret, para. 0096).

Tian discloses that the second and third neural network are LSTMs, but Tian does not explicitly disclose that the second neural network is configured to implement attention mechanism and learn the relation-aware video features using at least one query derived from a video feature and key-value pairs derived from both video and audio features associated with the video feed and the third neural network is configured to implement attention mechanism and learn the relation-aware audio features using at least one query derived from an audio feature and the key-value pairs derived from both video and audio features associated with the video feed.
	However, Ramaswamy teaches that the second neural network (Fig. 2, the inputs are both the vide and the audio, the second neural network is SWAB or Segment-Wise Attention Block (SWAB) (Section 1, p. 2960), Section 3, “) Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”) is configured to implement attention mechanism (Section 3.4, attention mechanism) and learn the relation-aware video features (Section 1, “We propose a Segment-Wise Attention Block (SWAB) which combines global information of the two modalities with audio-assisted visual features and audio features correspondingly such that it weighs the segments in the video according to the importance of segments in the audio”, Section 3, “Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”, both modalities being audio and video) using at least one query derived from a video feature (Fig. 2, input is a video for the SWAB on the pink box) and key-value pairs derived from both video and audio features associated with the video feed (Fig. 2, audio-assisted video, Section 1, SWAB uses the audio-assisted visual features coming from the fusion block (AVFB) and the audio features, along with the global information from the respective modalities, to localize sound source in the scene by providing segment-wise attention) and the third neural network (Fig. 2, the inputs are both the vide and the audio, the second neural network is SWAB or Segment-Wise Attention Block (SWAB) (Section 1, p. 2960), Section 3, “) Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”) is configured to implement attention mechanism  (Section 3.4, attention mechanism) and learn the relation-aware audio features (Section 1, “We propose a Segment-Wise Attention Block (SWAB) which combines global information of the two modalities with audio-assisted visual features and audio features correspondingly such that it weighs the segments in the video according to the importance of segments in the audio”, Section 3, “Give the outputs of the two LSTMs to their respective Segment-Wise Attention Block (SWAB) to ensure that attention is given not only to the spatial region in each segment, but also to the segments of both the modalities themselves”, both modalities being audio and video) using at least one query derived from an audio feature (Fig. 2, input from the video) for the SWAB on the purple box) and the key-value pairs derived from both video and audio features associated with the video feed (Fig. 2, audio-assisted video, Section 1, SWAB uses the audio-assisted visual features coming from the fusion block (AVFB) and the audio features, along with the global information from the respective modalities, to localize sound source in the scene by providing segment-wise attention).
	Tian and Ramaswamy are both considered to be analogous to the claimed invention because they are in the same field of audio-visual event localization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the computer program product as taught by Tian to incorporate the teachings of Ramaswamy that the second neural network is configured to implement attention mechanism and learn the relation-aware video features using at least one query derived from a video feature and key-value pairs derived from both video and audio features associated with the video feed and the third neural network is configured to implement attention mechanism and learn the relation-aware audio features using at least one query derived from an audio feature and the key-value pairs derived from both video and audio features associated with the video feed. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because it provides efficient fusion of audio and visual information from unconstrained videos by also providing segment-wise attention leading to superior performance (Ramaswamy, Section 2 p. 2961).

Regarding  claim 16, the combination of Tian in view of Marcheret in further view of Ramaswamy discloses the computer program product of claim 15 (Tian, Fig. 3, it is the computer framework of the method for Audio-Visual Event Localization explained in Section 4, Marcheret, para. 0095, “data 718 and 724 are a computer program product”), wherein the device (Marcheret, para. 0096, “Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device”)  is further caused to run a first convolutional neural network (Tian, Fig. 3, CNN) with at least a video portion of the video feed to extract the video features (Tian, Fig. 3, the video Vt is used as an input for one of the CNN, Section 4.1, “The feature extraction module utilizes pre-trained CNNs to extract visual features and audio features from each Vt and At”).

Regarding  claim 17, the combination of Tian in view of Marcheret in further view of Ramaswamy discloses the computer program product of claim 15 (Tian, Fig. 3, it is the computer framework of the method for Audio-Visual Event Localization explained in Section 4, Marcheret, para. 0095, “data 718 and 724 are a computer program product”), wherein the device (Marcheret, para. 0096, “Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device”) is further caused to run a second convolution neural network (Tian, Fig. 3, CNN) with at least an audio portion of the video feed to extract the audio features (Tian, Fig. 3, the video Vt is used as an input for one of the CNN, Section 4.1, “The feature extraction module utilizes pre-trained CNNs to extract visual features and audio features from each Vt and At”).

Regarding  claim 18, the combination of Tian in view of Marcheret in further view of Ramaswamy discloses the computer program product of claim 15 (Tian, Fig. 3, it is the computer framework of the method for Audio-Visual Event Localization explained in Section 4, Marcheret, para. 0095, “data 718 and 724 are a computer program product”), wherein the dual-modality representation (Fig. 3, fusion network uses the outputs of the two LSTMs as an input, Section 4.3, Audio-Visual Feature Fusion, “we introduce a Dual Multimodal Residual Network (DMRN)”, “Given audio and visual features from LSTMs, the DMRN will compute the updated audio and visual features”) is used as a last layer of the classifier in identifying the audio-visual event (Tian, Fig. 3, Section 4.1, “To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over C event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.”, the fully connected layer is the last layer).

Regarding  claim 19, the combination of Tian in view of Marcheret in further view of Ramaswamy discloses the computer program product of claim 15 (Tian, Fig. 3, it is the computer framework of the method for Audio-Visual Event Localization explained in Section 4, Marcheret, para. 0095, “data 718 and 724 are a computer program product”), wherein the classifier identifying the audio-visual event in the video feed (Tian, Section 4.1, “The audiovisual representation is learned by a multimodal fusion network with audio and visual hidden state output vectors as inputs. This joint audiovisual representation is used to output event category for each video segment”) includes identifying a location in the video feed where the audio-visual event is occurring (Tian, Section 6.3, “When the rat appears in the 5th frame but is not making any sound, the attention does not focus on the rat. When the rat sound becomes audible, the attention focuses on the sounding rat. This observation validates that the audio-guided attention mechanism is helpful to distinguish audio-visual unrelated videos, and is not just to capture a saliency map with objects.”, the framework of Tian can identify which frames the rat appears) and a category of the audio-visual event (Tian, Section 4.1, “This joint audiovisual representation is used to output event category for each video segment”).

Regarding  claim 20, the combination of Tian in view of Marcheret in further view of Ramaswamy discloses the computer program product of claim 15 (Tian, Fig. 3, it is the computer framework of the method for Audio-Visual Event Localization explained in Section 4, Marcheret, para. 0095, “data 718 and 724 are a computer program product”), wherein the second neural network (Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively) takes both temporal information in the video features and cross-modality information between the video features and the audio features (Tian, Fig. 3, the temporal information in the video features and the output of the audio guided visual attention model or the cross-modality information are the input for the first LSTM which is above the other one) in determining the relation-aware video features (Tian, Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively), and the third neural network (Tian, Fig. 3, the audio-visual event localization framework uses another Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware audio features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively) takes both temporal information in the audio features and cross-modality information between the video features and the audio features (Tian, Fig. 3, the temporal information in the audio features and the output of the audio guided visual attention model or the cross-modality information are the input for the first LSTM which is below the other one) in determining the relation-aware audio features (Tian, Fig. 3, the audio-visual event localization framework uses a Long Short-Term Memory or LSTM which is an artificial neural network to determine relation-aware video features, Section 4.1, two separate LSTMs takes the output of the audio-guided visual attention model as inputs to model temporal dependencies in the two modalities respectively).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENISE G ALFONSO whose telephone number is (571)272-1360. The examiner can normally be reached Monday - Friday 7:30 - 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire Wang can be reached on 571-270-1051. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/DENISE G ALFONSO/Examiner, Art Unit 2663                                                                                                                                                                                                        
/CLAIRE X WANG/Supervisory Patent Examiner, Art Unit 2663