DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 7/27/2022 has been entered.
Response to Amendment
The amendments, filed 6/27/2022, have been entered and made of record. Claims 1, 11, and 16 have been amended. Claims 1-20 are pending.
Response to Arguments
Applicant’s arguments in the Remarks filed on 6/27/2022 have been considered but are moot in view of the new ground(s) of rejection.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Aides in view of Crieri
Claims 1, 2, 5-11, and 14-20 are rejected under 35 U.S.C. 103 as being unpatentable over Aides et al.(USPubN 2020/0076988; hereinafter Aides) in view of Crieri(USPubN 2017/0177972).
As per claim 1, Aides teaches a neural network system implemented by one or more computers, wherein the neural network system identifies one or more blocks in a media clip that include audio data that is misaligned with corresponding video data(“In deep neural networks, the attention mechanism may focus the processing to a selected part of the input—either in the time domain or the spatial domain. In embodiments, a time frame processed in the audio domain and a time frame processed in the visual domain may be associated using a novel application of the attention mechanism” in Para.[0015], “an audio-visual pair of streams may be classified into a positive pair if they contain a synchronized recording of a speaker, or the pair of streams may be classified into a false pair if the video and audio are not synchronized” in Para.[0016]), wherein the neural network system comprises: 
a convolutional subnetwork that generates, for each block included in a plurality of blocks in the media clip, a corresponding feature map derived from one or more audio features and one or more video features of the block of the media clip(“At 308, audio stack 110 may be input to, and processed by, audio processing network 114 and video stack 112 may be input to, and processed by, video processing network 116. For example, audio processing network 114 may be fed with an audio stack 110 including a representation of the audio. For example, audio stack 110 may include mel-frequency cepstral coefficients (MFCCs) representing 20 time frames from an audio stream at 100 fps. The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound. The MFCCs are coefficients that collectively make up an MFC Likewise, for example, video processing network 116 may be fed with video stack 112 including 5 frames of 120×120 pixels of a video stream at 25 fps.” in Para.[0018], “The audio stack may comprise mel-frequency cepstral coefficients generated from the audio information and the video stack comprises a plurality of frames of video information. The processing may comprise processing the audio information using a machine learning method modeling the context-dependent time shift and processing the video information using a machine learning method modeling the context-dependent time shift” in Para.[0006], [0017], [0019], [0020]); 
an attention module that computes at least one weight value for each feature map generated for the plurality of blocks in the media clip, wherein the at least one weight value computed for each feature map indicates a likelihood that a given block of the media clip corresponding to the feature map includes audio data that is misaligned with corresponding video data(“the representation of the audio signal may be obtained using several consecutive outputs of the audio GRU network. For example, the consecutive outputs may be combined using a weighted function summed to one such that a different weight is given to each one of the consecutive audio frames. The weights may be content based and they may be implemented as a soft max layer leading to an improvement of the GRU architecture with negligible effect on the evaluation time” in Para.[0023], “the Siamese recurrent networks may be improved by introducing an attention mechanism, for which the new output of the network, which may be denoted by α.sub.n, may be given by a weighted sum of several consecutive frames ... The weights col are learned during training, and the use of a softmax layer makes sure that the weights of the embeddings sum to one. This allows training of the architecture with the attention mechanism in an end to end manner. In addition, since the weights are obtained as an output of the softmax layer, they are based on the content of the recording so that the network implicitly learns different types of misalignments” in Para.[0026], “The audio machine learning method may use a gated recurrent units network that uses a plurality of consecutive outputs of the audio gated recurrent units network combined using a weighted function summed to one such that a different weight is given to each one of the consecutive audio frames, the video machine learning method uses a gated recurrent units network that uses a plurality of consecutive outputs of the video gated recurrent units network combined using a weighted function summed to one such that a different weight is given to each one of the consecutive video frames, and the attention mechanism uses a weighted sum of a plurality of audio frames and video frames and weights of the attention mechanism are based on a content and context of the audio information and on a content and context of the video information” in Para.[0006]); and 
an output layer that identifies, based on the at least one weight value computed for each feature map, a first block included in the plurality of blocks that includes first audio data that is misaligned with corresponding first video data(“At 310, output streams 118, 120 may be fed to pair mapping processing 122, which may determine the mapping of audio and video pairs in output streams 118, 120. Pair mapping processing may map audio and video pairs to identify synchronized (true) pairs and unsynchronized (false) pairs. In output streams 118, 120, synchronized (true) pairs may map close to each other, while the false pairs are mapped distantly. In this example, the context-dependent time shift between the audio and visual streams may be explicitly modeled using an attention mechanism to detect synchronized (true) pairs and unsynchronized (false) pairs. In embodiments, using attention for modeling the fine temporal correspondence between audio and visual streams may be utilized, for example, for synchrony detection and for synthetic lip-syncing. For example, given input audio, a synchronized synthetic video may be generated based on the temporal features detected in the audio by generating matching visual features that provide the appropriate temporal correspondence” in Para.[0020], “The pairs of audio and video features may be identified as being true (synchronized) features or false (unsynchronized) features. The method may further comprise generating synthetic video information that is synchronized to the received audio information based on temporal features detected in the audio by generating matching visual features that provide temporal correspondence as synchronized features” in Para.[0006]).
Aides is silent about computes at least one weight value based on the feature map.
Crieri teaches computes at least one weight value based on the feature map(“determining a weight for each feature map” in Abs, “For each of said one or more feature maps, a weight is determined” in Para.[0043]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Aides with the above teachings of Crieri in order to improve timing synchronizing for video and sound efficiently.
As per claim 2, Aides and Crieri teach all of limitation of claim 1. 
Aides teaches wherein the convolutional subnetwork comprises: a plurality of three-dimensional convolutional networks that share a first set of weights, wherein each three-dimensional convolutional network included in the plurality of three-dimensional convolutional networks includes one or more audio feature extraction layers, one or more video feature extraction layers, and one or more joint audio/video extraction layers(“a new representation of the audio information, including information relating to features of the audio information, and a new representation of the video information, including information relating to features of the video information, and mapping features of the audio information and features of the video information using an attention mechanism to identify pairs of audio and video features” in Para.[0007], “the audio machine learning method uses a gated recurrent units network that uses a plurality of consecutive outputs of the audio gated recurrent units network combined using a weighted function summed to one such that a different weight is given to each one of the consecutive audio frames; the video machine learning method uses a gated recurrent units network that uses a plurality of consecutive outputs of the video gated recurrent units network combined using a weighted function summed to one such that a different weight is given to each one of the consecutive video frames; and the attention mechanism uses a weighted sum of a plurality of audio frames and video frames and weights of the attention mechanism are based on a content and context of the audio information and on a content and context of the video information” in Claim 5).
As per claim 5, Aides and Crieri teach all of limitation of claim 1.
Aides teaches wherein the output layer includes one or more fully connected layers that perform a binary classification based on the at least one weight value computed for each feature map to generate a first classification for the first block, wherein the first classification indicates that the first audio data is misaligned with the corresponding first video data(“The pairs of audio and video features may be identified as being true (synchronized) features or false (unsynchronized) features.” In Para.[0006], “a synchronized (true) stack, in which the audio and the video correspond with one another, and an unsynchronized (false) stack in which the audio and the video do not correspond with one another.” in Para.[0025]).
As per claim 6, Aides and Crieri teach all of limitation of claim 1.
Aides teaches wherein a first feature map derived from the one or more audio features and the one or more video features includes a first joint audio/visual feature that is derived from a first audio feature and a first video feature(“mapping features of the audio information and features of the video information using an attention mechanism to identify pairs of audio and video features” in Para.[0007]).
As per claim 7, Aides and Crieri teach all of limitation of claim 6.
Aides teaches wherein the first audio feature corresponds to a first sound that is played during playback of the media clip, the first video feature corresponds to a first event that is depicted during playback of the media clip, and the first sound is played back in conjunction with the first event in the absence of misalignment between audio data and corresponding video data(“true audio-visual pairs are shown at 204” in Para.[0021], Fig. 2, [0020]).
As per claim 8, Aides and Crieri teach all of limitation of claim 6.
Aides teaches wherein the first audio feature corresponds to a first sound that is played during playback of the media clip, the first video feature corresponds to a first event that is depicted during playback of the media clip, and the first sound is not played back in conjunction with the first event in the presence of misalignment between audio data and corresponding video data(“false audio-visual pairs are shown at 202” in Para.[0021], Fig. 2, Para.[0020]).
As per claim 9, Aides and Crieri teach all of limitation of claim 6.
Aides teaches wherein the first video feature corresponds to a first intersection between a first set of pixels and a second set of pixels, and the first audio feature corresponds to a sound associated with the first intersection(“using attention for modeling the fine temporal correspondence between audio and visual streams may be utilized, for example, for synchrony detection and for synthetic lip-syncing. For example, given input audio, a synchronized synthetic video may be generated based on the temporal features detected in the audio by generating matching visual features that provide the appropriate temporal correspondence” in Para.[0020], “Exemplary histograms of the time offsets (distances) between audiovisual representations of synchronized (true) and unsynchronized (false) pairs in output streams 118, 120 are shown in FIG. 2. In this example, false audio-visual pairs are shown at 202 and true audio-visual pairs are shown at 204” in Para.[0021], Fig. 2).
As per claim 10, Aides and Crieri teach all of limitation of claim 1.
Aides teaches wherein at least one of the convolutional subnetwork, the attention module, or the output layer is trained based on a first set of media clips that do not include misalignment between audio data and corresponding video data and a second set of media clips that include misalignment between audio data and corresponding video data(“audio processing network 114 and video processing network 116 may use two similar networks, which may be trained using the Siamese networks procedure. For training, the networks, for example, may be fed with the audio and the video signals, which are collected into stacks of, for example, 20 and 5 frames, respectively, such that each stack represents a sequence of −200 ms. Two types of stacks may be considered: a synchronized (true) stack, in which the audio and the video correspond with one another, and an unsynchronized (false) stack in which the audio and the video do not correspond with one another” in Para.[0025]).
As per claim 11, the limitations in the claim 11 has been discussed in the rejection claim 1 and rejected under the same rationale.	
As per claim 14, the limitations in the claim 14 has been discussed in the rejection claim 6 and rejected under the same rationale.	
As per claim 15, the limitations in the claim 15 has been discussed in the rejection claim 7 and rejected under the same rationale.	
As per claim 16, Aides teaches a non-transitory computer-readable medium storing program instructions that, when executed by a processor(“a processor, memory accessible by the processor, and computer program instructions stored in the memory and executable by the processor” in Para.[0007]) and the other limitations in the claim 16 has been discussed in the rejection claim 1 and rejected under the same rationale.	
As per claim 17, the limitations in the claim 17 has been discussed in the rejection claim 7 and rejected under the same rationale.	
As per claim 18, the limitations in the claim 18 has been discussed in the rejection claim 8 and rejected under the same rationale.	
As per claim 19, the limitations in the claim 19 has been discussed in the rejection claim 9 and rejected under the same rationale.	
As per claim 20, the limitations in the claim 20 has been discussed in the rejection claim 10 and rejected under the same rationale.	
Allowable Subject Matter
Claims 3, 4, 12, and 13 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SUNGHYOUN PARK whose telephone number is (571)270-1333. The examiner can normally be reached M - Thur 6:00 am - 4 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI Q TRAN can be reached on (571)272-7382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SUNGHYOUN PARK/Examiner, Art Unit 2484