DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/27/2021 has been entered.

Priority
This application takes the priority of foreign application KR10-2018-0141961 filed on 11/16/2018

Response to Amendment
Claims 1, 5-7, 10-11, 16-17 and 19 are amended. Claims 2-3 and 12-13 are cancelled . Claims 1, 4-11 and 14-20 are presented for examination. 
Response to Arguments
Applicant’s arguments filed on 10/27/2021 have been reviewed. Applicant arguments are persuasive in light of amendments hence the rejection under 35 U.S.C. 103 as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of  Sharath ( Sound Event Detection In Multichannel Audio Using Spatial and Harmonic Features)  is withdrawn, however upon further consideration a new ground(s) of rejection is given over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of  Sarath ( Sound 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 4-5, 7, 9-11, 14, 15 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of  Sarath ( Sound Event Detection In Multichannel Audio Using Spatial and Harmonic Features) and further in view of Briand ( US Pub: 20180063662)

( acoustic scene classification, Abstract), the method comprising: obtaining, a mid channel and a side channel from an input audio signal; recognizing, according to each of the mid channel and the side channel ( separating into channels, Under 2.1.1, Fig 2;common to record audios in stereo, it is usual to make it monaural first by averaging signals prior to processing, as in our previous work [10]. However, we decided to use left-right (LR) and mid-side (MS) pairs in this work, because these contain richer spatial information than mono. For instance, if a car passes in front of a microphone, the sound moves from L to R or R to L, while it is just amplitude change in mono. In addition, the MS representation emphasis  the time difference between the sounds reaching each side of the stereo microphone. Use of binaural information have shown superior results in the previous DCASE challenge as in [15] as well. The Mid channel is defined as L + R and the side channel is defined as L − R which is a difference between two channels. For LR and MS, we used 2-conv. model for the analysis, 2.1.1),  at least one audio scene from the input audio signal by using a plurality of neural networks trained to recognize an audio scene, wherein the plurality of neural networks are used separately on each of the mid channel and the side channel, to output audio scene recognition results for the mid channel and the side channel, respectively( The former is used for single melspectrogram input such as BS, and the latter was used for paired input such as LR, MS, and HPSS. 2-conv. model is similar to 1-conv. model, but processes two channels individually and concatenated before the last fully-connected layer. For both models, we used the same convolution block as illustrated in Fig.3. We employed batch normalization (BN) [24] and rectified linear unit (ReLU) which are de facto standard for modern ConvNets, Under 2.2 Network architecture;  multiple conv model, Fig 2, recognizing audio scene as described in Fig 4); and identifying, based on the recognizing of the at least one audio scene, at least one audio scene included in audio content by using a neural network trained to combine the audio scene recognition results for the  mid channel and the side channel( fig 4, As a result, the accuracy of the 2.conv-models was 0.87, and BS with various settings (1-conv. models) was generally not as good as 2-conv. models. By combining the results from all the models, it was possible to improve the mean accuracy to 0.917, and ensemble selection slightly pushed it up to 0.919. Because of page limitations, we could not present all class-specific results. However, BS results showed quite different confusion between classes, depending on median filtering size, which is the main reason for the performance improvement of the ensemble. For instance, although the result of BS (0.5 s, 1) are relatively poor compared to other methods, it showed about 16% higher accuracy than the LR for “bus” scene. The confusion matrix of ensemble selection model result is presented in Fig. 4, and it can be observed that the confusion is relatively focused in the home and office, park, and residential area., Under 4. Cross validation results), wherein the plurality of neural networks comprises: a first neural network trained to recognize the audio scene based on a time- frequency shape of an audio signal( mel spectrogram, Under 2.1), wherein the plurality of neural networks comprises: a neural network trained to recognize the audio scene based on a time- frequency shape of an audio signal( mel spectrogram, Under 2.1), a neural network trained to recognize the audio scene based on a shape of a spectral envelope of the audio signal( Fourier transform gives the spectral envelope, Under 2.1), and a neural network trained to recognize the audio scene based on a feature vector extracted from the audio signal( spectrogram, Under 2.1, Fig 2, Fig 4) ,, and wherein the recognizing the at least one audio scene further comprises determining to perform a recognition of  a first audio scene in the mid channel and a recognition of  a second audio scene in the side channel( Fig 2, using mid channel and side channel to recognize scene, Under 2.1 and 2.2) and 2 performing the recognition of the  first audio scene in the mid channel and the recognition of the   second audio scene in the side channel ( recognition of the audio scene in the mid and side channel, Under 2.1.1) 
While Han does not explicitly teaches a first  neural network trained to recognize the audio scene based on a time- frequency shape of an audio signal, a second neural network trained to recognize the audio scene based on a shape of a spectral envelope of the audio signal, and a third neural network trained to recognize the audio scene based on a feature vector extracted from the audio signal, 

( log mel band, harmonic feature and TDOC , Under 2.1, 2.2, 2.3; also Fig 2 and Table 1-4; As its understood a different RNN can be used for each feature- through the article it is mentioned RNNs, for e.g In SED, RNNs can be used to predict probabilities for each class to be active in a given frame at timestep t. The input to the network is a sequence of feature vectors x(t); the network computes hidden activations for each hidden layer, and at the output layer a vector of predictions for each class y(t). A sigmoid activation function is used at the output layer in order to allow several classes to be predicted as active simultaneously. By thresholding the predictions at the output layer it is possible to obtain a binary activity matrix. Under Section 3; and class can be predicted based on log mel, TDOC and harmonic feature – refer to Fig 2) 

Han has  a base concept of detecting scene by pre –processing and using the results in a convnets to detect the scenes in audio, Han differed by the claimed invention on the concept of using different neural nets for values, Sharath teaches the concept and that results in multiple overlapping audio scenes. Sharath architecture can be combined with Han architecture and the results would have been predictable to obtain a multiple overlapping scenes ( Fig 2, Sharath) 
Han modified by Sharath does not explicitly teaches the concept of obtaining, according to a predetermined criterion, scene for mid and side channel and wherein the recognizing the at least one audio scene further comprises determining to perform a recognition of  a predetermined first audio scene in the mid channel and a recognition of  a predetermined second audio scene in the side channel, the predetermined second audio scene being different from the predetermined first audio scene, 
However Briand teaches obtaining, according to a predetermined criterion, scene for mid and side channel and wherein the recognizing the at least one audio scene further comprises determining to perform a recognition of  a predetermined first audio scene in the mid channel and a recognition of  a predetermined second audio scene in the side channel, the predetermined second audio scene being different from the predetermined first audio scene, and 2 performing the recognition of the predetermined first audio scene in the mid channel and the recognition of the predetermined second audio scene in the side channel ( one channel carry dialogue and other carry music, Para 0027; center channel ( mid channel)  carry dialogue of the scene and left and right channel carry ambient sound ( music), Para 0017; types of audio can be determined based on the frequency, Para 0011, 0018) 
It would have been obvious having the teachings of Han and Sharath to further include the concept of Briand of having different channels for different audio scenes since device  may provide less sound externalization for dialogue audio on the center channel to perceive the dialog close to the user, thereby enhancing the listening experience. In other embodiments, music associated with the left surround sound channel, right surround sound channel, and low frequency effects channel may be provided with more sound externalization. Hence its advantageous for device to detect the audio content type on each audio content channel of the audio content in six channel sound format and determines the amount of sound externalization when rendering the audio content in a binaural audio format ( Para 0017-0018, 0027, 0029, 0037, Briand) 



Regarding claim 4, Han as above in claim 1, teaches , wherein the obtaining comprises preprocessing the input audio signal into an input data format of each of the plurality of neural networks trained to recognize the audio scene (input vectors, Page 3, Fig 2) 

Regarding claim 5, Han as above in claim 4, teaches the preprocessing comprises processing the input audio signal into the input data format of the first neutral network and the input data format of the third neural network by downsampling the input audio signal and converting the downsampled audio signal into a time and frequency-based spectrogram ( multiple version of the spectrogram, , dimension reduced, Under Audio processing, Page 1-2) 

Regarding claim 7, Han as above in claim 1, teaches , wherein the recognizing the at least one audio scene comprises calculating a probability for each of the recognized at least one audio scene according to each of the obtained channels ( probabilities, Fig 4, also fig 2) 


Regarding claim 9, Han as above in claim 1,  teaches , wherein the feature vector comprises at least one of a dominant vector, a mean spectrum power, monophony, or a spectral zero- crossing rate ( mean spectrum, power calculations, 2.1. Audio Preprocessing, Page 1-2) 

Regarding claim 10, Han as above in claim 1, teaches  wherein the identifying the at least one audio scene comprises calculating a probability for each of the at least one audio scene included in the audio content based on the probability of each of the at least one audio scene, calculated for each of the channels that are separated to a mid channel and a side channel ( calculated for each channel, Fig 2; channel based on the audio type, Para 0011, 0017-0018, Briand) 

Regarding claim 11, arguments analogous to claim 1, are applicable. In addition Han teaches An electronic device for recognizing an audio scene, the electronic device comprising: a memory storing at least one instruction; and at least one processor configured to execute the at least one instruction to implement: a preprocessing module to perform the method of claim 1 ( we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonic percussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set, Abstract) 
Regarding claim 12, arguments analogous to claim 2, are applicable. 
Regarding claim 14, arguments analogous to claim 4, are applicable. 
Regarding claim 15, arguments analogous to claim 5, are applicable. 
Regarding claim 17, arguments analogous to claim 7, are applicable. 
Regarding claim 18, arguments analogous to claim 9, are applicable. 
Regarding claim 19, arguments analogous to claim 10, are applicable. 

Regarding claim 20, Han teaches anon-transitory computer-readable recording medium having recorded thereon a program executable by at least one processor to perform the method of claim 1 ( we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonic percussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set, Abstract) 


Claim 8 is  rejected under 35 U.S.C. 103 as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of  Sharath ( Sound Event Detection In Multichannel Audio Using Spatial and Harmonic Features) and further in view of Briand ( US Pub: 20180063662) and further in view of Xue ( US Pub: 20150120291) 


Regarding claim 8, Han teaches the concept wherein the recognizing the at least one audio scene comprises calculating a probability of being the at least one audio scene by applying weights preprocessed into an input data format of the second neural network optimal weights ( 2.3 Network ensemble, Fig 4) however does not explicitly mentions wherein the recognizing the at least one audio scene comprises calculating a probability of being the at least one audio scene based on a spectral envelope of a size adjusted by applying a predetermined weight to a spectral envelope preprocessed into an input data format of the second neural network
 
However Xue teaches wherein   recognizing the at least one audio scene comprises calculating a probability of being the at least one audio scene based on a spectral envelope of a size adjusted by applying a predetermined weight to a spectral envelope preprocessed into an input data format of the second neural network ( After the feature vector is extracted, this group of 13-dimensional feature vectors, as parameters, is then transmitted to the classification recognition algorithm. A probability neural network structure is adopted (as shown in FIG. 7), wherein, there are d input layer units, n mode layer units and c classification layer units. Each mode layer unit is able to make the inner product of normalized sample connection x and its weight vector, to obtain z=w.sup.tx and then map it to exp[(z-1)/.sigma..sup.2], Para 0085, Fig 7, wherein features are extracted from Fourier transform, Para 0075) 
It would have been obvious having the teachings of Han and Sharath and Briand to further include the concept of Xue before effective filing date to recognize the scenes based on the known weights stored in the database ( Para 0010, Xue) 

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of  Sharath ( Sound Event Detection In Multichannel Audio Using Spatial and Harmonic Features) and further in view of Briand ( US Pub: 20180063662) and further in view of Marcheret( US Pub: 20170061966) 

Regarding claim 6, Han as above in claim 4, does not explicitly teaches  wherein the preprocessing comprises processing the input audio signal into the input data format of the second neural network by reducing a dimensionality of the shape of the spectral envelope of the input audio signal to a low dimension.  
However Marcheret teaches wherein the preprocessing comprises processing the input audio signal into the input data format of the second neural network by reducing a dimensionality of the shape of the spectral envelope of the input audio signal to a lower dimension ( characterizing the spatial spectral energy projected on scaled and rotated wavelet kernels ψ.sub.λ for at least a portion of the frame of video. This vector of visual scattering features which are in a high dimensional space (6400 dimensions, in one implementation) may then be projected to a lower dimensional space (60 dimensions for example) in such a way to assist in the discrimination of the audio context dependent phonemes (in the example of AV-ASR applications), Para 0061) 

Para 0061, Marcheret ) 

Regarding claim 16, arguments analogous to claim 6, are applicable. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHA MISHRA whose telephone number is (571)272-5357. The examiner can normally be reached M-T 7AM - 5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu can be reached on (571)272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.