DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

The office action sent in response to Applicant’s communication received on 11/15/2019 for the application number 16685479.  The office hereby acknowledges receipt of the following placed of record in the file: Specification, Abstract, Oath/Declaration and claims. 

Claims 1-20 are presented for examination.  

Priority
This application takes the priority of foreign application KR10-2018-0141961 filed on 11/16/2018
Information Disclosure Statement
The information disclosure submitted on 11/15/2019 and 5/28/2020 were filed  before the mailing data of the first office action. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information , disclosure statement is being considered by the examiner.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-5, 7, 9-15 and 17-20 are rejected under 35 U.S.C. 102(a)(1) as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) 


Regarding claim 1, Han teaches a method of recognizing an audio scene ( acoustic scene classification, Abstract) , the method comprising: separating, according to a predetermined criterion, an input audio signal into channels ( separating into channels, Under 2.1.1, Fig 2; ommon to record audios in stereo, it is usual to make it monaural first by averaging signals prior to processing, as in our previous work [10]. However, we decided to use left-right (LR) and mid-side (MS) pairs in this work, because these contain richer spatial information than mono. For instance, if a car passes in front of a microphone, the sound moves from L to R or R to L, while it is just amplitude change in mono. In addition, the MS representation emphasis  the time difference between the sounds reaching each side of the stereo microphone. Use of binaural information have shown superior results in the previous DCASE challenge as in [15] as well. The Mid channel is defined as L + R and the side channel is defined as L − R which is a difference between two channels. For LR and MS, we used 2-conv. model for the analysis, 2.1.1)  ; recognizing, according to each of the separated channels, at least one audio scene from the input audio signal by using a plurality of neural networks trained to recognize an audio scene ( multiple conv model, Fig 2, recognizing audio scene as described in Fig 4) ; and determining, based on a result of the recognizing of the at least one audio scene, at least one audio scene included in audio content by using a neural network trained to combine audio scene recognition results for respective channels ( fig 4, As a result, the accuracy of the 2.conv-models was 0.87, and BS with various settings (1-conv. models) was generally not as good as 2-conv. models. By combining the results from all the models, it was possible to improve the mean accuracy to 0.917, and ensemble selection slightly pushed it up to 0.919. Because of page limitations, we could not present all class-specific results. However, BS results showed quite different confusion between classes, depending on median filtering size, which is the main reason for the performance improvement of the ensemble. For instance, although the result of BS (0.5 s, 1) are relatively poor compared to other methods, it showed about 16% higher accuracy than the LR for “bus” scene. The confusion matrix of ensemble selection model result is presented in Fig. 4, and it can be observed that the confusion is relatively focused in the home and office, park, and residential area., Under 4. Cross validation results) , wherein the plurality of neural networks comprises: a first neural network trained to recognize the audio scene based on a time-frequency shape of an audio signal ( mel spectrogram, Under 2.1) , a second neural ( Fourier transform gives the spectral envelope, Under 2.1) , and a third neural network trained to recognize the audio scene based on a feature vector extracted from the audio signal ( spectrogram, Under 2.1, Fig 2, Fig 4) 


Regarding claim 2, Han as above in claim 1, teaches  wherein the separating comprises separating the input audio signal into a mid channel and a side channel (  left-right and mid channel, Under 2.1, Page 3) 
Regarding claim 3, Han as above in claim 1, teaches , wherein the separating comprises configuring recognition of a predetermined audio scene in each of the separated channels ( fig 2, fig 4, recognition in each channel) 

Regarding claim 4, Han as above in claim 1, teaches , wherein the separating comprises preprocessing the input audio signal into an input data format of each of the plurality of neural networks trained to recognize the audio scene (input vectors, Page 3, Fig 2) 

Regarding claim 5, Han as above in claim 4, teaches the preprocessing comprises processing the input audio signal into the input data format of the first neutral network and the input data format of the third neural network by downsampling the input audio signal and converting the downsampled audio signal into a time and frequency-based spectrogram ( multiple version of the spectrogram, , dimension reduced, Under Audio processing, Page 1-2) 

Regarding claim 7, Yan as above in claim 1, teaches , wherein the recognizing the at least one audio scene comprises calculating a probability for each of the recognized at least one audio scene according to each of the separated channels ( probabilities, Fig 4, also fig 2) 


Regarding claim 9, Han as above in claim 1,  teaches , wherein the feature vector comprises at least one of a dominant vector, a mean spectrum power, monophony, or a spectral zero- crossing rate ( mean spectrum, power calculations, 2.1. Audio Preprocessing, Page 1-2) 

Regarding claim 10, Han as above in claim 1, teaches  wherein the determining the at least one audio scene comprises calculating a probability for each of the at least one audio scene included in the audio content based on the probability of each of the at least one audio scene, calculated for each of the channels that are separated to a mid channel and a side channel ( calculated for each channel, Fig 2) 

Regarding claim 11, arguments analogous to claim 1, are applicable. In addition Han teaches An electronic device for recognizing an audio scene, the electronic device comprising: a memory storing at least one instruction; and at least one processor configured to execute the at least one instruction to implement: a preprocessing module to perform the method of claim 1 ( we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonic percussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set, Abstract) 
Regarding claim 12, arguments analogous to claim 2, are applicable. 
Regarding claim 13,arguments analogous to claim 3, are applicable. 
Regarding claim 14, arguments analogous to claim 4, are applicable. 
Regarding claim 15, arguments analogous to claim 5, are applicable. 
Regarding claim 17, arguments analogous to claim 7, are applicable. 
Regarding claim 18, arguments analogous to claim 9, are applicable. 
Regarding claim 19, arguments analogous to claim 10, are applicable. 

Regarding claim 20, Han teaches anon-transitory computer-readable recording medium having recorded thereon a program executable by at least one processor to perform the method of claim 1 ( we demonstrate how we applied convolutional neural network for DCASE 2017 task 1, acoustic scene classification. We propose a variety of preprocessing methods that emphasise different acoustic characteristics such as binaural representations, harmonic percussive source separation, and background subtraction. We also present a network structure designed for paired input to make the most of the spatial information contained in the stereo. The experimental results show that the proposed network structures and the preprocessing methods effectively learn acoustic characteristics from the audio recordings, and their ensemble model significantly reduces the error rate further, exhibiting an accuracy of 0.917 for 4-fold cross-validation on the development. The proposed system achieved second place in DCASE 2017 task 1 with an accuracy of 0.804 on the evaluation set, Abstract) 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.

Claim 8 is  rejected under 35 U.S.C. 103 as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of Xue ( US Pub: 20150120291) 


Regarding claim 8, Han teaches the concept wherein the recognizing the at least one audio scene comprises calculating a probability of being the at least one audio scene by applying weights preprocessed into an input data format of the second neural network optimal weights ( 2.3 Network ensemble, Fig 4) however does not explicitly mentions wherein the recognizing the at least one audio scene comprises calculating a probability of being the at least one audio scene based on a spectral envelope of a size adjusted by applying a predetermined weight to a spectral envelope preprocessed into an input data format of the second neural network
 
However Xue teaches wherein   recognizing the at least one audio scene comprises calculating a probability of being the at least one audio scene based on a spectral envelope of a size adjusted by applying a predetermined weight to a spectral envelope preprocessed into an input data format of the second neural network ( After the feature vector is extracted, this group of 13-dimensional feature vectors, as parameters, is then transmitted to the classification recognition algorithm. A probability neural network structure is adopted (as shown in FIG. 7), wherein, there are d input layer units, n mode layer units and c classification layer units. Each mode layer unit is able to make the inner product of normalized sample connection x and its weight vector, to obtain z=w.sup.tx and then map it to exp[(z-1)/.sigma..sup.2], Para 0085, Fig 7, wherein features are extracted from Fourier transform, Para 0075) 
It would have been obvious having the teachings of Han to further include the concept of Xue before effective filing date to recognize the scenes based on the known weights stored in the database ( Para 0010, Xue) 

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Han ( Convolutional Neural Networks With Binaural Representations And Background Subtraction for Acoustic Scene Classification ) and further in view of Marcheret( US Pub: 20170061966) 

Regarding claim 6, Han as above in claim 4, does not explicitly teaches  wherein the preprocessing comprises processing the input audio signal into the input data format of the second neural network by reducing a dimensionality of the shape of the spectral envelope of the input audio signal to a low dimension.  
However Marcheret teaches wherein the preprocessing comprises processing the input audio signal into the input data format of the second neural network by reducing a dimensionality of the shape of the spectral envelope of the input audio signal to a low dimension ( characterizing the spatial spectral energy projected on scaled and rotated wavelet kernels ψ.sub.λ for at least a portion of the frame of video. This vector of visual scattering features which are in a high dimensional space (6400 dimensions, in one implementation) may then be projected to a lower dimensional space (60 dimensions for example) in such a way to assist in the discrimination of the audio context dependent phonemes (in the example of AV-ASR applications), Para 0061) 

It would have been obvious having the teachings of Han to further include the concept of Marcheret before effective filing date to assist in the discrimination of the audio context dependent phonemes (in the example of AV-ASR applications) (Para 0061, Marcheret ) 

Regarding claim 16, arguments analogous to claim 6, are applicable. 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHA MISHRA whose telephone number is (571)272-5357.  The examiner can normally be reached on M-T 7AM - 5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu can be reached on (571)272-7490.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/RICHA MISHRA/Primary Examiner, Art Unit 2674