Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-20 are pending. Claims 1, 3 and 4 are independent.  A preliminary amendment filed 5/11/2021 amended the Claims and added new ones.
This Application was published as U.S. 2021/0272587.
            Apparent priority: 13 November 2018.
Claim 1: 2, 5-9.
Claim 3: 10-15
Claim 4: 16-20
Claim Suggestions
Claim 7 can benefit from improvement: 
7. The non-verbal sound detection device according to claim 1, 
wherein the non-verbal sound likeliness includes a first group and a second group, 
wherein the first group includes a sound generated through a human oral cavity with no linguistic meaning, and 
wherein the second group includes a sound with a linguistic meaning and one or more ambient sounds.

[0007] Even non-verbal sounds contain sounds which can be transcribed in text (that is, sounds that can mostly be specified as phonemes) in some cases. When a non-verbal sound is considered to be used in sound recognition, such a sound is an important clue. However, in detection of a non-verbal sound in the related art, such sounds have not been used.
[0020] In the embodiment, in the non-verbal sound likeliness, a range is assumed to be equal to or greater than 0.0 and equal to or less than 1.0. For the non-verbal sound likeliness, a sound spoken through a human oral cavity with no linguistic meaning (for example, including a cough, a sneeze, laughter, and an artificial sound) is assumed to provide a value close to 1.0. Conversely, a sound with a linguistic meaning and a non-vocal sound (for example, noise of a vehicle or noise such as the sound of feet climbing stairs) is assumed to provide a value close to 0.0.
[0028] As described in the above-described embodiment, the following advantages can be expected when the bottleneck feature value in the phoneme state output by the acoustic model is used in detection of a non-verbal sound. First, an improvement in detection precision can be expected when phoneme information of a non-verbal sound is used to perform estimation. Second, by using a model ascertaining a relation between previous and subsequent feature values, a non-verbal sound can be easily estimated using a sound which is the same result as the result obtained using text. That is, according to the non-verbal sound detection technology of the embodiment, information of sound content can be used in detection of a non-verbal sound and detection precision of the non-verbal sound is improved.
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: extractor, estimator, detector, in Claims 1 and 3-4. These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory or to a combination of software and hardware.
PLEASE NOTE: This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “processor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 3-6, 8, 11-12, 14, 17-18 rejected under 35 U.S.C. 103 as being unpatentable over Lesso (U.S. 20190348064) in view of Divakaran (U.S. 20170160813).
Regarding Claim 1, Lesso teaches:
1. A non-verbal sound detection device comprising: [Lesso, Figure 4, 52: “determine whether the sound comprises a non-verbal sound ….” Figures 1-3, 5 and 7-8 all show variations on the device.  Figures 4 and 6 are process flowcharts.  Figure 2 in particular shows the processor 16 and memory 14 and the microphones 12, 12a, 12b.  See [0019]-[0021] for the device.]
an acoustic model store configured to store an acoustic model that is configured by a deep neural network with a bottleneck structure, [Lesso, Figure 3, “sound analysis 32” includes a “model” which detects non-verbal sounds in a received “user utterance 22.”  Figure 5 shows the “feature extractor 74” which extracts “acoustic features” and includes a “deep neural network (DNN” which can extract “bottleneck or tandem features”:  “[0038] Thus, the sound analysis block 32 may be provided with models that represent the specific non-verbal sounds that it is concerned with, such as coughs, sneezes, snoring or other audible breathing sounds, passing wind (either by belching or flatulence), or hiccups. The pre-processed signal may then be compared in the sound analysis block 32 with these models, in order to determine whether one of these sounds has been detected.”  “[0064] The features that are extracted by the feature extraction block 74 include features representing the input audio that can be used for performing speaker recognition. For example, the features that are extracted could include Mel Frequency Cepstral Coefficients (MFCCs). Other features that can be extracted from the speech signal include Perceptual Linear Prediction (PLP) features, Linear Prediction (LPC) features, and bottleneck or tandem features extracted with a deep neural network (DNN). Acoustic features such as MFCCs, PLP features and LPC features can be combined with the DNN-extracted features.”  Figure 6 is a flowchart of both generating a model and using it:  “[0074] In step 96 of the process shown in FIG. 6, any obtained health markers are stored, together with data indicating the identified speaker. For example, as shown in FIG. 5, the data may be stored in a memory 80….”]
estimates a phoneme state from an input sound feature value, and outputs the phoneme state; [Lesso teaches developing a speaker recognition model from the “speech” of the user but does not specifically teach speech recognition.]
a non-verbal sound model store configured to store a non-verbal sound model that estimates a posterior probability of a non-verbal sound likeliness from the input sound feature value and a bottleneck feature value and outputs the posterior probability; [Lesso, Figure 3, “sound analysis 32” to “biometric analysis 34” and Figure 5,” “speaker recognition 76” to “health markers 78” store the “non-verbal sound model” of the Claim and “feature extract 74” extracts the features that are used for both developing the models and then using them.  “[0064] The features that are extracted by the feature extraction block 74 include features representing the input audio that can be used for performing speaker recognition. For example, the features that are extracted could include Mel Frequency Cepstral Coefficients (MFCCs). Other features that can be extracted from the speech signal include Perceptual Linear Prediction (PLP) features, Linear Prediction (LPC) features, and bottleneck or tandem features extracted with a deep neural network (DNN). Acoustic features such as MFCCs, PLP features and LPC features can be combined with the DNN-extracted features.”  See [0067] for using the extracted feature to develop a model.  “[0067] In one example described below, it is assumed that there is at least one registered user of the system, who has been through an enrolment process. The process of enrolment typically involves the user speaking some predetermined words or phrases, and extracted features of the user's speech then being used to form a model of the user's speech.”]
a sound feature value extractor configured to extract a sound feature value of each frame from an input sound signal; [Lesso, Figure 5, “Feature extract 74.”  “[0060] …The signal may also be divided into frames, for example of 20 ms duration, such that each frame can be considered separately.”]
a bottleneck feature value estimator configured to input the sound feature value of each frame extracted from the sound signal to the acoustic model and obtain an output of a bottleneck layer of the acoustic model as a bottleneck feature value of each frame; and [Lesso, Figure 5, the “feature extraction block 74” extracts the “bottleneck feature values” of input speech as well:  “[0064] … Other features that can be extracted from the speech signal include Perceptual Linear Prediction (PLP) features, Linear Prediction (LPC) features, and bottleneck or tandem features extracted with a deep neural network (DNN)…..”]
a non-verbal sound detector configured to input the sound feature value of each frame extracted from the sound signal and the bottleneck feature value of each frame obtained from the sound feature value to the non-verbal sound model and obtain the posterior probability of the non-verbal sound likeliness of each frame output by the non-verbal sound model. [Lesso, Figure 3, “sound analysis 32” using “biometric analysis 34” as input and Figure 5, “health markers 78” using the extracted features from the “feature extraction unit 74” teach the “non-verbal sound model” of the Claim which finds the posterior probability of an input sound belonging to a category of non-verbal sounds.  As provided in “[0064] … Acoustic features such as MFCCs, PLP features and LPC features can be combined with the DNN-extracted features.”  Where, “[0064] … bottleneck or tandem features extracted with a deep neural network (DNN)….”  See also Figure 6, “[0070] … the process passes to step 94, in which the features that are extracted by the feature extraction block 74 are passed to a health markers block 78, in which health marker features are obtained, based on the features that were extracted by the feature extraction block 74 from the speech represented by the audio signal….”]

Lesso teaches developing a speaker recognition model from the “speech” of the user but does not specifically teach speech recognition which includes derivation of phonemes.
Divakaran teaches:
an acoustic model store configured to store an acoustic model that is configured by a deep neural network with a bottleneck structure, estimates a phoneme state from an input sound feature value, and outputs the phoneme state; [Divakaran, Figure 18, “Acoustic Model 1816” include a “bottleneck features 1817” extractor which output a “model of current speech 1824” including phonemes.  “[0214] In various implementations, the speech recognizer may include a neural network-based acoustic model 1816. The acoustic model 1816 may include a deep neural network that can be trained for automatic speech recognition using acoustic features derived from input speech samples 1830. Once trained, the deep neural network can be used to associate a input sample 1830 with phonetic content. The deep neural network can produce bottleneck features 1817. Bottleneck features are generally generated by a multi-layer perceptrons that has been trained to predict context-independent monophone states. Bottleneck features can improve the accuracy of automatic speech recognition systems.”  “[0215] The speech recognizer 1814 in this example combines the bottleneck features 1817 with cepstral features 1818 that are separately derived from an input sample 1830. The combined bottleneck features 1817 and cepstral features 1818 can be used to create a joint speaker and content model of the current speech 1824, which is provided to the back end module 1820. The combination of bottleneck features 1817 and cepstral features 1818 can be used to generate a phonetic model (such as an i-vector), which can be used for both speaker identification and phonetic or text identification.”  “[0221] …Specifically, the analyzer 1828 can compare the phonemic, phonetic, and/or lexical content (e.g. at the phone or tri-phone level) as produced by a specific speaker. In this, the analyzer 1828 does not rely on traditional acoustic features alone. For example, the command/speaker recognizer 1822 may use a probabilistic linear discriminant analysis (PLDA) to compare one or more phones or phonemic characteristics of the current phonetic model to one or more similar phones or phonemic characteristics of the stored phonetic model(s).”  See also Figure 16 and “[0179] … The training engine 1607 is also provided the prescribed text, and can associate features extracted from the audio input signal with phones and/or phoneme identified in the text.”  “[0180] A phoneme is the smallest structural unit that distinguishes meaning in a language, while a phone is an instance of a phoneme in actual utterances. …”] [Divakaran is directed to detecting non-verbal sounds as well.  See Figure 4 and :[0076] … In various implementations, the audio understanding 414 component can also extract non-verbal information from audio input, such as onomatopoetic utterances and voice biometrics.  For example, the audio understanding 414 component can identify a particular sound as “laughter” or maybe even “ironic laughter.” … ”    “[0040] Multi-modality describes the practice of communicating using textual, aural, linguistic, spatial, and visual resources, each of which may be called a “mode”. A multi-modal virtual personal assistant can accept audio input, including natural language and non-verbal sounds such as grunts or laughter….”]
Lesso and Divakaran pertain to detection of sounds including non-verbal sounds (both include laughter as an example) and it would have been obvious to modify the system of Lesso which already includes speaker recognition based on the voice of the speaker with the system of Divakaran which expressly includes speech recognition together with speaker recognition to obtain phonemes corresponding to the input sound because speech and speaker recognition normally go hand in hand and one is performed in the service of the other and in order to arrive at the system of the instant Claim.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 3 is a method Claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale.
3. A non-verbal sound detection method comprising: 
storing, by an acoustic model store, an acoustic model that is configured by a deep neural network with a bottleneck structure, estimates a phoneme state from an input sound feature value, and outputs the phoneme state; 
storing, by a non-verbal sound model storage, a non-verbal sound model that estimates a posterior probability of a non-verbal sound likeliness from the input sound feature value and a bottleneck feature value and outputs the posterior probability; 
extracting, by a sound feature value extractor, a sound feature value of each frame from an input sound; 
inputting, by a bottleneck feature value estimator, the sound feature value of each frame extracted from the sound signal to the acoustic model and obtaining an output of a bottleneck layer of the acoustic model as a bottleneck feature value of each frame; and 
inputting, by a non-verbal sound detector, the sound feature value of each frame extracted from the sound signal and the bottleneck feature value of each frame obtained from the sound feature value to the non-verbal sound model and obtaining the posterior probability of the non-verbal sound likeliness of each frame output by the non-verbal sound model.

Claim 4 is a method Claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale.
4. A computer-readable non-transitory recording medium storing a computer-executable program instructions that when executed by a processor cause causing a computer system to: 
store, by an acoustic model store, an acoustic model that is configured by a deep neural network with a bottleneck structure, estimates a phoneme state from an input sound feature value, and outputs the phoneme state; 
store, by a non-verbal sound model storage, a non-verbal sound model that estimates a posterior probability of a non-verbal sound likeliness from the input sound feature value and a bottleneck feature value and outputs the posterior probability; 
extract, by a sound feature value extractor, a sound feature value of each frame from an input sound; 
input, by a bottleneck feature value estimator, the sound feature value of each frame extracted from the sound signal to the acoustic model and obtaining an output of a bottleneck layer of the acoustic model as a bottleneck feature value of each frame; and 
input, by a non-verbal sound detector, the sound feature value of each frame extracted from the sound signal and the bottleneck feature value of each frame obtained from the sound feature value to the non-verbal sound model and obtaining the posterior probability of the non-verbal sound likeliness of each frame output by the non-verbal sound model by a non-verbal sound detection unit.

Regarding Claim 5, Lesso teaches:
5. The non-verbal sound detection device according to claim 1, wherein the non-verbal sound model models a non-verbal sound, and wherein the non-verbal sound includes at least one of a cough, a sneeze, a breathing sound during a telephone conversation, or laughter. [Lesso, Figure 3, “sound analysis 32” includes a “model” which detects non-verbal sounds in a received “user utterance 22.”  “[0038] Thus, the sound analysis block 32 may be provided with models that represent the specific non-verbal sounds that it is concerned with, such as coughs, sneezes, snoring or other audible breathing sounds, passing wind (either by belching or flatulence), or hiccups….”]

Regarding Claim 6, Lesso teaches:
6. The non-verbal sound detection device according to claim 1, wherein the sound feature value includes one or more of a Mel filter bank slope (MFS) or Mel frequency cepstral coefficients (MFCC). [Lesso, Figure 5, “feature extract 74”:  “[0064] The features that are extracted by the feature extraction block 74 include features representing the input audio that can be used for performing speaker recognition. For example, the features that are extracted could include Mel Frequency Cepstral Coefficients (MFCCs)….”]

Regarding Claim 8, Lesso teaches:
8. The non-verbal sound detection device according to claim 1, wherein the first group includes one or more of: a cough, a sneeze, laughter, or an artificial sound. [Lesso, Figure 3, “sound analysis 32” includes a “model” which detects non-verbal sounds in a received “user utterance 22.”  “[0038] Thus, the sound analysis block 32 may be provided with models that represent the specific non-verbal sounds that it is concerned with, such as coughs, sneezes, snoring or other audible breathing sounds, passing wind (either by belching or flatulence), or hiccups….”]

Claim 11 is a method claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.
Claim 12 is a method claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.
Claim 14 is a method claim with limitations corresponding to the limitations of Claim 8 and is rejected under similar rationale.

Claim 17 is a computer program product system claim with limitations corresponding to the limitations of Claim 5 and is rejected under similar rationale.
Claim 18 is a computer program product system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.

Claims 2, 10, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Lesso and Divakaran and further in view of Wang (U.S. 20200321008).
Regarding Claim 2, Lesso does not teach that its DNN is an RNN or LSTM.  However, RNN or RNN-LSTM are the types of DNN that are usually used for time series analysis such as speech recognition. 
Neither does Divakaran.
Wang teaches:
2. The non-verbal sound detection device according to claim 1, wherein the non-verbal sound model is configured by a recurrent neural network treating a long-distance context. [Wang, Figure 1, teaches combining both spectral features and Bottleneck feature of a DNN to arrive at the acoustic features that are used for speaker identification.  The DNN is an RNN-LSTM.  “[0006] According to a first aspect, a voiceprint recognition method is provided, including: extracting a first spectral feature from speaker audio; inputting the speaker audio to a memory deep neural network (DNN), and extracting a bottleneck feature from a bottleneck layer of the memory DNN, the memory DNN including at least one temporal recurrent layer and the bottleneck layer,… forming an acoustic feature of the speaker audio based on the first spectral feature and the bottleneck feature …  performing speaker recognition by using a classification model and based on the identity authentication vector.”  “[0007] In an implementation, the first spectral feature includes a Mel frequency cepstral coefficient (MFCC) feature, and a first order difference feature and a second order difference feature of the MFCC feature.”  “[0008] In a possible design, the at least one temporal recurrent layer includes a hidden layer based on a long-short term memory (LSTM) model, or a hidden layer based on an LSTMP model, where the LSTMP model is an LSTM model with a recurrent projection layer.”]
Lesso/Divakaran and Wang pertain to speaker recognition and use a DNN to perform the processing and it would have been obvious to substitute the RNN-LSTM of Wang which is normally used in speech processing for the general DNN of the combination.  See Wang: “[0040] Recurrent neural network (RNN) is a temporal recurrent neural network that can be used to process sequence data. In the RNN, the current output of a sequence is associated with its previous output….”  Speech is a time-series/sequence type of data; so RNN is the type of DNN suited for speech processing.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 10 is a method claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.
Claim 16 is a computer program product system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.

Claims 7, 9, 13, 15, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lesso and Divakaran and further in view of Daniel (U.S. 20180012120).
Regarding Claim 7, Lesso teaches:
7. The non-verbal sound detection device according to claim 1, 
wherein the non-verbal sound likeliness includes a first group and a second group, [Lesso teaches detecting several different types of sounds from the speaking user which can be used for both speaker recognition (Figure 5, 76) and for determining speaker health (Figure 5, 78).  Thus, at least two groupings are identified in the sounds detected from the speaker.  “[0004] According to a first aspect of the invention, there is provided a method of obtaining information relevant to a user's health, the method comprising: detecting a sound; determining whether the detected sound comprises a non-verbal sound containing health information; if it is determined that the detected sound comprises a non-verbal sound containing health information, determining whether the non-verbal sound was produced by the user; and if it is determined that the non-verbal sound was produced by the user, storing data relating to said non-verbal sound.”]
wherein the first group includes a sound spoken through a human oral cavity with no linguistic meaning, and [Lesso teaches detecting coughing and sneezing and breathing sounds during speech all of which are generated through the same oral cavity as speech but have no meaning.  Figure 5. “Health Markers 78.”]
wherein the second group includes a sound with a linguistic meaning and non-vocal sound. [Lesso, “[0031] Embodiments described herein relate to non-speech sounds produced by a user, and to the user's speech.”  Sound with a linguistic meaning is speech.  Figure 5, “Speaker recognition 76.”]
Lesso does not teach detecting or classifying the background/ambient/ environmental sounds which is the intent of this Claim.  (Language of this Claim requires improvement as provided above.)
Neither does Divakaran.
Daniel teaches:
wherein the second group includes a sound with a linguistic meaning and non-vocal sound. [Daniel teaches training a neural network for detecting various time series patterns including speech and different types of noise including car noise.  The neural network is capable of detecting the types of sounds for which it is trained which include speech and non-speech sounds.  “According to a first aspect of the present disclosure, a method for facilitating the detection of one or more time series patterns is conceived, comprising building one or more artificial neural networks, wherein, for at least one time series pattern to be detected, a specific one of said artificial neural networks is built….”  Abstract.  “[0027] FIG. 1 shows an illustrative embodiment of a pattern detection facilitation method 100. The method 100 comprises, at 102, selecting a time series pattern to be detected. For instance, the selected time series pattern may be an audio pattern, in particular user-specific speech, voiced speech (vowels), unvoiced speech (consonants), contextual sound (e.g., a running car) or a sound event (e.g., starting a car). Furthermore, the method 100 comprises, at 104, building an ANN for the selected time series pattern. Then, at 106, it is checked whether more time series patterns should be detected. If so, the method 100 repeats steps 102 and 104 for each further time series pattern to be detected. If there are no more patterns to detect, the method 100 ends.”  “[0030] In one or more embodiments, each time series pattern to be detected represents a class of a pattern detection task. Thus, more specifically, a separate ANN may be evolved for each class of the detection task; the ANN thus effectively constitutes a model of the class. … This means that for instance, in an audio context recognition task, class “car” is distinguished from class “office” within the same feature space, in a speaker authentication task, speaker A and speaker B are authenticated within the same feature space….” ]
Lesso/Divakaran and Daniel pertain to classification of sound and speech and us neural networks to perform the classification and it would have been obvious to combine the method of Daniel which trains its neural network for different types of sound and then uses the trained network for detection of these sounds with the system of combination which classifies verbal and non-verbal sounds to provide a more granular classification system.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 9, Lesso does teach that one of its microphones detect ambient sounds:  “[0018] FIG. 1 illustrates a smartphone 10, having a microphone 12 for detecting ambient sounds….”   Lesso is directed to detection of human non-verbal sounds that indicate the health situation of the speaker and while it teaches collecting the ambient/environmental sounds, it does not discuss analyzing those sounds.
Neither does Divakaran.
Daniel teaches:
9. The non-verbal sound detection device according to claim 1, wherein the second group includes one or more of: noise of a vehicle or a sound of feet climbing stairs. [Daniel, “[0053] As mentioned above, the presently disclosed method and system are particularly useful for facilitating the detection of audio patterns. For example, the following use cases of the presently disclosed method and system are envisaged: audio context recognition (e.g., car, office, park), predefined audio pattern recognition (e.g. baby cry, glass breaking, fire alarm), speaker authentication/recognition, voice activity detection (i.e., detection of the presence of speech in a signal), and voicing probability (i.e., vowel/consonant distinction in a speech signal).”]
Rationale for combination as provided for Claim 7.

Claim 13 is a method claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale.
Claim 15 is a method claim with limitations corresponding to the limitations of Claim 9 and is rejected under similar rationale.

Claim 19 is a computer program product system claim with limitations corresponding to the limitations of Claim 7 and is rejected under similar rationale.
Claim 20 is a computer program product system claim with limitations corresponding to the limitations of Claims 8 and 9 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659