DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on November 1, 2021 is/are being considered by the examiner.

Drawings
The drawings are objected to because FIGS. 1A and 10 appear to include copyrighted work without permission.  FIG. 1A includes the watermark “123RF”  and FIG 10 includes the watermark “CanStock,” which indicates that these graphics are copyrighted by 123RF LLC, of Chicago, Illinois, USA, and CAN STOCK PHOTO INC, of Halifax, Nova Scotia, Canada, respectively. However, applicant has not provided an indication that the drawings are used with permission of the respective companies or proper attribution to the copyright holders as a derivative work. Therefore, the drawings are objected to.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 5-10, 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sun (U.S. Pat. No. 11,302,329, hereinafter Sun) in view of Sun2 (U.S. Pat. App. Pub. No. 2021/0358497, hereinafter Sun2).

Regarding claim 1, Sun discloses An audio processing system, comprising (The system and method described with reference to “user device 110”; Sun, ¶¶ Col. 3, lines 64-67) : an input interface configured to receive an audio signal (“The user device 110 may instead or in addition process the input audio data 211 to determine whether speech is represented therein.”; Sun, ¶¶ Col. 8, lines 54-56); a memory configured to store a neural network (“The event classifier may be a classifier trained to distinguish between different acoustic events and other sounds” where “...trained classifiers include...neural networks”; Sun, ¶¶ Col. 8, lines 26-29) trained to determine different types of attributes of multiple concurrent audio events of different origins, (The neural network is “trained to distinguish between different acoustic events and other sounds” including a plurality of feature encoders and further disclosing that “A first type of feature extraction may be suitable for identifying features for a first acoustic event, while a second type of feature extraction may be suitable for identifying features for a second acoustic event.” Thus, the system discloses identifying features {trained to determine different types of attributes...} for a first acoustic event and for a second acoustic event, the second acoustic event being different from the first acoustic event. Further, the audio events are derived from the same audio file (“...may include one or more microphone(s) 920 that detect audio and create audio data 211”), thus the audio events are concurrently occurring {...of multiple concurrent audio events of different origins}.; Sun, ¶¶ Col. 8, lines 25-26; Col. 15, lines 21-24; Col. 14, lines 37-38) wherein the different types of attributes include time-dependent and time-agnostic attributes of speech and non-speech audio events, (The user device 110 includes an orchestrator component 240 “which may be used to determine which, if any, of the ASR 250, NLU 260, and/or TTS 280 components should receive and/or process the audio data 211”, an AED 222, and an ASM component, where the AED 222 “includes a classifier that is trained using the audio data to detect a new class of events corresponding to the acoustic event” thus classifying the audio as speech or non-speech and classifying non-speech based on the event (e.g., classifying a sound as “the sound of a pot of water boiling over onto a stove” or a “doorbell”) {time-agnostic attributes of speech and non-speech audio events}. Further, the “profile storage 270 may further include data that shows an interaction history of a user, including commands and times of receipt of commands” {time dependent attributes...of speech and non-speech audio events}.; Sun, ¶¶ Col. 3, lines 8-14 and lines 44-47; Col. 10, lines 32-37; Col. 13, lines 35-40)…; a processor configured to process the audio signal with the neural network to produce metadata of the audio signal (“upon detection of the event, the user device 110 may cause (144) a notification of an acoustic event to be sent,” where causing a notification based on a detection of the event implicitly discloses that information about the event has been produced {produce metadata of the audio signal} and “Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium... and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure {a processor configured to process the audio signal with the neural network}”; Sun, ¶¶ Col. 5, lines 35-40), the metadata including one or multiple attributes of one or multiple audio events in the audio signal (“The notification-determination component 850 may receive one or more of the event output(s) 326” and can determine “data identifying the event and... a corresponding user preference (as stored in, for example, the profile storage 270) for receiving notifications corresponding to the event.”; Sun, ¶¶ Col. 36, lines 50-56; Col. 36 line64 - Col. 37 line 2); and an output interface configured to output the metadata of the audio signal (“the user device 110 may send a notification of the detected acoustic event to notification system(s) 121 which may cause a notification to be sent to (and/or cause an action to be performed by) another device, for example device 112 or a different device.”; Sun, ¶¶ Col. 5, lines 40-45). However, Sun fails to expressly recite wherein a model of the neural network shares at least some parameters for determining both types of the attributes.
Sun2 teaches “systems and methods for sharing one or more components and/or data between WW and AED models.” (Sun2, ¶ [0028]). Regarding claim 1, Sun2 teaches wherein a model of the neural network shares at least some parameters for determining both types of the attributes (“the WW model and AED model that perform the processing share the output of a feature-extraction model, which may include one or more neural-network layers.”; Sun2, ¶¶ [0034]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein a model of the neural network shares at least some parameters for determining both types of the attributes. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 2, the rejection of claim 1 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the audio signal carries multiple audio events including a speech event and a non-speech event, and wherein the processor determines a speech attribute of the speech event and a non-speech attribute of the non-speech event using the neural network to produce the metadata.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 2, Sun2 teaches wherein the audio signal carries multiple audio events including a speech event and a non-speech event (the audio signal can include multiple audio events, including “the representation of the wakeword {speech event} and/or acoustic event” where acoustic events can include “a baby crying, a glass shattering, or a car honking. {non-speech event}”; Sun2, ¶¶ [0024]), and wherein the processor determines a speech attribute of the speech event (the system processes the audio signal to “determine that part of the audio—e.g., one or more frames of audio data—include at least part of a representation of the wakeword {a speech attribute of the speech event} and/or acoustic event.”; Sun2, ¶¶ [0024]) and a non-speech attribute of the non-speech event (“As part of determining that the audio includes the representation of the wakeword and/or acoustic event, the models may determine that part of the audio—e.g., one or more frames of audio data—include at least part of a representation of the...acoustic event. {a non-speech attribute of the non-speech event}” where the “shared AED and wakeword processing component 204 may include a component that processes the audio data 202 to determine acoustic feature data”; Sun2, ¶¶ [0024],[0036]) using the neural network to produce the metadata (“the WW and AED models may receive output from one or more neural-network models that process the acoustic feature vectors first.”; Sun2, ¶¶ [0029]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the audio signal carries multiple audio events including a speech event and a non-speech event, and wherein the processor determines a speech attribute of the speech event and a non-speech attribute of the non-speech event using the neural network to produce the metadata. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 5, the rejection of claim 1 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the model of the neural network includes an encoder and a decoder, and wherein the parameters shared for determining different types of the attributes include parameters of the encoder.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 5, Sun2 teaches wherein the model of the neural network includes an encoder and a decoder (“the WW model and AED model that perform the processing share the output of a feature-extraction model, which may include one or more neural-network layers {wherein the model of the neural network...}” where, as explained with reference to FIG. 13, the “a shared feature-extraction model 1304 processes acoustic feature data 1302 {... includes an encoder...} and sends its output to the above-described combined wakeword-detection and AED model 1306.” {...and a decoder}; Sun2, ¶¶ [0034],[0073]), and wherein the parameters shared for determining different types of the attributes include parameters of the encoder (“a shared feature-extraction model 1304 processes acoustic feature data 1302 {...include parameters of the encoder}” for both the WW detection model and the AED model {and wherein the parameters shared for determining different types of the attributes…}; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the model of the neural network includes an encoder and a decoder, and wherein the parameters shared for determining different types of the attributes include parameters of the encoder. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 6, the rejection of claim 5 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the parameters shared for determining different types of the attributes include parameters of the decoder.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 6, Sun2 teaches wherein the parameters shared for determining different types of the attributes include parameters of the decoder (“a shared feature-extraction model 1304 processes acoustic feature data 1302 and sends its output to the above-described combined wakeword-detection and AED model 1306,” where a combined indicates that the parameters shared, and wakeword-detection and AED model are parameters of the decoder for determining different types of the attributes.; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the parameters shared for determining different types of the attributes include parameters of the decoder. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 7, the rejection of claim 5 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the parameters shared for determining different types of the attributes include parameters of the encoder and the decoder.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 7, Sun2 teaches wherein the parameters shared for determining different types of the attributes include parameters of the encoder (“a shared feature-extraction model 1304 processes acoustic feature data 1302 {...include parameters of the encoder}” for both the WW detection model and the AED model {and wherein the parameters shared for determining different types of the attributes…}; Sun2, ¶¶ [0073]) and the decoder (“...and sends its output to the above-described combined wakeword-detection and AED model 1306,” where a combined indicates that the parameters shared, and wakeword-detection and AED model are parameters of the decoder for determining different types of the attributes.; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the parameters shared for determining different types of the attributes include parameters of the encoder and the decoder. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 8, the rejection of claim 5 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the processor is configured to process the audio signal with the encoder of the neural network to produce an encoding and process the encoding multiple times with the decoder initialized to different states corresponding to the different types of the attributes to produce different decodings of the attributes of different audio events.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 8, Sun2 teaches wherein the processor is configured to process the audio signal with the encoder of the neural network to produce an encoding (“the WW model and AED model that perform the processing share the output of a feature-extraction model, which may include one or more neural-network layers {wherein the model of the neural network...}” where, as explained with reference to FIG. 13, the “a shared feature-extraction model 1304 processes acoustic feature data 1302 {... includes an encoder...} and sends its output to the above-described combined wakeword-detection and AED model 1306.” {...and a decoder}; Sun2, ¶¶ [0034],[0073]) and process the encoding multiple times with the decoder initialized to different states (“the wakeword-detection model 508 may receive, as input, a relatively smaller number of acoustic feature vectors (e.g., 80), while the AED model 808 may receive, as input, a relatively higher number of acoustic feature vectors (e.g., 1000).” where receiving different numbers of feature vectors indicates processing the encoding multiple times and where, to receive a different number of feature vectors, the decoder is initialized to different states.; Sun2, ¶¶ [0060]) corresponding to the different types of the attributes to produce different decodings of the attributes of different audio events (The different states described above correspond to the wakeword detection and the acoustic event detection (AED) respectively, thus corresponding to the different types of the attributes to produce different decodings of the attributes of different audio events (speech recognition and AED, respectively).; Sun2, ¶¶ [0060]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the processor is configured to process the audio signal with the encoder of the neural network to produce an encoding and process the encoding multiple times with the decoder initialized to different states corresponding to the different types of the attributes to produce different decodings of the attributes of different audio events. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 9, the rejection of claim 1 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the neural network is trained jointly to perform multiple different transcription tasks using the shared parameters for performing each of the transcription tasks.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 9, Sun2 teaches wherein the neural network is trained jointly to perform multiple different transcription tasks using the shared parameters for performing each of the transcription tasks (“a shared feature-extraction model 1304 processes acoustic feature data 1302 and sends its output to the above-described combined wakeword-detection and AED model 1306.” thus the neural network is trained to perform both acoustic event detection and wakeword detection {trained jointly to perform multiple different transcription tasks...} using shared parameters as incorporated into the combined model {...using the shared parameters for performing each of the transcription tasks}; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the neural network is trained jointly to perform multiple different transcription tasks using the shared parameters for performing each of the transcription tasks. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 10, the rejection of claim 9 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the transcription tasks include an automatic speech recognition (ASR) task and an acoustic event detection (AED) task.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 10, Sun2 teaches wherein the transcription tasks include an automatic speech recognition (ASR) task and an acoustic event detection (AED) task (“a shared feature-extraction model 1304 processes acoustic feature data 1302 and sends its output to the above-described combined wakeword-detection {wherein the transcription tasks include an automatic speech recognition (ASR) task...} and AED model 1306 {wherein the transcription tasks include...an acoustic event detection (AED) task},”; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the transcription tasks include an automatic speech recognition (ASR) task and an acoustic event detection (AED) task. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 19, the rejection of claim 1 is incorporated. Sun disclose all of the elements of the current invention as stated above. However, Sun fail(s) to expressly recite wherein the audio signal includes multiple audio events associated with multiple audio sources, and wherein the processor is further configured to: determine, using the neural network, at least one attribute of at least one audio event of the multiple audio sources; compare the at least one attribute of the at least one audio event with a predetermined at least one attribute of the at least one audio event; and determine anomaly in the audio source based on a result of the comparison.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 19, Sun2 teaches wherein the audio signal includes multiple audio events associated with multiple audio sources (the AED processes the audio data which can include “one or more representations of one or more acoustic events {multiple audio events associated with multiple audio sources}.”; Sun2, ¶¶ [0051]), and wherein the processor is further configured to: determine, using the neural network, at least one attribute of at least one audio event of the multiple audio sources (The AED includes “determin[ing]” using a neural network (e.g., “one or more recurrent nodes, such as LSTM nodes”) “one or more probabilities that the audio data 202 includes one or more representations {at least one attribute...} of one or more acoustic events {of at least one audio event of the multiple audio sources}.”; Sun2, ¶¶ [0051]); compare the at least one attribute of the at least one audio event with a predetermined at least one attribute of the at least one audio event (“AED may be performed by comparing input audio data {compare the at least one attribute of the at least one audio event} to an audio signature corresponding to the acoustic event {with a predetermined at least one attribute of the at least one audio event}”; Sun2, ¶¶ [0025]); and determine anomaly in the audio source based on a result of the comparison (“and, if there is a sufficient match between the signature and the input audio data, the system may determine that an acoustic event has occurred and take action accordingly.”; Sun2, ¶¶ [0025]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the audio signal includes multiple audio events associated with multiple audio sources, and wherein the processor is further configured to: determine, using the neural network, at least one attribute of at least one audio event of the multiple audio sources; compare the at least one attribute of the at least one audio event with a predetermined at least one attribute of the at least one audio event; and determine anomaly in the audio source based on a result of the comparison. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Regarding claim 20, Sun discloses An audio processing method, comprising: (The system and method described with reference to “user device 110”; Sun, ¶¶ Col. 3, lines 64-67) : accepting the audio signal via an input interface (“The user device 110 may instead or in addition process the input audio data 211 to determine whether speech is represented therein.”; Sun, ¶¶ Col. 8, lines 54-56); determining, via a neural network (“The event classifier may be a classifier trained to distinguish between different acoustic events and other sounds” where “...trained classifiers include...neural networks”; Sun, ¶¶ Col. 8, lines 26-29) different types of attributes of multiple concurrent audio events of different origins in the audio signal, (The neural network is “trained to distinguish between different acoustic events and other sounds” including a plurality of feature encoders and further disclosing that “A first type of feature extraction may be suitable for identifying features for a first acoustic event, while a second type of feature extraction may be suitable for identifying features for a second acoustic event.” Thus, the system discloses identifying features {trained to determine different types of attributes...} for a first acoustic event and for a second acoustic event, the second acoustic event being different from the first acoustic event. Further, the audio events are derived from the same audio file (“...may include one or more microphone(s) 920 that detect audio and create audio data 211”), thus the audio events are concurrently occurring {...of multiple concurrent audio events of different origins}.; Sun, ¶¶ Col. 8, lines 25-26; Col. 15, lines 21-24; Col. 14, lines 37-38) wherein the different types of attributes include time-dependent and time-agnostic attributes of speech and non-speech audio events, (The user device 110 includes an orchestrator component 240 “which may be used to determine which, if any, of the ASR 250, NLU 260, and/or TTS 280 components should receive and/or process the audio data 211”, an AED 222, and an ASM component, where the AED 222 “includes a classifier that is trained using the audio data to detect a new class of events corresponding to the acoustic event” thus classifying the audio as speech or non-speech and classifying non-speech based on the event (e.g., classifying a sound as “the sound of a pot of water boiling over onto a stove” or a “doorbell”) {time-agnostic attributes of speech and non-speech audio events}. Further, the “profile storage 270 may further include data that shows an interaction history of a user, including commands and times of receipt of commands” {time dependent attributes...of speech and non-speech audio events}.; Sun, ¶¶ Col. 3, lines 8-14 and lines 44-47; Col. 10, lines 32-37; Col. 13, lines 35-40)…; processing, via the processor, the audio signal with the neural network to produce metadata of the audio signal (“upon detection of the event, the user device 110 may cause (144) a notification of an acoustic event to be sent,” where causing a notification based on a detection of the event implicitly discloses that information about the event has been produced {produce metadata of the audio signal} and “Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium... and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure {a processor configured to process the audio signal with the neural network}”; Sun, ¶¶ Col. 5, lines 35-40), the metadata including one or multiple attributes of one or multiple audio events in the audio signal (“The notification-determination component 850 may receive one or more of the event output(s) 326” and can determine “data identifying the event and... a corresponding user preference (as stored in, for example, the profile storage 270) for receiving notifications corresponding to the event.”; Sun, ¶¶ Col. 36, lines 50-56; Col. 36 line64 - Col. 37 line 2); and outputting, via an output interface, the metadata of the audio signal (“the user device 110 may send a notification of the detected acoustic event to notification system(s) 121 which may cause a notification to be sent to (and/or cause an action to be performed by) another device, for example device 112 or a different device.”; Sun, ¶¶ Col. 5, lines 40-45). However, Sun fails to expressly recite wherein a model of the neural network shares at least some parameters for determining both types of the attributes.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 20, Sun2 teaches wherein a model of the neural network shares at least some parameters for determining both types of the attributes (“the WW model and AED model that perform the processing share the output of a feature-extraction model, which may include one or more neural-network layers.”; Sun2, ¶¶ [0034]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein a model of the neural network shares at least some parameters for determining both types of the attributes. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]).

Claims 3 and 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sun and Sun2 as applied to claim 1 above, and further in view of Banna (U.S. Pat. App. Pub. No./U.S. Pat. No. 2017/0323653, hereinafter Banna).

Regarding claim 3, the rejection of claim 1 is incorporated. Sun and Sun2 disclose all of the elements of the current invention as stated above. Sun further discloses wherein audio signal carries multiple audio events having at least one time-dependent attribute and at least one time- agnostic attribute, (The user device 110 includes an orchestrator component 240 “which may be used to determine which, if any, of the ASR 250, NLU 260, and/or TTS 280 components should receive and/or process the audio data 211”, an AED 222, and an ASM component, where the AED 222 “includes a classifier that is trained using the audio data to detect a new class of events corresponding to the acoustic event” thus classifying the audio as speech or non-speech and classifying non-speech based on the event (e.g., classifying a sound as “the sound of a pot of water boiling over onto a stove” or a “doorbell”) {time-agnostic attributes of speech and non-speech audio events}. Further, the “profile storage 270 may further include data that shows an interaction history of a user, including commands and times of receipt of commands.” {time dependent attributes...of speech and non-speech audio events}.; Sun, ¶¶ Col. 3, lines 8-14 and lines 44-47; Col. 10, lines 32-37; Col. 13, lines 35-40) wherein the time-dependent attribute includes one or combination of a detection of a speech event, and a transcription of speech of the speech event, (“Each speech-processing system 292 may include an ASR component 250 {wherein the time dependent attribute includes...}, which may transcribe the input audio data 211 into text data {transcription of speech of the speech event}” where transcription of a speech event implicitly discloses detection of a speech event.; Sun, ¶¶ Col. 10, lines 56-58) wherein the time-agnostic attribute includes tagging… [the transcription] with a label or with an audio caption describing the audio scene using a natural language sentence (“The NER component 562 may perform semantic tagging {wherein the time-agnostic attribute includes tagging [the transcription]…}, which is the labeling of a word or combination of words {...with a label or with an audio caption describing the audio scene...} according to their type/semantic meaning {...using a natural language sentence}”; Sun, ¶¶ Col. 23, lines 52-54). However, Sun and Sun2 fail to expressly recite wherein the time-agnostic attribute includes tagging an audio signal with a label or with an audio caption describing the audio scene using a natural language sentence.
Banna teaches systems and methods for speech enhancement and audio event detection. (Banna, ¶ [0002]). Regarding claim 3, Banna teaches wherein the time-agnostic attribute includes tagging an audio signal with a label or with an audio caption describing the audio scene using a natural language sentence (the system “is configured to select the audio event data 390, which most likely corresponds to the extracted audio features… [and] provide different audio labels and likelihood scores.”; Banna, ¶¶ [0047]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2 to incorporate the teachings of Banna to include wherein the time-agnostic attribute includes tagging an audio signal with a label or with an audio caption describing the audio scene using a natural language sentence. The “speech enhancement and audio event detection system” of Banna “is configured to provide clean speech and/or text from the audio input, even if background noise is present,” thus improving overall audio quality. (Banna, ¶ [0079]).

Regarding claim 4, the rejection of claim 1 is incorporated. Sun and Sun2 disclose all of the elements of the current invention as stated above. Sun further discloses wherein audio signal carries multiple audio events having at least one time-dependent attribute and at least one time- agnostic attribute, (The user device 110 includes an orchestrator component 240 “which may be used to determine which, if any, of the ASR 250, NLU 260, and/or TTS 280 components should receive and/or process the audio data 211”, an AED 222, and an ASM component, where the AED 222 “includes a classifier that is trained using the audio data to detect a new class of events corresponding to the acoustic event” thus classifying the audio as speech or non-speech and classifying non-speech based on the event (e.g., classifying a sound as “the sound of a pot of water boiling over onto a stove” or a “doorbell”) {time-agnostic attributes of speech and non-speech audio events}. Further, the “profile storage 270 may further include data that shows an interaction history of a user, including commands and times of receipt of commands.” {time dependent attributes...of speech and non-speech audio events}.; Sun, ¶¶ Col. 3, lines 8-14 and lines 44-47; Col. 10, lines 32-37; Col. 13, lines 35-40) wherein the time-dependent attribute includes one or combination of a transcription of speech, and a detection of a temporal position of the multiple audio events (“Each speech-processing system 292 may include an ASR component 250 {wherein the time dependent attribute includes...}, which may transcribe the input audio data 211 into text data {transcription of speech of the speech event}”; Sun, ¶¶ Col. 10, lines 56-58). However, Sun and Sun2 fail to expressly recite wherein the time-agnostic attribute includes tagging the audio signal with one or more of a label or an audio caption describing the audio scene using a natural language sentence.
The relevance of Banna is described above with relation to claim 3. Regarding claim 4, Banna teaches wherein the time-agnostic attribute includes tagging the audio signal with one or more of a label or an audio caption describing the audio scene using a natural language sentence (the system “is configured to select the audio event data 390, which most likely corresponds to the extracted audio features… [and] provide different audio labels and likelihood scores.”; Banna, ¶¶ [0047]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2 to incorporate the teachings of Banna to include wherein the time-agnostic attribute includes tagging the audio signal with one or more of a label or an audio caption describing the audio scene using a natural language sentence. The “speech enhancement and audio event detection system” of Banna “is configured to provide clean speech and/or text from the audio input, even if background noise is present,” thus improving overall audio quality. (Banna, ¶ [0079]).

Claims 11 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sun and Sun2 as applied to claim 10 above, and further in view of Non-Patent Literature to Çakır (E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen and T. Virtanen, “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291-1303, June 2017, doi: 10.1109/TASLP.2017.2690575, hereinafter Çakır).

Regarding claim 11, the rejection of claim 10 is incorporated. Sun and Sun2 disclose all of the elements of the current invention as stated above. However, Sun fails to expressly recite wherein the CTC-based model is configured to produce temporal information for one or more of the ASR or the AED transcription task.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 11, Sun2 teaches wherein the transcription tasks include automatic speech recognition (ASR) and... [an acoustic event detection (AED) task] (“a shared feature-extraction model 1304 processes acoustic feature data 1302 and sends its output to the above-described combined wakeword-detection {wherein the transcription tasks include an automatic speech recognition (ASR) task...} and AED model 1306 {wherein the transcription tasks include...an acoustic event detection (AED) task},”; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the transcription tasks include automatic speech recognition (ASR) and... [an acoustic event detection (AED) task]. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]). However, Sun and Sun2 fail to expressly recite wherein the acoustic event detection (AED) task is audio tagging (AT).
Çakır teaches “multi-label convolutional recurrent neural network for polyphonic, scene-independent sound event detection in real-life recordings. (Çakır, pg. 1292, para 6). Regarding claim 11, Çakır teaches wherein the acoustic event detection (AED) task is audio tagging (AT) (discloses that “sound event classification, sound event recognition {acoustic event detection}, or sound event tagging {audio tagging}, all refer to labeling an audio recording with the sound event classes present, regardless of the onset/offset times.”; Çakır, ¶¶ pg. 1291, para 2).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2 to incorporate the teachings of Çakır to include wherein the acoustic event detection (AED) task is audio tagging (AT). The proposed neural network of Çakır “provides a considerable improvement performance for AED over prior art methods when processing “everyday sound events,” as recognized by Çakır. (Çakır, Abstract).

Regarding claim 12, the rejection of claim 10 is incorporated. Sun and Sun2 disclose all of the elements of the current invention as stated above. However, Sun fails to expressly recite [all/remaining claim elements from claim [Z]].
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 12, Sun2 teaches wherein the transcription tasks include one or more of automatic speech recognition (ASR), [and] acoustic event detection (AED) (“a shared feature-extraction model 1304 processes acoustic feature data 1302 and sends its output to the above-described combined wakeword-detection {wherein the transcription tasks include an automatic speech recognition (ASR) task...} and AED model 1306 {wherein the transcription tasks include...an acoustic event detection (AED) task},”; Sun2, ¶¶ [0073]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun to incorporate the teachings of Sun2 to include wherein the transcription tasks include one or more of automatic speech recognition (ASR), [and] acoustic event detection (AED). “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]). However, Sun and Sun2 fail to expressly recite wherein the transcription tasks include… an audio tagging (AT).
The relevance of Çakır is described above with relation to claim 11. Regarding claim 12, Çakır teaches wherein the transcription tasks include… an audio tagging (AT) (discloses that “sound event classification, sound event recognition {acoustic event detection}, or sound event tagging {audio tagging}, all refer to labeling an audio recording with the sound eventclasses present, regardless of the onset/offset times,” thus, acoustic event detection (AED) and audio tagging (AT) are disclosed by the AED model 1306.; Çakır, ¶¶ pg. 1291, para 2).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2 to incorporate the teachings of Çakır to include wherein the transcription tasks include… an audio tagging (AT). The proposed neural network of Çakır “provides a considerable improvement performance for AED over prior art methods when processing “everyday sound events,” as recognized by Çakır. (Çakır, Abstract).

Claims 17 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sun, Sun2, and Çakır as applied to claim 12 above, and further in view of Miao (H. Miao, G. Cheng, C. Gao, P. Zhang and Y. Yan, “Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6084-6088, doi: 10.1109/ ICASSP40776.2020.9053165, hereinafter Miao).

Regarding claim 17, the rejection of claim 12 is incorporated. Sun, Sun2, and Çakır disclose all of the elements of the current invention as stated above. However, Sun and Çakır fail to expressly recite wherein the neural network is trained with a multi-objective cost function including a weight factor to control weighting between a transformer objective function and a CTC objective function, for performing jointly the ASR, AED, and AT transcription tasks.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 17, Sun2 teaches wherein the neural network is trained with a multi-objective cost function including a weight factor (“Other data that may be used to train a model may include training parameters such as error functions, weights, or other data that can be used to guide the training of a model.”; Sun2, ¶¶ [0033]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2, and as modified by the CRNN for sound event detection of Çakır, to further incorporate the teachings of Sun2 to include wherein the neural network is trained with a multi-objective cost function including a weight factor to control weighting between a transformer objective function and a CTC objective function, for performing jointly the ASR, AED, and AT transcription tasks. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]). However, Sun, Sun2, and Çakır fail to expressly recite a multi-objective cost function including a weight factor to control weighting between a transformer objective function and a CTC objective function, for performing jointly the ASR, AED, and AT transcription tasks.
Miao teaches “Transformer-based online CTC/attention E2E ASR architecture.” (Miao, Abstract). Regarding claim 17, Miao teaches a multi-objective cost function including a weight factor to control weighting between a transformer objective function and a CTC objective function, for performing jointly the ASR, AED, and AT transcription tasks (discloses a “Transformer-based online CTC/attention E2E ASR architecture” which is a hybrid architecture including a transformer objective function and a CTC objective function where “During training, we introduce the CTC objective as an auxiliary task, and the loss function” incorporates “Ldec and Lctc [which] are loss functions {multi-objective cost function} from the decoder and CTC {to control weighting between a transformer objective function and a CTC objective function}”; Miao, ¶¶ pg. 6084, para 5, para 7; pg 6085, para 1).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2, and as modified by the CRNN for sound event detection of Çakır, to incorporate the teachings of Miao to include a multi-objective cost function including a weight factor to control weighting between a transformer objective function and a CTC objective function, for performing jointly the ASR, AED, and AT transcription tasks. The E2E ASR architecture of Miao “achieves significant improvement” over prior art “Long Short-Term Memory (LSTM) based online E2E models,” as recognized by Miao. (Miao, Abstract).

Regarding claim 18, the rejection of claim 17 is incorporated. Sun, Sun2, Çakır, and Miao disclose all of the elements of the current invention as stated above. However, Sun, Çakır, and Miao fail to expressly recite wherein the neural network is trained using a set of ASR samples, a set of AED samples, and a set of AT samples.
The relevance of Sun2 is described above with relation to claim 1. Regarding claim 18, Sun2 teaches wherein the neural network is trained using a set of ASR samples (“The training data may include audio samples of utterances of the wakeword by different speakers and under different conditions...”; Sun2, ¶¶ [0048]), a set of AED samples (“The training data may include audio samples of acoustic events under different conditions.”; Sun2, ¶¶ [0052]), and a set of AT samples (“The training data may further include non-wakeword words and annotation data indicating which words are wakewords and which words are non-wakeword words” as well as “representations of other events and annotation data indicating which events are of interest and which events are not of interest.”; Sun2, ¶¶ [0048], [0052]).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the acoustic event detection system of Sun as modified by the shared wakeword and AED models of Sun2, as modified by the CRNN for sound event detection of Çakır, and as modified by the Transformer-based online CTC/attention E2E ASR architecture of Miao, to further incorporate the teachings of Sun2 to include wherein the neural network is trained using a set of ASR samples, a set of AED samples, and a set of AT samples. “[S]haring one or more components and/or data between WW and AED models…reduces the need for system resources and thereby reduces power consumption,” as recognized by Sun2. (Sun2, ¶ [0028]). 

Allowable Subject Matter
Claims 13-16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is an examiner’s statement of reasons for indicating allowable subject matter: 
Regarding claim 13, the closest prior art of record Sun and Sun 2 teach all elements as described with relation to claims 1, 9, and 10. However, Sun and Sun 2 does not specifically teach wherein the model of neural network includes a transformer model and a connectionist temporal classification (CTC) based model, wherein the transformer model includes an encoder configured to encode the audio signal and a decoder configured to execute ASR decoding, AED decoding, and AT decoding to produce a decoder output for the encoded audio signal, and wherein the CTC-based model is configured to execute the ASR decoding and the AED decoding, for the encoded audio signal, to produce a CTC output for the encoded audio signal, and wherein the decoder output and the CTC output of the ASR decoding and the AED decoding are jointly scored to produce a joint decoding output. 
Miao does teach wherein the model of neural network includes a transformer model and a connectionist temporal classification (CTC) based model (“we stream the Transformer {a transformer model} and integrate it into the CTC/attention E2E ASR architecture {connectionist temporal classification (CTC) based model}.”; Miao, pg. 6084, para 4).
However, none of the prior art references of record, either alone or in combination, teaches, suggests, or makes obvious the combination of limitations as recited in claim 13, including all intervening claims.
More specifically, the limitation of “wherein the transformer model includes an encoder configured to encode the audio signal and a decoder configured to execute ASR decoding, AED decoding, and AT decoding to produce a decoder output for the encoded audio signal, and wherein the CTC-based model is configured to execute the ASR decoding and the AED decoding, for the encoded audio signal, to produce a CTC output for the encoded audio signal, and wherein the decoder output and the CTC output of the ASR decoding and the AED decoding are jointly scored to produce a joint decoding output” when read in light of independent claim 1, including all intervening limitations, is not taught by the prior art of record.
Regarding claims 14-16, claims 14-16 are allowable at least in light of their dependency from an allowable base claim.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
	
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Non-Patent Literature to Fujimura et al. (H. Fujimura, M. Nagao and T. Masuko, “Simultaneous Speech Recognition and Acoustic Event Detection Using an LSTM-CTC Acoustic Model and a WFST Decoder,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5834-5838, doi: 10.1109/ICASSP.2018.8461916) discloses simultaneous speech recognition and acoustic event detection of spontaneous speech based on one-pass decoding without rescoring using an LSTM-CTC acoustic model and a WFST decoder.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627. The examiner can normally be reached 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Sean E Serraguard/Patent Examiner, Art Unit 2657      

/LAMONT M SPOONER/Primary Examiner, Art Unit 2657                                                                                                                                                                                                                                                            
8/29/2022