DETAILED ACTION

Introduction
This office action is in response to Applicant’s submission filed on 4/2/2021. Claims
1-20 are pending in the application. As such, claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 4/2/2021.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Drawings
The drawings filed on 4/2/2021 is accepted and considered by the Examiner.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 12-15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US Patent No.: US 10304475 B1) hereinafter as Wang, in view of Moghimi et al. (US Patent Application Publication No: US 20180218747 A1) hereinafter as Moghimi, and further in view of Yoshioka et al. (WO 2020222935 A1) hereinafter as Yoshioka.
		
Regarding claim 1, Wang discloses: A method implemented by one or more processors, the method comprising: receiving audio data that captures a spoken utterance of a user, the audio data being generated by two or more microphones of a computing device of the user; ([col. 4, lines 20-26] FIG. 1 illustrates a device 100 configured to capture audio, perform beamforming, and perform beam-level trigger word detection. As shown, the device 100 may include a microphone array 102 as well as other components, such as those discussed below. The device 100 may receive (170) input audio data corresponding to audio captured by the microphone array 102.)
determining, based on processing the audio data, that a first audio data segment of the audio data includes one or more particular words or phrases; ([col. 4, lines 37-47] The device 100 may process (174) each beam into one or more feature vector(s) corresponding to the beam. For example, one feature vector may correspond to a single audio frame for audio data of a particular beam. Typical audio frames may be 10 ms or 25 ms each. An audio frame for one beam may correspond to a same time period or a different time period as a different audio frame for another beam. The feature vectors determined may include values for features that may be considered by a trained model configured to detect a trigger word (or portion thereof) in audio data. [col. 3, lines 17-20] For speech processing enabled systems, the wakeword may be the only trigger word recognized by the system and all other words are processed using typical speech processing.)
initializing, based on the estimated spatial correlation matrix, a beamformer; ([col. 4, lines 4-19] To improve wakeword detection, and trigger word detection generally, offered is a device that can perform low power beam- based trigger word detection for initial beam selection, with potential trigger word confirmation by higher power downstream trigger word detection component. In the present device, an individual neural network or other trained model may process the output of each beam. Each such model operates independently and provides a confidence score corresponding to whether a portion (or the whole) of a trigger word is detected in the particular beam. The beam that indicates a strongest presence or earliest presence of the trigger word according to the trained models may be selected for further processing. The beam may be used until a user command is over or a desired beam may be switched during speech capture depending on changes to the acoustic environment.)
and causing the beamformer to be utilized in processing of at least a second audio data segment of the audio data, the second audio data segment including one or more terms that follow the one or more particular words or phrases. ([col. 4, lines 62- col. 5, line 1] The device 100 may also process (178) second audio data for the second beam using a second trained model to determine a second score. The second audio data may be post-beamformed audio data (e.g., audio data for the second beam output by the beamformer) or may be one or more feature vectors for the second beam, such as those determined in step 174.
Wang does not explicitly, but Moghimi discloses: obtaining a preceding audio data segment that precedes the first audio data segment, the preceding audio data segment being generated by the two or more microphones of the computing device;	([0004] In one aspect, an audio device includes a plurality of spatially-separated microphones that are configured into a microphone array, wherein the microphones are adapted to receive sound. There is a processing system in communication with the microphone array and configured to derive a plurality of audio signals from the plurality of microphones, use prior audio data to operate a filter topology that processes audio signals so as to make the array more sensitive to desired sounds than to undesired sounds, categorize received sounds as one of desired sounds or undesired sounds, and use the categorized received sounds and the categories of the received sounds to modify the filter topology. In one non-limiting example, desired and undesired sounds modify the filter topology differently.
Wang and Moghimi are considered analogous art because they are both in the related art of beamforming/audio filtering. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang to combine the teaching of Moghimi, to incorporate the above mentioned claim limitations, because the combination of the disclosures would make the array more sensitive to desired sounds than to undesired sounds, categorize received sounds as one of desired sounds or undesired sounds, and use the categorized received sounds and the categories of the received sounds to modify the filter topology (Moghimi, summary).
estimating, based on the first audio data segment and based on the preceding audio data segment, a spatial correlation matrix; ([0075] In one embodiment, an approach called geometry-agnostic beamforming, or blind beamforming, is used to perform beamforming for distributed recording devices. Given M microphone devices, corresponding to M audio channels, an M-dimensional spatial covariance matrices of speech and background noise are directly estimated. The matrices capture spatial statistics of the speech and the noise, respectively. To form an acoustic beam, the M-dimensional spatial covariance matrices are inverted.)
Wang, Moghimi, and Yoshioka are considered analogous art because they are in the related art of beamforming/audio filtering. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, to combine the teaching of Yoshioka, to incorporate the above mentioned claim limitations, because the combination of the disclosures may improve the accuracy of downstream speech processing, such as speech recognition and speaker diarization (Yoshioka, [0073]).

Regarding claim 12, Wang in view of Moghimi, and further in view of Yoshioka discloses: The method of claim 1, 
Wang further discloses: wherein the preceding audio data segment is obtained from an audio data buffer. ([col. 2, lines 65-68, col. 3, lines 1-6] The device, recognizing the wakeword “Alexa” would understand the subsequent audio (in this example, “play some music”) to include a command of some sort and would send audio data corresponding to that subsequent audio (as well as potentially to the wakeword and some buffered audio prior to the wakeword) to a remote device (or maintain it locally) to perform speech processing on that audio to determine what the command is for execution.  Also see Fig. 14, Buffer (702))

Regarding claim 13, Wang in view of Moghimi, and further in view of Yoshioka discloses: The method of claim 12, 
Yoshioka further discloses: wherein the preceding audio data segment captures ambient noise of an environment of the computing device of the user. ([0031] The audio received from nearby devices will have an audio signature based on a combination of ambient noise and/or any sound generated near the device.  A buffer is also disclosed in [0066])

Regarding claim 14, Wang in view of Moghimi, and further in view of Yoshioka discloses: The method of claim 1, 
Yoshioka further discloses: The method of claim 1, wherein the one or more processors are executed locally at the computing device of the user. ([0024] The functions or algorithms described herein may be implemented in software in one embodiment. The software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware-based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.)

	Regarding claim 15, Wang discloses: A computing device comprising: at least one processor; ([col. 22, lines 56-57] The device 100 may include one or more controllers/processors 1404, …)
	at least two microphones; ([col. 4, lines 22-24] the device 100 may include a microphone array 102 as well as other components, such as those discussed below. The device 100 may receive (170) input audio data corresponding to audio captured by the microphone array 102.)
	and memory storing instructions that, when executed, cause the at least one processor to: ([col. 22, lines 56-60] The device 100 may include one or more controllers/processors 1404, that may each include a central processing unit (CPU) for processing data and computer-readable instructions, and a memory 1406 for storing data and instructions.)
	As for the rest of the elements of the claim, they recite the same elements as claim 1, therefore the rationale in rejecting claim 1 also applies to claim 15.

	Regarding claim 20, Wang discloses: A non-transitory computer-readable storage medium storing instructions locally at a computing device that, when executed, cause at least one processor to: ([col. 24, lines 10-22] Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media. Some or all of the adaptive beamformer 160, beamformer 190, etc. may be implemented by a digital signal processor (DSP).)
	As for the rest of the elements of the claim, they recite the same elements as claim 1, therefore the rationale in rejecting claim 1 also applies to claim 20.

Claims 2 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Moghimi, further in view of Yoshioka, and furthermore in view of Donley et al. (US Patent No: US 10638252 B1) hereinafter as Donley.

Regarding claim 2, Wang in view of Moghimi, and further in view of Yoshioka discloses: The method of claim 1, 
Yoshioka further discloses: wherein estimating the spatial correlation matrix based on the first audio data segment and based on the preceding audio data segment comprises: determining a first audio data segment spatial covariance associated with the first audio data segment; ([0075] In one embodiment, an approach called geometry-agnostic beamforming, or blind beamforming, is used to perform beamforming for distributed recording devices. Given M microphone devices, corresponding to M audio channels, an M-dimensional spatial covariance matrices of speech and background noise are directly estimated. The matrices capture spatial statistics of the speech and the noise, respectively. To form an acoustic beam, the M-dimensional spatial covariance matrices are inverted.)
determining a preceding audio data segment spatial covariance associated with the preceding audio data segment; ([0075] In one embodiment, an approach called geometry-agnostic beamforming, or blind beamforming, is used to perform beamforming for distributed recording devices. Given M microphone devices, corresponding to M audio channels, an M-dimensional spatial covariance matrices of speech and background noise are directly estimated. The matrices capture spatial statistics of the speech and the noise, respectively. To form an acoustic beam, the M-dimensional spatial covariance matrices are inverted.
Wang in view of Moghimi, and further in view of Yoshioka does not explicitly, but Donley discloses: and estimating the spatial covariance matrix based on a difference between the first audio data segment spatial covariance and the preceding audio data segment spatial covariance. ([col. 14, lines 10-20]  In some embodiments, the covariance buffer module 365 updates the covariance buffer with a generated spatial covariance matrix based on a comparison of the generated spatial covariance matrix with matrices already in the covariance buffer. The covariance buffer module 365 may update the covariance buffer when the microphone array 310 detects, with a high confidence, a single audio signal from a single audio source. Such a detection may be determined using a singular-value decomposition or by comparing the relative contributions determined for each audio source to a threshold contribution level.)
Wang, Moghimi, Yoshioka, and Donley are considered analogous art because they are in the related art of beamforming/audio filtering/signal enhancement. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, further in view of Yoshioka, to combine the teaching of Donley, to incorporate the above mentioned claim limitations, because the combination of the disclosures may enhance signal data through adapting and updating of the signal enhancement filters (Donley, background/summary).

Regarding claim 16, although different in scope from claim 2, they recite elements of the method of claim 2 as a computer device.  Thus, the analysis in rejecting claim 2 is equally applicable to claim 16.

Claims 3-4, 10, and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima (JP 2020003751 A) hereinafter as Kagoshima.

Regarding claim 3, Wang in view of Moghimi, and further in view of Yoshioka discloses: The method of claim 1, 
Wang in view of Moghimi, and further in view of Yoshioka does not explicitly, but Kagoshima discloses: wherein initializing the beamformer based on the estimated spatial correlation matrix comprises: determining a principal eigenvector of the estimated spatial correlation matrix; ([pg. 15, 5th para] Specifically, the coefficient deriving unit 20G calculates a maximum eigenvalue of a matrix represented by a product of a first spatial correlation matrix φ .sub.xx (f, n) and an inverse matrix of the second spatial correlation matrix φ .sub.NN (f, n). Is derived from the eigenvector F .sub.SNR (f, n). Then, the coefficient deriving unit 20G derives the eigenvector F .sub.SNR (f, n) as a spatial filter coefficient F (f, n) (F (f, n) = .sub.FSNR (f, n)).)
and initializing, based on the principal eigenvector, a plurality of coefficients for the beamformer. ([pg. 15, 3rd 5th-6th para] The coefficient deriving unit 20G reads the first spatial correlation matrix φ .sub.xx (f, n) and the second spatial correlation matrix φ .sub.NN (f, n) from the first correlation storage unit 20E and the second correlation storage unit 20F, and performs spatial filtering. It may be used to derive the coefficient F (f, n). Here, at the stage of the steady processing, the first spatial correlation matrix φ .sub.xx (f, n) and the second spatial correlation matrix φ .sub.NN (f, n) stored in the first correlation storage unit 20E and the second correlation storage unit 20F. n) is a spatial correlation matrix updated by the correlation deriving unit 20D. That is, these spatial correlation matrices are spatial correlation matrices updated by the correlation deriving unit 20D using the target sound section detected based on the emphasized sound signal. Therefore, the coefficient deriving unit 20G derives the spatial filter coefficient F (f, n) based on the emphasized sound signal.  Note that the coefficient deriving unit 20G adds a post filter w (f, n) for improving the sound quality by adjusting the power of each frequency bin, and uses the following equation (9) to obtain a spatial filter coefficient F (f, f). n) may be derived.  [pg. 28, last para] ..., a plurality of coefficient deriving units 30G, ...)
Wang, Moghimi, Yoshioka, and Kagoshima are considered analogous art because they are in the related art of beamforming/audio filtering/signal enhancement. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, further in view of Yoshioka, to combine the teaching of Kagoshima, to incorporate the above mentioned claim limitations, because the combination of the disclosures may emphasize a target sound signal with high precision (Kagoshima, abstract).

Regarding claim 4, Wang in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima discloses: The method of claim 3, 
Wang further discloses: wherein causing the beamformer to be utilized in processing of the second audio data segment comprises: processing, using the beamformer and based on the plurality of coefficients for the beamformer, the second audio data segment to generate a filtered second audio data segment; ([col. 4, lines 62- col. 5, line 1] The device 100 may also process (178) second audio data for the second beam using a second trained model to determine a second score. The second audio data may be post beamformed audio data (e.g., audio data for the second beam output by the beamformer) or may be one or more feature vectors for the second beam, such as those determined in step 174.  See figs.4-5 for filtering process.  [plurality of coefficients already discussed in claim 3 with the Kagoshima disclosure])
and processing, using an acoustic machine learning (ML) model, the filtered second audio data segment to generate predicted output associated with the one or more terms. ([col. 21, lines 62-67, col. 22, lines 1-18] Various machine learning techniques may be used to perform the training of the trigger scorer 730, trained model/neural network 856 or other components. Models may be trained and operated according to various machine learning techniques. Such techniques may include, for example, inference engines, trained classifiers, etc. Examples of trained classifiers include conditional random fields (CRF) classifiers, Support Vector Machines (SVMs), neural networks (such as deep neural networks and/or recurrent neural networks), decision trees, AdaBoost (short for “Adaptive Boosting”) combined with decision trees, and random forests. Focusing on CRF as an example, CRF is a class of statistical models used for structured predictions. In particular, CRFs are a type of discriminative undirected probabilistic graphical models. A CRF can predict a class label for a sample while taking into account contextual information for the sample. CRFs may be used to encode known relationships between observations and construct consistent interpretations. A CRF model may thus be used to label or parse certain sequential data, like query text as described above. Classifiers may issue a “score” indicating which category the data most closely matches. The score may provide an indication of how closely the data matches the category.)

Regarding claim 10, Wang in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima discloses: The method of claim 4, 
Yoshioka further discloses: wherein the acoustic ML model is a speaker identification model, and wherein processing the filtered second audio data segment to generate the predicted output comprises: processing, using the speaker identification model, the filtered second audio data segment to identify the user that provided the spoken utterance. ([0094] The output from combination module 1060 is the result of a third fusion, referred to as a late fusion, to produce text and speaker identification for generation of a speaker-attributed transcript of the meeting. Note that the first two fusion steps at beamforming module 1020 and acoustic model score fusion module 1035, respectively, are optional in various embodiments. In some embodiments, one or more audio channels may be provided directly to an acoustic model scoring module 1065 without beamforming or speech separation. Speech recognition is then performed on such one or more audio channels via SR decoder 1070, followed by speaker diarization module 1075, with the output provided directly to combination module 1060.)

Regarding claim 17, although different in scope from claim 3, they recite elements of the method of claim 3 as a computer device.  Thus, the analysis in rejecting claim 3 is equally applicable to claim 17.

Regarding claim 18, although different in scope from claim 4, they recite elements of the method of claim 4 as a computer device.  Thus, the analysis in rejecting claim 4 is equally applicable to claim 18.

Claims 5 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, and furthermore in view of Hughs et al. (US Patent Application Publication No: US 20200066263 A1) hereinafter as Hughs.
Regarding claim 5, Wang in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima discloses: The method of claim 4, 
Wang in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima does not explicitly, but Hughs discloses: wherein processing the second audio data segment to generate the filtered second audio data segment using the beamformer and based on the plurality of coefficients for the beamformer comprises: processing, using one or more first coefficients, of the plurality of coefficients for the beamformer, a first channel of the second audio data segment to generate a first channel of the filtered second audio data segment, the first channel of the second audio data segment being generated by a first microphone of the two or more microphones; ([0090] In some implementations, the audio data frames of the stream include at least a first channel based on a first microphone of the one or more microphones and a second channel based on a second microphone of the one or more microphones.  [coefficient already discussed in [0043] [0045] [0071, as well as in the Yoshioka, and Kagoshima disclosures])
processing, using one or more second coefficients, of the plurality of coefficients for the beamformer, a second channel of the second audio data segment to generate a channel of the filtered second audio data segment, the second channel of the second audio data segment being generated by a second microphone of the two or more microphones; ([0090] In some implementations, the audio data frames of the stream include at least a first channel based on a first microphone of the one or more microphones and a second channel based on a second microphone of the one or more microphones.)
and generating, based on the first channel of the filtered second audio data segment and based on the second channel of the filtered second audio data segment, the filtered second audio data. ([0090] Generating filtered data frames based on processing of the plurality of the audio data frames in the buffer at the second instance using the noise reduction filter as adapted at least in part in response to the determination at the first instance, can include: using both the first channel and the second channel of the plurality of the audio data frames in generating the filtered data frames.)
Wang, Moghimi, Yoshioka, Kagoshima and Hughs are considered analogous art because they are in the related art of beamforming/audio filtering/signal enhancement. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, to combine the teaching of Hughs, to incorporate the above mentioned claim limitations, because the combination of the disclosures may result in more robust and/or more accurate detections of features of a stream of audio data frames in various situations, such as in environments with strong background noise (Hughs, summary).

Regarding claim 19, although different in scope from claim 5, they recite elements of the method of claim 5 as a computer device.  Thus, the analysis in rejecting claim 5 is equally applicable to claim 19.

Claims 6-9 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, and furthermore in view of Sundaram (US Patent No: US 9972339 B1) hereinafter as Sundaram.

Regarding claim 6, Wang in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima discloses: The method of claim 4, 
Wang in view of Moghimi, further in view of Yoshioka, and furthermore in view of Kagoshima does not explicitly, but Sundaram discloses: wherein the acoustic ML model is an automatic speech recognition (ASR) model, and wherein processing the filtered second audio data segment to generate the predicted output comprises: processing, using the ASR model, the filtered second audio data segment to generate one or more recognized terms corresponding to the one or more terms. ([col. 6, lines 61-67, col. 7, lines 1-4]  The different ways a spoken utterance may be interpreted (i.e., the different hypotheses) may each be assigned a probability or a confidence score representing the likelihood that a particular set of words matches those spoken in the utterance. The confidence score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model 253 stored in an ASR Models Storage 252), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language or grammar model).)
Wang, Moghimi, Yoshioka, Kagoshima and Sundaram are considered analogous art because they are in the related art of beamforming/audio filtering/signal enhancement. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, to combine the teaching of Sundaram, to incorporate the above mentioned claim limitations, because the combination of the disclosures may improve human-computer interactions (Sundaram, background/summary).

Regarding claim 7, Wang in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, and furthermore in view of Sundaram discloses: The method of claim 6, 
Wang further discloses: wherein determining that the first audio data segment includes one or more of the particular words or phrases comprises: processing, using a hotword detection model, the audio data to determine the first segment audio data includes one or more of the particular words or phrases. ([col. 2, lines 52-65] The waking command (which may be referred to as a wakeword), may include an indication for the system to perform further processing. The local device may continually listen for the wakeword and may disregard any audio detected that does not include the wakeword. Typically, systems are configured to detect a wakeword, and then process any subsequent audio following the wakeword (plus perhaps a fixed, but short amount of audio pre-wakeword) to detect any commands in the subsequent audio. As an example, a wakeword may include a name by which a user refers to a device. Thus, if the device was named “Alexa,” and the wakeword was “Alexa,” a user may command a voice controlled device to play music by saying “Alexa, play some music.”)

Regarding claim 8, Wang in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, and furthermore in view of Sundaram discloses: The method of claim 7, 
Wang further discloses: wherein one or more of particular words or phrases invoke an automated assistant, and wherein the automated assistant performs an automated assistant action based on the one or more recognized terms. ([col. 2, lines 52-65] The waking command (which may be referred to as a wakeword), may include an indication for the system to perform further processing. The local device may continually listen for the wakeword and may disregard any audio detected that does not include the wakeword. Typically, systems are configured to detect a wakeword, and then process any subsequent audio following the wakeword (plus perhaps a fixed, but short amount of audio pre-wakeword) to detect any commands in the subsequent audio. As an example, a wakeword may include a name by which a user refers to a device. Thus, if the device was named “Alexa,” and the wakeword was “Alexa,” a user may command a voice controlled device to play music by saying “Alexa, play some music.”)

Regarding claim 9, Wang in view of Moghimi, further in view of Yoshioka, furthermore in view of Kagoshima, and furthermore in view of Sundaram discloses: The method of claim 6, 
Yoshioka further discloses: further comprising: causing a transcription of the spoken utterance to be visually rendered for presentation to the user via a display of the computing device, wherein the transcription of the spoken utterances includes the one or more recognized terms. ([0034] The translated transcript is provided to the distributed device of the user. In example embodiments, the translated transcript is provided in real-time (or near real-time) as the meeting is occurring. The translated transcript can be provided via text (e.g., displayed on a device of the user) or outputted as audio (e.g., via a speaker, hearing aid, earpiece).)

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Moghimi, further in view of Yoshioka, furthermore in view of Donley, and furthermore in view of Ahgren et al. (WO 2013049739 A2) hereinafter as Ahgren.
Regarding claim 11, Wang in view of Moghimi, and further in view of Yoshioka discloses: The method of claim 1, 
Wang in view of Moghimi, and further in view of Yoshioka does not explicitly, but Donley discloses: further comprising: receiving additional audio data that captures an additional spoken utterance of the user, the additional audio data being generated by two or more microphones of a computing device of the user, and the additional audio data that captures the additional spoken utterance of the user being received subsequent to receiving the audio data that captures the spoken utterance of the user; ([col.8, lines 12-25]  FIG. 2 illustrates an example audio assembly 200 within a local area 210, according to one or more embodiments. The local area 205 includes a user 210 operating the audio assembly 200, and three audio sources 220, 230, and 240. The audio source 220 (e.g., a person) emits an audio signal 250. A second audio source 230 (e.g., a second person), emits an audio signal 260. A third audio source 240 (e.g., an A/C unit or another audio source associated with background noise in the local area 205) emits an audio signal 270. In alternate embodiments, the user 210 and the audio sources 220, 230, and 240 may be positioned differently within the local area 205. In alternate embodiments, the local area 205 may include additional or fewer audio sources or users operating audio assemblies.)
processing, using the beamformer, the additional audio data to generate filtered additional audio data; ([col. 8, lines 50-62]  Depending on the type into which they are categorized by the audio assembly 200, signals received from each audio source may be enhanced to different degrees using different signal enhancement filters. For example, the audio source 220 and the audio source 230 may be users communicating with the user 210 operating the user assembly 200, categorized as human type audio. Accordingly, the audio assembly 200 enhances the audio signals 250 and 260 using signal enhancement techniques described below. In comparison, the audio source 240 is an air conditioning unit, categorized as non-human type audio. Accordingly, the audio assembly 200 identifies audio signal 270 as a signal which need not be enhanced.)
Wang, Moghimi, Yoshioka, and Donley are considered analogous art because they are in the related art of beamforming/audio filtering/signal enhancement. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, further in view of Yoshioka, to combine the teaching of Donley, to incorporate the above mentioned claim limitations, because the combination of the disclosures may enhance signal data through adapting and updating of the signal enhancement filters (Donley, background/summary).
Wang in view of Moghimi, and further in view of Yoshioka does not explicitly, but Ahgren discloses: and transmitting, over one or more networks, the filtered additional audio data to an additional computing device of an additional user. ([background] For example, where the received audio signals are speech signals received from a user, the speech signals may be processed by the device for use in a communication event, e.g. by transmitting the speech signals over a network to another device which may be associated with another user of the communication event.) 
Wang, Moghimi, Yoshioka, Donley, and Ahgren are considered analogous art because they are in the related art of beamforming/audio filtering/signal enhancement. Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify the teachings of Wang, in view of Moghimi, further in view of Yoshioka, furthermore in view of Donley, to combine the teaching of Ahgren, to incorporate the above mentioned claim limitations, because the combination of the disclosures may provide more precise control for how fast and in what way changes in the desired beamformer behaviour are realized than what is provided by the data-adaptive beamformers of the prior art (Ahgren, summary).



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Masanadi-Shirazi et al. (US Patent Application Publication No: US 20210390952 A1) hereinafter as Masnadi-Shirazi.  Masnadi-Shirazi discloses a method and system for improving audio signal processing in noisy environment.  “[0014] In the present disclosure systems and methods are described that robustly estimate the TDOA/DOA of one or more concurrent speakers when a stronger dominant noise/interference source (e.g., loud TV noise) is consistently present. In some embodiments, the system works by employing some features of the Generalized Eigenvalue (GEV) beamformer, which allows for the estimate of the target speaker's unique spatial fingerprint or Relative Transfer Function (RTF). The target RTF is estimated by effectively nulling the dominant noise source. By applying a modified TDOA/DOA estimation method that uses the RTF as an input, the systems described herein can obtain a robust localization estimate of the target speaker. If multiple target speakers are active in the presence a stronger noise source (e.g., stronger than the target speakers), with proper tuning the RTF of each source can be estimated intermittently and fed to a multi-source tracker, leading to a robust VAD for each source separately that can drive the multi-stream voice enhancement system.”
(D. K., P. R. and M. M. P., "Real-time Multi Source Speech Enhancement for Voice Personal Assistant by using Linear Array Microphone based on Spatial Signal Processing," 2019 International Conference on Communication and Signal Processing (ICCSP), 2019, pp. 0965-0967, doi: 10.1109/ICCSP.2019.8698030.) hereinafter as Dinesh.  Dinesh discloses an algorithm for suppressing noise in multi-sourced audio signals for real-time voice personal assistant application.  Beamforming, voice activity detection and voice assistant applications are discussed.
(R. Haeb-Umbach et al., "Speech Processing for Digital Home Assistants: Combining Signal Processing With Deep-Learning Techniques," in IEEE Signal Processing Magazine, vol. 36, no. 6, pp. 111-124, Nov. 2019, doi: 10.1109/MSP.2019.2918706.) hereinafter as Haeb-Umbach.  Haeb-Umbach discloses a speech algorithm that enables reliable, hands-free interaction with digital home assistants that involves signal processing with deep learning methods.  Wake-word detectors, end-of-query detectors, second-turn, device-directed speech classifiers, speaker identification modules and multichannel acoustic echo cancellation (MAEC), including beamforming, are discussed in details, as well as unsupervised and supervised speech presence probability estimation which discuss how beamformer coefficients of most common beamformers can be calculated based from the covariance matrices.
(H. Taherian, Z. -Q. Wang, J. Chang and D. Wang, "Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1293-1302, 2020, doi: 10.1109/TASLP.2020.2986896.) hereinafter as Taherian.  Taherian discloses a technique for speaker recognition based on single and multi-channel speech enhancement.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Phillip H Lam whose telephone number is (571)272-1721. The examiner can normally be reached 10 AM-6 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PHILIP H LAM/Examiner, Art Unit 2656                                                                                                                                                                                                        

/HUYEN X VO/Primary Examiner, Art Unit 2656