Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-2, 4, 6, 8, 10, 13-15, 17, and 20 are pending.  Claims 1, 8, and 15 are independent and Claims 1 and 15 are both system Claims.  Claim 19 has been canceled and most of the remaining Claims amended.
This Application was published as 2020-0211540.
Apparent priority 27 December 2018.
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.
	Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/17/2020 has been entered.
Response to Arguments
Applicant’s arguments are moot in view of the new grounds of rejection.  Note the reference Sauk (U.S. 2008/0008342).  Sauk which was and is applied to Claim 4 includes more detail regarding considering the acoustic parameters of both the source and target environments in recreating the audio scene.

    PNG
    media_image1.png
    617
    486
    media_image1.png
    Greyscale
     
    PNG
    media_image2.png
    484
    640
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    472
    642
    media_image3.png
    Greyscale


    PNG
    media_image4.png
    240
    616
    media_image4.png
    Greyscale


    PNG
    media_image5.png
    150
    549
    media_image5.png
    Greyscale



1. A computing system comprising:
a first computing device comprising one or more processing units to execute processor-executable program code to cause the first computing device to: 
receive text data and sonic properties of input speech audio signals of a first user from which the text data was generated; 
input the text data into a trained network to generate output speech audio signals based on the text data, and the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user;  
determine acoustic characteristics of a playback environment; and  
process the output speech audio signals based on the sonic properties and the acoustic characteristics; and 
a speaker system to playback the processed output speech audio signals in the playback environment. 

2.    A computing system according to Claim 1, further comprising:
a second computing device comprising one or more second processing units to execute second processor-executable program code to cause the second computing system to: 
receive the input speech signals from the first user in the recording environment, the input speech audio signals exhibiting the sonic properties; 
generate the text data based on the received input speech audio signals; 
transmit the text data and the sonic properties to the first computing device.

Figure 1 of the instant Application shows the overall system: 
	

    PNG
    media_image6.png
    392
    666
    media_image6.png
    Greyscale

	Figure 6 shows the processes occurring at the headset 145 (or 500 as it is called in Figure 5).  Figure 6 is described with respect to flowchart of Figure 2.
[0052] According to some embodiments, device 500 executes S230 through S250 of process 200. FIG. 6 is an internal block diagram of some of the components of device 500 according to some embodiments. Each component may be implemented using any combination of hardware and software.
Published Application.

    PNG
    media_image7.png
    278
    405
    media_image7.png
    Greyscale
         
    PNG
    media_image8.png
    532
    291
    media_image8.png
    Greyscale


	Step S240: Process the Synthesized Speech Audio Signal Based on Contextual Information.  The “contextual information” includes both the “sender context” and the “receiver context.”
[0044] The synthesized speech audio is processed based on contextual information at S240. As described with respect to FIG. 1, the contextual information may include reproduction characteristics of a loudspeaker within an intended playback environment, an impulse response of the playback environment, an impulse response of an environment in which the original speech audio signals were captured, an impulse response of another environment, and/or spatial information associated with signal capture or with a virtual position within the playback environment. S240 may include application of signal processing effects intended to increase perception of the particular audio signals synthesized at S230.
Published Application.
As evidenced by the relevant portions of the Disclosure of the instant Application, the reproduction of the speech of the speaker who is located in the recording environment on the side of the playback environment first synthesizes the speech from the text of the speech and then processes the synthesized speech by taking into consideration the acoustic/sonic properties of both the recording and playback environments.
The system of Figure 1 provides: 1) converting the voice of the user to text at a source location; 2) generating audio from text at the target location by 3) using a TTS model that is 4) trained with the voice of the speaker and 5) after the voice is synthesized 6) post-processing the synthesized voice by taking into account 7) acoustic characteristics of the recipient/listener/target environment and 8) acoustic/sonic characteristics/properties of the sender/speaker/source environment where the audio originates together.
The Claims include the above features in a mix and match type of claiming.  For example, in the initial set the speech recognition step is separated from the speech synthesis in Claims 1 and 2.  In the other independent Claims 8 and 15, the speech recognition step is included.  Claim 8 mentions the processing with acoustics and sonics indirectly and Claim 15 has a separate step directed to this feature.  Some of the dependent Claims bring in second and third speeches that are recognized and subjected to the same process.

Taking into account the acoustic characteristics of the target environment is taught by Erten and references that pertain to generating audio and must take into account at least the noise level of the target.  (See also Fado.)  Taking into account the acoustic/sonics of the source environment is taught by Sauk and Domville and references that pertain to gaming and attempt to recreate the sound scene of the origin.  Speech-to-Speech (S2S) translation systems at times train the synthesizer to sound like the speaker who spoke in a different language.  Gaming references that create virtual sound environments cover this feature.  Chen which pertains to speech to speech translation is added for teaching a synthesizer trained with the voice of the speaker.
Also note Fado (U.S. 6,988,068), filed 25 March 2003, which similar to Erten is a TTS application except for the consideration of the sonic properties of the recording environment.  Title:  Compensating for ambient noise levels in TTS applications.

    PNG
    media_image9.png
    671
    499
    media_image9.png
    Greyscale
               
    PNG
    media_image10.png
    754
    406
    media_image10.png
    Greyscale


See also Gao (2016/0147740) as another example of cross-lingual model adaptation where the source and target are apart.

    PNG
    media_image11.png
    269
    410
    media_image11.png
    Greyscale

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 4 and 17 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claim 4 was amended to define “sonic properties” as “spatial location.”
4. A computing system according to Claim 1, 
wherein the sonic properties comprise a spatial location of a first user in the recording environment, and 
wherein the acoustic characteristics comprise a spatial location of a second user in the playback environment.
However, the Disclosure does not support this definition.  The phrase “sonic properties” occurs once in the Specification at [0021], provided below and is not defined to include the “spatial location of a first user in the recording environment.”  On the other hand, the “spatial location” of the speaker is defined numerous times in the Specification as part of the “context information” such as “playback context information 140” of Figure 1 which is provided to the “playback control 135.”

Claims 4 and 17 have similar language and are rejected for similar reasons.

See the following supporting material:
The “sonic properties” occurs once in the Specification:
[0021] System 100 includes microphone 105 located within physical environment 110. Microphone 105 may comprise any system for capturing audio signals, and may be separate from or integrated with a computing system (not shown) to any degree as is known. Physical environment 110 represents the acoustic environment in which microphone 105 resides, and which affects the sonic properties of audio acquired by microphone 110. In one example, physical properties of environment 110 may generate echo which affects the speech audio captured by microphone 105. 
Published Application.  
The “spatial information” in the recording “environment 110” is discussed in the following passages:
[0020] FIG. 1 illustrates system 100 according to some embodiments. System 100 may provide efficient generation of particularly-suitable speech audio at a receiving system based on speech audio input at a sending system. Generally, and according to some embodiments, input speech audio is converted to text data at a sending system and speech audio data is generated from the text data at a receiving system. The generated speech data may reflect any vocal characteristics on which the receiving system has been trained, and may be further processed to reflect the context in which it will be played back within the receiving system. This context may include an impulse response of the playback room, spatial information associated with the speaker (i.e., sending user), desired processing effects (reverb, noise reduction), and any other context information.
…
[0030] Playback control component 135 processes the speech audio output by text-to-speech component 120 to reflect any desirable playback context information 140. Playback context information 140 may include reproduction characteristics of headset (i.e., loudspeaker) 145 within playback environment 150, an impulse response of playback environment 150, an impulse response of recording environment 110, spatial information associated with microphone 105 within recording environment 110 or associated with a virtual position of microphone 105 within playback environment 150, signal processing effects intended to increase perception of the particular audio signal output by component 120, and any other context information. 
…
[0044] The synthesized speech audio is processed based on contextual information at S240. As described with respect to FIG. 1, the contextual information may include reproduction characteristics of a loudspeaker within an intended playback environment, an impulse response of the playback environment, an impulse response of an environment in which the original speech audio signals were captured, an impulse response of another environment, and/or spatial information associated with signal capture or with a virtual position within the playback environment. S240 may include application of signal processing effects intended to increase perception of the particular audio signals synthesized at S230.
Published Application.
The “spatial information” of the speaker is used to create a virtual reality atmosphere for the listener:
[0059] FIG. 9 depict a similar scene in which device 500 receives text data of two remote users 920 and 940, who may also be remote from one another. Context information of each remote user may also be received, as well as context information associated with environment 910. Each of users 920 and 940 may be associated with a respective trained network, which is used to synthesize speech audio signals based on the text data of its respective user.   

    PNG
    media_image12.png
    415
    402
    media_image12.png
    Greyscale


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, 6, 8, 10, 13-15, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Erten (U.S. 2003/0061049) in view of Sauk (U.S. 2008/0008342) and further in view of Chen (U.S. 2010/0198577).
Erten is directed to and “Environmentally Aware Speech Synthesis System”:  “[0025] … The first aspect may be referred to as Environmentally Aware Speech Synthesis System (EASSS). EASSS integrates the method of the invention into the speech synthesis process itself. This implies that the speech synthesis is occurring during the delivery of the synthesized speech. The second aspect may be referred to as Environmentally Aware Synthesized Speech Delivery (EASSD). EASSD integrates the method of the invention after speech has been synthesized.”

    PNG
    media_image13.png
    316
    522
    media_image13.png
    Greyscale


    PNG
    media_image14.png
    473
    739
    media_image14.png
    Greyscale


    PNG
    media_image15.png
    386
    728
    media_image15.png
    Greyscale


    PNG
    media_image16.png
    459
    750
    media_image16.png
    Greyscale

Figure 2 shows two paths, 52 and 70, in 70, the speech is generated at the server and enhanced at the recipient device and in 52, text is downloaded to the recipient device from which the speech is generated and enhanced.

Regarding Claim 1, Erten teaches:
1. A computing system comprising:
a first computing device comprising one or more processing units to execute processor-executable program code to cause the first computing device to: [Erten, Figure 2, “Remote Server 24” or any of the “PDAs 50” can perform the steps of this Claim and all include processors and memories with programs being executed on them.  “[0029] … In FIG. 2, Internet ready personal digital assistant (PDA) 50 is shown as the link to remote server 24. … a handheld portable communication device such as, for example, a cellular phone, personal digital assistants (PDA), handheld computers, or the like….”]
receive text data and sonic properties of input speech audio signals of a first user from which the text data was generated; [Erten, Figure 2, “Text to be translated to speech 22” is received by the user at the PDA 50 of Figure 2.  “[0030] The EASSS, shown generally by 52, receives a text file 22. In this embodiment, wireless transmitter 36 sends text file 22 to wireless receiver 50, where text file 22 is stored in memory 54….”]
input the text data into a trained network to generate output speech audio signals based on the text data, and the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user; [Erten, Figure  2, “TTS 30/56” generating speech that is output as “audio file 32/60.”  “[0030] … Text-to-speech (TTS) converter 56 reads text file 22 from memory 54 and generates a speech signal which is filtered by speech enhancer 58 to produce audio signal 60. Audio signal 60 is played into environment 28, such as a vehicle interior cavity, through speakers 61.”]  
determine acoustic characteristics of a playback environment; and [Erten, Figure 2, “voice detection and noise analysis 64” determines the noise/ “acoustic characteristics” in the playback environment.  The main focus of the Erten is to adapt to the noise conditions of the playback environment.  See Background paragraph [0007].  “[0031] … Voice detection and noise analysis unit 64 receives a sound signal from transducer 62 and generates one or more parameters 66 indicative of noise in environment 28. These parameters may be used to affect speech enhancer filter 58, TTS converter 56, or both. …”  See also Figure 3, “synthesis parameters 102” are obtained from “noise analysis 94.”  “Synthesized speech is enhanced by listening to the acoustic background into which the synthesized speech is delivered and adjusting parameters of the synthesized speech accordingly.…”  Abstract.]  
process the output speech audio signals based on the sonic properties and the acoustic characteristics; and [Erten, Figure 2, “speech enhancing filters 58” or “speech enhancer 58” receive the “acoustic characteristics” / noise in the recipient environment as determined by the “voice detection and noise analysis 64” to enhance the speech that is to be generated by the “TTS 56.”   “… In one embodiment, text is synthesized into speech based on at least one noise parameter determined from the environment into which the synthesized speech is delivered. In another embodiment, parameters for a filter modifying the synthesized signal are determined based environmental noise.”  Abstract.  “[0031] Synthesized speech signal 60 is greatly enhanced through the use of sound transducer 62 in environment 28. Voice detection and noise analysis unit 64 receives a sound signal from transducer 62 and generates one or more parameters 66 indicative of noise in environment 28. These parameters may be used to affect speech enhancer filter 58, TTS converter 56, or both. In addition, parameters 66 may be used to generate commands that are read by TTS converter 56. These commands may be written into memory 54.”  See Figures 3 and 4 for two types of enhancement of the synthesized speech both of which are applied to the already complete speech (speech synthesis 106 in Figure 3 or streaming audio 72 in Figure 4) and are based on the voices and noises in the playback environment.]
a speaker system to playback the processed output speech audio signals in the playback environment. [Erten, Figure 2, “loudspeakers 61” playing the synthesized speech.  “[0030] … Audio signal 60 is played into environment 28, such as a vehicle interior cavity, through speakers 61.”]

Erten teaches evaluating the audio environment of the recipient in which the speech is to be output and considering the noises and voices in playback/recipient environment to enhance a synthesized speech in the playback environment.
Erten does not teach taking into consideration the acoustics of the recording environment in which the text was generated.
Erten does not specify that its TTS model is trained with the speech of the speaker.  

Sauk pertains to “virtual sound environments,” sound scenes, or sound fields where the sound is generated for a listener taking into account the locations of the speakers and the listener in the scene or field of sound and teaches:  
1. A computing system comprising:
a first computing device comprising one or more processing units to execute processor-executable program code to cause the first computing device to: [Sauk, Figure 5, “Sound Environment Manager 494” that provides data to the “Headset 108.”  “[0036] The sound environment manager 494 can be implemented by means of a general purpose computer or microprocessor programmed with a suitable set of instructions for implementing the various processes as described herein, and one or more digital signal processors….”]
receive text data and sonic properties of input speech audio signals of a first user from which the text data was generated;  [Sauk teaches that the speech / enunciated data and two types of metadata associated with speech are collected from the speaker environment and sent to the listener.  See Figures 6A and 6B for the types of “information” and types of “metadata” that are associated the captured audio on the source side. The two types of metadata include non-spatial data pertaining to the recording environment and spatial data that includes the location of the speaker. “Method and apparatus for producing, combining, and customizing virtual sound environments. …. The first type of metadata includes information which identifies a characteristic of the enunciated data exclusive of spatial position information.  The second type of metadata identifies a spatial position information associated with the enunciated data.”  Abstract.  See [0033] for the first type of metadata. ]
input the text data into a trained network to generate output speech audio signals based on the text data, and the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user;  [Sauk, Figure 5, “Audio Mixer 484” generating the output speech for the “headphone 108” that is worn by the listener.  The generated audio can be “synthesized.”  “[0016] It should be understood that "enunciated data" as used herein will include a wide variety of different types of audio information that is available for presentation to a user. For example, the various types of enunciated data include live voice data as generated by a person, data which specifies one or more words which are then synthesized or machine reproduced for a user. Such synthesized or machine reproduction can include generating one or more words using stored audio data as specified by the enunciated data. It should also be understood that the term enunciated data as used herein includes data which specifies one or more different types of audio tones which are audibly reproduced for a user.”]
determine acoustic characteristics of a playback environment; and  [Sauk takes into consideration the receiving/listener environment as well in particular the location of the listener in the playback environment.  For example, in Figure 5, some of the data and metadata are received and some other types are sensed from the immediate environment of the listener.  “[0002] The inventive arrangements relate to the field of audio processing and presentation and, in particular, to combining and customizing multiple audio environments to give the user a preferred illusion of sound (or sounds) located in a three dimensional space surrounding the listener.”] 
process the output speech audio signals based on the sonic properties and the acoustic characteristics; and [Sauk combines the information from both the speaker side and the listener side to create the sound scene for the listener: “[0034] The second type of metadata 604-2 identifies spatial position information associated with the enunciated data that is used to create a 3-D binaural audio effect. For example, the spatial position information can include one or more of the following: a real world location of a source of the enunciated data, a virtual or apparent location of a source of enunciated data, a real world location of a target, and/or a real world location of a destination…..” ] 
a speaker system to playback the processed output speech audio signals in the playback environment. [Sauk, Figures 4 and 5.  “[0028] The binaural sound environment provided to the listener's ears with right speaker 104 and with left speaker 106 will change in decibel level and/or quality as the real-world position of the listener's head 110 moves or changes orientation relative to the position of the sound source 204. ….” ]
Erten and Sauk pertain to providing an audio communication to a listener and it would have been obvious to combine the virtual reality audio scene generation of Sauk that takes into account the spatial information of the speaker and the listener both with the system of Erten in order to provide a sound scene for the listener of Erten.  (Sauk, “[0004] Binaural audio is sound that is processed to provide the listener with a three dimensional virtual audio environment. This type of audio allows the listener to be virtually immersed into any environment to simulate a more realistic experience. Having binaural sound emanating from different spatial locations outside the listener's head is different from stereophonic sound and it is different from monophonic audio.”)  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Erten and Sauk do not discuss that synthesizer is trained with the voice of the speaker.
Chen teaches:
inputting the text data into a trained network to generate second speech audio signals based on the text data, the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user; [Chen, Figures 1 and 11.  Title:  “State Mapping for Cross-Language Speaker Adaptation.”  Figure 11, “speaker adaptation module 111” providing input to the “speech synthesis module.”  See Figure 9 and Figure 12 for the HMM networks trained to synthesize speech with the voice of the speaker 102.  Figure 12, VsLs samples 1202 teach the training set.  See claims 7 and 11 of Chen.]
Erten, Sauk, and Chen pertain to audio generation and speech synthesis and it would have been obvious to augment the combination with the speaker specific trained deep neural network speech synthesizer of Chen so that the output voice sounds the same as the voice of the speaker  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

    PNG
    media_image17.png
    380
    413
    media_image17.png
    Greyscale


    PNG
    media_image18.png
    510
    757
    media_image18.png
    Greyscale


Regarding Claim 2, Erten begins with text and does not teach where this text comes from.  Sauk does not discuss text.
Chen teaches or suggests:
2.    A computing system according to Claim 1, further comprising:
a second computing device comprising one or more second processing units to execute second processor-executable program code to cause the second computing system to: [Chen Figure 11 shows both operations of ASR and TTS occurring at the same device.  Thus, the “second computing device” in this limitation is suggested by Chen considering that the two operations can be separated.  “”Processor 1102.”  “Memory 1106.”]
receive the input speech signals from the first user in the recording environment, the input speech audio signals exhibiting the sonic properties; [Chen Figure 11, Speaker 102 “Hello” is the input speech signal.  The “sonic properties” are mapped to the characteristics of the voice of the speaker 102.]
generate the text data based on the received input speech audio signals; and [Chen, Figure 11, “Speech Recognition Module 1110.”]
transmit the text data and the sonic properties to the first computing device. [Chen teaches in Figure 11 that the text recognized from the input speech is provided to the “speech synthesis module 1120”.  This step is also suggested because both the “speech recognition module 1110” and “speech synthesis module 1120” are shown as part of the same system.]
Erten/Sauk and Chen pertain to generation of voice audio output.  Erten takes text as input and generates voice that takes into account the acoustics of the output environment.  Chen takes the speaker’s voice as input and converts to text and modifies it and outputs as voice again.  Chen trains the synthesizer module to generate speech that sounds like the speaker’s voice.  It would have been obvious to combine the voice input of Chen and the trained synthesizer of Chen with the system of Erten/Sauk in order to provide for the generation of the text that is used by Erten/Sauk from an initial input voice.  This is a tandem combination and voice and text are highly interchangeable when it comes to content.  Alternatively, Chen could have been modified by Erten to provide for separation of the recognition and synthesis processes in two devices.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, Erten, Figures 2 or 3 teach that the conditions of the listening environment are considered:  “[0007] …What is needed is to increase intelligibility by making the synthesis system aware of environmental conditions, such as noise parameters and environmental acoustics. “[0031] Synthesized speech signal 60 is greatly enhanced through the use of sound transducer 62 in environment 28. Voice detection and noise analysis unit 64 receives a sound signal from transducer 62 and generates one or more parameters 66 indicative of noise in environment 28. …”  See also [0035] to [0036].  Erten does not teach that the “acoustic characteristics” considers the “spatial location” of the listener although the “sound transducer 62” is a mic in the environment of the listener/second user that would pick up any voice generated by this user.
Erten, which was cited for taking into account the acoustics of the playback location, does not teach taking into account a spatial location of a second user as part of acoustic characteristics of the playback environment.
Sauk teaches:
4. A computing system according to Claim 1, 
wherein the sonic properties comprise a spatial location of a first user in the recording environment, and [Sauk pertains to “virtual sound environments,” sound scenes, or sound fields where the sound is generated for a listener taking into account the locations of the speakers and the listener in the scene or field of sound.  “Method and apparatus for producing, combining, and customizing virtual sound environments. …. The second type of metadata identifies a spatial position information associated with the enunciated data.”  Abstract.  “[0009] …For example, the metadata typically includes spatial location information of the source of the particular audio information. This spatial location information can then be used to produce a binaural audio signal that simulates the desired spatial location of the source. …”  “2. The method according to claim 1, further comprising selecting said second information to further include at least a second metadata which indicates a spatial position information of said enunciated data in said binaural audio environment.”  This is the location of the source/speaker in the recording environment.]
wherein the acoustic characteristics comprise a spatial location of a second user in the playback environment. [Sauk, “4. The method according to claim 3, further comprising selecting at least one of said BRIR and said reverb filter in accordance with a relative spatial distance of said user with respect to a remote location associated with a source of said enunciated data.”  This is the location of the listener in the playback environment.  “[0034] The second type of metadata 604-2 identifies spatial position information associated with the enunciated data that is used to create a 3-D binaural audio effect. For example, the spatial position information can include one or more of the following: a real world location of a source of the enunciated data, a virtual or apparent location of a source of enunciated data, a real world location of a target, and/or a real world location of a destination…..”  “[0028] The binaural sound environment provided to the listener's ears with right speaker 104 and with left speaker 106 will change in decibel level and/or quality as the real-world position of the listener's head 110 moves or changes orientation relative to the position of the sound source 204. ….” ]
Erten and Sauk pertain to generation of sound/speech and it would have been obvious to augment the combination with the virtual sound environment of Sauk which tracks both the source (speaker) and target (listener) locations and generates a virtual sound environment for the listener that is based on the location and motion of the source in order to provide an enhanced listening experience.  (Sauk, “[0002] The inventive arrangements relate to the field of audio processing and presentation and, in particular, to combining and customizing multiple audio environments to give the user a preferred illusion of sound (or sounds) located in a three dimensional space surrounding the listener.”)

Regarding Claim 6, Erten teaches:
6. A computing system according to Claim 2, further comprising:
a third computing device comprising one or more third processing units to execute third processor-executable program code to cause the third computing system to:
receive second input speech audio signals from a second user in a second recording environment, the second input speech audio signals and exhibiting second sonic properties;
generate second text data based on the received second input speech audio signals; and
transmit the second text data and the second sonic properties to the first computing device,
the first computing device comprising one or more processing units to further execute processor-executable program code to cause the first computing device to: [Erten, Figure 2, either the “remote server 24” or any of the “PDAs 50” can perform any of these operations multiple time with first, second, third, etc. text data.]
receive the second text data and the second sonic properties; [Erten, Figure 2, “text to be translated to speech 22” received from communications link 36 at the PDA 50.]
input the text data into a second trained network to generate second output speech audio signals based on the second text data, and the second network having been trained to generate audio signals from text data based on a training set of speech audio signals of the second user; and [Erten, Figure 2, audio signal 60 /speech generated by TTS 556.]
process the second output speech audio signals based on the acoustic characteristics and the second sonic properties, and [Erten, Figure 2, TTS 56 output is processed by the “speech enhancer 58” which gets its parameters from the acoustic characteristics of the environment by “voice detection and noise analysis 64.”]
the speaker system to playback the processed output speech audio signals and the processed second output speech audio signals in the playback environment. [Erten, Figure 2, output of the synthesized speech from loudspeakers 61.]
Erten does not teach taking into account acoustic/sonic characteristics of the recording end.
Sauk teaches:
process the second output speech audio signals based on the acoustic characteristics and the second sonic properties, and [Sauk as applied to Claim 1 teaches that it applies the metadata pertaining to spatial characteristics of both the speaker and listener environments to the output audio in order to provide a virtual reality sound scene that mimics the side of generation of the sound for the listener.]
Rationale for combination as provided for Claim 1.
Erten and Sauk do not teach the speech recognition end and the synthesizer trained with the voice of the speaker.
Chen teaches or suggests:
6. A computing system according to Claim 2, further comprising:
a third computing device comprising one or more third processing units to execute third processor-executable program code to cause the third computing system to: [Chen, Figure 11, Processor 1102, and Memory 1108.  A third device is not taught by is suggested.  If the device of Chen is combined as the second device with the device of Erten, several of the devices of Chen can be combined and there is no limitation on second, third, etc.  At any rate, adding devices is not changing any function.]
receive second input speech audio signals from a second user in a second recording environment, the second input speech audio signals and exhibiting second sonic properties; [Chen, Figures 1 or 11, “Hello” spoken by Speaker 102 with the voice /sonic properties of speaker 102.]
generate second text data based on the received second input speech audio signals; and [Chen, Figure 11, “Speech Recognition Module 1110” operating on input speech.]
transmit the second text data and the second sonic properties to the first computing device, [Chen, Figure 11, the text generated by ASR 1110 is provided to the “Speech Synthesis Module 1120.”  The “Speaker Adaptation Module 1114” provides the characteristics of the voice of the speaker 102 /sonic properties to the Speech Synthesis Module 1120.]
the first computing device comprising one or more processing units to further execute processor-executable program code to cause the first computing device to: [Chen: this is not taught but suggested by having separate modules that can work independently.]
receive the second text data and the second sonic properties; [Chen, Figure 11, the “Speech Synthesis Module 1120” receives the text out of the “Text Translation Module 1112” and the voice characteristics from the “Speaker Adaptation Module 1114.”]
input the text data into a second trained network to generate second output speech audio signals based on the second text data, and the second network having been trained to generate audio signals from text data based on a training set of speech audio signals of the second user; and [Chen, Figures 11 and 12.  The “Speech Synthesis Module 1120” is working with the speech synthesis model that is adapted to sound like the voice of the Speaker 102.  Figure 12, VsLs Samples 1202 are used for the training.]
process the second output speech audio signals based on the acoustic characteristics and the second sonic properties, and [Chen the output speech sounds like the voice of the Speaker 102 and thus includes the sonic properties of the recording side.]
the speaker system to playback the processed output speech audio signals and the processed second output speech audio signals in the playback environment. [Chen, output of “Hola 112” to the Listener 106.]
Rationale for combination as provided for Claim 1.

Regarding Claim 8, Erten teaches:
8. A computer-implemented method comprising: 
capturing first speech audio signals of a first user in a recording environment, the captured first speech audio signals exhibiting sonic properties; 
converting the first speech audio signals into text data;
inputting the text data into a trained network to generate second speech audio signals based on the text data, the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user; [Erten, Figure 2, “Text to be translated to speech 22” is received at the “remote server 24” from a user such as the PDA 50 of Figure 2. ] 
processing the second speech audio signals based on the sonic properties and on acoustic characteristics of a playback environment; and [Erten, Figure 2, “speech enhancing filters 58 and speech enhancer 58” receive the “acoustic characteristics” / noise in the environment as determined by the “voice detection and noise analysis 64” to enhance the speech that is to be generated by the “TTS 56” from the “text to be translated to speech 54.” ]  
playing the processed second speech audio signals in the playback environment. [Erten, Figure 2, “loudspeakers 61” playing the synthesized speech.  “[0030] … Audio signal 60 is played into environment 28, such as a vehicle interior cavity, through speakers 61.”]
Erten does not teach generation of text from speech.
Erten does not teach that the speech synthesizer is trained with the voice of the speaker.
Erten does not teach that the generated speech is post-processed with acoustics of source and target locations.
Sauk as applied to Claim 1 and under same rationale teaches that the output speech is post-processed with acoustics of source and target locations.
Chen as applied to Claims 1 and 2 and under similar rationale teaches that the initial input is speech and text that is generated from speech is converted back to speech and also teaches that the speech synthesizer is trained to sound like the initial speaker.

Claim 10 is a method Claim with limitations similar to Claim 4 and is rejected under similar rationale.

Claim 13 is similar to Claim 6 and is rejected under similar rationale.
13. A computer-implemented method according to Claim 8, further comprising:
capturing third speech audio signals of a second user in a second recording environment, the captured third speech audio signals exhibiting second sonic properties; [There is no limitation on the number of speakers and Sauk in Figure 4 shows the combining of the voices from different speakers.]
converting the third speech audio signals into second text data; [Chen, Figure 11, Speech Recognition Module.]
inputting the second text data into a second trained network to generate fourth speech audio signals based on the second text data, the second network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the second user; [Erten shows the TTS process.  Chen, Figure 11, Speech Synthesis Module that is trained with the voice of the Speaker.]
processing the fourth speech audio signals based on the sonic properties and the acoustic characteristics of the playback environment; and [Sauk teaching the processing of the audio output to include spatial acoustics of both source and target locations.]
playing the processed fourth speech audio signals in the playback environment. [Erten, Sauk, and Chen show the output of the audio to the listener through headset or speakers.]

Regarding Claim 14, Erten does not distinguish who or what generates the text. Sauk teaches:
14. A computer-implemented method according to Claim 13, wherein the second processed speech audio signals and the fourth speech audio signals are played in a same user session of the playback environment. [Sauk, Figure 4, shows that several speakers and listeners may be participating and their voices would be combined.  “[0029] FIG. 4 is a system overview diagram that is useful for understanding an arrangement of operation of a binaural sound system as disclosed herein. A plurality of users 109-1, 109-2, . . . 109-n are each equipped with a binaural sound system (BSS) 400. Each BSS 400 is connected to a set of headphones 108 or other sound reproducing device….”  “[0030] According to an embodiment of the invention, the BSS 400 units can be designed to operate in conjunction with one or more remote sensing devices 401….”]
Chen teaches that the speech input by a speaker is output to the listener as generated a TTS model trained with the voice of the speaker.  There is no limitation on the number of speakers.

Regarding Claim 15, Erten teaches:
15. A computing system to:
receive first speech audio signals of a first user in a recording environment, the received first speech audio signals exhibiting sonic properties;
convert the first speech audio signals into text data;
input the text data into a trained network to generate second speech audio signals based on the text data, the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user; [Erten, Figure 2, speech that is generated by the “TTS 56” from the “text to be translated to speech 54.”]  
process the second speech audio signals based on the sonic properties and on acoustic characteristics of a playback environment; and [Erten, Figure 2, “speech enhancing filters 58 and speech enhancer 58” receive the “acoustic characteristics of a playback environment” / noise in the environment as determined by the “voice detection and noise analysis 64” to enhance the speech that is to be generated by the “TTS 56” from the “text to be translated to speech 54.”  See also Figure 3, “synthesis parameters 102” are obtained from “noise analysis 94.”]  
transmit the processed second speech audio signals to the playback environment. [Erten, Figure 2, “loudspeakers 61” playing the synthesized speech.  “[0030] … Audio signal 60 is played into environment 28, such as a vehicle interior cavity, through speakers 61.”  However, transmission through the link 36 is also taught in Figure 2.  [0030] The EASSS, shown generally by 52, receives a text file 22. In this embodiment, wireless transmitter 36 sends text file 22 to wireless receiver 50, where text file 22 is stored in memory 54….”]
Erten does not teach where the initial text comes from.  Erten does not teach that the acoustics/sonics of the recording environment are given effect in the production of audio.
Sauk teaches:
process the second speech audio signals based on the sonic properties and on acoustic characteristics of a playback environment; [Sauk as applied to Claim 1 teaches processing the output audio by taking into considerations the spatial acoustics of the sender and listener sides both to create a virtual reality sound scene for the listener.]
Rationale for combination as provided for Claim 1.
Erten and Sauk do not teach speech recognition or a speech synthesizer trained with the voice of the speaker.
Chen teaches:
receive first speech audio signals of a first user in a recording environment, the received first speech audio signals exhibiting sonic properties; [Chen, Figures 1 and 11.  Input of voice of the Speaker 102 and the characteristics of his voice..]
convert the first speech audio signals into text data; [Chen Figure 11, “Speech Recognition Module 1110.””]
input the text data into a trained network to generate second speech audio signals based on the text data, the network having been trained to generate audio signals from text data based at least on a training set of speech audio signals of the first user; [Chen, Figure 11, “Speech Synthesis Module 1120” with input from the “Speaker Adaptation Module 1114” generating output voice that sounds like the Speaker 102 using the samples of his voice and using the networks of HMMs shown in Figure 12.]
Rationale for combination as provided for Claim 1.

Claim 17 is a Claim with limitations similar to Claim 4 and is rejected under similar rationale.
Claim 20 is a Claim with limitations similar to Claim 6 and is rejected under similar rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Barra (U.S. 20200365137) for TTS trained for any voice. (Amazon Alexa).  Barra is directed to a speech synthesis model (text to speech) which is trained and can be retrained for different vocal attributes and uses neural networks.  “[0019] … As explained in further detail below, the speech model may include a sample model, a conditioning model, and/or an output model--which may also be referred to as a sample network, conditioning network, and/or output network, respectively--and may use causal convolutions to predict output audio ….”  This synthesizer can be trained for any voice desired which at the least suggests training for the voice of the speaker.  “[0033] The TTS storage module 295 may be customized for an individual user based on his/her individualized desired speech output. In particular, the speech unit stored in a unit database may be taken from input audio data of the user speaking. For example, to create the customized speech output of the system, the system may be configured with multiple voice inventories 278a-278n, where each unit database is configured with a different "voice" to match desired speech qualities. Such voice inventories may also be linked to user accounts….”  Figure 9, and “[0060] … The training audio 902 may be captured using a human voice, and the training text 904 may be generated using a speech-to-text system and/or by a human transcriber.”  See [0057] for use of LSTIM and [0058] for use of QRNN.
For the use of “impulse responses” of both farfield and nearfield acoustic environments, see Li (U.S. 20190043482).  
Riggs (U.S. 2016/0269849) for spatial sound.
Raghavendra (U.S. 2019/0392815) for NN trained speech synthesizers.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659