DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 3/21/22 have been fully considered but they are not persuasive. 
Regarding the 35 U.S.C. 102 rejection of claim 1 with reference Gao as well as the 35 U.S.C. 103 rejection of similar independent claims 10 and 16 with at least Gao, Applicant argues that Gao merely describes enabling the transfer of emotion related characteristics of a speech signal by extracting and recombining content code of source speech and the style of a target emotion without a disclosure or a suggestion of an autoencoder, and that fig. 4 and other portions of Gao fails to disclose limitation “combining an output of the content encoder and an output of the target speaker encoder as input data into a decoder of the style transfer autoencoder, an output of the decoder providing the content information of the input source speech data as adapted to a style of the target speaker” and that the claims are not anticipated by Gao (Amendment, pg. 7, section A – pg. 10, first full para.). Examiner respectfully disagrees.
First, contrary to Applicant’s assertion that Gao does not describe the use of an encoder i.e. the previously claimed “style transfer autoencoder system”, Gao discloses an architecture based on style transfer autoencoders including the use of at least one encoder/autoencoder (fig. 1; sec. 4.2; sec.5), and corresponding to the currently 1 and x2 (i.e. source and target utterances) that are each respectively decomposable/disentangled into emotion invariant content codes c1 and c2 and emotion dependent style codes s1 and s2 (sec. 1; sec. 3.1; sec. 3.2) where in order to transfer the emotional content of the target utterance to the source utterance (Abstract; sec. 1), source content ci is extracted from the decomposed/disentangled first utterance using a content encoder (figure 4), target style sj is extracted from the decomposed/disentangled target utterance using a style encoder (figure 4), and the extracted source content ci and the a target style sj are fed into the Decoder (see figure 4; sec. 4.2), corresponding to claimed limitation “combining an output of the content encoder and an output of the target speaker encoder as input data into a decoder of the style transfer autoencoder, an output of the decoder providing the content information of the input source speech data as adapted to a style of the target speaker” as required by claim 1 and similar claims 10 and 16.
Applicant also argues that Gao merely shows a speech encoder model with partially shared latent space and that it is not clear how the cited portions of Gao describe source speaker disentanglement, and as such, argues that Gao fails to disclose limitation “wherein the source speaker disentanglement results from a predetermined design of an information bottleneck in the transfer autoencoder system” as required in claim 4 (Amendment, pg. 10-11, sec. 2).
Examiner respectfully disagrees. Gao discloses a network architecture including a content encoder/autoencoder, style encoder/autoencoder, decoder and a discriminator (fig. 4), where the architecture performs speech conversion by receiving wherein the source speaker disentanglement results from a predetermined design of an information bottleneck in the transfer autoencoder system”. Furthermore, because Gao’s hidden layer/bottleneck/interior of the autoencoder performs a disentanglement of the source speech into content information and a style information, it meets the claimed predetermined design of the information bottleneck, and as a result, the argued limitation.
Applicant further argues that the rejection based on anticipation (i.e., the rejection of claim 1) is improper as Gao does not provide an identical invention to the claimed invention (Amendment, pg. 11, sec. 3).
Examiner respectfully disagrees as presented above. Furthermore, a recitation of the intended use of the claimed invention must result in a structural difference between the claimed invention and the prior art in order to patentably distinguish the claimed invention from the prior art.  Since Gao is capable of performing the intended use as argued above and as presented in the rejection of at least claim 1, then it meets the claim language, and as such, Examiner maintains the rejection of the claim.
Regarding the 35 U.S.C. 103 rejection of claims 10 and 16 as well as the rest of the dependent claims with additional references Jia and Narayan, Applicant argues that Jia and Narayan fail to disclose the above alleged deficiency of Gao for claim 1, and as  (Amendment, pg. 11, sec. 4 – pg. 12, sec. B).
 Examiner respectfully disagrees as neither Jia nor Narayan were/are applied to teach the above alleged deficiencies of Gao in claims 10 or 16 even if Jia describes a subset or more of the language required in the independent claims (see rejection of at least claim 2), and absent any argument as to why the cited portions of the references fail to disclose the limitations recited in the dependent claims, Examiner maintains the rejections of the claims are appropriate.

Response to Amendment
The prior objection to claim 8 (12/20/21) is hereby withdrawn in light of amendments to the claim.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



1.        Claims 1, 4 and 5 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gao et al “Nonparallel Emotional Speech Conversion” (“Gao”)
         Per Claim 1, Gao discloses a method of voice conversion capable of a zero-shot voice conversion with non-parallel data, the method comprising: 
              receiving source speaker speech data as input data into a content encoder of a transfer autoencoder system, the content encoder providing a source speaker disentanglement of the source speaker speech data by reducing speaker related information of the input source speech data while retaining content information (fig. 4; sec. 1; sec. 2.3; sec. 3.1; sec. 3.2; The autoencoders take 24-dimentional MCEPs
as input and learn disentangled representations of content and style. In the content encoder, instance normalization (IN) [31] removes the original feature mean and variance that represent emotional style information…, sec. 4.2; sec. 5, content encoder (fig. 4) as claimed content encoder, source speaker style information removed from source speech that includes speaker style information and speaker content as reducing speaker related information and retaining content information);
            receiving target speaker input speech as input data into a target speaker encoder (Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion …, Abstract; sec. 1; Let x1 ∈ X1 and x2 ∈ X2 be utterances drawn from two different emotional categories…, sec. 3.1; sec. 3.2; Style encoder, fig. 4; The autoencoders take 24-dimentional MCEPs as input …In the style encoder, the emotional characteristics are encoded by a 3-layer MLP that outputs channel-wise mean and variance µ(s); σ(s). Then they are fed into the decoder to reconstruct MCEP features. The desired emotion is added through an style encoder (fig. 4) as target speaker encoder, emotional utterance in different domains as including target speaker input speech); and 
            combining an output of the content encoder and an output of the target speaker encoder as input data into a decoder of the transfer autoencoder (Decoder, fig. 4; In conversion stage, we extract content code of the source speech and recombine it with style code of the target emotion…., sec. 1; Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion…, Abstract; sec. 4.2; output ci of content encoder (fig. 4) and output sj of style encoder/target speaker encoder (fig. 4) as combined at Decoder (fig. 4)), 
             an output of the decoder providing the content information of the input source speech data as adapted to the target speaker (fig. 4; It enables the transfer of emotion-related characteristics of a speech signal while preserving the speaker’s identity and linguistic content…., Abstract; sec. 4.2).
         Per Claim 4, Gao discloses the method of claim 1,
             Gao discloses wherein the source speaker disentanglement results from a predetermined design of an information bottleneck in the transfer autoencoder system (fig. 1; fig. 4; The autoencoders take 24-dimentional MCEPs as input and learn disentangled representations of content and style. In the content encoder, instance normalization (IN) [31] removes the original feature mean and variance that represent emotional style information…, sec. 4.2). 
          Per Claim 5, Gao discloses the method of claim 4, 
downsampling performed in hidden layer/bottleneck of network architecture of fig. 4, downsampling defined in Applicant’s original specification as dimension reduction along the temporal axis and dimension reduction along the channel axis).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

2.      Claims 2, 3 and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Gao in view of Jia et al “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” (“Jia”)
Per Claim 2, Gao discloses the method of claim 1, wherein the content encoder comprises a first neural network, the target speaker encoder comprises a second neural network, and the decoder comprises a third neural network (The encoders and decoders are implemented with 1D-CNNs to capture the temporal dependencies…, sec. 4.2; fig. 4), the method further comprising: 
subsequently training the first neural network of the content encoder in combination with the third neural network of the decoder in a self-reconstruction training 
Gao does not explicitly disclose initially pre-training the second neural network of the target speaker encoder using speech information of the target speaker or the target speaker encoder that has been pre-trained using the target speaker speech information
However, these features are taught by Jia:
initially pre-training the second neural network of the target speaker encoder using speech information of the target speaker (sec. 2.1); and 
the target speaker encoder that has been pre-trained using the target speaker speech information (sec. 2.1)
            It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Jia with the method of Gao in arriving at “initially pre-training the second neural network of the target speaker encoder using speech information of the target speaker” and “the target speaker encoder that has been pre-trained using the target speaker speech information”, because such combination would have resulted in to conditioning a synthesis network on a reference speech signal from a desired target speaker (Jia, sec. 2.1).
Per Claim 3, Gao in view of Jia discloses the method of claim 2, 

Per Claim 16, Gao discloses a method for transferring a style of voice utterances, as capable of a zero-shot voice conversion with non-parallel data, the method comprising: 
operating an autoencoder system first in a training mode, the autoencoder system comprising a content encoder comprising a second neural network that compresses original input data from an input layer into a shorter code and a decoder comprising a third neural network that learns to un-compress the shorter code to closely match the original input data (fig. 4; sec. 1; sec. 3.1; sec. 3.2; sec. 4.2; sec. 5), 
    the training mode comprising a self-reconstruction training using speech inputs from a source speaker into the content encoder and into the target speaker encoder, the self-reconstruction training thereby training the second neural network and the third neural network to adapt to a style of the target speaker (Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion…, Abstract; sec. 2.2; fig. 4; sec. 4.2; sec. 5); and 
            operating the autoencoder system in a conversion mode in which utterances of a source speaker provide source speech utterances in a style of the target speaker (To 
Gao does not explicitly disclose preliminarily training a first neural network in a target speaker encoder, using speech information of a target speaker, the first neural network being trained to maximize an embedding similarity among different utterances of the target speaker and minimize similarities with other speakers or the target speaker encoder that has been preliminarily trained using target speaker speech information
 However, these features are taught by Jia:
 preliminarily training a first neural network in a target speaker encoder, using speech information of a target speaker, the first neural network being trained to maximize an embedding similarity among different utterances of the target speaker and minimize similarities with other speakers (sec. 2.1)
the target speaker encoder that has been preliminarily trained using target speaker speech information (sec. 2.1), 
            It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Jia with the method of Gao in arriving at “preliminarily training a first neural network in a target speaker encoder, using speech information of a target speaker, the first neural network being trained to maximize an 
Per Claim 17, Gao in view of Jia discloses the method of claim 16, 
                 Gao discloses wherein the content encoder is configured to provide at least a source speaker disentanglement of input source speech data by reducing speaker style information of the input source speech data as a predetermined result of the configuring (fig. 1; fig. 4; sec. 1; A basic idea is to find disentangled representations that can independently model image content and style..., sec. 2.3; The autoencoders take 24-dimentional MCEPs as input and learn disentangled representations of content and style. In the content encoder, instance normalization (IN) [31] removes the original feature mean and variance that represent emotional style information…, sec. 4.2)
Per Claim 18, Gao in view of Jia discloses the method of claim 17, 
    Gao discloses wherein the reducing of speaker style information results from a specifically-designed bottleneck implemented into a configuration of the autoencoder system (fig. 1; fig. 4; sec. 4.2). 

3.        Claims 10 is rejected under 35 U.S.C. 103 as being unpatentable over Gao in view of Narayanan US 2019/0304480 A1 (“Narayanan”)
          Per Claim 10, Gao discloses a style transfer autoencoder system, comprising: 
content encoder (fig. 4) as claimed content encoder);
a target speaker encoder for receiving target speaker speech information, the target speaker encoder comprising a second neural network that is trainable (fig. 1; We jointly train the encoders, decoders and GAN’s discriminators with multiple losses…., fig. 3.3; Style encoder, fig. 4; sec. 4.2; sec. 5, style encoder (fig. 4) as target speaker encoder); and
            a decoder receiving output data from the content encoder and output data from the target speaker encoder, the decoder providing as output speech information as comprising a content of a source speech utterance in a style of the target speaker, the decoder comprising a third neural network that is trainable, wherein the content encoder is configured with parameter settings in a dimension axis and in a temporal axis so as to achieve a speaker disentanglement of the received source speech information (fig. 1; To address these issues, we propose a nonparallel training method. Instead of learning one-to-one mapping between paired emotional utterances (x1; x2), we switch to training a conversion model between two emotional domains (X1;X2)…In conversion stage, we extract content code of the source speech and recombine it with style code of the target emotion…., sec. 1; We jointly train the encoders, decoders and GAN’s discriminators with multiple losses…., fig. 3.3; Decoder, fig. 4, downsampling performed in hidden layer/bottleneck of network architecture of fig. 4, downsampling defined in Applicant’s original specification as dimension reduction along the temporal axis and dimension reduction along the channel axis, output ci of content encoder (fig. 4) and output sj of style encoder/target speaker encoder (fig. 4) as combined at Decoder (fig. 4)), 
            the speaker disentanglement meaning that a style aspect of a source speech utterance is limited by a bottleneck caused by the parameter settings, leaving thereby a content aspect of the source speech utterance to be input data into the decoder (fig. 1; fig. 4; sec. 1; A basic idea is to find disentangled representations that can independently model image content and style..., sec. 2.3; The autoencoders take 24-dimentional MCEPs as input and learn disentangled representations of content and style. In the content encoder, instance normalization (IN) [31] removes the original feature mean and variance that represent emotional style information…, sec. 4.2, source speaker style information removed from source speech that includes speaker style information and speaker content as limiting style aspect/information and retaining content aspect/information)
 Gao does not explicitly disclose a processor or a memory accessible to the processor that stores machine-readable instructions permitting the processor to implement the style transfer autoencoder system as comprising: 
However these features are taught by Narayanan:
 a processor (para. [0023]); and 
 a memory accessible to the processor that stores machine-readable instructions permitting the processor to implement the style transfer autoencoder system as comprising (Abstract; para. [0041]; para. [0073]-[0074])
             It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Narayanan with the system of Gao in .

4.        Claims 8, 9, 10 and 11-13 are rejected under 35 U.S.C. 103 as being unpatentable over Gao in view of Jia and Narayanan 
Per Claim 8, Gao discloses the method of claim 1, wherein the content encoder comprises a first neural network, the target speaker encoder comprises a second neural network, and the decoder comprises a third neural network (fig. 4; sec. 2.3; The encoders and decoders are implemented with gated CNN…, sec. 3.2; The encoders and decoders are implemented with 1D-CNNs to capture the temporal dependencies…, sec. 4.2),
wherein the content encoder comprises a first neural network, the target speaker encoder comprises a second neural network, and the decoder comprises a third neural network (fig. 4; sec. 2.3; The encoders and decoders are implemented with gated CNN…, sec. 3.2; The encoders and decoders are implemented with 1D-CNNs to capture the temporal dependencies…, sec. 4.2), the method further comprising: 

Gao does not explicitly disclose initially pre-training the second neural network of the target speaker encoder using speech information of the target speaker or the target speaker encoder that has been pre-trained using the target speaker speech information
However, these features are taught by Jia:
initially pre-training the second neural network of the target speaker encoder using speech information of the target speaker (sec. 2.1); and 
the target speaker encoder that has been pre-trained using the target speaker speech information (sec. 2.1)
            It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Jia with the method of Gao in arriving at “initially pre-training the second neural network of the target speaker encoder using speech information of the target speaker” and “the target speaker encoder that has been pre-trained using the target speaker speech information”, because such 
   Gao in view of Jia does not explicitly disclose the method as embodied in a set of machine-readable instructions implementing the transfer autoencoder system
  However, this feature is taught by Narayanan (Abstract; para. [0041]; para. [0073]-[0074]) 
             It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Narayanan with the method of Gao in arriving at “the method as embodied in a set of machine-readable instructions implementing the transfer autoencoder system”, because such combination would have resulted in preventing a complete overhaul/update of the existing system when new data is available for measurement, by easily removing previous data and introducing new data and preventing a recompilation of the entire computing system, thereby allowing for algorithms that can be used by multiple applications.
      Per Claim 9, Gao in view of Jia and Narayanan discloses the method of claim 8, 
 Narayanan discloses the method as implemented on a server on a network (Abstract; para. [0041]-[0042]; para. [0073]-[0074])
Per Claim 11, Gao in view of Narayanan discloses the style transfer autoencoder system of claim 10, 
Gao discloses wherein the content encoder and the decoder comprise an autoencoder system in which the content encoder compresses original input data from an input layer into a short code and the decoder learns to un-compress that short code Autoencoders as performing encoding/compression and decoding/un-compression learning). 
Gao in view of Narayan does not explicitly disclose wherein the target speaker encoder is preliminarily trained to maximize embedding similarities among different utterances of the target speaker and minimize similarities with other speakers
However, this feature is taught by Jia (sec. 2.1)
            It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Jia with the system of Gao in view of Narayan in arriving at wherein the target speaker encoder is preliminarily trained to maximize embedding similarities among different utterances of the target speaker and minimize similarities with other speakers”, because such combination would have resulted in to conditioning a synthesis network on a reference speech signal from a desired target speaker (Jia, sec. 2.1).
   Per Claim 12, Gao in view of Narayanan discloses the style transfer autoencoder system of claim 10, 
   Gao in view of Narayanan does not explicitly disclose wherein the target speaker encoder is preliminarily trained to maximize embedding similarities among different utterances of the target speaker and minimize similarities with other speakers 
 However, this feature is taught by Jia (The network is trained to optimize a generalized end-to-end speaker verification loss, so that embeddings of utterances from the same speaker have high cosine similarity, while those of utterances from different speakers are far apart in the embedding space…, sec. 2.1).

Per Claim 13, Gao in view of Narayanan and Jia discloses the style transfer autoencoder system of claim 12, 
  Gao discloses the style transfer autoencoder system as selectively operable first in a training mode and second in a conversion mode, wherein: in the training mode, the first neural network in the content encoder and the third network in the decoder are trained in a self-reconstruction training that uses speech inputs from the source speaker into the content encoder and into the target speaker encoder, the self-reconstruction training thereby training the combination of the content encoder and the decoder to adapt to a style of the target speaker (fig. 1; fig. 2; fig. 4; sec. 4.2; sec. 4.3); and in the conversion mode, the decoder converts the content aspect of the source speech utterance as an utterance with a style aspect of the target speaker (fig. 1; fig. 2; fig. 4; sec. 4.2; sec. 4.3).
Jia discloses the target speaker encoder that has been pre-trained using the target speaker speech information (sec. 2.1).

Allowable Subject Matter
Claims 6, 7, 14, 15, 19 and 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO 892 form.
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUJIMI A ADESANYA whose telephone number is (571)270-3307.  The examiner can normally be reached on 8:30-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 571-272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/OLUJIMI A ADESANYA/Primary Examiner, Art Unit 2658