Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1,2,5-9 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Zhou (20200410976).

As per claim 1, Zhou (20200410976) teaches an apparatus for processing a voice signal comprising (as voice/speech synthesizer – para 0018):
 a receiver configured to receive a voice signal of a user (as receiving input from the first speech of a first speaker – para 0019); 
a memory configured to store a trained voice age conversion model (as, performing voice conversion – also defined as speech style transfer – para 0038; calculating a model via neural network – para 0041); 
and a processor configured to apply the trained voice age conversion model to the received voice signal of the user and generate a target voice signal estimated to be a voice of the user of a pre-inputted desired age (as using target speaker data – para 0050, and deriving a style from the speaker via training – para 0050; and one of the parameters is a target voice age – para 0043 – performing the voice synthesis at a second age; and the trained model applied to speaker B – see fig. 2); 
	wherein the trained voice age conversion model is pretrained in a training step to receive identification information of a plurality of trainees (paragraph 0019 shows a first speaker and a second speaker; see figure 6, source speaker and a target speaker; with para 0050, multiple speakers; para 0130 – “for each of the speakers”), gender information of the plurality of trainees, and acoustic characteristic information corresponding to a voice signal of a first age of each of the plurality of trainees (as age being one of the speaker parameters – para 0076, 0077, and other speaker descriptors – para 0043 – examiner notes that it is old and notoriously well known in the art of speaker characteristics to use gender as an identifier; which would be represented by one of the 8,16,32,64, 128 vectors – para 0114) and to output acoustic characteristic information corresponding to a voice signal of a second age (as voice age conversion of a first voice in a first age range and a second voice in a second age range – para 0042, 0043 back on para 0017), wherein the acoustic characteristic information comprises at least one of tone information, tone color information, fundamental frequency information, and pitch information extracted from a voice of each of the plurality of trainees (as operating on the fundamental frequency and pitch of the input voice – para 0040; see citations above showing ‘plurality of trainees).

As per claim 2, Zhou (20200410976) teaches the apparatus according to claim 1, wherein the trained voice age conversion model is trained using supervised learning (para 0114, with the embodiment uses loss calculation in a supervised environment by using loss thresholds) and wherein the voice age conversion model is a trained model that is trained in advance in a training step by pair information that comprises acoustic characteristic information corresponding to a voice signal of a first age of each of a plurality of trainees and acoustic characteristic information corresponding to a voice signal of a second age (and voice age conversion of a first voice in a first age range and a second voice in a second age range – para 0042, 0043 back on para 0017). 

As per claim 5, Zhou (20200410976) teaches the apparatus according to claim 1, wherein the voice age conversion model is a model that is based on unsupervised learning (as performing unsupervised learning based on an autoencoder – para 0102).

As per claim 6, Zhou (20200410976) teaches the apparatus according to claim 5, wherein the voice age conversion model is a variational auto encoder (VAE) model or a generative adversarial network (GAN) model (as using a variable autoencoder – para 0093). 

As per claim 7, Zhou (20200410976) teaches the apparatus according to claim 1, further comprising a speaker, wherein the processor is further configured to output the target voice signal through the speaker (as, generating an output synthesized speech – fig1, fig 2 – output waveform is synthesized speech). 

Claims 8,9 are method claims whose steps are performed by the apparatus claims of claims 1-7 above and as such, claims 8,9 are similar in scope and content to claims 1-7 above and therefore, claims 8,9 are rejected under similar rationale as presented against claims 1-7 above.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 10-14, 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Tokuchi (20190149490) in view of Zhou (20200410976).

As per claim 10, Tokuchi (20190149490) teaches an apparatus for processing a voice signal comprising: a display configured to display an image of a user or a character corresponding to the user (as displaying an image of the speaker – para 0018, 0019, referring to an image of a speaker);
and a processor configured to, based on changing an age of the user or the character displayed on the display, control the display such that the display displays the user or the character corresponding to the changed age, wherein the processor is further configured to generate a voice signal corresponding to the user or the character of the changed age by using the voice age conversion model (as changing both the voice and the image, based on the age range setting – para 0099, 0100 – an older image is selected if the age range is older, as well as the voice is changed to lower and more stable, with a higher age range);
Tokuchi (20190149490) discusses generating a voice based on an age range, as well as the image, as discussed above, but does not explicitly teach voice signal models to modify the style of the voice/speech; however, Zhou (20200410976) teaches using target speaker data – para 0050, and deriving a style from the speaker via training – para 0050; and one of the parameters is a target voice age – para 0043 – performing the voice synthesis at a second age; with a speaker configured to output a voice signal of the user; a memory configured to store a trained voice age conversion model, and output the generated voice signal through the speaker (as generating an output synthesized speech – fig1, fig 2 – output waveform is synthesized speech; using transducers – para 0020); as well as, wherein the trained voice age conversion model is pretrained in a training step to receive identification information of a plurality of trainees (paragraph 0019 shows a first speaker and a second speaker; see figure 6, source speaker and a target speaker; with para 0050, multiple speakers; para 0130 – “for each of the speakers”), gender information of the plurality of trainees, and acoustic characteristic information corresponding to a voice signal of a first age of each of the plurality of trainees (as age being one of the speaker parameters – para 0076, 0077, and other speaker descriptors – para 0043 – examiner notes that it is old and notoriously well known in the art of speaker characteristics to use gender as an identifier; which would be represented by one of the 8,16,32,64, 128 vectors – para 0114) and to output acoustic characteristic information corresponding to a voice signal of a second age (as voice age conversion of a first voice in a first age range and a second voice in a second age range – para 0042, 0043 back on para 0017), wherein the acoustic characteristic information comprises at least one of tone information, tone color information, fundamental frequency information, and pitch information extracted from a voice of each of the plurality of trainees (as operating on the fundamental frequency and pitch of the input voice – para 0040; see citations above showing ‘plurality of trainees).  Therefore, it would have been obvious to one of ordinary skill in the art of speech/image age altering to enhance the voice processing of Tokuchi (20190149490) with target speech models adjustable by age, from one speaker to another speaker, and model the information, as taught by Zhou (20200410976) because it would advantageously provide more intelligible and natural sounding speech (see Zhou, para 0050).

As per claim 11, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 10, further comprising a microphone, and wherein the processor is further configured to: 
determine a first age that is a current age of the user or the character based on the voice signal of the user inputted through the microphone (Zhou (20200410976)  (determining a first voice in a first age range and a second voice in a second age range – para 0042, 0043 back on para 0017)
display, on the display, an image of the user or the character corresponding to the first age (Tokuchi (20190149490) – para 0099-0100); 
and output, through the speaker, a voice signal of the user corresponding to the first age (Zhou (20200410976) – para 0020 – using transducers to output synthesized speech). 

As per claim 12, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 11, wherein the processor is further configured to: launch a predetermined application that causes the image of the user or the character to be outputted on the display; and set a second age, through the predetermined application, wherein the second age is a desired age of the user or the character (as setting/determining the second age – para 0042, 0017 , through an interface display – para 0052). 

As per claim 13, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 12, wherein the processor is further configured to, Tokuchi (20190149490) teaches based on a command to change the age of the user or the character from the first age to the second age being inputted through the microphone (para 0099, responding to the user input/command) or the predetermined application: control the display such that the display displays an image corresponding to the second age; and control the speaker such that the speaker outputs a voice signal corresponding to the second age (and controlling the display image and the output voice, based on the age range – para 0099-0100). 

As per claim 14, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 13, wherein the trained voice age conversion model is trained using supervised learning (Zhou (20200410976), para 0114, with the embodiment uses loss calculation in a supervised environment by using loss thresholds) . 

As per claim 17, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 13, wherein the voice age conversion model is trained using unsupervised learning (Zhou (20200410976) , as performing unsupervised learning based on an autoencoder – para 0102).

As per claim 18, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 17, wherein the voice age conversion model is a variational auto encoder (VAE) model or a generative adversarial network (GAN) model ( Zhou (20200410976), as using a variable autoencoder – para 0093) . 

As per claim 19, the combination of Tokuchi (20190149490) in view of Zhou (20200410976) teaches the apparatus according to claim 18, wherein the processor is further configured to: extract the acoustic characteristic information by performing a discrete wavelet transform (DWT) on the voice signal of the first age of each of the plurality of trainees; and convert the acoustic characteristic information of the second age of each of the plurality of trainees into the voice signal of the second age using a Griffin-Lim algorithm ( Zhou (20200410976) – see para 0070 – Zhou opens the type of recognition to well known algorithms – examiner notes that the use of discrete wavelet transform and griffin-lim algorithm are conventional alternatives in the art).
 
Response to Arguments

Applicant's arguments filed 4/28/2022 have been fully considered but they are not persuasive.  As per applicants arguments from the bottom of pp 9 of the response to the first 14 lines of pp 11 of the response, arguing that Zhou’s input speech is of a target speaker, and…not the same as….receiving…of a plurality of trainees”, examiner disagrees and argues that 1) the referred to paragraph 0019 shows a first speaker and a second speaker; 2) figure 6 shows a source speaker and a target speaker and 3) Zhou shows a plurality of speakers (para 0050, multiple speakers; para 0130 – “for each of the speakers) and 4) applicants own “trainees” are target speakers as well – it is a voice age conversion of an age of a speaker to a different age of the same speaker – hence, applicants ‘target speaker’ reads on Zhou’s speakers.
As per applicants arguments on pp 11, last paragraph to pp 12 of the response, arguing that “at best, Zhou describes another person’s speech (speaker B’s) is input to the same content extraction block that is described above with reference to Fig. 1….acoustic characteristic information corresponding to a voice signal of a first age…to a voice signal of a second age”, examiner argues (and as a side note, referring to the previous arguments in applicants response, is contradictory in nature, since the previous arguments were focused on ‘same target speaker’), that, on the same drawing page of fig.1, fig. 2 shows the use of the trained vocal model representing person A, to change the input voice of B to the voice of person A.  Hence, Zhou does teach pre-training as shown in Fig. 1, and alters the voice of person B in figure 2, to the tone/pitch/age of person A.  Zhou further contemplates alter the age of the voice, as shown in para 0008/0077.  These passages show that the models are trained to be at various ages of the person, and figure 2 shows that differing voice ages can be applied to a different speaker.    

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Please see related art listed on the PTO-892 form.
The following references were found towards speech style/changing synthesis:
Hirose (7912719) teaches altering voice characteristics according to age/gender, etc.:
Description Paragraph - DETX (37):For example, the voice characteristics designation unit 105 is configured by a GUI (Graphical User Interface), as shown in FIG. 3. A slider is arranged with respect to a reference axis (for example, age, gender, emotion, and the like) that can be changed for the voice characteristic of the synthesized sound, and the control value of each reference axis is designated by the position of the slider. The number of reference axes is not particularly limited.

Mori (20170076714) teaches voice synthesizing (para 0031-0032) using a fundamental frequency and age and gender (para 0035).
Engel (10068557) teaches synthesis:
 (Audio synthesis is important for a large range of applications including text-to-speech (TTS) systems and music generation. Certain existing audio generation algorithms, known as vocoders in TTS and synthesizers in music, respond to higher-level control signals to create fine-grained audio waveforms. Synthesizers have a long history of being hand-designed instruments, accepting control signals such as `pitch`, `velocity`, and filter parameters to shape the tone, timbre, and dynamics of a sound. In spite of their limitations, or perhaps because of them, synthesizers have had a profound effect on the course of music and culture in the past half century. 

 With the well known algorithm --  Griffin-Lim, only the magnitude is modeled, and 1000 iterations of an iterative technique is used to estimate the phase – fig. 8.

Latorre (8407053) teaches the use of wavelets:
 Description Paragraph - DETX (33):
The first parameter generating unit 2114 applies a linear transform to each segment of the Log F0 obtained by the segmenting unit 2113, and outputs the parameters to the second parameterizing unit 212 and the parameter combining unit 213 that are positioned downstream. The linear transform is performed by using an invertible operator such as a discrete cosine transform, a Fourier transform, a wavelet transform, a Taylor expansion, and a polynomial expansion, e.g. Legendre polynomials. The linear-transform parameterization is generally expressed by equation (1): PP.sub.s=T.sub.s.sup.-1log F0.sub.s (1) 
for speech synthesis:
 (A speech synthesizing device, which synthesizes speech from a text, includes three main processing units: a text analyzing unit, a prosody generating unit, and a speech signal generating unit. The text analyzing unit analyzes an input text (containing latin characters, kanji (Chinese characters), kana (Japanese characters or any other type of characters)) by using a dictionary or the like, and outputs linguistic information defining how to pronounce the text, where to put a stress, how to segment the sentence (into accentual phrases), and the like. Based on the linguistic information, the prosody generating unit outputs phonetic and prosodic information, such as a voice pitch (fundamental frequency) pattern (hereinafter, "pitch contour") and the length of each phoneme. The speech signal generating unit selects speech units in accordance with the arrangement of phonemes, connects the units together while modifying them in accordance with the prosodic information, and thereby outputs synthesized speech. It is well known that, among those three processing units, the prosody generating units that generates the pitch contour has a significant influence on the quality and naturalness of the synthesized speech.)

Zadeh (20180204111) teaches age based image progression/change (para 1827) as well as speech synthesis (para 2305).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).


/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        06/14/2022