DETAILED ACTION
This communication is in response to the Amendments and Arguments filed 07/06/2021.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Change of Examiner
The Examiner of Record for this Application has changed from Mark Hennings to Paras Shah.

Examiner Note
The Applicant’s Representative was contacted with respect to ASM through the incorporation of claims 7 and 15 on 8/09/2021. However, the Applicant requested an Office Action be sent.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3 and 9-11 are rejected under 35 U.S.C. 103 as being unpatentable over Nakai (JP 2011123141 A) in view of Celin (T. A. Mariya Celin, G. Anushiya Rachel, T. Nagarajan and P. Vijayalakshmi, "A Weighted Speaker-Specific Confusion Transducer-Based Augmentative and Alternative Speech Communication Aid for Dysarthric Speakers," in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 2, pp. 187-197, Feb. 2019, doi: 10.1109/TNSRE.2018.2887089), Lewis (EP 1028410 A1), and Ahara (“Inviduality-Preserving Voice Conversion for Articulation Disorders Using Locality-constrained NMF”)
	As per independent claim 1, Nakai teaches a device comprising: 
a phoneme database configured to store normal phoneme data (see Nakai translation page 4, fourth paragraph, which notes a consonant library 12, a vowel library 14, and a common library 16; and see Nakai translation page 4, paragraph 5, which notes the consonant library 12 stores waveform data for each type of consonant part, the vowel library 14 stores waveform data for each type of vowel part, and the common library 16 stores predetermined/normal sample waveform data for each type of consonant part, where the sample waveform data of the consonant part stored in the common library 16 is classified into male, female, child, adult and the like);  5
a syllable detector (see Nakai FIG. 5, which shows an SD (sound deformation) controller 30 that includes a voice discriminating unit 36) configured to 
receive the voice signal (see Nakai translation page 4, paragraph 7, which notes in (1) consonant-only replacement mode, the masky [maskee] H ′ (t) collected by the microphone Mic is converted into an audio signal, and the audio signal is input to the A / D unit 20 via a microphone amplifier (not shown). The A / D unit 20 converts an audio/voice signal that is an analog signal into a digital signal.  The voice discriminating unit 36 discriminates a consonant part and a vowel part of the voice signal by comparing the waveform of the voice signal digitized by the A / D unit 20 with a past speech voice waveform. The consonant extraction unit 32 extracts a consonant part/signal using the determination result),  
detect a position of the consonant signal (see Nakai translation, page 6 last paragraph—page 7 first full paragraph, which notes FIG. 7 is a waveform diagram showing the waveform of an audio signal representing the maskee H ′ (t). The waveform in FIG. 7 is obtained by converting the original voice “ANO KARETOWA SO-TONAGAINDAYONE ZITSUWA” into a voice signal with the microphone Mic. The vertical axis in FIG. 7 represents signal intensity in arbitrary units, and the horizontal axis represents time. In FIG. 7, each region divided by vertical broken lines corresponds to a phoneme, and the corresponding phoneme is clearly shown in Roman letters. “-” Represents a voice pause unit. The energy envelope 102 is shown as a solid line. Here, the energy envelope is obtained by multiplying a voice sample by a time constant of several tens of milliseconds in the square sound pressure region and taking a square root.   Table 1 shows vowels, consonants, and silences in FIG. A certain time before the start of voice is defined as the time origin (t = 0)), and 
generate position data based on the position of the consonant signal (see Nakai [original document] Table I of paragraph (0054) which shows starting and ending positions, sound, and timing for various consonants of the consonant signal; and see Nakai translation, page 7, second full paragraph, which notes the distinction between consonants, vowels, and silence can be determined by energy, the number of zero crossings, the first coefficient (spectral slope) of PARCOR (PARtial auto-CORrelation), and the like); and 
a voice processor (see Nakai translation, page 4, paragraph 4, which notes the SD controller unit SD includes a partial extraction unit 30, a partial change unit 90, the partial extraction unit 30, a consonant extraction unit 32, and a vowel extraction unit 34. The partial change unit 90 includes a consonant processing unit 40 and a vowel processing unit 50) electrically connected to the syllable detector (see Nakai translation, page 4, paragraph 4, which notes the SD controller unit SD includes a partial extraction unit 30, the partial extraction unit 30 includes a voice discrimination unit 36; and see Nakai page 4, paragraph 3, which notes the consonant processing unit 40 processes the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with another consonant part having substantially the same length selected from the consonant library 12; and see Nakai page 3, paragraph 3, which notes FIG. 5 is a block diagram showing the function and configuration of the SD controller unit SD of FIG. Each block shown here can be realized in hardware by an element such as a CPU (central processing unit) or a mechanical device, and in software by a computer program or the like. Describes functional blocks realized by collaboration. Accordingly, it is understood by those skilled in the art who have touched this specification that these functional blocks can be realized in various forms by a combination of hardware and software), wherein the voice processor is in communication with the phoneme database (see Nakai FIG. 5, which shows the consonant processing unit 40 and the vowel processing unit 50 of the partial change unit 90 both in communication with the common library 16)
receive the voice signal, the position data (see Nakai translation page 3, last paragraph, which notes a customer 6 who has a conversation with a counselor is a speaker. The speaker's maskee H ′ (t) is collected by a microphone Mic provided at or near the counter portion. The maskee H ′ (t) collected by the microphone Mic is converted into an audio signal and sent to the SD controller unit SD. The consonant part of the audio signal is changed, deleted, or replaced by the SD controller unit SD. The audio signal that has undergone processing in the SD controller section SD is output as a masker H (t) from the speaker SP to the left and right adjacent booths 2 'via the power amplifier PA; see Nakai [original document] Table I of paragraph (0054) which shows starting and ending positions, sound, and timing for various consonants of the consonant signal; and see Nakai translation, page 5, paragraph 3, which notes, during this utterance start period, the consonant processing unit 40 selects the corresponding consonant part from the common library 16 and replaces it with the consonant part extracted by the consonant extraction part 32), 
search from the phoneme database the normal phoneme data (see Nakai translation page 5, first paragraph, which notes the consonant processing unit 40 processes the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with another consonant part having substantially the same length selected from the consonant library 12);
cooperate with the syllable detector to detect a position of a normal consonant signal of the normal voice signal (see Nakai [original document] Table I of paragraph (0054) which shows starting and ending positions, sound, and timing for various consonants of the consonant signal; and see Nakai translation page 5, first paragraph, which notes the consonant processing unit 40 processes the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with another consonant part having substantially the same length/positions selected from the consonant library 12), and 
replace the consonant signal with the normal consonant signal based on the position of the normal consonant signal and the position of the consonant signal, thereby synchronously converting the voice signal into a synthesized voice signal (see Nakai translation page 3, third and fourth full paragraphs, which notes in the embodiment of the present invention, paying attention to such aspects of speech recognition / understanding, particularly the consonant part of the original speech is changed / deleted / replaced. Further, in order to further reduce the overall volume of the processed voice (hereinafter referred to as a masker) added to the original voice (hereinafter referred to as a maskee), the following combination / ingenuity is possible.  (I) In generating a masker, the processed consonant part is output at the original timing; and see Nakai page 5, last paragraph, which notes In order to conceal information by synthesizing Masky H '(t) and Masker H (t) at the listener's 8 position, the SD processing in the SD controller unit SD must be performed in real time or near real time).
Nakai fails to specifically teach a device for generating synchronous corpus receiving a dysarthria voice signal having a dysarthria consonant signal, wherein the voice signal is a dysarthria voice signal;  wherein the voice processor is a 10voice synthesizer configured to receive script data, search from the script data text corresponding to the dysarthria voice signal, search from the phoneme database the normal phoneme data corresponding to the text, convert the text into a normal voice signal based on the normal phoneme data corresponding to the text, and wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model.
	However, Celin does teach:
a device for generating synchronous corpus receiving a dysarthria voice signal having a dysarthria consonant signal, wherein the voice signal is a dysarthria voice signal (see Celin, page 189, col. 1, second full paragraph, which notes dysarthric speakers, who could not produce connected speech with ease, uttered sentences with a maximum of up to 6 words in sequences of 2 to 3 words, some words of a sentence were uttered in isolation. The corpus includes time-aligned word and phonetic transcriptions. These phonetic transcriptions are initially derived using forced Viterbi alignment procedure, as described in [19], and then manually corrected. For severe speakers, phone-level segmentation I performed based on intelligible consonants in the utterance; and see Celin page 190, col. 1 first paragraph, which notes (ii) context-dependent substitutions: Phones of this category are substituted by different phones in different contexts and in some contexts they also retain their own identity. The most common type of substitution that occurs in both corpora is substitution of a consonant in a CV (consonant-vowel) cluster by its adjacent vowel unit), and the device comprising:
a 10voice synthesizer (see Celin, page 195, col. 2, third full paragraph, which notes an HMM-based text-to-speech synthesis system (HTS) is used to synthesize the error corrected text from the WFST) is configured to 
search from the text corresponding to the voice signal (see Celin, page 192, col. 2, last paragraph, which notes dysarthric speech is initially recognized by a speaker-dependent continuous DSR system. The erroneous text from the DSR system is then corrected using a speaker-specific weighted finite state transducer (WFST), whose weights are computed based on the speaker-specific error analysis, discussed in Section III),
search from the phoneme database the normal phoneme data corresponding to the text (see Celin, page 194, col. 1, first full paragraph, which notes a phone confusion transducer must correct the insertion, deletion, and substitution errors that are expected to appear in a recognized text, leaving the phones (symbols) that are recognized correctly unchanged), 
convert the text into a normal voice signal based on the normal phoneme data corresponding to the text (see Celin, page 195, col. 2, third full paragraph, which notes an HMM-based text-to-speech synthesis system (HTS) is used to synthesize the error corrected text from the WFST; and see Celin, page 194, col. 1, first full paragraph, which notes a phone confusion transducer must correct the insertion, deletion, and substitution errors that are expected to appear in a recognized text, leaving the phones (symbols) that are recognized correctly unchanged),
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Nakai with speaker-specific error correction system of Celin (and selection thereof) in order to correct the speech of a specific speaker based on weighting determined by an analysis of pronunciation errors of the speaker (see Celin, page 192, col. 2, last paragraph, which notes an augmentative and alternative speech communication (AASC) aid consists of a speaker-dependent dysarthric speech recognition (DSR) system, an error-correction system, and a text-to-speech synthesis system, as shown in Fig. 3. Dysarthric speech is initially recognized by a speaker-dependent continuous DSR system. The erroneous text from the DSR system is then corrected using a speaker-specific weighted finite state transducer (WFST), whose weights are computed based on the speaker-specific error analysis, discussed in Section III. Finally, the error-corrected text is synthesized using a text-to-speech synthesis system).
The combination of Nakai with Celin includes predictable results, such as the synthesis of the corrected speech of a speaker.
The combination of Nakai with Celin fails to specifically teach a 10voice synthesizer is configured to receive script data, wherein the text is the script data text, and wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model.
However, Lewis does teach:
a 10voice synthesizer is configured to receive script data, wherein the text is the script data text (see Lewis [0025], which notes a prerequisite step in any enrolment process is preparing an enrolment script for use. In general, the enrolment script should include a thorough sampling of sounds and sound combinations. Various schemes, such as successively highlighting words as they are spoken, can be used to guide users through reading the enrolment script from a display. For non-readers and for users without access to display devices, other factors must be taken into consideration. Text for the script must be selected or composed with the variety of sounds that are helpful for initial training of the speech recognition engine. Each sentence in the script must be divided into its constituent or component phrases. Each text phrase should correspond to a linguistically complete unit, so each phrase will be easy for the user to remember. Each phrase should contain no more than one or two units to avoid exceeding user short-term memory limits. Units are linguistic components, such as prepositional phrases). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai and Celin with the script of Lewis to help ensure a thorough sampling of sounds and sound combination with which to initially train the speech recognition engine.  (see Lewis [0025], which notes a prerequisite step in any enrolment process is preparing an enrolment script for use. In general, the enrolment script should include a thorough sampling of sounds and sound combinations. Various schemes, such as successively highlighting words as they are spoken, can be used to guide users through reading the enrolment script from a display. For non-readers and for users without access to display devices, other factors must be taken into consideration. Text for the script must be selected or composed with the variety of sounds that are helpful for initial training of the speech recognition engine. Each sentence in the script must be divided into its constituent or component phrases. Each text phrase should correspond to a linguistically complete unit, so each phrase will be easy for the user to remember. Each phrase should contain no more than one or two units to avoid exceeding user short-term memory limits. Units are linguistic components, such as prepositional phrases).
The combination of Nakai and Celin with Lewis includes predictable results, such as training a speech recognition engine by an enrollee.

However, Ahara does teach wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model (see Figure 3, where source and target features provided to conversion on right side), and the voice conversion training system includes: a speech framing circuit electrically connected to the voice synthesizer and configured to receive and frame the synthesized voice signal and the dysarthria voice signal to generate synthesized speech frames and dysarthria speech frames (see Figure 3, left hand Constructing Dictionary, where the source speaker training speech is aligned with the target speaker speech in terms of frames, where each V C represents vowel and consonant frames (see sect. 2.2, 1st paragraph)); a speech feature retriever electrically connected to the speech framing circuit and configured to receive the dysarthria speech frames and the synthesized speech frames to retrieve dysarthria speech features and corresponding synthesized speech features (see Figure 3, left hand Constructing Dictionary, where source and target features are determined); and a voice conversion model trainer electrically connected to the speech feature retriever and configured to receive the dysarthria speech features and the corresponding synthesized speech features to train the voice conversion model (see Figure 3, right side Conversion where the source and target features are added to the source and target dictionary (i.e. training)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai and Celin with Lewis with the training of Ahara to improve the listening intelligibility of words uttered by a person with an articulation disorder (see Ahara, sect. 4).

As per independent claim 9, Nakai teaches a method comprising: 
receiving a voice signal having a consonant signal (see Nakai translation page 4, paragraph 7, which notes in (1) consonant-only replacement mode, the masky [maskee] H ′ (t) collected by the microphone Mic is converted into an audio signal, and the audio signal is input to the A / D unit 20 via a microphone amplifier (not shown). The A / D unit 20 converts an audio/voice signal that is an analog signal into a digital signal.  The voice discriminating unit 36 discriminates a consonant part and a vowel part of the voice signal by comparing the waveform of the voice signal digitized by the A / D unit 20 with a past speech voice waveform. The consonant extraction unit 32 extracts a consonant part/signal using the determination result),  
detecting a position of the consonant signal (see Nakai translation, page 6 last paragraph—page 7 first full paragraph, which notes FIG. 7 is a waveform diagram showing the waveform of an audio signal representing the maskee H ′ (t). The waveform in FIG. 7 is obtained by converting the original voice “ANO KARETOWA SO-TONAGAINDAYONE ZITSUWA” into a voice signal with the microphone Mic. The vertical axis in FIG. 7 represents signal intensity in arbitrary units, and the horizontal axis represents time. In FIG. 7, each region divided by vertical broken lines corresponds to a phoneme, and the corresponding phoneme is clearly shown in Roman letters. “-” Represents a voice pause unit. The energy envelope 102 is shown as a solid line. Here, the energy envelope is obtained by multiplying a voice sample by a time constant of several tens of milliseconds in the square sound pressure region and taking a square root.   Table 1 shows vowels, consonants, and silences in FIG. A certain time before the start of voice is defined as the time origin (t = 0)), and 
searching normal phoneme data (see Nakai translation page 5, first paragraph, which notes the consonant processing unit 40 processes the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with another consonant part having substantially the same length selected from the consonant library 12);
detecting a position of a normal consonant signal of the normal voice signal (see Nakai [original document] Table I of paragraph (0054) which shows starting and ending positions, sound, and timing for various consonants of the consonant signal; and see Nakai translation page 5, first paragraph, which notes the consonant processing unit 40 processes the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the consonant part extracted by the consonant extraction unit 32 with another consonant part having substantially the same length/positions selected from the consonant library 12), and 
replacing the consonant signal with the normal consonant signal based on the position of the normal consonant signal and the position of the consonant signal thereby synchronously converting the voice signal into a synthesized voice signal (see Nakai translation page 3, third and fourth full paragraphs, which notes in the embodiment of the present invention, paying attention to such aspects of speech recognition / understanding, particularly the consonant part of the original speech is changed / deleted / replaced. Further, in order to further reduce the overall volume of the processed voice (hereinafter referred to as a masker) added to the original voice (hereinafter referred to as a maskee), the following combination / ingenuity is possible.  (I) In generating a masker, the processed consonant part is output at the original timing; and see Nakai page 5, last paragraph, which notes In order to conceal information by synthesizing Masky H '(t) and Masker H (t) at the listener's 8 position, the SD processing in the SD controller unit SD must be performed in real time or near real time).
Nakai fails to specifically teach a method for generating a synchronous corpus comprising: receiving a dysarthria voice signal having a dysarthria consonant signal, wherein the voice signal is a dysarthria voice signal; receiving script data, wherein the script data have text corresponding to the dysarthria voice signal, searching normal phoneme data corresponding to the text, converting the text into a normal voice signal based on the normal phoneme data corresponding to the text, and wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model.
	However, Celin does teach a method for generating synchronous corpus comprising:
receiving a dysarthria voice signal having a dysarthria consonant signal, wherein the voice signal is a dysarthria voice signal (see Celin, page 189, col. 1, second full paragraph, which notes dysarthric speakers, who could not produce connected speech with ease, uttered sentences with a maximum of up to 6 words in sequences of 2 to 3 words, some words of a sentence were uttered in isolation. The corpus includes time-aligned word and phonetic transcriptions. These phonetic transcriptions are initially derived using forced Viterbi alignment procedure, as described in [19], and then manually corrected. For severe speakers, phone-level segmentation I performed based on intelligible consonants in the utterance; and see Celin page 190, col. 1 first paragraph, which notes (ii) context-dependent substitutions: Phones of this category are substituted by different phones in different contexts and in some contexts they also retain their own identity. The most common type of substitution that occurs in both corpora is substitution of a consonant in a CV (consonant-vowel) cluster by its adjacent vowel unit), 
10searching normal phoneme data corresponding to the text (see Celin, page 194, col. 1, first full paragraph, which notes a phone confusion transducer must correct the insertion, deletion, and substitution errors that are expected to appear in a recognized text, leaving the phones (symbols) that are recognized correctly unchanged), 
converting the text into a normal voice signal based on the normal phoneme data corresponding to the text (see Celin, page 195, col. 2, third full paragraph, which notes an HMM-based text-to-speech synthesis system (HTS) is used to synthesize the error corrected text from the WFST; and see Celin, page 194, col. 1, first full paragraph, which notes a phone confusion transducer must correct the insertion, deletion, and substitution errors that are expected to appear in a recognized text, leaving the phones (symbols) that are recognized correctly unchanged),
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Nakai with speaker-specific error correction system of Celin (and selection thereof) in order to (see Celin, page 192, col. 2, last paragraph, which notes an augmentative and alternative speech communication (AASC) aid consists of a speaker-dependent dysarthric speech recognition (DSR) system, an error-correction system, and a text-to-speech synthesis system, as shown in Fig. 3. Dysarthric speech is initially recognized by a speaker-dependent continuous DSR system. The erroneous text from the DSR system is then corrected using a speaker-specific weighted finite state transducer (WFST), whose weights are computed based on the speaker-specific error analysis, discussed in Section III. Finally, the error-corrected text is synthesized using a text-to-speech synthesis system).
The combination of Nakai with Celin includes predictable results, such as the synthesis of the corrected speech of a speaker.
The combination of Nakai with Celin fails to specifically teach receiving script data, wherein the script data have text corresponding to the dysarthria voice signal, and wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model.
However, Lewis does teach:
receiving script data, the script data have text corresponding to the voice signal (see Lewis [0025], which notes a prerequisite step in any enrolment process is preparing an enrolment script for use. In general, the enrolment script should include a thorough sampling of sounds and sound combinations. Various schemes, such as successively highlighting words as they are spoken, can be used to guide users through reading the enrolment script from a display. For non-readers and for users without access to display devices, other factors must be taken into consideration. Text for the script must be selected or composed with the variety of sounds that are helpful for initial training of the speech recognition engine. Each sentence in the script must be divided into its constituent or component phrases. Each text phrase should correspond to a linguistically complete unit, so each phrase will be easy for the user to remember. Each phrase should contain no more than one or two units to avoid exceeding user short-term memory limits. Units are linguistic components, such as prepositional phrases). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai and Celin with the script of Lewis to help ensure a thorough sampling of sounds and sound combination with which to initially train the speech recognition engine.  (see Lewis [0025], which notes a prerequisite step in any enrolment process is preparing an enrolment script for use. In general, the enrolment script should include a thorough sampling of sounds and sound combinations. Various schemes, such as successively highlighting words as they are spoken, can be used to guide users through reading the enrolment script from a display. For non-readers and for users without access to display devices, other factors must be taken into consideration. Text for the script must be selected or composed with the variety of sounds that are helpful for initial training of the speech recognition engine. Each sentence in the script must be divided into its constituent or component phrases. Each text phrase should correspond to a linguistically complete unit, so each phrase will be easy for the user to remember. Each phrase should contain no more than one or two units to avoid exceeding user short-term memory limits. Units are linguistic components, such as prepositional phrases).
The combination of Nakai and Celin with Lewis includes predictable results, such as training a speech recognition engine by an enrollee.
The combination of Nakai and Celin with Lewis fails to specifically teach The combination of Nakai and Celin with Lewis fails to specifically teach wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model and wherein the synthesized voice signal and the dysarthria voice signal are received by a voice conversion training system to train the voice conversion model, and the voice conversion training system includes: a speech framing circuit electrically connected to the voice synthesizer and configured to receive and frame the synthesized voice signal and the dysarthria voice signal to generate synthesized speech frames and dysarthria speech frames; a speech feature retriever electrically connected to the speech framing circuit and configured to receive the dysarthria speech frames and the synthesized speech frames to retrieve dysarthria speech features and corresponding synthesized speech features; and a voice conversion model trainer electrically connected to the speech feature retriever and configured to receive the dysarthria speech features and the corresponding synthesized speech features to train the voice conversion model
see Figure 3, where source and target features provided to conversion on right side) (e.g. where Celin’s synthesized signal has been interpreted as the target signal in Ahara as both are technically speech signals), and the voice conversion training system includes: a speech framing circuit electrically connected to the voice synthesizer and configured to receive and frame the synthesized voice signal and the dysarthria voice signal to generate synthesized speech frames and dysarthria speech frames (see Figure 3, left hand Constructing Dictionary, where the source speaker training speech is aligned with the target speaker speech in terms of frames, where each V C represents vowel and consonant frames (see sect. 2.2, 1st paragraph)); a speech feature retriever electrically connected to the speech framing circuit and configured to receive the dysarthria speech frames and the synthesized speech frames to retrieve dysarthria speech features and corresponding synthesized speech features (see Figure 3, left hand Constructing Dictionary, where source and target features are determined); and a voice conversion model trainer electrically connected to the speech feature retriever and configured to receive the dysarthria speech features and the corresponding synthesized speech features to train the voice conversion model (see Figure 3, right side Conversion where the source and target features are added to the source and target dictionary (i.e. training)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai and Celin with Lewis with the training of Ahara to 

As per claims 2 and 10, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claims 1 and 9 above. 
Nakai does not specifically teach the device for generating synchronous corpus according to claim 1, wherein the voice synthesizer is configured to convert the text into the normal voice signal using a text to speech (TTS) technology.
However, Celin does teach the device for generating synchronous corpus according to claim 1, wherein the voice synthesizer is configured to convert the text into the normal voice signal using a text to speech (TTS) technology detect a position of a normal consonant signal of the normal voice signal (see Celin FIG. 3 (page 193), which shows a text-to-speech system for converting error corrected text (with normal consonants) into ordered synthesized speech).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by Nakai with speaker-specific error correction system of Celin (and selection thereof) in order to correct the speech of a specific speaker based on weighting determined by an analysis of pronunciation errors of the speaker (see Celin, page 192, col. 2, last paragraph, which notes an augmentative and alternative speech communication (AASC) aid consists of a speaker-dependent dysarthric speech recognition (DSR) system, an error-correction system, and a text-to-speech synthesis system, as shown in Fig. 3. Dysarthric speech is initially recognized by a speaker-dependent continuous DSR system. The erroneous text from the DSR system is then corrected using a speaker-specific weighted finite state transducer (WFST), whose weights are computed based on the speaker-specific error analysis, discussed in Section III. Finally, the error-corrected text is synthesized using a text-to-speech synthesis system).
The combination of Nakai with Celin includes predictable results, such as the synthesis of the corrected speech of a speaker.

As per claims 3 and 11, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claims 1 and 9 above. 
Nakai teaches the he device for generating synchronous corpus according to claim 1, wherein the phoneme database is a consonant database and the normal phoneme data are normal consonant data (see Nakai translation page 4, fourth paragraph, which notes a consonant library 12, a vowel library 14, and a common library 16; and see Nakai translation page 4, paragraph 5, which notes the consonant library 12 stores waveform data for each type of consonant part, the vowel library 14 stores waveform data for each type of vowel part, and the common library 16 stores predetermined/normal sample waveform data for each type of consonant part, where the sample waveform data of the consonant part stored in the common library 16 is classified into male, female, child, adult and the like).


s 4 and 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Nakai in view of Celin, Lewis, and Ahara and in further view of He (He, L., Wang, X., Zhang, J. et al. Automatic detection of consonant omission in cleft palate speech. Int J Speech Technol 22, 59–65 (2019). https://doi.org/10.1007/s10772-018-09570-w).
As per claim 4, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claim 1 above.  As noted above with respect to claim 1, Celin does teach the consonant signal is a dysarthria consonant signal.
Nakai in view of Celin, Lewis, and Ahara does not specifically teach wherein the syllable detector is configured to detect the positions of the normal consonant signal and the dysarthria consonant signal using an autocorrelation function or a deep neural network (DNN).
However, He does teach wherein the syllable detector is configured to detect the positions of the normal consonant signal and the consonant signal using an autocorrelation function or a deep neural network (DNN) (see He, FIG. 1 (page 61), which shows syllables with voiced initials/consonant and which shows consonant omission/dysarthric consonants; see He, page 60, paragraph spanning col. 1-2, which notes consonant omission is one of the most common articulation disorders. In Mandarin, each syllable contains two parts: initial (consonant) and final. For the CP patients, their defective soft palates result in the lack of air pressure for the pronunciation of initials, therefore the consonant omission occurs; and see He, page 60, col. 1, third full paragraph, which notes Calculation of short-time autocorrelation: The short-time autocorrelation waveform can represent the similar periodicity of voiced speech frame. Thus, short-time autocorrelation waveform is adopted to distinguish voiced initials and finals).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai, Celin, Lewis, and Ahara with the voiced/unvoiced ZCR-based initials detection of He in order to efficiently classify initials as either voiced or unvoiced (see He, page 64, col. 1, second full paragraph, which notes when the consonant omission occurs, only the final exists in a speech syllable. In Mandarin, all the finals and only a few of initials are voiced, while most initials are unvoiced.  The voiced/unvoiced classification could be efficiently achieved by calculating ZCRs. Thus, the proposed methods efficiently classify the syllables into two types firstly through calculating ZCRs: Category I is the syllables with unvoiced initials; Category II is the syllables with voiced initials or consonant omission. In this step, the threshold T is chosen as 50. The effect of threshold T to the detection accuracy is tested. The accuracies of consonant omission detection are calculated, when the threshold T is chosen in a certain range 30–70. The experiment results show that as long as T is chosen in a certain range, the accuracy of this system will not decrease. Since in the following processing steps, a time domain waveform difference analysis is applied, which could separate unvoiced initials from finals as well, by calculating their waveform differences; and see He, page 64, col. 1, second full paragraph, which notes the classification between voiced initials and finals is a difficulty in Mandarin speech processing researches).


As per claim 12, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claim 9 above.  As noted above with respect to claim 9, Celin does teach the consonant signal is a dysarthria consonant signal.
Nakai in view of Celin, Lewis, and Ahara does not specifically teach wherein in the step of detecting the position of the consonant signal, an autocorrelation function or a deep neural network (DNN) is used to detect the position of the consonant signal.
However, He does teach wherein the syllable detector is configured to detect the positions of the normal consonant signal and the consonant signal using an autocorrelation function or a deep neural network (DNN) (see He, FIG. 1 (page 61), which shows syllables with voiced initials/consonant and which shows consonant omission/dysarthric consonants; see He, page 60, paragraph spanning col. 1-2, which notes consonant omission is one of the most common articulation disorders. In Mandarin, each syllable contains two parts: initial (consonant) and final. For the CP patients, their defective soft palates result in the lack of air pressure for the pronunciation of initials, therefore the consonant omission occurs; and see He, page 60, col. 1, third full paragraph, which notes Calculation of short-time autocorrelation: The short-time autocorrelation waveform can represent the similar periodicity of voiced speech frame. Thus, short-time autocorrelation waveform is adopted to distinguish voiced initials and finals).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai, Celin, Lewis, and Ahara with the voiced/unvoiced ZCR-based initials detection of He in order to efficiently classify initials as either voiced or unvoiced (see He, page 64, col. 1, second full paragraph, which notes when the consonant omission occurs, only the final exists in a speech syllable. In Mandarin, all the finals and only a few of initials are voiced, while most initials are unvoiced.  The voiced/unvoiced classification could be efficiently achieved by calculating ZCRs. Thus, the proposed methods efficiently classify the syllables into two types firstly through calculating ZCRs: Category I is the syllables with unvoiced initials; Category II is the syllables with voiced initials or consonant omission. In this step, the threshold T is chosen as 50. The effect of threshold T to the detection accuracy is tested. The accuracies of consonant omission detection are calculated, when the threshold T is chosen in a certain range 30–70. The experiment results show that as long as T is chosen in a certain range, the accuracy of this system will not decrease. Since in the following processing steps, a time domain waveform difference analysis is applied, which could separate unvoiced initials from finals as well, by calculating their waveform differences; and see He, page 64, col. 1, second full paragraph, which notes the classification between voiced initials and finals is a difficulty in Mandarin speech processing researches).


As per claim 13, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claim 9 above.  As noted above with respect to claim 9, Celin does teach the consonant signal is a dysarthria consonant signal.
Nakai in view of Celin, Lewis, and Ahara does not specifically teach wherein in the step of detecting the position of the normal consonant signal, an autocorrelation function or a deep neural network (DNN) is used to detect the position of the normal consonant signal.
However, He does teach wherein in the step of detecting the position of the normal consonant signal, an autocorrelation function or a deep neural network (DNN) is used to detect the position of the normal consonant signal (see He, FIG. 1 (page 61), which shows syllables with voiced initials/consonant and which shows consonant omission/dysarthric consonants; see He, page 60, paragraph spanning col. 1-2, which notes consonant omission is one of the most common articulation disorders. In Mandarin, each syllable contains two parts: initial (consonant) and final. For the CP patients, their defective soft palates result in the lack of air pressure for the pronunciation of initials, therefore the consonant omission occurs; and see He, page 60, col. 1, third full paragraph, which notes Calculation of short-time autocorrelation: The short-time autocorrelation waveform can represent the similar periodicity of voiced speech frame. Thus, short-time autocorrelation waveform is adopted to distinguish voiced initials and finals).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai, Celin, Lewis, and Ahara with the voiced/unvoiced ZCR-based initials detection of He in order to efficiently classify initials as either voiced or unvoiced (see He, page 64, col. 1, second full paragraph, which notes when the consonant omission occurs, only the final exists in a speech syllable. In Mandarin, all the finals and only a few of initials are voiced, while most initials are unvoiced.  The voiced/unvoiced classification could be efficiently achieved by calculating ZCRs. Thus, the proposed methods efficiently classify the syllables into two types firstly through calculating ZCRs: Category I is the syllables with unvoiced initials; Category II is the syllables with voiced initials or consonant omission. In this step, the threshold T is chosen as 50. The effect of threshold T to the detection accuracy is tested. The accuracies of consonant omission detection are calculated, when the threshold T is chosen in a certain range 30–70. The experiment results show that as long as T is chosen in a certain range, the accuracy of this system will not decrease. Since in the following processing steps, a time domain waveform difference analysis is applied, which could separate unvoiced initials from finals as well, by calculating their waveform differences; and see He, page 64, col. 1, second full paragraph, which notes the classification between voiced initials and finals is a difficulty in Mandarin speech processing researches).
.

Claims 5-6 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Nakai in view of Celin, Lewis, and Ahara and in further view of Kerimovska (US 8340966 B2).
As per claims 5 and 14, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claims 1 and 9 above.  As noted above with respect to claims 1 and 9, Celin teaches the voice signal is a dysarthria voice signal.
Ahara further teaches:
wherein the synthesized voice signal and the voice signal are provided to train a voice conversion model (see Figure 3, where source and target features provided to conversion on right side).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai and Celin with Lewis with the training of Ahara to listening intelligibility of words uttered by a person with an articulation disorder (see Ahara, sect. 4).
Nakai in view of Celin, Lewis, and Ahara does not specifically teach further comprising a voice smoothing circuit electrically connected to the voice synthesizer and configured to receive the synthesized voice signal and filter out noise of the synthesized 
However, Kerimovska does teach a voice smoothing circuit electrically connected to the voice synthesizer and configured to receive the synthesized voice signal and filter out noise of the synthesized voice signal (see Kerimovska, col. 4, lines 31-37, which notes the TTS circuit receives data to be read through its input port, e.g. ASCII characters, converts it into spoken audio and sends it to an analog output. A typical circuit comprises a text processor, a smoothing filter and multilevel memory storage array. The voice and audio signals are stored in the memory in their natural, uncompressed form, which provides a good voice reproduction quality, so that the synthesized voice signal is “a filtered the synthesized voice signal”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai, Celin, Lewis, and Ahara with the filtered storage of speech output of the TTS circuit of Kerimovska in a natural, uncompressed form in order to provide a good voice reproduction quality (see Kerimovska, col. 4, lines 31-37, which notes the TTS circuit receives data to be read through its input port, e.g. ASCII characters, converts it into spoken audio and sends it to an analog output. A typical circuit comprises a text processor, a smoothing filter and multilevel memory storage array. The voice and audio signals are stored in the memory in their natural, uncompressed form, which provides a good voice reproduction quality).


As per claim 6, the combination of Nakai, Celin, Lewis, Ahara, and Kerimovska above teaches all of the limitations of claim 5 above.  As noted above with respect to claim 5, Kerimovska teaches wherein the voice smoothing circuit is a filter signal (see Kerimovska, col. 4, lines 31-37, which notes the TTS circuit receives data to be read through its input port, e.g. ASCII characters, converts it into spoken audio and sends it to an analog output. A typical circuit comprises a text processor, a smoothing filter and multilevel memory storage array. The voice and audio signals are stored in the memory in their natural, uncompressed form, which provides a good voice reproduction quality, so that the synthesized voice signal is “a filtered the synthesized voice signal”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as taught by the combination of Nakai, Celin, Lewis, and Ahara with the filtered storage of speech output of the TTS circuit of Kerimovska in a natural, uncompressed form in order to provide a good voice reproduction quality (see Kerimovska, col. 4, lines 31-37, which notes the TTS circuit receives data to be read through its input port, e.g. ASCII characters, converts it into spoken audio and sends it to an analog output. A typical circuit comprises a text processor, a smoothing filter and multilevel memory storage array. The voice and audio signals are stored in the memory in their natural, uncompressed form, which provides a good voice reproduction quality).
The combination of Nakai, Celin, Lewis, and Ahara with Kerimovska includes predictable results, such as the output of synthesized speech that is of a good voice reproduction quality.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Nakai in view of Celin, Lewis, and Ahara and in further view of Foss (US 20090245695 A1).
As per claims 8, Nakai in view of Celin, Lewis, and Ahara teaches all of the limitations of claim 1 above.
Nakai in view of Celin, Lewis, and Ahara does not specifically teach further comprising a text scanner electrically connected to the voice synthesizer and configured to scan a script to generate the script data.
However Foss does teach further comprising a text scanner electrically connected to the voice synthesizer and configured to scan a script to generate the script data (see Foss [0015], which notes FIG. 2 shows a block diagram of a reading apparatus 110 to scan an image 201 from a multiplicity of images to be read to the user in accordance with some embodiments. Reader 110 generally comprises a processor 204, user interface 206, camera 208, camera control logic (CCL) 209, memory 210, and an auditory output device 212, coupled together as shown; see Foss [0019], which notes processor 204, CCL 209, and memory 210 may comprise any suitable combination of memory and processing circuits, components, or combinations of the same to implement processing engines to control the reader 110. For example, the memory could comprise read only memory (ROM) components, random access memory (RAM) components and non-volatile RAM such as flash memory or one or more hard drive devices. In some embodiments, CCL (camera control logic) employing separate processing logic, e.g., using a programmable logic device, separate from the processor 204 may be used to provide increased processing capability to control the camera and to appropriately transfer captured images to the processor. It may also function or assist in providing viewed images or image portions to the processor, e.g., in furtherance of a multiple image (or bulk) capture routine to determine if an image is ready; and see Foss [0020], which notes the memory 210 comprises device control (DC) software code 211 to control the reader 110 and execute its various functions such as text-to-speech (TTS), optical character recognition (OCR)/scanning script, characterization, reading navigation, system functionality, user interface control, and the like. With relevance to this disclosure, it also may comprise a bulk capture (BC) module 213 for controlling the capture of multiple images, as discussed herein. (It should be appreciated that the BC functionality may be performed via software, by processor 204 and/or by another processor, or it may be performed in whole or in part using separate logic such as CCL 209. In addition, there may be more modules and in some embodiments, the modules may not necessarily inter-relate with each other as shown)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed inventions to modify the systems and methods as (see Foss [0014], which notes the reading device 110 has a user interface comprising a display 114 along with sensors, transducers, and/or other instruments to allow a user to control the device to scan (capture) the one or more images from the text source in the image area 103. For example, the reading device 110 has buttons to allow a user to initiate a bulk capture operational mode to capture multiple images in a convenient manner for the user).
The combination of Nakai, Celin, Lewis, and Ahara with Foss includes predictable results, such as the convenient scanning of bulk material.

Allowable Subject Matter
Claims 7 and 15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  The closest prior art of Ahara discloses a voice conversion system as cited above. However, Ahara teaches away from use of a GMM model. Therefore, none of the cited art either alone or in combination thereof teaches the limitation as recited in claim 7 and 15.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARAS D SHAH whose telephone number is (571) 270-1650. The examiner can normally be reached on Monday-Thursday 7:00 am-005:00 pm. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Pierre-Louis Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll- free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




08/09/2021