DETAILED ACTION
Introduction
	This office action is in response to Applicant’s submission filed on 9/29/2020. Claims 1-20 are pending in this application. As such, claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant cannot rely upon the certified copy of the foreign priority application to overcome this rejection because a translation of said application has not been made of record in accordance with 37 CFR 1.55. See MPEP §§ 215 and 216.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 9/29/2020 and 7/2/2021 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-8, 10-17, 19, and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wang  (US 10192163 B2) (Further referred to as “Wang”).

Regarding Claim 1, Wang teaches an electronic apparatus, comprising: a memory configured to store a first artificial intelligence model (Wang Column 13 Lines 12-21 and Figure 6 - As shown in FIG. 6, the computer system 600 includes a central processing unit (CPU) 601, which may execute various appropriate actions and processes in accordance with a program stored in a read-only memory (ROM) 602 or a program loaded into a random access memory (RAM) 603 from a storage portion 608. The RAM 603 also stores various programs and data required by operations of the system 600. The CPU 601, the ROM 602 and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604.);
and a processor connected to the memory and configured to (Wang Column 14 Lines 5-15 - The units or modules involved in the embodiments of the present application may be implemented by means of software or hardware. The described units or modules may also be provided in a processor, for example, described as: a processor, comprising a first converting unit, an extracting unit, a determining unit, and a second converting unit, where the names of these units or modules do not in some cases constitute a limitation to such units or modules themselves. For example, the first converting may also be described as “a unit for converting a to-be-processed audio to a to-be-processed picture.”):
based on receiving an input audio signal, obtain an input frequency spectrum image representing a frequency spectrum of the input audio signal (Wang Column 5 Lines 64-67 and Column 6 Lines 1-8 -  In the present embodiment, an electronic device (e.g., the terminal device or server as illustrated in FIG. 1) on which the audio processing method is operated may convert a to-be-processed audio to a to-be-processed picture. The to-be-processed audio may be recorded by a user through a terminal with a recording function, or may be an excerpt of audio that has been stored locally or in the cloud. The to-be-processed picture may be an audiogram, a spectrum, or a spectrogram of the to-be-processed audio, or a picture obtained by performing graphic transformation on the audiogram, the spectrum, or the spectrogram. The picture may be obtained by using digital audio editors.),
input the input frequency spectrum image to the first artificial intelligence model (Wang Column 3 Lines 6-11 - the first converting unit comprises: a dividing subunit, configured to divide the to-be-processed audio into audio clips at a preset interval; and a to-be-processed picture determining subunit, configured to determine an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
obtain an output frequency spectrum image from the first artificial intelligence model (Wang Column 3 Lines 11-20 - the extracting unit comprises: an input subunit, configured to input the to-be-processed picture into a pre-trained convolutional neural network, the convolutional neural network being used for extracting an image characteristic; and a content characteristic determining subunit, configured to determine a matrix output by at least one convolutional layer in the convolutional neural network as the content characteristic of the to-be-processed picture),
and obtain an output audio signal based on the output frequency spectrum image (Wang Column 6 Lines 29 - 34 - inputting the to-be-processed picture into a pre-trained Convolutional Neural Network (CNN), the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the content characteristic of the to-be-processed picture.),
wherein the first artificial intelligence model is trained based on a target learning image (Wang Column 7 Lines 23-30 -  During the training, one of the models is fixed, the parameters of the other model are updated, and such is performed alternatively and by iteration. The loss function for model training may be determined based on the content characteristic of the to-be-processed picture and the style characteristic of the template picture. The style transfer model may also be implemented based on the style transfer algorithm such as the Ashikhmin algorithm.),
and wherein the target learning image represents a target frequency spectrum of a specific style, and is obtained from a second artificial intelligence model based on a random value (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.).

Regarding Claim 2, Wang teaches all of the limitations of claim 1. Wang also teaches that the target learning image is obtained from the second artificial intelligence model based on the random value and a condition value corresponding to the specific style (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.),
wherein the second artificial intelligence model is trained to obtain a plurality of target learning images representing a plurality of target frequency spectrums of a plurality of styles based on a plurality of condition values corresponding to the plurality of styles (Wang Column 10 Lines 47-59 -  In order to make the difference between different styles more prominent, the Gram matrix of the style characteristic of the template picture and the Gram matrix of the style characteristic of the initial target picture may be respectively determined. The Gram matrix of the style characteristic may be the inner product of different convolution slices in the given convolutional layer. The loss function is then determined based on the determined Gram matrix. Likewise, the distance between the Gram matrices of the initial target picture and the to-be-processed picture output from a plurality of convolutional layers may also be weighted average to determine the loss function, and the specific weight may be set as needed.).

Regarding Claim 3, Wang teaches all of the limitations of claim 1. Wang also teaches that the memory is further configured to store a plurality of first artificial intelligence models (Wang Column 9 Lines 66-67 and Column 10 Line 1 - Since there are many convolutional kernels in a CNN, there will be many output matrices. Likewise, there are many convolutional layers.),
wherein the plurality of first artificial intelligence models are trained based on different target learning images (Wang Column 6 Lines 66-13 - In some alternative implementations of the present embodiment, the style characteristic is determined through the following steps: inputting the template picture into a pre-trained CNN, the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the style characteristic of the template picture. The CNN to which the template picture is input may be identical to or different from the CNN to which the to-be-processed picture is input. After multi-layer convolution abstraction, the picture will loss the pixel-level characteristic, while retaining an advanced painting style. That is, the output of the high convolutional layer is more abstract compared to the output of the low convolutional layer, thus may be used to extract the style characteristic.),
and wherein the different target learning images are obtained from the second artificial intelligence model (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.).

Regarding Claim 4, Wang teaches all of the limitations of claim 3. Wang also teaches that the different target learning images are obtained by modifying a weight value of at least one layer from among a plurality of layers included in the second artificial intelligence model (Wang Column 10 Lines 62-67 and Column 11 Lines 1-3 - In the present embodiment, the electronic device may determine the total loss function based on the content loss function determined in step 404 and the style loss function determined in step 405. The total loss function may be obtained based on the weighted sum of the content loss function and the style loss function. By adjusting the weight of the content loss function and the weight of the style loss function, it may be determined that whether the target picture is more style-emphasized or content-emphasized.).

Regarding Claim 5, Wang teaches all the limitations of claim 4. Wang also teaches that the modified weight value is obtained by multiplying a feature vector with the weight value of the at least one layer (Wang Column 10 Lines 62-67 and Column 11 Lines 1-3 - In the present embodiment, the electronic device may determine the total loss function based on the content loss function determined in step 404 and the style loss function determined in step 405. The total loss function may be obtained based on the weighted sum of the content loss function and the style loss function. By adjusting the weight of the content loss function and the weight of the style loss function, it may be determined that whether the target picture is more style-emphasized or content-emphasized.).

Regarding Claim 6, Wang teaches all the limitations of claim 1. Wang also teaches that the first artificial intelligence model comprises a Convolutional Neural Network (CNN) (Wang Column 6 Lines 27-45 - In some alternative implementations of the present embodiment, the extracting a content characteristic of the to-be-processed picture may comprise: inputting the to-be-processed picture into a pre-trained Convolutional Neural Network (CNN), the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the content characteristic of the to-be-processed picture. The CNN is a feedforward neural network whose artificial neurons may respond to surrounding units within a part of the coverage area, and has an excellent performance at large-scale image processing. It includes a convolutional layer and a pooling layer. The CNN may complete object identification by extracting an abstract characteristic of an object by multi-layer convolution. Therefore, the content characteristic of the to-be-processed picture may be extracted by the CNN. The pre-trained CNN may use a Visual Graphics Generator (VGG) model, a Deep Residual Network (ResNet) model, etc. as a model for extracting the image characteristic.),
and wherein the second artificial intelligence model comprises a Generative Adversarial Network (GAN) (Wang Column 7 Lines 14-30 - In some alternative implementations of the present embodiment, the determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture may comprise: importing the content characteristic of the to-be-processed picture to a preset style transfer model, and acquiring an output of the style transfer model as the target picture. The style transfer model may be a Generative Adversarial Network (GAN) model. The GAN includes a generation model and a discrimination model. During the training, one of the models is fixed, the parameters of the other model are updated, and such is performed alternatively and by iteration. The loss function for model training may be determined based on the content characteristic of the to-be-processed picture and the style characteristic of the template picture. The style transfer model may also be implemented based on the style transfer algorithm such as the Ashikhmin algorithm.).

Regarding Claim 7, Wang teaches all the limitations of claim 1. Wang also teaches that the specific style is classified according to at least one from among a type of instrument, a type of emotion, or a processing method of an image (Wang Column 8 Line 67 and Column 9 Lines 1-10 - Since the effect of the audio processing is difficult to display, here the picture processing is used to denote the audio processing to produce an intuitive visual effect. FIG. 3A is a to-be-processed picture, i.e., a picture providing a content characteristic. FIG. 3B is a template picture, i.e., a picture providing a style characteristic. FIG. 3C is a target picture, i.e., a picture after style transfer. The content characteristic of the target picture is similar to the content characteristic of the to-be-processed picture, and the style characteristic of the target picture is similar to the style characteristic of the template picture.).

Regarding Claim 8, Wang teaches all the limitations of claim 1. Wang also teaches that the processor is further configured to: divide the input audio signal into a plurality of sections having a predetermined length (Wang Column 6 Lines 9-14 - In some alternative implementations of the present embodiment, the converting a to-be-processed audio to a to-be-processed picture may comprise: dividing the to-be-processed audio into audio clips at a preset interval; and determining an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
obtain a plurality of input frequency spectrum images representing a plurality of frequency spectrums corresponding to the plurality of sections (Wang Column 6 Lines 9-14 - In some alternative implementations of the present embodiment, the converting a to-be-processed audio to a to-be-processed picture may comprise: dividing the to-be-processed audio into audio clips at a preset interval; and determining an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
input the plurality of input frequency spectrum images to the first artificial intelligence model (Wang Column 10 Lines 17-28 - The content loss function may be obtained based on the mean square error of the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture, and may also be obtained based on other computational methods that can represent the difference between the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture. Since the CNN divides the picture into a number of convolutional slices when extracting the characteristic, the determining the content loss function is to be performed on the slices at the given position in the initial target picture and the to-be-processed picture.),
obtain a plurality of output frequency spectrum images from the first artificial intelligence model (Wang Column 10 Lines 17-28 - The content loss function may be obtained based on the mean square error of the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture, and may also be obtained based on other computational methods that can represent the difference between the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture. Since the CNN divides the picture into a number of convolutional slices when extracting the characteristic, the determining the content loss function is to be performed on the slices at the given position in the initial target picture and the to-be-processed picture.),
obtain a final output image by stitching the plurality of output feature spectrum images (Wang Column 6 Lines 4-14 - The to-be-processed picture may be an audiogram, a spectrum, or a spectrogram of the to-be-processed audio, or a picture obtained by performing graphic transformation on the audiogram, the spectrum, or the spectrogram. The picture may be obtained by using digital audio editors. In some alternative implementations of the present embodiment, the converting a to-be-processed audio to a to-be-processed picture may comprise: dividing the to-be-processed audio into audio clips at a preset interval; and determining an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
and obtain the output audio signal based on the final output image (Wang Column 8 Lines 37-45 - In the present embodiment, the electronic device may convert the target picture determined in step 203 to a processed audio. As similar to step 201, the electronic device may also convert the target picture to a processed audio by using some digital audio editors. In addition, the electronic device may store the processed audio locally, upload the processed audio to the cloud or send the processed audio to other electronic devices, and may also directly output the processed audio.).

Regarding Claim 10, Wang teaches a control method of an electronic apparatus, the method comprising: based on receiving an input audio signal, obtaining an input frequency spectrum image representing a frequency spectrum of the input audio signal (Wang Column 5 Lines 64-67 and Column 6 Lines 1-8 -  In the present embodiment, an electronic device (e.g., the terminal device or server as illustrated in FIG. 1) on which the audio processing method is operated may convert a to-be-processed audio to a to-be-processed picture. The to-be-processed audio may be recorded by a user through a terminal with a recording function, or may be an excerpt of audio that has been stored locally or in the cloud. The to-be-processed picture may be an audiogram, a spectrum, or a spectrogram of the to-be-processed audio, or a picture obtained by performing graphic transformation on the audiogram, the spectrum, or the spectrogram. The picture may be obtained by using digital audio editors.),
inputting the input frequency spectrum image to the first artificial intelligence model stored in the electronic apparatus (Wang Column 3 Lines 6-11 - the first converting unit comprises: a dividing subunit, configured to divide the to-be-processed audio into audio clips at a preset interval; and a to-be-processed picture determining subunit, configured to determine an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.);
obtaining an output frequency spectrum image from the first artificial intelligence model (Wang Column 3 Lines 11-20 - the extracting unit comprises: an input subunit, configured to input the to-be-processed picture into a pre-trained convolutional neural network, the convolutional neural network being used for extracting an image characteristic; and a content characteristic determining subunit, configured to determine a matrix output by at least one convolutional layer in the convolutional neural network as the content characteristic of the to-be-processed picture);
and obtaining an output audio signal based on the output frequency spectrum image (Wang Column 6 Lines 29 - 34 - inputting the to-be-processed picture into a pre-trained Convolutional Neural Network (CNN), the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the content characteristic of the to-be-processed picture.),
wherein the first artificial intelligence model is trained based on a target learning image (Wang Column 7 Lines 23-30 -  During the training, one of the models is fixed, the parameters of the other model are updated, and such is performed alternatively and by iteration. The loss function for model training may be determined based on the content characteristic of the to-be-processed picture and the style characteristic of the template picture. The style transfer model may also be implemented based on the style transfer algorithm such as the Ashikhmin algorithm.),
and wherein the target learning image represents a target frequency spectrum of a specific style, and is obtained from a second artificial intelligence model based on a random value (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.).

Regarding Claim 11, Wang teaches all of the limitations of claim 10. Wang also teaches that the target learning image is obtained from the second artificial intelligence model based on the random value and a condition value corresponding to the specific style (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.),
wherein the second artificial intelligence model is trained to obtain a plurality of target learning images representing a plurality of target frequency spectrums of a plurality of styles based on a plurality of condition values corresponding to the plurality of styles (Wang Column 10 Lines 47-59 -  In order to make the difference between different styles more prominent, the Gram matrix of the style characteristic of the template picture and the Gram matrix of the style characteristic of the initial target picture may be respectively determined. The Gram matrix of the style characteristic may be the inner product of different convolution slices in the given convolutional layer. The loss function is then determined based on the determined Gram matrix. Likewise, the distance between the Gram matrices of the initial target picture and the to-be-processed picture output from a plurality of convolutional layers may also be weighted average to determine the loss function, and the specific weight may be set as needed.).

Regarding Claim 12, Wang teaches all the limitations of claim 10. Wang also teaches that the memory is further configured to store a plurality of first artificial intelligence models (Wang Column 9 Lines 66-67 and Column 10 Line 1 - Since there are many convolutional kernels in a CNN, there will be many output matrices. Likewise, there are many convolutional layers.),
wherein the plurality of first artificial intelligence models are trained based on different target learning images (Wang Column 6 Lines 66-13 - In some alternative implementations of the present embodiment, the style characteristic is determined through the following steps: inputting the template picture into a pre-trained CNN, the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the style characteristic of the template picture. The CNN to which the template picture is input may be identical to or different from the CNN to which the to-be-processed picture is input. After multi-layer convolution abstraction, the picture will loss the pixel-level characteristic, while retaining an advanced painting style. That is, the output of the high convolutional layer is more abstract compared to the output of the low convolutional layer, thus may be used to extract the style characteristic.),
and wherein the different target learning images are obtained from the second artificial intelligence model (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.).

Regarding Claim 13, Wang teaches all of the limitations of claim 12. Wang also teaches that the different target learning images are obtained by modifying a weight value of at least one layer from among a plurality of layers included in the second artificial intelligence model (Wang Column 10 Lines 62-67 and Column 11 Lines 1-3 - In the present embodiment, the electronic device may determine the total loss function based on the content loss function determined in step 404 and the style loss function determined in step 405. The total loss function may be obtained based on the weighted sum of the content loss function and the style loss function. By adjusting the weight of the content loss function and the weight of the style loss function, it may be determined that whether the target picture is more style-emphasized or content-emphasized.).

Regarding Claim 14, Wang teaches all the limitations of claim 13. Wang also teaches that the modified weight value is obtained by multiplying a feature vector with the weight value of the at least one layer (Wang Column 10 Lines 62-67 and Column 11 Lines 1-3 - In the present embodiment, the electronic device may determine the total loss function based on the content loss function determined in step 404 and the style loss function determined in step 405. The total loss function may be obtained based on the weighted sum of the content loss function and the style loss function. By adjusting the weight of the content loss function and the weight of the style loss function, it may be determined that whether the target picture is more style-emphasized or content-emphasized.).

Regarding Claim 15, Wang teaches all the limitations of claim 10. Wang also teaches that the first artificial intelligence model comprises a Convolutional Neural Network (CNN) (Wang Column 6 Lines 27-45 - In some alternative implementations of the present embodiment, the extracting a content characteristic of the to-be-processed picture may comprise: inputting the to-be-processed picture into a pre-trained Convolutional Neural Network (CNN), the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the content characteristic of the to-be-processed picture. The CNN is a feedforward neural network whose artificial neurons may respond to surrounding units within a part of the coverage area, and has an excellent performance at large-scale image processing. It includes a convolutional layer and a pooling layer. The CNN may complete object identification by extracting an abstract characteristic of an object by multi-layer convolution. Therefore, the content characteristic of the to-be-processed picture may be extracted by the CNN. The pre-trained CNN may use a Visual Graphics Generator (VGG) model, a Deep Residual Network (ResNet) model, etc. as a model for extracting the image characteristic.),
and wherein the second artificial intelligence model comprises a Generative Adversarial Network (GAN) (Wang Column 7 Lines 14-30 - In some alternative implementations of the present embodiment, the determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture may comprise: importing the content characteristic of the to-be-processed picture to a preset style transfer model, and acquiring an output of the style transfer model as the target picture. The style transfer model may be a Generative Adversarial Network (GAN) model. The GAN includes a generation model and a discrimination model. During the training, one of the models is fixed, the parameters of the other model are updated, and such is performed alternatively and by iteration. The loss function for model training may be determined based on the content characteristic of the to-be-processed picture and the style characteristic of the template picture. The style transfer model may also be implemented based on the style transfer algorithm such as the Ashikhmin algorithm.).

Regarding Claim 16, Wang teaches all the limitations of claim 10. Wang also teaches that the specific style is classified according to at least one from among a type of instrument, a type of emotion, or a processing method of an image (Wang Column 8 Line 67 and Column 9 Lines 1-10 - Since the effect of the audio processing is difficult to display, here the picture processing is used to denote the audio processing to produce an intuitive visual effect. FIG. 3A is a to-be-processed picture, i.e., a picture providing a content characteristic. FIG. 3B is a template picture, i.e., a picture providing a style characteristic. FIG. 3C is a target picture, i.e., a picture after style transfer. The content characteristic of the target picture is similar to the content characteristic of the to-be-processed picture, and the style characteristic of the target picture is similar to the style characteristic of the template picture.).

Regarding Claim 17, Wang teaches all the limitations of claim 10. Wang also teaches that dividing the input audio signal into a plurality of sections having a predetermined length (Wang Column 6 Lines 9-14 - In some alternative implementations of the present embodiment, the converting a to-be-processed audio to a to-be-processed picture may comprise: dividing the to-be-processed audio into audio clips at a preset interval; and determining an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
obtaining a plurality of input frequency spectrum images representing a plurality of frequency spectrums corresponding to the plurality of sections (Wang Column 6 Lines 9-14 - In some alternative implementations of the present embodiment, the converting a to-be-processed audio to a to-be-processed picture may comprise: dividing the to-be-processed audio into audio clips at a preset interval; and determining an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
inputting the plurality of input frequency spectrum images to the first artificial intelligence model (Wang Column 10 Lines 17-28 - The content loss function may be obtained based on the mean square error of the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture, and may also be obtained based on other computational methods that can represent the difference between the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture. Since the CNN divides the picture into a number of convolutional slices when extracting the characteristic, the determining the content loss function is to be performed on the slices at the given position in the initial target picture and the to-be-processed picture.),
obtaining a plurality of output frequency spectrum images from the first artificial intelligence model (Wang Column 10 Lines 17-28 - The content loss function may be obtained based on the mean square error of the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture, and may also be obtained based on other computational methods that can represent the difference between the content characteristic of the to-be-processed picture and the content characteristic of the initial target picture. Since the CNN divides the picture into a number of convolutional slices when extracting the characteristic, the determining the content loss function is to be performed on the slices at the given position in the initial target picture and the to-be-processed picture.),
obtaining a final output image by stitching the plurality of output feature spectrum images (Wang Column 6 Lines 4-14 - The to-be-processed picture may be an audiogram, a spectrum, or a spectrogram of the to-be-processed audio, or a picture obtained by performing graphic transformation on the audiogram, the spectrum, or the spectrogram. The picture may be obtained by using digital audio editors. In some alternative implementations of the present embodiment, the converting a to-be-processed audio to a to-be-processed picture may comprise: dividing the to-be-processed audio into audio clips at a preset interval; and determining an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
and obtaining the output audio signal based on the final output image (Wang Column 8 Lines 37-45 - In the present embodiment, the electronic device may convert the target picture determined in step 203 to a processed audio. As similar to step 201, the electronic device may also convert the target picture to a processed audio by using some digital audio editors. In addition, the electronic device may store the processed audio locally, upload the processed audio to the cloud or send the processed audio to other electronic devices, and may also directly output the processed audio.).

Regarding Claim 19, Wang teaches a non-transitory computer-readable recording medium configured to store instructions which, when executed by at least one processor of an electronic apparatus, cause the at least one processor to (Wang Column 14 Lines 26-31 - In another aspect, the present application further provides a non-volatile computer-readable storage medium. The non-volatile computer-readable storage medium may be the non-volatile computer storage medium included in the apparatus in the above described embodiments, or a stand-alone non-volatile computer-readable storage medium not assembled into the apparatus. The non-volatile computer-readable storage medium stores one or more programs. The one or more programs, when executed by a device, cause the device to: convert a to-be-processed audio to a to-be-processed picture; extract a content characteristic of the to-be-processed picture; determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture, the style characteristic being obtained from a template picture converted from a template audio; and convert the target picture to a processed audio.):
based on receiving an input audio signal, obtain an input frequency spectrum image representing a frequency spectrum of the input audio signal (Wang Column 5 Lines 64-67 and Column 6 Lines 1-8 -  In the present embodiment, an electronic device (e.g., the terminal device or server as illustrated in FIG. 1) on which the audio processing method is operated may convert a to-be-processed audio to a to-be-processed picture. The to-be-processed audio may be recorded by a user through a terminal with a recording function, or may be an excerpt of audio that has been stored locally or in the cloud. The to-be-processed picture may be an audiogram, a spectrum, or a spectrogram of the to-be-processed audio, or a picture obtained by performing graphic transformation on the audiogram, the spectrum, or the spectrogram. The picture may be obtained by using digital audio editors.),
input the input frequency spectrum image to the first artificial intelligence model (Wang Column 3 Lines 6-11 - the first converting unit comprises: a dividing subunit, configured to divide the to-be-processed audio into audio clips at a preset interval; and a to-be-processed picture determining subunit, configured to determine an audiogram, a spectrum, or a spectrogram of the audio clips as the to-be-processed picture.),
obtain an output frequency spectrum image from the first artificial intelligence model (Wang Column 3 Lines 11-20 - the extracting unit comprises: an input subunit, configured to input the to-be-processed picture into a pre-trained convolutional neural network, the convolutional neural network being used for extracting an image characteristic; and a content characteristic determining subunit, configured to determine a matrix output by at least one convolutional layer in the convolutional neural network as the content characteristic of the to-be-processed picture),
and obtain an output audio signal based on the output frequency spectrum image (Wang Column 6 Lines 29 - 34 - inputting the to-be-processed picture into a pre-trained Convolutional Neural Network (CNN), the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the content characteristic of the to-be-processed picture.),
wherein the first artificial intelligence model is trained based on a target learning image (Wang Column 7 Lines 23-30 -  During the training, one of the models is fixed, the parameters of the other model are updated, and such is performed alternatively and by iteration. The loss function for model training may be determined based on the content characteristic of the to-be-processed picture and the style characteristic of the template picture. The style transfer model may also be implemented based on the style transfer algorithm such as the Ashikhmin algorithm.),
and wherein the target learning image represents a target frequency spectrum of a specific style, and is obtained from a second artificial intelligence model based on a random value (Wang Column 11 Lines 51 -56, Column 6 Lines 50-65, and Figure 2 - In the present embodiment, the specific processing of the first converting unit 510, the extracting unit 520, the determining unit 530 and the second converting unit 540 may refer to the detailed descriptions to the steps 201, 202, 203, and 204 in the corresponding embodiment in FIG. 2, detailed description thereof will be omitted. Step 203, determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture.  In the present embodiment, the electronic device may determine a target picture based on a style characteristic and the content characteristic of the to-be-processed picture extracted in step 202. The style characteristic is obtained from a template picture converted from a template audio, and the template audio may be preset. The user may choose according to his preference, for example, the template audio may be an excerpt of a speech by a star, or an excerpt of a speech by a cartoon character. The template audio may also be an excerpt of user-defined audio. The target picture may be a picture that synthesizes the style characteristic of the template picture and the content characteristic of the to-be-processed picture.).

Regarding Claim 20, Wang teaches all the limitations of claim 19. Wang also teaches that the first artificial intelligence model comprises a Convolutional Neural Network (CNN) (Wang Column 6 Lines 27-45 - In some alternative implementations of the present embodiment, the extracting a content characteristic of the to-be-processed picture may comprise: inputting the to-be-processed picture into a pre-trained Convolutional Neural Network (CNN), the CNN being used for extracting an image characteristic; and determining a matrix output by at least one convolutional layer in the CNN as the content characteristic of the to-be-processed picture. The CNN is a feedforward neural network whose artificial neurons may respond to surrounding units within a part of the coverage area, and has an excellent performance at large-scale image processing. It includes a convolutional layer and a pooling layer. The CNN may complete object identification by extracting an abstract characteristic of an object by multi-layer convolution. Therefore, the content characteristic of the to-be-processed picture may be extracted by the CNN. The pre-trained CNN may use a Visual Graphics Generator (VGG) model, a Deep Residual Network (ResNet) model, etc. as a model for extracting the image characteristic.),
and wherein the second artificial intelligence model comprises a Generative Adversarial Network (GAN) (Wang Column 7 Lines 14-30 - In some alternative implementations of the present embodiment, the determining a target picture based on a style characteristic and the content characteristic of the to-be-processed picture may comprise: importing the content characteristic of the to-be-processed picture to a preset style transfer model, and acquiring an output of the style transfer model as the target picture. The style transfer model may be a Generative Adversarial Network (GAN) model. The GAN includes a generation model and a discrimination model. During the training, one of the models is fixed, the parameters of the other model are updated, and such is performed alternatively and by iteration. The loss function for model training may be determined based on the content characteristic of the to-be-processed picture and the style characteristic of the template picture. The style transfer model may also be implemented based on the style transfer algorithm such as the Ashikhmin algorithm.).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Johnston (US 20150205570 A1) (Further referred to as “Johnston”).

Regarding Claim 9, Wang teaches all of the limitations of claim 1. Johnston further teaches that  the processor is further configured to: obtain the input frequency spectrum image from the input audio signal using a Fast Fourier Transformation (FFT ) (Johnston Paragraph 50 - The system performs 606 a Fast Fourier transform ("FFT") to extract the frequency components of a vertical slice of the audio data over a time corresponding to the block width. The Fourier transform separates the individual frequency components of the audio data (e.g., from zero hertz to the Nyquist frequency). The system applies 608 the window function of the block to the FFT results. Because of the window function, frequency components outside of the block are zero valued. Thus, combining the FFT results with the window function removes any frequency components of the audio data that lie outside of the defined block.),
and obtain the output audio signal from the output frequency spectrum image using an Inverse Fast Fourier Transformation (IFFT) (Johnston Paragraph 51 - The system performs 610 an inverse FFT on the extracted frequency components for the block to reconstruct the time domain audio data solely from within the each block. However, since the frequency components external to the bock were removed by the window function, the inverse FFT generates isolated time domain audio data result that corresponds only to the audio components within the block.).
Wang and Johnston are both considered to be analogous to the claimed invention because both relate to audio processing. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Wang on how to more effectively reduce background noise based on Johnston to implement FFTs and IFFTs. (Johnston Paragraphs 50 and 51 - Thus, combining the FFT results with the window function removes any frequency components of the audio data that lie outside of the defined block. The inverse FFT generates isolated time domain audio data result that corresponds only to the audio components within the block.).

Regarding Claim 18, Wang teaches all of the limitations of claim 10. Johnston further teaches obtaining the input frequency spectrum image from the input audio signal using a Fast Fourier Transformation (FFT ) (Johnston Paragraph 50 - The system performs 606 a Fast Fourier transform ("FFT") to extract the frequency components of a vertical slice of the audio data over a time corresponding to the block width. The Fourier transform separates the individual frequency components of the audio data (e.g., from zero hertz to the Nyquist frequency). The system applies 608 the window function of the block to the FFT results. Because of the window function, frequency components outside of the block are zero valued. Thus, combining the FFT results with the window function removes any frequency components of the audio data that lie outside of the defined block.),
and obtaining the output audio signal from the output frequency spectrum image using an Inverse Fast Fourier Transformation (IFFT) (Johnston Paragraph 51 - The system performs 610 an inverse FFT on the extracted frequency components for the block to reconstruct the time domain audio data solely from within the each block. However, since the frequency components external to the bock were removed by the window function, the inverse FFT generates isolated time domain audio data result that corresponds only to the audio components within the block.).
Wang and Johnston are both considered to be analogous to the claimed invention because both relate to audio processing. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Wang on how to more effectively reduce background noise based on Johnston to implement FFTs and IFFTs. (Johnston Paragraphs 50 and 51 - Thus, combining the FFT results with the window function removes any frequency components of the audio data that lie outside of the defined block. The inverse FFT generates isolated time domain audio data result that corresponds only to the audio components within the block.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Xing (US 20180276540 A1), Engel et al. (US 10068557 B1), Y. Gao, R. Singh and B. Raj, "Voice Impersonation Using Generative Adversarial Networks," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2506-2510, doi: 10.1109/ICASSP.2018.8462018. (Year: 2018), and J. Hall, W. O’Quinn and R. J. Haddad, "An Efficient Visual-Based Method for Classifying Instrumental Audio using Deep Learning," 2019 SoutheastCon, 2019, pp. 1-4, doi: 10.1109/SoutheastCon42311.2019.9020571. (Year: 2019).
Xing (US 20180276540 A1) discloses “methods and systems are provided for detecting and cataloging qualities in music” (Xing - Abstract).
Engel et al. (US 10068557 B1) discloses “systems and methods that include or otherwise leverage a machine-learned neural synthesizer model” (Engel – Abstract).
Y. Gao, R. Singh and B. Raj, "Voice Impersonation Using Generative Adversarial Networks," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2506-2510, doi: 10.1109/ICASSP.2018.8462018. (Year: 2018) discloses “a novel neural network based speech quality-and style-mimicry framework for the synthesis of impersonated voices” (Gao – Abstract).
J. Hall, W. O’Quinn and R. J. Haddad, "An Efficient Visual-Based Method for Classifying Instrumental Audio using Deep Learning," 2019 SoutheastCon, 2019, pp. 1-4, doi: 10.1109/SoutheastCon42311.2019.9020571. (Year: 2019) discloses “an efficient method for classifying and identifying instrumental audio is proposed via utilizing a deep learning image classification algorithm” (Hall – Abstract).
Please, see additional references in form PTO-892 for more details. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UTHEJ KUNAMNENI whose telephone number is (571)272-5428. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/UTHEJ KUNAMNENI/               Examiner, Art Unit 2656                                                                                                                                                                                         		
	/EDGAR X GUERRA-ERAZO/                             Primary Examiner, Art Unit 2656