DETAILED ACTION
This office action is in response to Applicant’s submission filed on 12/4/2020. Claims 1-20 are pending in the application. As such, claims 1- 20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. KR10-2019-0160008, filed on 12/4/2019.

Information Disclosure Statement
The information disclosure statement(s)(IDS) submitted on the following dates 12/4/2020, and 5/24/2021 have been considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 17 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh et al. (US20190138598A1)(herein "Albouyeh"), Aggarwal et al. (US20200380027A1)(herein "Aggarwal"), and Lu et al. (US20220084163A1)(herein "Lu") .

Regarding claim 1, 19, and 20 Albouyeh teaches [A device for improving output content through iterative generation, the device comprising: a memory storing instructions; and at least one processor configured to execute the instructions to- claim 1], [A method of improving output content through iterative generation, the method comprising- claim 19], and [A non-transitory computer-readable storage medium comprising instructions which, when executed by at least one processor, causes the at least one processor to – claim 20] (Albouyeh, Par. 0006: “… a system/apparatus is provided. The system/apparatus may comprise one or more processors and a memory coupled to the one or more processors. The memory may comprise instructions which, when executed by the one or more processors, cause the one or more processors to perform various ones of, and combinations of, the operations ....”, and Par. 0049:” The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, … specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random-access memory [RAM], a read-only memory [ROM], an erasable programmable read-only memory [EPROM or Flash memory], …. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, ...”).

set a target area in base content based on a first user input, determine input content based on the user intention information or a second user input, generate output content related to the base content based on the input content, the target area, and the user intention information by using a neural network (NN) model, generate a caption for the output content by using an image captioning model, calculate a similarity between text of the natural language input and the generated output content, and (Albouyeh, Par. 0043:” … when generating the NLU image description for an identified image, natural language description engine 306 may simply utilize a caption associated with the image if such a caption exists or generate an image specific description based on an analysis of the features within the identified image. However, in a second embodiment, natural language description engine 306 may determine that the caption associated with the identified image is not comprehensive. For example, a caption associated with an image may read “A man sitting at a desk,” which may be a general description when the text of document 318 is taken into consideration. Thus, natural language description engine 306 may utilize the caption associated with the image and generate an image specific description based on an analysis of the features within the identified image. For example, “An image is provided representing a team member sitting at a desk.” If natural language description engine 306 provide two different NLU descriptions to graphical element integration engine 308, graphical element integration engine 308 compares each NLU image description to the sentences within the text of document 318 using soft cosine similarity and/or ontological mapping to generate a relatedness score of each sentence to each NLU image description using the soft cosine similarity value and/or the ontological mapping value. Utilizing the relatedness scores for each sentence/NLU image description pair, graphical element integration engine 308 presents screen reader engine 302 the sentence with the highest relatedness score as well as the associated NLU image description.”).
Albouyeh fails to explicitly disclose, however, Aggarwal teaches receive a natural language input, obtain user intention information based on the natural language input by using a natural language understanding (NLU) model, (Aggarwal, Par. 0122:” … In this way, the machine-learning training module 130 is configured to capture user intention from the query-based training dataset 1322 regarding association of text queries with respective digital images and also create embeddings for long sequences of text [e.g., sentences] using the title-based training dataset 1324.”, and Par. 0039:” Other examples of features that may be supported by the functionality described herein include machine translation, text retrieval, speech recognition, text summarization, natural language understanding, and so forth as further described in relation to FIG. 3.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh in view of Aggarwal to teaches receive a natural language input, obtain user intention information based on the natural language input by using a natural language understanding (NLU) model, in order to discriminate and improves model accuracy, as evidence by Aggarwal (See Par. 0059).
Albouyeh, and Aggarwal fail to explicitly disclose, however, Lu teaches iterate generation of the output content based on the similarity. (Lu, Par. 0056:” An example in which a result of the consistency detection is the similarity is used, and the server may determine, according to the similarity outputted by the first discriminator, whether to stop training the first model. In any iteration process, when the similarity outputted by the first discriminator is greater than or equal to a first similarity threshold, it indicates that the third sample parsed image generated by the first model and the second sample parsed image are similar enough, the first training parameter of the first model has converged to an appropriate value, and the first model can generate an image that looks like the real image, and therefore the server stops training the first model.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, and Aggarwal in view of Lu to iterate generation of the output content based on the similarity, in order to improve the quality of a generated image, as evidence by Lu (See Par. 0005).

Regarding claim 17, Albouyeh as modified by Aggarwal above discloses the claimed caption and the text of the natural language input, but does not disclose the remainder of the limitations of claim 17. However, Lu teaches wherein the iterative generation of the output content continues until the output content is generated having the [caption] similar to [the text of the natural language input]. (Lu, Par. 0056:” An example in which a result of the consistency detection is the similarity is used, and the server may determine, according to the similarity outputted by the first discriminator, whether to stop training the first model. In any iteration process, when the similarity outputted by the first discriminator is greater than or equal to a first similarity threshold, it indicates that the third sample parsed image generated by the first model and the second sample parsed image are similar enough, the first training parameter of the first model has converged to an appropriate value, and the first model can generate an image that looks like the real image, and therefore the server stops training the first model.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, and Aggarwal in view of Lu to wherein the iterative generation of the output content continues until the output content is generated having the caption similar to the text of the natural language input, in order to improve the quality of a generated image, as evidence by Lu (See Par. 0005).

Regarding claim 18, Aggarwal further teaches wherein the similarity is a vector similarity, and wherein the text of the natural language input and the caption is encoded using a semantic vector to generate corresponding vectors, and a similarity between the vectors is calculated. (Aggarwal, Par. 0029:” Accordingly, machine-learning techniques and systems are described in which a model is trained to support a visually guided machine-learning embedding space that supports visual intuition as to ‘what’ is represented by text. In one example, training of the model begins with a fixed image embedding space, to which, a text encoder is then trained using digital images and text associated with the digital images. This causes text describing similar visual concepts to be clustered together in the visually guided language embedding space supported by the model. In this way, the text and digital image embeddings are usable directly within the visually guided machine learning embedding space and as such are directly comparable to each other [e.g., without further modification] to determine similarity. For example, a text embedding generated based on text is usable to determine similarity to a digital image embedding generated based on a digital image without further modification, e.g., through use of comparison metrics based on respective vectors as described below. As a result, similarity of the text to the digital image may be readily determined in real time, which is not possible using conventional techniques.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh and Lu in view of Aggarwal to wherein the similarity is a vector similarity, and wherein the text of the natural language input and the caption is encoded using a semantic vector to generate corresponding vectors, and a similarity between the vectors is calculated, in order to discriminate and improves model accuracy, as evidence by Aggarwal (See Par. 0059).


Claim(s) 2-3 are rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Karpathy et al. (“Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR2015, PP. 3128 - 3137”)(herein “Karpathy”).

Regarding claim 2, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Karpathy teaches wherein the base content, the input content, and the output content are images, and wherein the output content is generated by compositing the input content into the target area of the base content. (Karpathy, Abstract:” We present a model that generates natural language descriptions of images [base content] and their regions [target area]. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions.”, and Figure 1: “depicts various output content composed of input content and potential bounding [target] regions.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Karpathy to wherein the base content, the input content, and the output content are images, and wherein the output content is generated by compositing the input content into the target area of the base content, in order to improve ranking performance, as evidence by Karpathy (See section 3.1 page 3129).

Regarding claim 3, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Karpathy teaches wherein the base content comprises a plurality of areas, and wherein the target area comprises an area selected from among the plurality of areas by the first user input. (Karpathy, Abstract:” We present a model that generates natural language descriptions of images [base content] and their regions [target area]. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions.”, and Figure 1: “depicts various output content composed of input content and potential bounding [target] regions.”). Note, figure 1 depicts the base content which comprises a plurality of areas shown with bounding depiction, also the target areas can be selected from the plurality of areas shown.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Karpathy to wherein the base content comprises a plurality of areas, and wherein the target area comprises an area selected from among the plurality of areas by the first user input, in order to improve ranking performance, as evidence by Karpathy (See section 3.1 page 3129).

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Brown et al. (US20200143481A1)(herein “Brown”).

Regarding claim 2, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Brown teaches wherein the natural language input comprises a voice input, and wherein the voice input is converted into the text of the natural language input by using an automatic speech recognition (ASR) model. (Brown, Par. 0064: “ For example, the service provider system 102 may perform speech recognition to convert audio [e.g., speech input] into text or other data, natural language understanding with text or other data to understand the text/other data [e.g., determining meaning, intent, etc.], natural language generation to generate a response for input, task formation to generate a task for input [which may include generating a response], and so on.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Brown to wherein the natural language input comprises a voice input, and wherein the voice input is converted into the text of the natural language input by using an automatic speech recognition (ASR) model, in order to improve the natural language reporting component , the image processing component, as evidence by Brown (See Par. 0199).

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Dong-Hui et al. (CN112542163A, with reference to the provided English Machine Translation,  herein “Dong-Hui”).

Regarding claim 5, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Dong-Hui teaches wherein the input content is determined based on content information included in the user intention information. (Dong-Hui, Abstract: “ … performing voice recognition to the voice request input by the user, obtaining the voice recognition result; performing semantic understanding of the voice recognition result, identifying the intention of the user; if the intention of the user needs to depend on the image input, then extracting the obtained user interest content in the first image, the first image is obtained by shooting the object placed in the designated shooting area of the user”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Dong-Hui to wherein the input content is determined based on content information included in the user intention information, in order to improve the accuracy of the intelligent voice interaction result, as evidence by Dong-Hui (See Abstract).

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, Dong-Hui and in further view of  Karpathy.

Regarding claim 6, Albouyeh, Aggarwal, Lu and Dong-Hui fail to explicitly disclose, however, Karpathy teaches wherein the input content is further determined from a plurality of pieces of content corresponding to the content information, and the plurality of pieces of content have different attributes from each other. (Karpathy, section 3.1.1:"we observe that sentence descriptions make frequent references to objects and their attributes." Note, figure 1 and 2 depict input contents which are composed of plurality of different contents corresponding to the base content where each part of the content has different attribute from each other.").
 Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal, Lu and Dong-Hui in view of Karpathy to wherein the input content is further determined from a plurality of pieces of content corresponding to the content information, and the plurality of pieces of content have different attributes from each other, in order to improve ranking performance, as evidence by Karpathy (See section 3.1 page 3129).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Moustafa et al. (US20220126864A1)(herein “Moustafa”).

Regarding claim 7, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Moustafa teaches wherein an attribute of the input content comprises at least one of a pose, a facial expression, make-up, hair, apparel, or an accessory, and wherein the attribute of the input content is determined based on content attribute information included in the user intention information. (Moustafa, Par. 0647:” FIG. 92 shows example disguised images 9204 generated by using a StarGAN based model to modify different facial attributes of an input image 9202. The attributes used to modify input image 9202 include hair color [e.g., black hair, blond hair, brown hair] and gender [e.g., male, female]. A StarGAN based model could also be used to generate images with other modified attributes such as age [e.g., looking older] and skin color [e.g., pale, brown, olive, etc.]. In addition, combinations of these attributes could also be used to modify an image including H+G [e.g., hair color and gender], H+A [e.g., hair color and age], G+A [e.g., gender and age], and H+G+A [e.g., hair color, gender, and age]. Other existing GAN models can offer attribute modifications such as reconstruction [e.g., change in face structure], baldness, bangs, eye glasses, heavy makeup, and a smile. One or more of these attribute transformations can be applied to test images, and the transformed [or disguised images] can be evaluated to determine the optimal target domain to be used to configure a GAN model for use in a vehicle, as previously described herein.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Moustafa to wherein an attribute of the input content comprises at least one of a pose, a facial expression, make-up, hair, apparel, or an accessory, and wherein the attribute of the input content is determined based on content attribute information included in the user intention information, in order to enrich the data set to improve classification and object identification performance, as evidence by Moustafa (See Par. 0451).


Claim(s) 8, 10 - 12 are rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Zhang et al. (“StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv:1612.0324, PP. 5907 – 5915”)(herein “Zhang”).

Regarding claim 8, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Zhang teaches wherein the NN model is related to a generated adversarial network (GAN) model, and wherein the output content is generated by a generator of the GAN model. (Zhang, Section 3.1, Page 5909:” Generative Adversarial Networks (GAN) [8] are composed of two models that are alternatively trained to compete with each other. The generator G is optimized to reproduce the true data distribution pdata by generating images that are difficult for the discriminator D to differentiate from real images. Meanwhile, D is optimized to distinguish real images and synthetic images generated by G.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Zhang to wherein the NN model is related to a generated adversarial network (GAN) model, and wherein the output content is generated by a generator of the GAN model, in order to augment photo-realistic image synthesis with a novel sketch-refinement process, as evidence by Zhang (See Conclusion Page 5914).

Regarding claim 10, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Zhang teaches wherein the NN model is related to a generated adversarial network (GAN) model, and wherein a discriminator of the GAN model identifies the output content as fake content when the similarity does not satisfy a predetermined condition. (Zhang, Section 3.1, Page 5909:” Generative Adversarial Networks (GAN) [8] are composed of two models that are alternatively trained to compete with each other. The generator G is optimized to reproduce the true data distribution pdata by generating images that are difficult for the discriminator D to differentiate from real images. Meanwhile, D is optimized to distinguish real images and synthetic images generated by G.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Zhang to wherein the NN model is related to a generated adversarial network (GAN) model, and wherein a discriminator of the GAN model identifies the output content as fake content when the similarity does not satisfy a predetermined condition, in order to augment photo-realistic image synthesis with a novel sketch-refinement process, as evidence by Zhang (See Conclusion Page 5914).

Regarding claim 11, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Zhang teaches wherein the output content is first output content, and wherein the at least one processor is further configured to, when the similarity does not satisfy a predetermined condition, execute the instructions to: generate second output content different from the first output content based on the input content, the target area, and the user intention information by using the NN model. (Zhang, Section 3 - Page 5909: “To generate high-resolution images with photo-realistic details, we propose a simple yet effective Stacked Generative Adversarial Networks. It decomposes the text-to-image generative process into two stages (see Figure 2). Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image. Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high-resolution photo-realistic image.”). Note: Picture generated in stage-1, does not satisfy a predetermined condition since per Zhang sketches a primitive shape and as such it does not satisfy the required condition. Subsequent image produces an image with complete details since the image is a high-resolution image. 
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Zhang to wherein the output content is first output content, and wherein the at least one processor is further configured to, when the similarity does not satisfy a predetermined condition, execute the instructions to: generate second output content different from the first output content based on the input content, the target area, and the user intention information by using the NN model, in order to augment photo-realistic image synthesis with a novel sketch-refinement process, as evidence by Zhang (See Conclusion Page 5914).

Regarding claim 12, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Zhang teaches wherein the input content is first input content, and the output content is first output content, and wherein the at least one processor is further configured to, when the similarity does not satisfy a predetermined condition, execute the instructions to: determine second input content different from the first input content, and generate second output content different from the first output content based on the second input content and the target area by using the NN model, when the similarity does not satisfy the predetermined condition. (Zhang, Section 3 - Page 5909: “To generate high-resolution images with photo-realistic details, we propose a simple yet effective Stacked Generative Adversarial Networks. It decomposes the text-to-image generative process into two stages (see Figure 2). Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image. Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a high-resolution photo-realistic image.”). Note: In the system, a given input (call it first or x) would create an output which is related to the given input, consequently input x would produce output x (just a naming convention). Picture generated in stage-1, does not satisfy a predetermined condition since per Zhang sketches a primitive shape and as such it does not satisfy the required condition. Subsequent image produces an image with complete details since the image is a high-resolution image. 
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Zhang to wherein the input content is first input content, and the output content is first output content, and wherein the at least one processor is further configured to, when the similarity does not satisfy a predetermined condition, execute the instructions to: determine second input content different from the first input content, and generate second output content different from the first output content based on the second input content and the target area by using the NN model, when the similarity does not satisfy the predetermined condition, in order to augment photo-realistic image synthesis with a novel sketch-refinement process, as evidence by Zhang (See Conclusion Page 5914).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Ikhlef et al. (US20200390414A1)(herein “Ikhlef”).

Regarding claim 9, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Ikhlef teaches wherein probability distribution of the output content or the base content including the output content corresponds to probability distribution of real content. (Ikhlef, Par. 0067:” For example, G is a network that produces an image based on a random noise z it receives, with the noise written as G(z); and D is a discriminator network that determines whether an image is “real” or not based on an input parameter, x. Therein, x represents an image and output D(x) represents the probability that x is a real image, in which where the value is 1, it indicates that the image is 100% a real image, and where the output is 0, it indicates that the image is absolutely not a real image.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Ikhlef to wherein probability distribution of the output content or the base content including the output content corresponds to probability distribution of real content, in order to improve the quality of image reconstruction, as evidence by Ikhlef (See Par. 0081).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of Christos Margiolas (US20190138511A1)(herein “Margiolas”).

Regarding claim 13, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Margiolas teaches wherein the at least one processor is further configured to execute the instructions to: receive user feedback regarding a part of the output content, and modify the part of the output content by using the NN model. (Margiolas, Par. 0024:” In an example implementation, the targeted content engine receives content associated with a user's digital activities in a source form; applies a conversion framework based on the source form that outputs the content as data in a development form; determines contextual terms of the data in the development form; gathers environmental information associated with the content and user's digital activities; generates a primary indexer and secondary indexer to map a set of individual identifiers to the data; applies machine learning to generate characterization data and contextual data based on the environmental information, wherein the machine learning; utilizes the primary indexer and secondary indexer in a neural network to output a live model; updates the live model with feedback; processes the data with the live model and feedback to preform real-time analysis about the data;”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Margiolas to wherein the at least one processor is further configured to execute the instructions to: receive user feedback regarding a part of the output content, and modify the part of the output content by using the NN model, in order to improve the effectiveness of supplemental content, as evidence by Margiolas (See Par. 0019).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of  Hogan et al. (US20140164992A1)(herein “Hogan”).

Regarding claim 14, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Hogan teaches wherein the base content comprises a workspace of an application, and wherein the input content comprises a work object located in the workspace. (Hogan, Par. 0003:” In various computer-implemented applications, such as productivity programs, it may be possible for a user to set-up or design a displayed work space to convey particular information. Examples of such applications may include word processing programs, spreadsheet programs, presentation programs, and so forth, where a user can create or modify contents displayed on a work space [such as a document, a spreadsheet, a slide of a presentation]. In the creation or modification of such work spaces, a user may add various objects to the work space, such as tables of cells, graphics [pointers, arrows, lines, shapes], text boxes, images [e.g., pictures], and so forth.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Hogan to wherein the base content comprises a workspace of an application, and wherein the input content comprises a work object located in the workspace, in order to allow a user to place, move, or modify objects on such a displayed work space, as evidence by Hogan (See Par. 0029).

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, Hogan and in further view of  Saito Yutaka (JP3762243B2)(herein “Yutaka”).

Regarding claim 15, Albouyeh, Aggarwal, Lu and Hogan fail to explicitly disclose, however, Yutaka teaches wherein the output content comprises an animation related to the work object, wherein the animation is generated based on the work object, the user intention information, and an application programming interface (API) of the application, and wherein the caption for the output content comprises a caption for the animation. (Yutaka, Par. 0184: “Captions are displayed by overlaying characters such as product names [work object] and supplementary explanations on arbitrary positions such as on the surface, left and right. Specifically, the text to be displayed in the setting information may be described in advance, or the link destination (path name, file name) is specified in the setting information, and these are read when forming a solid. It can also be displayed during rollover. These captions can be set in a conventional HTML format to set hyperlinks.”, and Par. 0186:” As with rollover, [1] Sound is played, [2] Animation and video are played, [3] Caption is displayed.”, and Par. 0199:” Further, in the present embodiment, it is possible to attract the user's attention by a visual effect rich in changes in which only the frontmost surface rotates front and back [FIGS. 7, 14, and 19]. In the present embodiment, smooth information browsing according to the user's intention and interest is facilitated by rotating the front and back [animation] when a predetermined operation is performed. In this embodiment, when the front and back are rotated, the surface is displayed in an enlarged manner [FIG. 9], so that it is easy to grasp that the front and back are rotated. The display can effectively attract the user's interest.”, and Par. 0123:” Also, various data such as web data such as images, video, audio, text, HTML, and external application program routines are used as necessary.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal, Lu and Hogan in view of Yutaka to wherein the output content comprises an animation related to the work object, wherein the animation is generated based on the work object, the user intention information, and an application programming interface (API) of the application, and wherein the caption for the output content comprises a caption for the animation, in order to make it easier to browse and operate information, as evidence by Yutaka (see Par. 0166).

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Albouyeh, Aggarwal, Lu, and in further view of   Su et al. (US20190340469A1)(herein “Su”).

Regarding claim 16, Albouyeh, Aggarwal, and Lu fail to explicitly disclose, however, Su teaches wherein the NLU model, the NN model, and the image captioning model are stored in the memory. (Su, Par. 0014:” The disclosed techniques can be implemented, for example, in a computing system or a software product executable or otherwise controllable by such systems, although other embodiments will be apparent. The system or product is configured to perform topic-guided captioning on an image. In accordance with an embodiment, a methodology to implement these techniques includes generating image feature vectors, for an image to be captioned, based on application of a convolutional neural network [CNN] to the image. The method further includes generating the caption based on application of a recurrent neural network [RNN] to the image feature vectors. In some example embodiments, the RNN is configured as a “long short-term memory” [LSTM] RNN, as will be explained in greater detail below. Other natural language processing techniques can be used as well, as will be appreciated in light of this disclosure. The method further includes training the LSTM RNN with training images and associated training captions. The training is based on a combination of: feature vectors calculated from the training image; feature vectors calculated from the associated training caption; and an MCB pooling of the training caption feature vectors and an estimated topic of the training image. The estimated topic is generated by an application of the CNN to the training image.”). Note, during training process, local memory is used for storage.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Albouyeh, Aggarwal and Lu in view of Su to wherein the NLU model, the NN model, and the image captioning model are stored in the memory, in order to improve image captioning independent of platform used, as evidence by Su (See Par. 0015)



Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Kong et al. (US-20200175975A1) teaches Par. 0030:” According to some aspects of the present disclosure, techniques for modifying visual data based on audio data are disclosed. In one embodiment, an image may be received by a computer system. The image may be associated with one or more photo editing applications that may be executed on the computer system. The computer system may identify one or more segments within the image. A segment may be a modifiable or operable portion of the image. For example, a segment may be a background in the image, one or more objects in the image (such as people, mountains, the sun, the moon, a table, and the like), color levels within the image, brightness levels within the image, contrast levels within the image, shadows within the image, lighting within the image, saturation levels within the image, tones within the image, blurriness within the image, text within the image, layers within the image, and the like. In one embodiment, a segment of an image may be one or more characteristics or properties of the image.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DARIOUSH AGAHI/             Examiner, Art Unit 2656                                                                                                                                                                                           

/MICHELLE M KOETH/             Primary Examiner, Art Unit 2656