DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Information Disclosure Statement
3.	The information disclosure statements (IDS) submitted on the following dates are in compliance with the provisions of 37 CFR 1.97 and are being considered by the Examiner: 12/16/2020; 06/29/21.

Specification
4.	The abstract of the disclosure is objected to because acronym “K-NN graph” needs expansion.  Correction is required.  See MPEP § 608.01(b).
5.	The disclosure is objected to because of the following informalities: 
[0030] reads “K-NN graph”, this acronym needs expansion the first time it is used.  Examiner interprets K-NN graph as “k-nearest neighbor graph”.

Double Patenting
6.	Applicant is advised that should claim 7 be found allowable, claim 14 will be objected to under 37 CFR 1.75 as being a substantial duplicate thereof.  When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a 


Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103(a) are summarized as follows:
1.	Determining the scope and contents of the prior art.
2.	Ascertaining the differences between the prior art and the claims at issue.
3.	Resolving the level of ordinary skill in the pertinent art.
4.	Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
8.	Claims 1, 7-8 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Dimtrva et al., (“Dimtrva”) [US-2006/0290699-A1] in view of “A semantic feature for human motion retrieval” by Tian Qi et al. (“Qi”)
Regarding claim 1, Dimtrva discloses a method (¶0001, method for synthesizing audio-visual content in a video image processor), by one or more computing systems (Fig. 1 and ¶0027, a block diagram illustrating display unit 110 (having a display screen 115) and an exemplary computer 120 that comprises a content synthesis application processor 190), comprising:
receiving one or more non-video inputs (Dimtrva- ¶0006, extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking; Claim 1, receiving audio-visual input signals that represent a speaker who is speaking), wherein the one or more non-video inputs comprises at least one of a text input, an audio input, or an expression input (Dimtrva- ¶0006, extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking; ¶0009, The processor extracts audio features of the speaker's speech […] provides an audio-visual representation of the speaker's face synchronized with the speaker's speech; ¶0029, synchronizes the animated version of the face of the speaker with the speaker's speech; ¶0046-0047, A predefined number of sentences (e.g., two hundred sentences) are selected from the text corpus of a speech database […] audio data samples. For each speech segment, a selection of different audio coefficients are calculated as the audio features; ¶0075, Depending upon the audio expression classification, speaking face animation and synchronization module 380 can modify the animated facial parameters to accentuate certain features to more correctly express the facial animation of the speaker's face);
 a particular semantic context of a plurality of semantic contexts (Dimtrva- Claim 1, “performs a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker' speech and visemes that represent the speaker's face”);
processing the one or more non-video inputs  to identify one or more semantic contexts , respectively, that relate to the one or more non-video inputs (Dimtrva- ¶0006, extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking; Claim 1, receiving audio-visual input signals that represent a speaker who is speaking; Claim 1, “performs a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker' speech and visemes that represent the speaker's face”);
determining one or more actions to be performed by a digital avatar based on the one or more identified semantic contexts (Dimtrva- ¶0007, The processor then synchronizes the facial movements of the animated version of the face of the speaker with a plurality of the audio logical units that represent the speaker's speech; ¶0009, The processor extracts audio features of the speaker's speech and finds corresponding video representations for the audio features using a semantic association procedure. The processor then matches the corresponding video representations with the audiovisual speaking face movement components; ¶0060, A method to detect the semantic correlation between visual faces and associated speech; ¶0065, Analysis of new incoming audio may be performed by a semantic association method to find the matching video and the most likely facial movements.);
generating, in real-time responsive to receiving the one or more non-video inputs and based on the determined one or more actions, a video output of the digital avatar comprising one or more human characteristics corresponding to the one or more identified semantic contexts (Dimtrva- ¶0009, In a recognition phase the processor analyzes a new input video. The processor extracts audio features of the speaker's speech and finds corresponding video representations for the audio features using a semantic association procedure. The processor then matches the corresponding video representations with the audiovisual speaking face movement components […] The processor then creates a computer generated animated face for each selected audiovisual speaking face movement component and synchronizes each computer generated animated face with the speaker's speech. The final result is an output that provides an audio-visual representation of the speaker's face synchronized with the speaker's speech; ¶0012, displaying realistic facial gestures for a computer generated animated human face; ¶0079, Content synthesis application processor 190 synchronizes each computer generated animated face of the speaker with the speaker's speech (step 740). This creates an audio-visual representation of the speaker's face that is synchronized with the speaker's speech) ; and
sending, to a client device, instructions to present the video output of the digital avatar (Dimtrva- ¶0002, presenting computer output to a computer user in the form of a computer generated visual image of a person who is speaking; ¶0009, The final result is an output that provides an audio-visual representation of the speaker's face synchronized with the speaker's speech; ¶0011, displaying a realistic audio-visual representation of a speaker who is speaking; ¶0074, The output of facial animation for selected parameters module 370 is then sent to speaking face animation and synchronization module 380; ¶0079, The audio-visual representation of the speaker's face is then output to display unit 110).
Dimtrva fails to explicitly disclose accessing a K-NN graph comprising a plurality of sets of nodes, wherein each set of nodes corresponds to a particular semantic context of a plurality of semantic contexts; processing the one or more non-video inputs using the K-NN graph to identify one or more semantic contexts corresponding to one or more sets of nodes, respectively, that relate to the one or more non-video inputs;
However, Qi discloses 
accessing a K-NN graph comprising a plurality of sets of nodes, wherein each set of nodes corresponds to a particular semantic context of a plurality of semantic contexts (Qi- Figure 6 shows sparse coding (SC) and K nearest neighbor (KNN) method; page 399, section 1. INTRODUCTION, right column, 2nd paragraph, First, we use k-means clustering st paragraph, “clip-based semantic feature is capable to retrieve similar motion effectively and efficiently”, where the retrieval method includes SC and KNN);
processing the one or more non-video inputs using the K-NN graph to identify one or more semantic contexts corresponding to one or more sets of nodes, respectively, that relate to the one or more non-video inputs (Qi- Figure 6 shows sparse coding (SC) and K nearest neighbor (KNN) method; page 399, section 1. INTRODUCTION, right column, 2nd paragraph, First, we use k-means clustering algorithm to obtain the key frame of each clip in the database. Second, the key-pose model is constructed from all the key frames of every motion class by Gaussian mixture model (GMM). Then, the frame-based and clip-based semantic features can be extracted for all the motions in the motion database; page 404, section 5.3. Motion Retrieval, right column, 1st paragraph, “clip-based semantic feature is capable to retrieve similar motion effectively and efficiently”, where the retrieval method includes SC and KNN);
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Dimtrva to incorporate the teachings of Qi, and apply using the K-NN graph into a semantic association procedure on the audiovisual input vectors for accessing a K-NN graph comprising a plurality of sets of nodes, wherein each set of nodes corresponds to a particular semantic context of a plurality of semantic contexts; processing the one or more non-video inputs using the K-NN graph to identify one or more semantic contexts corresponding to one or more sets of nodes, respectively, that relate to the one or more non-video inputs.
Doing so would provide a novel approach is proposed for human motion retrieval.

Regarding claim 7, Dimtrva in view of Qi, discloses the method of Claim 1, and further discloses wherein the video output comprises a rendering of a sequence of actions performed by the digital avatar based on the determined one or more actions (Dimtrva- ¶0007, The processor then synchronizes the facial movements of the animated version of the face of the speaker with a plurality of the audio logical units that represent the speaker's speech; ¶0012-0013, displaying realistic facial gestures for a computer generated animated human face […] synchronizing the facial movements of an animated version of the face of a speaker with a plurality of the audio logical units that represent the speaker's speech; ¶0082, The audio-visual representation of the speaker's face is then output to display unit 110; Qi- Figure 2 shows a sequence of actions; page 401, section 3. 1. Key Frame Extraction, right column, 3nd paragraph, First, k-means algorithm is applied to segment the input motion clip into many short subsequences, as illustrated in Figure 2. Then, one frame of each subsequence is selected as the key frame).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Dimtrva to incorporate the teachings of Qi, and apply the subsequences of actions into facial movements of the animated version of the face of the speaker, as taught by Dimtrva, so the video output comprises a rendering of a sequence of actions performed by the digital avatar based on the determined one or more actions.
The same motivation that was utilized in the rejection of claim 1 applies equally to this claim.

The system of claim 8 is similar in scope to the functions performed by the method of claim 1 and therefore claim 8 is rejected under the same rationale.

Regarding claim 8, Dimtrva in view of Qi, discloses an apparatus (Dimtrva- Figs. 1-2, Computer 120 and ¶0027, an exemplary computer 120 that comprises a content synthesis  comprising: one or more non-transitory computer-readable storage media including instructions; and one or more processors coupled to the storage media (Dimtrva- Figs.1-2 and ¶0028, Computer 120 comprises a central processing unit (CPU) 150 and memory 160. Memory 160 comprises operating system software 170 and application programs 180. Computer 120 also comprises content synthesis application processor 190), the one or more processors configured to execute the instructions to perform the functions of claim 1.

Regarding claim 14, the same basis and rationale for claim rejection as applied to claim 7 is applied.

Regarding claim 15, all claim limitations are set forth as claim 1 in a computer program product having a non-transitory medium storing a set of instructions and rejected as per discussion for claim 1.

Regarding claim 15, Dimtrva in view of Qi, discloses a computer-readable non-transitory storage media comprising instructions executable by a processor (Dimtrva- Figs.1-2 and ¶0028, Computer 120 comprises a central processing unit (CPU) 150 and memory 160. Memory 160 comprises operating system software 170 and application programs 180. Computer 120 also comprises content synthesis application processor 190) to perform the functions of claim 1.


9.	Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Dimtrva in view of Qi, further in view of Liu et al., (“Liu”) [US-2019/0164327-A1]
Regarding claim 2, Dimtrva in view of Qi, discloses the method of Claim 1, and further discloses wherein each of the plurality of semantic contexts are indicative of an expression (Dimtrva- Claim 1, “performs a semantic association procedure on the audiovisual input vectors to obtain an association between phonemes that represent the speaker' speech and visemes that represent the speaker's face”; ¶0075, Depending upon the audio expression classification, speaking face animation and synchronization module 380 can modify the animated facial parameters to accentuate certain features to more correctly express the facial animation of the speaker's face), and wherein each node of a set of nodes that correspond to the respective semantic context nd paragraph, First, we use k-means clustering algorithm to obtain the key frame of each clip in the database. Second, the key-pose model is constructed from all the key frames of every motion class by Gaussian mixture model (GMM). Then, the frame-based and clip-based semantic features can be extracted for all the motions in the motion database; page 404, section 5.3. Motion Retrieval, right column, 1st paragraph, “clip-based semantic feature is capable to retrieve similar motion effectively and efficiently”, where the retrieval method includes SC and KNN).
The prior art fails to explicitly disclose wherein each node of a set of nodes that correspond to the respective semantic context is associated with an intensity of the expression.
However, Liu discloses
the respective semantic context is associated with an intensity of the expression (Liu- Fig. 4 shows a first relationship table 200 with intensity of the expression such as “Happy” or “Sad” and ¶0023, The first relationship table 200 includes a number of preset context and a plurality of preset animated images, and the first relationship table 200 defines a relationship between the number of preset contexts and the number of preset animated images).

Doing so would reflect user's emotions with vividness.

The system of claim 9 is similar in scope to the functions performed by the method of claim 2 and therefore claim 9 is rejected under the same rationale.

Regarding claim 16, all claim limitations are set forth as claim 2 in a computer program product having a non-transitory medium storing a set of instructions and rejected as per discussion for claim 2.


10.	Claims 3-6, 10-13 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Dimtrva in view of Qi, further in view of Wang et al., (“Wang”) [US-2013/0230255-A1]
Regarding claim 3, Dimtrva in view of Qi, discloses the method of Claim 1, but fails to explicitly disclose wherein the K-NN graph is generated based on identified relationships between different modalities of previous inputs and the corresponding outputs, and wherein the relationships are mapped to the K-NN graph.
However, Wang discloses 
the K-NN graph is generated based on identified relationships between different modalities of previous inputs and the corresponding outputs (Wang- Claim 1, “connecting each data point with the nearest-neighboring data points in a subset in which the data points , and
wherein the relationships are mapped to the K-NN graph (Wang- Claim 1, “connecting each data point with the nearest-neighboring data points in a subset in which the data points represent nodes to form a subgraph; forming multiple subgraphs, which are to be combined to form a base approximate k-NN graph; creating additional base approximate k-NN graphs, which are merged to create the approximate k-NN graph; retrieving images similar in appearance to the image query by identifying best NN data points from the approximate k-NN graph”; ¶0005, The process connects each data point with its nearest-neighboring data points in a subset in which the data points represent nodes to form an approximate neighborhood subgraph. The process combines multiple approximate neighborhood subgraphs to create a base approximate k-NN graph. The process repeats this procedure as described to construct multiple base approximate k-NN graphs. The process further combines the multiple base approximate k-NN graphs to form an approximate k-NN graph, which merges neighbors of the multiple base approximate k-NN graphs together and keeps best k-NN data points as the new k-NN data points; ¶0037, The first phase 202 is to construct the approximate k-NN graph. For receives a plurality data points that may be from digital images of people, places, or things with foregrounds and backgrounds, photographs, medical images, fingerprint images, facial features, and the like. Based on the plurality of data points received, the graph application 110 applies a multiple random divide-and-conquer approach to construct approximate neighborhood subgraphs and base approximate k-NN graphs. The graph application 110 eventually merges the base approximate k-NN graphs to form the approximate k-NN graph; ¶0060-0064, the graph application 110 constructs the approximate k-NN graph 116 as a nearest-neighbors search (NNS) problem. Every data point may be considered to be a query […] 604 with additional base approximate k-NN graphs to create the approximate k-NN graph 116).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Dimtrva/Qi to incorporate the teachings of Wang, and apply partitioning the data points into subsets into the K-NN graph, as taught by Dimtrva/Qi so the K-NN graph is generated based on identified relationships between different modalities of previous inputs and the corresponding outputs, and wherein the relationships are mapped to the K-NN graph.
Doing so would the best performing searches for similar images to an image query by using an approximate k-Nearest Neighbor (k-NN) graph.

Regarding claim 4, Dimtrva in view of Qi, discloses the method of Claim 1, and further discloses one or more machine-learning models that identify relationships between two or more modalities (Dimtrva- Fig. 8 and ¶0080, Learning module 330 receives audiovisual input vectors and creates audiovisual speaking face movement components (SFMCs) using Hidden Markov Models (step 810). Learning module 330 receives audiovisual input vectors and creates audiovisual speaking face movement components (SFMCs) and uses semantic association to obtain an association (i.e., a mapping) between phonemes and visemes (step 820))

However, Wang discloses 
the K-NN graph is generated using one or more machine-learning models that identify relationships between two or more modalities (Wang- Claim 1, “connecting each data point with the nearest-neighboring data points in a subset in which the data points represent nodes to form a subgraph; forming multiple subgraphs, which are to be combined to form a base approximate k-NN graph; creating additional base approximate k-NN graphs, which are merged to create the approximate k-NN graph; retrieving images similar in appearance to the image query by identifying best NN data points from the approximate k-NN graph”; ¶0002, These neighborhood graphs are often used in computer vision and machine learning tasks, such as image retrieval, nearest neighbor search, manifold learning and dimension reduction, semi-supervised learning, manifold ranking, and clustering; ¶0005, The process connects each data point with its nearest-neighboring data points in a subset in which the data points represent nodes to form an approximate neighborhood subgraph. The process combines multiple approximate neighborhood subgraphs to create a base approximate k-NN graph. The process repeats this procedure as described to construct multiple base approximate k-NN graphs. The process further combines the multiple base approximate k-NN graphs to form an approximate k-NN graph, which merges neighbors of the multiple base approximate k-NN graphs together and keeps best k-NN data points as the new k-NN data points).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Dimtrva/Qi to incorporate the teachings of Wang, and apply machine learning tasks and modalities into the K-NN graph, as taught by Dimtrva/Qi so the K-NN graph is generated using one or more machine-learning models that identify relationships between two or more modalities.


Regarding claim 5, Dimtrva in view of Qi, discloses the method of Claim 1, and further discloses wherein processing the one or more non-video inputs (Dimtrva- ¶0006, extracts audio features and video features from audio-visual input signals that represent a speaker who is speaking; Claim 1, receiving audio-visual input signals that represent a speaker who is speaking), but fails to explicitly disclose identifying one or more nodes of a plurality of nodes of the K-NN graph, each of the one or more nodes associated with the one or more sets of nodes, that correspond to the one or more non-video inputs.
However, Wang discloses 
identifying one or more nodes of a plurality of nodes of the K-NN graph, each of the one or more nodes associated with the one or more sets of nodes, that correspond to the one or more non-video inputs (Wang- ¶0005, The process connects each data point with its nearest-neighboring data points in a subset in which the data points represent nodes to form an approximate neighborhood subgraph […] The process further combines the multiple base approximate k-NN graphs to form an approximate k-NN graph, which merges neighbors of the multiple base approximate k-NN graphs together and keeps best k-NN data points as the new k-NN data points; ¶0080, At 902, the graph application 110 determines node a 802 and node d 808 share a common tag in textual information. For instance, the graph application 110 updates 3-NN points for node a 802 by using the textual information).
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Dimtrva/Qi to incorporate the teachings of Wang, and apply nodes and textual information into the K-NN graph, as taught by Dimtrva/Qi for identifying one or more nodes of a plurality of nodes of the K-NN graph, each of the one or more nodes 
Doing so would the best performing searches for similar images to an image query by using an approximate k-Nearest Neighbor (k-NN) graph.

Regarding claim 6, Dimtrva in view of Qi, discloses the method of Claim 5, and further discloses wherein determining the one or more actions to be performed (Dimtrva- ¶0007, The processor then synchronizes the facial movements of the animated version of the face of the speaker with a plurality of the audio logical units that represent the speaker's speech; ¶0065, Analysis of new incoming audio may be performed by a semantic association method to find the matching video and the most likely facial movements) but fails to explicitly disclose determining the one or more actions that correspond to the identified one or more nodes of the plurality of nodes of the K-NN graph.
However, Wang discloses 
determining the one or more actions that correspond to the identified one or more nodes of the plurality of nodes of the K-NN graph (Wang- ¶0006, the process may propagate the approximate k-NN graph by expanding from an immediate neighborhood to farther neighborhoods to locate additional nearest-neighboring nodes for each node in a best-first manner. This expansion of the approximate k-NN graph to other areas retrieves additional nearest-neighboring nodes that are true neighbors, and finally identifies the best k NN points as the refined k-NN points for each point; ¶0083, The graph application 110 propagates the visual information from one node to the other nodes to identify similar images in a best-first manner).

The system of claims 10-13 are similar in scope to the functions performed by the method of claims 3-6 and therefore claims 10-13 are rejected under the same rationale.

Regarding claims 17-20, all claim limitations are set forth as claims 3-6 in a computer program product having a non-transitory medium storing a set of instructions and rejected as per discussion for claims 3-6.


Conclusion
11.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US-2020/0357382-A1 to Ogawa et al., Oral, facial and gesture communication devices and computing architecture for interacting with digital media content teaches an oral communication device and related computing architectures and methods for processing data and outputting digital media content, such as via audio or visual media, or both. In another aspect, the following generally relates to computing architectures and machine intelligence to ingest large volumes of data from many different data sources, and to output digital media content (¶0002).
US-9,336,268-B1 to Moudy et al., Relativistic sentiment analyzer teaches sentiment analyzer systems may include feedback analytics servers configured to receive and analyze feedback data from various client devices. Feedback data may be received and analyzed to determine feedback context and sentiment scores (Abstract). Moudy further teaches a specific content feedback data item may represent a user's expression (e.g., via a discussion post, review, evaluation, etc.) of positive sentiment regarding a reading assignment in an eLearning course (col 53, lines 17-20).

12.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL LE whose telephone number is (571)272-5330. The examiner can normally be reached 9am-5pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached on (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MICHAEL LE/Primary Examiner, Art Unit 2619