DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 1 to 16 and 21 to 24 are objected to because of the following informalities:  
Independent claims 1, 9, and 21 set forth a limitation of “wherein the one or more context tags are determined based on (i) events occurring within the video game . . . , and (ii) attributes of the speaker”, where there is no antecedent basis for “the speaker”.  Here, “(ii) attributes of the speaker” should be “(ii) attributes of a speaker”.
Independent claims 9 and 21 set forth a limitation of “wherein the one or more context tags are determined based on (i) events occurring within the video game in the virtual context”, where there is no express antecedent basis for “the video game”.
Independent claims 9 and 21 set forth a limitation of “using the AI speech markup model to generate a structure version . . . based at least in part on the events occurring within the video game at the video game state”, where “the video game state” does not have express antecedent basis.  These independent claims provide antecedent basis for “the video game state” in a subsequent limitation of “synchronize the audio output with video output generated for a video game state”.  That is, Applicant’s amendment reverses a proper location of the antecedent basis for “a video game state”.  These claims should be amended to set forth “based at least in part on the events occurring .
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 16 and 21 to 24 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia (U.S. Patent No. 10,600,404) in view of Bala et al. (U.S. Patent Publication 2018/0071636) and Kingsbury et al. (U.S. Patent Publication 2019/0043474).
Concerning independent claims 1, 9, and 21, Garcia discloses a method and apparatus for automatic speech imitation, comprising:
“access a speech script including a sequence of words to be spoken” – user interface 22 may enable modification of media content that includes text content that is transformed/transcribed from audio content (column 3, lines 4 to 20); text content is provided to speech data modifier 40 before it is transformed into speech (column 12, lines 4 to 19: Figure 1); media content is obtained which includes speech data of a transcript from existing text, video, or audio (column 14, lines 16 to 23); a development tool enables a user to review media content, e.g., transcribed text from audio content, 
“obtain one or more context tags describing a virtual context in which the sequence of words will be spoken” – user interface 22 may enable a configurable value to be set for imitating speech, and provide a modifiable field that might be specific to a speech object, e.g., specific to a speech context behavior (column 3, lines 20 to 27: Figure 1); emulator 20 includes a media content tagger 28 to modify media content with a tag; media content tagger 28 may utilize input from user interface 22 to select a part of media content identified as a particular speech pattern and/or identified as a particular context pattern behavior and tag that part of the media content (column 7, lines 35 to 44: Figure 1); 
“wherein the one or more context tags are determined based on (i) events occurring within the video game in the virtual context, and (ii) attributes of the speaker” – speech pattern identifier 24 and/or context behavior identifier 26 may identify a virtual reality character from game content utilizing media content metadata, e.g., scene data (“based on (i) events occurring without a video game”); speech pattern identifier 24 and/or context behavior identifier 26 may utilize media content metadata to identify respective speech identification information, where media content metadata may include information describing a character that is speaking, e.g., a virtual reality e.g., episode number (column 3, line 55 to column 4, line 21: Figure 1); speech imitation system 10 may generate imitated speech that more closely resembles speech from a particular geographic region, speech from a particular socio-economic background, speech from a historical figure, and speech from a celebrity (column 2, lines 1 to 7); context may be recognized and tagged using a <Context> tag to indicate an emotional state of a character, e.g., a character is being humorous, funny, silly, and or hilarious (“based on . . . (ii) attributes of the speaker”) (column 17, lines 26 to 48); a programmable speech rule is configured to define a character for which a speech pattern, e.g., stuttering, is to be applied; a probability is applied that a selected character stutters under standard/baseline conditions, e.g., happy, excited, or in a triggered state, e.g., nervous (column 20, line 51 to column 21, line 3); here, media content metadata can be scene data for a video game, which is equivalent to “(i) events occurring without a video game”; broadly, “attributes of the speaker” can construed to be a particular style of speaking of any given character name, an emotional state of a character, stuttering by a character, or imitating speech from a particular socio-economic background as defined by a speech context behavior tag;
“process the speech script and the one or more context tags using a machine-learning-based artificial intelligence (“AI”) speech markup model, wherein the artificial intelligence speech markup model is [a neural network] trained with inputs including at least a plurality of marked speech scripts and a plurality of context tags that respectively correspond to the plurality of marked speech scripts” – text content may be rendered in a markup language format that may include speech synthesis markup language (SSML) e.g., text content to be converted to speech, to mimic data, e.g., modified text content to be converted to imitated speech in a response; machine learner 32 may utilize a speech pattern tag to learn a speech pattern and/or may utilize a speech context behavior tag to learn a speech context behavior at a general-object granularity; machine learner 32 may use a speech pattern tag to learn a syntax and/or a composition of a speech pattern in media content that is customary for a group of people; machine learner 32 may utilize a speech pattern tag to learn a speech pattern and/or may utilize a context behavior tag to learn a speech context behavior at a specific-object granularity (column 7, line 51 to column 8, line 24: Figure 1); 
“wherein the artificial intelligence speech markup model is configured to generate markup tags for the speech script based at least in part on the one or more context tags” – speech emulator 20 includes a media content tagger 28 to modify media content with a tag that can include a markup language speech tag (column 7, lines 35 to 40: Figure 1); speech data modifier 40 may utilize a trained speech model to modify speech data and generate mimic data; speech data modifier 40 may apply a trained speech model to generate a markup language instruction, e.g., a SSML tag, that is to instruct 
“use the AI speech markup model to generate a structured version of the speech script that includes a plurality of markup tags added to the speech script . . .” – speech data modifier 40 may utilize a trained speech model (“use the AI speech markup model”) to modify speech data and generate mimic data; intelligent personal assistant device may provide text content to speech data modifier 40 before it is transformed into speech by applying a trained speech model to modify the text content and generate mimic data; speech data modifier 40 may apply a trained speech model to generate a markup language instruction, e.g., a SSML tag, that is to instruct speech device 12 how to modify the text content for generating mimicked speech, e.g., text content with the modification (column 11, line 32 to column 12, line 19: Figure 1); media content is modified with a tag, e.g., a speech pattern tag and/or a speech context behavior tag, to generate tagged data that is to be used as a training data set (column 14, lines 36 to 40: Figure 2: Step 60); here, using tags in a speech synthesis markup language to train a speech model provides “the AI speech markup model”; a markup language instruction is “a structured version of the speech script”;
“based at least in part on the events occurring within the video game at the video game state and the attributes of the speaker, each markup tag including a speech attribute variation” – tagged data for a stuttering speech pattern for ‘Elmer Fudd’ indicates a <Stutter> tag; tagged data for <Context> tag to indicate an emotional state may be determined from media content metadata (column 16, line 48 to column 17, line 26); here, <Stutter> tags indicating a stuttering characteristic for ‘Elmer Fudd” is “a 
“synthesize an audio recording of the structured version of the speech script, wherein the audio recording is adjusted according to the markup tag added to the speech script” – mimic data may be forwarded to a speech device to output the mimicked speech in real-time (column 22, lines 45 to 49: Figure 3: Step 96); that is, speech (“an audio recording”) is subsequently synthesized using tags of speech synthesis markup language to modify the text content to produce mimicked speech at speech output device 14 of speech device 12 (Figure 1).
Concerning independent claims 1, 9, and 21, Garcia discloses the main concept of these independent claims as directed to training an artificial intelligence model using machine learning to generate tags for markup language scripts to modify speech synthesis according to context that can include characteristics of a speaker and metadata of scenes in a video game.  Broadly, Garcia can be construed to disclose that “context tags are determined based on (i) events occurring within the video game in a virtual context, and (ii) attributes of the speaker”.  Applicant’s Specification, ¶[0032], ¶[0058], ¶[0064], ¶[0071], describes a broad range of embodiments for ‘speaker attributes’ including a speaker’s name, gender, age, location, place or origin, race, species, or “any other attribute”.  These ‘speaker attributes’, then, can be construed to simply include an identity of given named character, e.g., ‘Elmer Fudd’, or an emotional state of a character.  The only elements not clearly disclosed by Garcia are that a Garcia discloses game content of a video game that includes video content and audio content, and one skilled in the art might understand it to be almost implicit that there is buffering to synchronize the audio and video components so that spoken audio matches up with a video animation.  Still, Bala et al. teaches these limitations directed to synchronizing audio and video streams using a buffer and Kingsbury et al. teaches using an artificial neural network to perform machine learning to generate an audio rendering by speech synthesis of a character model based on age, sex, race, and nationality.
Concerning independent claims 1, 9, and 21, Bala et al. teaches a video game that includes an audio-video stream combined with game graphics and game sounds.  (Abstract)  Program instructions determine game states and selection of audio and video data for presentation.  (¶[0017]: Figure 1)  Generation of game graphics and/or game sounds is synchronized to playback of audio and video data, and circuitry is provided for reading audio and video data from a local memory.  (¶[0019])  Specifically, processor 211 generally uses program instructions to control game play, and audio and video data is read from a video disc and temporarily stored in a buffer 205 (“buffering at Bala et al., then, teaches the limitations of “synchronization comprises buffering at least one of the audio output or the video output until both outputs are ready to be output”, so as to “synchronize the audio output with video output generated for the video game state during runtime of the video game”.  An objective is to generate game graphics and Bala et al. with video game content of game scene metadata that is determined by artificial intelligence of Garcia for a purpose of generating game graphics and game sounds during playback.
Concerning independent claims 1, 9, and 21, Kingsbury et al. teaches generating audio renderings from textual content based on character models, where character models simulate different characters by one or more of age, sex, race, nationality, or personality trait.  Context segments may represent sections of spoken dialogue by different individual speakers, wherein machine learning assigns different character models for individual speakers.  (¶[0006])  Specifically, ‘textual machine learning’ and ‘machine learning’ refer to an artificial intelligence algorithm, and may include deep learning and an artificial neural network (“wherein the artificial intelligence speech markup model is a neural network trained with inputs”).  (¶[0028])  Here, deep learning is defined as ‘a family of machine learning methods based on artificial neural networks’.  Textual analysis could be used to automatically generate a Speech Synthesis Markup Language (SSML) audio rendering.  (¶[0038])  Characters associated with each of the different character models are classified based on age, sex, race, and nationality.  (¶[0060])  An audio rendering may represent a file stored in various formats, e.g., SSML.  (¶[0079]: Figure 5)  Kingsbury et al., then, expressly teaches that a machine learning based artificial intelligence speech markup model can be “a neural network” and that context is determined “based on (ii) attributes of the speaker”.  An objective is Kingsbury et al. to perform machine learning in speech imitation of Garcia for a purpose of providing an improved method to generate audio renderings from textual content.  

Concerning claims 2, 10, and 22, Garcia discloses that user interface 22 may provide a modifiable field, e.g., a dropdown menu, a click menu, etc., that might be specific to a speech object, specific to a speech pattern, or specific to a speech context behavior (column 3, lines 20 to 27: Figure 1); media content tagger 28 may utilize input from user interface 22 to select a part of media content identified as a particular speech pattern and/or identified as a particular context pattern behavior and tag that part of the media content; media content tagger 28 may automatically add a speech pattern tag to the media content or automatically add a speech context behavior tag to the media content (“receive user input that adds, deletes, or modifies at least one markup tag in the structured version of the speech script”) (column 7, lines 35 to 50: Figure 1); a development tool may enable a user to review media content and determine a distribution of speech pattern and/or a speech context behavior in the media content to generate a programmed rule; block 70 may provide a character input field for a configurable speech object value defining a speech object for which one or more of a selected speech pattern or a selected speech context behavior is to be applied (“adjust the AI speech markup model using the received user input as feedback”) (column 19, line 66 to column 20, line 15: Figure 2); here, a development tool that enables a user to 
Concerning claims 3, 11, and 24, Garcia discloses that user interface 22 may enable modification of media content including game content, educational content, social application content, text content, and video content; user interface 22 may enable a configurable value to be set for properly imitating speech as a specific context behavior (“display respective context tags for a plurality of speech scripts to one or more users”) (column 3, lines 4 to 26: Figure 1); media content is obtained; context data is obtained corresponding to the media content; media content is modified with a tag to generate tagged data that is to be used as a training data set (“receive, from the one or more users, the plurality of marked speech scripts”) (column 14, lines 17 to 40: Figure 2); implicitly, media content including a text transcript is displayed for selection on user interface 22, and media content modified with tags by a user is received from user interface 22.
Concerning claims 4 and 12, Garcia discloses that user interface 22 may provide a modifiable field, e.g., a dropdown menu, a click menu, etc., that might be specific to a speech object, specific to a speech pattern, or specific to a speech context behavior (column 3, lines 20 to 27: Figure 1); media content tagger 28 may utilize input from user interface 22 to select a part of media content identified as a particular speech pattern and/or identified as a particular context pattern behavior and tag that part of the media content; media content tagger 28 may automatically add a speech pattern tag to the media content or automatically add a speech context behavior tag to the media content (column 7, lines 35 to 50: Figure 1); here, tagging a part of media content is at least 
Concerning claims 5 to 6 and 13 to 14, Kingsbury et al. teaches that a machine learning algorithm is adjusted over multiple iterations based on data by supervised machine learning, unsupervised machine learning, and/or reinforcement learning (“wherein the artificial intelligence speech markup model is generated using an AI training system comprising at least one of a supervised machine learning system . . . and an unsupervised machine learning system”).  (¶[0028])  
Concerning claims 7, 15, and 23, Garcia broadly discloses an application to a gaming console, that a character may be a virtual reality (VR) character from game content, and using metadata from scene content, e.g., an episode number.  (Column 1, Line 58; Column 3, Line 55 to Column 4, Line 21; Column 5, Lines 38 to 45)  Arguably, Garcia discloses “dynamically generating the one or more context tags in a video game based on a video game state that indicates a context in which the speech script is configured to be read” and “dynamically generate, during video game runtime, the one or more context tags in a video game based on a video game state in which the speech script is configured to be read.”  That is, Garcia discloses characters from video games whose speech obtained from transcribed text is modified with tags based on metadata from a given scene, where a given scene of a video game can be construed as “a video game state”.  Even if context tags are not dynamically generated “based on a video game state” by Garcia, Bala et al. teaches that a video game state is used to determine selection of audio and video from a PlayItem and a PlayList.  (¶[0041] - ¶[0044]: Figure Bala et al.’s PlayList, then, corresponds to a ‘script’ of a video game that establishes ‘a video game state’ and can be used to generate ‘context tags’ of Garcia.  
Concerning claims 8 and 16, Garcia broadly discloses an application to a gaming console, that a character may be a virtual reality (VR) character from game content, and using metadata from scene content, e.g., an episode number.  (Column 1, Line 58; Column 3, Line 55 to Column 4, Line 21; Column 5, Lines 38 to 45)  Moreover, each character may speak according to a respective personality, e.g., a southern drawl or voicing tonality.  (Column 12, Lines 19 to 41)  Arguably, Garcia discloses “the context tags include at least two of: a video game title or series . . . a speaker attribute”.  However, even if two of these tags are omitted by Garcia, then at least one additional tag is taught by Bala et al.  Specifically, Bala et al. teaches that a process establishes an initial game state including a difficulty level (“a video game level”) and game player status.  (¶[0031]; ¶[0038]; ¶[0047]: Figures 3 to 5)  Moreover, a game state is broadly equivalent to “a location in the video game”.

Response to Arguments
Applicant’s arguments filed 15 June 2021 have been considered but are moot in view of new grounds of rejection as necessitated by amendment.
Applicant’s amendments overcome the prior claim objections and objections to the Specification.  However, Applicant’s claim amendments necessitate new claim objections as directed to the limitations of “the speaker”, “the video game”, and “the video game state”.
Garcia (U.S. Patent No. 10,600,404) and Bala et al. (U.S. Patent Publication 2018/0071636).  However, these arguments merely cite the claim language of the independent claims, but do not provide any significant reasoning that the claims are allowable over the prior art of record.
New grounds of rejection are set forth as directed to independent claims 1, 9, and 21 as being obvious under 35 U.S.C. §103 over Garcia (U.S. Patent No. 10,600,404) in view of Bala et al. (U.S. Patent Publication 2018/0071636) and Kingsbury et al. (U.S. Patent Publication 2019/0043474).  Here, these new grounds of rejection simply bring up Kingsbury et al., which was applied to certain dependent claims, to be applied to the independent claims.  The Office Action now relies upon only  Garcia, where these modifications include commonly-known and conventional variations, e.g., buffering to synchronize and audio and video and using a neural network to perform machine learning.
Specifically, Bala et al. and Kingsbury et al. teach whatever limitations might be omitted by Garcia.  Here, Bala et al. teaches synchronizing audio and video using a buffer for a video game and determining game states for selection and presentation of the audio and video.  Kingsbury et al. teaches generating audio renderings from textual content using character models generated by a machine learning algorithm that uses a neural network and ‘speaker attributes’ that include age, sex, race, and nationality of characters.  Compare Specification, ¶[0032], ¶[0058], ¶[0064], ¶[0071], describing ‘speaker attributes’ as including a speaker’s name, gender, age, location, place or origin, race, species, or “any other attribute”.  These teachings somewhat overlap with those of Garcia.  Specifically, Garcia appears to disclose modifying speech of a character according to video game content that includes media content metadata of a game scene (“(i) events occurring within the video game at the video game state”) and according to a name of a character, e.g., ‘Elmer Fudd’, emotional state of a character, or propensity of a character to stutter under certain conditions (“(ii) attributes of the speaker”).  Even if Garcia does not disclose buffering to synchronize audio and video or a neural network to perform machine learning, these are conventional features. 
Many of these new limitations are disclosed by Garcia.  Applicant’s “wherein the one or more context tags are determined based on (i) events occurring with the video Garcia.  Similarly, Applicant’s “(ii) attributes of the speaker” is disclosed because there are various embodiments that meet a definition of a ‘speaker attribute’ including a name of a character, an emotion of a character, or if a character is likely to stutter in a given scene in Garcia.  Broadly, Garcia’s speech pattern could be understood to correspond to Applicant’s “speaker attribute” and Garcia’s speech context behavior could be understood to correspond to Applicant’s ‘video game state’.  That is, a speech pattern can be understood to correspond to a characteristic of speech for any given character and a speech context behavior can be understood to correspond to how a speech characteristic might change in a given context of a video according to media content metadata.  Moreover, Applicant’s new limitation of “wherein the artificial speech markup model is configured to generate markup tags for the speech script based at least in part on the one or more context tags” appears to be at least somewhat repetitive to a limitation of “using the AI speech markup model to generate a structured version of the speech script that includes a plurality of markup tags added to the speech script”.  
Any new grounds of rejection are necessitated by amendment.  This Office Action is NON-FINAL.                                                                                                                                                                                    
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant’s disclosure.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        August 16, 2021