DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The disclosure is objected to because of the following informalities:
In ¶[0054], “video game code 136” should be “video game code 162”.  See Figure 1C.
In ¶[0143], “game media 12” (one occurrence) is not illustrated in Figure 8.
Appropriate correction is required.

Claim Objections
Claims 7, 9 to 16 and 21 to 24 are objected to because of the following informalities:
Independent claims 9 and 21 set forth “the video game state”, which lacks express antecedent basis.  Independent claim 1 sets forth a limitation of “a game state of the video game”, but there is no corresponding limitation set forth by independent claims 9 and 21.
Claims 7, 15, and 23 set forth “a video game based on a video game state”, which should be “the video game based on the video game state” because “a video game” and “a video game state” are already recited by independent claims 1, 9, and 21.
.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 4, 7 to 12, 15 to 16, and 21 to 24 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia (U.S. Patent No. 10,600,404) in view of Bala et al. (U.S. Patent Publication 2018/0071636).
Concerning independent claims 1, 9, and 21, Garcia discloses a method and apparatus for automatic speech imitation, comprising:
“access a speech script including a sequence of words to be spoken” – user interface 22 may enable modification of media content that includes text content that is transformed/transcribed from audio content (column 3, lines 4 to 20); text content is provided to speech data modifier 40 before it is transformed into speech (column 12, lines 4 to 19: Figure 1); media content is obtained which includes speech data of a e.g., transcribed text from audio content, and determine a distribution of speech pattern and/or speech context behavior in the media content (column 20, lines 1 to 6); here, a text transcript or transcribed text that is to be transformed into speech is “a speech script” and text that is to be transformed into speech is “a sequence of words to be spoken”; Table 6 provides various ‘speech scripts’ with “a sequence of words” of “I’m happier than a tornado in a trailer park!” and “This is better than tractor tipping!” (column 10, line 60 to column 11, line 15: Table 6);
“obtain one or more context tags describing a virtual context in which the sequence of words will be spoken” – user interface 22 may enable a configurable value to be set for imitating speech, and provide a modifiable field that might be specific to a speech object, e.g., specific to a speech context behavior (column 3, lines 20 to 27: Figure 1); speech pattern identifier 24 and/or context behavior identified 26 may utilize media content metadata, e.g., scene data, to identify when a particular speech object is talking, and utilize the media content itself to identify when a particular speech object is talking (column 3, line 63 to column 4, line 6: Figure 1); emulator 20 includes a media content tagger 28 to modify media content with a tag; media content tagger 28 may utilize input from user interface 22 to select a part of media content identified as a particular speech pattern and/or identified as a particular context pattern behavior and tag that part of the media content (column 7, lines 35 to 44: Figure 1); 
“process the speech script and the one or more context tags using an artificial intelligence (“AI”) speech markup model, wherein the artificial intelligence speech markup model is trained with inputs including at least a plurality of marked speech e.g., text content to be converted to speech, to mimic data, e.g., modified text content to be converted to imitated speech in a response; machine learner 32 may utilize a speech pattern tag to learn a speech pattern and/or may utilize a speech context behavior tag to learn a speech context behavior at a general-object granularity; machine learner 32 may use a speech pattern tag to learn a syntax and/or a composition of a speech pattern in media content that is customary for a group of people; machine learner 32 may utilize a speech pattern tag to learn a speech pattern and/or may utilize a context behavior tag to learn a speech context behavior at a specific-object granularity (column 7, line 51 to column 8, line 24: Figure 1); a machine learning model is equivalent to “an artificial intelligence (‘AI’) speech markup model”; 
“use the AI speech markup model to generate a structured version of the speech script that includes at least one markup tag added to the speech script, the markup tag including a speech attribute variation” – speech data modifier 40 may utilize a trained e.g., a SSML tag, that is to instruct speech device 12 how to modify the text content for generating mimicked speech, e.g., text content with the modification (column 11, line 32 to column 12, line 19: Figure 1); media content is modified with a tag, e.g., a speech pattern tag and/or a speech context behavior tag, to generate tagged data that is to be used as a training data set (column 14, lines 36 to 40: Figure 2: Step 60); here, using tags in a speech synthesis markup language to train a speech model provides “the AI speech markup model”; a markup language instruction is “a structured version of the speech script”;
“synthesize an audio recording of the structured version of the speech script, wherein the audio recording is adjusted according to the markup tag added to the speech script” – mimic data may be forwarded to a speech device to output the mimicked speech in real-time (column 22, lines 45 to 49: Figure 3: Step 96); that is, speech (“an audio recording”) is subsequently synthesized using tags of speech synthesis markup language to modify the text content to produce mimicked speech at speech output device 14 of speech device 12 (Figure 1).
Concerning independent claims 1, 9, and 21, the only elements omitted by Garcia are “synchronizing the audio output with video output generated for the video game state during runtime of the video game, wherein synchronizing comprises Garcia discloses game content of a video game that includes video content and audio content, and one skilled in the art might understand it to be almost implicit that there is buffering to synchronize the audio and video components so that spoken audio matches up with a video animation.  Moreover, Garcia discloses utilizing metadata for media content, where this metadata may be ‘scene data’ identifying a scene, to determine that a particular scene object is talking.  (Column 3, Line 55 to Column 4, Line 6; Column 5, Lines 40 to 45)  Broadly, ‘scene data’ can be construed to provide “a video game state during runtime of the video game” so that metadata of a scene in a video game corresponds to ‘a video game state’.  Still, Bala et al. teaches whatever limitations that might not be expressly disclosed by Garcia.
Concerning independent claims 1, 9, and 21, Bala et al. teaches a video game that includes an audio-video stream combined with game graphics and game sounds.  (Abstract)  Program instructions determine game states and selection of audio and video data for presentation.  (¶[0017]: Figure 1)  Generation of game graphics and/or game sounds is synchronized to playback of audio and video data, and circuitry is provided for reading audio and video data from a local memory.  (¶[0019])  Specifically, processor 211 generally uses program instructions to control game play, and audio and video data is read from a video disc and temporarily stored in a buffer 205 (“buffering at least one of the audio output or the video output”).  The buffer passes the audio and video data to media decoder 207 as audio and video streams, and decoded audio and Bala et al., then, teaches the limitations of “synchronization comprises buffering at least one of the audio output or the video output until both outputs are ready to be output”, so as to “synchronize the audio output with video output generated for the video game state during runtime of the video game”.  An objective is to generate game graphics and game sounds synchronized to commanded playback of audio and video data.  (¶[0019])  It would have been obvious to one having ordinary skill in the art to synchronize output of audio and video based on a game state in a video game by buffering as taught by Bala et al. with video game content that is determined by artificial intelligence of Garcia for a purpose of generating game graphics and game sounds during playback.

Concerning independent claim 1, this independent claim is almost identical to independent claims 9 and 21, but includes an additional limitation directed to generating context tags “based on a game state of the video game in which the sequence of words will be spoken”.  Here, Garcia broadly discloses an application to a gaming console, that a character may be a virtual reality (VR) character from game content, and using metadata from scene content, e.g., an episode number.  (Column 1, Line 58; Column 3, Line 55 to Column 4, Line 21; Column 5, Lines 38 to 45)  A scene of a video game broadly corresponds to “a game state”.  Moreover, Bala et al. teaches a video game with PlayList and PlayItem, which corresponds to a ‘script of a video game’, and 

Concerning claims 2, 10, and 22, Garcia discloses that user interface 22 may provide a modifiable field, e.g., a dropdown menu, a click menu, etc., that might be specific to a speech object, specific to a speech pattern, or specific to a speech context behavior (column 3, lines 20 to 27: Figure 1); media content tagger 28 may utilize input from user interface 22 to select a part of media content identified as a particular speech pattern and/or identified as a particular context pattern behavior and tag that part of the media content; media content tagger 28 may automatically add a speech pattern tag to the media content or automatically add a speech context behavior tag to the media content (“receive user input that adds, deletes, or modifies at least one markup tag in the structured version of the speech script”) (column 7, lines 35 to 50: Figure 1); a development tool may enable a user to review media content and determine a distribution of speech pattern and/or a speech context behavior in the media content to generate a programmed rule; block 70 may provide a character input field for a configurable speech object value defining a speech object for which one or more of a selected speech pattern or a selected speech context behavior is to be applied (“adjust the AI speech markup model using the received user input as feedback”) (column 19, line 66 to column 20, line 15: Figure 2); here, a development tool that enables a user to review and modify output of a trained speech model uses “received user input as feedback.” 
Garcia discloses that user interface 22 may enable modification of media content including game content, educational content, social application content, text content, and video content; user interface 22 may enable a configurable value to be set for properly imitating speech as a specific context behavior (“display respective context tags for a plurality of speech scripts to one or more users”) (column 3, lines 4 to 26: Figure 1); media content is obtained; context data is obtained corresponding to the media content; media content is modified with a tag to generate tagged data that is to be used as a training data set (“receive, from the one or more users, the plurality of marked speech scripts”) (column 14, lines 17 to 40: Figure 2); implicitly, media content including a text transcript is displayed for selection on user interface 22, and media content modified with tags by a user is received from user interface 22.
Concerning claims 4 and 12, Garcia discloses that user interface 22 may provide a modifiable field, e.g., a dropdown menu, a click menu, etc., that might be specific to a speech object, specific to a speech pattern, or specific to a speech context behavior (column 3, lines 20 to 27: Figure 1); media content tagger 28 may utilize input from user interface 22 to select a part of media content identified as a particular speech pattern and/or identified as a particular context pattern behavior and tag that part of the media content; media content tagger 28 may automatically add a speech pattern tag to the media content or automatically add a speech context behavior tag to the media content (column 7, lines 35 to 50: Figure 1); here, tagging a part of media content is at least “receive the plurality of context tags from user input” (“receive the plurality of context 
Concerning claims 7, 15, and 23, Garcia broadly discloses an application to a gaming console, that a character may be a virtual reality (VR) character from game content, and using metadata from scene content, e.g., an episode number.  (Column 1, Line 58; Column 3, Line 55 to Column 4, Line 21; Column 5, Lines 38 to 45)  Arguably, Garcia discloses “dynamically generating the one or more context tags in a video game based on a video game state that indicates a context in which the speech script is configured to be read” and “dynamically generate, during video game runtime, the one or more context tags in a video game based on a video game state in which the speech script is configured to be read.”  That is, Garcia discloses characters from video games whose speech obtained from transcribed text is modified with tags based on metadata from a given scene, where a given scene of a video game can be construed as “a video game state”.  Even if context tags are not dynamically generated “based on a video game state” by Garcia, Bala et al. teaches that a video game state is used to determine selection of audio and video from a PlayItem and a PlayList.  (¶[0041] - ¶[0044]: Figure 4)  Bala et al.’s PlayList, then, corresponds to a ‘script’ of a video game that establishes ‘a video game state’ and can be used to generate ‘context tags’ of Garcia.  
Concerning claims 8 and 16, Garcia broadly discloses an application to a gaming console, that a character may be a virtual reality (VR) character from game content, and using metadata from scene content, e.g., an episode number.  (Column 1, Line 58; Column 3, Line 55 to Column 4, Line 21; Column 5, Lines 38 to 45)  Moreover, each character may speak according to a respective personality, e.g., a southern drawl or Garcia discloses “the context tags include at least two of: a video game title or series . . . a speaker attribute”.  However, even if two of these tags are omitted by Garcia, then at least one additional tag is taught by Bala et al.  Specifically, Bala et al. teaches that a process establishes an initial game state including a difficulty level (“a video game level”) and game player status.  (¶[0031]; ¶[0038]; ¶[0047]: Figures 3 to 5)  Moreover, a game state is broadly equivalent to “a location in the video game”.

Claim 5 to 6 and 13 to 14 are rejected under 35 U.S.C. 103 as being unpatentable over Garcia (U.S. Patent No. 10,600,404) in view of Bala et al. (U.S. Patent Publication 2018/0071636) as applied to claims 1 and 9 above, and further in view of Kingsbury et al. (U.S. Patent Publication 2019/0043474).
Generally, Garcia discloses training a model using text marked up by context for machine learning, but does not expressly disclose that the speech markup model is generated using at least one of “a supervised machine learning model element”, “a semi-supervised machine learning model element”, and “an unsupervised machine learning model element”.  However, it is commonly known to train models in speech processing by supervised or unsupervised training.  That is, supervised training is known in the art to use training data that is annotated or labeled by a user and unsupervised training is known in the art to use training data that is not annotated or labeled by a user.  Inherently, Garcia would appear to be using “a supervised machine learning model element” because a user may apply tags to media content through a user interface, and this tagged media content is then used as training data to train a Garcia additionally states that a media content tagger may automatically add a speech pattern tag to the media content, and then use that training data to generate a trained speech model, where this automatically tagged media content would appear to be characteristic of unsupervised machine learning.  (Column 7, Line 35 to 64: Figure 1)  Garcia appears, then, at least, to disclose a supervised machine learning system.  Anyway, even if supervised and unsupervised machine learning are not disclosed by Garcia, then these are expressly taught by Kingsbury et al.  Generally, Kingsbury et al. teaches generating audio rendering from textual content based on character models using textual machine learning to designate character models.  (Abstract)  Specifically, a machine learning algorithm is adjusted over multiple iterations based on data by supervised machine learning, unsupervised machine learning, and/or reinforcement learning.  (¶[0028])  An objective is to provide an improved method to generate audio renderings from textual content.  It would have been obvious to one having ordinary skill in the art to use supervised machine learning or unsupervised machine learning to generate character models as taught by Kingsbury et al. to perform machine learning in speech imitation of Garcia for a purpose of providing an improved method to generate audio renderings from textual content.

Response to Arguments
Applicant’s arguments filed 01 March 2021 have been considered but are moot in view of new grounds of rejection as necessitated by amendment.
Applicant’s amendments overcome the objections to the drawings and the rejection for non-statutory subject matter under 35 U.S.C. §101.

Applicant amends independent claims 1, 9, and 21 to set forth a new limitation directed to “synchronizing the audio output with video output generated for the video game state during runtime of the video game, wherein synchronizing comprises buffering at least one of the audio output or the video output until both outputs are ready to be output.”  Generally, Applicant’s argument then is that this new limitation is not disclosed in the prior rejection of anticipation under 35 U.S.C. §102(a)(2) by Garcia (U.S. Patent No. 10,600,404), or disclosed and taught in the obviousness rejection under 35 U.S.C. §103 over Garcia (U.S. Patent No. 10,600,404) in view of Ishikawa et al. (U.S. Patent Publication 2017/0178622).
Applicant’s amendments necessitate new grounds of rejection for independent claims 1, 9, and 21 as being obvious under 35 U.S.C. §103 over Garcia (U.S. Patent No. 10,600,404) in view of Bala et al. (U.S. Patent Publication 2018/0071636).  Applicant’s amendment, then, overcomes any anticipation rejection under 35 U.S.C. §102(a)(2), but there remains a new obviousness rejection under 35 U.S.C. §103.  The rejection no longer relies upon Ishikawa et al.  The rejection of some of the dependent claims continues to rely upon Kingsbury et al. (U.S. Patent Publication 2019/0043474).  Bala et al. is maintained to teach the new limitations of the independent claims.  
Buffering of audio and video to enable synchronization is well known.  One skilled in the art could almost allege that this is implicit for Garcia, as it is well known that whenever audio and video might be received or processed at different rates, then buffering to enable synchronization is, at best, an obvious expedient.  Moreover, Garcia could broadly be understood to disclose a “game state” of “a video game” because an application to a video game is noted in various passages and media content metadata for scenes can be construed to provide a game state.  (See, e.g., Column 3, Lines 5 to 20 and Column 3, Line 58 to Column 4, Line 6)  Even if these limitations are not implicit for Garcia, they are expressly taught by Bala et al.
Specifically, Bala et al. teaches a video game that includes an audio-video stream, and determining game states for selection of audio and video data.  (¶[0017])  Generation of game graphics and game sounds are synchronized by reading audio and video data from a local memory that can include buffer 205.  (¶[0019]; ¶[0023]: Figure 2)  PlayLists and PlayItems signal decoding of audio and video at beginning times and ending times.  (¶[0029])  Updating of a game state provides for synchronization in progression of audio-video background including game sounds according to timestamps.  A processor controlling a movie player may wait for playback of an audio-video background to reach a certain time stamp before commanding display of an updated overlay.  (¶[0034] - ¶[0035]; ¶[0041]; ¶[0051] - ¶[0052]: Figures 3 to 5)  Bala et al., then, teaches “synchronizing the audio output with video output generated for the video game state during runtime of the video game, wherein synchronizing comprises Bala et al.’s PlayLists and PlayItems provide a ‘script’ of video game code describing to “a game state of the video game”.  
These new grounds of rejection are necessitated by amendment.  Accordingly, this rejection is properly FINAL.

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant’s disclosure.
Kadikario et al., Bruzzo et al., Kosai et al., Perry, and Payzer et al. disclose related prior art.
Applicant's amendment necessitated the new grounds of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP §706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        March 10, 2021