DETAILED ACTION
This action is responsive to the Amendment filed on 04/06/2022. Claims 1, 9-11, 13, and 17 have been amended. Claims 1-20 remain pending in the case. Claims 1, 11, and 17 are independent claims.

Claim Objections
Claims 11-16 are objected to because of the following informalities:
Claim 11:
Line 11 recites “is dependent first user upon input” where “is dependent upon first user ” was apparently intended.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7, 9-15, and 17-20 are rejected under 35 U.S.C. § 103 as being unpatentable over Merrill et al. (US Patent No. 6,181,351, hereinafter “Merrill”) in view of Chew et al. (US Patent Application Pub. No. 2017/0083620, hereinafter “Chew”).

As to independent claim 1, Merrill shows a system [e.g. figs. 1 or 2] comprising:
one or more processors [e.g. processing unit 21 (fig. 1; col. 4, line 56 – col. 5, line 67)];
an application engine [e.g. a “linguistically enhanced sound file,” its concomitant “enhanced sound file player” (col. 6, lines 32-40), and/or “speech recognition engine 212” (col. 12, line 23)] stored in memory [e.g. memory 22 (fig. 1)] and executable to generate a dynamic user-influenced media experience [e.g. “computer-generated animation, and more specifically to synchronizing animation with recorded speech” (col. 1, lines 6-8)];
and a development tool [e.g. “sound file tool” (col. 6, lines 35-36) and/or “linguistic information and sound editing tool 208” (col. 12, line 21)] for adding electronically-driven effects [e.g. animations and/or other effects (col. 6, lines 32-40)] to the dynamic user-influenced media experience, the development tool being stored in the memory [e.g. memory 22 (fig. 1)] and executable by the one or more processors to:
receive {…} an audio trigger [e.g. “linguistic events are used to synchronize some action in the animation” (col. 12, lines 27-28) and/or the “bookmarks” defining which notifications to send upon detecting audio triggering occurrences (col. 15, lines 01-17)] corresponding to one or more words or phrases appearing in a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience [e.g. the one or more words or phrases appearing in a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience, as illustrated in fig. 6 and/or col. 16, lines 27-46.];
 receive second developer input further defining an event to be executed by the application engine in temporal association with an audible occurrence of the audio trigger during the dynamic user-influenced media experience [e.g. receiving developer input (like directly manipulating markers 384 and/or 390 in fig. 6) to further define an event to be executed by the application engine in temporal association with an audible occurrence of the audio trigger during the dynamic user-influenced media experience (for further context, see also col. 13, line 19 – col. 14, line 7)],
wherein execution of the event is dependent upon first user input provided by an end user during execution of the dynamic user-influenced media experience [For further context into the “developer” aspect of the second input, see how “[…] a game program may present an animated character for entertainment, or an educational program may include an animated teacher character. In addition, animated characters are a useful part of social interfaces that present an interactive interface with human qualities. For instance, an animated character may appear on a computer display to help a user having difficulty completing a function or to answer questions. The character's creators may give it certain human traits reflected in gestures and other behavior, and the character may be programmed to react to actions by the user.” (col. 1, lines 15-25) and how “[… a] common arrangement is to create the linguistically enhanced sound file on a development computer, test the file using a player, and then distribute the file to computers with access to a player. […]” (col. 7, lines 33-36)
For even further evidence of how the execution of the event is dependent upon first user input provided by an end user during execution of the dynamic user-influenced media experience, see also how: 
 “The sound file tool 108 acquires the text string 104 and the speech sound data stream 106 at step 152 (FIG. 3). The text string 104 is a textual version of what is spoken in the speech sound data stream 106. For example, the text string 104 might be an ASCII text string and the speech sound data stream 106 might be a sound file produced by digitally sampling (e.g., with a microphone) a person speaking the words of the text string 104.” (col. 6, lines 51-58)
 “The linguistic information and sound editing tool 208 acquires the speech sound data at step 252 (FIG. 5). In the illustrated embodiment, the speech sound data 206 is of the familiar WAV sound format (also known as RIFF format). The data 206 is acquired by opening a saved file or by sampling an input device such as the microphone 62 (FIG. 1) or some other sound input device. […]” (col. 9, lines 32-40)
“In the final stages of development, a linguistically enhanced sound file 512 can be created by recording a human voice (e.g., professional vocal talent) and incorporated into the character animation 508 with a minimum of changes to the programming code in the application 502. In this way, the resulting application presents high quality animation while avoiding some of the development costs associated with using a human voice. In both cases, the character animation 508 presents an animation in which the character's mouth (and optionally, a word balloon) are synchronized with the speech sound output. However, the linguistically enhanced sound file 512 provides a superior animation with more realistic speech sound output.” (col. 19, lines 18-30)]; and 
output a metadata file [e.g. “linguistically enhanced sound file 232” (fig. 4)] including at least one timestamp defined relative to a start of the audio content stream, the at least one timestamp being associated in the metadata file with the defined event [“{…} appropriate member functions of the ISRResGraph programming interface 220 are employed to generate the word break information 216 and the phoneme information 218 from the speech recognition results object 214. The word break information 216 is a list of words and time values indicating when they occur within the speech sound data 206. The phoneme information 218 is a list of phoneme codes associated with the International Phonetic Alphabet and time values indicating when the phonemes occur in the speech sound data 206. The time values are represented by a start and stop offset indicating a number of bytes from the start of the speech sound data 206.
For example, the word break information 216 might contain a list of 10 words, the first of which being “Ha.” The start and stop offsets would indicate the number of bytes from the beginning of the speech sound data 206 the word “Ha” started and stopped. {…}
At step 266, the speech sound data 206 is annotated with the word break information 216 and the phoneme information 218 to create a linguistically enhanced sound file 232. In the illustrated embodiment, the linguistic information and sound editing tool 208 combines the speech sound data 206, the word break information 216, and the phoneme information 218 into a single file 232 containing an audio chunk 234, a word marking list 236, and a phoneme marking list 238. The audio chunk is a part of the file 232 (e.g., a set of bytes) containing audio data. Typically, the audio chunk 234 is of the same format (e.g., WAV) as the speech sound data 206, but can be of some other format. The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). The phoneme marking list 238 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of a string of hex codes corresponding to individual IPA phonemes in the form of 0xhhhh, where each “h” denotes a single hex digit. For example, a string might be “0x00f0,” which represents the English phoneme // (which is pronounced as the “th” in “they”). The lists could be implemented in other ways. For example, the file could be divided into frames, and the phoneme and word break data scattered throughout the file in the frames.” (col. 11, line 56 – col. 12, line 64) | For even further context/examples, see e.g. the “linguistically enhanced sound file 232” (fig. 4) and/or also col. 15, line 56 – col. 16, line 46.], the application engine being adapted to: 
synchronize a read pointer for the metadata file with a playback pointer for the audio content stream [“{…} the synchronization data chunk 115 includes a phoneme type (or a word) and a timing reference used to synchronize playback of the phoneme (or word) with the animation. {…}” (col. 7, lines 22-25) | For further context/examples, see also col. 18, lines 30-53 and/or the other mappings provided herein.]; 
modify timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience [see, e.g. how second user input received during the dynamic user-influenced media experience modifying time-associated edges 386 or 388 correspondingly modifies the timing of the audio content stream (fig. 6; col. 13, line 63 – col. 14, line 4)]; and 
based on the modified timing and the synchronization of the read pointer with the playback pointer, execute the defined event when the playback pointer for the audio content stream reaches a position originally associated with the timestamp while concurrently presenting the dynamic user-influenced media experience [“At step 458, the audio player 424 plays the audio segments in the audio stream to send a decompressed audio data stream to the sound output device 420. When it encounters a bookmark in the audio stream, the audio player 424 sends a notification back to the sound file player 414 using the callback mechanism set up during step 450. The notification includes information in the bookmark indicating how to process the notification.
At step 460, the sound file player 414, having received a notification from the audio player 424, sends a notification to an appropriate interface of the animation server, as determined by information from the bookmark (e.g., a next word interface or a phoneme interface) {…} to maintain synchronicity with the sound output from the sound output device 420.
As the linguistically enhanced sound file player traverses the audio chunk 406, it reiterates steps 456-460 until it reaches the end of the audio chunk 406. At such time, other linguistically enhanced sound files 404 can be provided for additional utterances.
When the interface of the animation server 422 for next word notifications receives a notification from the sound file player 414, it proceeds as shown in FIG. 8B. At step 472, the animation server 422 displays the next word in the utterance in the word balloon animation module 434.
When the interface of the animation server 422 for phoneme notifications receives a notification from the sound file player 414, it proceeds as shown in FIG. 8C. As part of the notification, a phoneme code is provided. At step 482, the animation server 422 maps the phoneme code to one of seven mouth shapes using the phoneme mapping table 416. An alternative implementation could be constructed without the phoneme mapping table 416, if, for example, the phoneme marking list 410 contained mouth shape values instead of phoneme values. Such an arrangement could be accomplished by performing the mapping while creating the linguistically enhanced sound file 404. Alternatively, the linguistically enhanced sound file player 414 could compute mouth shape values internally and send the mouth shape values to the animation server 422, rather than sending phoneme values. The animation server 422 then displays the mouth shape in the mouth animation module 432 at step 484.
In the illustrated embodiment, the notifications are processed immediately by the animation server. In an alternative embodiment, time information could be included in the notification, and the animation server 422 could use the time information to determine when to process the notifications. Yet another embodiment could send a list of notifications, each element of the list containing a start and stop time value and either a word or a phoneme value. In addition, start and stop time values might not be necessary in every instance. Instead, a single time (e.g., a start time) value might suffice.” (col. 15, lines 05-65) 
“The mouth animation module 432 typically provides a choice of seven different mouth shapes that can be displayed for a character. Typically, the mouth shapes are loaded from a mouth data file containing a set of bitmap images that can be customized for the particular character being presented. {…}
The word balloon animation module 434 places the word balloon in an appropriate position with respect to the animated character and displays an indicated word in the balloon upon being sent a message or notification. The module also manages the size and shape of the balloon and places words in the balloon. A feature allows the word balloon to be disabled, enabled with all the words appearing at once, or enabled with words appearing as they are spoken.
As a result of executing the steps indicated above, the animation elements generated by the word balloon and mouth animation modules 434 and 432 are synchronized with the audio chunk 406 as presented by the sound output device 420, presenting the illusion that an animated character is speaking. However, the features in the above description could be used for other purposes, such as controlling animation color or triggering some event in a computer presentation. For example, a window could be colored red upon detecting a word (e.g., “angry”) or a slide show presentation could advanced to the next slide upon detecting a word (e.g., “next”).” (col. 16, lines 15-46) | For even further context/examples, see e.g. col. 11, line 56 – col. 12, line 64.].

As shown above, Merrill shows an operability to receive and/or process multiple triggers upon which several different events are based. For example, Merrill shows multiple user-selectable words (fig. 6) whose temporal occurrences trigger respective event executions, but the words appear to have been populated directly from a speech-to-text transcribing process. In other words, even though Merrill is certainly able to respond to and/or receive word-associated triggering criteria to execute corresponding events, these triggers do not appear to be defined as a direct result of “first developer input” (at least as apparently intended). In lieu of simply pointing to the considerable breadth of the terms to “receive first developer input defining” as currently recited and/or the spectrum of possible mappings its broadest reasonable interpretation would cover, it is potentially conceded that Merrill does not appear to explicitly recite receiving a developer input for the purposes of defining an audio trigger itself as apparently intended. In an analogous art, Chew shows:
receive first developer input defining an audio trigger corresponding to one or more words or phrases appearing in a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience; {…} and output a metadata file including at least one timestamp defined relative to a start of the audio content stream, the at least one timestamp being associated in the metadata file with the defined event, the application engine being adapted to: synchronize a read pointer for the metadata file with a playback pointer for the audio content stream; modify timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience; and based on the modified timing and the synchronization of the read pointer with the playback pointer, execute the defined event when the playback pointer for the audio content stream reaches a position originally associated with the timestamp while concurrently presenting the dynamic user-influenced media experience [“{…} Upon selection of a keyword in the tag cloud, the system can present cue points along the time line of the video player to indicate the time index within the media where the keyword appears. This can assist the learner in skipping to the section of the media that is mentions the keyword. {…}” (Chew: ¶ 21)
“{…} Tag cloud 810 can present a plurality of keywords associated with the media file. Each of the keywords can be selectable. Upon selection of a keyword, a cue point function (within UI framework 120, post process functions 135, or database functions 142) can generate cue points to be presented on timeline 805 of media player 750. {…}
In one embodiment, the cue point function can determine the keyword within tag cloud 810 that has been selected. In response to the selection, the cue point function can analyze the transcript of the media file to determine time stamps within the media file where the keyword is heard. The cue point function can then generate cue points along timeline 805 where the keywords are heard. The cue points can be a visual indicator such as highlighting which is used to visually indicate to the learner where the keywords appear in the media file. A touch gesture detected at or near a cue point can result in the media player skipping to a part of the media file where the keyword is mentioned. In some examples, the media player can slightly rewind the media so that the learner can determine the context in which the keyword is being used. For example, the media player can rewind a few seconds or to the beginning of the sentence so that the learner. As shown here, keyword 815 has been selected. Upon selection of keyword 815, cue points 812, 814, and 816 appear along timeline 805. Thus, the keyword is used three times in the media. Selection of any of these cue points can start playback of the media at or near when the keyword is used.” (Chew: ¶¶ 43-44)]; 

One of ordinary skill in the art, having the teachings of Merrill and Chew before them prior to the effective filing date of the claimed invention, would have been motivated to adapt Merrill to allow for a developer to deliberately define the triggers for which it already responds to execute respective trigger-associated events, as taught by Chew. The rationale for doing so would have been that Chew’s approach “can assist the [user] in quickly finding relevant media” (Chew: Abstract), and thus Merrill would have been motivated to “also include features which enhance the manner in which the media file can be consumed […] such as cue points and hot zones” (Chew: ¶ 42). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Merrill and Chew (hereinafter, the “Merrill-Chew” combination) in order to obtain the invention as recited in claim 1.

As to dependent claim 2, Merrill-Chew further shows:
wherein the first developer input defines a select location within the textual transcript [“{…} the time values may be implemented as a unit of time (e.g., milliseconds) or as a pointer to a particular location in the speech sound data 206.” (Merrill: col. 12, lines 18-20)
“A word marker 384 and a phoneme marker 390 represent the linguistic information on the user interface. The markers indicate where a particular linguistic event (e.g. a word or phoneme) begins and ends with respect to the speech sound data 382 by their size and position. {…}” (Merrill: col. 13, lines 19-23)
“In one embodiment, the cue point function can determine the keyword within tag cloud 810 that has been selected. In response to the selection, the cue point function can analyze the transcript of the media file to determine time stamps within the media file where the keyword is heard. The cue point function can then generate cue points along timeline 805 where the keywords are heard. The cue points can be a visual indicator such as highlighting which is used to visually indicate to the learner where the keywords appear in the media file. A touch gesture detected at or near a cue point can result in the media player skipping to a part of the media file where the keyword is mentioned. {…} As shown here, keyword 815 has been selected. Upon selection of keyword 815, cue points 812, 814, and 816 appear along timeline 805. Thus, the keyword is used three times in the media. Selection of any of these cue points can start playback of the media at or near when the keyword is used.” (Chew: ¶ 44)].

As to dependent claim 3, Merrill-Chew further shows:
wherein the development tool is further adapted to define a timestamp location within the audio content stream that temporally correlates with the audible occurrence of the audio trigger [“{…} The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). {…}” (Merrill: col. 12, lines 42-51)
“{…} time information could be included in the notification, and the animation server 422 could use the time information to determine when to process the notifications. Yet another embodiment could send a list of notifications, each element of the list containing a start and stop time value and either a word or a phoneme value. In addition, start and stop time values might not be necessary in every instance. Instead, a single time (e.g., a start time) value might suffice.” (Merrill: col. 15, lines 58-65)
“In one embodiment, the cue point function can determine the keyword within tag cloud 810 that has been selected. In response to the selection, the cue point function can analyze the transcript of the media file to determine time stamps within the media file where the keyword is heard. The cue point function can then generate cue points along timeline 805 where the keywords are heard. The cue points can be a visual indicator such as highlighting which is used to visually indicate to the learner where the keywords appear in the media file. A touch gesture detected at or near a cue point can result in the media player skipping to a part of the media file where the keyword is mentioned. {…} As shown here, keyword 815 has been selected. Upon selection of keyword 815, cue points 812, 814, and 816 appear along timeline 805. Thus, the keyword is used three times in the media. Selection of any of these cue points can start playback of the media at or near when the keyword is used.” (Chew: ¶ 44)].

As to dependent claim 4, Merrill-Chew further shows:
wherein the metadata associates the defined timestamp location with a defined event name identifying the event [“{…} The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). {…}” (Merrill: col. 12, lines 42-51)
“To specify an utterance under the human speech player arrangement, the application 502 specifies a text string 510 and a reference to a linguistically enhanced sound file 512 in a speak command (e.g., ‘speak “This is a test.”, test.lwv’). The reference could alternatively be something other than a file name (e.g., a uniform resource locator for specifying a file on the world wide web). {…}” (Merrill: col. 17, line 54-60)].

As to dependent claim 5, Merrill-Chew further shows:
wherein the application engine is further configured to: read the metadata while playing the audio content stream and rendering graphics to a display within the dynamic user-influenced media experience [e.g. rendering graphics to a display while reading the metadata and playing the audio content stream | Merrill: col. 16, lines 15-46; Chew: ¶¶ 43-44];
and initiate the event according to a timestamp specified by the metadata relative to a start of the audio content stream [e.g. initiating the event according to a timestamp specified by the metadata relative to a start of the audio content stream | Merrill: col. 12, lines 42-51 and col. 15, lines 58-65 | Chew: ¶¶ 43-44].

As to dependent claim 6, Merrill-Chew further shows:
wherein the application engine is a game engine [To merely redefine the application engine as a “game engine” would appear to be drawn to an intended field-of-use and/or result for the otherwise application-agnostic “application engine,” and thus would appear to lack considerable patentable weight for purposes of prior art analysis. Furthermore, the Office does not solely rely on this interpretation due to “gaming” as a field of use for an engine being already taught by the prior art in at least Merrill: col 1, lines 13-25 & col. 5, lines 51-57 and/or Chew: ¶ 01.].

As to dependent claim 7, Merrill-Chew further shows:
wherein the execution of the defined event launches a sub-animation [e.g. any of the sub-animations in Merrill: col. 16, lines 15-46].

As to dependent claim 9, Merrill-Chew further shows:
wherein the execution of the defined event launches one or more ancillary events that differ based on the first user input within the dynamic user-influenced media experience [e.g. how different end-user-provided inputs within the dynamic user-influenced media experience may trigger the execution of different/contextualized ancillary events | Merrill: col. 16, lines 15-46 | Chew: ¶¶ 43-44].

As to dependent claim 10, Merrill-Chew further shows:
 a web-based service adapted to: analyze the audio content stream and the textual transcript; and based on the analysis, output timestamp data indicating a time at which each word of the textual transcript is spoken within the audio content stream, wherein the development tool accepts the output timestamp data as a third developer input [e.g. the actual transcription process (e.g. the attribution of timestamp data to word occurrences in an audio content stream) may be performed by a remote web-based service and received as input rather than have been actively transcribed by the development tool | Merrill: col. 6, lines 05-30; col. 17, lines 58-64; col. 18, lines 49-53| Chew: ¶¶ 22-27, 31, 38, & 52)].

As to independent claim 11, Merrill shows a method for adding electronically-driven effects to a dynamic user-influenced media experience [“The invention is directed to methods and systems for synchronizing the animation of a speaking character with recorded speech {…} (col. 4, lines 38-40)], the method comprising:
 receiving {…} an audio trigger [e.g. “linguistic events are used to synchronize some action in the animation” (col. 12, lines 27-28) and/or the “bookmarks” defining which notifications to send upon detecting audio triggering occurrences (col. 15, lines 01-17)] corresponding to one or more words or phrases within a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience [e.g. the one or more words or phrases appearing in a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience, as illustrated in fig. 6.];
 receiving as a second developer input a selection of a defined event that is to be executed in temporal association with an audible occurrence of the audio trigger during the dynamic user-influenced media experience e.g. receiving developer input (like directly manipulating markers 384 and/or 390 in fig. 6) to further define an event to be executed by the application engine in temporal association with an audible occurrence of the audio trigger during the dynamic user-influenced media experience (for further context, see also col. 13, line 19 – col. 14, line 7)]; 
generating a metadata file [e.g. outputting a “linguistically enhanced sound file 232” (fig. 4)] including metadata temporally associating the defined event with a timestamp of the audible occurrence of the audio trigger within the audio content stream [“{…} appropriate member functions of the ISRResGraph programming interface 220 are employed to generate the word break information 216 and the phoneme information 218 from the speech recognition results object 214. The word break information 216 is a list of words and time values indicating when they occur within the speech sound data 206. The phoneme information 218 is a list of phoneme codes associated with the International Phonetic Alphabet and time values indicating when the phonemes occur in the speech sound data 206. The time values are represented by a start and stop offset indicating a number of bytes from the start of the speech sound data 206.
For example, the word break information 216 might contain a list of 10 words, the first of which being “Ha.” The start and stop offsets would indicate the number of bytes from the beginning of the speech sound data 206 the word “Ha” started and stopped. {…}
At step 266, the speech sound data 206 is annotated with the word break information 216 and the phoneme information 218 to create a linguistically enhanced sound file 232. In the illustrated embodiment, the linguistic information and sound editing tool 208 combines the speech sound data 206, the word break information 216, and the phoneme information 218 into a single file 232 containing an audio chunk 234, a word marking list 236, and a phoneme marking list 238. The audio chunk is a part of the file 232 (e.g., a set of bytes) containing audio data. Typically, the audio chunk 234 is of the same format (e.g., WAV) as the speech sound data 206, but can be of some other format. The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). The phoneme marking list 238 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of a string of hex codes corresponding to individual IPA phonemes in the form of 0xhhhh, where each “h” denotes a single hex digit. For example, a string might be “0x00f0,” which represents the English phoneme // (which is pronounced as the “th” in “they”). The lists could be implemented in other ways. For example, the file could be divided into frames, and the phoneme and word break data scattered throughout the file in the frames.” (col. 11, line 56 – col. 12, line 64) | For even further context/examples, see e.g. the “linguistically enhanced sound file 232” (fig. 4) and/or also col. 15, line 56 – col. 16, line 46.], wherein execution of the defined event is dependent first user upon input provided by an end user during execution of the dynamic user-influenced media experience [For further context into the “developer” aspect of the second developer input, see how “[…] a game program may present an animated character for entertainment, or an educational program may include an animated teacher character. In addition, animated characters are a useful part of social interfaces that present an interactive interface with human qualities. For instance, an animated character may appear on a computer display to help a user having difficulty completing a function or to answer questions. The character's creators may give it certain human traits reflected in gestures and other behavior, and the character may be programmed to react to actions by the user.” (col. 1, lines 15-25) and how “[… a] common arrangement is to create the linguistically enhanced sound file on a development computer, test the file using a player, and then distribute the file to computers with access to a player. […]” (col. 7, lines 33-36)
For even further evidence of how the execution of the event is dependent upon input provided by an end user during execution of the dynamic user-influenced media experience, see also how: 
 “The sound file tool 108 acquires the text string 104 and the speech sound data stream 106 at step 152 (FIG. 3). The text string 104 is a textual version of what is spoken in the speech sound data stream 106. For example, the text string 104 might be an ASCII text string and the speech sound data stream 106 might be a sound file produced by digitally sampling (e.g., with a microphone) a person speaking the words of the text string 104.” (col. 6, lines 51-58)
 “The linguistic information and sound editing tool 208 acquires the speech sound data at step 252 (FIG. 5). In the illustrated embodiment, the speech sound data 206 is of the familiar WAV sound format (also known as RIFF format). The data 206 is acquired by opening a saved file or by sampling an input device such as the microphone 62 (FIG. 1) or some other sound input device. […]” (col. 9, lines 32-40)
“In the final stages of development, a linguistically enhanced sound file 512 can be created by recording a human voice (e.g., professional vocal talent) and incorporated into the character animation 508 with a minimum of changes to the programming code in the application 502. In this way, the resulting application presents high quality animation while avoiding some of the development costs associated with using a human voice. In both cases, the character animation 508 presents an animation in which the character's mouth (and optionally, a word balloon) are synchronized with the speech sound output. However, the linguistically enhanced sound file 512 provides a superior animation with more realistic speech sound output.” (col. 19, lines 18-30)]; 
synchronizing a read pointer for the metadata file with a playback pointer for the audio content stream [“{…} the synchronization data chunk 115 includes a phoneme type (or a word) and a timing reference used to synchronize playback of the phoneme (or word) with the animation. {…}” (col. 7, lines 22-25) | For further context/examples, see also col. 18, lines 30-53 and/or the other mappings provided herein.]; 
modifying timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience [see, e.g. how user input received during the dynamic user-influenced media experience modifying time-associated edges 386 or 388 correspondingly modifies the timing of the audio content stream (fig. 6; col. 13, line 63 – col. 14, line 4)]; and 
as a result of the modified timing, the synchronization of the read pointer with the playback pointer, and the playback pointer for the audio content stream reaching a position originally associated with the timestamp in the audio content stream, executing the defined event during the dynamic user-influenced media experience [“At step 458, the audio player 424 plays the audio segments in the audio stream to send a decompressed audio data stream to the sound output device 420. When it encounters a bookmark in the audio stream, the audio player 424 sends a notification back to the sound file player 414 using the callback mechanism set up during step 450. The notification includes information in the bookmark indicating how to process the notification.
At step 460, the sound file player 414, having received a notification from the audio player 424, sends a notification to an appropriate interface of the animation server, as determined by information from the bookmark (e.g., a next word interface or a phoneme interface) {…} to maintain synchronicity with the sound output from the sound output device 420.
As the linguistically enhanced sound file player traverses the audio chunk 406, it reiterates steps 456-460 until it reaches the end of the audio chunk 406. At such time, other linguistically enhanced sound files 404 can be provided for additional utterances.
When the interface of the animation server 422 for next word notifications receives a notification from the sound file player 414, it proceeds as shown in FIG. 8B. At step 472, the animation server 422 displays the next word in the utterance in the word balloon animation module 434.
When the interface of the animation server 422 for phoneme notifications receives a notification from the sound file player 414, it proceeds as shown in FIG. 8C. As part of the notification, a phoneme code is provided. At step 482, the animation server 422 maps the phoneme code to one of seven mouth shapes using the phoneme mapping table 416. An alternative implementation could be constructed without the phoneme mapping table 416, if, for example, the phoneme marking list 410 contained mouth shape values instead of phoneme values. Such an arrangement could be accomplished by performing the mapping while creating the linguistically enhanced sound file 404. Alternatively, the linguistically enhanced sound file player 414 could compute mouth shape values internally and send the mouth shape values to the animation server 422, rather than sending phoneme values. The animation server 422 then displays the mouth shape in the mouth animation module 432 at step 484.
In the illustrated embodiment, the notifications are processed immediately by the animation server. In an alternative embodiment, time information could be included in the notification, and the animation server 422 could use the time information to determine when to process the notifications. Yet another embodiment could send a list of notifications, each element of the list containing a start and stop time value and either a word or a phoneme value. In addition, start and stop time values might not be necessary in every instance. Instead, a single time (e.g., a start time) value might suffice.” (col. 15, lines 05-65) 
“The mouth animation module 432 typically provides a choice of seven different mouth shapes that can be displayed for a character. Typically, the mouth shapes are loaded from a mouth data file containing a set of bitmap images that can be customized for the particular character being presented. {…}
The word balloon animation module 434 places the word balloon in an appropriate position with respect to the animated character and displays an indicated word in the balloon upon being sent a message or notification. The module also manages the size and shape of the balloon and places words in the balloon. A feature allows the word balloon to be disabled, enabled with all the words appearing at once, or enabled with words appearing as they are spoken.
As a result of executing the steps indicated above, the animation elements generated by the word balloon and mouth animation modules 434 and 432 are synchronized with the audio chunk 406 as presented by the sound output device 420, presenting the illusion that an animated character is speaking. However, the features in the above description could be used for other purposes, such as controlling animation color or triggering some event in a computer presentation. For example, a window could be colored red upon detecting a word (e.g., “angry”) or a slide show presentation could advanced to the next slide upon detecting a word (e.g., “next”).” (col. 16, lines 15-46) | For even further context/examples, see e.g. col. 11, line 56 – col. 12, line 64.].

As shown above, Merrill shows an operability to receive and/or process multiple triggers upon which several different events are based. For example, Merrill shows multiple user-selectable words (fig. 6) whose temporal occurrences trigger respective event executions, but the words appear to have been populated directly from a speech-to-text transcribing process. In other words, even though Merrill is certainly able to respond to and/or receive word-associated triggering criteria to execute corresponding events, these triggers do not appear to be defined as a direct result of “first developer input” (at least as apparently intended). In lieu of simply pointing to the considerable breadth of the terms to “receive first developer input defining” as currently recited and/or the spectrum of possible mappings its broadest reasonable interpretation would cover, it is potentially conceded that Merrill does not appear to explicitly recite receiving a developer input for the purposes of defining an audio trigger itself as apparently intended. In an analogous art, Chew shows:
receiving as a first developer input an audio trigger corresponding to one or more words or phrases within a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience; {…} generating a metadata file including metadata temporally associating the defined event with a timestamp of the audible occurrence of the audio trigger within the audio content stream, wherein execution of the defined event is dependent first user upon input provided by an end user during execution of the dynamic user-influenced media experience; synchronizing a read pointer for the metadata file with a playback pointer for the audio content stream; modifying timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience; and as a result of the modified timing, the synchronization of the read pointer with the playback pointer, and the playback pointer for the audio content stream reaching a position originally associated with the timestamp in the audio content stream, executing the defined event during the dynamic user-influenced media experience [“{…} Upon selection of a keyword in the tag cloud, the system can present cue points along the time line of the video player to indicate the time index within the media where the keyword appears. This can assist the learner in skipping to the section of the media that is mentions the keyword. {…}” (Chew: ¶ 21)
“{…} Tag cloud 810 can present a plurality of keywords associated with the media file. Each of the keywords can be selectable. Upon selection of a keyword, a cue point function (within UI framework 120, post process functions 135, or database functions 142) can generate cue points to be presented on timeline 805 of media player 750. {…}
In one embodiment, the cue point function can determine the keyword within tag cloud 810 that has been selected. In response to the selection, the cue point function can analyze the transcript of the media file to determine time stamps within the media file where the keyword is heard. The cue point function can then generate cue points along timeline 805 where the keywords are heard. The cue points can be a visual indicator such as highlighting which is used to visually indicate to the learner where the keywords appear in the media file. A touch gesture detected at or near a cue point can result in the media player skipping to a part of the media file where the keyword is mentioned. In some examples, the media player can slightly rewind the media so that the learner can determine the context in which the keyword is being used. For example, the media player can rewind a few seconds or to the beginning of the sentence so that the learner. As shown here, keyword 815 has been selected. Upon selection of keyword 815, cue points 812, 814, and 816 appear along timeline 805. Thus, the keyword is used three times in the media. Selection of any of these cue points can start playback of the media at or near when the keyword is used.” (Chew: ¶¶ 43-44)” (Chew: ¶¶ 43-44)]..

One of ordinary skill in the art, having the teachings of Merrill and Chew before them prior to the effective filing date of the claimed invention, would have been motivated to adapt Merrill to allow for a developer to deliberately define the triggers for which it already responds to execute respective trigger-associated events, as taught by Chew. The rationale for doing so would have been that Chew’s approach “can assist the [user] in quickly finding relevant media” (Chew: Abstract), and thus Merrill would have been motivated to “also include features which enhance the manner in which the media file can be consumed […] such as cue points and hot zones” (Chew: ¶ 42). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Merrill and Chew (hereinafter, the “Merrill-Chew” combination) in order to obtain the invention as recited in claim 11.

As to dependent claim 12, Merrill-Chew further shows:
wherein generating the metadata further comprises: storing a name of the defined event with a timestamp identifying a location of the audible occurrence of the audio trigger in the audio content stream [“{…} The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). {…}” (Merrill: col. 12, lines 42-51)
“To specify an utterance under the human speech player arrangement, the application 502 specifies a text string 510 and a reference to a linguistically enhanced sound file 512 in a speak command (e.g., ‘speak “This is a test.”, test.lwv’). The reference could alternatively be something other than a file name (e.g., a uniform resource locator for specifying a file on the world wide web). {…}” (Merrill: col. 17, line 54-60)].

As to dependent claim 13, Merrill-Chew further shows:
wherein interpreting the generated metadata further comprises: reading the metadata while playing the audio content stream and rendering graphics to a display within the dynamic user-influenced media experience [e.g. rendering graphics to a display while reading the metadata and playing the audio content stream | Merrill: col. 16, lines 15-46; Chew: ¶¶ 43-44];
and initiating the execution of the defined event according to the timestamp specified by the metadata relative to a start of the audio content stream [e.g. initiating the event according to a timestamp specified by the metadata relative to a start of the audio content stream | Merrill: col. 12, lines 42-51 and col. 15, lines 58-65 | Chew: ¶¶ 43-44].

As to dependent claim 14, Merrill-Chew further shows:
wherein a game engine generates the dynamic user-influenced media experience and interprets the generated metadata [To merely redefine the application engine as a “game engine” (and/or to generally rely on the “game” aspect of an otherwise application-agnostic “engine” as claimed) would appear to be drawn to an intended field-of-use and/or result, and thus would appear to lack considerable patentable weight for purposes of prior art analysis. Furthermore, the Office does not solely rely on this interpretation due to “gaming” as a field of use for an engine being already taught by the prior art in at least Merrill: col 1, lines 13-25 & col. 5, lines 51-57 and/or Chew: ¶ 01.].

As to dependent claim 15, Merrill-Chew further shows:
wherein initiating the defined event according to the timestamp further comprises: launching a sub-animation within the dynamic user-influenced media experience according to the timestamp [e.g. launching any of the sub-animations in accordance to the timestamp | Merrill: col. 16, lines 15-46].

As to independent claim 17, Merrill shows one or more tangible computer-readable storage media devices encoding computer-executable instructions [“computer-readable media” (fig. 1; col. 5, lines 06-37)] for executing a computer process that adds electronically-driven effects to a dynamic user-influenced media experience [“The invention is directed to methods and systems for synchronizing the animation of a speaking character with recorded speech {…} (col. 4, lines 38-40)], the computing process comprising:
 receiving {…} an audio trigger [e.g. “linguistic events are used to synchronize some action in the animation” (col. 12, lines 27-28) and/or the “bookmarks” defining which notifications to send upon detecting audio triggering occurrences (col. 15, lines 01-17)] associated with a select location within an audio content stream [“{…} the time values may be implemented as a unit of time (e.g., milliseconds) or as a pointer to a particular location in the speech sound data 206.” (col. 12, lines 18-20)
“A word marker 384 and a phoneme marker 390 represent the linguistic information on the user interface. The markers indicate where a particular linguistic event (e.g. a word or phoneme) begins and ends with respect to the speech sound data 382 by their size and position. {…}” (col. 13, lines 19-23)];
 receiving a second developer input defining an event that is to be executed when a playback pointer reaches the select location within the audio content stream during the dynamic user-influenced media experience [e.g. receiving input (like directly manipulating markers 384 and/or 390 in fig. 6) by a developer for defining an event that is to be executed when a playback pointer reaches the select location within the audio content stream during the dynamic user-influenced media experience. For further context, see also col. 13, line 19 – col. 14, line 7 and/or the one or more words or phrases appearing in a textual transcript of an audio content stream to be presented as part of the dynamic user-influenced media experience, as illustrated in fig. 6 and/or col. 16, lines 27-46.];
generating a metadata file [e.g. outputting a “linguistically enhanced sound file 232” (fig. 4)] including metadata temporally associating the defined event with the select location within the audio content stream [“{…} appropriate member functions of the ISRResGraph programming interface 220 are employed to generate the word break information 216 and the phoneme information 218 from the speech recognition results object 214. The word break information 216 is a list of words and time values indicating when they occur within the speech sound data 206. The phoneme information 218 is a list of phoneme codes associated with the International Phonetic Alphabet and time values indicating when the phonemes occur in the speech sound data 206. The time values are represented by a start and stop offset indicating a number of bytes from the start of the speech sound data 206.
For example, the word break information 216 might contain a list of 10 words, the first of which being “Ha.” The start and stop offsets would indicate the number of bytes from the beginning of the speech sound data 206 the word “Ha” started and stopped. {…}
At step 266, the speech sound data 206 is annotated with the word break information 216 and the phoneme information 218 to create a linguistically enhanced sound file 232. In the illustrated embodiment, the linguistic information and sound editing tool 208 combines the speech sound data 206, the word break information 216, and the phoneme information 218 into a single file 232 containing an audio chunk 234, a word marking list 236, and a phoneme marking list 238. The audio chunk is a part of the file 232 (e.g., a set of bytes) containing audio data. Typically, the audio chunk 234 is of the same format (e.g., WAV) as the speech sound data 206, but can be of some other format. The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). The phoneme marking list 238 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of a string of hex codes corresponding to individual IPA phonemes in the form of 0xhhhh, where each “h” denotes a single hex digit. For example, a string might be “0x00f0,” which represents the English phoneme // (which is pronounced as the “th” in “they”). The lists could be implemented in other ways. For example, the file could be divided into frames, and the phoneme and word break data scattered throughout the file in the frames.” (col. 11, line 56 – col. 12, line 64) | For even further context/examples, see e.g. the “linguistically enhanced sound file 232” (fig. 4) and/or also col. 15, line 56 – col. 16, line 46.], wherein execution of the defined event is dependent upon first user input provided by an end user during execution of the dynamic user-influenced media experience [For further context into the “developer” aspect of the second developer input, see how “[…] a game program may present an animated character for entertainment, or an educational program may include an animated teacher character. In addition, animated characters are a useful part of social interfaces that present an interactive interface with human qualities. For instance, an animated character may appear on a computer display to help a user having difficulty completing a function or to answer questions. The character's creators may give it certain human traits reflected in gestures and other behavior, and the character may be programmed to react to actions by the user.” (col. 1, lines 15-25) and how “[… a] common arrangement is to create the linguistically enhanced sound file on a development computer, test the file using a player, and then distribute the file to computers with access to a player. […]” (col. 7, lines 33-36)
For even further evidence of how the execution of the event is dependent upon input provided by an end user during execution of the dynamic user-influenced media experience, see also how: 
 “The sound file tool 108 acquires the text string 104 and the speech sound data stream 106 at step 152 (FIG. 3). The text string 104 is a textual version of what is spoken in the speech sound data stream 106. For example, the text string 104 might be an ASCII text string and the speech sound data stream 106 might be a sound file produced by digitally sampling (e.g., with a microphone) a person speaking the words of the text string 104.” (col. 6, lines 51-58)
 “The linguistic information and sound editing tool 208 acquires the speech sound data at step 252 (FIG. 5). In the illustrated embodiment, the speech sound data 206 is of the familiar WAV sound format (also known as RIFF format). The data 206 is acquired by opening a saved file or by sampling an input device such as the microphone 62 (FIG. 1) or some other sound input device. […]” (col. 9, lines 32-40)
“In the final stages of development, a linguistically enhanced sound file 512 can be created by recording a human voice (e.g., professional vocal talent) and incorporated into the character animation 508 with a minimum of changes to the programming code in the application 502. In this way, the resulting application presents high quality animation while avoiding some of the development costs associated with using a human voice. In both cases, the character animation 508 presents an animation in which the character's mouth (and optionally, a word balloon) are synchronized with the speech sound output. However, the linguistically enhanced sound file 512 provides a superior animation with more realistic speech sound output.” (col. 19, lines 18-30)]; 
synchronizing a read pointer for the metadata file with a playback pointer for the audio content stream [“{…} the synchronization data chunk 115 includes a phoneme type (or a word) and a timing reference used to synchronize playback of the phoneme (or word) with the animation. {…}” (col. 7, lines 22-25) | For further context/examples, see also col. 18, lines 30-53 and/or the other mappings provided herein.]; 
modifying timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience [see, e.g. how user input received during the dynamic user-influenced media experience modifying time-associated edges 386 or 388 correspondingly modifies the timing of the audio content stream (fig. 6; col. 13, line 63 – col. 14, line 4)]; and 
as a result of the modified timing, the synchronization of the read pointer with the playback pointer, and the playback pointer for the audio content stream reaching a position originally associated with the timestamp in the audio content stream, executing the defined event during the dynamic user-influenced media experience [“At step 458, the audio player 424 plays the audio segments in the audio stream to send a decompressed audio data stream to the sound output device 420. When it encounters a bookmark in the audio stream, the audio player 424 sends a notification back to the sound file player 414 using the callback mechanism set up during step 450. The notification includes information in the bookmark indicating how to process the notification.
At step 460, the sound file player 414, having received a notification from the audio player 424, sends a notification to an appropriate interface of the animation server, as determined by information from the bookmark (e.g., a next word interface or a phoneme interface) {…} to maintain synchronicity with the sound output from the sound output device 420.
As the linguistically enhanced sound file player traverses the audio chunk 406, it reiterates steps 456-460 until it reaches the end of the audio chunk 406. At such time, other linguistically enhanced sound files 404 can be provided for additional utterances.
When the interface of the animation server 422 for next word notifications receives a notification from the sound file player 414, it proceeds as shown in FIG. 8B. At step 472, the animation server 422 displays the next word in the utterance in the word balloon animation module 434.
When the interface of the animation server 422 for phoneme notifications receives a notification from the sound file player 414, it proceeds as shown in FIG. 8C. As part of the notification, a phoneme code is provided. At step 482, the animation server 422 maps the phoneme code to one of seven mouth shapes using the phoneme mapping table 416. An alternative implementation could be constructed without the phoneme mapping table 416, if, for example, the phoneme marking list 410 contained mouth shape values instead of phoneme values. Such an arrangement could be accomplished by performing the mapping while creating the linguistically enhanced sound file 404. Alternatively, the linguistically enhanced sound file player 414 could compute mouth shape values internally and send the mouth shape values to the animation server 422, rather than sending phoneme values. The animation server 422 then displays the mouth shape in the mouth animation module 432 at step 484.
In the illustrated embodiment, the notifications are processed immediately by the animation server. In an alternative embodiment, time information could be included in the notification, and the animation server 422 could use the time information to determine when to process the notifications. Yet another embodiment could send a list of notifications, each element of the list containing a start and stop time value and either a word or a phoneme value. In addition, start and stop time values might not be necessary in every instance. Instead, a single time (e.g., a start time) value might suffice.” (col. 15, lines 05-65) 
“The mouth animation module 432 typically provides a choice of seven different mouth shapes that can be displayed for a character. Typically, the mouth shapes are loaded from a mouth data file containing a set of bitmap images that can be customized for the particular character being presented. {…}
The word balloon animation module 434 places the word balloon in an appropriate position with respect to the animated character and displays an indicated word in the balloon upon being sent a message or notification. The module also manages the size and shape of the balloon and places words in the balloon. A feature allows the word balloon to be disabled, enabled with all the words appearing at once, or enabled with words appearing as they are spoken.
As a result of executing the steps indicated above, the animation elements generated by the word balloon and mouth animation modules 434 and 432 are synchronized with the audio chunk 406 as presented by the sound output device 420, presenting the illusion that an animated character is speaking. However, the features in the above description could be used for other purposes, such as controlling animation color or triggering some event in a computer presentation. For example, a window could be colored red upon detecting a word (e.g., “angry”) or a slide show presentation could advanced to the next slide upon detecting a word (e.g., “next”).” (col. 16, lines 15-46) | For even further context/examples, see e.g. col. 11, line 56 – col. 12, line 64.]..

As shown above, Merrill shows an operability to receive and/or process multiple triggers upon which several different events are based. For example, Merrill shows multiple user-selectable words (fig. 6) whose temporal occurrences trigger respective event executions, but the words appear to have been populated directly from a speech-to-text transcribing process. In other words, even though Merrill is certainly able to respond to and/or receive word-associated triggering criteria to execute corresponding events, these triggers do not appear to be defined as a direct result of “first developer input” (at least as apparently intended). In lieu of simply pointing to the considerable breadth of the terms to “receive first developer input defining” as currently recited and/or the spectrum of possible mappings its broadest reasonable interpretation would cover, it is potentially conceded that Merrill does not appear to explicitly recite receiving a developer input for the purposes of defining an audio trigger itself as apparently intended. In an analogous art, Chew shows:
receiving a first developer input designating an audio trigger associated with a select location within an audio content stream; {…} generating a metadata file including metadata temporally associating the defined event with the select location within the audio content stream, wherein execution of the defined event is dependent upon first user input provided by an end user during execution of the dynamic user-influenced media experience; synchronizing a read pointer for the metadata file with a playback pointer for the audio content stream; modifying timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience; and as a result of the modified timing, the synchronization of the read pointer with the playback pointer, and the playback pointer for the audio content stream reaching a position originally associated with the timestamp in the audio content stream, executing the defined event during the dynamic user-influenced media experience [“{…} Upon selection of a keyword in the tag cloud, the system can present cue points along the time line of the video player to indicate the time index within the media where the keyword appears. This can assist the learner in skipping to the section of the media that is mentions the keyword. {…}” (Chew: ¶ 21)
“{…} Tag cloud 810 can present a plurality of keywords associated with the media file. Each of the keywords can be selectable. Upon selection of a keyword, a cue point function (within UI framework 120, post process functions 135, or database functions 142) can generate cue points to be presented on timeline 805 of media player 750. {…}
In one embodiment, the cue point function can determine the keyword within tag cloud 810 that has been selected. In response to the selection, the cue point function can analyze the transcript of the media file to determine time stamps within the media file where the keyword is heard. The cue point function can then generate cue points along timeline 805 where the keywords are heard. The cue points can be a visual indicator such as highlighting which is used to visually indicate to the learner where the keywords appear in the media file. A touch gesture detected at or near a cue point can result in the media player skipping to a part of the media file where the keyword is mentioned. In some examples, the media player can slightly rewind the media so that the learner can determine the context in which the keyword is being used. For example, the media player can rewind a few seconds or to the beginning of the sentence so that the learner. As shown here, keyword 815 has been selected. Upon selection of keyword 815, cue points 812, 814, and 816 appear along timeline 805. Thus, the keyword is used three times in the media. Selection of any of these cue points can start playback of the media at or near when the keyword is used.” (Chew: ¶¶ 43-44)]

One of ordinary skill in the art, having the teachings of Merrill and Chew before them prior to the effective filing date of the claimed invention, would have been motivated to adapt Merrill to allow for a developer to deliberately define the triggers for which it already responds to execute respective trigger-associated events, as taught by Chew. The rationale for doing so would have been that Chew’s approach “can assist the [user] in quickly finding relevant media” (Chew: Abstract), and thus Merrill would have been motivated to “also include features which enhance the manner in which the media file can be consumed […] such as cue points and hot zones” (Chew: ¶ 42). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Merrill and Chew (hereinafter, the “Merrill-Chew” combination) in order to obtain the invention as recited in claim 17.

As to dependent claim 18, Merrill-Chew further shows:
generating the metadata by storing a name of the defined event with a timestamp identifying the select location in the audio content stream [“{…} The word marking list 236 is implemented as a header, followed by a list of time-stamped strings. The strings contain a start offset, a stop offset, and ASCII text. The start and stop offsets are byte offsets from the start of the speech sound data, written as 8-byte unsigned integers. The ASCII text consists of the word itself (e.g., “Ha”). {…}” (Merrill: col. 12, lines 42-51)
“To specify an utterance under the human speech player arrangement, the application 502 specifies a text string 510 and a reference to a linguistically enhanced sound file 512 in a speak command (e.g., ‘speak “This is a test.”, test.lwv’). The reference could alternatively be something other than a file name (e.g., a uniform resource locator for specifying a file on the world wide web). {…}” (Merrill: col. 17, line 54-60)].

As to dependent claim 19, Merrill-Chew further shows:
reading the metadata while playing the audio content stream and rendering graphics to a display as part of the dynamic user-influenced media experience [e.g. rendering graphics to a display while reading the metadata and playing the audio content stream | Merrill: col. 16, lines 15-46; Chew: ¶¶ 43-44];
and initiating the defined event according to the timestamp specified by the metadata relative to a start of the audio content stream [e.g. initiating the event according to a timestamp specified by the metadata relative to a start of the audio content stream | Merrill: col. 12, lines 42-51 and col. 15, lines 58-65 | Chew: ¶¶ 43-44].

As to dependent claim 20, Merrill-Chew further shows:
wherein reading the metadata and initiating the defined event is performed by a game engine [To merely redefine the application engine as a “game engine” would appear to be drawn to an intended field-of-use and/or result for the otherwise application-agnostic “application engine,” and thus would appear to lack considerable patentable weight for purposes of prior art analysis. Furthermore, the Office does not solely rely on this interpretation due to “gaming” as a field of use for an engine being already taught by the prior art in at least Merrill: col 1, lines 13-25 & col. 5, lines 51-57 and/or Chew: ¶ 01.].

Claims 8 and 16 are rejected under 35 U.S.C. § 103 as being unpatentable over Merrill-Chew in further view of Rosenberg et al. (US Patent Application Pub. No. 2004/0160415, hereinafter “Rosenberg”).

As to dependent claim 8, Merrill-Chew further shows how the execution environment of the defined event may already be in association with a user controller (Merrill: col. 5, line 52). Nonetheless, Merrill-Chew does not appear to explicitly recite providing “tactile sensory feedback” per se. In an analogous art, Rosenberg also shows a development tool (e.g. “design interface tool” (Abstract)), and further shows:
wherein the execution of the defined event provides tactile sensory feedback to a user controller [“Effects are force sensations that are closely correlated with discrete temporal events during game play. For example, a shuttlecraft is blasted by an alien laser, the user feels a physical blast that is synchronized with graphics and sound that also represent the event. {…} Effects are best thought of as predefined functions of time such as vibrations and jolts that can be “overlaid” on top of the background conditions described above as foreground sensations. In other words, effects are forces that are defined and “played back” over time when called.” (Rosenberg: ¶ 75)
For further context into how providing tactile sensory feedback to a user controller as part of defined event executions was already well-known and established in the prior art, see also Rosenberg: ¶¶ 02-09, 71-75, & 87.].

One of ordinary skill in the art, having the teachings of Merrill-Chew and Rosenberg before them prior to the effective filing date of the claimed invention, would have been motivated to add tactile sensory feedback to a user controller to the Merrill-Chew combination when executing its defined events, as taught by Rosenberg. The rationale for doing so would have been that the Merrill-Chew combination already accounted for incorporating a user controller into its user experience (see, e.g. Merrill: col. 5, line 52), and Rosenberg confirms that much like providing “visual and audio feedback to the user utilizing the display screen and audio speakers” (Rosenberg: ¶ 03), providing tactile sensory feedback to a user controller “in conjunction and coordinated with displayed events and interactions by sending control signals or commands to {…} convey physical force sensations to the user in conjunction with other supplied feedback as the user is grasping or contacting the joystick or other object of the interface device” (Rosenberg: ¶ 04) was also well known and established before the effective filing date of the claimed invention. Furthermore, it would have been obvious to incorporate Rosenberg’s approach into the Merrill-Chew combination because it “advantageously provides a simple, easy-to-use design interface tool for designing force feedback sensations {and} the design interface tool of the present invention meets the needs of force sensation designers that wish to create force sensations as close to their needs as possible. The graphical design interface of the present invention allows a force sensation programmer or developer to easily and intuitively design force sensations, conveniently experience the designed force sensations, and visually understand the effect of changes to different aspects of the force sensations.” (Rosenberg: ¶ 09). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Merrill, Chew, and Rosenberg in order to obtain the invention as recited in claim 8.

As to dependent claim 16, Merrill-Chew further shows initiating the defined event according to the timestamp with a plurality of effects within the dynamic user-influenced media experience according to the timestamp (Merrill: col. 16, lines 15-46). Merrill-Chew also shows how the dynamic user-influenced media experience may already be in association with a user controller (Merrill: col. 5, line 52). Nonetheless, Merrill-Chew does not appear to explicitly recite providing “tactile sensory feedback” per se. In an analogous art, Rosenberg also shows a development tool (e.g. “design interface tool” (Abstract)), and further shows:
wherein initiating the defined event according to the timestamp further comprises: providing tactile sensory feedback to a user controller within the dynamic user-influenced media experience according to the timestamp [“Effects are force sensations that are closely correlated with discrete temporal events during game play. For example, a shuttlecraft is blasted by an alien laser, the user feels a physical blast that is synchronized with graphics and sound that also represent the event. {…} Effects are best thought of as predefined functions of time such as vibrations and jolts that can be “overlaid” on top of the background conditions described above as foreground sensations. In other words, effects are forces that are defined and “played back” over time when called.” (Rosenberg: ¶ 75)
For further context into how providing tactile sensory feedback to a user controller as part of defined event executions was already well-known and established in the prior art, see also Rosenberg: ¶¶ 02-09, 71-75, & 87.].

One of ordinary skill in the art, having the teachings of Merrill-Chew and Rosenberg before them prior to the effective filing date of the claimed invention, would have been motivated to add tactile sensory feedback to a user controller to the Merrill-Chew combination when executing its defined events, as taught by Rosenberg. The rationale for doing so would have been that the Merrill-Chew combination already accounted for incorporating a user controller into its user experience (see, e.g. Merrill: col. 5, line 52), and Rosenberg confirms that much like providing “visual and audio feedback to the user utilizing the display screen and audio speakers” (Rosenberg: ¶ 03), providing tactile sensory feedback to a user controller “in conjunction and coordinated with displayed events and interactions by sending control signals or commands to {…} convey physical force sensations to the user in conjunction with other supplied feedback as the user is grasping or contacting the joystick or other object of the interface device” (Rosenberg: ¶ 04) was also well known and established before the effective filing date of the claimed invention. Furthermore, it would have been obvious to incorporate Rosenberg’s approach into the Merrill-Chew combination because it “advantageously provides a simple, easy-to-use design interface tool for designing force feedback sensations {and} the design interface tool of the present invention meets the needs of force sensation designers that wish to create force sensations as close to their needs as possible. The graphical design interface of the present invention allows a force sensation programmer or developer to easily and intuitively design force sensations, conveniently experience the designed force sensations, and visually understand the effect of changes to different aspects of the force sensations.” (Rosenberg: ¶ 09). Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Merrill, Chew, and Rosenberg in order to obtain the invention as recited in claim 16.

Response to Arguments
Applicant’s arguments have been fully considered but they are not persuasive. Applicant argues:
“    […] the fact that Merrill's linguistically-enhanced sound file includes timestamps associated with word breaks or phonemes (words or sounds) does not disclose or suggest "at least one timestamp.... being associated in the metadata file with the defined event" where the event is one that is to be executed "in association with the audible occurrence of the audio trigger." Merrill's linguistically-enhanced sound file 232 (the purported "metadata file") does not include any timestamp that is associated, in the sound file, with an event that is to be executed in association with an audio trigger. Phonemes, word breaks, and timestamps do not disclose or suggest events that are to be executed in associated with audio triggers.
In further support of the above interpretation of the linguistically-enhanced sound file as lacking "events" that are to be executed, Merrill discusses that a speech recognition engine 212 that uses a phoneme mapping table 222 to dynamically map - during playback of the sound file- particular animations (mouth shapes) to the different phonemes in the file. See, col. 13, lines 36- 42 (discussing that the mapping table is used "during playback" of the sound file to perform the mapping via the mapping table between the phonemes and the corresponding animations). Thus, animations that are executed based on the timestamps are selected dynamically based on the phonemes in the linguistically-enhanced sound file, but neither the animations nor any other events to be executed are identified within the linguistically-enhanced sound file. 
Accordingly, Merrill's linguistically-enhanced sound file cannot be reasonably interpreted as a metadata file that includes "at least one timestamp.... being associated in the metadata file with the defined event [i.e., the event that is to executed "in association with the audible occurrence of the audio trigger"], as recited in claim 1.”


The Office respectfully disagrees with their assessment/interpretation. The “phonemes, word breaks, and timestamps” to which Applicant refers do not exist in a vacuum in Merrill’s (metadata) file 232, but are instead explicitly and deliberately “associated” with corresponding events (for example, corresponding animation and/or sound output events to be executed at a corresponding timestamp). The Office would also respectfully emphasize the significant breadth in scope of the term “event”  as currently recited, which covers a wide array of output possibilities (including those cited in Merrill), and how the claims only require an equally-broad condition that the timestamp be in at least some manner “associated” (another significantly broad term, which Applicant themselves acknowledge are associated with animations/mouth movements in their remarks) with the defined event. 

“    Chew also fails to disclose or suggest a "metadata file" that includes "at least one timestamp.... being associated in the metadata file with the defined event [i.e., an event that is to executed "in association with the audible occurrence of the audio trigger"]. 
The cited portions of Chew generally disclose a video player timeline that presents "cue points" that can be selected by a user to skip to various locations within a media file, such as locations where particular keywords are spoken. See, e.g., paragraph [0021], paragraph [0043], [044]. However, the presentation of "cue points" that allow a user to skip to different portions in a media file does not disclose or suggest a "metadata file" that includes a timestamp associated "with [a] defined event" that is to be executed "in association with [a] audible occurrence of the audio trigger." For example, Chew does not appear to disclose any metadata file that identifies events that are to be "executed" in association with audio triggers. 
Therefore, Chew cannot be relied on as disclosing or suggesting: "at least one timestamp.... being associated in the metadata file with the defined event [i.e., the event that is to executed "in association with the audible occurrence of the audio trigger"], as recited in claim 1.”

The Office respectfully disagrees. Not only do the breadth issues indicated above also apply to Chew’s applicability/mappability to the claims, but also Chew explicitly recites how “the cue point function can analyze the transcript of the media file to determine time stamps within the media file where the keyword is heard. The cue point function can then generate cue points along timeline 805 where the keywords are heard. The cue points can be a visual indicator such as highlighting which is used to visually indicate to the learner where the keywords appear in the media file. […] Selection of any of these cue points can start playback of the media at or near when the keyword is used.” (Chew: ¶ 44).

“As explained above with respect to argument (I) and page 9 of the Office Action, the Office's current claim mapping characterizes Merrill's linguistically-enhanced sound file as the "metadata file" of claim 1. Therefore, the Office's position appears to be that the synchronization of a specific location in a sound file (purportedly "the metadata file" - per the Office's interpretation on pg. 9 of the Office Action) with a soundless character animation discloses or suggests "synchroniz[ing] a read pointer for the metadata file with a playback pointer for the audio content stream." The Applicant disagrees and respectfully submits that it is unreasonable to map Merrill's soundless character animation to the "playback pointer for the audio content stream" of claim 1.
A synchronization between a linguistically-enhanced audio file and a sound-free animation does not suggest a synchronization between a metadata file playback pointer and an audio stream playback pointer. Accordingly, Merrill fails to disclose or suggest at least "synchroniz[ing] a read pointer for the metadata file with a playback pointer for the audio content stream" as recited in claim 1. 
On page 15 of the Office action, the Office argues, in the alterative, that Chew discloses or suggests "synchronize a read pointer for the metadata file with a playback pointer for the audio content stream." The Applicant disagrees. As discussed above, the cited portions of Chew (e.g., paragraphs [0021], [0043]-[0044]) disclose a video player timeline that can present "cue points" that can be selected by a user to skip to various locations within a media file, such as locations where particular keywords are spoken. The Office has not proposed any mapping for the claim term "metadata file" to particular features of Chew, and Chew does not appear to disclose or suggest any "metadata file" let alone, a metadata file with a playback pointer that is synchronized with a playback pointer for an audio content stream, as generally recited in claim 1. Therefore, the Office has not met is burden of establishing, prima facie, that Chew discloses or suggests: "synchronize a read pointer for the metadata file with a playback pointer for the audio content stream."”

The Office respectfully disagrees. It appears that Applicant inherits the arguments presented above, and thus their corresponding answers are maintained herein. Moreover, with respect to Applicant’s arguments against Merrill alleging that its mappings are somehow invalid because Merrill only limits itself to soundless mouth movements, not only would such a characterization still appear to technically reasonably read on the claims due to their persisting breadth and the way the “metadata file” was broadly described, but also Merrill was not limited to said alleged soundless movements (see the many other event alternatives throughout Merrill: col. 16, lines 15-46). Moreover, as shown above and maintained herein, Chew does teach its own (synchronized) “metadata file” teachings.

“    Thus, although the positioning of markers 386, 388 in Merrill's sound editing tool may be manipulated to change which portions of the sound file are mapped to the different mouth shapes, Merrill does not disclose or suggest that the markers can be used to "modify timing of the audio content stream." For example, the markers cannot be used to speed up slow down playback of the sound file. Accordingly, Merrill does not disclose or suggest "modify timing of the audio content stream." 
Moreover, even if the markers 386, 388 could be used to modify the audio content stream timing (a fact the Applicant does not concede), any user manipulation of the markers 386, 388 occurs prior to playback of the file to the end user (and not "during the dynamic user-influenced media experience"). The sound editing tool shown in Merrill's FIG. 3 is intended to provide the user with a "preview" (see col. 14, line 3) of how the mouth shapes will be mapped to the different phonemes in the sound file. Thus, the tool is intended to be used prior to playback of any portion of the file or animation to an end user. Accordingly, Merrill does not disclose or suggest "modify timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience.”

The Office respectfully disagrees. In response to Applicant’s arguments that the references fail to show certain features of Applicant’s invention, it is noted that the features upon which Applicant relies (that “the markers […] be used to speed up slow down playback of the sound file”) are not recited in the rejected claims. Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 U.S.P.Q.2d 1057 (Fed. Cir. 1993). Furthermore, regardless of Applicant’s opinion with regards to Merrill’s alleged intentions, it is still respectfully submitted that the marker manipulations conceded by the Applicant above would still result in a modification to a timing of the audio content stream (regardless of said stream being a “preview” or a live/full reproduction). 

“On page 15 of the Office action, the Office argues, in the alterative, that Chew discloses or suggests "modify timing of the audio content stream responsive to second user input received during the dynamic user-influenced media experience" The Applicant disagrees. As discussed above, the cited portions of Chew (e.g., paragraphs [0021], [0043]-[0044]) disclose a video player timeline that can present "cue points" that can be selected by a user to skip to various locations within a media file, such as locations where particular keywords are spoken. Clicking cue points at different locations in a media file timeline does not disclose or suggest "modify timing of the audio content stream." For example, the audio content stream is not sped up or slowed down.
Further, playback of Chew's media file cannot be reasonably characterized as a "dynamic user-influenced media experience" (see, e.g., Applicant's paragraph [0011], providing: "[a]s used herein, the term 'dynamic user-influenced media experience' refers to an audio/visual experience that is adapted to change (e.g., vary in visual/audio content or other sensory effects) based on user inputs received while a media content stream is being presented to a user). It follows that Chew does not disclose or suggest any "dynamic user-influenced media experience" and therefore cannot disclose any modification of playback timing that is "responsive to [a] user input received during the dynamic user-influenced media experience."”

The Office respectfully disagrees. In response to Applicant’s arguments that the references fail to show certain features of Applicant’s invention, it is noted that the features upon which Applicant relies (that “the markers […] be used to speed up slow down playback of the sound file”) are not recited in the rejected claims. Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 U.S.P.Q.2d 1057 (Fed. Cir. 1993). Moreover, clicking cue points in Chew causes the timing of the audio content stream to be modified at least in the sense that playback timing jumps to the timestamp associated with the corresponding cue point. 

Therefore, the Office respectfully asserts that the cited art sufficiently teaches the limitations recited in the amended claims.

Conclusion
THIS ACTION IS MADE FINAL.  Applicants are reminded of the extension of time policy as set forth in 37 C.F.R. § 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 C.F.R. § 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
The prior art made of record and not relied upon is considered pertinent to Applicant’s disclosure.  Applicants are required under 37 C.F.R. § 1.111(c) to consider these references fully when responding to this action.
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way.  A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALVARO R CALDERON IV whose telephone number is (571)272-1818.  The examiner can normally be reached on Monday - Friday (9:30am - 6:00pm).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kieu D. Vu can be reached on (571) 272-4057.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ALVARO R. CALDERON IV
Examiner
Art Unit 2173



/KIEU D VU/Supervisory Patent Examiner, Art Unit 2173