DETAILED ACTION

This communication is in response to the Application filed on 23 March 2020. Claims 1-20 are pending and have been examined.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Objections
Claim 1 is objected to because of the following informalities:  typographical error in the limitation “detecting an indication of a selected text-based media of \ at least one text-based media displayed on a client device, wherein the selected text-based media is to be utilized in generation of an audio story”. Need to remove “\”. Appropriate correction is required.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1, 6, and 17 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 20190005959, hereinafter referred to as Cameron et al. (1).


Regarding claim 1, Cameron et al. (1) discloses a computer-implemented method for generating an audio story (“It is an object of at least some embodiments of the invention to provide an improved system and/or method for the creation and/or playback of soundtrack-enhanced audiobooks, or to at least provide the public with a useful choice,” Cameron et al. (1), para [0005]. The audiobook is an audio story.), the method comprising: 

detecting an indication of a selected text-based media of \ at least one text-based media displayed on a client device, wherein the selected text-based media is to be utilized in generation of an audio story (“Referring to FIG. 3A, in an embodiment the system is provided with a speech-to-text mapping engine 100. The mapping engine 100 receives the digital audiobook data file or files 102, typically audio files in mp3 or other audio formats or similar, and the e-book data file 104 or other electronic text representing the narrated words in the audiobook (e.g. sourced from any digital text source or generated by a speech-to-text converter or engine),” Cameron et al. (1), para [0141]. And, “FIG. 4B is a schematic diagram of an electronic user device or hardware system operable to display electronic media content,” Cameron et al. (1), para [0093].); 

converting the selected text-based media into an audio representation of the selected text-based media as an audio file, wherein a time position of each portion of the audio file corresponds to a portion of the selected text-based media (“FIG. 3C is a graphical representation of the 1:1 resolution mapping data of FIG. 3B depicted relative to a text position axis and an audiobook playback timeline, also showing an overlay of the audio regions of a soundtrack,” Cameron et al. (1), para [0080]. And, “FIG. 3E is a graphical representation of the marker-determined resolution mapping data of FIG. 3D depicted relative to a text position axis and an audiobook playback timeline, also showing an overlay of the audio regions of a soundtrack,” Cameron et al. (1), para [0082].); 

modifying the audio file to incorporate supplemental media content at a series of time positions of the audio file, the supplemental media content differing from that of the audio representation of the selected text-based media, the modified audio file comprising the audio story (“FIG. 3E is a graphical representation of the marker-determined resolution mapping data of FIG. 3D depicted relative to a text position axis and an audiobook playback timeline, also showing an overlay of the audio regions of a soundtrack,” Cameron et al. (1), para [0082]. Here, the supplemental media content is the soundtrack. And, “The audio regions in the soundtrack may comprise any one or more of different types of audio regions including, but not limited to, music, ambience, or sound effects,” Cameron et al. (1), para [0110].); and 

providing the audio story to the client device, wherein the client device is configured to playback the audio story responsive to identifying an indication to playback the audio story (“FIG. 4B is a schematic diagram of an electronic user device or hardware system operable to display electronic media content, playback an audiobook, and co-ordinate the synchronised playback of a soundtrack of the type described with reference to FIGS. 1 and 2 based on the user's reading position with the displayed text or alternatively based on the audiobook playback position,” Cameron et al. (1), para [0093].).  
As to claim 17, CRM claim 17 and method claim 1 are related as method and CRM of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Cameron et al. (1), para [0043] teaches a processor and memory, and para [0055] teaches CRM.

Regarding claim 6, Cameron et al. discloses the computer-implemented method of claim 1, further comprising: 

providing an instruction to implement an authoring dashboard on an author device, the authoring dashboard incorporating the audio file (“In one form, the narration speed data comprises a plurality of narration speed values each corresponding to a respective segment or portion of the audiobook playback duration, which is manually created by a user listening to the audiobook and marking the audiobook to words that are time markers in the audiobook which can then be used to both determine narration speed data and accurately reference soundtrack layers from position in the text to the position in the audiobook,” Cameron et al. (1), para [0029]. The user (author) is able to incorporate the audio file and make modifications.); and 

modifying the audio file based on a series of supplemental audio effects added at various time positions of the selected text-based media in the authoring dashboard (“In one form, the narration speed data comprises a plurality of narration speed values each corresponding to a respective segment or portion of the audiobook playback duration, which is manually created by a user listening to the audiobook and marking the audiobook to words that are time markers in the audiobook which can then be used to both determine narration speed data and accurately reference soundtrack layers from position in the text to the position in the audiobook,” Cameron et al. (1), para [0029]. The soundtrack is a supplemental audio effect added by the user (author) at various time positions in the text.).  


Claim(s) 9 and 14 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US 20200135158, hereinafter referred to as Yao et al.

Regarding claim 9, Yao et al. discloses a method performed by a network-accessible device for generating a audio story that is specific to a first client (“The image acquisition device is configured to acquire an image of the user's real-time reading content,” Yao et al., para [0005].), the method comprising: 

causing display of a series of text-based media (“Further, the image acquisition device includes a camera and/or a text capturing tool of a smart reading device, and the text capturing tool includes a screenshot tool, a text memory reading tool or an invoking tool of an application programming interface (API) for a reading application,” Yao et al., para [0006].); 

detecting an indication of a selected text-based media of the series of text-based media, wherein the selected text-based media is to be utilized in generation of an audio story (“The image acquisition device is configured to acquire an image of the user's real-time reading content. The processing device includes a transmission unit, a memory unit, and an audio unit, and an operation unit for controlling the transmission unit, the memory unit, and the audio unit to perform transmission, memory, and audio synthesis, respectively, The operation unit includes: an image extraction module configured to receive an input signal of the image acquisition device, and then to convert the image into an image signal; and a word recognition module configured to process the image signal to make it clear and easy to recognize, and to identify by the image signal. The recognized word is stored in a cached text file, and classifies the word in the text file. The semantic analysis module is used to identify the semantics of the classified word, to extract the environmental semantic words and the emotional semantic words respectively, and then to retrieve an environmental background music or an emotional background music by comparing the environmental semantic words or the emotional semantic words to an element in a background music library. The audio synthesis module is configured to perform audio synthesis and sound enhancement on the basis of the background music,” Yao et al., para [0005]. And, “FIG. 11 shows an embodiment of the word acquiring process according to the method of the present disclosure. The time domain control and audio synthesis process in the present disclosure are exemplified below with the background words and emotional words identified from the article as shown,” Yao et al., para [0072].);

generating an audio representation of the selected text-based media as an audio file, wherein a time position of each portion of the audio file corresponds to a portion of the selected text-based media (“Further, the audio synthesis module includes: a time domain recorder for recording at least one reading time node according to a text change in a reading target area of the acquired image, recording at least one emotional time node if the accumulated emotional score value exceeds a preset threshold, each emotional time node corresponding to a position of the emotional word in the text segment, and generating a time domain control bar by integrating the reading time node and the emotional time node; and a mixer for superimposing audio signals of the background music and the sound effect music in time domain by a saturator having an attenuation factor, by means of the time domain control bar,” Yao et al., para [0010].); 

retrieving a first series of characteristics relating to the first client (“Further, the audio synthesis module includes: a time domain recorder for recording at least one reading time node according to a text change in a reading target area of the acquired image, recording at least one emotional time node if the accumulated emotional score value exceeds a preset threshold, each emotional time node corresponding to a position of the emotional word in the text segment, and generating a time domain control bar by integrating the reading time node and the emotional time node; and a mixer for superimposing audio signals of the background music and the sound effect music in time domain by a saturator having an attenuation factor, by means of the time domain control bar,” Yao et al., para [0010]. Here, the emotional characteristics relate to the client.); 

inspecting text included in the selected text-based media to identify a series of keywords in the selected text-based media (“The semantic analysis module is used to identify the semantics of the classified word,” Yao et al., para [0005]. And, “S4, identifying the semantics of the classified word, and extracting environmental semantic words and emotional semantic words respectively,” Yao et al., para [0015]. The classified word is interpreted as a keyword. The inspecting of text is done via semantic analysis.); 

comparing the series of keywords and the first series of characteristics relating to the first client with a listing of known supplemental audio types to identify a first supplemental audio type that corresponds to the series of keywords and the first series of characteristics relating to the first client (Yao et al., para [0005]. And, “S5, retrieving an environmental background music or an emotional background music by comparing the environmental semantic words or the emotional semantic words to an element in a background music library,” Yao et al., para [0016]. The background music is a supplemental audio type.); 

modifying the audio file to add supplemental audio effects included in the first supplemental audio type at time positions corresponding with each of the identified series of keywords, the modified audio file comprising the audio story (Yao et al., para [0005]. And, “S5, retrieving an environmental background music or an emotional background music by comparing the environmental semantic words or the emotional semantic words to an element in a background music library,” Yao et al., para [0016].); and 

causing playback of the audio story responsive to identifying an indication to playback the audio story (“S6, performing audio synthesis and sound enhancement on the basis of background music, and playing the synthesized audio by the audio output device,” Yao et al., para [0017].).  

Regarding claim 14, Yao et al. discloses the method of claim 9, further comprising: 

detecting a second indication of the selected text-based media of the series of text-based media by a second client (“The image acquisition device is configured to acquire an image of the user's real-time reading content. The processing device includes a transmission unit, a memory unit, and an audio unit, and an operation unit for controlling the transmission unit, the memory unit, and the audio unit to perform transmission, memory, and audio synthesis, respectively, The operation unit includes: an image extraction module configured to receive an input signal of the image acquisition device, and then to convert the image into an image signal; and a word recognition module configured to process the image signal to make it clear and easy to recognize, and to identify by the image signal. The recognized word is stored in a cached text file, and classifies the word in the text file. The semantic analysis module is used to identify the semantics of the classified word, to extract the environmental semantic words and the emotional semantic words respectively, and then to retrieve an environmental background music or an emotional background music by comparing the environmental semantic words or the emotional semantic words to an element in a background music library. The audio synthesis module is configured to perform audio synthesis and sound enhancement on the basis of the background music,” Yao et al., para [0005]. And, “FIG. 11 shows an embodiment of the word acquiring process according to the method of the present disclosure. The time domain control and audio synthesis process in the present disclosure are exemplified below with the background words and emotional words identified from the article as shown,” Yao et al., para [0072]. The method described in these passages may be applied to a second client.); 

retrieving a second series of characteristics relating to the second client (“Further, the audio synthesis module includes: a time domain recorder for recording at least one reading time node according to a text change in a reading target area of the acquired image, recording at least one emotional time node if the accumulated emotional score value exceeds a preset threshold, each emotional time node corresponding to a position of the emotional word in the text segment, and generating a time domain control bar by integrating the reading time node and the emotional time node; and a mixer for superimposing audio signals of the background music and the sound effect music in time domain by a saturator having an attenuation factor, by means of the time domain control bar,” Yao et al., para [0010]. Here, the emotional characteristics relate to the client.); 

comparing the selected text-based media and the second series of characteristics with the listing of known supplemental audio types to identify a second supplemental audio type that corresponds to the selected text-based media and the second series of characteristics (Yao et al., para [0005]. Also, “The semantic analysis module is used to identify the semantics of the classified word,” Yao et al., para [0005]. And, “S4, identifying the semantics of the classified word, and extracting environmental semantic words and emotional semantic words respectively,” Yao et al., para [0015]. The classified word is interpreted as a keyword. The inspecting of text is done via semantic analysis. And, “S5, retrieving an environmental background music or an emotional background music by comparing the environmental semantic words or the emotional semantic words to an element in a background music library,” Yao et al., para [0016]. The background music is a supplemental audio type.); and 

modifying the audio file to add supplemental audio effects included in the second supplemental audio type at a series of time positions of the selected text-based media, the modified audio file comprising the audio story (“S6, performing audio synthesis and sound enhancement on the basis of background music, and playing the synthesized audio by the audio output device,” Yao et al., para [0017].  And, ”Further, the step S6 further includes:…and superimposing audio signals of the background music and the sound effect music in time domain by a saturator having an attenuation factor, by means of the time domain control bar,” Yao et al., para [0020].).  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20190005959, hereinafter referred to as Cameron et al. (1), in view of US 20180032610, hereinafter referred to as Cameron et al. (2).

Regarding claim 2, Cameron et al. discloses the computer-implemented method of claim 1, wherein said modifying the audio file to incorporate supplemental media content at the series of time positions of the audio file further comprises: 

inspecting words included in the selected text-based media The soundtrack generation may also be partially or fully automated using semantic analysis of the text to identify mood or other characteristics of the narration and automatically configure suitable audio regions with suitable audio tracks,” Cameron et al. (1), para [0196]. The semantic analysis is used to inspect the words of the text.); 

processing 

comparing the derived predicted emotion with a listing of known supplemental audio types to identify a first supplemental audio type that corresponds to the derived predicted emotion of the selected text-based media (Cameron et al. (1), para [0196]. Here, suitable audio tracks correspond to a listing of known supplemental audio types. Also, “The audio regions in the soundtrack may comprise any one or more of different types of audio regions including, but not limited to, music, ambience, or sound effects,” Cameron et al. (1), para [0110]. Thus, a first supplemental audio type may be, say, music.); and

at a series of time positions throughout a duration of the selected-text based media, modifying the audio story to add a supplemental audio effect included in the first supplemental audio type (“FIG. 4B is a schematic diagram of an electronic user device or hardware system operable to display electronic media content, playback an audiobook, and co-ordinate the synchronised playback of a soundtrack of the type described with reference to FIGS. 1 and 2 based on the user's reading position with the displayed text or alternatively based on the audiobook playback position,” Cameron et al. (1), para [0093].).

Cameron et al. (1), though, does not explicitly describe generating a content model representing features of the selected text-based media.

Cameron et al. (2) is cited to disclose generating a content model representing features of the selected text-based media (“(c) applying semantic analysis to a series of text segments of the processed text data based on a continuous emotion model defined by a predefined number of emotional category identifiers each representing an emotional category in the model, the semantic analysis being configured to parse the processed text data to generate, for each text segment, a segment emotional data profile based on the continuous emotion model,” Cameron et al. (2) , para [0011]. Here, semantic analysis is used to inspect words included in the selected text-based media, and the emotion model is a content model.). Cameron et al. (2) benefits Cameron et al. (1) by generating an emotional profile for each text segment in the context of a continuous emotion model (Cameron et al. (2), Abstract), thereby providing for the assignment of an emotional sound attribute to the text. Therefore, it would be obvious for one skilled in the art to combine the teachings of Cameron et al. (1) with those of Cameron et al. (2) to enhance the audio playback of the audiobook book system described by Cameron et al. (1). 
As to claim 18, CRM claim 18 and method claim 2 are related as method and CRM of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 18 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Cameron et al., para [0043] teaches a processor and memory, and para [0055] teaches CRM.

Claims 3 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20190005959, hereinafter referred to as Cameron et al. (1), in view of US 20180032610, hereinafter referred to as Cameron et al. (2), and further in view of US 20140223462, hereinafter referred to as Aimone et al. 

Regarding claim 3, Cameron et al. (1), as modified by Cameron et al. (2), discloses the computer-implemented method of claim 2, further comprising: 

identifying a series of internal characteristics and a series of external characteristics relating to the client, the series of internal characteristics indicative of past interactions by the client (“However, in other configurations or modes, the music may be selected by the system to counteract the emotion or mood associated with the live speech audio (e.g. if an angry or aggressive mood or emotion is identified in the live speech audio, the system may be configured to select calming music to counteract that mood). In further configurations, the music may be selected based on a moving average of the emotional profiles of all or at least a portion of the past processed text regions,” Cameron et al. (2), para [0447].);

identifying a second supplemental audio type of the listing of known supplemental audio types that corresponds to the set of features associated with the client (Cameron et al. (1), para [0110]. Here, a second supplemental audio type may be, say, sound effects.); and 

at a series of time positions throughout a duration of the selected-text based media, modifying the audio story to add a supplemental audio effect included in the second supplemental audio type (“FIG. 4B is a schematic diagram of an electronic user device or hardware system operable to display electronic media content, playback an audiobook, and co-ordinate the synchronised playback of a soundtrack of the type described with reference to FIGS. 1 and 2 based on the user's reading position with the displayed text or alternatively based on the audiobook playback position,” Cameron et al. (1), para [0093].).  

Neither Cameron et al. (1) nor Cameron et al. (2), though disclose that the series of external characteristics are indicative of environmental features detected by the client device; generating a prediction model based on the identified series of internal characteristics and the series of external characteristics relating to the client; and processing the prediction model to derive a set of features that are associated with the client.

Aimone et al. is cited to disclose that the series of external characteristics are indicative of environmental features detected by the client device (“Sensors can be biological sensors or other kind of sensors hosted by the client device (e.g. outdoor air temperature thermometers, other environmental sensors, accelerometers, light, environmental molecules, ambient sound, wind speed, water temperature, etc.),” Aimone et al., para [0264].); 

generating a prediction model based on the identified series of internal characteristics and the series of external characteristics relating to the client (“Feature extraction is also a form of signal processing, however the goal may be to extract features that are useful for machine learning to build prediction models,” Aimone et al., para [0135]. And, Aimone et al., para [0169].); and

processing the prediction model to derive a set of features that are associated with the client (Aimone et al., para [0135] and para [0169].). Aimone et al. benefits Cameron et al. (1) by modulating content presentation using the user’s brain-state data, thereby improving the evocative or engaging nature of the communication platform (Aimone et al., para [0007]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Cameron et al. (1) with those of Aimone et al. to enhance the audiobook creation of Cameron et al. (1).
As to claim 19, CRM claim 19 and method claim 3 are related as method and CRM of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 19 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Cameron et al., para [0043] teaches a processor and memory, and para [0055] teaches CRM.

Claims 4-5 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20190005959, hereinafter referred to as Cameron et al. (1), in view of US 20180032610, hereinafter referred to as Cameron et al. (2), and further in view of US 20180032611, hereinafter referred to as Cameron et al. (3).

Regarding claim 4, Cameron et al. (1), as modified by Cameron et al. (2), discloses the computer-implemented method of claim 2, but not further comprising: 

retrieving a listing of advertising content entries, each advertising content entry including characteristics relating to advertising content; 

38comparing the derived predicted emotion of the selected text-based media with the listing of advertising content entries to identify a first advertising content entry that corresponds to the derived predicted emotion of the selected text-based media; and 

modifying the audio story to add the first advertising content entry to the audio story.

Cameron et al. (3) is cited to disclose retrieving a listing of advertising content entries, each advertising content entry including characteristics relating to advertising content (“The soundtrack generation system can generate text data and mood data relating to the live conversation, and based on this data advertising may be selected and targeted appropriately in the soundtrack (e.g. between songs) or alternatively visual advertisements may be cued for presentation or playback on any associated visual display device. The text data about the topic of conversation and mood enable effective advertising targeting. The advertising selected may also be based on other supplementary data or typical advertising targeting data, such as user profile information, demographic information, user preferences, location and the like,” Cameron et al. (3), para [0383].); 
38comparing the derived predicted emotion of the selected text-based media with the listing of advertising content entries to identify a first advertising content entry that corresponds to the derived predicted emotion of the selected text-based media (Cameron et al. (3), para [0383].); and 
modifying the audio story to add the first advertising content entry to the audio story (Cameron et al. (3), para [0383].). Cameron et al. (3) benefits Cameron et al. (1) by using emotion prediction to determine appropriate advertisements to present to a user (Cameron et al. (3), para [0383]), thereby providing the user with recommendations more likely to be of interest to the user. Therefore, it would be obvious for one skilled in the art to combine the teachings of Cameron et al. (1) with those of Cameron et al. (3) to extend the audiobook creation of Cameron et al. (1) to an advertising service.
As to claim 20, CRM claim 20 and method claim 4 are related as method and CRM of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to method claim. Also, Cameron et al., para [0043] teaches a processor and memory, and para [0055] teaches CRM.

Regarding claim 5, Cameron et al. (1), as modified by Cameron et al. (2) and Cameron et al. (3), discloses the computer-implemented method of claim 4, further comprising: 

identifying a series of internal characteristics and a series of external characteristics relating to the client, the series of internal characteristics indicative of past interactions by the client (“However, in other configurations or modes, the music may be selected by the system to counteract the emotion or mood associated with the live speech audio (e.g. if an angry or aggressive mood or emotion is identified in the live speech audio, the system may be configured to select calming music to counteract that mood). In further configurations, the music may be selected based on a moving average of the emotional profiles of all or at least a portion of the past processed text regions,” Cameron et al. (2), para [0447].), the series of external characteristics indicative of environmental features detected by the client device (“the sensors may be configured to also sense background and other noise or sounds in the environment in combination with the speech audio Cameron et al. (3), para [0186].); 
comparing the series of internal characteristics and the series of external characteristics relating to the client with the listing of advertising content entries to identify a second advertising content entry that corresponds to the series of internal characteristics and the series of external characteristics relating to the client (“The soundtrack generation system generates mood data relating to captured live speech audio, and this may be collected and analyzed as a group of aggregated data on a number of levels to generate data indicative of the mood of a room, part of town, city, county, etc. This mood data or mood meter may enable personal and commercial decisions to be made from where is a happy place to go on holiday to what is the mood in your workplace today. This mood data may also be used to enhance targeted electronic advertising,” Cameron et al. (3), para [0384].); and
modifying the audio story to add the second advertising content entry to the audio story (“In an embodiment, the method further comprises selected advertising content based on the emotional profile data generated and serving targeted audio and/or visual advertising to the participants of the live conversation at least partly based on the emotional profile data generated. In one configuration, the advertising content may be audio advertising served between one or music tracks in the live conversation,” Cameron et al. (3), para [0117].).  

Claim 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20190005959, hereinafter referred to as Cameron et al. (1), in view of US 20180032610, hereinafter referred to as Cameron et al. (2), and further in view of US 20110153047, hereinafter referred to as Cameron et al. (4).

Regarding claim 7, Cameron et al., as modified by Cameron et al. (2), discloses the computer-implemented method of claim 2, but not further comprising: 

processing the content model of the selected text-based media to derive a geographic profile indicative of a primary geographic region identified in the selected text-based media;

comparing the geographic profile with the listing of known supplemental audio types to identify a third supplemental audio type that corresponds to the geographic profile; and

at the series of time positions throughout the duration of the selected-text based media, modifying the audio story to add a supplemental audio effect included in the third supplemental audio type.
Cameron et al. (4) is cited to disclose processing the content model of the selected text-based media to derive a geographic profile indicative of a primary geographic region identified in the selected text-based media (“For example, the music layer 22 may comprise desired background music such as orchestral music or band songs, the background layer 24 may comprise weather sounds, scene noise or the like, and the effects layer 26 may include sound effects such as gunshots, door-slamming, lightning etc, that are timed to synchronise with events occurring in the text source,” Cameron et al. (4), para [0117].); 
comparing the geographic profile with the listing of known supplemental audio types to identify a third supplemental audio type that corresponds to the geographic profile (“For example, the music layer 22 may comprise desired background music such as orchestral music or band songs, the background layer 24 may comprise weather sounds, scene noise or the like, and the effects layer 26 may include sound effects such as gunshots, door-slamming, lightning etc, that are timed to synchronise with events occurring in the text source,” Cameron et al. (4), para [0117]. The background layer is a third supplemental audio type.); and
at the series of time positions throughout the duration of the selected-text based media, modifying the audio story to add a supplemental audio effect included in the third supplemental audio type (“In step 1 of FIG. 2, the reader buys a text source, such as a book, e-book, an online text, audible book or any other publication with text,” Cameron et al. (4), para [0097]. And, “For example, the music layer 22 may comprise desired background music such as orchestral music or band songs, the background layer 24 may comprise weather sounds, scene noise or the like, and the effects layer 26 may include sound effects such as gunshots, door-slamming, lightning etc, that are timed to synchronise with events occurring in the text source,” Cameron et al. (4), para [0117]. Synchronise is interpreted as matching time positions to the audio of the text/story.). Cameron et al. (4) benefits Cameron et al. (1) by adding geographic-aware supplemental audio to the audio story (Cameron et al. (4), para [0117]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Cameron et al. (1) with those of Cameron et al. (4) to enhance the audiobook creation of Cameron et al. (1).

Claim 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20190005959, hereinafter referred to as Cameron et al. (1), in view of in view of US 20100299149, hereinafter referred to as Kurzweil et al.

Regarding claim 8, Cameron et al. (1) discloses the computer-implemented method of claim 1, but not further comprising: 

subsequent to generation of the audio file, identifying a second text-based media that was published in response to the selected text-based media; 

converting the second text-based media into an audio representation of the second text-based media by comparing each word of the second text-based media with a corresponding entry in a listing of speech entries; and

modifying the audio file to incorporate the audio representation of the second text-based media into the audio file.

Kurzweil et al. is cited to disclose subsequent to generation of the audio file, identifying a second text-based media that was published in response to the selected text-based media (“In some additional examples, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process. For example, line 212 of the story shown in FIG. 9 recites "To which the pig answered `no, not by the hair of my chinny chin chin.",” Kurzweil et al., para [0056]. Here, a quotation in response to another audio quote.); 

converting the second text-based media into an audio representation of the second text-based media by comparing each word of the second text-based media with a corresponding entry in a listing of speech entries (“The computer system recognizes the quotation "no, not by the hair of my chinny chin chin" based on the text being enclosed in quotation marks. The system review the text leading up to or following the quotation for an indication of the speaker. In this example, the text leading up to the quotation states "To which the pig answered" as such, the system could recognize that the pig is the character speaking this quotation and associate the quotation with the voice model for the pig,” Kurzweil et al., para [0056]. Then, the quotation is assigned a voice different from that of the previous speaker.); and 

modifying the audio file to incorporate the audio representation of the second text-based media into the audio file (Kurzweil et al., para [0056]. The voice of the character matching the quotation is incorporated into the document narration (audio file).). Kurzweil et al. benefits Cameron et al. (1) by allowing each character of a document narration to be represented by a unique voice (Kurzweil et al., Abstract). Therefore, it would be obvious for one skilled in the art to combine the teachings of Cameron et al. (1) with those of Kurzweil et al. to enhance the audiobook creation of Cameron et al. (1).

Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200135158, hereinafter referred to as Yao et al., in view of US 20180032610, hereinafter referred to as Cameron et al. (2).

Regarding claim 10, Yao et al. discloses the method of claim 9, wherein
the series of external characteristics are indicative of environmental features detected by the network-accessible device (“The semantic analysis module is used to identify the semantics of the classified word, to extract the environmental semantic words and the emotional semantic words respectively, and then to retrieve an environmental background music or an emotional background music by comparing the environmental semantic words or the emotional semantic words to an element in a background music library,” Yao et al., para [0005]. And, Yao et al., para [0076] teaches that the device may be network-accessible.).

Yao et al., though, does not disclose wherein the first series of characteristics include a series of internal characteristics and a series of external characteristics relating to the first client, the series of internal characteristics indicative of past interactions by the first client.

Cameron et al. (2) is cited to disclose wherein the first series of characteristics include a series of internal characteristics and a series of external characteristics relating to the first client, the series of internal characteristics indicative of past interactions by the first client (“However, in other configurations or modes, the music may be selected by the system to counteract the emotion or mood associated with the live speech audio (e.g. if an angry or emotion is identified in the live speech audio, the system may be configured to select calming music to counteract that mood). In further configurations, the music may be selected based on a moving average of the emotional profiles of all or at least a portion of the past processed text regions,” Cameron et al. (2), para [0447].). Cameron et al. (2) benefits Yao et al. by using past emotional profile information to better predict a present emotion for a text portion (Cameron et al. (2), para [0447]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yao et al. with those of Cameron et al. (2) to enhance the reading environment of Yao et al.


Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200135158, hereinafter referred to as Yao et al., in view of US 20180032610, hereinafter referred to as Cameron et al. (2), and further in view of  US 20150163561, hereinafter referred to as Grevers.

Regarding claim 11, Yao et al. discloses the method of claim 9, but not further comprising: 

retrieving a listing of advertising content entries, each advertising content entry including characteristics relating to advertising content;

comparing the series of keywords with the listing of advertising content entries to identify a first advertising content entry that corresponds to the series of keywords; and

modifying the audio story to add the first advertising content entry to the audio story.

Grevers is cited to disclose retrieving a listing of advertising content entries, each advertising content entry including characteristics relating to advertising content (“Still referring to FIG. 3A, a specific example is shown in which three keywords are selected from a spoken sentence that is translated into text. The sentence is "After dinner, we are going shopping and then to see a movie." The three keywords are selected based upon events and objects, while other words are deemed not to be relevant and are ignored. Once selected, keyword "Dinner" 310(1) may be provided to geo-targeting advertising service 110, in combination with geolocation information, to trigger the display of advertising material directed to nearby restaurants shown at reference numeral 315(1),” Grevers, para [0034].); 





147633102.1comparing the series of keywords with the listing of advertising content entries to identify a first advertising content entry that corresponds to the series of keywords (“Still referring to FIG. 3A, a specific example is shown in which three keywords are selected from a spoken sentence that is translated into text. The sentence is "After dinner, we are going shopping and then to see a movie." The three keywords are selected based upon events and objects, while other words are deemed not to be relevant and are ignored. Once selected, keyword "Dinner" 310(1) may be provided to geo-targeting advertising service 110, in combination with geolocation information, to trigger the display of advertising material directed to nearby restaurants shown at reference numeral 315(1),” Grevers, para [0034].); and 

modifying the audio story to add the first advertising content entry to the audio story (“Referring now to FIG. 3B and continuing with the example of FIG. 3A, a technique is shown for precisely managing the order or sequence in which advertisements are presented to a user of the endpoint device. In particular, advertisements are presented in a sequence (from top to bottom, bottom to top, left to right, right to left, etc.) that tracks the occurrence of a keyword obtained for captured audio. For example, the first advertisement may be displayed in the upper right portion of the display screen 140, where the first advertisement 315(1) corresponds to an advertisement related to "dinner", and "dinner" is the first identified keyword in the example of FIG. 3A,” Grevers, para [0037].). Grevers benefits Yao et al. by providing context aware geo-tagged advertisement in a communication session (Grevers, Abstract), thereby providing the user with recommendations more likely to be of interest to the user. Therefore, it would be obvious for one skilled in the art to combine the teachings of Yao et al. with those of Grevers to extend the reading environment of Yao et al. to an advertising service.

Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200135158, hereinafter referred to as Yao et al., in view of US 20100106498, hereinafter referred to as Morrison.

Regarding claim 12, Yao et al. discloses the method of claim 9, but not further comprising: 

detecting a geographic region indicator indicative of a geographic location of the network-accessible device; 

comparing the geographic region indicator with the listing of advertising content entries to identify a first audio content entry that includes audio content that corresponds to the geographic location of the network-accessible device;

modifying the audio story to add the audio content included in the first audio content entry to the audio story.

Morrison is cited to disclose detecting a geographic region indicator indicative of a geographic location of the network-accessible device (“The system receives from an advertiser an advertisement related to the identified at least one key phrase (208). Advertisements can include text, pictures, audio, video, coupons, and other advertising or promotional material. For example, based on the key phrase "new car", an advertiser such as Ford can send the system an audio advertisement for a Ford Explorer. Based on the key phrase "football", an advertiser such as the National Football League (NFL) can send the system a video advertisement for an upcoming Monday Night Football game. Based on the key phrase "chocolate", an advertiser such as Toblerone can send the system a text-based advertisement including a coupon code for 30% off. Based on the key phrase "let's go see a movie", an advertiser such as a local movie theater can send the system a blended advertisement including text of the movie titles and show times, audio movie reviews, pictures of movie posters, coupons for discounted popcorn, and even trailers for currently showing movies,” Morrison, para [0024]. This excerpt shows that geographic location information is used to determine movies playing locally. And, “The first device can be a converged voice and data communications device connected to a network,” Morrison, para [0013]. This excerpt explains that the user device is network accessible.); 

comparing the geographic region indicator with the listing of advertising content entries to identify a first audio content entry that includes audio content that corresponds to the geographic location of the network-accessible device (Morrison, para [0024]. This passage also shows that the advertisement may include audio (movie reviews and trailers). See also Morrison, fig. 2.); and 

modifying the audio story to add the audio content included in the first audio content entry to the audio story (Morrison, para [0024]. The advertisement audio is added to the audio story.). Morrison benefits Yao et al. by providing context aware geo-tagged advertisement in a communication session (Morrison, para [0024]), thereby providing the user with recommendations more likely to be of interest to the user. Therefore, it would be obvious for one skilled in the art to combine the teachings of Yao et al. with those of Morrison to extend the reading environment of Yao et al. to an advertising service to enhance the reading environment of Yao et al.

Claim 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200135158, hereinafter referred to as Yao et al., in view of US 20100299149, hereinafter referred to as Kurzweil et al.

Regarding claim 13, Yao et al. discloses the method of claim 9, but not further comprising: 

subsequent to generation of the audio file, identifying a quote provided in a second text-based media provided in response to the selected text-based media;

converting the second text-based media into an audio representation of the second text-based media by comparing each word of the second text-based media with a corresponding entry in a listing of speech entries, wherein the audio representation of the second text-based media includes a voice type that is different than a voice type of the audio representation of the selected text-based media; and

modifying the audio file to incorporate the audio representation of the second text-based media into the audio file.

Kurzweil et al. is cited to disclose subsequent to generation of the audio file, identifying a quote provided in a second text-based media provided in response to the selected text-based media (“In some additional examples, the system automatically selects a character to associate with each quotation based on the words of the text using a natural language process. For example, line 212 of the story shown in FIG. 9 recites "To which the pig answered `no, not by the hair of my chinny chin chin.",” Kurzweil et al., para [0056]. Here, a quotation in response to another audio quote.); 

converting the second text-based media into an audio representation of the second text-based media by comparing each word of the second text-based media with a corresponding entry in a listing of speech entries, wherein the audio representation of the second text-based media includes a voice type that is different than a voice type of the audio representation of the selected text-based media (“The computer system recognizes the quotation "no, not by the hair of my chinny chin chin" based on the text being enclosed in quotation marks. The system review the text leading up to or following the quotation for an indication of the speaker. In this example, the text leading up to the quotation states "To which the pig answered" as such, the system could recognize that the pig is the character speaking this quotation and associate the quotation with the voice model for the pig,” Kurzweil et al., para [0056]. Then, the quotation is assigned a voice different from that of the previous speaker.); and 

modifying the audio file to incorporate the audio representation of the second text-based media into the audio file (Kurzweil et al., para [0056]. The voice of the character matching the quotation is incorporated into the document narration (audio file).). Kurzweil et al. benefits Yao et al. by allowing each character of a document narration to be represented by a unique voice (Kurzweil et al., Abstract). Therefore, it would be obvious for one skilled in the art to combine the teachings of Cameron et al. (1) with those of Kurzweil et al. to enhance the reading environment of Yao et al.


Allowable Subject Matter

Claims 15 and 16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. None of the prior art describes the steps of claims 15 and 16. 


Conclusion
Other related prior art are listed in the attached PTO-892. Of particular interest is Alm et al., which describes a machine learning method for text-based emotion prediction. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2659