DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
This communication is responsive to the original application filed on 12/31/2020. This action is Non-Final. Claims 1-20 are pending and have been examined.  
Drawings
The applicant’s drawings submitted are acceptable for examination purposes. 
Specification
The applicant’s specification submitted is acceptable for examination purposes. 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Karlsson et al., U.S. Patent Application Publication No.: 2012/0297081 (Hereinafter “Karlsson”), and further in view of Bloebaum et al., U.S. Patent Application Publication No.: 2008/0086539 (Hereinafter “Bloebaum”).
Regarding claim 1, Karlsson teaches, a method for playing audio, applied to a terminal, comprising:
sending a response request corresponding to a voice input to a server in response to detecting the voice input (Karlsson [0042]: It is a binary that uses command line arguments to record a particular program based on either NTP time from the encoded stream or wallclock time.  In particular embodiments, this is configurable as part of the arguments and depends on the input stream.  When the fragment writer completes recording a program it exits.  For live streams, programs are artificially created to be short time intervals e.g. 5-15 minutes in length.);
receiving a response audio clip carrying position information sent by the server, the position information indicating a position of the response audio clip in a response audio corresponding to the response request (Karlsson [0058]: FIG. 7 illustrates one example of a technique for managing buffer configuration. At 701, a client device sends a request for a media stream. According to various embodiments, the client device provides information about the client device to a content server such as a fragment server. Information may include resolution, buffer size, processing capabilities, network throughput, average data transfer rates, location, etc. In particular embodiments, the content server already has information about the client device. The content server selects a stream with an appropriate quality level for delivery to the client device. At 703, the client device begins receiving content using an initial buffer configuration. When a playback threshold is reached at 705, playback begins.);
synthesizing, based on the position information carried by respective received response audio clips, adjacent response audio clips into a response audio packet (Karlsson [0025]: In particular examples, a client establish a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.);
playing the synthesized response audio packet (Karlsson [0031]: According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded.), 
Karlsson does not clearly teach, until finishing playing the response audio. However, Bloebaum [0016] teaches, “In one embodiment of the method, the audio content is played to the user and repeated to facilitate tagging in response to user input.”
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to incorporate the teaching of Karlsson et al. to the Bloebaum’s system by adding the feature of playing response audio. The references (Karlsson and Bloebaum) teach features that are analogous art and they are directed to the same field of endeavor, such as contextual data. Ordinary skilled artisan would have been motivated to do so to provide Karlsson’s system with enhanced playback. (See Bloebaum [Abstract], [0007], [0023], [0038], [0055]). One of the biggest advantages of network machine learning database algorithms is their ability to improve over time. Machine learning technology typically improves efficiency and accuracy thanks to the ever-increasing amounts of data that are processed.
Regarding claim 2, the method according to claim 1, wherein said playing the synthesized response audio packet comprises:
playing, in response to determining that a number of the response audio clip included in the response audio packet is greater than a preset number, the response audio packet (Karlsson [0026]: Conventional MPEG-4 files require that a player parse the entire header before any of the data can be decoded. Parsing the entire header can take a notable amount of time, particularly on devices with limited network and processing resources. Consequently, the techniques and mechanisms of the present invention provide a fragmented MPEG-4 framework that allows playback upon receiving a first MPEG-4 file fragment. A second MPEG-4 file fragment can be requested using information included in the first MPEG-4 file fragment. According to various embodiments, the second MPEG-4 file fragment requested may be a fragment corresponding to a higher or lower bit-rate stream than the stream associated with the first file fragment.);
playing, in response to determining that the number of the response audio clip included in the response audio packet is less than the preset number and a duration between current moment and end time of playing of a last response audio packet reaches a first preset duration, the response audio packet (Karlsson [0031]: FIG. 1 illustrates buffer configurations associated with a client device. According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded. A subsequent buffer configuration depicted in buffer 131 includes a higher playback start threshold 133. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 131 fills to the threshold 133, playback begins. According to various embodiments, a modified initial buffer configuration sets a playback start threshold 133 at a relatively high level now that adverse network conditions are known.).
Regarding claim 3, the method according to claim 1, wherein the method further comprises:
playing, based on the position information, in response to determining that the positions of respective received response audio clips are not adjacent and a duration between current moment and end time of playing of a last response audio packet reaches a second preset duration, the respective received response audio clips; and discarding, during playing the respective received response audio clips, in response to determining that a response audio clip located before the position of the response audio clip that has been played or is being played is received, the received response audio clip (Bloebaum [0061]: In one embodiment, the tag for the start of the clip may be offset from the time of the corresponding user input to accommodate a lag between playback and user action.  For example, the start tag may be positioned relative to the audio content by about a half second to about one second before the point in the content when the user input to tag the beginning of the clip is received.  Similarly, the tag for the end of the clip may be offset from the time of the corresponding user input to assist in positioning the entire phrase between the start tag and the end tag, thereby accommodating premature user action.  For example, the end tag may be positioned relative to the audio content by about a half second to about one second after the point in the content when the user input to tag the end of the clip is received.).
Regarding claim 4, the method according to claim 2, wherein, in response to determining that a duration between a time in response to determining that a new response audio clip is received and an end time of playing of a last response audio packet is greater than a second preset duration, the method further comprises at least one of: extending the first preset duration; or increasing the preset number (Bloebaum [0050]: The mobile telephone 10 may also include a timer 40 for carrying out timing functions.  Such functions may include timing the durations of calls, generating the content of time and date stamps, etc. The mobile telephone 10 may include a camera 42 for taking digital pictures and/or movies.  Image and/or video files corresponding to the pictures and/or movies may be stored in the memory 16.  The mobile telephone 10 also may include a position data receiver 44, such as a global positioning system (GPS) receiver, Galileo satellite system receiver or the like.).
Regarding claim 5, the method according to claim 1, wherein the method further comprises:
requesting, in response to determining that the positions of the respective received response audio clips are not adjacent and an un-received response audio clip is a target audio clip, the target audio clip from the server, the target audio clip being a response audio clip carrying a keyword in the response audio (Karlsson [0031]: FIG. 1 illustrates buffer configurations associated with a client device. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time.); and
synthesizing, in response to determining that the target audio clip is received, the target audio clip and a response audio clip located adjacent to the target audio clip into the response audio packet, and playing the response audio packet (Karlsson [0025]: In particular examples, a client establishes a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.).
Regarding claim 6, the method according to claim 1, wherein said synthesizing, based on the position information carried by the respective received response audio clips, adjacent response audio clips into the response audio packet comprises:
determining, based on the position information carried by the respective received response audio clips, a plurality of adjacent response audio clips; performing semantic analysis on the plurality of adjacent response audio clips, and synthesizing, in response to determining that the plurality of adjacent response audio clips constitute a short sentence, the plurality of adjacent response audio clips into the response audio packet (Bloebaum [0060]: The playback may be resumed so that the phrase may be replayed to the user.  During the replaying of the phrase, the phrase may be tagged in blocks 60 and 62 to identify the portion of the audio content for use as the audio clip.  For instance, user input in the form of a depression of a key from the keypad 18 may serve as a command input to tag the beginning of the clip and a second depression of the key may serve as a command input to tag the end of the clip.  In another embodiment, the depression of a button may serve as a command input to tag the beginning of the clip and the release of the button may serve as a command input to tag the end of the clip so that the clip corresponds to the audio content played while the button was depressed.  In another embodiment, user voice commands or any other appropriate user input action may be used to command tagging the start and the end of the desired audio clip.).
Regarding claim 7, the method according to claim 1, wherein, before said receiving the response audio clip carrying the position information sent by the server, the method further comprises:
sending a persistent connection request to the server to indicate that the terminal is ready to receive the response audio clip; receiving a persistent connection response sent by the server to establish a persistent connection with the server (Karlsson [0025]: In particular examples, a client establishes a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.); and
said receiving the response audio clip carrying the position information sent by the server comprises: receiving, through the persistent connection, the response audio clip carrying the position information sent by the server (Karlsson [0032]: FIG. 2 illustrates buffer configurations with multiple thresholds associated with a client device.  An initial buffer 201 includes a playback start threshold 203.  In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer.  When the buffer 201 fills to the threshold 203, playback begins.  According to various embodiments, an initial buffer configuration sets a playback start threshold 203 at a relatively low level based on typical network conditions.  If the buffer begins to deplete and reaches quality threshold 205, the media stream being transmitted to the device may be switched to a lower quality media stream.  In some examples, the lower quality media stream has a lower bit rate than the initial media stream.  A playback experience may continue uninterrupted, without any need to establish a new session. According to various embodiments, the media stream is quality shifted by a content server upon receiving a signal from the client that a quality threshold has been reached.  The content server may replace higher quality MPEG-4 fragments with lower quality MPEG-4 fragments all while maintaining timing and sequence number information.).
Regarding claim 8, Karlsson teaches, a method for playing audio, applied to a server, the method comprising:
receiving a response request sent by a terminal (Karlsson [0042]: It is a binary that uses command line arguments to record a particular program based on either NTP time from the encoded stream or wallclock time.  In particular embodiments, this is configurable as part of the arguments and depends on the input stream.  When the fragment writer completes recording a program it exits.  For live streams, programs are artificially created to be short time intervals e.g. 5-15 minutes in length.);
obtaining, based on the response request, a plurality of response audio data (Karlsson [0058]: FIG. 7 illustrates one example of a technique for managing buffer configuration. At 701, a client device sends a request for a media stream. According to various embodiments, the client device provides information about the client device to a content server such as a fragment server. Information may include resolution, buffer size, processing capabilities, network throughput, average data transfer rates, location, etc. In particular embodiments, the content server already has information about the client device. The content server selects a stream with an appropriate quality level for delivery to the client device. At 703, the client device begins receiving content using an initial buffer configuration. When a playback threshold is reached at 705, playback begins.);
sending, in response to determining that at least one piece of response audio data of the plurality of response audio data is synthesized into a response audio clip (Karlsson [0025]: In particular examples, a client establish a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.), 
the response audio clip to the terminal (Karlsson [0031]: According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded.), 
Karlsson does not clearly teach, until finishing sending the plurality of response audio data. However, Bloebaum [0016] teaches, “In one embodiment of the method, the audio content is played to the user and repeated to facilitate tagging in response to user input.”
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to incorporate the teaching of Karlsson et al. to the Bloebaum’s system by adding the feature of playing response audio. The references (Karlsson and Bloebaum) teach features that are analogous art and they are directed to the same field of endeavor, such as contextual data. Ordinary skilled artisan would have been motivated to do so to provide Karlsson’s system with enhanced playback. (See Bloebaum [Abstract], [0007], [0023], [0038], [0055]). One of the biggest advantages of network machine learning database algorithms is their ability to improve over time. Machine learning technology typically improves efficiency and accuracy thanks to the ever-increasing amounts of data that are processed.
Regarding claim 9, the method according to claim 8, wherein the method further comprises: 
receiving a request for a target audio clip sent by the terminal, the target audio clip being a response audio clip carrying a response keyword in the response audio (Karlsson [0026]: Conventional MPEG-4 files require that a player parse the entire header before any of the data can be decoded. Parsing the entire header can take a notable amount of time, particularly on devices with limited network and processing resources. Consequently, the techniques and mechanisms of the present invention provide a fragmented MPEG-4 framework that allows playback upon receiving a first MPEG-4 file fragment. A second MPEG-4 file fragment can be requested using information included in the first MPEG-4 file fragment. According to various embodiments, the second MPEG-4 file fragment requested may be a fragment corresponding to a higher or lower bit-rate stream than the stream associated with the first file fragment.); 
sending the target audio clip to the terminal, causing the terminal to synthesize the target audio clip and a response audio clip adjacent to the target audio clip into a response audio packet to be played, and playing the response audio packet (Karlsson [0031]: FIG. 1 illustrates buffer configurations associated with a client device. According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded. A subsequent buffer configuration depicted in buffer 131 includes a higher playback start threshold 133. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 131 fills to the threshold 133, playback begins. According to various embodiments, a modified initial buffer configuration sets a playback start threshold 133 at a relatively high level now that adverse network conditions are known.).
Regarding claim 10, the method according to claim 8, wherein, before said sending the response audio clip to the terminal, the method further comprises: 
receiving a persistent connection request sent by the terminal, the persistent connection request indicating that the terminal is ready to receive the response audio clip; establishing, in response to the persistent connection request sent by the terminal, a persistent connection with the terminal (Karlsson [0025]: In particular examples, a client establishes a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.); and 
said sending the response audio clip to the terminal comprises: sending, through the persistent connection, the response audio clip to the terminal (Karlsson [0032]: FIG. 2 illustrates buffer configurations with multiple thresholds associated with a client device.  An initial buffer 201 includes a playback start threshold 203.  In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer.  When the buffer 201 fills to the threshold 203, playback begins.  According to various embodiments, an initial buffer configuration sets a playback start threshold 203 at a relatively low level based on typical network conditions.  If the buffer begins to deplete and reaches quality threshold 205, the media stream being transmitted to the device may be switched to a lower quality media stream.  In some examples, the lower quality media stream has a lower bit rate than the initial media stream.  A playback experience may continue uninterrupted, without any need to establish a new session. According to various embodiments, the media stream is quality shifted by a content server upon receiving a signal from the client that a quality threshold has been reached.  The content server may replace higher quality MPEG-4 fragments with lower quality MPEG-4 fragments all while maintaining timing and sequence number information.).
Regarding claim 11, Karlsson teaches, a terminal, comprising: 
a processor; a memory for storing processor-executable instructions, which, when executed by the processor, cause the processor to perform a method comprising (Karlsson [0060]: a processor 801, a memory 803,): 
sending a response request corresponding to a voice input to a server in response to detecting the voice input (Karlsson [0042]: It is a binary that uses command line arguments to record a particular program based on either NTP time from the encoded stream or wallclock time.  In particular embodiments, this is configurable as part of the arguments and depends on the input stream.  When the fragment writer completes recording a program it exits.  For live streams, programs are artificially created to be short time intervals e.g. 5-15 minutes in length.); 
receiving a response audio clip carrying position information sent by the server, the position information indicating a position of the response audio clip in a response audio corresponding to the response request (Karlsson [0058]: FIG. 7 illustrates one example of a technique for managing buffer configuration. At 701, a client device sends a request for a media stream. According to various embodiments, the client device provides information about the client device to a content server such as a fragment server. Information may include resolution, buffer size, processing capabilities, network throughput, average data transfer rates, location, etc. In particular embodiments, the content server already has information about the client device. The content server selects a stream with an appropriate quality level for delivery to the client device. At 703, the client device begins receiving content using an initial buffer configuration. When a playback threshold is reached at 705, playback begins.); 
synthesizing, based on the position information carried by respective received response audio clips, adjacent response audio clips into a response audio packet (Karlsson [0025]: In particular examples, a client establish a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.);
playing the synthesized response audio packet (Karlsson [0031]: According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded.), 
Karlsson does not clearly teach, until finishing playing the response audio. However, Bloebaum [0016] teaches, “In one embodiment of the method, the audio content is played to the user and repeated to facilitate tagging in response to user input.”
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to incorporate the teaching of Karlsson et al. to the Bloebaum’s system by adding the feature of playing response audio. The references (Karlsson and Bloebaum) teach features that are analogous art and they are directed to the same field of endeavor, such as contextual data. Ordinary skilled artisan would have been motivated to do so to provide Karlsson’s system with enhanced playback. (See Bloebaum [Abstract], [0007], [0023], [0038], [0055]). One of the biggest advantages of network machine learning database algorithms is their ability to improve over time. Machine learning technology typically improves efficiency and accuracy thanks to the ever-increasing amounts of data that are processed.
Regarding claim 12, the terminal according to claim 11, wherein said playing the synthesized response audio packet comprises:
playing, in response to determining that a number of the response audio clip included in the response audio packet is greater than a preset number, the response audio packet (Karlsson [0026]: Conventional MPEG-4 files require that a player parse the entire header before any of the data can be decoded. Parsing the entire header can take a notable amount of time, particularly on devices with limited network and processing resources. Consequently, the techniques and mechanisms of the present invention provide a fragmented MPEG-4 framework that allows playback upon receiving a first MPEG-4 file fragment. A second MPEG-4 file fragment can be requested using information included in the first MPEG-4 file fragment. According to various embodiments, the second MPEG-4 file fragment requested may be a fragment corresponding to a higher or lower bit-rate stream than the stream associated with the first file fragment.);
playing, in response to determining that the number of the response audio clip included in the response audio packet is less than the preset number and a duration between current moment and end time of playing of a last response audio packet reaches a first preset duration, the response audio packet (Karlsson [0031]: FIG. 1 illustrates buffer configurations associated with a client device. According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded. A subsequent buffer configuration depicted in buffer 131 includes a higher playback start threshold 133. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 131 fills to the threshold 133, playback begins. According to various embodiments, a modified initial buffer configuration sets a playback start threshold 133 at a relatively high level now that adverse network conditions are known.).
Regarding claim 13, the terminal according to claim 11, wherein the method further comprises:
playing, based on the position information, in response to determining that the positions of respective received response audio clips are not adjacent and a duration between current moment and end time of playing of a last response audio packet reaches a second preset duration, the respective received response audio clips; and discarding, during playing the respective received response audio clips, in response to determining that a response audio clip located before the position of the response audio clip that has been played or is being played is received, the received response audio clip (Bloebaum [0061]: In one embodiment, the tag for the start of the clip may be offset from the time of the corresponding user input to accommodate a lag between playback and user action.  For example, the start tag may be positioned relative to the audio content by about a half second to about one second before the point in the content when the user input to tag the beginning of the clip is received.  Similarly, the tag for the end of the clip may be offset from the time of the corresponding user input to assist in positioning the entire phrase between the start tag and the end tag, thereby accommodating premature user action.  For example, the end tag may be positioned relative to the audio content by about a half second to about one second after the point in the content when the user input to tag the end of the clip is received.).
Regarding claim 14, the terminal according to claim 12, wherein, in response to determining that a duration between a time in response to determining that a new response audio clip is received and an end time of playing of a last response audio packet is greater than a second preset duration, the method further comprises at least one of: extending the first preset duration; and increasing the preset number (Bloebaum [0050]: The mobile telephone 10 may also include a timer 40 for carrying out timing functions.  Such functions may include timing the durations of calls, generating the content of time and date stamps, etc. The mobile telephone 10 may include a camera 42 for taking digital pictures and/or movies.  Image and/or video files corresponding to the pictures and/or movies may be stored in the memory 16.  The mobile telephone 10 also may include a position data receiver 44, such as a global positioning system (GPS) receiver, Galileo satellite system receiver or the like.).
Regarding claim 15, the terminal according to claim 11, wherein the method further comprises:
requesting, in response to determining that the positions of the respective received response audio clips are not adjacent and an un-received response audio clip is a target audio clip, the target audio clip from the server, the target audio clip being a response audio clip carrying a keyword in the response audio (Karlsson [0031]: FIG. 1 illustrates buffer configurations associated with a client device. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time.); and
synthesizing, in response to determining that the target audio clip is received, the target audio clip and a response audio clip located adjacent to the target audio clip into the response audio packet, and playing the response audio packet (Karlsson [0025]: In particular examples, a client establishes a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.).
Regarding claim 16, the terminal according to claim 11, wherein said synthesizing, based on the position information carried by the respective received response audio clips, adjacent response audio clips into the response audio packet comprises:
determining, based on the position information carried by the respective received response audio clips, a plurality of adjacent response audio clips; performing semantic analysis on the plurality of adjacent response audio clips, and synthesizing, in response to determining that the plurality of adjacent response audio clips constitute a short sentence, the plurality of adjacent response audio clips into the response audio packet (Bloebaum [0060]: The playback may be resumed so that the phrase may be replayed to the user.  During the replaying of the phrase, the phrase may be tagged in blocks 60 and 62 to identify the portion of the audio content for use as the audio clip.  For instance, user input in the form of a depression of a key from the keypad 18 may serve as a command input to tag the beginning of the clip and a second depression of the key may serve as a command input to tag the end of the clip.  In another embodiment, the depression of a button may serve as a command input to tag the beginning of the clip and the release of the button may serve as a command input to tag the end of the clip so that the clip corresponds to the audio content played while the button was depressed.  In another embodiment, user voice commands or any other appropriate user input action may be used to command tagging the start and the end of the desired audio clip.).
Regarding claim 17, the terminal according to claim 11, wherein, before said receiving the response audio clip carrying the position information sent by the server, the method further comprises: 
sending a persistent connection request to the server to indicate that the terminal is ready to receive the response audio clip; receiving a persistent connection response sent by the server to establish a persistent connection with the server (Karlsson [0025]: In particular examples, a client establishes a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.); and 
said receiving the response audio clip carrying the position information sent by the server comprises: receiving, through the persistent connection, the response audio clip carrying the position information sent by the server (Karlsson [0032]: FIG. 2 illustrates buffer configurations with multiple thresholds associated with a client device.  An initial buffer 201 includes a playback start threshold 203.  In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer.  When the buffer 201 fills to the threshold 203, playback begins.  According to various embodiments, an initial buffer configuration sets a playback start threshold 203 at a relatively low level based on typical network conditions.  If the buffer begins to deplete and reaches quality threshold 205, the media stream being transmitted to the device may be switched to a lower quality media stream.  In some examples, the lower quality media stream has a lower bit rate than the initial media stream.  A playback experience may continue uninterrupted, without any need to establish a new session. According to various embodiments, the media stream is quality shifted by a content server upon receiving a signal from the client that a quality threshold has been reached.  The content server may replace higher quality MPEG-4 fragments with lower quality MPEG-4 fragments all while maintaining timing and sequence number information.).

Regarding claim 18, Karlsson teaches, a non-transitory computer-readable storage medium in which instructions, when executed by a processor of a terminal, cause the terminal to execute a method comprising: 
sending a response request corresponding to a voice input to a server in response to detecting the voice input (Karlsson [0042]: It is a binary that uses command line arguments to record a particular program based on either NTP time from the encoded stream or wallclock time.  In particular embodiments, this is configurable as part of the arguments and depends on the input stream.  When the fragment writer completes recording a program it exits.  For live streams, programs are artificially created to be short time intervals e.g. 5-15 minutes in length.); 
receiving a response audio clip carrying position information sent by the server, the position information indicating a position of the response audio clip in a response audio corresponding to the response request (Karlsson [0058]: FIG. 7 illustrates one example of a technique for managing buffer configuration. At 701, a client device sends a request for a media stream. According to various embodiments, the client device provides information about the client device to a content server such as a fragment server. Information may include resolution, buffer size, processing capabilities, network throughput, average data transfer rates, location, etc. In particular embodiments, the content server already has information about the client device. The content server selects a stream with an appropriate quality level for delivery to the client device. At 703, the client device begins receiving content using an initial buffer configuration. When a playback threshold is reached at 705, playback begins.);
synthesizing, based on the position information carried by respective received response audio clips, adjacent response audio clips into a response audio packet (Karlsson [0025]: In particular examples, a client establish a session such as a Real-Time Streaming Protocol (RTSP) session. A server computer receives a connection for a media stream, establishes a session, and provides a media stream to a client device. The media stream includes packets encapsulating frames such as MPEG-4 frames. The MPEG-4 frames themselves may be key frames or differential frames. The specific encapsulation methodology used by the server depends on the type of content, the format of that content, the format of the payload, and the application and transmission protocols being used to send the data. After the client device receives the media stream, the client device decapsulates the packets to obtain the MPEG frames and decodes the MPEG frames to obtain the actual media data.); 
playing the synthesized response audio packet (Karlsson [0031]: According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded.), 
Karlsson does not clearly teach, until finishing playing the response audio. However, Bloebaum [0016] teaches, “In one embodiment of the method, the audio content is played to the user and repeated to facilitate tagging in response to user input.”
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to incorporate the teaching of Karlsson et al. to the Bloebaum’s system by adding the feature of playing response audio. The references (Karlsson and Bloebaum) teach features that are analogous art and they are directed to the same field of endeavor, such as contextual data. Ordinary skilled artisan would have been motivated to do so to provide Karlsson’s system with enhanced playback. (See Bloebaum [Abstract], [0007], [0023], [0038], [0055]). One of the biggest advantages of network machine learning database algorithms is their ability to improve over time. Machine learning technology typically improves efficiency and accuracy thanks to the ever-increasing amounts of data that are processed.
Regarding claim 19, the storage medium according to claim 18, wherein said playing the synthesized response audio packet comprises: 
playing, in response to determining that a number of the response audio clip included in the response audio packet is greater than a preset number, the response audio packet (Karlsson [0026]: Conventional MPEG-4 files require that a player parse the entire header before any of the data can be decoded. Parsing the entire header can take a notable amount of time, particularly on devices with limited network and processing resources. Consequently, the techniques and mechanisms of the present invention provide a fragmented MPEG-4 framework that allows playback upon receiving a first MPEG-4 file fragment. A second MPEG-4 file fragment can be requested using information included in the first MPEG-4 file fragment. According to various embodiments, the second MPEG-4 file fragment requested may be a fragment corresponding to a higher or lower bit-rate stream than the stream associated with the first file fragment.); 
playing, in response to determining that the number of the response audio clip included in the response audio packet is less than the preset number and a duration between current moment and end time of playing of a last response audio packet reaches a first preset duration, the response audio packet (Karlsson [0031]: FIG. 1 illustrates buffer configurations associated with a client device. According to various embodiments, buffer 101 includes a playback start threshold 103. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 101 fills to the threshold 103, playback begins. According to various embodiments, an initial buffer configuration sets a playback start threshold 103 at a relatively low level based on typical network conditions. The relatively low level allows a buffer 101 to reach the playback threshold 103 in a relatively brief period of time. Playback can begin quickly. Data is added to the buffer 101 as data is received and data is removed from the buffer 101 as it is obtained for processing and playback. According to various embodiments, when the buffer 101 is depleted, a new buffer configuration can be loaded. A subsequent buffer configuration depicted in buffer 131 includes a higher playback start threshold 133. In particular embodiments, data such as fragmented MPEG-4 packets are received and placed in the buffer. When the buffer 131 fills to the threshold 133, playback begins. According to various embodiments, a modified initial buffer configuration sets a playback start threshold 133 at a relatively high level now that adverse network conditions are known.).
Regarding claim 20, the storage medium according to claim 18, wherein the method further comprises: 
playing, based on the position information, in response to determining that the positions of respective received response audio clips are not adjacent and a duration between current moment and end time of playing of a last response audio packet reaches a second preset duration, the respective received response audio clips; and discarding, during playing the respective received response audio clips, in response to determining that a response audio clip located before the position of the response audio clip that has been played or is being played is received, the received response audio clip (Bloebaum [0061]: In one embodiment, the tag for the start of the clip may be offset from the time of the corresponding user input to accommodate a lag between playback and user action.  For example, the start tag may be positioned relative to the audio content by about a half second to about one second before the point in the content when the user input to tag the beginning of the clip is received.  Similarly, the tag for the end of the clip may be offset from the time of the corresponding user input to assist in positioning the entire phrase between the start tag and the end tag, thereby accommodating premature user action.  For example, the end tag may be positioned relative to the audio content by about a half second to about one second after the point in the content when the user input to tag the end of the clip is received.).  
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Boesen, US 2018/0060031, Voice Assistant for wireless earpieces
Liu, US 2019/0347063, Systems and Methods for voice-assisted media content selection
Milevski, US 2017/0257717, Location based tracking using a wireless earpiece device, system, and method
Whatmough, US 2004/0160445, System and Method of converting frame-based animations into interpolator-based animations

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD S ROSTAMI whose telephone number is (571)270-1980.  The examiner can normally be reached on Mon-Fri From 9 a.m. to 5 p.m..
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hosain Alam can be reached on (571)272-39783978. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SABA AHMED/
Examiner, Art Unit 2154

/MOHAMMAD S ROSTAMI/Primary Examiner, Art Unit 2154