DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 6-9, 11-16 are rejected under 35 U.S.C. 103 as being unpatentable over Basso (US 2011/0112821) in view of Wang (US 2017/0278525) and further in view of Bakker (US 6,423,013).
With respect to claim 1 (similarly claim 14 [0047]), Basso teaches a computer-implemented method of voice-to-text tagging (e.g. the method of Figs 2-3 [0020])  for transcription of a human voice signal (e.g. for transcription of "Hey you!" [0020]-[0025]) by one of an automatic speech recognition system (e.g. by the system of Fig 1 including server 104 [0020]-[0025]) or a natural conversation system, the method comprising: 
generating a speech-to-text verbal transcript of one or more verbal vocalizations of an audio signal (e.g. translating "Hey you" i.e. the verbal components [0024]-[0025] whereby translating "Hey you" is generating a speech-to-text verbal transcript of one or more verbal vocalizations of an audio signal) at a receipt timestamp of the audio signal (e.g. at a receipt timestamp at which server 104 received the audio signal, as suggested in [0011] i.e. real-time and Fig 2 S204 [0020]); 
generating a voice-to-text non-verbal transcript of one or more non-verbal vocalizations of the audio signal (e.g. deriving "!" i.e. the non-verbal components [0024]-[0025] whereby deriving "!" is generating a voice-to-text non-verbal transcript of one or more non-verbal vocalizations of the audio signal) at a receipt timestamp of the audio signal (e.g. at a receipt timestamp at which server 104 received the audio signal, as suggested in [0011] i.e. real-time and Fig 2 S204 [0020]); and 
generating an enhanced transcript (e.g. generating enhanced/capitalized "HEY YOU!" [0025], see also [0029] whereby the ability to convey the non-verbal components of the content may enhance the users' understanding of each other) by combining the non-verbal transcript and the verbal transcript (e.g. by combining the non-verbal transcript and the verbal transcript, as suggested in [0025]).
However, Basso fails to clearly show the generation at a verbal timestamp and at a non-verbal timestamp of the audio signal.
Wang teaches a generation/processing of speech and non-speech at a verbal timestamp and at a non-verbal timestamp of the audio signal (e.g. server 180 Fig 1 which processes a dialogue and a non-speech sound at a verbal timestamp and at a non-verbal timestamp of the audio signal, see [0046]).
Basso and Wang are analogous art because they all pertain to processing verbal and non-verbal sounds according to the time/timestamp at which they are received. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Wang to include: generating a speech-to-text verbal transcript of one or more verbal vocalizations and a voice-to-text non-verbal transcript of one or more non-verbal vocalizations of the audio signal at a verbal timestamp and at a non-verbal timestamp of the audio signal, as suggested by Basso. The benefit of the modification would be to better synchronize the combination of verbal and non-verbal components of the audio signal.
Further, even though Basso teaches the verbal and non-verbal vocalizations, he fails to teach dividing an audio signal of a human vocal tract into one or more time-based segments of non-verbal vocalizations and verbal vocalizations;
Bakker teaches dividing an audio signal of a human vocal tract into one or more time-based segments (e.g. Figs 6-7 divide an audio signal of a human vocal tract i.e. user 102 of Fig 1 into one or more time-based (seconds) segments) based non-verbal vocalizations (e.g. based on an integrated breathing signal representation 606 Figs 6-7) and verbal vocalizations (e.g. and a vocal tract sound signal representation 604 Figs 6-7, see col 9 ln 46-67-col 11 ln 1-30);
Basso and Bakker are analogous art because they all pertain to processing verbal and non-verbal sounds/vocal tract sound and breathing sound according to the time/timestamp at which they are received. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Bakker to include: dividing an audio signal of a human vocal tract into one or more time-based segments of non-verbal vocalizations and verbal vocalizations, as suggested by Bakker. The benefit of the modification would be to provide useful feedback on speech and breathing, Bakker col 2 ln 21-31.
 With respect to claim 2 (similarly claim 15), Basso teaches the computer-implemented method of claim 1, further comprising determining whether a quality of the verbal vocalizations of the audio signal comprise at least one of being sung, shouted, whispered, or uttered with a creaky voice (e.g. the content may be an audio message recorded on a user's cellular phone in which the user shouts "Hey you!" in an angry voice [0020], see also [0024]).
With respect to claim 3 (similarly claim 16), Basso teaches the computer-implemented method of claim 1, further comprising: classifying the audio signal into one or more time segments (e.g. classifying "Hey you!" into one or more time segments, as suggested in [0011] and [0032] whereby the translation of the content/feedback is performed in real-time) that each include a respective portion of the non-verbal vocalizations and the verbal vocalizations (e.g. each time segment of the content reception and/or of the feedback include a portion of "!" i.e. non-verbal vocalizations and "Hey you" i.e. the verbal vocalizations, as suggested in [0024]-[0025] and [0031]-[0032]), wherein: the generating of the non-verbal transcript includes identifying each of the non-verbal vocalizations and an associated time segment of the one or more time segments (e.g. deriving the non-verbal components include identify the non-verbal vocalizations and the associated time segment at which it is received at server 104, as suggested in [0011] i.e. real-time and Fig 2 S204 [0020], [0024]-[0025], see also Wang [0046]-[0047]), and the generating of the verbal transcript includes identifying each of the verbal vocalizations and an associated time segment of the one or more time segments (e.g. generating the verbal components include identifying “Hey you” and the associated time segment at which it is received at server 104, as suggested in [0011] i.e. real-time and Fig 2 S204 [0020], [0024]-[0025], see also [0046]-[0047] of Wang).
With respect to claim 4, Basso in view of Wang and Bakker teaches the computer-implemented method of claim 3, further comprising sub-classifying the non-verbal vocalizations into one or more vocalization groups (Wang e.g. [0038] sub-classifies the non-verbal components into one or more vocalization groups).
With respect to claim 6, Basso teaches the computer-implemented method of claim 1, wherein the generating of the verbal transcript of the one or more verbal vocalizations and the generating of the non-verbal transcript of the one or more non-verbal vocalizations (e.g. the generation of the verbal transcript “Hey you” and the derivation of the non-verbal transcript “!” as suggested in [0020]-[0025]) are based on an overlapping time segment of the audio signal (e.g. are based on an overlapping time segment at which “Hey you!” is received at server 104 of Fig 1, as suggested in [0011], [0020], [0032], see also [0047] of Wang).
With respect to claim 7, Basso teaches the computer-implemented method of claim 1, wherein the generating of the verbal transcript of the one or more verbal vocalizations and the generating of the non-verbal transcript of the one or more non-verbal vocalizations (e.g. the generation of the verbal transcript “Hey you” and the derivation of the non-verbal transcript “!” as suggested in [0020]-[0025]) are based on consecutive time segments of the one or more time segments of the audio signal (e.g. are based on consecutive time segments at which “Hey you!” and feedback from the debate are received at server 104 of Fig 1, as suggested in [0011], [0020], [0032], see also [0046] of Wang).
With respect to claim 8, Basso teaches the computer-implemented method of claim 1, further comprising: generating a video-to-text transcript of a video of a subject speaking the verbal vocalizations and the non-verbal vocalizations (e.g. [0020] discloses that the content is a video, which means that the video is processed just like the audio “Hey you!” to generate a video-to-text transcript of a video of a subject speaking the verbal vocalizations and the non-verbal vocalizations, as disclosed in [0020]-[0025] for the audio signal); and combining the video-to-text transcript with at least one of the speech-to-text transcript and the voice-to-text transcript (e.g. and combining the processed/translated video with at least one of the speech-to-text transcript and the voice-to-text transcript, as suggested in [0025], to generate an enhanced translated video just as is the case for capitalized “HEY MAN!”).  
With respect to claim 9, Basso teaches a computer-implemented method  of voice-to-text tagging (e.g. the method of Figs 2-3 [0020]) for transcription of a natural conversation (e.g. for transcription of "Hey you!" [0020]-[0025] and/or "This guy is wrong" [0035]-[0042]), the method comprising: 
classifying the one or more non-verbal vocalizations according to the time-based segments of the audio signal, respectively (e.g. classifying "Hey you!" and/or "This guy is wrong" into one or more time segments, as suggested in [0011] and [0032] whereby the translation of the content/feedback is performed in real-time, see also the structured data [0023], [0038]); 
generating a voice-to-text non-verbal transcript indicating an occurrence of one or more of the non-verbal vocalizations (e.g. deriving "!" i.e. the non-verbal components [0024]-[0025] whereby deriving "!" is generating a voice-to-text non-verbal transcript indicating an occurrence of one or more of the non-verbal vocalizations); 
generating a speech-to-text verbal transcript indicating an occurrence of the verbal vocalizations (e.g. translating "Hey you" i.e. the verbal components [0024]-[0025] whereby translating "Hey you" is generating a speech-to-text verbal transcript indicating an occurrence of the verbal vocalizations) based on a timestamp (e.g. at a timestamp at which server 104 received the audio signal, as suggested in [0011] i.e. real-time and Fig 2 S204 [0020]); and 
generating an enhanced transcript (e.g. generating enhanced/capitalized "HEY YOU!" [0025], [0039]-[0040], see also [0029] whereby the ability to convey the non-verbal components of the content may enhance the users' understanding of each other) by combining an output of the voice-to-text transcript of the non-verbal vocalizations of the audio signal with an output of the speech-to-text transcript of the verbal vocalizations (e.g. by combining the non-verbal transcript and the verbal transcript, as suggested in [0025]), wherein the verbal vocalizations are classified by a natural language classifier (e.g. the verbal components are classified by a natural language classifier, as suggested in [0023]-[0025], [0038]-[0040]).
However, Basso fails to clearly show training a vocalization classifier with a training set of a plurality of voice-to-text non-verbal vocalizations; and the generation based on verbal timestamp and at a non-verbal timestamp of the audio signal.
Wang teaches training a vocalization classifier (e.g. training a classifier 225 Fig 2) with a training set of a plurality of voice-to-text non-verbal vocalizations (e.g. with a training data which may have speech and non-speech captions for audio that has been created by trusted users [0056]);
 a generation/processing of speech and non-speech at a verbal timestamp and at a non-verbal timestamp of the audio signal (e.g. server 180 Fig 1 which processes a dialogue and a non-speech sound at a verbal timestamp and at a non-verbal timestamp of the audio signal, see [0046]).
Basso and Wang are analogous art because they all pertain to processing verbal and non-verbal sounds according to the time/timestamp at which they are received. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Wang to include: training a vocalization classifier with a training set of a plurality of voice-to-text non-verbal vocalizations; generating a speech-to-text verbal transcript of one or more verbal vocalizations and a voice-to-text non-verbal transcript of one or more non-verbal vocalizations of the audio signal at a verbal timestamp and at a non-verbal timestamp of the audio signal, as suggested by Wang. The benefit of the modification would be to better synchronize the combination of verbal and non-verbal components of the audio signal.
Further, even though Basso teaches the verbal and non-verbal vocalizations, he fails to teach dividing an audio signal of a human vocal tract into one or more time-based segments of non-verbal vocalizations and verbal vocalizations;
Bakker teaches dividing an audio signal of a human vocal tract into one or more time-based segments (e.g. Figs 6-7 divide an audio signal of a human vocal tract i.e. user 102 of Fig 1 into one or more time-based (seconds) segments) based non-verbal vocalizations (e.g. based on an integrated breathing signal representation 606 Figs 6-7) and verbal vocalizations (e.g. and a vocal tract sound signal representation 604 Figs 6-7, see col 9 ln 46-67-col 11 ln 1-30);
Basso and Bakker are analogous art because they all pertain to processing verbal and non-verbal sounds/vocal tract sound and breathing sound according to the time/timestamp at which they are received. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Bakker to include: dividing an audio signal of a human vocal tract into one or more time-based segments of non-verbal vocalizations and verbal vocalizations, as suggested by Bakker. The benefit of the modification would be to provide useful feedback on speech and breathing, Bakker col 2 ln 21-31.
With respect to claim 11, Basso in view of Wang and Bakker teaches the computer-implemented method of claim 9, wherein the combining of the voice- to-text transcript and the speech-to-text transcript includes: displaying of a voice-to-text description of the non-verbal vocalizations for each time segment of the audio signal (Wang e.g. [0046]-[0048] disclose displaying that these two captions should be presented together according to various formats).
With respect to claim 12, Basso in view of Wang and Bakker teaches the computer-implemented method of claim 11, wherein the combining of the output the voice-to-text transcript and the speech-to-text transcript includes a displaying of the voice-to- text description for an associated speech-to-text display for at least one time segment of the audio signal (Wang e.g. [0046]-[0048] suggest displaying these two captions should be presented together according to various formats, for at least one time segment of the audio signal).
With respect to claim 13, Basso in view of Wang and Bakker teaches the computer-implemented method of claim 9, further comprising generating a video- to-text transcript of a video of a subject speaking the verbal vocalizations and the non-verbal vocalizations (Basso e.g. [0025] suggest generating a video- to-text transcript of a video of a subject speaking the verbal vocalizations and the non-verbal vocalizations, especially when the content is a video file), and combining the video-to-text transcript with at least one of the speech-to-text transcript and the voice-to-text transcript (Basso e.g. combining the non-verbal transcript and the verbal transcript, as suggested in [0025] to generate enhanced video file).

Claims 5, 10, 17 are rejected under 35 U.S.C. 103 as being unpatentable over Basso (US 2011/0112821) in view of Wang (US 2017/0278525), Bakker (US 6,423,013) and further in view of Stephanick (US 2006/0190256).
With respect to claim 5, Basso in view of Wang and Bakker teaches the computer-implemented method of claim 4 including the vocalization groups.
However, Basso in view of Wang and Bakker fails to teach wherein the plurality of vocalization groups comprise at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click.
Stephanick teaches plurality of vocalization groups comprise at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click (e.g. recognized utterances which may include mouth clicks and other non-verbal sounds [0064]).
Basso and Stephanick are analogous art because they all pertain to recognizing verbal, non-verbal components/utterances. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Stephanick to include: wherein the plurality of vocalization groups comprise at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click, as disclosed by Stephanick [0064]. The benefit of the modification would be to enhance understanding between users by recognizing and translating verbal and non-verbal components in their interactions.
With respect to claim 10, Basso in view of Wang and Bakker teaches the computer-implemented method of claim 9, further comprising sub-classifying the non-verbal vocalizations in a plurality of vocalization groups (Wang e.g. [0038] sub-classifies the non-verbal components into one or more vocalization groups).
However, Basso fails to teach wherein the plurality of vocalization groups comprising at least one of a preverbal, a guttural, a breathy, a fricative, or a click.
Stephanick teaches plurality of vocalization groups comprise at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click (e.g. recognized utterances which may include mouth clicks and other non-verbal sounds [0064]).
Basso and Stephanick are analogous art because they all pertain to recognizing verbal, non-verbal components/utterances. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Stephanick to include: one or more vocalization groups comprising at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click, as disclosed by Stephanick [0064]. The benefit of the modification would be to enhance understanding between users by recognizing and translating verbal and non-verbal components in their interactions.
With respect to claim 17, Basso in view of Wang and Bakker teaches the computer-readable storage medium according to claim 16 wherein the method further comprises sub-classifying the non-verbal vocalizations into one or more vocalization groups (Wang e.g. [0038] sub-classifies the non-verbal components into one or more vocalization groups).
However, Basso fails to teach wherein the one or more vocalization groups comprise at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click.
Stephanick teaches plurality of vocalization groups comprise at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click (e.g. recognized utterances which may include mouth clicks and other non-verbal sounds [0064]).
Basso and Stephanick are analogous art because they all pertain to recognizing verbal, non-verbal components/utterances. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Stephanick to include: one or more vocalization groups comprising at least one of a pre-verbal, a guttural, a breathy, a fricative, or a click, as disclosed by Stephanick [0064]. The benefit of the modification would be to enhance understanding between users by recognizing and translating verbal and non-verbal components in their interactions.

Claims 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Basso (US 2011/0112821) in view of Bakker (US 6,423,013).
With respect to claim 18, Basso teaches a voice-to-text tagging device (e.g. a general purpose computing device 400 Fig 4) comprising: 
a processor (e.g. a processor 402 Fig 4 [0046]); 
a memory coupled to the processor (e.g. a memory 404 coupled to processor 402 Fig 4 [0046]), the memory storing instructions to cause the processor to perform acts (e.g. the memory storing instructions to cause processor 402 to perform acts [0047] comprising: 
classifying the one or more time-based segments of the audio signal into one or more verbal vocalizations and one or more non-verbal vocalizations (e.g. [0011], [0020], [0032] suggest classifying a time segment of “Hey you!” into a verbal component and a non-verbal component); 
classifying and converting the verbal vocalizations from a natural language classifier into a verbal transcript having a speech-to-text format (e.g. classifying and converting “Hey you” from a natural language classifier into a verbal transcript having a speech-to-text format, [0020]-[0025]); 
classifying and converting the non-verbal vocalizations into a non-verbal transcript having a voice-to-text format (e.g. classifying and converting “!” into a non-verbal transcript having a voice-to-text format [0023]-[0025]); 
generating an enhanced transcript (e.g. generating enhanced/capitalized "HEY YOU!" [0025], see also [0029] whereby the ability to convey the non-verbal components of the content may enhance the users' understanding of each other) by combining the speech-to-text-transcript and the voice-to-text transcript for each time segment of the audio signal (e.g. by combining the non-verbal transcript and the verbal transcript, as suggested in [0025] for the time segment of “Hey you!”); and 
outputting the combined speech-to-text-transcript and the voice-to-text transcript (e.g. the broker 104 outputs the verbal and non-verbal components of the translated content in the second modality [0026], [0041]).
However, even though Basso teaches the verbal and non-verbal vocalizations, he fails to teach dividing an audio signal of a human vocal tract into one or more time-based segments of non-verbal vocalizations and verbal vocalizations;
Bakker teaches dividing an audio signal of a human vocal tract into one or more time-based segments (e.g. Figs 6-7 divide an audio signal of a human vocal tract i.e. user 102 of Fig 1 into one or more time-based (seconds) segments) based non-verbal vocalizations (e.g. based on an integrated breathing signal representation 606 Figs 6-7) and verbal vocalizations (e.g. and a vocal tract sound signal representation 604 Figs 6-7, see col 9 ln 46-67-col 11 ln 1-30);
Basso and Bakker are analogous art because they all pertain to processing verbal and non-verbal sounds/vocal tract sound and breathing sound according to the time/timestamp at which they are received. Therefore, it would have been obvious to people having ordinary skill in the art before the effective filing date of the claimed invention to modify Basso with the teachings of Bakker to include: dividing an audio signal of a human vocal tract into one or more time-based segments of non-verbal vocalizations and verbal vocalizations, as suggested by Bakker. The benefit of the modification would be to provide useful feedback on speech and breathing, Bakker col 2 ln 21-31.
With respect to claim 19, Basso teaches the device according to claim 18, further comprising a microphone (e.g. a microphone [0046]) coupled to the memory and the processor, the microphone is configured for receiving a sound input and generating the audio signal (e.g. the microphone receiving a sound input and generating the audio signal, as suggested in [0020]).
With respect to claim 20, Basso teaches the device according to claim 19, further comprising a camera (e.g. a camera [0046]) coupled to the memory and the processor, the camera is configured for receiving a video input (e.g. the camera receiving a video input, as suggested in [0020]), wherein the processor is configured to generate a video-to-text transcript for combination with the speech-to-text transcript and the voice-to-text transcript (e.g. [0024]-[0025] suggest processor 402 Fig generates a video-to-text transcript for combination with the speech-to-text transcript and the voice-to-text transcript).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IBRAHIM SIDDO whose telephone number is (571)272-4508. The examiner can normally be reached 9:00-5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached on 571-272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IBRAHIM SIDDO/Primary Examiner, Art Unit 2675