Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments, see page 6, filed 7/11/2022, with respect to claims 9,13 have been fully considered and are persuasive.  The 35 USC 112(d) rejection of claims 9,13 has been withdrawn. 
Applicant’s arguments, see page 7, filed 7/11/2022, with respect to claims 1,4,7,8,11,14,15 have been fully considered and are persuasive.  The 35 USC 101 rejection of claims 1,4,7,8,11,14,15 has been withdrawn. 
Regarding 35 USC 112f, Applicant's arguments filed 7/11/2022 have been fully considered but they are not persuasive. 
The applicant contends
35 U.S.C. 112(f) INTERPRETATION 
The Applicant respectfully submits that the above amendments to claims 1, 8 and 15 remove the 35 U.S.C. 112(f) interpretation from "scene annotation module." The current claims recite a further structure to the scene annotation module. Specifically the current claims recite that "the scene annotation module includes a first neural network configured to generate a feature vector from the image frame and a second neural network configured to generate a caption describing elements within the image frame from the feature vector" As such the current claims recite structure to perform the function of the claims and therefore 35 U.S.C. 112(f) interpretation is not appropriate for the amended claims. 
Thus the applicant respectfully requests that the examiner apply the ordinary interpretation standard to the amended claims. 

The examiner disagrees. The recited claimed language states “the scene annotation module includes a first neural network configured to generate a feature vector from the image frame and a second neural network configured to generate a caption describing elements within the image frame from the feature vector”. The highlighted portion of the recited limitation indicates the module includes a structure, but such claimed language fails to indicate “a scene annotation module” is a structure. Furthermore, the applicant’s remarks are directed towards the neural network as a structure, wherein such structure does not perform all the limitations of the scene annotation module. Instead, the claimed language indicates separate functionalities performed by the scene annotation module such as “classify scene elements from an image frame received from a host system ….”, “detect a change in scene complexity … and generate the caption describing elements within the image frame when a change in scene complexity is detected”. Such limitation recites the functions of the scene annotation module as opposed to the functions of the neural network. For these reasons, the applicant’s remarks fail to show the claimed limitations is not written as a function to be performed and there is sufficient structure, material or acts to perform that function as is required in a rebuttal of 35 USC 112f stated in MPEP 2181. Furthermore, the amended claimed language does not recite language supporting the scene annotation module is a structure as opposed to invoking 35 USC 112f, means plus function claimed language. 
Regarding the prior art rejection, applicant’s arguments, see pages 7-9, filed 7/11/2022, with respect to the rejection(s) of claim(s) 1,3-8,10-15 under 35 USC 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Valliani et al (US Publication No.: 20170132821) in view of Lu et al (US Publication No.: 20180143966), further in view of Yurick et al (US Publication No.: 20070011012).

Claim Rejections - 35 USC § 112(f)
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “a scene annotation module configured to” in claim 1.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1,3,7,8,10,14,15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Valliani et al (US Publication No.: 20170132821) in view of Lu et al (US Publication No.: 20180143966), further in view of Yurick et al (US Publication No.: 20070011012).
Claim 1, Valliani et al discloses 
a scene annotation module configured to (Such recitation invokes 35 USC 112, 6th, hence incorporates the apparatus as indicated in the specification. Page 4 of the applicant’s specification discloses the scene description module’s apparatus as hardware, software or a combination of hardware and software. Fig. 2 shows the system including 260 caption engine or scene annotation module. Fig. 6 shows the apparatus and paragraph 139 discloses software and paragraph 140 discloses a processor. Fig. 4 shows the method of generating caption for a visual media.) classify scene elements from an image frame (Fig. 4, label 410, Paragraph 115 discloses “At step 410 an object in a visual media is identified. ... classifying the object into a known category, such as a person, a dog, a cat …”.) received from a host system (Paragraph 70 discloses “The image may be an active image displayed in an image application or other application on the user device. In one aspect, the image is specifically sent to a captioning application by the user … “. Fig. 1, label user device and server. Paragraph 51 discloses “the functions performed by components of system 200 are associated with one or more caption generation applications … In particular, such applications, services, or routines may operate on one or more user devices …, servers …”) and generate a caption describing the scene elements (Fig. 4, label 430),
wherein the scene annotation module includes a first neural network configured to generate a feature vector from the image frame (Fig. 2, label 262. Paragraph 75 discloses image classifier 262 as a neural network model classifier. Paragraph 74 discloses 262 generates a feature vector for classifying objects within images.).
Valliani et al fails to disclose a second neural network configured to generate a caption describing elements within the image frame from the feature vector.
Lu et al discloses an automated image captioning system comprising an encoder ResNet CNN that outputs image features and a second neural network (Fig. 7, label CNN as the first neural network CNN. The rest of the components are part of the second neural network.) configured to generate a caption describing elements within the image frame from the feature vector (Fig. 7, label image features are used to generate caption, label next caption word, describing elements within the image. Fig. 17 shows the images with generated captions that describe elements within the image such as a little girl sitting on a bench holding an umbrella.). It would be obvious to one skilled in the art before the effective filing date of the application to simply substitute one well known element of generating captions as disclosed by Valliani et al with another well-known manner of generating captions using a second neural network as disclosed by Lu et al so to obtain predictable results of captions for an image and use a system that improves the performance of attention based image captioning models, hence improving the performance image captioning.
Valliani et al discloses the scene annotation module configured to generate captions describing scene elements (Fig. 2, label 260,262. Fig. 4 shows the method of generating caption for a visual media.), but fails to disclose wherein the scene annotation module is configured to detect a change in scene complexity and generate the caption describing elements within the image frame when a change in scene complexity is detected.
Yurick et al discloses scene caption text suggestions using language processing or scene annotation module (paragraph 43, Fig. 2 shows automatic captioning engine comprising speech recognition and OCR engines and Fig. 3 shows the multi-media analysis with captioning generation.) configured to detect a change in scene complexity and generate the caption describing elements within the image frame when a change in scene complexity is detected (Paragraph 44 discloses “In an operation 144, scene changes within the video portion of multi-media can be detected during multi-media analysis to provide caption segmentation suggestions. Segmentation is utilized in pop-on style captions (as opposed to scrolling captions) such that the captions are broken down in appropriate sentences or phrase for incremental presentation to the consumer. scene changes within the video portion. … In operation 157, caption segments are created based on the scene changes, periods of silence, and audio speaker identification.”) It would be obvious to one skilled in the art before the effective filing date of the application to modify scene annotation or captioning of media as disclosed by Valliani et al by incorporating scene change detection as disclosed by Yurick et al so to provide users with relevant information that is accurate to the media, hence improving the user’s experience as the user views the media or video.
Claim 3, Lu et al discloses wherein the caption describing elements within the image frame is a sentence predicted by the second neural network (Fig. 7, label next caption word is a prediction of the next word in the caption sentence such as shown in Fig. 17. Fig. 6, label w1-end indicates the predicted words in the caption.).  
Claim 7, Valliani et al discloses wherein the image frame data is video game frame data. (paragraph 22 discloses “generate captions for visual media, such as a photograph or video”, wherein video game is a video. Fig. 4, label 410 shows a visual media. Paragraph 115 discloses a user selects a portion of the image.)
Claim 8, Valliani et al discloses 
classifying scene elements from an image frame (Fig. 4, label 410, Paragraph 115 discloses “At step 410 an object in a visual media is identified. ... classifying the object into a known category, such as a person, a dog, a cat …”.) received from a host system (Paragraph 70 discloses “The image may be an active image displayed in an image application or other application on the user device. In one aspect, the image is specifically sent to a captioning application by the user … “. Fig. 1, label user device and server. Paragraph 51 discloses “the functions performed by components of system 200 are associated with one or more caption generation applications … In particular, such applications, services, or routines may operate on one or more user devices …, servers …”) and generating a caption describing the scene elements (Fig. 4, label 430),
wherein the scene annotation module includes a first neural network configured to generate a feature vector from the image frame (Fig. 2, label 262. Paragraph 75 discloses image classifier 262 as a neural network model classifier. Paragraph 74 discloses 262 generates a feature vector for classifying objects within images.).
Valliani et al fails to disclose a second neural network configured to generate a caption describing elements within the image frame from the feature vector.
Lu et al discloses an automated image captioning system comprising an encoder ResNet CNN that outputs image features and a second neural network (Fig. 7, label CNN as the first neural network CNN. The rest of the components are part of the second neural network.) configured to generate a caption describing elements within the image frame from the feature vector (Fig. 7, label image features are used to generate caption, label next caption word, describing elements within the image. Fig. 17 shows the images with generated captions that describe elements within the image such as a little girl sitting on a bench holding an umbrella.). It would be obvious to one skilled in the art before the effective filing date of the application to simply substitute one well known element of generating captions as disclosed by Valliani et al with another well-known manner of generating captions using a second neural network as disclosed by Lu et al so to obtain predictable results of captions for an image and use a system that improves the performance of attention based image captioning models, hence improving the performance image captioning.
Valliani et al discloses the scene annotation module configured to generate captions describing scene elements (Fig. 2, label 260,262. Fig. 4 shows the method of generating caption for a visual media.), but fails to disclose wherein the scene annotation module is configured to detect a change in scene complexity and generate the caption describing elements within the image frame when a change in scene complexity is detected.
Yurick et al discloses scene caption text suggestions using language processing or scene annotation module (paragraph 43, Fig. 2 shows automatic captioning engine comprising speech recognition and OCR engines and Fig. 3 shows the multi-media analysis with captioning generation.) configured to detect a change in scene complexity and generate the caption describing elements within the image frame when a change in scene complexity is detected (Paragraph 44 discloses “In an operation 144, scene changes within the video portion of multi-media can be detected during multi-media analysis to provide caption segmentation suggestions. Segmentation is utilized in pop-on style captions (as opposed to scrolling captions) such that the captions are broken down in appropriate sentences or phrase for incremental presentation to the consumer. scene changes within the video portion. … In operation 157, caption segments are created based on the scene changes, periods of silence, and audio speaker identification.”) It would be obvious to one skilled in the art before the effective filing date of the application to modify scene annotation or captioning of media as disclosed by Valliani et al by incorporating scene change detection as disclosed by Yurick et al so to provide users with relevant information that is accurate to the media, hence improving the user’s experience as the user views the media or video.
Claim 10, Lu et al discloses wherein the caption describing elements within the image frame is a sentence predicted by the second neural network (Fig. 7, label next caption word is a prediction of the next word in the caption sentence such as shown in Fig. 17. Fig. 6, label w1-end indicates the predicted words in the caption.).  
Claim 14, Valliani et al discloses wherein the image frame data is video game frame data. (paragraph 22 discloses “generate captions for visual media, such as a photograph or video”, wherein video game is a video. Fig. 4, label 410 shows a visual media. Paragraph 115 discloses a user selects a portion of the image.)
Claim 15, Valliani et al discloses 
classifying scene elements from an image frame (Fig. 4, label 410, Paragraph 115 discloses “At step 410 an object in a visual media is identified. ... classifying the object into a known category, such as a person, a dog, a cat …”.) received from a host system (Paragraph 70 discloses “The image may be an active image displayed in an image application or other application on the user device. In one aspect, the image is specifically sent to a captioning application by the user …”. Fig. 1, label user device and server. Paragraph 51 discloses “the functions performed by components of system 200 are associated with one or more caption generation applications … In particular, such applications, services, or routines may operate on one or more user devices …, servers …”) with a scene annotation module (Fig. 2 shows the system including 260 caption engine or scene annotation module.) and generating a caption describing the scene elements (Fig. 4, label 430).  

Claims 4,5,11,12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Valliani et al (US Publication No.: 20170132821) in view of Lu et al (US Publication No.: 20180143966), further in view of Yurick et al (US Publication No.: 20070011012) and further in view of Kwon et al (US Patent No.: 9319566).
	Claim 4, Valliani et al, Lu et al and Yurick et al fails to disclose a text to speech synthesis module coupled to the scene annotation module, wherein the text to speech synthesis module is configured to convert the caption to synthesized speech data describing the scene elements within the image frame.  
	Kwon et al discloses a text to speech synthesis module (Fig. 4, label 47, paragraph 52 discloses “the sound output section 47 may use a text-to-speech (TTS) method to convert the caption data having a text form into the sound.”) coupled to the scene annotation module (Fig. 4, label 42, Fig. 5 shows the scene annotation module generates captions for an input signal containing the frames of the image (paragraph 50). The processor is shown coupled to 47.), wherein the text to speech synthesis module is configured to convert the caption to synthesized speech data describing the scene elements within the image frame (paragraph 52 discloses “the sound output section 47 may use a text-to-speech (TTS) method to convert the caption data having a text form into the sound.”).  It would be obvious to one skilled in the art before the effective filing date of the application to modify Valliani et al in view of Lu et al, further in view of Yurick et al by incorporating text to speech of the caption as disclosed by Kwon et al so to provide the user, such as a blind person, speech describing the image, hence improving the user’s experience and providing the user with needed information.
Claim 5, Lu et al discloses a controller (Fig. 25, label controller) and caption generation (Fig. 7 which is found in Fig. 25.), wherein the controller is configured to synchronize the output of the scene annotation module with one or more other neural network modules (Paragraph 168 discloses “The system comprises a controller (Fig. 25) for iterating the input preparer, the decoder, the attender, and the feed-forward neural network to generate the natural language caption for the image until the next caption word emitted is an end-of-caption token <end>. The iterations are performed by a controller shown in Fig. 25.” Such indicates the controller synchronizes the output of the scene annotation module (Fig. 7) with one or more other neural network modules such as the feed-forward neural network (Fig. 7, label MLP).), but fails to disclose a controller coupled to the host system and the scene annotation module, wherein the controller is configured to activate the scene annotation module in response to an input from a user.  
Kwon et al discloses a controller coupled to the host system (Fig. 4, label 42 as the host system, label 45 as the controller) and the scene annotation module (label 42 as the scene annotation module (Fig. 5 shows the processor generates captions for the input signal that includes image (paragraph 50).)), 
wherein the controller (Fig. 4, label controller) is configured to activate the scene annotation module in response to an input from a user (Paragraph 43 discloses “The connector may request the set-top box 4 to transmit the signal and receive the requested signal from the set-top box 4 under control of the controller.” Paragraph 44 discloses “The processor 42 processes a signal … input from the signal receiver 41.” Such paragraphs indicate when a user input or input signal is received, the processor is activated to generate captions for an image as shown in Fig. 5.). It would be obvious to one skilled in the art before the effective filing date of the application to modify the controller of Valliani et al in view of Lu et al, further in view of Yurick et al as disclosed by Kwon et al so to control the actions of the components generating captions, hence effectively providing captions to images, videos which improves the user’s experience by providing the user with needed information.
Claim 11, Valliani et al, Lu et al and Yurick et al fails to disclose converting the caption to synthesized speech data describing the scene elements within the image frame with a speech synthesis module coupled to the scene annotation module.  
	Kwon et al discloses a speech synthesis module (Fig. 4, label 47, paragraph 52 discloses “the sound output section 47 may use a text-to-speech (TTS) method to convert the caption data having a text form into the sound.”) coupled to the scene annotation module (Fig. 4, label 42, Fig. 5 shows the scene annotation module generates captions for an input signal containing the frames of the image (paragraph 50). The processor is shown coupled to 47.), wherein the text to speech synthesis module is configured to convert the caption to synthesized speech data describing the scene elements within the image frame (paragraph 52 discloses “the sound output section 47 may use a text-to-speech (TTS) method to convert the caption data having a text form into the sound.”).  It would be obvious to one skilled in the art before the effective filing date of the application to modify Valliani et al in view of Lu et al, further in view of Yurick et al by incorporating text to speech of the caption as disclosed by Kwon et al so to provide the user, such as a blind person, speech describing the image, hence improving the user’s experience and providing the user with needed information.
Claim 12, Lu et al discloses a controller (Fig. 25, label controller) and caption generation (Fig. 7 which is found in Fig. 25.), wherein the controller is configured to synchronize the output of the scene annotation module with one or more other neural network modules (Paragraph 168 discloses “The system comprises a controller (Fig. 25) for iterating the input preparer, the decoder, the attender, and the feed-forward neural network to generate the natural language caption for the image until the next caption word emitted is an end-of-caption token <end>. The iterations are performed by a controller shown in Fig. 25.” Such indicates the controller synchronizes the output of the scene annotation module (Fig. 7) with one or more other neural network modules such as the feed-forward neural network (Fig. 7, label MLP).), but fails to disclose activating the scene annotation module in response to an input from a user and a controller coupled to the host system and the scene annotation module, 
Kwon et al discloses a controller coupled to the host system (Fig. 4, label 42 as the host system, label 45 as the controller) and the scene annotation module (label 42 as the scene annotation module (Fig. 5 shows the processor generates captions for the input signal that includes image (paragraph 50).)), 
wherein activating the scene annotation module in response to an input from a user with the controller (Fig. 4, label controller) (Paragraph 43 discloses “The connector may request the set-top box 4 to transmit the signal and receive the requested signal from the set-top box 4 under control of the controller.” Paragraph 44 discloses “The processor 42 processes a signal … input from the signal receiver 41.” Such paragraphs indicate when a user input or input signal is received, the processor is activated to generate captions for an image as shown in Fig. 5.). It would be obvious to one skilled in the art before the effective filing date of the application to modify the controller of Valliani et al in view of Lu et al, further in view of Yurick et al as disclosed by Kwon et al so to control the actions of the components generating captions, hence effectively providing captions to images, videos which improves the user’s experience by providing the user with needed information.

Claims 6,13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Valliani et al (US Publication No.: 20170132821) in view of Lu et al (US Publication No.: 20180143966), further in view of Yurick et al (US Publication No.: 20070011012) and further in view of Kwon et al (US Patent No.: 9319566), and further in view of Kritt et al (US Publication No.: 20140178043).
Claim 6, Valliani et al discloses 
wherein the audio segment is synchronized to occur during presentation of the image frame (paragraph 92 discloses “presentation component 218 generates user interface features associated with a caption. Such features can include interface elements … audio prompts, alerts, alarms, vibrations …”.), and Lu et al discloses the one or more neural network modules (Fig. 7, label MLP, encoder, decoder and attender and paragraph 168.) but fails to disclose wherein the one or more other neural network modules includes an acoustic effect annotation module configured classify primary acoustic effects occurring within an audio segment.
Kritt et al discloses extraction of features (paragraph 57 discloses one or more audio features are extracted.) and one or more neural network modules includes an acoustic effect annotation module configured to (Such invokes 35 USC 112, 6th, hence incorporates the apparatus into the claim. Page 19 discloses the apparatus as neural network. paragraph 46,58,62 discloses such apparatus.) classifying features to speech, silence, music, other (primary acoustic effects) occurring within an audio segment (paragraph 58,62 discloses classification of audio feature values of the audio signal or audio features into different categories. Paragraph 46 discloses recognizing patterns is performed using well known training methods such as neural network. Although paragraph 46 indicates recognizing patterns of an image for example, paragraph 58,62 discloses classification of audio features, wherein each audio feature classified as a sound effect may be compared with the library of characteristic features. This indicates recognizing patterns and is associated with paragraph 46’s disclosure of well-known training methods that can be used for recognizing patterns.). 
It would be obvious to one skilled in the art before the effective filing date of the application to modify Valliani et al in view of Lu et al’s caption generation with neural network modules by incorporating classification of the features of an input, wherein the input includes video and audio as disclosed by Kritt et al so to associate audio features with image to provide captions (paragraph 66 discloses the audio transcript, generated from classified features as shown in Fig. 6, label 612, may be provided to the video in the form of closed caption file.), hence improving the user’s experience in viewing video or image and audio.
Claim 13, Valliani et al discloses wherein audio segment is synchronized to occur during presentation of the image frame (paragraph 92 discloses “presentation component 218 generates user interface features associated with a caption. Such features can include interface elements … audio prompts, alerts, alarms, vibrations …”.), and Lu et al discloses the one or more neural network modules (Fig. 7, label MLP, encoder, decoder and attender and paragraph 168.) and Lu et al discloses the one or more neural network modules (Fig. 7, label MLP, encoder, decoder and attender and paragraph 168.) but fails to disclose classifying primary acoustic effects occurring within an audio segment, and wherein the one or more neural network modules includes an acoustic effect annotation module configured classify the primary acoustic effects occurring within the audio segment.
Kritt et al discloses extraction of features (paragraph 58,62 discloses extraction of audio features for classification.) and one or more neural network modules includes an acoustic effect annotation module configured to (Such invokes 35 usc 112, 6th, hence incorporates the apparatus into the claim. Page 19 discloses the apparatus as neural network. paragraph 46,58,62 discloses such apparatus.) classifying features to speech, silence, music, other (primary acoustic effects) occurring within an audio segment (paragraph 58,62 discloses classification of audio feature values of the audio signal or audio features into different categories. Paragraph 46 discloses recognizing patterns is performed using well known training methods such as neural network. Although paragraph 46 indicates recognizing patterns of an image for example, paragraph 58,62 discloses classification of audio features, wherein each audio feature classified as a sound effect may be compared with the library of characteristic features. This indicates recognizing patterns and is associated with paragraph 46’s disclosure of well-known training methods that can be used for recognizing patterns.).
It would be obvious to one skilled in the art before the effective filing date of the application to modify Valliani et al in view of Lu et al’s caption generation with neural network modules by incorporating classification of the features of an input, wherein the input includes video and audio as disclosed by Kritt et al so to associate audio features with image to provide captions (paragraph 66 discloses the audio transcript, generated from classified features as shown in Fig. 6, label 612, may be provided to the video in the form of closed caption file.), hence improving the user’s experience in viewing video or image and audio.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/LINDA WONG/Primary Examiner, Art Unit 2655