DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .  This Office Action is responsive to the communications filed on 28 October 2020.  Claims 1-20 are pending.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 4, 6, 8, 9, 11,13, 15, 16, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang (US 2019/0155883 A1) in view of Denoue et al. (US 2009/0113278 A1) .
Per claim 1, Wang discloses a method to extract and enrich slide presentations from multimodal content through cognitive computing (Abstract; paragraph [0006], “According to another aspect of the present disclosure, it is provided a method.  The method may comprise extracting a slide area from image or video information associated with slide, wherein the slide comprises text and non-text information; segmenting the slide area into a plurality of regions; classifying each of the plurality of regions into a text region or a non-text region; performing text recognition on the text region to obtain text information when a region is classified as the text region; and constructing an editable slide with the non-text region or the text information according to their locations in the slide area.  “), the method comprising:
automatically performing extraction of slides from multimodal content  including audio-visual content in real-time (e.g., Block 201 as shown in Fig. 1; paragraph [0009]; paragraph [0035], “ As shown in FIG. 2, the process 200 starts at block 201 where a slide area is extracted from image or video information associated with slide, wherein the slide comprises text and non-text information.  The image or video information can be captured in real time or retrieved from a local or remote storage device.  For example, when people are attending a business, a lecture, an academic meeting or any other suitable activities, they may record slide presentation with videos or images using smart phones and optionally share them with other people or upload them to a network location ... “) ; 
automatically performing object extraction from each of the slides (e.g., Blocks 202-204 as shown in Fig. 2; paragraph [0039]; paragraphs [0043-0044]); 
allowing object substitution through semantics and concepts of the objects extracted (e.g., Block 205 as shown in Fig. 2; Abstract and paragraph [0005], “ …construct an editable slide with the non-text region or the text information according to their locations in the slide area (205). “; paragraph [0024], “… Therefore, it is desirable to provide a technical solution for recovering an editable slide (such as in .ppt or .pptx format) from such video or image, which may potentially be used in much more scenarios.  “; paragraph [0047]; paragraph [0051]; Examiner’s Note: Wang teaches creating a slide capable of being edited.  Therefore, Wang allows object substitution through semantics and concepts of the objects extracted such that the slide can be used in much more scenarios.); but does not expressly disclose:
processing audio synchronized with the slides enriched with cognitive computing, search engine, and knowledge base in a live stream, to provide annotations of the slides; 
processing the audio synchronized with the object being presented in each slide; 
curating for each step with human-machine interaction to provide a learning process by the system,
wherein each slide extraction is performed by processing the content and using interactive input and cognitive computing to automatically define slide transitions.
Denoue disclose 
processing audio synchronized with the slides enriched with cognitive computing, search engine, and knowledge base in a live stream, to provide annotations of the slides (Abstract, paragraph [0024]; paragraph [0031]; paragraph [0035]; paragraph [0047], “ …The original video is thus segmented into units of time, each having a representing slide and associated audio segment… “; paragraph [0050], “In an embodiment of the inventive system, interactions in front of the display can be extracted by differencing the snapshots of the display.  Cursor movement, marks, and annotations can be obtained more precisely from PowerPoint or using APIs of the operating system of the presenter's computer system 103.  “; paragraphs [0057-0058]) according to semantics (paragraphs [0034]-0037]; paragraph [0052] ); 
processing the audio synchronized with the object being presented in each slide (paragraph [0030], “The capture module 101 then transmits the captured presentation slides, captured audio and/or other content 109 as well as associated metadata 110 to a presentation analysis module 106.  The presentation analysis module 106, in turn, uses audio and visual features to find synchronized regions of interest, which are the regions in the complete original presentation that appear to be relevant to the user at a particular point in time, from the point of view of presentation flow.” ; paragraph [0031]).
curating for each step with human-machine interaction to provide a learning process by the system (e.g., Blocks 204-206 as shown in Fig. 2; paragraph [0033]; Examiner’s Note
wherein each slide extraction is performed by processing the content and using interactive input and cognitive computing to automatically define slide transitions(paragraph [0029], ‘Another exemplary setup is a room equipped with multiple cameras that detect and track the presenter's interactions with the slides on the room display, plus other capture appliances to record the slides and audio...”; paragraphs [0049-0054]).
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to use the system and method of Denoue in the presentation system of Wang for the purpose of composing a focused timed content representation of the presentation based on an identified sequence of regions of interest in the presentation and an identified the temporal path of the presentation as suggested by Denoue.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Wang and Denoue to obtain the invention as specified in claim 1.
Per claims 2, Wang and Denoue disclose the method according to claim 1, 
wherein the automatically performing extraction of slides in the live-stream in real-time includes finding and extracting slides based on the audio-visual content according to content semantics (Wang, e.g., Block 201 as shown in Fig. 2; paragraph [0035]), and
_wherein processing audio synchronized with the slides being enriched further includes enriching from multimodal content through cognitive computing including slide extraction, slide transition extraction, object extraction, object animation extraction, and allowing object substitution (Wang, paragraph [0023]; paragraph [0035] ).
Per claims 4, 11, and 18, Wang and Denoue disclose the method according to claim 1, the computer program product according to claim 8, and the system according to claim 15, respectively, wherein in the object extraction from each slide includes using monitors and cognitive computing to automatically define object animations, using a search engine, cognitive computing and knowledge base to increase accuracy of extracted object and slides (Denoue, paragraph [0053]), and wherein the audio-visual content is processed to detect video content through regions of streamed images (Wang, paragraph [0035]).
Per claims 6 and 13, Wang and Denoue disclose the method according to claim 1 and the computer program product according to claim 8, respectively, wherein the audio processing includes speech-to- text and natural language understanding synchronized with the slides (Denoue, Abstract; paragraph [0027]).
Per claim 8,  Denoue discloses a computer program product comprising a computer readable storage medium having program instructions embodied therewith (paragraph [0060]), the program instructions and hardware descriptions readable and executable by a computer to cause the computer to:
automatically performing extraction of slides from multimodal content  including audio-visual content (e.g., Block 201 as shown in Fig. 1; paragraph [0009]; paragraph [0035], “ As shown in FIG. 2, the process 200 starts at block 201 where a slide area is extracted from image or video information associated with slide, wherein the slide comprises text and non-text information.  The image or video information can be captured in real time or retrieved from a local or remote storage device.  For example, when people are attending a business, a lecture, an academic meeting or any other suitable activities, they may record slide presentation with videos or images using smart phones and optionally share them with other people or upload them to a network location ... “) ; 
automatically performing object extraction from each of the slides (e.g., Blocks 202-204 as shown in Fig. 2; paragraph [0039]; paragraphs [0043-0044]); 
allowing object substitution through semantics and concepts of the objects extracted (e.g., Block 205 as shown in Fig. 2; Abstract and paragraph [0005], “ …construct an editable slide with the non-text region or the text information according to their locations in the slide area (205). “; paragraph [0024], “… Therefore, it is desirable to provide a technical solution for recovering an editable slide (such as in .ppt or .pptx format) from such video or image, which may potentially be used in much more scenarios.  “; paragraph [0047]; paragraph [0051]; Examiner’s Note: Wang teaches creating a slide capable of being edited.  Therefore, Wang allows object substitution through semantics and concepts of the objects extracted such that the slide can be used in much more scenarios.); but does not expressly disclose:
processing audio synchronized with the slides enriched with cognitive computing, search engine, and knowledge base in a live stream, to provide annotations of the slides in real-time
processing the audio synchronized with the object being presented in each slide. 
Denoue disclose 
processing audio synchronized with the slides enriched with cognitive computing, search engine, and knowledge base in a live stream, to provide annotations of the slides (Abstract, paragraph [0024]; paragraph [0031]; paragraph [0035]; paragraph [0047], “ …The original video is thus segmented into units of time, each having a representing slide and associated audio segment… “; paragraph [0050], “In an embodiment of the inventive system, interactions in front of the display can be extracted by differencing the snapshots of the display.  Cursor movement, marks, and annotations can be obtained more precisely from PowerPoint or using APIs of the operating system of the presenter's computer system 103.  “; paragraphs [0057-0058]) according to semantics (paragraphs [0034]-0037]; paragraph [0052] ); and
processing the audio synchronized with the object being presented in each slide (paragraph [0030], “The capture module 101 then transmits the captured presentation slides, captured audio and/or other content 109 as well as associated metadata 110 to a presentation analysis module 106.  The presentation analysis module 106, in turn, uses audio and visual features to find synchronized regions of interest, which are the regions in the complete original presentation that appear to be relevant to the user at a particular point in time, from the point of view of presentation flow.”; paragraph [0031]
).
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to use the system and method of Denoue in the presentation system of Wang for the purpose of composing a focused timed content representation of the presentation based on an identified sequence of regions of interest in the presentation and an identified the temporal path of the presentation as suggested by Denoue.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Wang and Denoue to obtain the invention as specified in claim 8.
Per claims 9  and 16, Wang and Denoue disclose the computer program product according to claim 8 and the system according to claim 15, respectively, wherein each slide extraction is performed by processing the content and using interactive input and cognitive computing to automatically define slide transitions (Denoue, paragraph [0029], “Another exemplary setup is a room equipped with multiple cameras that detect and track the presenter's interactions with the slides on the room display, plus other capture appliances to record the slides and audio… “; paragraphs [0049-0050]),  further comprising processing the audio-visual content to detect specific objects from video content through regions of streamed images (Wang, paragraph[0035]).
Per claim 15, Denoue discloses a system (e.g., Fig. 1; paragraph [0029]), comprises: 
a network(e.g., network link 1214 as shown in Fig. 12; paragraph [0067]);
 a virtual computer connected to the network(e.g., computer platform 1201 as shown in Fig. 12), comprising: 
a virtual memory storing computer instructions(e.g., memory 1207; paragraph [0060] ; paragraph [0064]); a virtual processor executing the computer instructions and configured to: 
automatically perform extraction of slides from multimodal content (e.g., slide 109 as shown in Fig. 1; paragraph [0029]; paragraph [0030], “The capture module 101 then transmits the captured presentation slides, captured audio and/or other content 109 as well as associated metadata 110 to a presentation analysis module 106…“; paragraph [0035]; paragraph [0047], “…For video files, the system detects slides as unit elements using frame differencing.  The original video is thus segmented into units of time, each having a representing slide and associated audio segment… “; paragraph [0048]); paragraph [0056]; 
automatically perform object extraction from each of the slides (e.g., associated metadata 110; paragraph [0036]; paragraph [0039]; paragraph [0047], “… The system then finds regions of interest on each unit (i.e. slide) using Optical Character Recognition, word bounding box and motion regions (e.g. a video clip playing within a slide or an animation). “; paragraph [0048]); 
allow object substitution through semantics and concepts of the objects extracted (paragraph [0035]; Examiner’s Note
process audio synchronized with the slides enriched with cognitive computing, search engine, and knowledge base, to provide annotations of the slides (Abstract, paragraph [0024]; paragraph [0035]; paragraph [0047], “ …The original video is thus segmented into units of time, each having a representing slide and associated audio segment… “; paragraphs [0057-0058]); 
process the audio synchronized with the object being presented in each slide (paragraph [0030], “The capture module 101 then transmits the captured presentation slides, captured audio and/or other content 109 as well as associated metadata 110 to a presentation analysis module 106.  The presentation analysis module 106, in turn, uses audio and visual features to find synchronized regions of interest, which are the regions in the complete original presentation that appear to be relevant to the user at a particular point in time, from the point of view of presentation flow.”). 
Per claim 20, Wang and Denoue disclose the system according to claim 15, wherein the audio processing includes speech- to-text and natural language understanding synchronized with the slides (Denoue, Abstract; paragraph [0027]), and wherein the learning process includes registering information in a database according to received feedback information (Marlowe, paragraphs [0031], [0038], [0040], [046], [0050]), [0054], [0088], [0104]).
Claims 3, 7, 10, 14, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Wang (US 2019/0155883 A1) in view of Denoue et al. (US 2009/0113278 A1), and further in view of Marlow et al. (Hereinafter, Marlow, US 2018/0359530 Al).
Per claims 3, Wang and Denoue disclose the method according to claim 1,  wherein each slide extraction is performed by processing the content of video frames by detection, rotation, distortion and using monitors and cognitive computing to automatically define slide transitions(Denoue, paragraph [0029], “Another exemplary setup is a room equipped with multiple cameras that detect and track the presenter's interactions with the slides on the room display, plus other capture appliances to record the slides and audio… “; paragraph [0050]; Examiner’s Note: Denoue uses changes in snapshots of the display to determine when to extract content.), further comprising processing video of the audio-visual content by searching for regions that there is an ongoing slide presentation to detect objects, presenters and presentation content in real-time along with the processing of the audio (Wang, paragraph [0035]).
Denoue does not expressly disclose wherein the processing of the audio synchronized with the object being presented in each slide is to enhance content semantics and understanding.
Marlowe discloses wherein the processing of the audio synchronized with the object being presented in each slide is to enhance semantics and understanding (Abstract; paragraph [0025]; paragraph [0028]; paragraph [0033]; paragraph [0040]).
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to use the curation engine of Marlowe in the presentation system of Denoue for the purpose of curating content from video based communication as suggested by Marlowe.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Denoue and Marlowe to obtain the invention as specified in claims 3.
Per claims 10 and 17, Wang and Denoue disclose the method according to claim 1,the computer program product according to claim 8, and the system according to claim 15, respectively,  wherein each slide extraction is performed by processing the content of video frames by detection, rotation, distortion and using monitors and cognitive computing to automatically define slide transitions(Denoue, paragraph [0029], “Another exemplary setup is a room equipped with multiple cameras that detect and track the presenter's interactions with the slides on the room display, plus other capture appliances to record the slides and audio… “; paragraph [0050]; Examiner’s Note: Denoue uses changes in snapshots of the display to determine when to extract content.).
Wang and Denoue do not expressly disclose wherein the processing of the audio synchronized with the object being presented in each slide is to enhance content semantics and understanding.
Marlowe discloses wherein the processing of the audio synchronized with the object being presented in each slide is to enhance semantics and understanding (Abstract; paragraph [0025]; paragraph [0028]; paragraph [0033]; paragraph [0040]).
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to use the curation engine of Marlowe in the presentation system of Wang and Denoue for the purpose of curating content from video based communication as suggested by Marlowe.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Wang, Denoue, and Marlowe to obtain the invention as specified in claims 10 and 17.  
Per claims 7 and 14, Wang and Denoue disclose the method according to claim 1 and the computer program product according to claim 8, respectively,  but do not expressly disclose wherein the learning process includes registering information in a database according to received feedback information (Marlowe, paragraphs [0031], [0038], [0040], [046], [0050]), [0054], [0088], [0104]).
Marlowe discloses wherein the learning process includes registering information in a database according to received feedback information (Marlowe, paragraphs [0031], [0038], [0040], [046], [0050]), [0054], [0088], [0104]).
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to use the curation engine of Marlowe in the presentation system of Wang and Denoue for the purpose of curating content from video based communication as suggested by Marlowe.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Wang, Denoue, and Marlowe to obtain the invention as specified in claims 7 and 14.
Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Wang (US 2019/0155883 A1) in view of Denoue et al. (US 2009/0113278 A1), and further in view of Mahapatra et al. (Hereinafter, Mahapatra, US 2018/0130496 A1).
Per claims 5, 12, and 19, Wang and Denoue disclose the method according to claim 1, the computer program product according to claim 8, and the system according to claim 15, respectively, but does not expressly disclose wherein the allowing of object substitution through semantics and concepts includes commanding a cognitive computing system to replace all image objects with related images according to a given specific licensing, such as Creative Commons.
Mahapatra discloses wherein the allowing of object substitution through semantics and concepts includes commanding a cognitive computing system to replace all image objects with related images according to a given specific licensing, such as Creative Commons (Abstract; paragraphs [0005-0007]; paragraphs [0025-0026]; paragraph [0052]; paragraph [0098]).
It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to use the image repository of Mahapatra in the presentation system of Wang and Denoue for the purpose of visual summary of multimedia content as suggested by Mahapatra.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Denoue, Wang, and Mahapatra to obtain the invention as specified in claims 5, 12, and 19.
Response to Arguments
Applicant's arguments filed 28 October 2020 have been fully considered but they are not persuasive. 
On pages 1-2 of the Applicant’s Response, applicants argues that Denoue fails to teach or suggest (e.g., claim 1) “automatically performing extraction of slides from multimodal content including audio-visual content in real-time; automatically performing object extraction from each of the slides; allowing object substitution through semantics and concepts of the objects extracted.” 
The Examiner respectfully disagrees with Applicant’s arguments, because Denoue was relied upon to disclose “automatically performing extraction of slides from multimodal content including audio-visual content in real-time; automatically performing object extraction from each of the slides.”   Moreover, Denoue discloses: 
processing audio synchronized with the slides enriched with cognitive computing, search engine, and knowledge base in a live stream, to provide annotations of the slides (Abstract, paragraph [0024]; paragraph [0031]; paragraph [0035]; paragraph [0047], “ …The original video is thus segmented into units of time, each having a representing slide and associated audio segment… “; paragraph [0050], “In an embodiment of the inventive system, interactions in front of the display can be extracted by differencing the snapshots of the display.  Cursor movement, marks, and annotations can be obtained more precisely from PowerPoint or using APIs of the operating system of the presenter's computer system 103.  “; paragraphs [0057-0058]) according to semantics (paragraphs [0034]-0037]; paragraph [0052] ); 
processing the audio synchronized with the object being presented in each slide (paragraph [0030], “The capture module 101 then transmits the captured presentation slides, captured audio and/or other content 109 as well as associated metadata 110 to a presentation analysis module 106.  The presentation analysis module 106, in turn, uses audio and visual features to find synchronized regions of interest, which are the regions in the complete original presentation that appear to be relevant to the user at a particular point in time, from the point of view of presentation flow.”; paragraph [0031] ).
Wang discloses a method to extract and enrich slide presentations from multimodal content through cognitive computing (Abstract; paragraph [0006], “According to another aspect of the present disclosure, it is provided a method.  The method may comprise extracting a slide area from image or video information associated with slide, wherein the slide comprises text and non-text information; segmenting the slide area into a plurality of regions; classifying each of the plurality of regions into a text region or a non-text region; performing text recognition on the text region to obtain text information when a region is classified as the text region; and constructing an editable slide with the non-text region or the text information according to their locations in the slide area.  “), the method comprising:
automatically performing extraction of slides from multimodal content  including audio-visual content in real-time (e.g., Block 201 as shown in Fig. 1; paragraph [0009]; paragraph [0035], “ As shown in FIG. 2, the process 200 starts at block 201 where a slide area is extracted from image or video information associated with slide, wherein the slide comprises text and non-text information.  The image or video information can be captured in real time or retrieved from a local or remote storage device.  For example, when people are attending a business, a lecture, an academic meeting or any other suitable activities, they may record slide presentation with videos or images using smart phones and optionally share them with other people or upload them to a network location ... “) ; 
automatically performing object extraction from each of the slides (e.g., Blocks 202-204 as shown in Fig. 2; paragraph [0039]; paragraphs [0043-0044]); 
allowing object substitution through semantics and concepts of the objects extracted (e.g., Block 205 as shown in Fig. 2; Abstract and paragraph [0005], “ …construct an editable slide with the non-text region or the text information according to their locations in the slide area (205). “; paragraph [0024], “… Therefore, it is desirable to provide a technical solution for recovering an editable slide (such as in .ppt or .pptx format) from such video or image, which may potentially be used in much more scenarios.  “; paragraph [0047]; paragraph [0051]).
Wang teaches creating a slide capable of being edited.  Therefore, Wang allows object substitution through semantics and concepts of the objects extracted such that the slide can be used in much more scenarios.).
Moreover, the Applicant argues that the combination fails to teach or suggest (e.g., claim 1), “curating for each step with human-machine interaction to provide a learning process by the system, wherein each slide extraction is performed by processing the content and using interactive input and cognitive computing to automatically define slide transitions'".
The Examiner disagrees curating for each step with human-machine interaction to provide a learning process by the system (e.g., Blocks 204-206 as shown in Fig. 2; ‘Another exemplary setup is a room equipped with multiple cameras that detect and track the presenter's interactions with the slides on the room display, plus other capture appliances to record the slides and audio...”; paragraphs [0049-0054]). Denoue teaches organizing regions of interest, relevant, from the point of view of the presentation flow are organize temporally.
Therefore, Examiner submits that claims 1, 8 and 15 are not allowable.  The claims dependent on claims 1, 8 and 15 are not allowable by their dependence on rejected claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
      Brandt et al. (US 6,646,655 B1) - 	Brandt discloses a method to extract and enrich slide presentations from multimodal content through cognitive computing (e.g., Fig. 20 illustrates a generalized technique for generating a slide set from a video input; Abstract, “An apparatus and method for generating a slide from a video.  A slide is automatically identified in video frames of a video input stream.  A digitized representation of the slide is generated based on the video frames of the video input stream.  “; column 2, lines 53-54, “ A method and apparatus for generating a set of slides from a video is disclosed in various embodiments … “; column 18, lines 7-27).	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARRIN HOPE whose telephone number is (571)270-
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kieu D Vu can be reached on (571)272-4057.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


DARRIN HOPE
Examiner
Art Unit 2173


/TADESSE HAILU/Primary Examiner, Art Unit 2173