DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Cheng
Claims 1, 3-5, 8-11, 13, 15-18, 20-30 are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by Cheng et al.(USPubN 2014/0328570; hereinafter Cheng).
As per claim 1, Cheng teaches a computer-implemented method for performing post-production editing of digital video footages or digital multimedia footages, comprising: 
receiving one or more footages of an event(“As used herein, "multimedia input" may refer to, among other things, a collection of digital images, a video, a collection of videos, or a collection of images and videos (where a "collection" includes two or more images and/or videos). References herein to a "video" may refer to, among other things, a relatively short video clip, an entire full-length video production, or different segments within a video or video clip (where a segment includes a sequence of two or more frames of the video). Any video of the input 102 may include or have associated therewith an audio soundtrack and/or a speech transcript, where the speech transcript may be generated by, for example, an automated speech recognition (ASR) module of the computing system 100. Any video or image of the input 102 may include or have associated therewith a text transcript, where the text transcript may be generated by, for example, an optical character recognition (OCR) module of the computing system 100. References herein to an "image" may refer to, among other things, a still image (e.g., a digital photograph) or a frame of a video (e.g., a "key frame").” in Para.[0017], The multimedia input can be interpreted as footages of an event.); 
constructing, based on information about the event, a script to indicate a structure of multiple temporal units of the one or more footages, wherein a temporal unit comprises a shot or a scene(“The event description 106 semantically describes an event depicted by the multimedia input 102, as determined by the multimedia content understanding module 104. In the illustrative embodiments, the event description 106 is determined algorithmically by the computing system 100 analyzing the multimedia input 102. In other embodiments, the event description 106 may be user-supplied or determined by the system 100 based on meta data or other descriptive information associated with the input 102. The illustrative event description 106 generated by the understanding module 104 indicates an event type or category, such as "birthday party," "wedding," "soccer game," "hiking trip," or "family activity." The event description 106 may be embodied as, for example, a natural language word or phrase that is encoded in a tag or label, which the computing system 100 associates with the multimedia input 102 (e.g., as an extensible markup language or XML tag). Alternatively or in addition, the event description 106 may be embodied as structured data, e.g., a data type or data structure including semantics, such as "Party(retirement)," "Party(birthday)," "Sports_Event(soccer)," "Performance(singing)," or "Performance(dancing).” in Para.[0019], “ynamic visual features include features that are computed over x-y-t segments or windows of a video. Dynamic feature detectors can detect the appearance of actors, objects and scenes as well as their motion information. Some examples of dynamic feature detectors include MoSIFT, STIP (Spatio-Temporal Interest Point), DTF-HOG (Dense Trajectory based Histograms of Oriented Gradients), and DTF-MBH (Dense-Trajectory based Motion Boundary Histogram). The MoSIFT feature detector extends the SIFT feature detector to the time dimension and can collect both local appearance and local motion information, and identify interest points in the video that contain at least a minimal amount of movement. The STIP feature detector computes a spatio-temporal second-moment matrix at each video point using independent spatial and temporal scale values, a separable Gaussian smoothing function, and space-time gradients.” In Para.[0041]); 
extracting semantic meaning from the one or more footages based on a multimodal analysis comprising at least an audio analysis and a video analysis(“A multimedia content understanding module 104 of the computing system 100 is embodied as software, firmware, hardware, or a combination thereof. The multimedia content understanding module 104 applies a number of different feature detection algorithms 130 to the multimedia input 102, using a multimedia content knowledge base 132, and generates an event description 106 based on the output of the algorithms 130 … The illustrative multimedia content understanding module 104 executes different feature detection algorithms 130 on different parts or segments of the multimedia input 102 to detect different features, or the multimedia content understanding module 104 executes all or a subset of the feature detection algorithms 130 on all portions of the multimedia input 102.” in Para.[0018], “multimedia content understanding module 104 accesses one or more feature models 134 and/or concept models 136. The feature models 134 and the concept models 136 are embodied as software, firmware, hardware, or a combination thereof, e.g., a knowledge base, database, table, or other suitable data structure or computer programming construct. The models 134, 136 correlate semantic descriptions of features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low level features detected by the algorithms 130 with semantic descriptions of those sets of features (e.g., "object," "person," "face," "ball," "vehicle," etc.). Similarly, the concept model 136 may define relationships between sets of features detected by the algorithms 130 and higher-level "concepts," such as people, objects, actions and poses (e.g., "sitting," "running," "throwing," etc.). The semantic descriptions of features and concepts that are maintained by the models 134, 136 may be embodied as natural language descriptions and/or structured data. As described in more detail below with reference to FIG. 4, a mapping 140 of the knowledge base 132 indicates relationships between various combinations of features, concepts, events, and activities. As described below, the event description 106 can be determined using semantic reasoning in connection with the knowledge base 132 and/or the mapping 140. To establish "relationships" and "associations" as described herein, the computing system 100 may utilize, for example, a knowledge representation language or ontology” in Para.[0020], “multimedia content understanding module 104 includes a number of feature detection modules 202, including a visual feature detection module 212, an audio feature detection module 214, a text feature detection module 216, and a camera configuration feature detection module 218. The feature detection modules 202, including the modules 212, 214, 216, 218, are embodied as software, firmware, hardware, or a combination thereof. The various feature detection modules 212, 214, 216, 218 analyze different aspects of the multimedia input 102 using respective portions of the feature models 134. To enable this, the multimedia content understanding module 104 employs external devices, applications and services as needed in order to create, from the multimedia input 102, one or more image/video segments 204, an audio track 206, and a text/speech transcript 208” in Para.[0039]); 
adding editing instructions to the script based on the structure of the multiple temporal units and the semantic meaning extracted from the one or more footages(“The salient event criteria 138 may also include salient event criteria that is specified by or derived from user inputs 256. For example, the interactive storyboard module 124 may determine salient event criteria based on user inputs received by the editing module 126. As another example, the user preference learning module 148 may derive new or updated salient event criteria based on its analysis of the user feedback 146” in Para.[0050]); and 
performing editing operations based on the editing instructions to generate an edited multimedia content based on the one or more footages(“The output generator module 114 interfaces with an interactive storyboard module 124 to allow the end user to modify the (machine-generated) visual presentation 120 and/or the (machine-generated) NL description 122, as desired. The illustrative interactive storyboard module 124 includes an editing module 126, a sharing module 128, and an auto-suggest module 162. The interactive storyboard module 124 and its submodules 126, 128, 162 are each embodied as software, firmware, hardware, or a combination thereof. The editing module 126 displays the elements of the visual presentation 120 on a display device (e.g., a display device 642, FIG. 6) and interactively modifies the visual presentation 120 in response to human-computer interaction (HCI) received by a human-computer interface device (e.g., a microphone 632, the display device 642, or another part of an HCI subsystem 638). The interactive storyboard module 124 presents the salient event segments 112 using a storyboard format that enables the user to intuitively review, rearrange, add and delete segments of the presentation 120 (e.g. by tapping on a touchscreen of the HCI subsystem 638). When the user's interaction with the presentation 120 is complete, the interactive storyboard module 124 stores the updated version of the presentation 120 in computer memory (e.g., a data storage 620)” in Para.[0033]).
As per claim 3, Cheng teaches extracting information about time or location at which the event has been captured based on metadata embedded in the one or more footages(“Dynamic visual features include features that are computed over x-y-t segments or windows of a video. Dynamic feature detectors can detect the appearance of actors, objects and scenes as well as their motion information. Some examples of dynamic feature detectors include MoSIFT, STIP (Spatio-Temporal Interest Point), DTF-HOG (Dense Trajectory based Histograms of Oriented Gradients), and DTF-MBH (Dense-Trajectory based Motion Boundary Histogram). The MoSIFT feature detector extends the SIFT feature detector to the time dimension and can collect both local appearance and local motion information, and identify interest points in the video that contain at least a minimal amount of movement. The STIP feature detector computes a spatio-temporal second-moment matrix at each video point using independent spatial and temporal scale values, a separable Gaussian smoothing function, and space-time gradients. The DTF-HoG feature detector tracks two-dimensional interest points over time rather than three-dimensional interest points in the x-y-t domain, by sampling and tracking feature points on a dense grid and extracting the dense trajectories. The HoGs are computed along the dense trajectories to eliminate the effects of camera motion (which may be particularly important in the context of unconstrained or "in the wild" videos). The DTF-MBH feature detector applies the MBH descriptors to the dense trajectories to capture object motion information. The MBH descriptors represent the gradient of optical flow rather than the optical flow itself. Thus, the MBH descriptors can suppress the effects of camera motion, as well. However, HoF (histograms of optical flow) may be used, alternatively or in addition, in some embodiments.” in Para.[0041]).
As per claim 4, Cheng teaches assigning a time domain location for each of the multiple temporal units of the one or more footages; and aligning corresponding temporal units based on the time domain location(Para.[0041], “The relationships 246 may include, for example, temporal relations between actions, objects, and/or audio events (e.g., "is followed by"), compositional relationships (e.g., Person X is doing Y with object Z), interaction relationships (e.g., person X is pushing an object Y or Person Y is using an object Z), state relations involving people or objects (e.g., "is performing," "is saying"), co-occurrence relations (e.g., "is wearing," "is carrying"), spatial relations (e.g., "is the same object as"), temporal relations between objects (e.g., "is the same object as"), and/or other types of attributed relationships (e.g., spatial, causal, procedural, etc.)” in Para.[0047]).
As per claim 5, Cheng teaches identifying one or more characters or one or more gestures in the one or more footages; and refining the aligning of the corresponding temporal units based on the identified one or more characters or the identified one or more gestures(“the event detection module 228 may use one or more concept classifiers to analyze the low-level features 220, 222, 224, 226 and use the concept model 136 to classify the low-level features 220, 222, 224, 226 as representative of certain higher-level concepts such as scenes, actions, actors, and objects … The event detection module 228 may apply one or more event classifiers to the features 244, relationships 246, and/or concepts 238 to determine whether a combination of features 244, relationships 246, and/or concepts 238 evidences an event. The relationships 246 may include, for example, temporal relations between actions, objects, and/or audio events (e.g., "is followed by"), compositional relationships (e.g., Person X is doing Y with object Z), interaction relationships (e.g., person X is pushing an object Y or Person Y is using an object Z), state relations involving people or objects (e.g., "is performing," "is saying"), co-occurrence relations (e.g., "is wearing," "is carrying"), spatial relations (e.g., "is the same object as"), temporal relations between objects (e.g., "is the same object as"), and/or other types of attributed relationships (e.g., spatial, causal, procedural, etc.)” in Para.[0047]).
As per claim 8, Cheng teaches wherein the semantic meaning comprises an association between some of the one or more characters that is determined based on the video analysis of the one or more footages(“the event description 106 or the NL description 122 generated automatically by the system 100 can take the form of, e.g., meta data that can be used for indexing, search and retrieval, and/or for advertising (e.g., as ad words). The meta data can include keywords that are derived from the algorithmically performed complex activity recognition and other semantic video analysis (e.g. face, location, object recognition; text OCR; voice recognition), performed by the system 100 using the feature detection algorithms 130 as described above” in Para.[0063]). 
As per claim 9, Cheng teaches wherein the semantic meaning comprises an association between actions performed by some of the one or more characters that is determined based on the video analysis of the one or more footages(“detect the presence of a variety of different types of multimedia features in the multimedia input 102, including audio and text, in addition to the more typical visual features (e.g., actors, objects, scenes, actions)” In Para.[0043], [0047], [0048]).
As per claim 10, Cheng teaches wherein the extracting of the semantic meaning comprises: identifying one or more characters in the one or more footages; identifying, based on the one or more footages, one or more actions performed by the one or more characters; and establishing, using a neural network, an association between at least part of the one or more actions based on the information about the event(“Some embodiments of the computing system 100 can detect the presence of a variety of different types of multimedia features in the multimedia input 102, including audio and text, in addition to the more typical visual features (e.g., actors, objects, scenes, actions). The illustrative audio feature detection module 214 analyzes the audio track of an input 102 using mathematical sound processing algorithms and uses the audio feature model 238 (e.g., an acoustic model) to detect and classify audio features 222. For example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an input video 102, and, with the audio feature model 238, classify the acoustic characteristic as indicating a "cheering" sound or "applause." Some examples of low level audio features that can be used to mathematically detect audio events in the input 102 include Mel frequency cepstral coefficients (MFCCs), spectral centroid (SC), spectral roll off (SRO), time domain zero crossing (TDZC), and spectral flux. The audio feature model 238 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to the visual feature models 236 except that the audio features of the training data are analyzed rather than the visual features, in order to develop the audio feature model 238. The audio feature detection module 214 identifies the detected audio features 222 to the event detection module 228” in Para.[0043], [0047], [0048], “a number of semantic content learning modules 152, including feature learning modules 154, a concept learning module 156, a salient event learning module 158, and a template learning module 160. The learning modules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection 150 and create and /or update portions of the knowledge base 132 and/or the presentation templates 142. For example, the learning modules 152 may be used to initially populate and/or periodically update portions of the knowledge base 132 and/or the templates 142, 144. The feature learning modules 154 analyze sample images and videos from the collection 150 and populate or update the feature models 134. For example, the feature learning modules 154 may, over time or as a result of analyzing portions of the collection 150, algorithmically learn patterns of computer vision algorithm output that evidence a particular feature, and update the feature models 134 accordingly. Similarly, the concept learning module 156 may, over time or as a result of analyzing portions of the collection 150, algorithmically learn combinations of low level features that evidence particular concepts, and update the concept model 136 accordingly” in Para.[0029]).
As per claim 11, Cheng teaches wherein the extracting of the semantic meaning further comprises: adjusting the association between the at least part of the one or more actions using feedback from a user(“computing system 100 also includes a user preference learning module 148. The user preference learning module 148 is embodied as software, firmware, hardware, or a combination thereof. The user preference learning module 148 monitors implicit and/or explicit user interactions with the presentation 120 (user feedback 146) and executes, e.g., machine learning algorithms to learn user-specific specifications and/or preferences as to, for example, the types of activities that the user considers to be "salient" with respect to particular events, the user's specifications or preferences as to the ordering of salient events in various types of different presentations 120, and/or other aspects of the creation of the presentation 120 and/or the NL description 122. The user preference learning module 148 updates the templates 142, 144 and/or portions of the knowledge base 132 (e.g., the salient event criteria 138) based on its analysis of the user feedback 146” in Para.[0037]).
As per claim 13, Cheng teaches a post-production editing platform, comprising: a user interface and a processor(“user computing device 610 includes at least one processor 612 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 614, and an input/output (I/O) subsystem 616. The computing device 610 may be embodied as any type of computing device capable of performing the functions described herein, such as a personal computer (e.g., desktop, laptop, tablet, smart phone, body-mounted device, wearable device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices. Although not specifically shown, it should be understood that the I/O subsystem 616 typically includes, among other things, an I/O controller, a memory controller, and one or more I/O ports. The processor 612 and the I/O subsystem 616 are communicatively coupled to the memory 614. The memory 614 may be embodied as any type of suitable computer memory device (e.g., volatile memory such as various forms of random access memory)” in Para.[0067]) and the other limitations in the claim 15 has been discussed in the rejection claim 1 and rejected under the same rationale. 	
As per claim 15, the limitations in the claim 15 has been discussed in the rejection claim 3 and rejected under the same rationale.
As per claim 16, Cheng teaches wherein the structure of the multiple temporal units specifies that a scene includes multiple shots, and wherein one or more clips from at least one device correspond to a same shot(Para.[0013]).
As per claim 17, the limitations in the claim 17 has been discussed in the rejection claim 4 and rejected under the same rationale.
As per claim 18, the limitations in the claim 18 has been discussed in the rejection claim 5 and rejected under the same rationale.
As per claim 20, the limitations in the claim 20 has been discussed in the rejection claim 8 and rejected under the same rationale.
As per claim 21, the limitations in the claim 21 has been discussed in the rejection claim 9 and rejected under the same rationale.
As per claim 22, the limitations in the claim 22 has been discussed in the rejection claim 10 and rejected under the same rationale.
As per claim 23, the limitations in the claim 24 has been discussed in the rejection claim 11 and rejected under the same rationale.
As per claim 24, the limitations in the claim 25 has been discussed in the rejection claim 12 and rejected under the same rationale.
As per claim 25, Cheng teaches wherein at least part of which is implemented as a web service(“the content processing, e.g., the complex event recognition, is done on a server computer (e.g., by a proprietary video creation service), so captured video files are uploaded by the customer to the server, e.g. using a client application running on a personal electronic device or through interactive website. Interactive aspects, such as storyboard selection and editing of clip segments, may be carried out via online interaction between the customer's capture device (e.g., camera, smartphone, etc.), or other customer local device (e.g. tablet, laptop), and the video service server computer. Responsive to local commands entered on the customer's device, the server can assemble clip segments as desired, and redefine beginning and end frames of segments, with respect to the uploaded video content. Results can be streamed to the customer's device for interactive viewing. Alternatively or in addition, computer vision algorithms (such as complex event recognition algorithms) may be implemented locally on the user's capture device and the video creation service can be delivered as an executable application running on the customer's device” in Para.[0064]).
As per claim 26, Cheng teaches a multimedia content system, comprising: an input device that stores one or more video footages of an event; and one or more computer processors, computer servers or computer storage devices in communication with the input device via a network(Para.[0067]) and provide, via the network, the edited multimedia content to be retrieved for viewing or further processing(Para.[0065], Fig. 6) and other limitations in the claim 26 has been discussed in the rejection claim 1 and rejected under the same rationale.
As per claim 27, Cheng teaches further comprising a communication or computing device in communication with the network to interact with the one or more computer processors, computer servers or computer storage devices to retrieve the edited multimedia content for viewing or further processing(Fig. 6).
As per claim 28, Cheng teaches wherein the input device is operable to retrieve the edited multimedia content for viewing or further processing(Fig. 6).
As per claim 29, Cheng teaches wherein the input device includes a camera for capturing the one or more video footages and a processor for communicating the one or more computer processors, computer servers or computer storage devices(Fig. 6, Para.[0032]).
As per claim 30, Cheng teaches wherein the input device includes a computer(Fig. 6, Para.[0065]).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Cheng in view of Dunsmuir
Claims 2 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Cheng et al.(USPubN 2014/0328570; hereinafter Cheng) in view of Dunsmuir(USPubN 2014/0082079).
As per claim 2, Cheng teaches all of limitation of claim 1. 
Cheng is silent about presenting, to a user via a user interface, the script and the edited multimedia content; receiving input from the user via the user interface to update at least part of the script in response to the input from the user; and generating a revised version of the edited multimedia content based on the updated script in an iterative manner.
Dunsmuir teaches presenting, to a user via a user interface, the script and the edited multimedia content; receiving input from the user via the user interface to update at least part of the script in response to the input from the user; and generating a revised version of the edited multimedia content based on the updated script in an iterative manner(“The MPA includes a Navigation Interface (301), which gives the user (300) access to a number of different views, each of which is associated with a specific function of the MPA. Views are selected via the Navigation Interface and each View has its own User Interface for interaction with the User. The Navigation Interface includes buttons and underlying menus, which are displayed to the user as the result of touch screen interactions on the phone. The Views include: the Local View (302), which allows the user to select and view content recorded locally (i.e. without a network connection between the MPA and the MS); a Settings View (310), which allows the user to adjust default settings for the MPA running of their phone; the Folder View (304) which allows the user to view and navigate folders and content recordings; the Player View (306), activated from the Folder View, which allows the user to playback and interact with individual recordings using the local media player (314); and the Camera View (308) which allows the user to make new recordings using the media recorder (316) and the phone camera (318). There is also an Upload Service (330), which runs in the background and is responsible for uploading items from the File store (320) to the Media Server. Each View and the Upload Service runs in a separate thread allowing a number of different tasks to proceed in parallel and allowing the user interface to be responsive to the user at all times. Those of ordinary skill in the art will appreciate the value of dividing the work of an application between multiple threads. The embodiment shown makes use of the interfaces and services provided by the underlying operating system to perform its tasks. These include external functions which provide their own user interface, such as: the Media Recorder (316), which is used to capture and record video and digital photographs; the Media Player (314) which is used to playback recordings; and other functions such as the email and text-messaging interfaces (332). When an external event takes place, such as a change to the location of the host device, an event message (330) is sent by the system to the MPA, where an event handler (322) updates global data (324) and notifies interested views of the message's arrival, as appropriate. The global data (324) is accessible to all functions of the MPA and this data is divided into two categories: persistent data, which is maintained across instantiations of the MPA; and volatile data, representing the state of the current MPA instantiation. This latter data is lost when the MPA terminates. In addition to the Views themselves a set of network interfaces (326) is provided to allow the Views to interface with the Media Server over a computer network (328) and the MPA also makes use of the local file-store (320) to store recordings and their meta-data before they are uploaded to the Media Server” in Para.[0068]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Cheng with the above teachings of Dunsmuir in order to enhance user experience on multimedia.
As per claim 14, the limitations in the claim 14 has been discussed in the rejection claim 2 and rejected under the same rationale. 	

Cheng in view of Stojancic
Claims 6 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Cheng et al.(USPubN 2014/0328570; hereinafter Cheng) in view of Stojancic et al.(USPubN 2019/0356948; hereinafter Stojancic)
As per claim 6, Cheng teaches all of limitation of claim 5. 
Cheng teaches extracting text or background sound from the one or more footages based on the audio analysis(Para.[0039]).
Cheng is silent about modifying the script to include the text or the background sound.
Stojancic teaches modifying the script to include the text or the background sound(“updating metadata 224 for highlights 220 with in-frame real-time information. In a step 1110, a field to be processed is selected from character boundaries 204 of the characters present in card image 310. In a step 1120, a group of characters is extracted from a line field, and text strings are recognized and interpreted as described above” in Para.[0184]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Cheng with the above teachings of Stojancic in order to allow for real-time generation of sophisticated programming content highlights accompanied by metadata.
As per claim 19, the limitations in the claim 19 has been discussed in the rejection claim 6 and rejected under the same rationale.

Cheng in view of Buford
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Cheng et al.(USPubN 2014/0328570; hereinafter Cheng) in view of Buford et al.(USPubN 2017/0048492; hereinafter Buford).
As per claim 7, Cheng teaches all of limitation of claim 1. 
Cheng and Stojancic are silent about replacing the background sound using an alternative sound determined based on the semantic meaning of the one or more footages.
Buford teaches replacing the background sound using an alternative sound determined based on the semantic meaning of the one or more footages(“The audio may be replaced with background noise (possibly recorded from prior in the communication) to ensure the silence is not as noticeable to participant 123 when hearing the audio from the media stream” in Para.[0038]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Cheng and Stojancic with the above teachings of Buford in order to enhances the user experience of media content by desired media assets.

Cheng in view of Zavesky
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Cheng et al.(USPubN 2014/0328570; hereinafter Cheng) in view of Zavesky et al.(USPubN 2019/0045194; hereinafter Zavesky).
As per claim 12, Cheng teaches all of limitation of claim 11. 
Cheng is silent about packaging the edited multimedia content based on a target online media platform; and distributing the packaged multimedia content to the target online media platform.
Zavesky teaches packaging the edited multimedia content based on a target online media platform; and distributing the packaged multimedia content to the target online media platform(“one of the application servers 114 may store or receive video programs in accordance with the present disclosure, detect scene boundaries, identify themes in scenes, encode scenes using encoding strategies based upon the themes that are detected, and so forth. The video programs may be received in a raw/native format from server 149, or from another device within or external to core network 110. For instance, server 149 may comprise a server of a television or video programming provider. After encoding/compression, one of the application servers 114 may store the encoded version of the video program to one or more of the content servers 113 for later broadcasting via TV servers 112, streaming via interactive TV/VOD server 115, and so forth. In another example, the video programs may be received in a raw/native format from content servers 113 and stored back to content servers 113 after encoding. In still another example, the video program may be received from server 149 and stored back to server 149 after encoding. In addition, in some examples, one of application server 114 may also be configured to generate multiple target bitrate copies of the same video program, e.g., for adaptive bitrate streaming, for a selection of a different version of the video program depending upon the target platform or delivery mode (e.g., STB/DVR 162A and TV 163A via access network 120 versus mobile device 157B via wireless access network 150). Thus, multiple copies of the video program may be stored at server 149 and/or content servers 113” in Para.[0036]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings Cheng with the above teachings of Zavesky in order to enhances the user experience of media content by desired media platform.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SUNGHYOUN PARK whose telephone number is (571)270-1333.  The examiner can normally be reached on M - Thur 6:00 am - 4 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THAI Q TRAN can be reached on (571)272-7382.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SUNGHYOUN PARK/Examiner, Art Unit 2484