DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 10/12/2021 have been fully considered but they are not persuasive.
On pages 2-3, Applicant argues that,
“… Herberger does not disclose or suggest counting the number of video clips or further determining a target audio based on the number of video chips. Furthermore, col. 6, Ins. 58-60 of Herberger discloses letting the user to select one or more audio works to include into audio track, and fails to mention or suggest a number of clips. Therefore, Herberger does not disclose or suggest “acquiring a target audio suitable to video content based on the video content and the number of the at least one video clip.””

In response, Examiner respectfully disagrees for at least two reasons:
1) Herberger teaches selecting audio content suitable to cover all of the acquired video clips. For example, there are 4 acquired video clips in Figs. 3-4. The corresponding acquired video content are selected based on these 4 video clips so that the total length of the video clips is fully covered with corresponding audio content.
2) According to Herberger’s teachings, the target audio is acquired in such a way that the number of audio change points is greater than or equal to a number of the at least one video clip minus one as described in the Office Action, i.e. as described in column 2, lines 26-59; Figs. 3-4 of Herberger, the target audio is acquired to provide a  number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4, minus one. For this, Herberger teaches acquiring a target audio suitable to video content based on the number of the at least one video clip minus one. Since, the acquiring is based on the number of the at least one video clip minus one, it is also based on the number of the at least one video clip.
Herberger is not relied upon to teach acquiring a target audio suitable to video content based on the video content.
On pages 3-4, Applicant argues that,
“… De Vos discloses generating audio content which is related to the captured image content items in the collection by using the time of captured content items and playback times indicated by the playback log. Therefore, the audio content must associated with the content items in the collection in view of the time of captured content items and playback. De Vos does not mention or suggest “acquiring a target audio suitable to video content based on the video content and the number of the at least one video clip” as recited in claim 1.”

In response, Examiner respectfully submits that, at least in [0021], De Vos states that,
[0021] The selection of audio content items may be made in dependence on various different parameters relating to the image content items. For example, the picture data of uploaded image content items may be analysed and categorised based on the analysis. The audio content items may then be preferentially selected to accompany the collection or sequence of image content items when they have a category which matches the categorisation of the image content items in the collection or sequence. For example, such image analysis might determine that an image or a set of images relate to a beach scene or a sunset, which might make the selection of a relaxing genre of music appropriate to accompany those images.
(emphasis added).
“acquiring a target audio suitable to video content based on the video content”.
De Vos is not relied upon to teach acquiring a target audio suitable to video content based on the number of the at least one video clip as this feature has been taught by Herberger.
Applicant’s arguments are therefore not persuasive.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-9, 11-12, 14-15, and 33-34 are rejected under 35 U.S.C. 103 as being unpatentable over Herberger et al. (US 7,512,886 B1 – hereinafter Herberger) and De Vos et al. (US 2012/0251082 A1 – hereinafter De Vos).
Regarding claim 1, Herberger discloses a video synthesis method, comprising: acquiring at least one video clip (column 2, lines 14-25; column 6, lines 9-27 – acquiring one or more video clips selected by the user); acquiring a target audio suitable to video content based on the number of the at least one video clip (column 2, lines 26-59; column 6, lines 58-60 – acquiring audio data suitable to video content based on the number of the at least one video clip as desired and selected by the user), wherein a number of audio change points of the target audio is greater than or equal to a number of the at least one video clip minus one (column 2, lines 26-59; Figs. 3-4 – the number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4, minus one), and the audio change points comprise time points at which change in audio feature satisfies a preset condition (column 2, lines 26-59; column 9, lines 4-14; Figs. 3-4 – the audio change points comprises time points at which change in audio feature satisfies specific criteria, e.g. determination of the musical rhythm, such as beat/time signature, of the music, identification of changes in its volume, location within the audio track of a chorus or refrain, identification of changes in musical key, bar locations, strophe and refrain, etc.); and obtaining a video file by synthesizing the at least one video clip and the target audio based on the audio change points included in the target audio (column 9, lines 54-56; Fig. 5 – synthesizing a video file as an aligned video/audio work for subsequent playback or additional editing as shown at step 545 of Fig. 5).
However, Herberger does not disclose acquiring a target audio suitable to video content based on the video content (Herberger discloses acquiring a target audio based on parameters set by the user).
De Vos discloses acquiring a target audio suitable to video content based on the video content ([0021]-[0029]; [0072]-[0074]; [0084] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, extreme, etc., based on the analysis of the video content, selecting an audio items suitable to the video content).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the teachings of De Vos into the method taught by Herberger to provide a better match between the audio and the video content.
	Regarding claim 2, see the teachings of Herberger and De Vos as discussed in claim 1 above. Herberger also discloses said acquiring the target audio comprising: determining the video content of each video clip of the at least one video clip by recognizing each video clip (column 2, lines 26-59; column 6, lines 58-60; Figs. 3-4 – recognizing each video clip, e.g. existence of boundaries T1, T2, and T3); and including the audio change points as the target audio (column 2, lines 26-59; column 6, lines 58-60 – including audio data with satisfactory markers as target audio as further described in at least column 10, lines 24-54).
	However, the method as proposed does not comprise the feature of “determining a style corresponding to the video content of each video clip, respectively, based on the video content of each video clip; and in response to determining that respective video contents of the at least one video clip correspond to an identical style, acquiring an audio having the identical style.”
	De Vos further teaches acquiring the target audio comprising: determining the video content of each video clip of the at least one video clip by recognizing each video clip ([0021]-[0029] – recognizing each video item for analysis); determining a style corresponding to the video content of each video clip, respectively, based on the video ([0021]-[0029]; [0072] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, extreme, etc., based on the analysis of the video content); and in response to determining that respective video contents of the at least one video clip correspond to an identical style, acquiring an audio having the identical style ([0021]-[0029]; [0072]-[0074]; [0084] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, extreme, etc., based on the analysis of the video content, selecting an audio items suitable to the video content).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the further teachings of De Vos into the method taught by Herberger to provide a consistency of style between the audio and the video content.
	Regarding claim 3, see the teachings of Herberger and De Vos as discussed in claim 2 above, in which Herberger and De Vos also discloses said acquiring the audio having the identical style and including the audio change points as the target audio comprising: acquiring one audio (Herberger: column 6, lines 58-60 – one or more audio work or De Vos: [0028] – one or more audio items) having the identical style (De Vos: [0021]-[0029] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., based on the analysis of the video content, selecting an audio items suitable to the video content) and including the audio change points whose number is (Herberger: column 2, lines 26-59; Figs. 3-4 – the number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4); or acquiring multiple audios (Herberger: column 6, lines 58-60 – more than one audio works or De Vos: [0028] – more than one audio items) having the identical style (De Vos: [0021]-[0029] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., based on the analysis of the video content, selecting an audio items suitable to the video content) and including the audio change points whose total number is greater than or equal to the number of the at least one video clip minus one as the target audio (Herberger: column 2, lines 26-59; Figs. 3-4 – the number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4).
	The motivation for combining Herberger and De Vos has been discussed in claim 2 above.
	Regarding claim 4, see the teachings of Herberger and De Vos as discussed in claim 2 above. De Vos also discloses after determining a style corresponding to the video content of each video clip, respectively, based on the video content of each video clip ([0021]-[0029]; [0072] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, etc., based on the analysis of the video content, selecting an audio items suitable to the video content), the method further comprising: in response to that respective video contents of the at least ([0021]-[0029]; [0072] – multiple styles: mood, activity, and effect as shown in Fig. 4), acquiring an audio having a style corresponding to the video content of a target video clip and including the audio change points as the target audio, wherein one video clip in the at least one video clip being determined as the target video clip ([0021]-[0029]; [0072]-[0074]; [0084] – acquiring an audio item). Herberger also discloses acquiring an audio including the audio change points as the target audio (column 2, lines 26-59; column 9, lines 4-14; Figs. 3-4 – the audio change points comprises time points at which change in audio feature satisfies specific criteria, e.g. determination of the musical rhythm, such as beat/time signature, of the music, identification of changes in its volume, location within the audio track of a chorus or refrain, identification of changes in musical key, bar locations, strophe and refrain, etc.).
The motivation for combining Herberger and De Vos has been discussed in claim 2 above.
Regarding claim 5, see the teachings of Herberger and De Vos as discussed in claim 4 above, in which Herberger and De Vos also discloses said acquiring the audio having the style corresponding to the video content of the target video and including the audio change points as the target audio comprising: acquiring one audio (Herberger: column 6, lines 58-60 – one or more audio work or De Vos: [0028] – one or more audio items) having the style corresponding to the video content of the target video clip (De Vos: [0021]-[0029] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, extreme, etc., based on the analysis of the video content, selecting an audio items suitable to the video content) and including the audio change points whose number is greater than or equal to the number of the at least one video clip minus one as the target audio (Herberger: column 2, lines 26-59; Figs. 3-4 – the number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4); or acquiring multiple audios (Herberger: column 6, lines 58-60 – more than one audio works or De Vos: [0028] – more than one audio items) having the style corresponding to the video content of the target video clip (De Vos: [0021]-[0029]; [0072] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, extreme, etc., based on the analysis of the video content, selecting an audio items suitable to the video content) and including the audio change points whose total number is greater than or equal to the number of the at least one video clip minus one as the target audio (Herberger: column 2, lines 26-59; Figs. 3-4 – the number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4).
	The motivation for combining Herberger and De Vos has been discussed in claim 2 above.
	Regarding claim 6, Herberger also discloses the process of the target video clip being determined comprising: determining a video clip having a longest duration in the at least one video clip as a target video clip; or determining a video clip having a largest weight in the at least one video clip as a target video clip, wherein the weight being (Fig. 4 – determining clip A, which has a longest duration in the at least one video clip as a target video clip).
	Regarding claim 7, see the teachings of Herberger and De Vos as discussed in claim 2 above. De Vos also discloses after determining a style corresponding to the video content of each video clip, respectively, based on the video content of each video clip ([0021]-[0029]; [0072] – analyzing video content to determine category, e. g. beach scene or sunset, time of day, night time, visual effects, or amount of motion, etc., or other styles, e.g. energetic, relaxed, etc., based on the analysis of the video content, selecting an audio items suitable to the video content), the method further comprising: in response to that respective video contents of the at least one video clip corresponds to multiple styles, determining multiple video clip sets, and video contents of video clips in each of the video clip sets correspond to one of the multiple styles ([0021]-[0029] – multiple styles: mood, activity, and effect as shown in Fig. 4); acquiring an audio having a style corresponding to the video contents of video clips in the video chip set as the target audio for each video clip set ([0021]-[0029]; [0072]-[0074]; [0084] – acquiring an audio item); and determining a plurality of acquired audio as the target audio ([0021]-[0029]; [0072]-[0074]; [0084] – acquiring one or more audio items). Herberger also discloses acquiring an audio including the audio change points as the target audio (column 2, lines 26-59; column 9, lines 4-14; Figs. 3-4 – the audio change points comprises time points at which change in audio feature satisfies specific criteria, e.g. determination of the musical rhythm, such as beat/time signature, of the music, identification of changes in its volume, location within the audio track of a chorus or refrain, identification of changes in musical key, bar locations, strophe and refrain, etc.), wherein the number of the audio change points is greater than or equal to the number of video clips in the video clip set minus one (column 2, lines 26-59; Figs. 3-4 – the number of audio change points, e.g. points M1-M7, which is 7, is greater than or equal to the number of video clips, which is 4).
The motivation for combining Herberger and De Vos has been discussed in claim 2 above.
Regarding claim 8, see the teachings of Herberger and De Vos as discussed in claim 2 above. De Vos also discloses said determining the video content of the respective video clips of the at least one video clip comprising: recognizing the video clip and determining a recognized target object and/or recognized environmental information in the video clip as the video content of the video clip for each video clip ([0021]-[0029] – recognizing time of day, sunset scene, or beach scene).
The motivation for combining Herberger and De Vos has been discussed in claim 2 above.
Regarding claim 9, De Vos also discloses said recognizing the video clip comprising: outputting at least one of the target object and the environmental information of the video clip by inputting the video clip into a video recognition model, wherein the video recognition model is configured to output at least one of the target object and the environmental information based on input video clip ([0021]-[0029] – inputting the video clip for analysis and outputting information indicating time of day of clip or whether the clip is a sunset scene, or beach scene, etc.).

Regarding claim 11, see the teachings of Herberger and De Vos as discussed in claim 2 above. De Vos also discloses said respectively determining the style corresponding to video content of the respective video clips comprising: determining a style corresponding to the video content of the video clip based on the video content of the video clip and a rule between the video content and the corresponding style for each video clip ([0072]).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the further teachings of De Vos into the method taught by Herberger and De Vos proposed in claim 2 above to provide a consistency of style between the audio and the video content as discussed in claim 3 above in order to facilitate identifying a corresponding audio style ([0072] and [0077]).
Regarding claim 12, Herberger also discloses in response to the target audio including one audio (column 6, lines 58-60 – in response to one audio work being selected), the audio change points of the target audio are the audio change points included in the one audio (Figs. 3-4); in response to the target audio including a plurality of audios (column 6, lines 58-60 – in response to more than one audio works being selected), the audio change points of the target audios are the audio change points included in each audio of the plurality of audios (Figs. 3-4); for any one audio, acquiring the audio change points of the audio comprising: determining the audio change points of the audio based on amplitude information of the audio, wherein difference between (column 2, lines 45-59; column 9, lines 4-14 – difference in volume levels based on threshold), and the target time point in the audio being a time point whose time interval with the corresponding audio change point is less than a time threshold (column 2, lines 45-59; column 9, lines 4-14; Figs. 3-4 – at least less than the total duration of the corresponding audio clip, which corresponds to the recited time threshold); or, outputting the audio change points of the audio by an audio recognition model based on input audio (column 2, lines 45-59 – or outputting the audio change points by recognition of beat/time signature).
Regarding claim 14, Herberger also discloses said synthesizing the at least one video clip and the target audio to obtain the video file comprising: determining adjacent audio change points corresponding to respective video clips based on the audio change points of the target audio and a play sequence of the at least one video clip (column 3, lines 12-36); and marking the video clip and an audio clip corresponding to the video clip with a same timestamp based on the adjacent audio change points corresponding to the video clip for each video clip, and synthesizing the video file, wherein the audio clip is an audio clip, in the target audio, between the adjacent audio change points corresponding to the video clip, and the timestamp includes a start timestamp and an end timestamp (column 4, lines 38-47; column 6, lines 31-35; column 10, lines 10-54; Figs. 3-4; column 11, lines 41-49).
Regarding claim 15, Herberger also discloses said respectively determining adjacent audio change points corresponding to respective video clips comprising: in (column 6, lines 58-60 – in response to one audio work being selected), determining the adjacent audio change points corresponding to respective video clips based on a position sequence of the audio change points of the target audio and the play sequence of the at least one video clip (column 3, lines 12-36); and in response to that the target audio comprises multiple audios (column 6, lines 58-60 – in response to more than one audio works being selected), determining the adjacent audio change points corresponding to each video clip respectively, based on an audio play sequence of the multiple audios, a position sequence of the audio change points of each of the multiple audios and the video sequence of the at least one video clip (column 3, lines 12-36).
	Claim 33 is rejected for the same reason as discussed in claim 1 above in view of Herberger also disclosing a terminal (column 5, lines 1-6 – a computer), comprising: a processor (column 5, lines 1-6 – a processor of a computer to execute a program stored memory and hard disk storage, enabling the computer to implement the method); and a memory for storing instructions executable by the processor; wherein, the processor is configured to perform the recited steps (column 5, lines 1-6 – a program memory and hard disk storage storing software instructions executed by the processor of the computer, enabling the computer to implement the steps of the method).
	Claim 34 is rejected for the same reason as discussed in claim 1 above in view of Herberger also disclosing a non-transitory computer-readable storage medium having a computer instruction stored thereon, when the computer instruction being executed by a processor of a terminal enable the terminal to implement the recited video synthesis (column 5, lines 1-6 – a program memory and hard disk storage storing software instructions executed by a processor of a user's computer, enabling the computer to implement the method).
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Herberger and De Vos as applied to claims 1-9, 11-12, 14-15, and 33-34 above, and further in view of Smith et al. (US 2019/0042900 A1 – hereinafter Smith).
	Regarding claim 10, see the teachings of Herberger and De Vos as discussed in claim 9 above. However, Herberger and De Vos do not explicitly disclose the video recognition model is acquired by: acquiring a plurality of sample video clips and annotation information of each sample video clip, wherein the annotation information includes at least one of the target object and the environmental information; and obtaining the video recognition model by training a neural network model based on the plurality of sample video clips and the respective annotation information.
	Smith discloses a video recognition model is acquired by: acquiring a plurality of sample video clips and annotation information of each sample video clip, wherein the annotation information includes at least one of the target object and the environmental information ([0612] – acquiring a plurality of sample training scene(s), which are video scenes as further described in [0588], where known objects are tagged using a tag vocabulary); and obtaining the video recognition model by training a neural network model based on the plurality of sample video clips and the respective annotation information ([0612]-[0613] - obtaining the video recognition model by training a neural network model based on the plurality of sample training video scenes and the respective tag information so that, subsequent to the training, objects in other clips can be recognized accordingly).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the teachings of Smith into the method taught by Herberger and De Vos to reduce computing resources that are required by avoiding a labor-intensive training process is typically required for every type of object or condition that needs to be recognized (see Smith [0003]). 
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Herberger and De Vos as applied to claims 1-9, 11-12, 14-15, and 33-34 above, and further in view of Dimitriadis et al. (US 2018/0166067 A1 – hereinafter Dimitriadis).
Regarding claim 13, see the teachings of Herberger and De Vos as discussed in claim 12 above. However, Herberger and De Vos do not disclose the audio recognition model is acquired by following operations: acquiring a plurality of sample audios and the audio change points marked in each sample audio; and obtaining the audio recognition model by training a neural network model based on the plurality of sample audios and the corresponding audio change points.
Dimitriadis discloses an audio recognition model is acquired by following operations: acquiring a plurality of sample audios and audio change points marked in each sample audio ([0039]-[0041]; Fig. 5 – each change point is a transition from one of a number of speech feature to a different one of the speech feature); and obtaining the audio recognition model by training a neural network model based on the plurality of sample audios and the corresponding audio change points ([0039]-[0041]; Fig. 5).
.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Herberger and De Vos as applied to claims 1-9, 11-12, 14-15, and 33-34 above, and further in view of Eppolito (US 2015/0113408 A1 – hereinafter Eppolito).
Regarding claim 16, see the teachings of Herberger and De Vos as discussed in claim 14 above. Herberger also discloses said marking the video clip and the audio clip corresponding to the video clip with the same timestamp comprising: in response to that a duration of the video clip is equal to a duration of the audio clip corresponding to the video clip, marking the video clip and the audio clip corresponding to the video clip with the same timestamp (column 4, lines 38-47; column 6, lines 31-35; column 10, lines 10-54; Figs. 3-4; column 11, lines 41-49); in response to that a duration of the video clip is greater than a duration of the audio clip corresponding to the video clip, trimming the video clip to obtain a trimmed video clip having the same duration as the audio clip, and marking the trimmed video clip and the audio clip corresponding to the video clip with the same timestamp (column 4, lines 38-47; column 6, lines 31-35; column 10, lines 10-54; Figs. 3-4; column 11, lines 41-49).
However, Herberger and De Vos do not explicitly disclose in response to that a duration of the video clip is less than a duration of the audio clip corresponding to the video clip, trimming the audio clip corresponding to the video clip to obtain a trimmed 
Eppolito discloses in response to that a duration of a video clip is less than a duration of an audio clip corresponding to the video clip, trimming the audio clip corresponding to the video clip to obtain a trimmed audio clip having the same duration as the video clip, and marking the video clip and the trimmed audio clip corresponding to the video clip with the same timestamp ([0006]; [0046]; [0054]).
One of ordinary skill in the art before the effective filing date of the claimed invention would have been motivated to incorporate the teachings of Eppolito into the method taught by Herberger and De Vos to allow the user an option to keep the length of the video clip unmodified, e.g. in case adjusting the length of the video clip may cause an unsatisfactory effect.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Thai Q Tran can be reached on 571-272-7382.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/HUNG Q DANG/Primary Examiner, Art Unit 2484