DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Examiner has withdrawn the rejections under 35 U.S.C. 112, based on Applicant’s explanation.
Applicant's arguments filed 3/29/2022 with respect to the rejections under 35. U.S.C.103 have been fully considered but they are not persuasive.
As provided in the rejection of claim 1 below, McCallum teaches regarding determining acoustic feature groups for each beat of one or more beats within each temporal section of audio content ([0028] The example deep feature generator 122 forms a set of deep features 124 for each of the segments 202 formed by the segment extractor 204. Each set of the deep features 124 is placed in a column of a feature matrix 206 by an aggregator 208. [0029], Each of the segments 202 is passed into the example neural network 104 to form a set of deep features 124 for the beat associated with the segment 202. The example aggregator 208 forms the feature matrix 206 by placing the set of deep features 124 into a column for the beat associated with the segment 202. Thus, the feature matrix 206 has a column for each beat, and the data in each column represents the set of deep features 124 associated with the beat.)
Although McCallum teaches a neural network 104, including a distance calculator, for example, that can determine based on audio attributes such as deep features 124, whether the portions of the incoming digital audio are musically similar or musically dissimilar ([0022]), McCallum fails to teach providing the feature groups as input to the CNN to predict candidate cuepoint placements. 
Analogous art Attorre is directed toward using a neural network to accept audio and visual data from portions of the target digital content item as an input and to classify the input as either a positive or negative example of a transition or break between the scenes or stories in the target digital content item, or a positive or negative example of a sequence that leads up to a transition or break between the scenes or stories in the target digital content item ([0010]). [0011] In some embodiments, the computer program in the apparatus and/or the executable instructions in the non-transitory computer-readable medium are operable to cause a processor to detect or predict the host time from the target digital content item based on one or more of:… (7) combination methods including, but not limited to, machine learning models including neural network-based approaches that use models trained on attributes (e.g., audio data, visual data, metadata, or combinations thereof) from positive and negative examples of target digital content that embody or precede a transition in the scenes or stories in the target digital content item. [0202] In some embodiments, the host time identification module 106 is configured to identify and/or score host times or candidate host times in the target digital content by training one or more neural network models, where each neural network is used for a different purpose (e.g., theme recognition) and is trained on features extracted from (i.e., attributes extracted from the audio…). [0222] In some embodiments where host times are identified by inputting the visual or audio attributes and/or metadata regarding portions of the target digital content item into machine learning models in order to predict whether the portions of the target digital content item are positive or negative host times or to predict the probability that the input is a positive or negative host time, the resulting label, probability, or score may be included in the host time object as a candidate score or as host time metadata. 
Attorre further teaches determining a cuepoint placement in the media content item from among the candidate cuepoint placements received as output from the neural network ([0066]The host time defining data associated with the target digital content item can then be used by the content integration system or a computer system of a user to select appropriate host times in the target digital content item for inserting source digital content so as to minimize the impact on user experience.  For example, the candidate host times can be ranked based on their candidate scores, and one or more highest ranked candidate host times can be selected for inserting source digital content [0246], The module then chooses the host times within each segment with the highest weighted score.). 
For at least these reasons, Examiner respectfully maintains that prior art of record fully teaches the instant set of claims.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-15 are rejected under 35 U.S.C. 103 as being unpatentable over McCallum (US 2020/0074982) in view of Attorre et al. (US 2019/0035431).
Claim 1
McCallum teaches a method for placing a cuepoint in a media content item, the method comprising: 
receiving at least a portion of audio content of the media content item ([0025], FIG. 2 illustrates an example similarity analysis system 200 including the example incoming digital audio 106); 
normalizing the received audio content into a plurality of beats ([0018], The example beat detector 108 of FIG. 1 generates an example stream of beat markers 110 representing the detected beats (e.g., a stream, list, etc. of timestamps for the detected beats).); 
partitioning the plurality of beats into temporal sections ([0027], For instance, the segment extractor 204 generates a first segment 202 consisting of beats one to four inclusive, a second segment 202 of beats two to five inclusive, a third segment 202 of beats three to six inclusive, etc.); 
for one or more of the temporal sections:
 extracting one or more acoustic feature groups for each beat of one or more beats within the temporal section ([0028] The example deep feature generator 122 forms a set of deep features 124 for each of the segments 202 formed by the segment extractor 204. Each set of the deep features 124 is placed in a column of a feature matrix 206 by an aggregator 208. [0029], Each of the segments 202 is passed into the example neural network 104 to form a set of deep features 124 for the beat associated with the segment 202. The example aggregator 208 forms the feature matrix 206 by placing the set of deep features 124 into a column for the beat associated with the segment 202. Thus, the feature matrix 206 has a column for each beat, and the data in each column represents the set of deep features 124 associated with the beat.); and
providing the extracted acoustic feature groups for the one or more beats within the temporal section as input to a convolutional neural network (CNN) to predict candidate cuepoint placements ([0022] The example neural network 104 of FIG. 1 is any type, configuration, architecture of convolutional neural network (CNN). An example convolutional neural network architecture that can be used to implement the example neural network 104 is shown in FIG. 12. The neural network 104 has an example deep feature generator 122 that generates, develops, forms, computes, etc. so called deep features 124 that can be combined e.g., by a distance calculator 126 of some sort, that generates a distance metric that can be used to embed and/or classify audio, data, objects, information, etc. The deep features 124 computed by the deep feature generator 122 may represent classes and/or descriptors of audio, data, objects, information, etc.  [0030], similarity processor 212 determines similarity and/or dissimilarity of each portion of the incoming digital audio 106 with other portions of the incoming digital audio 106. [0043], Convolving with the checkerboard kernel 512 along the diagonal produces a one dimensional novelty function vector 520 which may in turn be used to identify audio segment boundaries. [0045], Peaks (e.g., a peak 706) in the plot 704 correspond to audio segment boundaries identified by the similarity processor 500 in the incoming digital audio 106.); and 
Although McCallum teaches “In some examples, if there are multiple peaks within a short time window (e.g., 8 or 16 beats), then only the peak with the highest novelty value is selected” [0046], McCallum may not explicitly detail determining a cuepoint placement in the media content item from among the candidate cuepoint placements received as output from the CNN.  
Attorre teaches determining a cuepoint placement in the media content item from among the candidate cuepoint placements received as output from the neural network ([0066]The host time defining data associated with the target digital content item can then be used by the content integration system or a computer system of a user to select appropriate host times in the target digital content item for inserting source digital content so as to minimize the impact on user experience.  For example, the candidate host times can be ranked based on their candidate scores, and one or more highest ranked candidate host times can be selected for inserting source digital content [0246], The module then chooses the host times within each segment with the highest weighted score.). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate selection of cuepoint placements/host times as taught by Attorre with the audio segmentation method of McCallum, because doing so would have provided a way to classify the elements or frames, or segments thereof, comprising the audio aspect of the target digital content according to the types of sound they represent, such as human speech, music, or silence, and searching for successive pairs or sequences of elements or frames that are classified differently and thus indicate that there has been a transition in the story in the target digital content item. ([0024] of Attorre)
Claim 2 
McCallum in view of Attorre further teaches the method of claim 1, wherein determining the cuepoint placement in the media content item comprises: receiving as output from the CNN (neural network of Fig. 2 of McCallum), for each of the one or more of the temporal sections, a probability that a beat immediately following the temporal section is a candidate cuepoint placement; comparing the received probability across the one or more of the temporal sections; and determining to place the cuepoint at the beat immediately following the temporal section having the highest probability based on the comparison ([0066] of Attorre, The host time defining data associated with the target digital content item can then be used by the content integration system or a computer system of a user to select appropriate host times in the target digital content item for inserting source digital content so as to minimize the impact on user experience.  For example, the candidate host times can be ranked based on their candidate scores, and one or more highest ranked candidate host times can be selected for inserting source digital content; [0135] In some embodiments, the host time identification module 106 can be configured to, at any point during host time identification, assign to one or more times, frame numbers, or audio element indices of the target digital content item a score, ranking, or probability that indicates the likelihood that each host time, candidate host time, or time interval represents a host time or candidate host time or its relative attractiveness as a host time or candidate host time ("candidate score"). [0139] In some embodiments, the host time identification module 106 can be configured to select one or more candidate host times using one or more host time identification processes before passing the candidate host times to one or more other processes where the list is further refined or where final host times are selected from the candidate host times. [0246], The module then chooses the host times within each segment with the highest weighted score).  
Claim 3 
McCallum in view of Attorre further teaches the method of claim 2, further comprising: automatically placing the cuepoint at the beat immediately following the temporal section having the highest probability based on the comparison ([0067] of Attorre, Techniques disclosed herein enable the seamless and unobtrusive integration of digital content, such as advertisements or informational messages, at host times inside a target digital content item in an automated or semi-automated fashion, allowing for the efficient placement of the advertisements or other augmentations at a high throughput.).  
Claim 4 
McCallum in view of Attorre further teaches the method of claim 1, wherein extracting the one or more acoustic feature groups for each beat within the temporal section comprises: extracting one or more of downbeat confidence, position in bar, loudness, timbre, and pitch ([0022] of McCallum, For example, the deep feature generator 122 may generate deep features 124 that are representative of pitch, melodies, chords, rhythms, timbre modulation, instruments, production methods and/or effects (e.g., filtering, compression, panning), vocalists, dynamics etc. )  
Claim 5 
McCallum in view of Attorre further teaches the method of claim 1, further comprising:  26Attorney Docket No. 04777.0159US01Patent training the CNN with training data, the training data including media content items with previously labeled cuepoints (Fig. 1 of McCallum with training data generator 102; [0038] The training data generator 300 generates multiple triplet examples to form a batch to train the neural network 104 via an optimization algorithm, e.g., via stochastic gradient descent or the Adam adaptive moment optimization algorithm. In some examples the training data generator 300 will take examples from multiple songs or audio streams to form one batch. Additionally, and/or alternatively, it will take multiple examples from each of a set of individual songs or audio streams, where this set may consist of one or more elements. [0076] of Attorre, machine learning model or a neural network trained on past examples of preferable host times, that a portion of the target digital content item represents a positive example of host time or satisfies some prediction score for a host time or candidate host time.). 


Claim 6 
McCallum in view of Attorre further teaches the method of claim 5, wherein the training data includes a reference to an identifier of a respective media content item, a millisecond time stamp for a start cuepoint of the respective media content item, and a millisecond time stamp for an end cuepoint of the respective media content item ([0011] of Attorre, neural network-based approaches that use models trained on attributes (e.g., audio data, visual data, metadata, or combinations thereof) from positive and negative examples of target digital content that embody or precede a transition in the scenes or stories in the target digital content item. [0035], use a neural network model to detect a host time by ingesting the visual and/or audio features of portions of the target digital content item in order to predict; [0066], In this example, host time defining data is generated based on the candidate host times and the corresponding candidate scores, and is associated with the target digital content item, such as saved as metadata of the target digital content item.  The host time defining data associated with the target digital content item can then be used by the content integration system or a computer system of a user to select appropriate host times in the target digital content item for inserting source digital content so as to minimize the impact on user experience. See also [0067]; [0071], The host time or host frame can include, for example, a timestamp (e.g., with respect to the beginning of the digital content item), a frame number, an audio element index number, or any other indicator that identifies a specific time instant or moment in the target digital content item. [0076], machine learning model or a neural network trained on past examples of preferable host times, that a portion of the target digital content item represents a positive example of host time or satisfies some prediction score for a host time or candidate host time.).  

Claim 7 
McCallum in view of Attorre further teaches the method of claim 1, wherein partitioning the plurality of beats into temporal sections comprises: partitioning the plurality of beats into sliding widow lengths comprised of N beats ([0027] McCallum, For instance, the segment extractor 204 generates a first segment 202 consisting of beats one to four inclusive, a second segment 202 of beats two to five inclusive, a third segment 202 of beats three to six inclusive, etc.).  
Claim 8 
McCallum in view of Attorre further teaches the method of claim 1, wherein the cuepoint is a start cuepoint or an end cuepoint (Fig. 4C of Attorre, illustrating host time/cuepoint).  
Claim 9 
McCallum in view of Attorre further teaches the method of claim 8, wherein if the cuepoint is the start cuepoint, further comprising: selecting the one or more of the temporal sections to include temporal sections comprising a first N beats of the media content item ([0072] of Attorre, In some embodiments, the host time defining data also includes metadata about the one or more host times, candidate host times, or time intervals, such as any visual or audio feature at or around each host time, candidate host time, or time interval.  In some embodiments, the host time defining data also includes digital content transformation objects that can be used to transform or adjust the visual and/or audio content of the source digital content for more seamless integration of the source digital content into the target digital content item. [0218] Attorre, indicating a fade-in, fade-out, or fade-in/fade-out sequence, with such frames being chosen as candidate host times; Examiner notes frames are interpreted to have N beats in either fade-in or fade-out process. [0228], indicating a fade-in, fade-out, or fade-in/fade-out sequence, with such frames being chosen as candidate host times.  In step 1104, the content integration system 100 is configured to accept the candidate host times output at step 1102 as input and pass the median frames from the intervals on both side of each candidate host time (e.g., the interval between each candidate host time and the next candidate host time) through one or more neural networks to determine the content vectors for the median frames, where determining the content vector for a median frame includes operations in steps 1106-1110. ).  
Claim 10 
McCallum in view of Attorre further teaches the method of claim 8, wherein if the cuepoint is the end cuepoint, further comprising: selecting the one or more of the temporal sections to include temporal sections comprising a last N beats of the media content item ([0218] Attorre, indicating a fade-in, fade-out, or fade-in/fade-out sequence, with such frames being chosen as candidate host times; Examiner notes frames are interpreted to have N beats in either fade-in or fade-out process. [0228], indicating a fade-in, fade-out, or fade-in/fade-out sequence, with such frames being chosen as candidate host times.  In step 1104, the content integration system 100 is configured to accept the candidate host times output at step 1102 as input and pass the median frames from the intervals on both side of each candidate host time (e.g., the interval between each candidate host time and the next candidate host time) through one or more neural networks to determine the content vectors for the median frames, where determining the content vector for a median frame includes operations in steps 1106-1110.).  
Claim 11 
McCallum in view of Attorre further teaches the method of claim 1, wherein normalizing the received audio content into the plurality of beats comprises: receiving the at least portion of the audio content in a raw audio format; and normalizing the raw audio format into the plurality of beats ([0018] McCallum, To detect beats in incoming digital audio 106, the example training system 100 includes an example beat detector 108. The example beat detector 108 of FIG. 1 generates an example stream of beat markers 110 representing the detected beats (e.g., a stream, list, etc. of timestamps for the detected beats). In music, the beat is the basic unit of time or pulse of the music.).  
Claim 12
This claim recites substantially the same limitations as those provided in claim 1 above, and therefore it is rejected for the same reasons.
A system for placing a cuepoint in a media content item, the system comprising: a convolutional neural network (CNN) (104 of Fig. 2 of McCallum); and a server communicatively coupled to the CNN, the server comprising at least one processing device and a memory coupled to the at least one processing device and storing 27Attorney Docket No. 04777.0159US01Patent instructions, that when executed by the at least one processing device, cause the at least one processing device ([0055] FIG. 11 is a block diagram of an example processor platform 1100 structured to execute the instructions of FIG. 10 to implement the training system 100, the similarity analysis system 200, the training data generator 300, and the similarity processor 500 of FIGS. 1-3 and 5. The processor platform 1100 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network)).
Claim 13 
McCallum in view of Attorre further teaches the system of claim 12, wherein the CNN is configured to: apply one or more convolutional layers to each of the extracted acoustic feature groups from a temporal section to learn features of each acoustic feature group; apply a final convolutional layer to combine the learned features from each acoustic feature group (1108 of Fig. 11 of Attorre); and determine a probability that a beat immediately following the temporal section is a candidate cuepoint placement based on the combined learned features ([0072] of Attorre, The host time defining data can also include a score, ranking, or probability that indicates the likelihood that each host time, candidate host time, or time interval represents a transition or break, or indicate the relative attractiveness of each specific time or moment as a host time or candidate host time. [0219] of Attorre, (N) neural networks including but not limited to: (i) convolutional neural networks as described in Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, NIPS (2012), which is herein incorporated by reference in its entirety (including the unique implementation where: the neurons in the network are grouped in different layers, each layer analyses windows of a frame and determines an output score for each pixel, the highest score pixels are the ones in windows that match a region of that frame that is suitable for hosting, in an aesthetically-pleasing and unobtrusive manner, source digital content, and the output scores are used to determine the coordinates of the regions of that frame that are best suitable for hosting, in an aesthetically-pleasing and unobtrusive manner, source digital content;).  
Claim 14 
McCallum in view of Attorre further teaches the system of claim 13, wherein the CNN is configured to perform one or more of a rectified linear unit activation function, a batch normalization, and dropout after applying each convolution layer (1110 of Fig. 11 of Attorre; [0126]normalizing audio features; See also 1306 and 1314 with corresponding disclosure [0230] of Attorre, averaging, resizing, etc.).  


Claim 15 
McCallum in view of Attorre further suggests the system of claim 13, wherein the final convolutional layer is a dense layer followed by a sigmoid activation ([0219] of Attorre, (A) a linear classifier; (B) a Fisher's linear discriminant; (C) a logistic regression).


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS H MAUNG whose telephone number is (571)270-5690. The examiner can normally be reached Monday-Friday, 9am-6pm, EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached on 1-(571) 272-7848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS H MAUNG/Primary Examiner, Art Unit 2654