Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
1.	The following Office action is in response to communications filed on 3/2/2020.  Claims 1-2, 4-13, 15-19, 21-23 are currently pending within this application.

Information Disclosure Statement
2.	The information disclosure statement(s) filed on 3/2/2020 is/are in compliance with the provisions of 37 CFR 1.97, and has/have been considered and a copy/copies is/are enclosed with this Office action.
	NOTE:  Some of the citations were not considered due to the following: missing date of publication for non-patent literature documents.

Claim Rejections – 35 USC § 102

3.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

4.	Claims 1-2, 4-5, 7, 11-13, 15-16, 19, 21-23 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Paluri (US PGPub 2017/0046613) [hereafter Paluri].

5.	As to claim 1, Paluri discloses a method (as shown in Figure 4) for identifying an object within a video sequence (content item 202), wherein the video sequence comprises a sequence of images, wherein the method comprises, for each of one or more images of the sequence of images: using a first neural network (generalized convolutional neural network 104/204 as shown in Figures 1-2) to determine whether or not an object of a predetermined type is depicted within the image; and in response to the first neural network determining that an object of the predetermined type is depicted within the image, using an ensemble of second neural networks (subsequent neural networks 110-122 or 206-228) to identify the object determined as being depicted within the image (Paragraphs 0023-0028, 0031-0039, 0047).

6.	As to claim 2, Paluri discloses the first neural network and/or one or more of the second neural networks is a convolutional neural network or a deep convolutional neural network (Paragraphs 0023-0024, 0031).

7.	As to claim 4, Paluri discloses generating a plurality of candidate images (image patches containing ROIs) from the image; using the first neural network to determine, for each of the candidate images, an indication of whether or not an object of the predetermined type is depicted in said candidate image; and using the indications to determine whether or not an object of the predetermined type is depicted within the image (Paragraphs 0026-0028, 0031-0032, 0034-0039).

8.	As to claim 5, Paluri discloses one or more of the candidate images is generated from the image by performing one or more geometric transformations on an area of the image (Paragraphs 0025-0026).

9.	As to claim 7, Paluri discloses the predetermined type is a face or a person (Paragraph 0024).

10.	As to claim 11, the Paluri reference discloses all claimed subject matter as explained above with respect to the comments/citations of claim 1.

11.	As to claim 12, Paluri discloses the amount of content is one of: (a) an image; (b) an image of a video sequence that comprises a sequence of images; and (c) an audio snippet (Paragraph 0004).

12.	As to claims 13, 15-16, 19, 21-23, the Paluri reference discloses all claimed subject matter as explained above with respect to the comments/citations of claims 1-2, 4-5, and 12.

13.	Claims 1-2, 4-5, 7-8, 11-13, 15-19 and 21-23 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Chakraborty (US PGPub 2017/0185872) [hereafter Chakraborty].

14.	As to claim 1, Chakraborty discloses a method for identifying an object within a video sequence, wherein the video sequence comprises a sequence of images, wherein the method comprises, for each of one or more images of the sequence of images (see Abstract, a machine learning model is configured to detect objects from video images. A system monitors video images to identify particular objects. A deep learning process is utilized to learn a baseline pattern. A change due to movement within a field of view is autonomously detected using the deep learning processing. An action is performed based on the detected change, also page 4, paragraph, [0041] neural networks may be designed with a variety of connectivity patterns. In feed-forward networks, information is passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, the output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that span more than one of the input data chunks that are delivered to the neural network in a sequence. A connection from a neuron in a given layer to a neuron in a lower layer is called a feedback (or top-down) connection. A network with many feedback connections may be helpful when the recognition of a high-level concept may aid in discriminating the particular low-level features of an input);
	using a first neural network to determine whether or not an object of a predetermined type is depicted within the image (see page 2, paragraph, [0033] aspects of the present disclosure are directed to systems and methods utilizing deep learning processing for training a video camera system to automatically detect objects in an isolated network system (e.g., a network not connected to a cloud). The method and system utilize deep learning processing to learn baseline patterns and then operate autonomously to detect a change due to movement within the field of view. In other words, objects of interest are detected in a scene. Based on the detection, further action may be taken, such as storing images and attempting to identify objects previously learned. The detectable objects are learned from training. Also page 3, paragraph, [0041] neural networks may be designed with a variety of connectivity patterns. In feed-forward networks, information is passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, the output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that span more than one of the input data chunks that are delivered to the neural network in a sequence. A connection from a neuron in a given layer to a neuron in a lower layer is called a feedback (or top-down) connection. A network with many feedback connections may be helpful when the recognition of a high-level concept may aid in discriminating the particular low-level features of an input, also page 5, paragraph, [0062] the present disclosure are directed to utilizing deep learning processing to detect objects from video images. In a deep learning architecture, the network learns to recognize and extract features (including feature vectors) based on examples provided during a training phase. The training phase may include back propagation in deep neural networks);
	and in response to the first neural network determining that an object of the predetermined type is depicted within the image, using an ensemble of second neural networks to identify the object determined as being depicted within the image (page 5, paragraphs, [0063-0064] in one aspect, the deep learning network trains a system to monitor video images to identify particular objects (e.g., a car, person, etc.) The detection of an object triggers an action (where the action may be pre-configured by a user). For example, when a change is detected, various actions may be performed. Examples of actions include, but are not limited to, storing an image for review by a user, and identifying an object detected within the field of view. The action may be a user- configured action. In one aspect, a camera controller is taught to determine the baseline pattern. Additionally, the camera controller is configured to learn the baseline pattern, rather than having the baseline programmed into the camera controller. For example, when video images show empty frames, the camera controller is taught how to determine when an object enters the frame, such as an adult, child, or cat walking into the frame. The camera controller is also taught to determine when there is no object movement in the field of view of the camera as compared to when movement occurs. The process is repeated during different times of the day to account for various lighting conditions. Also, paragraphs [0043-0043] disclose locally connected neural networks may be well suited to problems in which the spatial location of inputs is meaningful. For instance, a network 300 designed to recognize visual features from a car- mounted camera may develop high layer neurons with different properties depending on their association with the lower versus the upper portion of the image. Neurons associated with the lower portion of the image may learn to recognize lane markings, for example, while neurons associated with the upper portion of the image may learn to recognize traffic lights, traffic signs, and the like. A deep convolutional network (DCN) (second neural networks), may be trained with supervised learning. During training, a DCN may be presented with an image, such as a cropped image of a speed limit sign 326, and a "forward pass" may then be computed to produce an output 322. The output 322 may be a vector of values corresponding to features such as “sign,” "60," and "100." The network designer may want the DCN to output a high score for some of the neurons in the output feature vector, for example the ones corresponding to "sign" and "60" as shown in the output 322 for a network 300 that has been trained. Before training, the output produced by the DCN is likely to be incorrect, and so an error may be calculated between the actual output and the target output. The weights of the DCN may then be adjusted so that the output scores of the DCN are more closely aligned with the target).

15.	As to claim 2, Chakraborty discloses the first neural network and/or one or more of the second neural networks is a convolutional neural network or a deep convolutional neural network (see page 3, paragraphs, [0043-0043] locally connected neural networks may be well suited to problems in which the spatial location of inputs is meaningful. For instance, a network 300 designed to recognize visual features from a car- mounted camera may develop high layer neurons with different properties depending on their association with the lower versus the upper portion of the image. Neurons associated with the lower portion of the image may learn to recognize lane markings, for example, while neurons associated with the upper portion of the image may learn to recognize traffic lights, traffic signs, and the like. A deep convolutional network (DCN) may be trained with supervised learning. During training, a DCN may be presented with an image, such as a cropped image of a speed limit sign 326, and a "forward pass” may then be computed to produce an output 322. The output 322 may be a vector of values corresponding to features such as "sign," "60," and "100." The network designer may want the DCN to output a high score for some of the neurons in the output feature vector, for example the ones corresponding to "sign" and "60" as shown in the output 322 for a network 300 that has been trained. Before training, the output produced by the DCN is likely to be incorrect, and so an error may be calculated between the actual output and the target output. The weights of the DCN may then be adjusted so that the output scores of the DCN are more closely aligned with the target).

16.	As to claim 4, Chakraborty discloses using a first neural network to determine whether or not an object of a predetermined type is depicted within the image comprises (see claim 1, also page 6, paragraph, [0072] the camera controller can learn an object when an object is detected and is present for M frames. For example, a family member, dog, or specific car type can be learned. Classification scores are assigned to each frame (e.g., Bayesian information criterion scores). The assigned scores are evaluated and possible candidates are determined);
	generating a plurality of candidate images from the image; using the first neural network to determine, for each of the candidate images, an indication of whether or not an object of the predetermined type is depicted in said candidate image; and using the indications to determine whether or not an object of the predetermined type is depicted within the image (see page 6, paragraphs, [0076-0077] FIG. 10 illustrates an example flow diagram 1000 utilized by the camera controller. In particular, the camera controller is trained to learn a baseline from idle images (e.g., images where no movement is detected), starting at block 1002. In one example, a first set of feature vectors may be extracted from a first video frame. The first set of feature vectors may be represented as first baseline scores. Additionally, a second set of feature vectors may be extracted from a second video frame. The second set of feature vectors may be represented as second baseline scores. For images that are very similar to each other, the first and second baseline scores may likewise be similar. A baseline pattern may be established when the first and second baseline scores are similar to one another. In some aspects, a final baseline score may be calculated based on the average of the first and second baseline scores. Once an object enters the frame, at block 1004, the camera controller determines whether an object is present, at block 1006. For example, when a person (or object) enters a frame, the first and second baseline scores are different. The larger the difference between the first and second baseline scores, then the bigger the change in the images).

17.	As to claim 5, Chakraborty discloses wherein one or more of the candidate images is generated from the image by performing one or more geometric transformations (such as a rotation, (zoom-in or zoom-out), shear or scaling) on an area of the image (see page 5, paragraph, [0059] FIG. 5 is a block diagram illustrating the run-time operation 500 of an AI application on a smartphone 502. The AI application may include a pre- process module 504 that may be configured (using for example, the JAVA programming language) to convert the format of an image 506 and then crop and/or “resize the image” 508.  The pre-processed image may then be communicated to a classify application 510 that contains a Scene Detect Backend Engine 512 that may be configured (using for example, the C programming language) to detect and classify scenes based on visual input. The Scene Detect Backend Engine 512 may be configured to further preprocess 514 the image by “scaling” 516 and cropping 518. For example, the image may be scaled and cropped so that the resulting image is 224 pixels by 224 pixels. These dimensions may map to the input dimensions of a neural network. The neural network may be configured by a deep neural network block 520 to cause various processing blocks of the SOC 100 to further process the image pixels with a deep neural network. The results of the deep neural network may then be thresholded 522 and passed through an exponential smoothing block 524 in the classify application 510. The smoothed results may then cause a change of the settings and/or the display of the smartphone 502).

18.	As to claim 7, Chakraborty discloses the predetermined type is a face or a person (see claim 6, also page 6, paragraphs, [0071-0072] FIG. 9 illustrates images where a person walks into the frames and then exits the frames. The graph below the images illustrates the change in the classification scores as the person enters and exits the frames. The camera controller can learn an object when an object is detected and is present for M frames. For example, a family member, dog, or specific car type can be learned. Classification scores are assigned to each frame (e.g., Bayesian information criterion scores). The assigned scores are evaluated and possible candidates are determined).

19.	As to claim 8, Chakraborty discloses associating metadata with the image based on the identified object (see claim 6, also page 6, paragraphs, [0074-0075] the objects can be grouped in bins for user assisted “labeling”. In one aspect, the grouping is based purely on time, where the frames are successive, to assist the user in labeling the objects properly. The labeling facilitates training. Once the objects are learned, similar objects are classified the same in the future. Additionally, for learned objects, their respective classification scores are compared. Upon recognizing an object, or upon the occurrence of a specific event, a user can be notified. Also paragraph, [0079] if the object is not new, at block 1010, an action may be triggered, at block 1012. For example, the classified objects may be compared against previously known objects to determine whether the object is not new. Once the objects are learned, similar objects are classified the same in the future. Additionally, for learned objects, their respective classification scores are compared. In one aspect, the objects can be grouped in bins for user assisted labeling. For example, the grouping can be based purely on time, where the frames are successive, to assist the user in labeling the objects properly. And page 7, paragraph, [0083] from block 1016, the process may branch into two simultaneous or nearly simultaneous actions. At block 1018, the process performs training on the saved images to learn the determined new object at block 1020. In one example, the camera controller can learn an object when an object is detected and is present for M frames. For example, a family member, dog, or specific car type can be learned. Classification scores are assigned to each frame (e.g., Bayesian information criterion scores)).

20.	As to claim 11, Chakraborty discloses using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; and in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify the object determined as being depicted within the amount of content (see claim 1, also page 3, paragraphs, [0043-0043] locally connected neural networks may be well suited to problems in which the spatial location of inputs is meaningful. For instance, a network 300 designed to recognize visual features from a car-mounted camera may develop high layer neurons with different properties depending on their association with the lower versus the upper portion of the image. Neurons associated with the lower portion of the image may learn to recognize lane markings, for example, while neurons associated with the upper portion of the image may learn to recognize traffic lights, traffic signs, and the like. A deep convolutional network (DCN) may be trained with supervised learning. During training, a DCN may be presented with an image, such as a cropped image of a speed limit sign 326, and a "forward pass" may then be computed to produce an output 322. The output 322 may be a vector of values corresponding to features such as “sign,” "60," and "100." The network designer may want the DCN to output a high score for some of the neurons in the output feature vector, for example the ones corresponding to "sign" and "60" as shown in the output 322 for a network 300 that has been trained. Before training, the output produced by the DCN is likely to be incorrect, and so an error may be calculated between the actual output and the target output. The weights of the DCN may then be adjusted so that the output scores of the DCN are more closely aligned with the target).

21.	As to claim 12, Chakraborty discloses the amount of content is one of: (a) an image; (b) an image of a video sequence that comprises a sequence of images; and-s (c) an audio snippet (see page 3, paragraph, [0039] a deep learning architecture may learn a hierarchy of features. If presented with visual data, for example, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, if presented with auditory data, the first layer may learn to recognize spectral power in specific frequencies. The second layer, taking the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For instance, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases. Also page 4, paragraph, [0055] the deep convolutional network 350 may also include one or more fully connected layers (e.g., FC1 and FC2). The deep convolutional network 350 may further include a logistic regression (LR) layer. Between each layer of the deep convolutional network 350 are weights (not shown) that are to be updated. The output of each layer may serve as an input of a succeeding layer in the deep convolutional network 350 to learn hierarchical feature representations from input data (e.g., images, audio, video, sensor data and/or other input data) supplied at the first convolution block C1).

22.	As to claim 13, Chakraborty discloses the first neural network and/or one or more of the second neural networks is a convolutional neural network or a deep convolutional neural network (see page 3, paragraphs, [0043-0043] locally connected neural networks may be well suited to problems in which the spatial location of inputs is meaningful. For instance, a network 300 designed to recognize visual features from a car- mounted camera may develop high layer neurons with different properties depending on their association with the lower versus the upper portion of the image. Neurons associated with the lower portion of the image may learn to recognize lane markings, for example, while neurons associated with the upper portion of the image may learn to recognize traffic lights, traffic signs, and the like. A deep convolutional network (DCN) may be trained with supervised learning. During training, a DCN may be presented with an image, such as a cropped image of a speed limit sign 326, and a "forward pass” may then be computed to produce an output 322. The output 322 may be a vector of values corresponding to features such as "sign," "60," and "100." The network designer may want the DCN to output a high score for some of the neurons in the output feature vector, for example the ones corresponding to "sign" and "60" as shown in the output 322 for a network 300 that has been trained. Before training, the output produced by the DCN is likely to be incorrect, and so an error may be calculated between the actual output and the target output. The weights of the DCN may then be adjusted so that the output scores of the DCN are more closely aligned with the target).

23.	As to claim 15, Chakraborty discloses using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content comprises: generating a plurality of content candidates from the amount of content; using the first neural network to determine, for each of the content candidates, an indication of whether or not an object of the predetermined type is depicted in said content candidate; and using the indications to determine whether or not an object of the predetermined type is depicted within the amount of content (see claim 1, also page 6, paragraph, [0072] the camera controller can learn an object when an object is detected and is present for M frames. For example, a family member, dog, or specific car type can be learned. Classification scores are assigned to each frame (e.g., Bayesian information criterion scores). The assigned scores are evaluated and possible candidates are determined). And page 6, paragraphs, [0076-0077] FIG. 10 illustrates an example flow diagram 1000 utilized by the camera controller. In particular, the camera controller is trained to learn a baseline from idle images (e.g., images where no movement is detected), starting at block 1002. In one example, a first set of feature vectors may be extracted from a first video frame. The first set of feature vectors may be represented as first baseline scores. Additionally, a second set of feature vectors may be extracted from a second video frame. The second set of feature vectors may be represented as second baseline scores. For images that are very similar to each other, the first and second baseline scores may likewise be similar. A baseline pattern may be established when the first and second baseline scores are similar to one another. In some aspects, a final baseline score may be calculated based on the average of the first and second baseline scores. Once an object enters the frame, at block 1004, the camera controller determines whether an object is present, at block 1006. For example, when a person (or object) enters a frame, the first and second baseline scores are different. The larger the difference between the first and second baseline scores, then the bigger the change in the images).

24.	As to claim 16, Chakraborty discloses one or more of the content candidates is generated from the amount of content by performing one or more geometric transformations (such as a rotation, (zoom-in or zoom-out), shear or scaling) on a portion of the amount of content (see page 5, paragraph, [0059] FIG. 5 is a block diagram illustrating the run-time operation 500 of an AI application on a smartphone 502. The AI application may include a pre- process module 504 that may be configured (using for example, the JAVA programming language) to convert the format of an image 506 and then crop and/or “resize the image” 508.  The pre-processed image may then be communicated to a classify application 510 that contains a Scene Detect Backend Engine 512 that may be configured (using for example, the C programming language) to detect and classify scenes based on visual input. The Scene Detect Backend Engine 512 may be configured to further preprocess 514 the image by “scaling” 516 and cropping 518. For example, the image may be scaled and cropped so that the resulting image is 224 pixels by 224 pixels. These dimensions may map to the input dimensions of a neural network. The neural network may be configured by a deep neural network block 520 to cause various processing blocks of the SOC 100 to further process the image pixels with a deep neural network. The results of the deep neural network may then be thresholded 522 and passed through an exponential smoothing block 524 in the classify application 510. The smoothed results may then cause a change of the settings and/or the display of the smartphone 502).

25.	As to claim 17, Chakraborty discloses the amount of content is an audio snippet and the predetermined type is one of: a voice; a word; a phrase (see page 3, paragraph, [0039] a deep learning architecture may learn a hierarchy of features. If presented with visual data, for example, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, if presented with auditory data, the first layer may learn to recognize spectral power in specific frequencies. The second layer, taking the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For instance, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases).

26.	As to claims 18-19 and 21-23, the Chakraborty reference discloses all claimed subject matter as explained above with respect to the comments/citations of claims 8 and 11-12.

Claim Rejections – 35 USC § 103
	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


27.	Claim 6 is rejected under 35 U.S.C 103 as being unpatentable over Paluri (US PGPub 2017/0046613) [hereafter Paluri] in view of Pereira (US PGPub 2018/0307942) [hereafter Pereira].

28.	As to claim 6, it is noted that Paluri fails to specifically disclose the predetermined type is a logo.
	On the other hand, Pereira discloses identifying an object of a predetermined type within a video sequence wherein the predetermined type is a logo (Abstract, Paragraphs 0034, 0040, 0044, 0054, 0063, 0072, 0112, 0128, 0133).
	It would have been obvious to one having ordinary skill in the art before the effective filing date of the invention to include identifying an object of a predetermined type within a video sequence wherein the predetermined type is a logo as taught by Pereira with the method and device of Paluri because the cited prior art are directed towards using neural networks to determine specific content within received video content and because the claimed limitations are fully disclosed within the cited prior art references and would yield predictable results of achieving high robustness and accuracy of enabling detection of various logos placed within video content.

29.	Claims 8 and 18 are rejected under 35 U.S.C 103 as being unpatentable over Paluri (US PGPub 2017/0046613) [hereafter Paluri] in view of Shen (US PGPub 2016/0148079) [hereafter Shen].

30.	As to claims 8 and 18, it is noted that Paluri fails to specifically disclose associating metadata with the image based on the identified object.
	On the other hand, Shen discloses identifying an object of a predetermined type within a video sequence and associating metadata with the image based on the identified object (Paragraphs 0021-0024).
	It would have been obvious to one having ordinary skill in the art before the effective filing date of the invention to include associating metadata with the image based on the identified object as taught by Shen with the method and device of Paluri because the cited prior art are directed towards using neural networks to determine specific content within received video content and because the claimed limitations are fully disclosed within the cited prior art references and would yield predictable results of generating specific metadata for the identified objects that can be used for future image retrieval and identification operations.

31.	Claims 9-10 are rejected under 35 U.S.C 103 as being unpatentable over Paluri (US PGPub 2017/0046613) [hereafter Paluri] and Pereira (US PGPub 2018/0307942) [hereafter Pereira], as applied to claim 6, and in further view of Shah (US PGPub 2016/0321167) [hereafter Shah].

32.	As to claim 9, it is noted that the combination of the Paluri and Pereira references fails to particularly disclose obtaining the video sequence from a source and determining unauthorized use of the video sequence based on identifying that the logo is depicted within one or more images of the video sequence.
	On the other hand, Shah discloses obtaining the video sequence from a source; and determining unauthorized use of the video sequence based on identifying that the logo is depicted within one or more images of the video sequence (Paragraphs 0058-0059, 0063, 0069).
	It would have been obvious to one having ordinary skill in the art before the effective filing date of the invention to include obtaining the video sequence from a source and determining unauthorized use of the video sequence based on identifying that the logo is depicted within one or more images of the video sequence as taught by Shah with the method and device of Paluri and Pereira because the cited prior art are directed towards using neural networks to detect specific logos within video contents and because the claimed limitations are fully disclosed within the cited prior art references and would yield predictable results of enabling the system to distinguish between authorized and non-authorized streams of content based on the logo and source.

33.	As to claim 10, Pereira discloses the logo is one of a plurality of predetermined logos (Paragraphs 0009, 0040, 0044).

34.	Claim 17 is rejected under 35 U.S.C 103 as being unpatentable over Paluri (US PGPub 2017/0046613) [hereafter Paluri] in view of Chakraborty (US PGPub 2017/0185872) [hereafter Chakraborty].

35.	As to claim 17, it is noted that Paluri fails to specifically disclose the amount of content is an audio snippet and the predetermined type is one of: a voice; a word; a phrase.
	On the other hand, Chakraborty discloses the amount of content is an audio snippet and the predetermined type is one of: a voice; a word; a phrase (see page 3, paragraph, [0039] a deep learning architecture may learn a hierarchy of features. If presented with visual data, for example, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, if presented with auditory data, the first layer may learn to recognize spectral power in specific frequencies. The second layer, taking the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For instance, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases).
	It would have been obvious to one having ordinary skill in the art before the effective filing date of the invention to include the amount of content is an audio snippet and the predetermined type is one of: a voice; a word; a phrase as taught by Chakraborty with the method and device of Paluri because the cited prior art are directed towards using neural networks to determine specific content within received video content and because the claimed limitations are fully disclosed within the cited prior art references and would yield predictable results of filtering and identifying specific content within audio content items.

Double Patenting
	The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees.   A nonstatutory obviousness-type double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and  In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the conflicting application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. 
Effective January 1, 1994, a registered attorney or agent of record may sign a terminal disclaimer. A terminal disclaimer signed by the assignee must fully comply with 37 CFR 3.73(b).


36.	Claims 1-2, 4-13, 15-19, 21-23 of the instant application are rejected on the grounds of non-statutory double patenting as being unpatentable over claims 1-18 of US Patent 10417527 (hereafter ‘527).

37.	Although the conflicting claims are not identical, they are not patentably distinct from each other because the subject matter claimed in the instant application is an obvious variant of that claimed in the patented invention.  Claims 1-2, 4-13, 15-19, 21-23 of the instant application are anticipated in the limitations of claims 1-18 of cited patent ‘527. The claims of the instant application are broader than those of cited patent ‘527 and merely omit certain limitations in the claims of the cited patent ‘527.

Instant Application
Claim 1
US Patent ‘527
Claim 1
A method for identifying an object within a video sequence, wherein the video sequence comprises a sequence of images, wherein the method comprises, for each of one or more images of the sequence of images: using a first neural network to determine whether or not an object of a predetermined type is depicted within the image; and in response to the first neural network determining that an object of the predetermined type is depicted within the image, using an ensemble of second neural networks to identify the object determined as being depicted within the image.
A method for identifying an object within a video sequence, wherein the video sequence comprises a sequence of images, wherein the method comprises: obtaining the video sequence from a source; for each of one or more images of the sequence of images: using a first neural network to determine whether or not an object of a predetermined type is depicted within the image; and in response to the first neural network determining that an object of the predetermined type is depicted within the image, using an ensemble of second neural networks to identify which object of the predetermined type is depicted within the image; wherein one or both of: (a) the first neural network is a convolutional neural network or a deep convolutional neural network; and (b) one or more of the second neural networks is a convolutional neural network or a deep convolutional neural; and determining unauthorized use of the video sequence based on identifying that the object of the predetermined type is depicted within one or more images of the video sequence.

	Table 1	

Instant Application
Claim 11
US Patent ‘527
Claim 8
A method for identifying an object within an amount of content, the method comprising: using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; and in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify the object determined as being depicted within the amount of content.
A method for identifying an object within an amount of content, the method comprising: obtaining the amount of content from a source; using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify which object of the predetermined type is depicted within the amount of content; and determining unauthorized use of the amount of content based on identifying that the object of the predetermined type is depicted within the amount of content; wherein one or both of: (a) the first neural network is a convolutional neural network or a deep convolutional neural network; and (b) one or more of the second neural networks is a convolutional neural network or a deep convolutional neural.

	Table 2	

Instant Application
Claim 19
US Patent ‘527
Claim 15
An apparatus comprising one or more processors, the one or more processors being arranged to carry out identification of an object within an amount of content, said identification comprising: using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; and in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify the object determined as being depicted within the amount of content.
An apparatus comprising one or more processors, the one or more processors being arranged to carry out identification of an object within an amount of content, said identification comprising: obtaining the amount of content from a source; using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify which object of the predetermined type is depicted within the amount of content; and determining unauthorized use of the amount of content based on identifying that the object of the predetermined type is depicted within the amount of content; wherein one or both of: (a) the first neural network is a convolutional neural network or a deep convolutional neural network; and (b) one or more of the second neural networks is a convolutional neural network or a deep convolutional neural.

	Table 3	

Instant Application
Claim 21
US Patent ‘527
Claim 17
A non-transitory computer-readable medium storing a computer program which, when executed by one or more processors, causes the one or more processors to carry out identification of an object within an amount of content, said identification comprising: using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; and in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify the object determined as being depicted within the amount of content.
A non-transitory computer-readable medium storing a computer program which, when executed by one or more processors, causes the one or more processors to carry out identification of an object within an amount of content, said identification comprising: obtaining the amount of content from a source; using a first neural network to determine whether or not an object of a predetermined type is depicted within the amount of content; in response to the first neural network determining that an object of the predetermined type is depicted within the amount of content, using an ensemble of second neural networks to identify which object of the predetermined type is depicted within the amount of content; and determining unauthorized use of the amount of content based on identifying that the object of the predetermined type is depicted within the amount of content; wherein one or both of: (a) the first neural network is a convolutional neural network or a deep convolutional neural network; and (b) one or more of the second neural networks is a convolutional neural network or a deep convolutional neural.

	Table 4	
38.	As can be seen in Tables 1-4, each of the claimed limitations of independent claims 1, 11, 19, and 21 of the instant application are included within the claimed limitations of claims 1, 8, 15, and 17 of cited patent '527.  Additionally, claims 2, 4-10, 12-13, 15-18, 21-23 of the instant application are anticipated by claims 1-18 of cited patent ‘527.  

Conclusion
39.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL S OSINSKI whose telephone number is (571) 270-3949.  The examiner can normally be reached on Monday - Friday, 10:00am - 6:00pm.  If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nay Maung can be reached on 571-272-7882.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
	Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



MO
/MICHAEL S OSINSKI/Primary Examiner, Art Unit 2664