DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Election/Restrictions
Applicants’ election with traverse of Invention I, Claims 1 to 10 and 16 to 20, in the reply filed on 21 June 2022 is acknowledged.  The traversal is on the grounds that the claims are amended so that they are no longer patentably distinct.  Specifically, Applicants argue that Invention II now includes a limitation that a binary object detector is trained via self-supervised training on speech extracted from raw and unlabeled videos.  This is not found persuasive because the inventions remain patentably distinct even after amendment.
Firstly, Applicants have not expressly traversed the restriction requirement, but have elected Invention I, amended the independent claims of Invention II, and asserted that the restriction requirement should be withdrawn.  This response is being treated as an election with traverse of the restriction requirement, even if a traverse is not expressly stated.  However, Invention I and Invention II would remain patentably distinct as subcombinations usable together after an amendment directed to detecting an object in an image via a binary object detector trained via self-supervised training on speech extracted from raw and unlabeled videos.  Applicants should recall that in order to support a restriction requirement based on subcombinations usable together, one only has to show that at least one of the inventions is separately usable.  Here, Invention II can be used without the details of Invention I directed to extracting positive and negative frames, region proposals, and clustering regions.  Similarly, Invention I can be used without the details of Invention II directed to receiving an image and detecting an object in the image, as Invention I has separate utility for classifying objects in videos, but Invention II classifies objects in images.  An image can be a still picture, and is not necessarily the same as a video.  Generally, Invention I is directed to training an object detector as a process of making, and Invention II is directed to using an object detector as a process or using, but an object detector of Invention II can be trained by patentably distinct method of training an object classifier of Invention I, so that the processes are patentably distinct.  
Secondly, Applicants have amended their claims after restriction, and this is not strictly proper.  Mainly, Applicants’ response to a restriction requirement is limited to traversing or electing without traverse after the restriction requirement, as an amendment raises new issues.  Moreover, there would clearly be a significant burden if all of the claims of these two inventions were examined together due to their divergent claim limitations.  A reference for a first of the two inventions would not appear to render obvious any of the claims for a second of the two inventions, so that a rejection considering the two inventions together would require a burden of complexity with different combinations of references.  Long et al, (U.S. Patent Publication 201/0340567) is a prior art evidence that could anticipate or render obvious the independent claims of Invention II, but would not be relevant to rejection of any claims of Invention I.  
The requirement is still deemed proper and is therefore made FINAL.
Claims 11 to 15 and 21 to 25 are withdrawn from further consideration pursuant to 37 CFR 1.142(b), as being drawn to a nonelected invention, there being no allowable generic or linking claim. Applicants timely traversed the restriction (election) requirement in the reply filed on 21 June 2022.

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 
The following title is suggested: Self-Supervised Object Detection Training Using Raw and Unlabeled Videos and Extracted Speech.
The disclosure is objected to because of the following informalities:
In ¶[0004], “The program code executable by the processor” should be “The program code is executable by the processor”.
In ¶[0006], “can include computer-readable storage medium” should be “can include a computer-readable storage medium”.
In ¶[0006], “The program code executable by the processor” should be “The program code is executable by the processor”.
In ¶[0017], “advantages maybe” should be “advantages may be”.
In ¶[0020], “As one examples” should be “As one example”.
In ¶[0021], it appears that “for” should not be lined through in “12. Run DSD per frame 
In ¶[0025], it appears that “regions” should not be lined through in “whether a given 
In ¶[0027], “the object name be ‘guitar’” should be “the object name can be ‘guitar’”.  
In ¶[0032], “a common characteristics” should be “a common characteristic”.
In ¶[0049], “a names” should be “a name” or “names”.
In ¶[0089], “the trainer module 822 can include code to can detect” should be “the trainer module 822 can include code to detect”.
In ¶[0089], “how many time” should be “how many times”.
Appropriate correction is required.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1 to 10 and 16 to 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Amrani et al. (“Learning to Detect and Retrieve Objects from Unlabeled Video”).
(Note: Amrani et al. is properly prior art under AIA  35 U.S.C. §102(a)(1) because it is a printed publication describing the invention before the effective filing date of the invention.  Amrani et al. has a publication date of 19 October 2019, and the invention has an effective filing date of 04 November 2019.  However, Applicants may be eligible for an exception under AIA  35 U.S.C. §102(b)(1) for disclosures made less than one year before their effective filing date if they can show that a disclosure was made by an inventor or joint invention or by another who obtained the subject matter disclosed directly or indirectly from an inventor or joint inventor.  Applicants must file an affidavit/declaration in accordance with 37 CFR 1.130(b) to establish this exception under AIA  35 U.S.C. §102(b)(1).)
Regarding independent claims 1, 6 and 16, Amrani et al. discloses a system, method, and computer program for detecting objects from unlabeled videos, comprising:
“receiving raw and unlabeled videos” – Self-Supervised Object Detection and Retrieval (SSODR) receives unlabeled videos (“raw and unlabeled videos”); large-scale video data of YouTube-8M and How2 can be leveraged for this purpose (Page 1: I. Introduction, Left Column); 
“extracting speech from the raw and unlabeled videos” – given unlabeled videos, an audio channel can be used as a ‘free’ source of weak labels; by seeing and hearing many frames where the word ‘guitar’ is mentioned, it should be possible to detect the guitar due to its shared characteristics over the frames; an audio track is mapped to text using automatic transcription from a speech to text model (Page 1: I. Introduction, Left Column to Right Column); for a given object, a single frame from the temporal segment contains the object’s name in a transcription (Page 2: 3. Method: Left Column); here, a transcription of audio by a speech to text model from an audio channel is ‘speech extracted from the videos’;
“extracting positive frames and negative frames from the raw and unlabeled videos based on the extracted speech for each object to be detected” – for a given object, a single frame is extracted from the temporal segment containing the object’s name in the transcription (“based on the extracted speech for each object to be detected”), which are a (noisy) positive set labeled Yl = 1 (“positive frames”); a balanced contrastive set, Yl = 0, is introduced as a negative set containing frames (“and negative frames”) randomly selected from disparate videos (Page 2: 3. Method: Left Column);  
“extracting region proposals from the positive frames and negative frames” – next, N region proposals are extracted from the selected frames using an unsupervised method (Page 2: 3. Method: Left Column);
“extracting features based on the extracted region proposals” – each region is mapped to a feature space, represented by zli using a pre-trained Inception-ResNet-v2 CNN (Page 2: 3. Method: Left Column);
“clustering the region proposals and assign a potential score to each cluster” – clustering of regions in the embedding space is performed in learning to find a common theme across positive regions that is less likely to exist in negative counterparts; following feature extraction of region proposals, the proposals are clustered using a variation of deep embedded clustering (DEC) (Page 2: 3. Method: Left to Right Column); each cluster is assigned a potential score (Page 3: 3. Method: Left Column);
“training a binary object detector to detect objects based on positive samples selected based on potential score” – positive samples (regions) are extracted from high potential scores, and a detector is trained as a binary classifier to distinguish between regions that are likely to contain an object of interest and background that may include other objects (Page 3: 3. Method: Left Column).

Regarding claims 2 to 5, Amrani et al. discloses searching for clusters satisfying three conditions: (1) high purity as defined as a percentage of positive samples in the cluster (“a positive ratio”), (2) low cluster variance (“a cluster variance”) for tendency to include a single object type, and (3) high video variety (“a cluster member variety”) prioritizing clusters that include regions from a higher variety of videos, as an object having common characteristics among various videos; these constraints are formalized with a softmax function Sk, which is referred to as a potential score (“wherein the potential score is based on . . .”), where Pk is a positive ratio (“a positive ratio”), Vk is a cluster distance variance (“a cluster variance”), and Uk denotes a number of unique videos incorporated into the cluster (“a cluster member variety”) (Page 2: 3. Method: Left to Right Column: Equation (1)).
Regarding claims 7 and 17, Amrani et al. discloses training an object detector using Dense Subgraph Discovery (DSD); high overlap regions are correlated with most connected nodes in the gram, and edges are connected between overlapping regions; remaining regions are ‘hard negative’ examples (“generating hard negative samples using dense subgraph discovery”); each cluster is assigned a potential score, and positive samples (regions) are extracted from high potential scores; a detector is trained as a binary classifier to distinguish between regions that are likely to contain the objects of interest and background that may include other objects; positives regions that satisfy dense subgraph discovery (DSD) criteria are sampled according to the cluster potential score distribution, and negatives are sampled uniformly from negative frames (“uniformly sampling the negative frames”), and are combined with rejected regions from dense subgraph discovery (DSD) as hard negatives (“based on the combined hard negative samples and sampled negative frames”) (Page 3: 3. Method: Left Column: Figure 2).
Regarding claims 8 and 18, Amrani et al. discloses that extracting N region proposals from selected frames uses an unsupervised method of a Selected Search (“wherein extracting the region proposals comprises using a selected search”) (Page 2: 3. Method: Left Column).
Regarding claims 9 and 19, Amrani et al. discloses that clustering uses a variation of deep embedded clustering (DEC) with a weighted student’s t-distribution as a similarity measure (“wherein clustering region proposals comprises performing a weighted deep embedded clustering”) (Page 2: 3. Method: Right Column: Figure 2).  Figure 2 illustrates ‘Weighted DEC’.
Regarding claims 10 and 20, Amrani et al. discloses that clusters are refined with new weights every I epochs, and cluster centroids are optimized (Page 2: 3. Method: Right Column: Figure 2).  Here, refining and optimizing clusters over a plurality of epochs is equivalent to “refining the clusters based on the potential score of the cluster.” 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.  Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format.  For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        August 1, 2022