DETAILED ACTION
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 4 recites the limitation "the method of claim 3".  There is insufficient antecedent basis for this limitation in the claim because claim 3 is canceled. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 4, 8, 11, 15-16, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gordon (US 20170323158) in view of Osotio et al (US 20190096105), in further in view of Shoemake et al (US 20150070516).
Regarding claim 1, Gordon discloses a method comprising:
receiving, at a server over a network from a remote device (¶58-59 & Fig. 5 , the computing device 502 may communicate via the one or more networks 504 with an electronic device 506 associated with an individual 508), an image and spatial information about the image (¶51 & ¶101 the process 600 includes receiving input data including at least one of audible input, visual input, or sensor input and at 610, the process 600 includes determining that the input data corresponds to a request to identify the object; ¶156-159 the computing device architecture 1400 is applicable to any of the clients shown in FIGS. 1, 2, 5, 12, and 13; the processor 1402 may additionally or alternatively comprise a holographic processing unit (HPU) which is designed specifically to process and integrate data from multiple sensors of a head mounted computing device and to handle tasks such as spatial mapping, gesture recognition, and voice and speech recognition);
determining, by the server, a context of the image (¶77 a context of the object to be identified);
detecting, by the server using an object detection algorithm and the spatial information, one or more objects within the image (¶77 a description of other objects proximate to the object to be identified, one or more images of the object to be identified, one or more images of a scene including the object to be identified, or combinations thereof; ¶156-159 process and integrate data from multiple sensors of a head mounted computing device and to handle tasks such as spatial mapping, gesture recognition, and voice and speech recognition);
comparing, by the server, each of the one or more objects with the image context (¶78 the additional features included in the image may be used to identify a person or may be inappropriate for some individuals to view) and
Gordon fails to specifically teach receiving an audio associated with the image, extracting from the audio, one or more keywords, determining, by the server, a context of the image, the context including the extracted keywords; selectively modifying, by the server, each of the one or more objects in the image that does not relate to the image context by replacing each of the one or more objects with a generic version of each of the one or more objects, each of the generic versions having a size and perspective comparable to their respective replaced objects.
Shoemake teaches receiving an audio associated with the image (¶149 the presence detection device might comprise a video input interface to receive video input from the local content source, an audio input interface to receive audio input from the local content source), extracting from the audio, one or more keywords (¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable; ¶200 Such identifying information can include raw or analyzed presence information, as well as information derived from the presence information, such as, to name some examples, extracted features from an image, audio segment, and/or video segment; an excerpted image, video, and/or audio segment; and/or the like); determining, by the server, a context of the image, the context including the extracted keywords (¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable; ¶180  For example, for video content, game content, or image content, image recognition or image identification techniques may be used to recognize or identify nudity, sexual content, gun violence, knife violence, gore, blood, violent acts, use of drugs, alcohol, and/or tobacco, and/or the like. For audio content, sound recognition, word identification, and/or similar techniques may be used to recognize or identify offensive words or phrases.).);
Osotio teaches selectively modifying, by the server, each of the one or more objects in the image that does not relate to the image context by replacing each of the one or more objects with a generic version of each of the one or more objects (¶87-88 a placeholder object can be presented, that lets the user know that augmented content is available and as the user focuses on the placeholder object, the augmented content is presented, at a selected rendering fidelity), each of the generic versions having a size and perspective comparable to their respective replaced objects (¶87-89 Selecting a rendering fidelity thus comprises selecting one or more characteristics such as transparency, color, resolution, how much of the content is rendered, size, and so forth that will be used to render the object. Which combination can be based on user preference, for example, the user may specify that they prefer resolution adjustment for initial presentation of augmented content or that they prefer transparency adjustment for initial presentation of augmented content).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of modifying, by the server, each of the one or more objects in the image that does not relate to the image context by replacing each of the one or more objects with a generic version of each of the one or more objects, each of the generic versions having a size and perspective comparable to their respective replaced objects from Osotio, and the teaching of receiving an audio associated with the image, extracting from the audio, one or more keywords, determining, by the server, a context of the image, the context including the extracted keywords from Shoemake, into the method as disclosed by Gordon. The motivation for doing this is to improve the user experience, thus improving the efficiency and effectiveness of the system and further to improve the effectiveness of future content filtering.

Regarding claim 4, the combination of Gordon, Osotio and Shoemake discloses the method of claim 3, wherein comparing each of the one or more objects with the image context comprises identifying each of the one or more objects, and comparing each of the one or more objects with each of the one or more keywords (Shoemake ¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable). 
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein comparing each of the one or more objects with the image context comprises identifying each of the one or more objects, and comparing each of the one or more objects with each of the one or more keywords from Shoemake into the method as disclosed by the combination of Gordon and Osotio. The motivation for doing this is to improve the effectiveness of future content filtering.

Regarding claims 8 and 11 (drawn to a CRM):                  
The proposed combination of Gordon, Osotio and Shoemake, explained in the rejection of method claims 1 and 4 renders obvious the steps of the computer readable medium of claims 8 and 11 because these steps occur in the operation of the proposed rejection as discussed above. Thus, the arguments similar to that presented above for claims 1 and 4 are equally applicable to claim 8 and 11. See further Gordon ¶61.

Regarding claim 15, Gordon discloses an apparatus, comprising: 
an object detector (Fig. 5 processor 510);
a context determiner (Fig. 5 processor 510); and 
an object replacer (Fig. 5 processor 510), 
wherein: 
the object detector is to detect, using an object detection algorithm, one or more objects from a video and spatial information about the video (¶21 the system may capture visual input, such as an image or video; ¶51 & ¶101 the process 600 includes receiving input data including at least one of audible input, visual input, or sensor input and at 610, the process 600 includes determining that the input data corresponds to a request to identify the object; ¶156-159 the computing device architecture 1400 is applicable to any of the clients shown in FIGS. 1, 2, 5, 12, and 13; the processor 1402 may additionally or alternatively comprise a holographic processing unit (HPU) which is designed specifically to process and integrate data from multiple sensors of a head mounted computing device and to handle tasks such as spatial mapping, gesture recognition, and voice and speech recognition), 
the context determiner is to determine a context of the video from an associated audio (¶77 a context of the object to be identified), and 
the object replacer is to compare each of the one or more objects with the video context (¶78 the additional features included in the image may be used to identify a person or may be inappropriate for some individuals to view).
Gordon fails to specifically teach the context determiner is to extract, from an associated audio, one or more keywords, and determine a context of the video from the extracted keywords; and selectively modify each of the one or more objects in the video that does not relate to the video context by replacing each of the one or more objects with a generic version of each of the one or more objects, each of the generic versions having a size and perspective comparable to their respective replaced objects.
Shoemake teaches the context determiner is to extract, from an associated audio, one or more keywords (¶149 the presence detection device might comprise a video input interface to receive video input from the local content source, an audio input interface to receive audio input from the local content source; ¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable), and determine a context of the video from the extracted keywords (¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable);
Osotio teaches selectively modify each of the one or more objects in the video that does not relate to the video context by replacing each of the one or more objects with a generic version of each of the one or more objects (¶87-88 a placeholder object can be presented, that lets the user know that augmented content is available and as the user focuses on the placeholder object, the augmented content is presented, at a selected rendering fidelity), each of the generic versions having a size and perspective comparable to their respective replaced objects (¶87-89 Selecting a rendering fidelity thus comprises selecting one or more characteristics such as transparency, color, resolution, how much of the content is rendered, size, and so forth that will be used to render the object. Which combination can be based on user preference, for example, the user may specify that they prefer resolution adjustment for initial presentation of augmented content or that they prefer transparency adjustment for initial presentation of augmented content).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of the context determiner is to extract, from an associated audio, one or more keywords, and determine a context of the video from the extracted keywords from Shoemake, and the teaching of  selectively modify each of the one or more objects in the video that does not relate to the video context by replacing each of the one or more objects with a generic version of each of the one or more objects, each of the generic versions having a size and perspective comparable to their respective replaced objects from Osotio into the method as disclosed by Gordon. The motivation for doing this is to improve the user experience, thus improving the efficiency and effectiveness of the system and further to improve the effectiveness of future content filtering.

Regarding claim 16, Gordon discloses the apparatus of claim 15, wherein the apparatus is a mobile device (¶58-59). 

Regarding claim 20, Gordon discloses the apparatus of claim 15, wherein the object detector is to detect one or more objects from the video with reference to an object library (¶23 compare the characteristics of the object in the scene to characteristics of objects in the database). 

Claims 5-6, 12-13, 18-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Gordon, Osotio and Shoemake as applied to claims 4 and 11 above, and further in view of Delaney et al (US 20190139576).
Regarding claim 5, the combination of Gordon, Osotio and Shoemake discloses the method of claim 4, but fails to teach assigning a weight to each of the one or more keywords, and wherein comparing each of the one or more objects with the image context comprises assigning a weight to each of the one or more objects based upon the weight of each of the one or more keywords that are relevant to each of the one or more objects. 
Delaney teaches assigning a weight to each of the one or more keywords (¶33 the present system uses a confidence score between NLU processing and image recognition processing to assign a tag to video content. For example, as the video content progresses, the video content tagging device continually generates an audio and video confidence score), and wherein comparing each of the one or more objects with the image context comprises assigning a weight to each of the one or more objects based upon the weight of each of the one or more keywords that are relevant to each of the one or more objects (¶33-34 the present system uses a confidence score between NLU processing and image recognition processing to assign a tag to video content. For example, as the video content progresses, the video content tagging device continually generates an audio and video confidence score. Once the audio and video confidence score cross their respective threshold values, the video content tagging device assigns the tag to the video content. The video content tagging device de-assigns the tag to the video content when the audio and video confidence score falls below their respective threshold values.).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of assigning a weight to each of the one or more keywords, and wherein comparing each of the one or more objects with the image context comprises assigning a weight to each of the one or more objects based upon the weight of each of the one or more keywords that are relevant to each of the one or more objects from Delaney into the method as disclosed by the combination of Gordon, Osotio and Shoemake. The motivation for doing this is to improve corroborating video data with audio data to automatically tag video content.

Regarding claim 6, the combination of Gordon, Osotio, Shoemake and Delaney discloses the method of claim 5, wherein selectively modifying each of the one or more objects that does not relate to the image context comprises selectively modifying each of the one or more objects that has a weight that falls below a predetermined threshold (Delaney ¶33-35 The video content tagging device de-assigns the tag to the video content when the audio and video confidence score falls below their respective threshold values; de-assigning may refer modifying the image of the car in the video image by removing the generic highlight, such as the box, around the image of the car). 
	Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein selectively modifying each of the one or more objects that does not relate to the image context comprises selectively modifying each of the one or more objects that has a weight that falls below a predetermined threshold from Delaney into the method as disclosed by the combination of combination of Gordon, Osotio and Shoemake. The motivation for doing this is to improve corroborating video data with audio data to automatically tag video content.

Regarding claims 12-13 (drawn to a CRM):                  
The proposed combination of Gordon, Osotio, Shoemake and Delaney, explained in the rejection of method claims 5-6 renders obvious the steps of the computer readable medium of claims 12-13 because these steps occur in the operation of the proposed combination as discussed above. Thus, the arguments similar to that presented above for claims 5-6 are equally applicable claims 12-13. See further Shoemake ¶124.

Regarding claim 18, the combination of Gordon, Osotio and Shoemake discloses the apparatus of claim 15, but fails to teach wherein the context determiner is to determine the context of the video from the associated audio with an automated speech recognition routine. 
Delaney teaches wherein the context determiner is to determine the context of the video from the associated audio with an automated speech recognition routine (¶42 Based on receiving the video stream 92, the audio analyzing module 72 analyzes the audio data 94 in the video stream 92 using NLU processing. The audio analyzing module 72 determines a candidate audio tag in the video stream 92 based on NLU processing of the audio data 94). 
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the context determiner is to determine the context of the video from the associated audio with an automated speech recognition routine from Delaney into the method as disclosed by the combination of Gordon, Osotio and Shoemake. The motivation for doing this is to improve corroborating video data with audio data to automatically tag video content.

Regarding claim 19, the combination of Gordon, Osotio, Shoemake and Delaney discloses the apparatus of claim 18, wherein the context determiner is to further determine the context of the video from the associated audio with a non-speech recognition routine (Gordon ¶34 the individual 104 may provide one or more sounds, one or more words, one or more gestures, or combinations thereof to indicate a request to identify an object within the scene 102. The computing device 112 may analyze the input from the individual 104 and determine that a request is being provided by the individual 104 to identify one or more objects within the scene). 

Claims 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Gordon, Osotio and Shoemake as applied to claims 1 above, and further in view of Ham et al (US 20160062116).
Regarding claim 21, the combination of Gordon, Osotio and Shoemake disclose the method of claim 1, but fail to teach wherein determining the context of the image further comprises determining the context of the image based a type of application capturing the image.
Ham teaches wherein determining the context of the image further comprises determining the context of the image based a type of application capturing the image (¶88 an immersive application is executed and it is determined that there is no safety danger based on the context information; an immersive application is executed and it is determined that there is safety danger based on the context information).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein determining the context of the image further comprises determining the context of the image based a type of application capturing the image from Ham into the method as disclosed by the combination of Gordon, Osotio and Shoemake. The motivation for doing this is to improve user interaction by displaying objects accordingly.
Response to Arguments
Applicant's arguments filed 09/07/2022 have been fully considered but they are not persuasive.
Regarding claim 1, the applicant argues that the prior art of record, alone or in combination, fails to teach at least “extracting from the audio, one or more keywords, determining, by the server, a context of the image, the context including the extracted keywords”.
Regarding the above argument, the examiner respectfully disagrees. Shoemake et al (US 20150070516) teaches “extracting from the audio, one or more keywords” in ¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable. Paragraph 200 further discloses “Such identifying information can include raw or analyzed presence information, as well as information derived from the presence information, such as, to name some examples, extracted features from an image, audio segment, and/or video segment; an excerpted image, video, and/or audio segment; and/or the like.”. Therefore, the objectionable words are extracted from the audio.
Shoemake further teaches “determining, by the server, a context of the image, the context including the extracted keywords” in ¶158 analyzing the first media content to identify specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable and in ¶180  For example, for video content, game content, or image content, image recognition or image identification techniques may be used to recognize or identify nudity, sexual content, gun violence, knife violence, gore, blood, violent acts, use of drugs, alcohol, and/or tobacco, and/or the like. For audio content, sound recognition, word identification, and/or similar techniques may be used to recognize or identify offensive words or phrases. That is, Shoemake teaches identifying specific video content, image content, game content, audio content, etc. that are indicated in a database as being potentially objectionable. The example given in ¶180 discusses the a violent video game with offensive words. The offensive words are found based on a match in a database. 
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN KY whose telephone number is (571)272-7648. The examiner can normally be reached Monday-Friday 9-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan Park can be reached on 571-272-7409. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KEVIN KY/               Primary Examiner, Art Unit 2669