DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority Acknowledgment
2.               Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in the priority Application 10-2018-0009965, filed on 01/26/2018 in Republic of Korea.

Continued Examination Under 37 CFR 1.114
3.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant’s submission filed on 06/29/2022 has been entered.

Response to Arguments/Amendments
4.	With respect to 103 rejection, Applicant argued on page 7 of the Remarks that “As illustrated in FIGS. 13A-13B by way of a non-limiting example, the electronic apparatus receives a voice “the left is my son, Junseo.” The voice is used to both 1) identify the object in the image (based on, e.g., left), and 2) generate tag information according to a keyword in the voice, for example, adding a tag for the name of the child - Junseo.”
In response, Examiner respectfully notes that the applicant’s arguments in the last paragraph of page 7 seem to indicate that the voice information is used to identify an object in the image by its location (i.e., “on the left”). The applicant then argues that Rogers features tagging directed at one selected object. The claim, however, is broader than the Applicants indicate and the object is not identified by location based upon the voice and feature information.
Instead, the claim requires only that the object is identified very generally somehow “based” on the voice and feature information. It is noted that the features upon which applicant relies (i.e., identify the object in the image (based on, e.g., left)) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
	The images in Rogers (e.g., Fig. 3) features multiple objects (i.e., faces) wherein one object (i.e., a face) among the multiple objects can be identified by  using facial feature recognition as in paragraph 0046 and voice information to identify the person’s face as in paragraphs 0049 and 0013 (Rogers [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging, [0049] Once the icon is activated, audio recording can begin and a record progress bar may be displayed beneath the icon along with a “micon” (microphone icon) to indicate that the image display device IDD is now recording audio to identify the activated object (see Fig. 4C). Such recorded information could be, for example, the name of the person shown in the activated image portion (e.g., “Jerry Alonzo, my brother in law”, [0049] Visual feedback of activation can be provided for example by providing a surrounding border or other visual indication as shown in FIG. 4B, and/or audio or other tactile, audible or visual feedback can be provided. Once the icon is activated, audio recording can begin and a record progress bar may be displayed beneath the icon along with a “micon” (microphone icon) to indicate that the image display device IDD is now recording audio to identify the activated object (see FIG. 4C). Such recorded information could be, for example, the name of the person shown in the activated image portion (e.g., “Jerry Alonzo, my brother in law”), [0013] off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison ).
 	Given the level of generality in the identification that is only “based” on the voice and feature information, Rogers clearly addresses the argued limitation “identifying an object among the plurality of objects in the image based on the voice and the feature information”. Applicant’s arguments are not persuasive, and thus for these reasons, Examiner respectfully disagree. 
 	In order to differentiate the claimed invention from Rogers, the applicant should consider an amendment identifying both the object and its location/position among multiple objects based upon the feature and voice as long as there is sufficient support in the specification.

Claim Rejections - 35 USC § 103
5.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

6.	Claims 1, 4, 9-12, 15-17 are rejected under 35 U.S.C.103 as being unpatentable over Rogers (US 2017/0075924 A1) in view of Bentley et al. (US 2017/0024388 A1.)

	With respect to Claim 1, Rogers discloses 
 	A control method of an electronic apparatus, the method comprising:
 	displaying an image including a plurality of objects (Rogers [0008] if a user is looking at a photo on a display device, Fig. 2); 
 	receiving a voice (Rogers [0008] if a user is looking at a photo on a display device...and speak a voice tag, or utter a command and then say the voice tag. As one example, if the user is looking at a photo of Gerilynn on the screen and wished to tag the photo, ...and say “Gerilynn”, or alternatively just say “Tag Gerilynn.”);
 	obtaining feature information on the plurality of objects (Rogers [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging) by inputting the image to a first Al model; 
 	identifying an object among the plurality of objects in the image based on the voice and the feature information (Rogers [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging, [0049] Visual feedback of activation can be provided for example by providing a surrounding border or other visual indication as shown in FIG. 4B, and/or audio or other tactile, audible or visual feedback can be provided. Once the icon is activated, audio recording can begin and a record progress bar may be displayed beneath the icon along with a “micon” (microphone icon) to indicate that the image display device IDD is now recording audio to identify the activated object (see FIG. 4C). Such recorded information could be, for example, the name of the person shown in the activated image portion (e.g., “Jerry Alonzo, my brother in law”, [0008] The actions identifies the people or objects in the photo and also applies a voice tag to the photo, [0013] off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison);
 	obtaining a keyword of the voice (Rogers [0009] voice command could be used instead (e.g., “tag: Gerilynn) and the voice tagging could automatically be applied to the item displayed at that time...Any keyword used during the tagging operation(s) could be uttered to call up), [0040] Since many of such devices have cameras, it may be possible to detect people who are looking at photos and to thereby verify that recorded voice has relevance to the photos being shown on the screen. The owners of the photos can then tag the watcher(s) of the photos and connect the voice comments to actual people, [0043] Image display device IDD may then selectively record and store such acquired audio and/or visual information in association with the image I for later recall and replay by the same or different viewers. Such audible and/or visual information becomes a “tag” that tags or otherwise identifies the image I and describes it for later listening or other access) by inputting the voice to a second Al model; 
generating tag information corresponding to the identified object based on the feature information and the keyword (Rogers [0008] if a user is looking at a photo on a display device and wishes to tag the photo...or utter a command and then say the voice tag. As one example, if the user is looking at a photo of Gerilynn on the screen and wishes to tag the photo, the user can touch the photo on the touch screen and say “Gerilynn”, or alternatively just say “Tag Gerilynn.” That photo has now been tagged. The action identifies the people or objects in the photo and also applies a voice tag to the photo); and 
 providing the tag information (Rogers [0009] voice command could be used instead (e.g., “tag: Gerilynn) and the voice tagging could automatically be applied to the item displayed at that time, [0015] particular photo and/or video streams can be tagged as being associated with a particular person, time and event and made available for sharing over a communications network, [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of Fig. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. 
 	Rogers disclose a method/a system for voice tagging by applying the voice tag to the identified people/objects in the photo. Rogers fail to explicitly utilizes AI(s) model in obtaining feature information on the plurality of objects by inputting the image to a first Al model and obtaining the keyword of the voice (i.e., the bolded limitations in claim.)
	However, Bentley et al. teach
 	identifying an object among the plurality of objects in the image based on the voice and the feature information (Bentley et al. [0041] the recognition analysis model may determine identities of objects (e.g., a person, a place, a thing, etc.) within content items. For example, an image recognition analysis model, such as one utilizing a deep convolutional neural net, may be utilized to evaluate a digital image content item to identify a feature of the digital image content item (e.g., curve, line, coloring, etc. associated with the digital image). The feature may be evaluated to determine the object, such as a dog, within the digital image content item); 
 	obtaining a keyword of the voice by inputting the voice to a second Al model (Bentley et al. [0041] The recognition analysis model may comprise an image recognition model (e.g., pattern recognition, sketch recognition, facial recognition, etc.), a video recognition analysis model (e.g., gate recognition, moving facial recognition from a live video stream), and audio recognition analysis model (e.g., processing audio of a content item to identify keywords)); 
Rogers and Bentley et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of voice tagging as taught by Rogers, using teaching of the image recognition model and audio recognition model as taught by Bentley et al. for the benefit of determining the object within the digital image content item and identifying keyword in the voice input (Bentley et al. [0041] audio recognition analysis model (e.g., processing audio of a content item to identify keywords)... an image recognition analysis model, such as one utilizing a deep convolutional neural net, may be utilized to evaluate a digital image content item to identify a feature of the digital image content item (e.g., curve, line, coloring, etc. associated with the digital image). The feature may be evaluated to determine the object, such as a dog, within the digital image content item.)

	With respect to Claim 4, Rogers discloses 
 	wherein the tag information further comprises information on the identified object among the feature information (Rogers [0009] Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary—voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time. In such implementations, the device could respond to additional voice commands such as “IPAD Gerilynn” by recognizing the word “Gerilynn” and start showing photos that had previously been tagged with “Gerilynn”. Any keyword used during the tagging operation(s) could be uttered to call up and cause display of items tagged with that particular keyword.)
 	With respect to Claim 9, Rogers in view of Bentley et al. teach
 	further comprising: based on the object associated with the voice being identified, displaying a UI element notifying that the identified object is a target object to be tagged (Rogers [0049] Once recording is successful (see FIG. 4D), a micon may be displayed in the upper right-hand corner or the image to indicate that an audible tag has successfully been recorded relative to the object.)

 	With respect to Claim 10, Rogers in view of Bentley et al. teach
 	wherein the obtaining the feature information comprises, based on a plurality of objects associated with the voice being identified from the image, obtaining tagging information for each of the plurality of objects based on the voice (Rogers [0009] voice command could be used instead (e.g., “tag: Gerilynn) and the voice tagging could automatically be applied to the item displayed at that time, [0046] FIG. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging. As shown in FIG. 2, in one non-limiting illustration, software functionality may be employed to recognize the faces of the three subjects within the image I, and those faces may be highlighted or otherwise visually emphasized by for example placing a box or other visual indicator around them, changing the color and/or intensity of the display of the faces and/or the area surrounding the faces, or any other desired visual highlighting or emphasizing technique, [0008] The actions identifies the people or objects in the photo and also applies a voice tag to the photo, [0013] off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison.)
  
 	With respect to Claim 11, Rogers in view of Bentley et al. teach 
 	further comprising: 
 	storing the tag information associated with the image (Rogers [0009] voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time. In such implementations, the device could response to additional voice commands such as “IPAD Gerilynn” by recognizing the word “Gerilynn” and start showing photos that had previously been tagged with “Gerilynn”.)

 	With respect to Claim 12, Rogers disclose 
 	An electronic apparatus comprising: 
 	a display (Rogers [0037] Image display device IDD displays image I so people looking at the image display device can visually perceive the image); 
 	a microphone (Rogers [0037] the image display device IDD also includes a camera C and a microphone M); 
 	a memory configured to store computer executable instructions (Rogers [0078] Storage 1008 can comprise for example an SD card, a build in flash memory or any type of non-transitory, non-volatile memory device under control of at least one processor); and 
 	a processor configured to execute the computer executable instructions (Rogers [0078] Storage 1008 can comprise for example an SD card, a build in flash memory or any type of non-transitory, non-volatile memory device under control of at least one processor) to:  
 	 	control the display to display an image including plurality of objects (Rogers [0008] if a user is looking at a photo on a display device, Fig. 2),  
 		receive a voice through the microphone (Rogers [0008] if a user is looking at a photo on a display device...and speak a voice tag, or utter a command and then say the voice tag. As one example, if the user is looking at a photo of Gerilynn on the screen and wished to tag the photo, ...and say “Gerilynn”, or alternatively just say “Tag Gerilynn.”, [0010] The digital photo frame includes a microphone), 
 		obtain feature information on the plurality of objects (Rogers [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging), by inputting the image to a first Al model, 
 		identify an object among the plurality of objects in the image based on the voice and the feature information (Rogers [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging, [0049] Visual feedback of activation can be provided for example by providing a surrounding border or other visual indication as shown in FIG. 4B, and/or audio or other tactile, audible or visual feedback can be provided. Once the icon is activated, audio recording can begin and a record progress bar may be displayed beneath the icon along with a “micon” (microphone icon) to indicate that the image display device IDD is now recording audio to identify the activated object (see FIG. 4C). Such recorded information could be, for example, the name of the person shown in the activated image portion (e.g., “Jerry Alonzo, my brother in law”), [0008] The actions identifies the people or objects in the photo and also applies a voice tag to the photo, [0013] off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison), 
obtain a keyword of the voice (Rogers [0009] voice command could be used instead (e.g., “tag: Gerilynn) and the voice tagging could automatically be applied to the item displayed at that time...Any keyword used during the tagging operation(s) could be uttered to call up), [0040] Since many of such devices have cameras, it may be possible to detect people who are looking at photos and to thereby verify that recorded voice has relevance to the photos being shown on the screen. The owners of the photos can then tag the watcher(s) of the photos and connect the voice comments to actual people, [0043] Image display device IDD may then selectively record and store such acquired audio and/or visual information in association with the image I for later recall and replay by the same or different viewers. Such audible and/or visual information becomes a “tag” that tags or otherwise identifies the image I and describes it for later listening or other access) by inputting the voice to a second Al model, 
generate tag information corresponding to the identified object based on the feature information and the keyword (Rogers [0008] if a user is looking at a photo on a display device and wishes to tag the photo...or utter a command and then say the voice tag. As one example, if the user is looking at a photo of Gerilynn on the screen and wishes to tag the photo, the user can touch the photo on the touch screen and say “Gerilynn”, or alternatively just say “Tag Gerilynn.” That photo has now been tagged. The action identifies the people or objects in the photo and also applies a voice tag to the photo), and 
 	provide the tag information (Rogers [0009] voice command could be used instead  (e.g., “tag: Gerilynn) and the voice tagging could automatically be applied to the item displayed at that time, [0015] particular photo and/or video streams can be tagged as being associated with a particular person, time and event and made available for sharing over a communications network, [0046] Fig. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of Fig. 1 providing additional functionality allowing tagging of individual people or objects shown in image I.)
 	Rogers disclose a method/a system for voice tagging by applying the voice tag to the identified people/objects in the photo. Rogers fail to explicitly utilizes AI(s) model to obtain feature information on the plurality of objects and obtain a keyword of the voice (i.e., the bolded limitations in claim.)
	However, Bentley et al. teach
 	obtain feature information on the plurality of objects, by inputting the image to a first AI model (Bentley et al. [0041] the recognition analysis model may determine identities of objects (e.g., a person, a place, a thing, etc.) within content items. For example, an image recognition analysis model, such as one utilizing a deep convolutional neural net, may be utilized to evaluate a digital image content item to identify a feature of the digital image content item (e.g., curve, line, coloring, etc. associated with the digital image). The feature may be evaluated to determine the object, such as a dog, within the digital image content item), 
 	obtain a keyword of the voice by inputting the voice to a second AI model (Bentley et al. [0041] The recognition analysis model may comprise an image recognition model (e.g., pattern recognition, sketch recognition, facial recognition, etc.), a video recognition analysis model (e.g., gate recognition, moving facial recognition from a live video stream), and audio recognition analysis model (e.g., processing audio of a content item to identify keywords)), 
Rogers and Bentley et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of voice tagging as taught by Rogers, using teaching of the image recognition model and audio recognition model as taught by Bentley et al. for the benefit of determining the object within the digital image content item and identifying keyword in the voice input (Bentley et al. [0041] audio recognition analysis model (e.g., processing audio of a content item to identify keywords)... an image recognition analysis model, such as one utilizing a deep convolutional neural net, may be utilized to evaluate a digital image content item to identify a feature of the digital image content item (e.g., curve, line, coloring, etc. associated with the digital image). The feature may be evaluated to determine the object, such as a dog, within the digital image content item.)

 	With respect to Claim 15, Rogers in view of Bentley et al. teach
 	wherein the tag information further comprises information on the identified object among the feature information obtained by inputting the image to the first AI model (Rogers [0009] Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time. In such implementations, the device could respond to additional voice commands such as “IPAD Gerilynn” by recognizing the word “Gerilynn” and start showing photos that had previously been tagged with “Gerilynn”. Any keyword used during the tagging operation(s) could be uttered to call up and cause display of items tagged with that particular keyword, [0046] FIG. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging, Bentley et al. [0041] the recognition analysis model may determine identities of objects (e.g., a person, a place, a thing, etc.) within content items. For example, an image recognition analysis model, such as one utilizing a deep convolutional neural net, may be utilized to evaluate a digital image content item to identify a feature of the digital image content item (e.g., curve, line, coloring, etc. associated with the digital image). The feature may be evaluated to determine the object, such as a dog, within the digital image content item.)
  	With respect to Claim 16, Rogers in view of Bentley et al. teach
 	wherein a plurality of keywords of the voice are identified by inputting the voice to the second AI model (Bentley et al. [0041] The recognition analysis model may comprise an image recognition model (e.g., pattern recognition, sketch recognition, facial recognition, etc.), a video recognition analysis model (e.g., gate recognition, moving facial recognition from a live video stream), and audio recognition analysis model (e.g., processing audio of a content item to identify keywords)), 
 	the object among the plurality of objects in the image is identified based on one or more first keywords of the plurality of keywords (Rogers [0009] Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary—voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time, [0046] FIG. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging. As shown in FIG. 2, in one non-limiting illustration, software functionality may be employed to recognize the faces of the three subjects within the image I, and those faces may be highlighted or otherwise visually emphasized by for example placing a box or other visual indicator around them, changing the color and/or intensity of the display of the faces and/or the area surrounding the faces, or any other desired visual highlighting or emphasizing technique, [0008] The actions identifies the people or objects in the photo and also applies a voice tag to the photo, [0013] off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison. The keyword Gerilynn is used to identify the object Gerilynn among the plurality of objects in the photo), and 
 	the tag information corresponding to the identified object is generated based on the feature information and one or more second keywords among the plurality of keywords (Rogers [0009] Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary—voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time. The tag information is generated in response to the voice command keyword “tag:”)  

 	With respect to Claim 17, Rogers in view of Bentley et al. teach
 	wherein the processor is configured to identify a plurality of keywords of the voice by inputting the voice to the second AI model (Bentley et al. [0041] The recognition analysis model may comprise an image recognition model (e.g., pattern recognition, sketch recognition, facial recognition, etc.), a video recognition analysis model (e.g., gate recognition, moving facial recognition from a live video stream), and audio recognition analysis model (e.g., processing audio of a content item to identify keywords)), 
 	identify the object among the plurality of objects in the image based on one or more first keywords of the plurality of keywords (Rogers [0009] Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary—voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time, [0046] FIG. 2 shows a further enhancement of the example non-limiting operation of image display device IDD of FIG. 1 providing additional functionality allowing tagging of individual people or objects shown in image I. In the example shown, automated, semi-automated and/or manual techniques may be employed to recognize or otherwise select or delimit different objects (e.g., faces or head shots) and to highlight or otherwise indicate those objects as potential objects for tagging. As shown in FIG. 2, in one non-limiting illustration, software functionality may be employed to recognize the faces of the three subjects within the image I, and those faces may be highlighted or otherwise visually emphasized by for example placing a box or other visual indicator around them, changing the color and/or intensity of the display of the faces and/or the area surrounding the faces, or any other desired visual highlighting or emphasizing technique, [0008] The actions identifies the people or objects in the photo and also applies a voice tag to the photo, [0013] off-line or on-line processing can be used to recognize uttered speech and store text, data or other information and store this information in association with images for later comparison. The keyword Gerilynn is used to identify the object Gerilynn among the plurality of objects in the photo), and 
 	generate the tag information corresponding to the identified object based on the feature information and one or more second keywords among the plurality of keywords (Rogers [0009] Thus, in some non-limiting arrangements, touching on the touch screen may not be necessary—voice commands could be used instead (e.g., “tag: Gerilynn”) and the voice tagging could automatically be applied to the item displayed at that time. The tag information is generated in response to the voice command keyword “tag:”)

7.	Claim 5 is rejected under 35 U.S.C.103 as being unpatentable over Rogers (US 2017/0075924 A1) in view of Bentley et al. (US 2017/0024388 A1) and Handa et al. (US 8,793,129 B2.)

	With respect to Claim 5, Rogers in view of Bentley et al. teach all the limitations of Claim 1 upon which Claim 5 depends. Rogers in view of Bentley et al. fail to explicitly teach 
 	wherein the providing comprises displaying the keyword of the voice along with the image.  
	However, Handa et al. teach 
 	wherein the providing comprises displaying the keyword of the voice along with the image (Handa et al. col. 2 lines 35-46 a voice input control function that inputs a voice signal of a voice uttered by a viewer who is viewing a display image displayed on a display portion; an acquisition function that identifies at least one word from the voice uttered by the viewer based on the voice signal inputted by way of control processing of the voice input control function and acquires at least one word thus identified as a keyword; and a display control function that causes information including the keyword acquired by the acquisition function or information derived from the keyword to be displayed together with the display image on the display portion.)
 	Rogers, Bentley et al. and Handa et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of voice tagging as taught by Rogers, using teaching of the image recognition model and audio recognition model as taught by Bentley et al. for the benefit of determining the object within the digital image content item and identifying keyword in the voice input, using teaching of identifying at least one keyword from the voice uttered by the viewer as taught by Handa et al. for the benefit of displaying the identified keyword(s) together with the display image (Handa et al. col. 2 lines 35-46 a voice input control function that inputs a voice signal of a voice uttered by a viewer who is viewing a display image displayed on a display portion; an acquisition function that identifies at least one word from the voice uttered by the viewer based on the voice signal inputted by way of control processing of the voice input control function and acquires at least one word thus identified as a keyword; and a display control function that causes information including the keyword acquired by the acquisition function or information derived from the keyword to be displayed together with the display image on the display portion.)

8.	Claim 6 is rejected under 35 U.S.C.103 as being unpatentable over Rogers (US 2017/0075924 A1) in view of Bentley et al. (US 2017/0024388 A1), Handa et al. (US 8,793,129 B2) and Terrell, II et al. (US 2009/0228274 A1.)

	With respect to Claim 6, Rogers, Bentley et al. and Handa et al. teach all the limitations of Claim 5 upon which Claim 6 depends. Rogers, Bentley et al. and Handa et al. fail to explicitly teach 
 	further comprising: 
 	displaying the keyword of a voice subsequently input along with the keyword previously displayed.  
 	However, Terrell, II et al. teach
 	further comprising: 
 	displaying the keyword of a voice subsequently input along with the keyword previously displayed (Terrell, II et al. Fig. 5, [0089] display the incremental results in an animated, real-time visual display, which then updates frequently as new information becomes available. The Examiner notes that Handa et al. teach displaying the keyword of the voice along with the image, in this reference Terrell, II et al. teach displaying a transcription of the voice subsequently input along with a transcription previously displayed.) 
 	Rogers, Bentley et al., Handa et al. and Terrell, II et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of voice tagging as taught by Rogers, using teaching of the image recognition model and audio recognition model as taught by Bentley et al. for the benefit of determining the object within the digital image content item and identifying keyword in the voice input, using teaching of identifying at least one keyword from the voice uttered by the viewer as taught by Handa et al. for the benefit of displaying the identified keyword(s) together with the display image, using teaching of displaying the incremental results in an animated, real-time visual display as taught by Terrell, II et al. for the benefit of enabling the user navigating all of the options (Terrell, II et al. [0089] Because the initial and intermediate results are likely to contain most or all of the transcription options that will be available in the final results, it makes sense to display the incremental results in an animated, real-time visual display, which then updates frequently as new information becomes available. In this way, the user 32 is exposed to most or all of the options that the ASR engine considered during transcription and can more easily navigate to those options, after transcription is complete, in order to select a transcription option different from the one chosen by the engine as having the highest confidence value.)

9.	Claim 7 is rejected under 35 U.S.C.103 as being unpatentable over Rogers (US 2017/0075924 A1) in view of Bentley et al. (US 2017/0024388 A1) and Somekh et al. (US 2013/0325462 A1.)

 	With respect to Claim 7, Rogers in view of  Bentley et al. teach all the limitations of Claim 1 upon which Claim 7 depends. Rogers in view of  Bentley et al. fail to explicitly teach
further comprising: 
displaying a user interface (UI) element to delete the keyword of the voice from the tag information. 
	However, Somekh et al. teach
 	further comprising: 
 	displaying a user interface (UI) element to delete the keyword of the voice from the tag information (Somekh et al. [0070] the tag extraction module 145 performs speech to text conversion on the audio component to obtain words that may be tags, Claim 5 communicating, by the server computer, the textual tag and the image file to the client device for display and for enabling the user to approve, reject, and edit the textual tags.)
 Rogers, Bentley et al. and Somekh et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of voice tagging as taught by Rogers, using teaching of the image recognition model and audio recognition model as taught by Bentley et al. for the benefit of determining the object within the digital image content item and identifying keyword in the voice input, using teaching of displaying the textual tag and the image file as taught by Somekh et al. for the benefit of enabling the user to editing the textual tag (Somekh et al. [0070] the tag extraction module 145 performs speech to text conversion on the audio component to obtain words that may be tags, Claim 5 communicating, by the server computer, the textual tag and the image file to the client device for display and for enabling the user to approve, reject, and edit the textual tags.)

10.	Claim 8 is rejected under 35 U.S.C.103 as being unpatentable over Rogers (US 2017/0075924 A1) in view of Bentley et al. (US 2017/0024388 A1) and Solem et al. (US 2013/0346068 A1.)

 	With respect to Claim 8, Rogers in view of  Bentley et al. teach all the limitations of Claim 1 upon which Claim 8 depends. Rogers in view of  Bentley et al. fail to explicitly teach
 	wherein the identifying the object comprises identifying a first object associated with the voice and obtaining tag information for the first object by referring to pre-generated tag information associated with a second object included in the image.  
 	However, Solem et al. teach
 	wherein the identifying the object comprises identifying a first object associated with the voice and obtaining tag information for the first object by referring to pre-generated tag information associated with a second object included in the image (Solem et al. [0017] In some implementations, the natural language processing includes identifying one of the one or more terms as a pronoun; and determining a noun to which the pronoun refers. In some implementations, the noun is a name of an entity, an activity, or a location identified in a previous speech input associated with a previously tagged digital photograph. In some implementations, the noun is a name of a person identified using a contact list associated with a user of the electronic device. In some implementations, the noun is a name of a person identified based on a previous speech input associated with a previously tagged digital photograph.)
 	Rogers, Bentley et al. and Solem et al. are analogous art because they are from a similar field of endeavor in the Signal Processing techniques and applications. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of voice tagging as taught by Rogers, using teaching of the image recognition model and audio recognition model as taught by Bentley et al. for the benefit of determining the object within the digital image content item and identifying keyword in the voice input, using teaching of previously tagged digital photograph as taught by Solem et al. for the benefit of disambiguating the present voice input of the user in tagging (Solem et al. [0017] In some implementations, the natural language processing includes identifying one of the one or more terms as a pronoun; and determining a noun to which the pronoun refers. In some implementations, the noun is a name of an entity, an activity, or a location identified in a previous speech input associated with a previously tagged digital photograph. In some implementations, the noun is a name of a person identified using a contact list associated with a user of the electronic device. In some implementations, the noun is a name of a person identified based on a previous speech input associated with a previously tagged digital photograph.)

Conclusion
11.	The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. See PTO-892.
a.	Joshi (US 2019/0278797 A1.) In this reference, Joshi disclose a method for generating a voice tag on the photographic image.
b. 	Mohan et al. (US 2019/0080708 A1.) In this reference, Mohan et al. disclose a method for generating a voice  tags.
c.	Smadi (US 2014/0207466 A1.) In this reference, Smadi disclose a method for generating a voice tag through user operation of the mobile device. 

12.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429. The examiner can normally be reached Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C. Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THUYKHANH LE/Primary Examiner, Art Unit 2655