DETAILED ACTION
Response to Amendment
The amendment was received 9/17/21. Claims 1-21 are pending.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
Accordingly, 35 USC 112(f) is NOT invoked in claims 1-21. 





Accordingly the following definitions are “taken” via MPEP 2111.01 III. "PLAIN MEANING" REFERS TO THE ORDINARY AND CUSTOMARY MEANING GIVEN TO THE TERM BY THOSE OF ORDINARY SKILL IN THE ART, 3rd paragraph, emphasis added:
“It is also appropriate to look to how the claim term is used in the prior art, which includes prior art patents, published applications, trade publications, and dictionaries. Any meaning of a claim term taken from the prior art must be consistent with the use of the claim term in the specification and drawings. Moreover , when the specification is clear about the scope and content of a claim term, there is no need to turn to extrinsic evidence for claim interpretation. 3M Innovative Props. Co. v. Tredegar Corp., 725 F.3d 1315, 1326-28, 107 USPQ2d 1717, 1726-27 (Fed. Cir. 2013) (holding that "continuous microtextured skin layer over substantially the entire laminate" was clearly defined in the written description, and therefore, there was no need to turn to extrinsic evidence to construe the claim).”

The claimed “reproducing” (as in “reproducing a video” in claim 1, line 2) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com wherein “representation” is “taken” under MPEP 2111.01 III:
reproduce
verb (used with object), re·pro·duced, re·pro·duc·ing.
1	to make a copy, representation, duplicate, or close imitation of:
to reproduce a picture.

wherein “representation” is defined:
representation
noun
11	the act of portrayal, picturing, or other rendering in visible form.

The claimed “frame” (as in “storing a plurality of frames of the reproduced video” in claim 1, lines 3,4) is claim 1 is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com wherein meanings 8 and 10 are “taken”:
frame
noun
8	Movies. one of the successive pictures on a strip of film.
10	Computers. the information or image on a screen or monitor at any one time.

The claimed “while” (as in “receiving a user voice input of a user while reproducing a first frame of the video” in claim 1, line 5) is interpreted in light of applicant’s disclosure as on or skill in the art would and definition thereof via Dictionary.com, wherein definitions 3-6 are equally applicable:
while
conjunction
3	during or in the time that.
4	throughout the time that; as long as.
5	even though; although:
While she appreciated the honor, she could not accept the position.
6	at the same time that (showing an analogous or corresponding action):
The floor was strewn with books, while magazines covered the tables.

The claimed “first” of the claimed “first frame” in claim 1, line 5 is interpreted in light of applicant’s disclosure and “drawings” thereof under MPEP 2111.01 III.
The claimed “-ing” (as in “identifying  is interpreted in light of applicant’s disclosure and “drawings” of fig. 5:270: “FRAME RECOGNITION SERVER” and definition thereof under MPEP 2111.01 III via Dictionary.com, emphasis added “expressing the action of the verb” recognize or identify “or its result” such as the last two limitations of claim 1: “identifying an object”: “providing a search result”:
-ing
1	a suffix of nouns formed from verbs, expressing the action of the verb or its result, product, material, etc. (the art of building; a new building; cotton wadding). It is also used to form nouns from words other than verbs (offing; shirting). Verbal nouns ending in -ing are often used attributively (the printing trade) and in forming compounds (drinking song). In some compounds (sewing machine), the first element might reasonably be regarded as the participial adjective, -ing2, the compound thus meaning “a machine that sews,” but it is commonly taken as a verbal noun, the compound being explained as “a machine for sewing.”

The claimed “a” (as in “identifying  a second frame” in claim 1, line 8) is interpreted in light of applicant’s disclosure as one of skill in the art would and definition thereof via Dictionary.com, wherein definitions 1-7 are equally applicable:
a1
indefinite article
1	not any particular or certain one of a class or group:
a man; a chemical; a house.
2	a certain; a particular:
one at a time; two of a kind; A Miss Johnson called.
3	another; one typically resembling:
a Cicero in eloquence; a Jonah.
4	one (used before plural nouns that are preceded by a quantifier singular in form): a hundred men (compare hundreds of men); a dozen times (compare dozens of times).
5	indefinitely or nonspecifically (used with adjectives expressing number):
a great many years; a few stars.
6	one (used before a noun expressing quantity):
a yard of ribbon; a score of times.
7	any; a single:
not a one.

The claimed “second” of the claimed “second frame” in claim 1, line 8 is interpreted in light of applicant’s disclosure and “drawings” thereof under MPEP 2111.01 III.
The claimed “trigger” (as in “the user voice input comprises a trigger voice” in claim 3, line 2) is interpreted in light of applicant’s disclosure and definition thereof via Dictionay.com wherein meaning 3 is “taken”:
trigger
noun
3	anything, as an act or event, that serves as a stimulus and initiates or precipitates a reaction or series of reactions.




Response to Arguments
Claim Objection
Applicant’s arguments, see remarks, page 10, filed 9/17/21, with respect to the claim objection of claims 11-20 have been fully considered and are persuasive.  The claim objection of claims 11-20 has been withdrawn. 
Rejections Under 35 USC 103
In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., “identifying an object in a previously-reproduced frame” via applicant’s remarks, page 13, emphasis added:
“Therefore, nothing in Sanchez discloses identifying a previously-reproduced frame based on a time point at which a user voice input is received, and then identifying an object in a previously-reproduced frame based on the previously-reproduced frame, and then obtaining a search result of information about the identified object.”)

are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
	In contrast, claim 1 states:
“the second frame is a part of 
identifying an object included in the second frame”

Applicants are implying:
“the second frame is one of the stored plurality of frames reproduced…
identifying an object included in the second frame”

	Thus a person accesses, comprising transfer from one memory to another, a stored copy of content in fig. 9:916: “Media” to be reproduced via fig. 8:812: “Display”.
Applicant's arguments filed 9/17/21 have been fully considered but they are not persuasive. Applicants state in page 13:
“Accordingly, Sanchez fails to disclose or suggest "based on the user voice input comprising the request for information about the object being received, identifying a second frame based on a time point at which the user voice input is received, wherein the second frame is a part of the stored plurality of frames reproduced prior to the time point when the user voice input is started to be received; identifying an object included in the second frame based on the second frame and the user voice input; and providing a search result for the information about the object included in the second frame," as claimed inter alia in claim 1, and therefore fails to remedy the deficiencies of Hodge.”

	The examiner respectfully disagrees since Sanchez teaches:
identifying (or identifying or determining frames based on a user command such as “Repeat” via fig. 2:220: “Repeat the scene where the baker enters the cake contest”) a second frame (as shown in fig. 3:330: “QUIET”: an identified/determined frame showing a cake and faces to be identified) based on a time point (as shown in fig. 2 relative to fig. 3) at which the user voice input (via fig. 2:210: microphone: represented in fig. 10:1010: “Receive Summary Command”) is received, wherein the second frame is a part of the stored (via said media content server that provides said memory access of copies to a person) plurality of frames reproduced (via the display that reproduces video frames in visible form for the person) prior to the time point when the user voice input is started to be received; identifying an object (via fig. 16:1610: “Determine area of interest in current scene”: “identify…face… and a cake”: c.5,l.49 to c.6,l.2:“locate a cake”: c.38,ll. 9-13: represented in fig. 10:1050:“Determine a related scene based on the comparison”) included in the second frame (via fig. 3:300: “QUIET”: showing a cake and faces to be identified) based on the second frame and the user voice input (represented in fig. 10:1010: “Receive Summary Command”).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Regarding inquiry 4, see Suggestions regarding claim 2.
Claims 1-9,21 and 11-19  is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271).
Note that claims 3 and 13 are also rejected under 35 USC 103 due to different meanings of the claimed “inquiry voice” in claim 3, line 4.
Regarding claim 1, Hodge teaches a method for controlling an electronic device comprising: 
reproducing (perceptually) a video (via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”); 
storing (via “buffer storing”, cited below: [0073]) a plurality of frames (via fig. 7:704: “PRESERVE DATA”) of the reproduced video (via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”) for a specific time period (via “store video data for a predetermined and programmable set amount of time”, cited below: [0050]) while (via “monitored…while…continuously captured”, cited below: [0072]) reproducing (via said “monitored”) the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO” to be preserved/buffered); 
receiving a user (via “The video clip sharing request 800 may be user-generated”, cited below: [0089]) voice (via “voice commands… identifying objects being displayed on a video”, cited below: [0090]) input (via fig. 8:arrows being input) of a user (said via “The video clip sharing request 800 may be user-generated”) while reproducing (said via “The video clip sharing request 800 may be user-generated”) a first frame (“being displayed”:[0090]) of the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”), the user voice input (said via fig. 8:arrows being input) comprising a request (via “requested” “files”, cited below:[0076], corresponding to fig. 8:800: “SHARING REQUEST”) for information (comprised by said “files”) about an object (via fig. 4a: “Nice Spot!”) displayed (via fig. 4b:410: “CAMERAS”) in the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”); 





based on the user voice input (said via fig. 8:arrows being input) comprising the request for information about the object being received, identifying a (i.e., any) second (as indicated in fig. 7’s loops going back for a second time) frame (or “the video…image”, cited below: [0091], via “video data may be uploaded to the cloud system 103”, cited below: [0075], for said fig. 8: “SHARING REQUEST”) based on a time point (via fig. 7: “TAG EVENT?  YES”) at which the user voice input is received, wherein the second frame is a part of  to the (“displayed”) time point (during said via “voice commands… identifying objects being displayed on a video”) when the user (said via “The video clip sharing request 800 may be user-generated”) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input) is started to be received (as shown in fig. 8: any one arrow being input such as one of the arrows pointing to fig. 8:806: “END” relative to said “displayed”);




identifying an object (via fig. 8:805: “MATCH?”: “identify the…event”) included in the second (said as indicated in fig. 7’s loops going back for a second time) frame (said “video…image”) based on the second frame and the user (said via “The video clip sharing request 800 may be user-generated”) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input); and 
providing (via said arrows in fig. 8) a search result (via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) for the (matching) information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”, cited below: [0091]) included in the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image”, cited below: [0091], or “the video data in which the object was detected”, cited below: [0091] via:
“[0026] The above and other needs are met by the disclosed methods, a non-transitory computer-readable storage medium storing executable code, and systems for streaming and playing back immersive video content.”

“[0050] According to one embodiment, client device 101 is always turned on as long as it has sufficient power to operate.  Cameras 214a and 214b are always turned on and recording video.  The video recorded by the cameras 214 is buffered in the memory device 203.  In one embodiment, memory device 203 is configured as a circular buffer.  For example, in one embodiment, memory device 203 may be a 32 Gb FLASH memory device.  Client device 101 manages the buffer in memory device 203 to store video data for a predetermined and programmable set amount of time.  For example, in one embodiment, memory device 203 buffers video data from two cameras 214a and 214b for the preceding 24 hours.”;

“[0072] Now referring to FIG. 7, a method for generating event-based video clips according to one embodiment is described.  Upon activation of the system, the method starts 700.  The various inputs are monitored 701 while video is continuously captured.  If no tagging event is detected 702, the system keeps monitoring.  If a tagging event is detected 702, the relevant video data in the buffer is identified and selected 703.  For example, once an event is detected 702, the video files for a predefined period of time before and after the event is identified in the buffer.  In one example, 15 seconds before and after the event time is used.  The amount of time, preferably between 10 and 30 seconds, may be pre-programmed or user selectable.  Further, two different time periods may be used, one for time before the event and the other for time after the event.  In one embodiment, the time periods may be different depending on the event detected.  For example, for some events the time periods may be 30 seconds before event and 1 or 2 minutes after while other events may be 15 seconds before and 15 seconds after.

[0073] The selected video data is marked for buffering 704 for a longer period of time.  For example, the video files for the selected time period are copied over to a second system buffer with a different buffering policy that retains the video for a longer period of time.  In one embodiment, the selected video data being in a buffer storing video for 24 hours is moved over to a second buffer storing video for 72 hours.”

“[0075] In one embodiment, video data objects are stored on the network-accessible buffer of the camera device and the playlist or manifest files for the generated event-based video clips identify the network addresses for the memory buffer memory locations storing the video data objects or files.  Alternatively, upon identifying and selecting 703 the relevant video data objects, in addition to or as an alternative to moving the video data to the longer buffer 704, the video data may be uploaded to the cloud system 103.  The clip generation 705 then identifies in the playlist or manifest file the network addresses for the video data stored in the cloud system 103.  A combination of these approaches may be used depending on storage capacity and 
network capabilities for the camera devices used in the system or according to other design choices of the various possible implementations.”;


















“[0076] In one embodiment, other system components, such as the cloud system 103 
or mobile device 104, are notified 706 of the event or event-based video clip. For example, in one embodiment a message including the GUID for the generated 
video clip is sent to the cloud system in a cryptographically signed message (as discussed above).  Optionally, the playlist or manifest file may also be sent in the message.  In one embodiment, the playlist or manifest files are maintained in the local memory of the camera device until requested.  For example, upon notification 706 of the clip generation, the cloud system may request the clip playlist or manifest file.  Optionally, the cloud system may notify 706 other system components and/or other users of the clip and other system components or users may request the clip either from the cloud system 103 or directly from the camera device.  For example, the clips pane 401a in the user's mobile app may display the clip information upon receiving the 
notification 706.  Given that the clip metadata is not a large amount of data, e.g., a few kilobytes, the user app can be notified almost instantaneously after the tag event is generated.  The larger amount of data associated with the video data for the clip can be transferred later, for example, via the cloud system or directly to the mobile device.  For example, upon detection of a "Baby/Animal in Parked Car" event or a "Location Discontinuity" event, the user's mobile device 104 may be immediately notified of the tag event using only tag metadata.  Subsequently, the user can use the video clip playlist to access the video data stored remotely, for example, for verification purposes.”;

“[0084] These combinations of events and inputs are illustrative only.  Some embodiments may provide a subset of these inputs and/or events.  Other embodiments may provide different combinations of inputs and/or different events.  The event detection algorithms may be implemented locally on the camera device (e.g., client device 101) or may be performed in cloud servers 102, with the input signals and event detection outputs transmitted over the wireless communication connection 107/108 from and to the camera device.  Alternatively, in some embodiments a subset of the detection algorithms may be performed locally on the camera device while other detection algorithms are performed on cloud servers 102, depending for example, on the processing capabilities available on the client device.  Further, in one embodiment, artificial intelligence ("AI") algorithms are applied to the multiple inputs to identify the most likely matching event for the given combination of inputs.  For example, a neural network may be trained with the set of inputs used by the system to recognize the set of possible tagging events.  Further, a feedback mechanism may be provided to the user via the mobile app to accept or reject proposed tagging results to further refine the neural network as the system is used.  This provides a refinement process that improves the performance of the system over time.  At the same time, the system is capable of learning to detect false positives provided by the algorithms and heuristics and may refine them to avoid incorrectly tagging events.”




“[0086] According to another aspect of the disclosure, in one embodiment, the detection process 702 is configured to detect a user-determined manual tagging of an event.  The user may provide an indication to the system of the occurrence of an event of interest to the user.  For example, in one embodiment, a user may touch the touchscreen of a client device 101 to indicate the occurrence of an event.  Upon detecting 702 the user "manual tag" input, the system creates an event-based clip as described above with reference to FIG. 7.  In an alternative embodiment, the user indication may include a voice command, a Bluetooth transmitted signal, or the like.  For example, in one embodiment, a user may utter a predetermined word or set of words (e.g., "Owl make a note").  Upon detecting the utterance in the audio input, the system may provide a cue to indicate the recognition.  For example, the client device 101 may beep, vibrate, or output speech to indicate recognition of a manual tag.  Optionally, additional user speech may be input to provide a name or descriptor for the event-based video clip resulting for the user manual tag input.  For example, a short description of the event may be uttered by the user.  The user's utterance is processed by a speech-to-text algorithm and the resulting text is stored as metadata associated with the video clip.  For example, in one embodiment, the name or descriptor provided by the user may be displayed on the mobile app as the clip descriptor 402 in the clips pane 401a of the mobile app. In another embodiment, the additional user speech may include additional 
commands.  For example, the user may indicate the length of the event for which the manual tag was indicated, e.g., "short" for a 30-second recording, "long" for a two-minute recording, or the like.  Optionally, the length of any video clip can be extended based on user input.  For example, after an initial event-based video clip is generated, the user may review the video clip and request additional time before or after and the associated video data is added to the playlist or manifest file as described with reference to FIG. 7.”




















“[0089] Now referring to FIG. 8, a method for identifying and sharing event-based video clips is described.  In addition to the various options for sharing video clips identified above, in one embodiment, video clips may also be shared based on their potential relevance to events generated by different camera devices.  To do so, in one embodiment, a video clip sharing request is received 800.  The video clip sharing request 800 may be user-generated or automatically generated.  For example, in one embodiment, a map can be accessed displaying the location of camera devices for which a user may request shared access.  The user can select the camera device or devices it wants to request video from.  In an alternative embodiment, the user enters a location, date, and time for which video is desired to generate a sharing request.

[0090] In yet another embodiment, a user may select an object (e.g., a car, person, item, or the like) being displayed on the screen of a camera device.  For example, via a tap on a touchscreen of a client device 101 while video is being played, using voice commands, or other user input device capable of identifying objects being displayed on a video.  Optionally, an object of interest can also be identified on a video automatically.  For example, as part of the auto-tagging feature described above with reference to FIG. 7, some of the inputs monitored 701 may include objects of interest resulting from image processing techniques.  For example, if a tagging-event is determined to be a break-in and one of the monitored inputs includes a detected human face that is not recognized, the unrecognized face may be used as the selected object.”

[0091] Image processing algorithms and/or computer vision techniques are applied to identify the selected object from the video and formulate an object descriptor query.  For example, the user input is applied to detect the region of interest in the image, e.g., the zoomed-in region.  The data for the relevant region is processed into a vector representation for image data around detected relevant points in the mage region.  From the vector or descriptor of the relevant region, feature descriptors are then extracted based on, for example, second-order statistics, parametric models, coefficients obtained from an image transform, or a combination of these approaches.  The feature-based representation of the object in the image is then used as a query for matching in other video data.  In one embodiment, a request for sharing video clips 
includes an image query for an object and metadata from the video data in which 
the object was detected.”).






	Thus, Hodge does not teach as a whole, as indicated in bold above, the claimed:
A.	identifying based on a time point at which the user voice input is received; and
B.	“identifying an object included in the second frame based on the second frame and the user voice input; and 
providing a search result for the information about the object included in the second frame”.
Accordingly, Sanchez teaches as a whole:
A.	identifying (expressing action or result of identify via “determine a frame”) based on a time point (or beginning that begins at fig. 10:1010: “Receive Summary Command”) at which the user voice input (via fig. 2:210: “microphone”) is received, wherein the second frame is a part of the time point when the user voice input is started to be received; and
B.	identifying an object (or a baker via fig. 2: “Repeat the scene where the baker enters the cake contest” corresponding to “identify…events…in an earlier point of the show”) included in the second (said via fig. 2: “Repeat” corresponding to said “earlier point of the show” in contrast to a current point of the show) frame (as shown in fig. 1 or fig. 3:330 or fig. 4:450,460 comprising or involving as a factor said “earlier point of the show” comprised by a summary of what has happened via fig. 10:1010: “Receive Summary Command”) based on the (complied) second frame and the user voice input (said via fig. 2:210: “Repeat the scene where the baker enters the cake context.”); and 
providing a search (via “search a catalogue”) result (as shown in figure 3) for the information (or “other information”) about the object (said baker corresponding to “identify…events…in an earlier point of the show”) included in the second (via fig. 2: “Repeat” corresponding to said “earlier point of the show”, comprised by a summary of what has happened via fig. 10:1010: “Receive Summary Command”, in contrast to a current point of the show) frame (said as shown in fig. 1 or fig. 3:330 or fig. 4:450,460 comprising or involving as a factor said “earlier point of the show” comprised by a summary of what has happened via fig. 10:1010: “Receive Summary Command” via:
c.2,l. 62 to c.3,l.34:
“After the media guidance application determines which scene or scenes it will include in the summarized content, the media guidance application compiles the summarized content. For example, the media guidance application compiles summary content of the related scene or scenes by analyzing the video content of the related scene or scenes and extracting pertinent video frames. The media guidance application may use machine vision algorithms to determine a frame when a new character enters the related scene. The media guidance application marks the identified frame, and in some cases a predetermined number of frames before the character entered the scene and/or a predetermined number of frames after the character entered the scene, for inclusion in the summarized content. Furthermore, the media guidance application may analyze motion vectors present in the digital representation of a scene, e.g., within an MPEG stream, to identify frames associated with a large amount of image motion suggesting large visual changes in the scene. The media guidance application may mark the frames with a large amount of image motion for inclusion in the summarized content. Still further, the media guidance application may identify key portions of an image frame, such as the portion of the image centered near a rule of thirds intersection points, are in focus. In one embodiment, the media guidance application extracts an A×B portion (e.g., 8 pixel by 8 pixel image block from a frame) coincident to a focal point and calculates the local maximum frequency of the image to make a determination whether the frame is in focus. Using focus information, the media guidance application may mark frames for inclusion based on a change in focus information. Still other examples may locate a first frame of the related scene or scenes and track when a focal point of the scene changes according to a pre-determined threshold to identify key frames for inclusion in the summarized content. In some embodiments, the media guidance application may rely on metadata correlated with the related scene or scenes to identify the key frames which are marked for inclusion in the summarized playback content. The media guidance application then compiles a collection of the marked frames as the summarized playback content.”

c. 3,l. 59 to c.4,l.13:
“In some embodiments, the media guidance application 100 will determine the related scene or scenes using information from the current scene.  For example, the media guidance application 100 may determine a current playback position of the current scene in a media asset being viewed in the first display.  The media guidance application 100 identifies information associated with the current scene based on the current playback position.  For example, the media guidance application 100 may identify that a character in a scene is talking to a second character about events that happened in an earlier point of the show or a related show.  The media guidance application 100 may compare the identifying information with other information associated with a plurality of relevant scenes.  For example, the media guidance application 100 may use the topics discussed by characters to search a catalogue of scenes from the current episode or other episodes from the current show.  The media guidance application 100 may then determine a related scene from other scenes of the 
show based on that comparison.  As discussed above, the media guidance application 100 compiles summarized playback content, wherein the summarized playback content is associated with the current scene and the related scene.”

Thus, one of ordinary skill of television as indicated in Hodge’s “a television transceiver”:
“[0119] One or more processors in association with software in a computer-based system may be used to implement methods of video data collection, cloud-based data collection and analysis of event-based data, generating event-based video clips, sharing event-based video, verifying authenticity of event-based video data files, and setting up client devices according to various embodiments, as well as data models for capturing metadata associated with a given video data object or file or for capturing metadata associated with a given event-based video clip according to various embodiments, all of which improves the operation of the processor and its interactions with other components of a computer-based system.  The camera devices according to various embodiments may be used in conjunction with modules, implemented in hardware and/or software, such as a cameras, a video camera module, a videophone, a speakerphone, a vibration device, a speaker, a microphone, a television transceiver, a hands free headset, a keyboard, a Bluetooth module, a frequency modulated (FM) radio 
unit, a liquid crystal display (LCD) display unit, an organic light-emitting diode (OLED) display unit, a digital music player, a media player, a video game player module, an Internet browser, and/or any wireless local area network (WLAN) module, or the like.”

can modify Hodge’s teaching of said fig. 8:805: “MATCH?”: “identify the…event” with Sanchez’s teaching of said fig. 2: “baker enters” corresponding to “identify…events…in an earlier point of the show” by:
a)	inserting Sanchez’s program of fig. 10:1000, comprising said “fig. 2: ‘baker enters’ corresponding to ‘identify…events…in an earlier point of the show’, into Hodge’s fig. 8: “SHARING REQUEST”;
b)	transmitting/receiving a television signal, such as said baking show, for said Hodge’s fig. 8: “SHARING REQUEST”; and
c)	recognizing that the modification is predictable or looked forward to because Hodge already teaches that “The camera devices according to various embodiments may be used in conjunction with…a television transceiver” (Hodge, cited above) and in addition “television” is “an electronically consumable user asset” that is a useful and desirable thing or is valuable or useful intended to be bought and used via Sanchez, c.17,ll. 19-47
“Interactive media guidance applications may take various forms depending on the content for which they provide guidance.  One typical type of media guidance application is an interactive television program guide.  Interactive television program guides (sometimes referred to as electronic program guides) are well-known guidance applications that, among other things, allow users to navigate among and locate many types of content or media assets. Interactive media guidance applications may generate graphical user interface screens that enable a user to navigate among, locate and select content.  As referred to herein, the terms "media asset" and "content" should be understood to mean an electronically consumable user asset, such as television programming, as well as pay-per-view programs, on-demand programs (as in video-on-demand (VOD) systems), Internet content (e.g., streaming content, downloadable content, Webcasts, etc.), video clips, audio, content information, pictures, rotating images, documents, playlists, websites, articles, books, electronic books, blogs, chat sessions, social media, applications, games, and/or any other media or multimedia and/or combination of the same.  Guidance applications also allow users to navigate among and locate content.  As referred to herein, the term "multimedia" should be understood to mean content that utilizes at least two different content forms described above, for example, text, audio, images, video, or interactivity content forms.  Content 
may be recorded, played, displayed or accessed by user equipment devices, but 
can also be part of a live performance.”


Regarding claim 2, Hodge as combined teaches the method for controlling an electronic device of claim 1, further comprising: 
inputting the second (said as indicated in fig. 7’s loops going back for a second time) frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”), based on the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into an artificial intelligence model trained (via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) through an artificial intelligence algorithm; and 
acquiring information (via fig. 8: 805: “MATCH ?”: “Yes”) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) included in the second (said as indicated in fig. 7’s loops going back for a second time) frame (said “video… image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) based on output (via fig. 8:805: “MATCH?”: “YES”) of the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”)






“[0084] These combinations of events and inputs are illustrative only.  Some embodiments may provide a subset of these inputs and/or events.  Other embodiments may provide different combinations of inputs and/or different events.  The event detection algorithms may be implemented locally on the camera device (e.g., client device 101) or may be performed in cloud servers 102, with the input signals and event detection outputs transmitted over the wireless communication connection 107/108 from and to the camera device.  Alternatively, in some embodiments a subset of the detection algorithms may be performed locally on the camera device while other detection algorithms are performed on cloud servers 102, depending for example, on the processing capabilities available on the client device.  Further, in one embodiment, artificial intelligence ("AI") algorithms are applied to the multiple inputs to identify the most likely matching event for the given combination of inputs.  For example, a neural network may be trained with the set of inputs used by the system to recognize the set of possible tagging events.  Further, a feedback mechanism may be provided to the user via the mobile app to accept or reject proposed tagging results to further refine the neural network as the system is used.  This provides a refinement process that improves the performance of the system over time.  At the same time, the system is capable of learning to detect false positives provided by the algorithms and heuristics and may refine them to avoid incorrectly tagging events.”).














Regarding claim 3, Hodge as combined teaches the method for controlling an electronic device of claim 2, 
wherein the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) comprises a trigger voice (comprising “ ‘trigger’ words… associated with particular events”) for initiating an inquiry (via fig. 8:803: “PROVIDE IMAGE QUERY”) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query”) included in the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) and an inquiry voice (said via “voice commands… identifying objects being displayed on a video” to “formulate an object descriptor query”, cited [0091]) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query” via:
“[0082] Sound processing may also include speech recognition and natural language processing to recognize human speech, words, and/or commands.  For example, certain "trigger" words may be associated with particular events. When the "trigger" word is found present in the audio data, the corresponding event may be determined.  Similarly, the outputs of the available sensors may be received and processed to determine presence of patterns associated with events.  For example, GPS signals, accelerator signals, gyroscope signals, magnetometer signals, and the like may be received and analyzed to detect the presence of events.  In one embodiment, additional data received via wireless module 205, such as traffic information, weather information, police reports, or the like, is also used in the detection process.  The detection process 702 applies algorithms and heuristics that associate combinations of all these 
potential inputs with possible events.”), and 

wherein the inputting (via said “multiple inputs”) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”), based on the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises inputting (via said “multiple inputs”) the second` frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) based on the (said “displayed”) time point (said during said via “voice commands… identifying objects being displayed on a video”) when (at fig. 8:805: “MATCH?”) the trigger voice (said comprising “ ‘trigger’ words… associated with particular events” of interest to a user or driver identifying said “Nice Spot!”) is received (via said “multiple inputs”) into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”).




Regarding claim 4, Hodge as combined teaches the method for controlling an electronic device of claim 2, wherein the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) comprises an image frame (said “video…image”) and an audio frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio” via:
“[0036] In one embodiment, client device 101 also includes a touchscreen 211. In alternative embodiments, other user input devices (not shown) may be used, such a keyboard, mouse, stylus, or the like.  Touchscreen 211 may be a capacitive touch array controlled by touchscreen module 208 to receive touch input from a user.  Other touchscreen technology may be used in alternative embodiments of touchscreen 211, such as for example, force sensing touch screens, resistive touchscreens, electric-field tomography touch sensors, radio-frequency (RF) touch sensors, or the like.  In addition, user input may be received through one or more microphones 212.  In one embodiment, microphone 212 is a digital microphone connected to audio module 206 to receive user 
spoken input, such as user instructions or commands.  Microphone 212 may also be used for other functions, such as user communications, audio component of video recordings, or the like.  Client device may also include one or more audio output devices 213, such as speakers or speaker arrays.  In alternative embodiments, audio output devices 213 may include other components, such as an automotive speaker system, headphones, stand-alone "smart" speakers, or the like.
[0037] Client device 101 can also include one or more cameras 214, one or more sensors 215, and a screen 216.  In one embodiment, client device 101 includes two cameras 214a and 214b.  Each camera 214 is a high definition CMOS-based imaging sensor camera capable of recording video one or more video modes, including for example high-definition formats, such as 1440p, 1080p, 720p, and/or ultra-high-definition formats, such as 2K (e.g., 2048.times.1080 or similar), 4K or 2160p, 2540p, 4000p, 8K or 4320p, or similar video modes. Cameras 214 record video using variable frame rates, such for example, frame rates between 1 and 300 frames per second.  For example, in one embodiment cameras 214a and 214b are Omnivision OV-4688 cameras.  Alternative cameras 214 may be provided in different embodiments capable of recording video in any combinations of these and other video modes.  For example, other CMOS sensors or CCD image sensors may be used.  Cameras 214 are controlled by video module 207 to record video input as further described below.  A single client device 101 may include multiple cameras to cover different views and angles.  For 
example, in a vehicle-based system, client device 101 may include a front camera, side cameras, back cameras, inside cameras, etc.”; and


“[0081] According to another aspect of the disclosure, detection of tagging events 702 may be done automatically by the system.  For example, based on the monitored inputs, in different embodiments events such as a vehicle crash, a police stop, or a break in, may be automatically determined.  The monitored inputs 701 may include, for example, image processing signals, sound processing signals, sensor processing signals, speech processing signals, in any combination.  In one embodiment, image processing signals includes face recognition algorithms, body recognition algorithms, and/or object/pattern detection algorithms applied to the video data from one or more cameras.  For example, the face of the user may be recognized being inside a vehicle.  As another example, flashing lights from police, fire, or other emergency vehicles 
may be detected in the video data.  Another image processing algorithm detects the presence of human faces (but not of a recognized user), human bodies, or uniformed personnel in the video data.  Similarly, sound processing signals may be based on audio recorded by one or more microphones 212 in a camera device, (e.g., client device 101, auxiliary camera 106, or mobile device 104).  In one embodiment sound processing may be based on analysis of sound patterns or signatures of audio clips transformed to the frequency domain.  For example, upon detection of a sound above a minimum threshold level (e.g., a preset number of decibels), the relevant sound signal is recorded and a Fast Fourier Transform (FFT) is performed on the recorded time-domain audio signal as is known in the art.  The frequency-domain signature of the recorded audio signal is then compared to known frequency domain signatures for recognized events, such as, glass breaking, police sirens, etc. to determine if there is a match.  
For example, in one embodiment, pairs of points in the frequency domain signature of the recorded audio input are determined and the ratio between the selected points are compared to the ratios between similar points in the audio signatures of recognized audio events.”), 











wherein the inputting the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”), based on the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises matching (via fig. 8:805: “MATCH?” via “audio…match”, cited above: [0081]) the image frame (via said “video…image”) and the audio (said via “audio recorded…in a camera”) frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio”), and
wherein the identifying the object included in the second frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio”) based on output (via fig. 8:805: “MATCH?”: “YES”) of the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises inputting the image frame (said “video…image”) and the audio (said via “audio recorded…in a camera”) frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio”) into the artificial intelligence model (said via “trained” “artificial intelligence ("AI") algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) wherein said “video…frames” or the claimed “frame” comprises film further comprising “recording and reproduction of images” or “ recording and reproduction of both images and sound” via Dictionary.com:
film
noun
Movies.
a strip of transparent material, usually cellulose triacetate, covered with a photographic emulsion and perforated along one or both edges, intended for the recording and reproduction of images.
a similar perforated strip covered with an iron oxide emulsion (magfilm ), intended for the recording and reproduction of both images and sound.
a movie; motion picture: We decided to stay home and watch a Kurosawa film.).





Regarding claim 5, Hodge as combined teaches the method for controlling an electronic device of claim 4, further comprising:
matching (said via fig. 8: 805: “MATCH ?”: “Yes”) information on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) with the image frame (said “video… image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) in which the object  (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) appeared, and storing the (“matching”) information and the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected” such that “the user may access clips generated” via:
“[0095] Responses to the search request are received 804.  If no matches are found 805, the sharing request process ends 806.  For example, if the search request was initiated by a user, the user may be notified that no matching video clips were found.  If matching video clips are found 805, an authorization request is sent 807 to the user of the camera device responding with a match.  As discussed above with reference to FIG. 4a-c, the clips generated from camera devices of the user may be listed under the clips pane 401a.  Thus, the user may access clips generated 705 from a client device 101, 
an auxiliary camera 106, a mobile device 104, without further authorization requirement.  For example, in one embodiment, when the camera devices with video clips matching the same event, such as a break-in, are registered to the same user account, the user may directly access the shared video clips from one or more home auxiliary cameras 106 that captured the same break-in as the dash-mounted client device 101 from different vantage points.  Thus, for example, a user may be able to provide related video clips to the authorities showing a perpetrator's face (from an IN-camera device), a "get-away" vehicle from an auxiliary home camera device located in a carport, and a license plate for the get-away vehicle from a driveway auxiliary camera device.  The video 
clips for the break-in event could be automatically generated and associated as "related" clips from multiple camera devices integrated by the system according 
to one embodiment of the invention.”).

Regarding claim 6, Hodge as combined teaches the method for controlling an electronic device of claim 2, further comprising: 
determining (via fig. 7: “TAG EVENT?” for fig. 8:805: “MATCH?”: “Yes”: “No”) information (for matching) on an object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) corresponding to a user's voice instruction (or “a voice command” or “speech…commands…e.g., ‘short’… ‘long’ ”, cited below: [0086]) among the information (said comprised by said “files”) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”), and 
wherein the providing (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) comprises transmitting (represented in fig. 1 as dashed lines) the determined (said via fig. 7: “TAG EVENT?” for fig. 8:805: “MATCH?”: “Yes”: “No”) information (said comprised by said “files” for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) corresponding to the user’s voice instruction (said or “a voice command” or “speech…commands…e.g., ‘short’… ‘long’ ”) to an external search server (or fig. 1:102:the cloud) and providing (via fig. 4a:403a-c & 402a-c) the search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) received (represented in fig. 1 as dashed lines) from the external search server (said or fig. 1:102:the cloud via:






“[0086] According to another aspect of the disclosure, in one embodiment, the detection process 702 is configured to detect a user-determined manual tagging of an event.  The user may provide an indication to the system of the occurrence of an event of interest to the user.  For example, in one embodiment, a user may touch the touchscreen of a client device 101 to indicate the occurrence of an event.  Upon detecting 702 the user "manual tag" input, the system creates an event-based clip as described above with reference to FIG. 7.  In an alternative embodiment, the user indication may include a voice command, a Bluetooth transmitted signal, or the like.  For example, in one 
embodiment, a user may utter a predetermined word or set of words (e.g., "Owl make a note").  Upon detecting the utterance in the audio input, the system may provide a cue to indicate the recognition.  For example, the client device 101 may beep, vibrate, or output speech to indicate recognition of a manual tag.  Optionally, additional user speech may be input to provide a name or descriptor for the event-based video clip resulting for the user manual tag input.  For example, a short description of the event may be uttered by the user.  The user's utterance is processed by a speech-to-text algorithm and the resulting text is stored as metadata associated with the video clip.  For example, in one embodiment, the name or descriptor provided by the user may be displayed on the mobile app as the clip descriptor 402 in the clips pane 401a of the mobile app. In another embodiment, the additional user speech may include additional 
commands.  For example, the user may indicate the length of the event for which the manual tag was indicated, e.g., "short" for a 30-second recording, "long" for a two-minute recording, or the like.  Optionally, the length of any video clip can be extended based on user input.  For example, after an initial event-based video clip is generated, the user may review the video clip and request additional time before or after and the associated video data is added to the playlist or manifest file as described with reference to FIG. 7.”).









Regarding claim 7, Hodge as combined teaches the method for controlling an electronic device of claim 6, wherein the determining (said via fig. 7: “TAG EVENT?” for fig. 8:805: “MATCH?”: “Yes”: “No”) further comprises: 
displaying a user interface (UI) (via figures 4a,b,c) identifying (via fig. 7:706: “NOTIFICATIONS”) whether (said via fig. 8:805: “MATCH?”: “Yes”: “No”) the information (said for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) is information (said comprised by said “files” for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) corresponding to the user's voice instruction (said or “a voice command” or “speech…commands…e.g., ‘short’… ‘long’ ”) among the information (said comprised by said “files” for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”), or identifying whether there is an additional inquiry for inquiring additional information.









Regarding claim 8, Hodge as combined teaches the method for controlling an electronic device of claim 1, wherein the providing (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) comprises: 
providing the search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) and a (“zoomed-in region”, cited in the rejection of claim 1) frame corresponding to the search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) in an area (said  “zoomed-in region”) of the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”) while the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”) is being reproduced (via fig. 4a,b,c).
Regarding claim 9, Hodge as combined teaches the method for controlling an electronic device of claim 1, comprising: 
transmitting (via the dashed lines in fig. 1) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103”) to an external server (fig. 1:102:the cloud) for acquiring (matching) information on frames; and 
acquiring (said via the dashed lines in fig. 1) information (for matching) on the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103”) from the external server (said fig. 1:102:the cloud).





Regarding claim 21, Hodge as combined teaches via Sanchez the method of claim 1, wherein the identifying of the object (via fig. 8:arrows being input: 800: “SHARING REQUEST” as modified via Sanchez’s fig. 10:1000) comprises obtaining a keyword (via Sanchez: fig. 12:1050A:1220: “Generate key phrase from dialogue”: represented in fig. 10:1050: “Determine a related scene based on the comparison”) corresponding to the object based on the second frame (represented in fig. 10:1040: “Compare information with relevant scenes”), and 
wherein the providing includes transmitting the keyword to an external search (said via “search a catalogue”) server, and 
receiving the search result from the external search server based on the keyword.
Thus, the combination does not teach, as indicated in bold above, the claimed:
“transmitting the keyword to an external search server, and 
receiving the search result from the external search server based on the keyword”.
Accordingly, Sanchez, as already combined above, further teaches:
transmitting (via fig. 9:920: a “communication path”, c. 26,ll. 31-36) the keyword to an external search (said via “search a catalogue”) server (or fig. 9:916: “Media Content Source”), and 
receiving (said via fig. 9:920: a “communication path”) the search result from the external search server based on the keyword.

Thus, one of ordinary skill in the art of asking questions or querying as taught by both references can modify the combination’s said fig. 8:arrows being input: 800: “SHARING REQUEST” as modified via Sanchez’s fig. 10:1000 with Sanchez’s further teaching of said fig. 9:920: a “communication path” by:
a)	making Hodge’s fig. 1:100, a network, be as Sanchez’s fig. 9:900: a network;
b)	installing a “search engine” (Sanchez, c.32,ll. 61-64) in each of Hodge’s fig. 1:101,105,104;
c)	searching Hodge’s fig. 1:102, a server system:
c1)	sending inquiring key phrases or words, such as:
“Red Wedding” (Sanchez, c.33, ll. 56-59); or 
“Nice car parking spots!” regarding Hodge:402c:“Nice Spot!”; and
c2)	receiving a result or a response to each question via the “communication path”; and
d)	recognizing that the modification is predictable or looked forward to because the modification allows one to search with “weight” (Sanchez, c.32,ll. 64-67) or relative importance regarding each word of the phrase, such as “Red (10% weight) Wedding (90% weight)”, so as to “extract pertinent features” (Sanchez, c. 33,ll. 11-16) from owned assets that are useful and desirable things or is valuable or useful intended to be bought and used, such as:
d1)	the book or video of “Game of Thrones” (Sanchez, c.33,ll. 11-16); or 
d2)	shared, via Hodge: fig. 8:800: “SHARING REQUEST”, video of parked cars.
Regarding claim 11, claim 11 is rejected the same as claim 1. Thus, argument presented in claim 1 is equally applicable to claim 11. Accordingly, Hodge teaches claim 11 of an electronic device comprising: 
a display (fig. 2:208: TOUCH SCREEN MODULE); 
a communicator (fig. 2:205: WIRELESS MODULE); 
a microphone (fig. 2:206: AUDIO MODULE); 
memory (fig. 2:203: MEMORY MODULE) storing at least one instruction (via figures 5,6a and 7-11); and 
a processor (fig. 2:201: PROCESSING MODULE) coupled to the display (said fig. 2:208: TOUCH SCREEN MODULE), the communicator (said fig. 2:205: WIRELESS MODULE), the microphone (said fig. 2:206: AUDIO MODULE) and the memory (said fig. 2:203: MEMORY MODULE), and controlling the electronic device (fig. 2:101), 
wherein the processor (said fig. 2:201: PROCESSING MODULE) is configured to execute the at least one instruction to: 
control the electronic device (said fig. 2:101) to store in the memory (said fig. 2:203: MEMORY MODULE) a plurality of frames (via fig. 2:207: VIDEO MODULE) of a video (via said fig. 2:208: TOUCH SCREEN MODULE) for a specific time period (said via “store video data for a predetermined and programmable set amount of time”, cited in the rejection of claim 1) while reproducing (via said fig. 2:208: TOUCH SCREEN MODULE) the video (said via fig. 2:207: VIDEO MODULE) on the display (said fig. 2:208: TOUCH SCREEN MODULE), 



based on the user voice input comprising the request for information about the object being received, identify based on a time point at which the user voice input is received, where the second frame is part of the time point when (at fig. 8:800: “SHARING REQUEST”) the user voice input (said identifying the “Nice Spot!”) is started to be received (via any one arrow in fig. 8), 
identify an object (via fig. 8:805: “MATCH?” identifying an event of interest to the user or driver or passenger) included in the (any) second frame (said “video…image”) based on the second frame and the user voice input (that identified the “Nice Spot!” for parking), and 

 provide a search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) for the (matching) information about the object (said nice parking spot or “image data around detected relevant points in the mage region…used as a query”) included in the (any) second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103”, cited in the rejection of claim 1, for matching in fig. 8:805: “MATCH?”).     
Thus as discussed above, Hodge does not teach, as indicated in bold above, the claimed:
A.	“identify based on a time point at which the user voice input is received”; and
B.	“identify an object included in the second frame based on the user voice input , and 
provided a search result for the information about the object  included in the second frame”.









Accordingly as discussed above, Sanchez teaches
A.	identify based on a time point (via said beginning) at which the user voice input is received; and
B.	identify an object (via fig. 2: “baker enters” corresponding to “identify…events…in an earlier point of the show”) included in the second (via fig. 2: “Repeat” corresponding to said “earlier point of the show” in contrast to a current point of the show) frame  (as shown in fig. 1 or fig. 2:330 or fig. 4:450,460 comprising said “earlier point of the show”) based on the user voice input (via fig. 2:210: “Repeat the scene where the baker enters the cake context.”), and 
provided a search (via “search a catalogue”)  result (as shown in figure 3) for the information  (or “other information”)  about the object (said via fig. 2: “baker” corresponding to “identify…events…in an earlier point of the show”)  included in the second (via fig. 2: “Repeat” corresponding to said “earlier point of the show” in contrast to a current point of the show) frame  (said as shown in fig. 1 or fig. 2:330 or fig. 4:450,460 comprising said “earlier point of the show”).







Thus as discussed above, one of ordinary skill of television as indicated in Hodge’s “a television transceiver” can modify Hodge’s teaching of said fig. 8:805: “MATCH?”: “identify the…event” with Sanchez’s teaching of said fig. 2: “baker enters” corresponding to “identify…events…in an earlier point of the show” by:
a)	inserting Sanchez’s program of fig. 10:1000, comprising said “fig. 2: ‘baker enters’ corresponding to ‘identify…events…in an earlier point of the show’, into Hodge’s fig. 8: “SHARING REQUEST”;
b)	transmitting/receiving a television signal, such as said baking show, for said Hodge’s fig. 8: “SHARING REQUEST”; and
c)	recognizing that the modification is predictable or looked forward to because Hodge already teaches that “The camera devices according to various embodiments may be used in conjunction with…a television transceiver” (Hodge, cited above) and in addition “television” is “an electronically consumable user asset” that is a useful and desirable thing or is valuable or useful intended to be bought and used via Sanchez.








Regarding claim 12, claim 12 is rejected the same as claim 2. Thus, argument presented in claim 2 is equally applicable to claim 12.
Regarding claim 13, claim 13 is rejected the same as claim 3. Thus, argument presented in claim 3 is equally applicable to claim 13.
Regarding claim 14, claim 14 is rejected the same as claim 4. Thus, argument presented in claim 4 is equally applicable to claim 14.
Regarding claim 15, claim 15 is rejected the same as claim 5. Thus, argument presented in claim 5 is equally applicable to claim 15.
Regarding claim 16, claim 16 is rejected the same as claim 6. Thus, argument presented in claim 6 is equally applicable to claim 16.
Regarding claim 17, claim 17 is rejected the same as claim 7. Thus, argument presented in claim 7 is equally applicable to claim 17.
Regarding claim 18, claim 18 is rejected the same as claim 8. Thus, argument presented in claim 8 is equally applicable to claim 18.
Regarding claim 19, claim 19 is rejected the same as claim 9. Thus, argument presented in claim 9 is equally applicable to claim 19.







Claims 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271) as applied above further in view of Diamant et al. (US Patent App. Pub. No.: US 2019/0027147 A1). Note that claims 3 and 13 are twice rejected under 35 USC 103 because the claimed “inquiry voice” has multiple meanings thus multiple rejections.
Regarding claim 3, Hodge as combined teaches the method for controlling an electronic device of claim 2, 
wherein the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) comprises a trigger voice (comprising “ ‘trigger’ words… associated with particular events”) for initiating an inquiry (via fig. 8:803: “PROVIDE IMAGE QUERY”) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query”) included in the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) and an inquiry voice (said via “voice commands… identifying objects being displayed on a video” to “formulate an object descriptor query”) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query” via:

“[0082] Sound processing may also include speech recognition and natural language processing to recognize human speech, words, and/or commands.  For example, certain "trigger" words may be associated with particular events. When the "trigger" word is found present in the audio data, the corresponding event may be determined.  Similarly, the outputs of the available sensors may be received and processed to determine presence of patterns associated with events.  For example, GPS signals, accelerator signals, gyroscope signals, magnetometer signals, and the like may be received and analyzed to detect the presence of events.  In one embodiment, additional data received via wireless module 205, such as traffic information, weather information, police reports, or the like, is also used in the detection process.  The detection process 702 applies algorithms and heuristics that associate combinations of all these 
potential inputs with possible events.”), and 

















wherein the inputting (via said “multiple inputs”) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”), based on the user (said via “The video clip sharing request 800 may be user-generated”) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises inputting (via said “multiple inputs”) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) based on the (said “displayed”) time point (said during said via “voice commands… identifying objects being displayed on a video”) when (at fig. 8:805: “MATCH?”) the trigger voice (said comprising “ ‘trigger’ words… associated with particular events” of interest to a user or driver identifying said “Nice Spot!”) is received (via said “multiple inputs”) into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”).
Thus, Hodge as combined does not teach, as indicated in bold above, the “inquiry voice” meaning that the voice itself is the inquiry in contrast to an object-of-interest identifying voice serving as a basis of an inquiry descriptor as discussed in the rejection of claim 3 under 35 USC 102.

Accordingly, Diamant teaches:
an inquiry voice (via fig. 2A: 116: “Hey Ayeye, what is [this]?”).
Thus, one of ordinary skill in audio can modify Hodge’s trigger event words with Diamant’s teaching of fig. 2A: 116: “Hey Ayeye, what is [this]?” and recognize that the modification is predictable or looked forward to because Diamant’s teaching is “operative to perform intent understanding for identifying…information the user would like to obtain” such that “overall user experience…is enhanced” via Diamant:
“[0026] The intent system 126 is operative to receive the text translated from the received utterance 116 and the objects and text recognized from the captured image 136, and interpret the content of the image as part of the search query or command indicated in the utterance.  According to one aspect, the intent system 126 recognizes and replaces the trigger 134 in the text translated from the received utterance 116 with the identified object(s) and text from the captured image 136.  The intent system 126 is further operative to perform intent understanding for identifying an action the user 102 wants the client computing device 104 to take or information the user would like to obtain, conveyed in the spoken utterance 116.  According to an example, the 
intent system 126 is exposed as an API.

[0027] In some examples, the digital assistant 110 provides context information 138 to the image integrated query system 105.  Context data 138 can include, for example, time/date, the user's location, language, schedule, applications 108 installed on the client computing device 104, the user's preferences, the user's behaviors (in which such behaviors are monitored/tracked with notice to the user and the user's consent), stored contacts (including, in some cases, links to a local user's or remote user's social graph such as those maintained by external social networking services), call history, messaging history, browsing history, device type, device capabilities, and the like.  According to an aspect, the intent system 126 applies context data 138 that is available to it to enable interactions with the user 102 that are more natural and an overall user experience supported by the digital assistant 110 that is enhanced.  That is, the intent system 126 is operative to apply context data 138 provided to it by the digital assistant 110 to the combined text translated from the received utterance 116 and the objects and the text recognized from the captured image 136 for understanding the semantic intent of the search query or command indicated in the utterance 116.  According to examples, the intent system 126 uses natural language processing to process the combined text translated from the received utterance 116 and the objects and the text 
recognized from the captured image 136 in association with available context 
information 138.












Regarding claim 13, claim 13 is rejected the same as claim 3. Thus, argument presented in claim 3 is equally applicable to claim 13.











Claims 10 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271) as applied above further in view of Casper (US Patent Application No.: US 2015/0296250 A1).
Regarding claim 10, Hodge as combined teaches the method for controlling an electronic device of claim 9, wherein the external server (said fig. 1:102) recognizes a fingerprint included in the second frame (said via “video data may be uploaded to the cloud system 103”).
Thus, Hodges as combined does not teach “the external server recognizes a fingerprint included in the second frame”.
Accordingly, Casper teaches:
the external server (via figs. 6 and 7: “SERVER”) recognizes (via “one or more servers…are capable of…object…recognition”) a fingerprint (via fig. 3:360: “IDENTIFY… FINGERPRINT”) included in the second frame (via fig. 3:310: “VIDEO FRAME” via:
“[0105] Video processing server(s) 623 can include one or more servers that are capable of receiving, processing, storing, and/or delivering video content, performing object detection and/or recognition, receiving, processing, storing, and/or providing commerce information relating to merchandise items, searching for matching merchandise items, and/or performing any other suitable functions.”).

	




Thus, one of ordinary skill in the art of “Web-based…marketing materials” can modify Hodge’s fig. 1:102: “a server system” and uploading to the cloud, as shown in Hodge’s fig. 1:103: “cloud-based system”, corresponding to Hodge’s teaching of:
[0027] Referring now to FIG. 1, an exemplary vehicular video-based data capture and analysis system 100 according to one embodiment of the disclosure is provided.  Client device 101 is a dedicated data capture and recording system suitable for installation in a vehicle.  In one embodiment, client device 101 is a video-based dash camera system designed for installation on the dashboard or windshield of a car.  Client device 101 is connected to cloud-based system 103.  In one embodiment, cloud-based system 103 includes a server system 102 and network connections, such as for example, to Internet connections.  In one embodiment, cloud-based system 103 is a set of software services and programs operating in a public data center, such as an Amazon Web Services (AWS) data center, a Google Cloud Platform data center, or the like.  Cloud-based system 103 is accessible via mobile device 104 and web-based system 105.  In one embodiment, mobile device 104 includes a mobile device, such as an Apple iOS based device, including iPhones, iPads, or iPods, or an Android based device, like a Samsung Galaxy smartphone, a tablet, or the like.  Any such mobile device includes an application program or app running on a processor.  Web-based system 105 can be any computing device capable of running a Web browser, such as for example, a Windows.TM.  PC or tablet, Mac Computer, or the like.  Web-based system 105 may provide access to information or marketing materials of a system operations for new or potential users.  In addition, Web-based system 105 may also optionally provide access to users via a software program or application similar to the mobile app further described below.  In one embodiment, system 100 may also include one or more auxiliary camera modules 106.  For example, one or more camera modules on a user's home, vacation home, or place of business.  Auxiliary camera module 106 may be 
implemented as a client device 101 and operate the same way.  In one embodiment, auxiliary camera module 106 is a version of client device 101 with a subset of components and functionality.  For example, in one embodiment, auxiliary camera module 106 is a single camera client device 101.







with Casper’s teaching of figs. 6 and 7: “SERVER” with “object…recognition” by:
a)	installing into Hodge’s fig. 1:102: “a server system” the object recognition/identification; and
b)	send the marketing materials over the cloud to users/consumers/buyers such that the marketing materials are recognized/identified by Hodge’s fig. 1:102;
and thus said one of skill would recognize that the modification is predictable or looked forward to because Casper’s teaching uses the recognition/identification of “identified objects” to “provide a viewer…with an opportunity to purchase one or more merchandise items” via Casper:
“[0032] In some implementations, the mechanisms can be used in a variety of applications.  For example, the mechanisms can provide commerce information relating to merchandise items presented in video content.  More particularly, for example, the mechanisms can identify discrete objects in a video frame and match the discrete objects against products and other merchandise items that are available for sale in a product catalogue.  The mechanisms can then store commerce information relating to the merchandise items (e.g., prices, product names, sellers of the products, links to ordering information, etc.) in association with video frames of the video content (e.g., by timestamping the commerce information).  As another example, the mechanisms can provide commerce information relating to merchandise items presented in video content in a real-time manner.  In a more particular example, in response to receiving an 
indication that a viewer of the video content is interested in merchandise items presented in the video content (e.g., a user request to pause the playback of the video content), the mechanisms can retrieve commerce information relating to the merchandise items and present the commerce information to the viewer.  In this example, the mechanisms can provide a viewer that is consuming video content with an opportunity to purchase one or more merchandise items corresponding to identified objects in a video frame and/or an opportunity to place the one or more merchandise items in a queue for making a purchasing decision at a later time without leaving or navigating away from the presented video content.”

Regarding claim 20, claim 20 is rejected the same as claim 10. Thus, argument presented in claim 10 is equally applicable to claim 20.

Suggestions
Applicant’s disclose states:
 “[0013]       According to an embodiment of the disclosure as described above, a user 
becomes capable of searching information on an image content that the user is currently viewing more easily and intuitively through his or her voice, without stopping the reproduction of the image content.” 

Thus, applicant’s fig. 4:S455: “PROVIDING THE SEARCH RESULT” is “a user…searching…more easily and intuitively” because fig. 4:250: “AI SERVER” is doing all the heavy thinking, such as decision making regarding the search, for the user thus being easy and intuitive for the user or freeing the user from doing all the heavy thinking or lifting (such as thinking of all the synonyms and using wild-terms such “run*4” representative of the search term “running” and determining whether to use the “with” sentence search operator versus the “same” paragraph search operator) in formulating a search.
Claim 2 appears directed to freeing the user from doing all the heavy thinking in formulating a search; however, claim 2’s AI has no direct connection to claim 1’s searching. In contrast to claim 2, Sanchez (US 10,182,271) teaches a branch of artificial intelligence or machine learning or “natural language processing”, c.4,ll. 55-58, with respect to fig. 12:12210: “Extract dialogue associated with current scene”. Thus, applicant’s disclosed solution to conveniently searching as shown in applicant’s fig. 4 (or applicant’s figs. 4,5,12B being “interlocked” with AI) is an indication of non-obviousness in view of the rejection of claim 2.
Note that these suggestions are not provided with respect to overcoming 35 USC 101,112,102 and/or 103. These suggestion are mainly provided to seek out advantages in the disclosure regardless of 35 USC 101,112,102 and/or 103.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENNIS ROSARIO whose telephone number is (571)272-7397. The examiner can normally be reached Monday-Friday, 9AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DENNIS ROSARIO/Examiner, Art Unit 2667    

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667