DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/24/2022 has been entered. Claims 1-22 are pending.
Claim Objections
Claims 1-22 is objected to because of the following informalities:  
Regarding claim 1, claim 1 is objected for not making a proper “items in a list”. Thus, claim 1 is interpreted as:











Interpreted as): A method for controlling an electronic device comprising: reproducing a video; 
storing a plurality of frames of the reproduced video for a first time period while reproducing the video; 
receiving a user voice input of a user while reproducing a first frame of the video, the user voice input comprising a request for information about an object displayed in the video; 
based on the user voice input comprising the request for information about the object being received, identifying a second frame based on a time point at which the user voice input is received, wherein the second frame is a part of the stored plurality of frames, and is from a second time period, and is reproduced prior to the time point when the user voice input is started to be received; 
identifying an object included in the second frame based on the second frame and the user voice input; and 
providing a search result for the information about the object included in the second frame.  
wherein “comma” or “,” is defined via Dictionary.com:
comma
noun
the sign (,), a mark of punctuation used for indicating a division in a sentence, as in setting off a word, phrase, or clause, especially when such a division is accompanied by a slight pause or is to be noted in order to give order to the sequential elements of the sentence. It is also used to separate items in a list, to mark off thousands in numerals, to separate types or levels of information in bibliographic and other data, and, in Europe, as a decimal point.

 

Thus, claims 2-10,21 and 22 are objected for depending on claim 1.
	Regarding claim 11, claim 11 is objected the same as claim 1 for not making a proper list of items using the comma, “,”.
	Thus, claims 12-20 are objected for depending on claim 11.





Appropriate correction is required.













Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
Accordingly, 35 USC 112(f) is NOT invoked in claims 1-21. 






The following definitions are “taken” via MPEP 2111.01 III. "PLAIN MEANING" REFERS TO THE ORDINARY AND CUSTOMARY MEANING GIVEN TO THE TERM BY THOSE OF ORDINARY SKILL IN THE ART, 3rd paragraph, emphasis added:
“It is also appropriate to look to how the claim term is used in the prior art, which includes prior art patents, published applications, trade publications, and dictionaries. Any meaning of a claim term taken from the prior art must be consistent with the use of the claim term in the specification and drawings. Moreover , when the specification is clear about the scope and content of a claim term, there is no need to turn to extrinsic evidence for claim interpretation. 3M Innovative Props. Co. v. Tredegar Corp., 725 F.3d 1315, 1326-28, 107 USPQ2d 1717, 1726-27 (Fed. Cir. 2013) (holding that "continuous microtextured skin layer over substantially the entire laminate" was clearly defined in the written description, and therefore, there was no need to turn to extrinsic evidence to construe the claim).”

The claimed “reproducing” (as in “reproducing a video” in claim 1, line 2) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com wherein “representation” is “taken” under MPEP 2111.01 III:
reproduce
verb (used with object), re·pro·duced, re·pro·duc·ing.
1	to make a copy, representation, duplicate, or close imitation of:
to reproduce a picture.

wherein “representation” is defined:
representation
noun
11	the act of portrayal, picturing, or other rendering in visible form.

The claimed “frame” (as in “storing a plurality of frames of the reproduced video” in claim 1, lines 3,4) is claim 1 is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com wherein meanings 8 and 10 are “taken”:
frame
noun
8	Movies. one of the successive pictures on a strip of film.
10	Computers. the information or image on a screen or monitor at any one time.

The claimed “while” (as in “receiving a user voice input of a user while reproducing a first frame of the video” in claim 1, line 5) is interpreted in light of applicant’s disclosure as on or skill in the art would and definition thereof via Dictionary.com, wherein definitions 3-6 are equally applicable:
while
conjunction
3	during or in the time that.
4	throughout the time that; as long as.
5	even though; although:
While she appreciated the honor, she could not accept the position.
6	at the same time that (showing an analogous or corresponding action):
The floor was strewn with books, while magazines covered the tables.

The claimed “first” of the claimed “first frame” in claim 1, line 5 is interpreted in light of applicant’s disclosure and “drawings” thereof under MPEP 2111.01 III.
The claimed “-ing” (as in “identifying a second frame” in claim 1, line 8) is interpreted in light of applicant’s disclosure and “drawings” of fig. 5:270: “FRAME RECOGNITION SERVER” and definition thereof under MPEP 2111.01 III via Dictionary.com, emphasis added “expressing the action of the verb” recognize or identify “or its result” such as the last two limitations of claim 1: “identifying an object”: “providing a search result”:
-ing
1	a suffix of nouns formed from verbs, expressing the action of the verb or its result, product, material, etc. (the art of building; a new building; cotton wadding). It is also used to form nouns from words other than verbs (offing; shirting). Verbal nouns ending in -ing are often used attributively (the printing trade) and in forming compounds (drinking song). In some compounds (sewing machine), the first element might reasonably be regarded as the participial adjective, -ing2, the compound thus meaning “a machine that sews,” but it is commonly taken as a verbal noun, the compound being explained as “a machine for sewing.”

The claimed “a” (as in “identifying a second frame” in claim 1, line 8) is interpreted in light of applicant’s disclosure as one of skill in the art would and definition thereof via Dictionary.com, wherein definitions 1-7 are equally applicable:
a1
indefinite article
1	not any particular or certain one of a class or group:
a man; a chemical; a house.
2	a certain; a particular:
one at a time; two of a kind; A Miss Johnson called.
3	another; one typically resembling:
a Cicero in eloquence; a Jonah.
4	one (used before plural nouns that are preceded by a quantifier singular in form): a hundred men (compare hundreds of men); a dozen times (compare dozens of times).
5	indefinitely or nonspecifically (used with adjectives expressing number):
a great many years; a few stars.
6	one (used before a noun expressing quantity):
a yard of ribbon; a score of times.
7	any; a single:
not a one.

The claimed “second” of the claimed “second frame” in claim 1, line 8 is interpreted in light of applicant’s disclosure and “drawings” thereof under MPEP 2111.01 III.








The claimed “point” (as in “identifying a second frame based on a time point” in claim 1, line 8) is interpreted in light of applicant’s disclosure and definition thereof via Dictionary.com wherein “a particular instant of time” or “a specific point or time” is “taken” as the meaning of “point” under MPEP 2111.01 III:
point
noun
19	a particular instant of time:
It was at that point that I told him he'd said enough.

BRITISH DICTIONARY DEFINITIONS FOR POINT
point
noun
10	a moment:
at that point he left the room
wherein “moment” is defined:
moment
noun
2	a specific instant or point in time:
at that moment the doorbell rang

wherein “point in time” is defined:
point in time

A particular moment, as in At no point in time had they decided to leave the country, or The exact point in time when he died has not been determined. Critics say this usage is wordy since in most cases either point or time will suffice. However, it has survived since the mid-1700s. Also see at this point.






The claimed “trigger” (as in “the user voice input comprises a trigger voice” in claim 3, line 2) is interpreted in light of applicant’s disclosure and definition thereof via Dictionay.com wherein meaning 3 is “taken”:
trigger
noun
3	anything, as an act or event, that serves as a stimulus and initiates or precipitates a reaction or series of reactions.



















Response to Arguments
Claim Interpretation
Applicants state in page 10 of the remarks of 1/24/2022:
“Regarding the Examiner’s other interpretations of certain claim terms discussed on pages 4-6 of the Office Action, Applicant submits that the claims should be given their broadest reasonable interpretation in light of the original disclosure and all relevant case law, and does not acquiesce to any narrower interpretation of the claims.”

In response, the claimed “frame” has a broader interpretation in this Office action than the previous Office action of 11/23/2021 in light of applicant’s disclosure:
“[0033]       Hereinafter, various embodiments of the disclosure will be described with 
reference to the accompanying drawings. However, it should be noted that the various embodiments are not for limiting the technologies described in the disclosure to a specific embodiment, but should be interpreted to include all modifications, equivalents and/or alternatives of the embodiments of the disclosure. Meanwhile, with respect to the detailed description of the drawings, similar components may be designated by similar reference numerals.”

	Thus, the claimed “frame” encompasses in scope “the information or image on a screen or monitor at any one time”, as mentioned above for the definition of “frame”.
	Claim 1 appears to of been originally interpreted in the previous Office action of 11/23/2021 in light of applicant’s fig. 6:610-1,610-2,610-3,610-4,610-5 (like a film-strip); however, the original film-strip interpretation in the previous Office action of 11/23/2021 is not the broadest reasonable interpretation in view of applicant’s disclosure’s paragraph [0033]. 





Rejections Under 35 USC 103
Applicant's arguments filed 1/24/2022 have been fully considered but they are not persuasive. Applicants state in page 11:
“In the Office Action, the Examiner appears to be taking the position that the second frame can be "a part of" frames that were previously reproduced without being previously reproduced itself. Without any admissions, and solely in the interest of advancing prosecution, Applicant has amended the independent claims to explicitly specify that the second frame is "a part of the stored plurality of frames, and is from a second time period reproduced prior to the time point when the user voice input is started to be received".

The examiner does not know or recall the basis of this position. Where does this position of--the second frame can be "a part of" frames that were previously reproduced without being previously reproduced itself--come from?
The examiner’s understanding is that the claimed “second frame is part of the stored plurality of frames” which includes in scope that the second frame (for example, applicant’s fig. 6:610-1 at time 00:46) did not get stored (or reproduced as a copy) along with the rest of the frames; however, the second frame (not being reproduced or stored as a copy) is still a part of the overall movie of frames (fig. 6:610-1, 610-2, 610-3,610-4,610-5).







In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., “identifying a previously-reproduced frame” and “the identified object” via applicant’s remarks, page 13:
“Therefore, as discussed in Applicant's previous remarks, nothing in Sanchez discloses identifying a previously-reproduced frame based on a time point at which a user voice input is received, and then identifying an object in a previously-reproduced frame based on the previously-reproduced frame, and then obtaining a search result of information about the identified object.”

) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).
	In contrast, claim 1 states:
	“wherein the second frame is…from a second time period reproduced prior to the time point when the user voice input is started to be received…
providing a search result for the information about the object included in the second frame”.







Applicant's arguments filed 1/24/2022 have been fully considered but they are not persuasive. Applicant’s state in pages 13,14:
“Accordingly, Sanchez fails to disclose or suggest "based on the user voice input 
comprising the request for information about the object being received, identifying a
second frame based on a time point at which the user voice input is received,
wherein the second frame is a part of the stored plurality of frames, and is from a
second time period reproduced prior to the time point when the user voice input
is started to be received; identifying an object included in the second frame 
based on the second frame and the user voice input; and providing a search result
for the information about the object included in the second frame," as claimed inter alia
in claim 1, and therefore fails to remedy the deficiencies of Hodge.”

















	The examiner respectfully disagrees. While applicant’s have cited a portion (c. 10,l.47 to c.11,l.7) of Sanchez in the context of a related scene, the examiner has already cited (in the Office action of 11/23/2021, page 17) to c.2,l. 62 to c.3,l.34:
“After the media guidance application determines which scene or scenes it will include in the summarized content, the media guidance application compiles the summarized content. For example, the media guidance application compiles summary content of the related scene or scenes by analyzing the video content of the related scene or scenes and extracting pertinent video frames. The media guidance application may use machine vision algorithms to determine a frame when a new character enters the related scene. The media guidance application marks the identified frame, and in some cases a predetermined number of frames before the character entered the scene and/or a predetermined number of frames after the character entered the scene, for inclusion in the summarized content. Furthermore, the media guidance application may analyze motion vectors present in the digital representation of a scene, e.g., within an MPEG stream, to identify frames associated with a large amount of image motion suggesting large visual changes in the scene. The media guidance application may mark the frames with a large amount of image motion for inclusion in the summarized content. Still further, the media guidance application may identify key portions of an image frame, such as the portion of the image centered near a rule of thirds intersection points, are in focus. In one embodiment, the media guidance application extracts an A×B portion (e.g., 8 pixel by 8 pixel image block from a frame) coincident to a focal point and calculates the local maximum frequency of the image to make a determination whether the frame is in focus. Using focus information, the media guidance application may mark frames for inclusion based on a change in focus information. Still other examples may locate a first frame of the related scene or scenes and track when a focal point of the scene changes according to a pre-determined threshold to identify key frames for inclusion in the summarized content. In some embodiments, the media guidance application may rely on metadata correlated with the related scene or scenes to identify the key frames which are marked for inclusion in the summarized playback content. The media guidance application then compiles a collection of the marked frames as the summarized playback content.”

Thus, Sanchez teaches marking or identifying (such as with check-marks) a frame when a character enters a related scene (via fig. 2:200: a voice-bubble: “Repeat the scene where the baker enters the cake contest”) and also marks or identifies (such as with more check-marks) other frames before the character enters the scene.


Thus, Sanchez teaches:
identifying a second frame (via fig. 10:1050: “Determine a related scene based on the comparison” comprising a frame wherein a character enter a scene such that the frame is marked or identified such as with checkmarks) based on a time point (or a specific time via current of a current command of fig. 10:1010: “Receive Summary Command”: said voice-bubble that is current with corresponding frame of people with a cake) at which the user voice input is received (as represented as said voice-bubble), wherein the second frame is a part of the stored plurality of frames (via said scene), and is from a second (previous) time period (relative to a current time period) reproduced (via “playback of the current scene”, c.9,ll. 28-31) prior to the time point when the user voice input is started to be received (to command another playback “in parallel with playback of the current scene”, id., in order to see again via the other playback while the movie continues in parallel as shown in fig. 3: the large screen 105 is the first playback of a scene and the small screen 330 is the second playback of the same scene such that both screens are playing back in parallel such that the small screen lags or is delayed relative to large screen); 
identifying an object (or “locate a cake”, c.38,ll. 11-13) included in the second frame (as shown in fig. 3:330: a cake in a frame) based on the second frame and the user voice input (said voice-bubble of fig. 2).




Claims 2 and 12
Applicants state in page 14:
“As the Examiner notes in the Office Action, Sanchez discloses "natural language 
processing" (See Sanchez, col.4, lines. 55-58). However, Applicant submits that Sanchez does not disclose an artificial intelligence model which provides information based on an input of a frame such as the claimed second frame, or that such information is used to provide search results.”

The examiner agrees and relies upon said Hodge (US 2018/0220189) to teach claims 2 and 12.
New Claim
New reference to MILSTEIN (WO 2019/030551 A1) is being applied to claim 22 and teaches creating a keyword based on inputting a media-frame file into AI such that the keyword is made universally searchable as thus is applied to the metadata aspects as taught both in Hodge (US 2018/0220189) and in Sanchez (US Patent 10,182,271) via 35 USC 103.









Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Regarding inquiry 4, see Suggestions.
Claims 1-9,21 and 11-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271).
Note that claims 3 and 13 are also rejected under 35 USC 103 due to different meanings of the claimed “inquiry voice” in claim 3, line 4.
Regarding claim 1, Hodge discloses a method for controlling an electronic device comprising: 
reproducing (perceptually) a video (via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”); 
storing (via “buffer storing”, cited below: [0073]) a plurality of frames (via the screen of fig. 4b:410 of fig. 2:216 for fig. 7:704: “PRESERVE DATA”) of the reproduced video (via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”) for a first time period (via “store video data for a predetermined and programmable set amount of time”, cited below: [0050]) while (via “monitored…while…continuously captured”, cited below: [0072]) reproducing (via said “monitored”) the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO” to be preserved/buffered); 
receiving a user (via “The video clip sharing request 800 may be user-generated”, cited below: [0089]) voice (via “voice commands… identifying objects being displayed on a video”, cited below: [0090]) input (via fig. 8:arrows being input, represented in fig. 2:212: microphone) of a user (said via “The video clip sharing request 800 may be user-generated”) while reproducing (said via “The video clip sharing request 800 may be user-generated”) a first frame (“being displayed”:[0090]) of the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”), the user voice input (said via fig. 8:arrows being input) comprising a request (via “requested” “files”, cited below:[0076], corresponding to fig. 8:800: “SHARING REQUEST”) for information (comprised by said “files”) about an object (via fig. 4a: “Nice Spot!”) displayed (via fig. 4b:410: “CAMERAS”) in the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”); 





based on the user voice input (said via fig. 8:arrows being input) comprising the request for information about the object being received, identifying (via “identifying objects being displayed on a video” “on a touchscreen”, [0090], 2nd S) a (i.e., any) second (as indicated in fig. 7’s loops going back for a second time) frame (or said “objects being displayed on a video” “on a touchscreen” based on “the video…image”, cited below: [0091], via “video data may be uploaded to the cloud system 103”, cited below: [0075], for said fig. 8: “SHARING REQUEST”) based on a time point (or a specific time, based on via fig. 7: “TAG EVENT?  YES”) at (“at” is used to indicate a location or position, as in time, on a scale, or in order: Dictionary.com) which the user voice input (via “using voice commands…identifying objects being displayed on a video” “on a touchscreen”, [0090], 2nd S) is received (via fig. 2:212: microphone used for said “select an object”), wherein the second (via said loop) frame (or said “objects being displayed on a video” “on a touchscreen”, represented in fig. 7: “TAG EVENT?”) is 








a part (or an excerpt via fig. 7:705: “GENERATE CLIP”) of the stored (via said fig. 7:704 “PRESERVE DATA” for said fig. 8: “SHARING REQUEST”) plurality of frames, and 
is from (i.e., in a period of time starting at, Dictionary.com) a second (via said loop) time period (or “ending…corresponding to…beginning”1 comprised by said tagged event1 that comprises “a particular interval of time”1) 
reproduced (via a “touchscreen… ‘manual tag’ input” “via a tap on a touchscreen of a client device 101 while video is being played” via fig. 7: “TAG EVENT?” based on fig. 7:701: “MONITOR INPUTS” for said via fig. 8: “SHARING REQUEST” and “playing back”, cited below: [0026]) prior (or before) to the (“displayed”) time point (during said via “voice commands… identifying objects being displayed on a video”) when the user (said via “The video clip sharing request 800 may be user-generated”) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input) is started to be received (as shown in fig. 8: any one arrow being input such as one of the arrows pointing to fig. 8:806: “END” relative to said “displayed”);







identifying an object (via fig. 8:805: “MATCH?”: “identify the…event”) included in the second (said as indicated in fig. 7’s loops going back for a second time) frame (said “video…image”) based on the second frame and the user (said via “The video clip sharing request 800 may be user-generated”) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input); and 
providing (via said arrows in fig. 8) a search result (via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) for the (matching) information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”, cited below: [0091]) included in the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image”, cited below: [0091], or “the video data in which the object was detected”, cited below: [0091] via:
“[0026] The above and other needs are met by the disclosed methods, a non-transitory computer-readable storage medium storing executable code, and systems for streaming and playing back immersive video content.”

“[0050] According to one embodiment, client device 101 is always turned on as long as it has sufficient power to operate.  Cameras 214a and 214b are always turned on and recording video.  The video recorded by the cameras 214 is buffered in the memory device 203.  In one embodiment, memory device 203 is configured as a circular buffer.  For example, in one embodiment, memory device 203 may be a 32 Gb FLASH memory device.  Client device 101 manages the buffer in memory device 203 to store video data for a predetermined and programmable set amount of time.  For example, in one embodiment, memory device 203 buffers video data from two cameras 214a and 214b for the preceding 24 hours.”;




“[0072] Now referring to FIG. 7, a method for generating event-based video clips according to one embodiment is described.  Upon activation of the system, the method starts 700.  The various inputs are monitored 701 while video is continuously captured.  If no tagging event is detected 702, the system keeps monitoring.  If a tagging event is detected 702, the relevant video data in the buffer is identified and selected 703.  For example, once an event is detected 702, the video files for a predefined period of time before and after the event is identified in the buffer.  In one example, 15 seconds before and after the event time is used.  The amount of time, preferably between 10 and 30 seconds, may be pre-programmed or user selectable.  Further, two different time periods may be used, one for time before the event and the other for time after the event.  In one embodiment, the time periods may be different depending on the event detected.  For example, for some events the time periods may be 30 seconds before event and 1 or 2 minutes after while other events may be 15 seconds before and 15 seconds after.

[0073] The selected video data is marked for buffering 704 for a longer period of time.  For example, the video files for the selected time period are copied over to a second system buffer with a different buffering policy that retains the video for a longer period of time.  In one embodiment, the selected video data being in a buffer storing video for 24 hours is moved over to a second buffer storing video for 72 hours.”

“[0075] In one embodiment, video data objects are stored on the network-accessible buffer of the camera device and the playlist or manifest files for the generated event-based video clips identify the network addresses for the memory buffer memory locations storing the video data objects or files.  Alternatively, upon identifying and selecting 703 the relevant video data objects, in addition to or as an alternative to moving the video data to the longer buffer 704, the video data may be uploaded to the cloud system 103.  The clip generation 705 then identifies in the playlist or manifest file the network addresses for the video data stored in the cloud system 103.  A combination of these approaches may be used depending on storage capacity and 
network capabilities for the camera devices used in the system or according to other design choices of the various possible implementations.”;














“[0076] In one embodiment, other system components, such as the cloud system 103 
or mobile device 104, are notified 706 of the event or event-based video clip. For example, in one embodiment a message including the GUID for the generated 
video clip is sent to the cloud system in a cryptographically signed message (as discussed above).  Optionally, the playlist or manifest file may also be sent in the message.  In one embodiment, the playlist or manifest files are maintained in the local memory of the camera device until requested.  For example, upon notification 706 of the clip generation, the cloud system may request the clip playlist or manifest file.  Optionally, the cloud system may notify 706 other system components and/or other users of the clip and other system components or users may request the clip either from the cloud system 103 or directly from the camera device.  For example, the clips pane 401a in the user's mobile app may display the clip information upon receiving the 
notification 706.  Given that the clip metadata is not a large amount of data, e.g., a few kilobytes, the user app can be notified almost instantaneously after the tag event is generated.  The larger amount of data associated with the video data for the clip can be transferred later, for example, via the cloud system or directly to the mobile device.  For example, upon detection of a "Baby/Animal in Parked Car" event or a "Location Discontinuity" event, the user's mobile device 104 may be immediately notified of the tag event using only tag metadata.  Subsequently, the user can use the video clip playlist to access the video data stored remotely, for example, for verification purposes.”;

“[0084] These combinations of events and inputs are illustrative only.  Some embodiments may provide a subset of these inputs and/or events.  Other embodiments may provide different combinations of inputs and/or different events.  The event detection algorithms may be implemented locally on the camera device (e.g., client device 101) or may be performed in cloud servers 102, with the input signals and event detection outputs transmitted over the wireless communication connection 107/108 from and to the camera device.  Alternatively, in some embodiments a subset of the detection algorithms may be performed locally on the camera device while other detection algorithms are performed on cloud servers 102, depending for example, on the processing capabilities available on the client device.  Further, in one embodiment, artificial intelligence ("AI") algorithms are applied to the multiple inputs to identify the most likely matching event for the given combination of inputs.  For example, a neural network may be trained with the set of inputs used by the system to recognize the set of possible tagging events.  Further, a feedback mechanism may be provided to the user via the mobile app to accept or reject proposed tagging results to further refine the neural network as the system is used.  This provides a refinement process that improves the performance of the system over time.  At the same time, the system is capable of learning to detect false positives provided by the algorithms and heuristics and may refine them to avoid incorrectly tagging events.”




“[0086] According to another aspect of the disclosure, in one embodiment, the detection process 702 is configured to detect a user-determined manual tagging of an event.  The user may provide an indication to the system of the occurrence of an event of interest to the user.  For example, in one embodiment, a user may touch the touchscreen of a client device 101 to indicate the occurrence of an event.  Upon detecting 702 the user "manual tag" input, the system creates an event-based clip as described above with reference to FIG. 7.  In an alternative embodiment, the user indication may include a voice command, a Bluetooth transmitted signal, or the like.  For example, in one embodiment, a user may utter a predetermined word or set of words (e.g., "Owl make a note").  Upon detecting the utterance in the audio input, the system may provide a cue to indicate the recognition.  For example, the client device 101 may beep, vibrate, or output speech to indicate recognition of a manual tag.  Optionally, additional user speech may be input to provide a name or descriptor for the event-based video clip resulting for the user manual tag input.  For example, a short description of the event may be uttered by the user.  The user's utterance is processed by a speech-to-text algorithm and the resulting text is stored as metadata associated with the video clip.  For example, in one embodiment, the name or descriptor provided by the user may be displayed on the mobile app as the clip descriptor 402 in the clips pane 401a of the mobile app. In another embodiment, the additional user speech may include additional 
commands.  For example, the user may indicate the length of the event for which the manual tag was indicated, e.g., "short" for a 30-second recording, "long" for a two-minute recording, or the like.  Optionally, the length of any video clip can be extended based on user input.  For example, after an initial event-based video clip is generated, the user may review the video clip and request additional time before or after and the associated video data is added to the playlist or manifest file as described with reference to FIG. 7.”




















“[0089] Now referring to FIG. 8, a method for identifying and sharing event-based video clips is described.  In addition to the various options for sharing video clips identified above, in one embodiment, video clips may also be shared based on their potential relevance to events generated by different camera devices.  To do so, in one embodiment, a video clip sharing request is received 800.  The video clip sharing request 800 may be user-generated or automatically generated.  For example, in one embodiment, a map can be accessed displaying the location of camera devices for which a user may request shared access.  The user can select the camera device or devices it wants to request video from.  In an alternative embodiment, the user enters a location, date, and time for which video is desired to generate a sharing request.

[0090] In yet another embodiment, a user may select an object (e.g., a car, person, item, or the like) being displayed on the screen of a camera device.  For example, via a tap on a touchscreen of a client device 101 while video is being played, using voice commands, or other user input device capable of identifying objects being displayed on a video.  Optionally, an object of interest can also be identified on a video automatically.  For example, as part of the auto-tagging feature described above with reference to FIG. 7, some of the inputs monitored 701 may include objects of interest resulting from image processing techniques.  For example, if a tagging-event is determined to be a break-in and one of the monitored inputs includes a detected human face that is not recognized, the unrecognized face may be used as the selected object.”

[0091] Image processing algorithms and/or computer vision techniques are applied to identify the selected object from the video and formulate an object descriptor query.  For example, the user input is applied to detect the region of interest in the image, e.g., the zoomed-in region.  The data for the relevant region is processed into a vector representation for image data around detected relevant points in the mage region.  From the vector or descriptor of the relevant region, feature descriptors are then extracted based on, for example, second-order statistics, parametric models, coefficients obtained from an image transform, or a combination of these approaches.  The feature-based representation of the object in the image is then used as a query for matching in other video data.  In one embodiment, a request for sharing video clips 
includes an image query for an object and metadata from the video data in which 
the object was detected.”












1	Dictionary.com: 
event
noun
3	something that occurs in a certain place during a particular interval of time. 

wherein “interval is defined:
interval
noun
1	an intervening period of time:
an interval of 50 years.

wherein “period” is defined:
period
noun
3	a round of time or series of years by which time is measured.

wherein “round” is defined:
round
noun
22	Sometimes rounds . a completed course of time, series of events or operations, etc., ending at a point corresponding to that at the beginning:
We waited through the round of many years.).












	







Thus, Hodge does not teach as a whole, as indicated in bold above, the claimed:
A.	“a time point at which the user voice input is received”; and
B.	“the second frame is…reproduced prior to the time point when the user voice input is started to be received”











Accordingly, Sanchez teaches as a whole:
A.	identifying (expressing action or result of identify via “determine a frame”) a second frame (resulting in an “identified frame” represented in fig. 10:1060: “Compile summarized playback content”) based on a time point (or a specific time via “current”1, comprising “at the moment in time at which an utterance is spoken”, of “voice request about the current video content”, c.16,ll. 42-47, based on beginning that begins at fig. 10:1010: “Receive Summary Command”) at which the user voice input (via fig. 2:210: “microphone”) is received, wherein the second frame is a part of the stored plurality of frames reproduced (from “a stored copy”, c.30,ll.3-8) prior to the time point when the user voice input is started to be received, wherein the second frame…is from a second time period (or an “earlier” “intervening period of time” comprised by an “earlier” “event”, c.3,l.66 to c.4,l.2, via Dictionary.com:
1 BRITISH DICTIONARY DEFINITIONS FOR CURRENT
current
adjective
1	of the immediate present; in progress:
current events

BRITISH DICTIONARY DEFINITIONS FOR PRESENT (1 OF 2)
present
adjective
1	(prenominal) in existence at the moment in time at which an utterance is spoken or written

event
noun
3	something that occurs in a certain place during a particular interval of time.

wherein “interval” is defined:
interval
noun
1	an intervening period of time:
an interval of 50 years.); 

and

B.	the second frame is…reproduced (via a first “playback of the current scene”, c.9,ll. 28-31) prior to the time point when the user voice input is started to be received (initiating the second “playback, in a second display”, id.);
	identifying an object (or a baker via fig. 2: “Repeat the scene where the baker enters the cake contest” corresponding to “identify…events…in an earlier point of the show”) included in the second (said via fig. 2: “Repeat” corresponding to said “earlier point of the show” in contrast to a current point of the show) frame (as shown in fig. 1 or fig. 3:330 or fig. 4:450,460 comprising or involving as a factor said “earlier point of the show” comprised by a summary of what has happened via fig. 10:1010: “Receive Summary Command”) based on the (complied) second frame and the user voice input (said via fig. 2:210: “Repeat the scene where the baker enters the cake context.”); and 
providing a search (via “search a catalogue”) result (as shown in figure 3) for the information (or “other information”) about the object (said baker corresponding to “identify…events…in an earlier point of the show”) included in the second (via fig. 2: “Repeat” corresponding to said “earlier point of the show”, comprised by a summary of what has happened via fig. 10:1010: “Receive Summary Command”, in contrast to a current point of the show) frame (said as shown in fig. 1 or fig. 3:330 or fig. 4:450,460 comprising or involving as a factor said “earlier point of the show” comprised by a summary of what has happened via fig. 10:1010: “Receive Summary Command” via:






c.2,l. 62 to c.3,l.34:
“After the media guidance application determines which scene or scenes it will include in the summarized content, the media guidance application compiles the summarized content. For example, the media guidance application compiles summary content of the related scene or scenes by analyzing the video content of the related scene or scenes and extracting pertinent video frames. The media guidance application may use machine vision algorithms to determine a frame when a new character enters the related scene. The media guidance application marks the identified frame, and in some cases a predetermined number of frames before the character entered the scene and/or a predetermined number of frames after the character entered the scene, for inclusion in the summarized content. Furthermore, the media guidance application may analyze motion vectors present in the digital representation of a scene, e.g., within an MPEG stream, to identify frames associated with a large amount of image motion suggesting large visual changes in the scene. The media guidance application may mark the frames with a large amount of image motion for inclusion in the summarized content. Still further, the media guidance application may identify key portions of an image frame, such as the portion of the image centered near a rule of thirds intersection points, are in focus. In one embodiment, the media guidance application extracts an A×B portion (e.g., 8 pixel by 8 pixel image block from a frame) coincident to a focal point and calculates the local maximum frequency of the image to make a determination whether the frame is in focus. Using focus information, the media guidance application may mark frames for inclusion based on a change in focus information. Still other examples may locate a first frame of the related scene or scenes and track when a focal point of the scene changes according to a pre-determined threshold to identify key frames for inclusion in the summarized content. In some embodiments, the media guidance application may rely on metadata correlated with the related scene or scenes to identify the key frames which are marked for inclusion in the summarized playback content. The media guidance application then compiles a collection of the marked frames as the summarized playback content.”;


and














c. 3,l. 59 to c.4,l.13:
“In some embodiments, the media guidance application 100 will determine the related scene or scenes using information from the current scene.  For example, the media guidance application 100 may determine a current playback position of the current scene in a media asset being viewed in the first display.  The media guidance application 100 identifies information associated with the current scene based on the current playback position.  For example, the media guidance application 100 may identify that a character in a scene is talking to a second character about events that happened in an earlier point of the show or a related show.  The media guidance application 100 may compare the identifying information with other information associated with a plurality of relevant scenes.  For example, the media guidance application 100 may use the topics discussed by characters to search a catalogue of scenes from the current episode or other episodes from the current show.  The media guidance application 100 may then determine a related scene from other scenes of the 
show based on that comparison.  As discussed above, the media guidance application 100 compiles summarized playback content, wherein the summarized playback content is associated with the current scene and the related scene.”).

Thus, one of ordinary skill of television as indicated in Hodge’s “a television transceiver”:
“[0119] One or more processors in association with software in a computer-based system may be used to implement methods of video data collection, cloud-based data collection and analysis of event-based data, generating event-based video clips, sharing event-based video, verifying authenticity of event-based video data files, and setting up client devices according to various embodiments, as well as data models for capturing metadata associated with a given video data object or file or for capturing metadata associated with a given event-based video clip according to various embodiments, all of which improves the operation of the processor and its interactions with other components of a computer-based system.  The camera devices according to various embodiments may be used in conjunction with modules, implemented in hardware and/or software, such as a cameras, a video camera module, a videophone, a speakerphone, a vibration device, a speaker, a microphone, a television transceiver, a hands free headset, a keyboard, a Bluetooth module, a frequency modulated (FM) radio 
unit, a liquid crystal display (LCD) display unit, an organic light-emitting diode (OLED) display unit, a digital music player, a media player, a video game player module, an Internet browser, and/or any wireless local area network (WLAN) module, or the like.”

can modify Hodge’s teaching of said fig. 8:805: “MATCH?”: “identify the…event” with Sanchez’s teaching of said fig. 2: “baker enters” corresponding to “identify…events…in an earlier point of the show” by:
a)	inserting Sanchez’s program of fig. 10:1000, comprising said “fig. 2: ‘baker enters’ corresponding to ‘identify…events…in an earlier point of the show’, into Hodge’s fig. 8: “SHARING REQUEST”;
b)	transmitting/receiving a television signal, such as said baking show, for said Hodge’s fig. 8: “SHARING REQUEST”; and
c)	recognizing that the modification is predictable or looked forward to because Hodge already teaches that “The camera devices according to various embodiments may be used in conjunction with…a television transceiver” (Hodge, cited above) and in addition “television” is “an electronically consumable user asset” that is a useful and desirable thing or is valuable or useful intended to be bought and used via Sanchez, c.17,ll. 19-47
“Interactive media guidance applications may take various forms depending on the content for which they provide guidance.  One typical type of media guidance application is an interactive television program guide.  Interactive television program guides (sometimes referred to as electronic program guides) are well-known guidance applications that, among other things, allow users to navigate among and locate many types of content or media assets. Interactive media guidance applications may generate graphical user interface screens that enable a user to navigate among, locate and select content.  As referred to herein, the terms "media asset" and "content" should be understood to mean an electronically consumable user asset, such as television programming, as well as pay-per-view programs, on-demand programs (as in video-on-demand (VOD) systems), Internet content (e.g., streaming content, downloadable content, Webcasts, etc.), video clips, audio, content information, pictures, rotating images, documents, playlists, websites, articles, books, electronic books, blogs, chat sessions, social media, applications, games, and/or any other media or multimedia and/or combination of the same.  Guidance applications also allow users to navigate among and locate content.  As referred to herein, the term "multimedia" should be understood to mean content that utilizes at least two different content forms described above, for example, text, audio, images, video, or interactivity content forms.  Content 
may be recorded, played, displayed or accessed by user equipment devices, but 
can also be part of a live performance.”


Regarding claim 2, Hodge as combined teaches the method for controlling an electronic device of claim 1, further comprising: 
inputting the second (said as indicated in fig. 7’s loops going back for a second time) frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”), based on the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into an artificial intelligence model trained (via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) through an artificial intelligence algorithm; and 
acquiring the information (via fig. 8: 805: “MATCH ?”: “Yes”) about the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) included in the second (said as indicated in fig. 7’s loops going back for a second time) frame (said “video… image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) based on output (via fig. 8:805: “MATCH?”: “YES”) of the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”)






“[0084] These combinations of events and inputs are illustrative only.  Some embodiments may provide a subset of these inputs and/or events.  Other embodiments may provide different combinations of inputs and/or different events.  The event detection algorithms may be implemented locally on the camera device (e.g., client device 101) or may be performed in cloud servers 102, with the input signals and event detection outputs transmitted over the wireless communication connection 107/108 from and to the camera device.  Alternatively, in some embodiments a subset of the detection algorithms may be performed locally on the camera device while other detection algorithms are performed on cloud servers 102, depending for example, on the processing capabilities available on the client device.  Further, in one embodiment, artificial intelligence ("AI") algorithms are applied to the multiple inputs to identify the most likely matching event for the given combination of inputs.  For example, a neural network may be trained with the set of inputs used by the system to recognize the set of possible tagging events.  Further, a feedback mechanism may be provided to the user via the mobile app to accept or reject proposed tagging results to further refine the neural network as the system is used.  This provides a refinement process that improves the performance of the system over time.  At the same time, the system is capable of learning to detect false positives provided by the algorithms and heuristics and may refine them to avoid incorrectly tagging events.”).













Regarding claim 3, Hodge as combined teaches the method for controlling an electronic device of claim 2, 
wherein the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) comprises a trigger voice (comprising “ ‘trigger’ words… associated with particular events”) for initiating an inquiry (via fig. 8:803: “PROVIDE IMAGE QUERY”) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query”) included in the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) and an inquiry voice (said via “voice commands… identifying objects being displayed on a video” to “formulate an object descriptor query”, cited [0091]) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query” via:
“[0082] Sound processing may also include speech recognition and natural language processing to recognize human speech, words, and/or commands.  For example, certain "trigger" words may be associated with particular events. When the "trigger" word is found present in the audio data, the corresponding event may be determined.  Similarly, the outputs of the available sensors may be received and processed to determine presence of patterns associated with events.  For example, GPS signals, accelerator signals, gyroscope signals, magnetometer signals, and the like may be received and analyzed to detect the presence of events.  In one embodiment, additional data received via wireless module 205, such as traffic information, weather information, police reports, or the like, is also used in the detection process.  The detection process 702 applies algorithms and heuristics that associate combinations of all these 
potential inputs with possible events.”), and 

wherein the inputting (via said “multiple inputs”) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”),  based on the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises inputting (via said “multiple inputs”) the second` frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) based on the (said “displayed”) time point (said during said via “voice commands… identifying objects being displayed on a video”) when (at fig. 8:805: “MATCH?”) the trigger voice (said comprising “ ‘trigger’ words… associated with particular events” of interest to a user or driver identifying said “Nice Spot!”) is received (via said “multiple inputs”) into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”).




Regarding claim 4, Hodge as combined teaches the method for controlling an electronic device of claim 2, wherein the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) comprises an image frame (said “video…image”) and an audio frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio” via:
“[0036] In one embodiment, client device 101 also includes a touchscreen 211. In alternative embodiments, other user input devices (not shown) may be used, such a keyboard, mouse, stylus, or the like.  Touchscreen 211 may be a capacitive touch array controlled by touchscreen module 208 to receive touch input from a user.  Other touchscreen technology may be used in alternative embodiments of touchscreen 211, such as for example, force sensing touch screens, resistive touchscreens, electric-field tomography touch sensors, radio-frequency (RF) touch sensors, or the like.  In addition, user input may be received through one or more microphones 212.  In one embodiment, microphone 212 is a digital microphone connected to audio module 206 to receive user 
spoken input, such as user instructions or commands.  Microphone 212 may also be used for other functions, such as user communications, audio component of video recordings, or the like.  Client device may also include one or more audio output devices 213, such as speakers or speaker arrays.  In alternative embodiments, audio output devices 213 may include other components, such as an automotive speaker system, headphones, stand-alone "smart" speakers, or the like.
[0037] Client device 101 can also include one or more cameras 214, one or more sensors 215, and a screen 216.  In one embodiment, client device 101 includes two cameras 214a and 214b.  Each camera 214 is a high definition CMOS-based imaging sensor camera capable of recording video one or more video modes, including for example high-definition formats, such as 1440p, 1080p, 720p, and/or ultra-high-definition formats, such as 2K (e.g., 2048.times.1080 or similar), 4K or 2160p, 2540p, 4000p, 8K or 4320p, or similar video modes. Cameras 214 record video using variable frame rates, such for example, frame rates between 1 and 300 frames per second.  For example, in one embodiment cameras 214a and 214b are Omnivision OV-4688 cameras.  Alternative cameras 214 may be provided in different embodiments capable of recording video in any combinations of these and other video modes.  For example, other CMOS sensors or CCD image sensors may be used.  Cameras 214 are controlled by video module 207 to record video input as further described below.  A single client device 101 may include multiple cameras to cover different views and angles.  For 
example, in a vehicle-based system, client device 101 may include a front camera, side cameras, back cameras, inside cameras, etc.”; and


“[0081] According to another aspect of the disclosure, detection of tagging events 702 may be done automatically by the system.  For example, based on the monitored inputs, in different embodiments events such as a vehicle crash, a police stop, or a break in, may be automatically determined.  The monitored inputs 701 may include, for example, image processing signals, sound processing signals, sensor processing signals, speech processing signals, in any combination.  In one embodiment, image processing signals includes face recognition algorithms, body recognition algorithms, and/or object/pattern detection algorithms applied to the video data from one or more cameras.  For example, the face of the user may be recognized being inside a vehicle.  As another example, flashing lights from police, fire, or other emergency vehicles 
may be detected in the video data.  Another image processing algorithm detects the presence of human faces (but not of a recognized user), human bodies, or uniformed personnel in the video data.  Similarly, sound processing signals may be based on audio recorded by one or more microphones 212 in a camera device, (e.g., client device 101, auxiliary camera 106, or mobile device 104).  In one embodiment sound processing may be based on analysis of sound patterns or signatures of audio clips transformed to the frequency domain.  For example, upon detection of a sound above a minimum threshold level (e.g., a preset number of decibels), the relevant sound signal is recorded and a Fast Fourier Transform (FFT) is performed on the recorded time-domain audio signal as is known in the art.  The frequency-domain signature of the recorded audio signal is then compared to known frequency domain signatures for recognized events, such as, glass breaking, police sirens, etc. to determine if there is a match.  
For example, in one embodiment, pairs of points in the frequency domain signature of the recorded audio input are determined and the ratio between the selected points are compared to the ratios between similar points in the audio signatures of recognized audio events.”), 











wherein the inputting the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”), based on the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises matching (via fig. 8:805: “MATCH?” via “audio…match”, cited above: [0081]) the image frame (via said “video…image”) and the audio (said via “audio recorded…in a camera”) frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio”), and
wherein the identifying the object included in the second frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio”) based on output (via fig. 8:805: “MATCH?”: “YES”) of the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises inputting the image frame (said “video…image”) and the audio (said via “audio recorded…in a camera”) frame (said “video…image” comprising “audio component of video…frames” corresponding to said “video” with said “audio”) into the artificial intelligence model (said via “trained” “artificial intelligence ("AI") algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) wherein said “video…frames” or the claimed “frame” comprises film further comprising “recording and reproduction of images” or “ recording and reproduction of both images and sound” via Dictionary.com:
film
noun
Movies.
a strip of transparent material, usually cellulose triacetate, covered with a photographic emulsion and perforated along one or both edges, intended for the recording and reproduction of images.
a similar perforated strip covered with an iron oxide emulsion (magfilm ), intended for the recording and reproduction of both images and sound.
a movie; motion picture: We decided to stay home and watch a Kurosawa film.).





Regarding claim 5, Hodge as combined teaches the method for controlling an electronic device of claim 4, further comprising:
matching (said via fig. 8: 805: “MATCH ?”: “Yes”) information on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) with the image frame (said “video… image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) in which the object  (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) appeared, and storing the (“matching”) information and the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected” such that “the user may access clips generated” via:
“[0095] Responses to the search request are received 804.  If no matches are found 805, the sharing request process ends 806.  For example, if the search request was initiated by a user, the user may be notified that no matching video clips were found.  If matching video clips are found 805, an authorization request is sent 807 to the user of the camera device responding with a match.  As discussed above with reference to FIG. 4a-c, the clips generated from camera devices of the user may be listed under the clips pane 401a.  Thus, the user may access clips generated 705 from a client device 101, 
an auxiliary camera 106, a mobile device 104, without further authorization requirement.  For example, in one embodiment, when the camera devices with video clips matching the same event, such as a break-in, are registered to the same user account, the user may directly access the shared video clips from one or more home auxiliary cameras 106 that captured the same break-in as the dash-mounted client device 101 from different vantage points.  Thus, for example, a user may be able to provide related video clips to the authorities showing a perpetrator's face (from an IN-camera device), a "get-away" vehicle from an auxiliary home camera device located in a carport, and a license plate for the get-away vehicle from a driveway auxiliary camera device.  The video 
clips for the break-in event could be automatically generated and associated as "related" clips from multiple camera devices integrated by the system according 
to one embodiment of the invention.”).

Regarding claim 6, Hodge as combined teaches the method for controlling an electronic device of claim 2, further comprising: 
determining (via fig. 7: “TAG EVENT?” for fig. 8:805: “MATCH?”: “Yes”: “No”) information (for matching) on an object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) corresponding to a user's voice instruction (or “a voice command” or “speech…commands…e.g., ‘short’… ‘long’ ”, cited below: [0086]) among the information (said comprised by said “files”) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”), and 
wherein the providing (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) comprises transmitting (represented in fig. 1 as dashed lines) the determined (said via fig. 7: “TAG EVENT?” for fig. 8:805: “MATCH?”: “Yes”: “No”) information (said comprised by said “files” for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) corresponding to the user’s voice instruction (said or “a voice command” or “speech…commands…e.g., ‘short’… ‘long’ ”) to an external search server (or fig. 1:102:the cloud) and providing (via fig. 4a:403a-c & 402a-c) the search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) received (represented in fig. 1 as dashed lines) from the external search server (said or fig. 1:102:the cloud via:






“[0086] According to another aspect of the disclosure, in one embodiment, the detection process 702 is configured to detect a user-determined manual tagging of an event.  The user may provide an indication to the system of the occurrence of an event of interest to the user.  For example, in one embodiment, a user may touch the touchscreen of a client device 101 to indicate the occurrence of an event.  Upon detecting 702 the user "manual tag" input, the system creates an event-based clip as described above with reference to FIG. 7.  In an alternative embodiment, the user indication may include a voice command, a Bluetooth transmitted signal, or the like.  For example, in one 
embodiment, a user may utter a predetermined word or set of words (e.g., "Owl make a note").  Upon detecting the utterance in the audio input, the system may provide a cue to indicate the recognition.  For example, the client device 101 may beep, vibrate, or output speech to indicate recognition of a manual tag.  Optionally, additional user speech may be input to provide a name or descriptor for the event-based video clip resulting for the user manual tag input.  For example, a short description of the event may be uttered by the user.  The user's utterance is processed by a speech-to-text algorithm and the resulting text is stored as metadata associated with the video clip.  For example, in one embodiment, the name or descriptor provided by the user may be displayed on the mobile app as the clip descriptor 402 in the clips pane 401a of the mobile app. In another embodiment, the additional user speech may include additional 
commands.  For example, the user may indicate the length of the event for which the manual tag was indicated, e.g., "short" for a 30-second recording, "long" for a two-minute recording, or the like.  Optionally, the length of any video clip can be extended based on user input.  For example, after an initial event-based video clip is generated, the user may review the video clip and request additional time before or after and the associated video data is added to the playlist or manifest file as described with reference to FIG. 7.”).









Regarding claim 7, Hodge as combined teaches the method for controlling an electronic device of claim 6, wherein the determining (said via fig. 7: “TAG EVENT?” for fig. 8:805: “MATCH?”: “Yes”: “No”) further comprises: 
displaying a user interface (UI) (via figures 4a,b,c) identifying (via fig. 7:706: “NOTIFICATIONS”) whether (said via fig. 8:805: “MATCH?”: “Yes”: “No”) the information (said for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) is information (said comprised by said “files” for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”) corresponding to the user's voice instruction (said or “a voice command” or “speech…commands…e.g., ‘short’… ‘long’ ”) among the information (said comprised by said “files” for matching) on the object (said via fig. 4a: “Nice Spot!” or “image data around detected relevant points in the mage region…used as a query”), or identifying whether there is an additional inquiry for inquiring additional information.









Regarding claim 8, Hodge as combined teaches the method for controlling an electronic device of claim 1, wherein the providing (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) comprises: 
providing the search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) and a (“zoomed-in region”, cited in the rejection of claim 1) frame corresponding to the search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) in an area (said  “zoomed-in region”) of the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”) while the video (said via fig. 7:701: “MONITOR INPUTS” and fig. 7:703: “VIDEO”) is being reproduced (via fig. 4a,b,c).
Regarding claim 9, Hodge as combined teaches the method for controlling an electronic device of claim 1, comprising: 
transmitting (via the dashed lines in fig. 1) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103”) to an external server (fig. 1:102:the cloud) for acquiring (matching) information on frames; and 
acquiring (said via the dashed lines in fig. 1) information (for matching) on the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103”) from the external server (said fig. 1:102:the cloud).





Regarding claim 21, Hodge as combined teaches via Sanchez the method of claim 1, wherein the identifying of the object (via fig. 8:arrows being input: 800: “SHARING REQUEST” as modified via Sanchez’s fig. 10:1000) comprises obtaining a keyword (via Sanchez: fig. 12:1050A:1220: “Generate key phrase from dialogue”: represented in fig. 10:1050: “Determine a related scene based on the comparison”) corresponding to the object based on the second frame (represented in fig. 10:1040: “Compare information with relevant scenes”), and 
wherein the providing includes transmitting the keyword to an external search (said via “search a catalogue”) server, and 
receiving the search result from the external search server based on the keyword.
Thus, the combination does not teach, as indicated in bold above, the claimed:
“transmitting the keyword to an external search server, and 
receiving the search result from the external search server based on the keyword”.
Accordingly, Sanchez, as already combined above, further teaches:
transmitting (via fig. 9:920: a “communication path”, c. 26,ll. 31-36) the keyword to an external search (said via “search a catalogue”) server (or fig. 9:916: “Media Content Source”), and 
receiving (said via fig. 9:920: a “communication path”) the search result from the external search server based on the keyword.

Thus, one of ordinary skill in the art of asking questions or querying as taught by both references can modify the combination’s said fig. 8:arrows being input: 800: “SHARING REQUEST” as modified via Sanchez’s fig. 10:1000 with Sanchez’s further teaching of said fig. 9:920: a “communication path” by:
a)	making Hodge’s fig. 1:100, a network, be as Sanchez’s fig. 9:900: a network;
b)	installing a “search engine” (Sanchez, c.32,ll. 61-64) in each of Hodge’s fig. 1:101,105,104;
c)	searching Hodge’s fig. 1:102, a server system:
c1)	sending inquiring key phrases or words, such as:
“Red Wedding” (Sanchez, c.33, ll. 56-59); or 
“Nice car parking spots!” regarding Hodge:402c:“Nice Spot!”; and
c2)	receiving a result or a response to each question via the “communication path”; and
d)	recognizing that the modification is predictable or looked forward to because the modification allows one to search with “weight” (Sanchez, c.32,ll. 64-67) or relative importance regarding each word of the phrase, such as “Red (10% weight) Wedding (90% weight)”, so as to “extract pertinent features” (Sanchez, c. 33,ll. 11-16) from owned assets that are useful and desirable things or is valuable or useful intended to be bought and used, such as:
d1)	the book or video of “Game of Thrones” (Sanchez, c.33,ll. 11-16); or 
d2)	shared, via Hodge: fig. 8:800: “SHARING REQUEST”, video of parked cars.
Regarding claim 11, claim 11 is rejected the same as claim 1. Thus, argument presented in claim 1 is equally applicable to claim 11. Accordingly, Hodge as combined above in the rejection of claim 1 teaches claim 11 of an electronic device comprising: 
a display (fig. 2:208: TOUCH SCREEN MODULE); 
a communicator (fig. 2:205: WIRELESS MODULE); 
a microphone (fig. 2:206: AUDIO MODULE); 
memory (fig. 2:203: MEMORY MODULE) storing at least one instruction (via figures 5,6a and 7-11); and 
a processor (fig. 2:201: PROCESSING MODULE) coupled to the display (said fig. 2:208: TOUCH SCREEN MODULE), the communicator (said fig. 2:205: WIRELESS MODULE), the microphone (said fig. 2:206: AUDIO MODULE) and the memory (said fig. 2:203: MEMORY MODULE), and controlling the electronic device (fig. 2:101), 
wherein the processor (said fig. 2:201: PROCESSING MODULE) is configured to execute the at least one instruction to: 
control the electronic device (said fig. 2:101) to store in the memory (said fig. 2:203: MEMORY MODULE) a plurality of frames (via fig. 2:207: VIDEO MODULE) of a video (via said fig. 2:208: TOUCH SCREEN MODULE) for a first time period (said via “store video data for a predetermined and programmable set amount of time”, cited in the rejection of claim 1) while reproducing (via said fig. 2:208: TOUCH SCREEN MODULE) the video (said via fig. 2:207: VIDEO MODULE) on the display (said fig. 2:208: TOUCH SCREEN MODULE), 


receiving a user voice (via fig. 2:212) input (identifying the “Nice Spot!”) of a user (via said fig. 2:206: AUDIO MODULE) while a first frame of the video (said via fig. 2:207: VIDEO MODULE) is being reproduced (via said fig. 2:208: TOUCH SCREEN MODULE), the user voice input (via said identifying the “Nice Spot!”) comprising a request (via fig. 8:800: “SHARING REQUEST”) for information about an object (said “Nice Spot!”) displayed in the video, 
based on the user voice input comprising the request for information about the object being received, identify a (i.e., any) second frame (said “video...image” via said via “video data may be uploaded to the cloud system 103”, cited in the rejection of claim 1, for matching in fig. 8:805: “MATCH?”) based on a time point at which the user voice input is received, where the second frame is part of the stored plurality of frames (said via fig. 2:207: VIDEO MODULE), and is from a second time period reproduced (or played back to allow the user or driver or passenger to identify said “Nice Spot!” for parking in said any “video…image” for said matching) prior (or before) to the time point when (at fig. 8:800: “SHARING REQUEST”) the user voice input (said identifying the “Nice Spot!”) is started to be received (via any one arrow in fig. 8), 
identify an object (via fig. 8:805: “MATCH?” identifying an event of interest to the user or driver or passenger) included in the (any) second frame (said “video…image”) based on the second frame and the user voice input (that identified the “Nice Spot!” for parking), and 


provide a search result (said via fig. 8:804: “RECEIVE QUERY RESPONSE(S)” and fig. 8:805: “MATCH?”: “Yes”: “No”) for the (matching) information about the object (said nice parking spot or “image data around detected relevant points in the mage region…used as a query”) included in the (any) second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103”, cited in the rejection of claim 1, for matching in fig. 8:805: “MATCH?”).     
















Regarding claim 12, claim 12 is rejected the same as claim 2. Thus, argument presented in claim 2 is equally applicable to claim 12.
Regarding claim 13, claim 13 is rejected the same as claim 3. Thus, argument presented in claim 3 is equally applicable to claim 13.
Regarding claim 14, claim 14 is rejected the same as claim 4. Thus, argument presented in claim 4 is equally applicable to claim 14.
Regarding claim 15, claim 15 is rejected the same as claim 5. Thus, argument presented in claim 5 is equally applicable to claim 15.
Regarding claim 16, claim 16 is rejected the same as claim 6. Thus, argument presented in claim 6 is equally applicable to claim 16.
Regarding claim 17, claim 17 is rejected the same as claim 7. Thus, argument presented in claim 7 is equally applicable to claim 17.
Regarding claim 18, claim 18 is rejected the same as claim 8. Thus, argument presented in claim 8 is equally applicable to claim 18.
Regarding claim 19, claim 19 is rejected the same as claim 9. Thus, argument presented in claim 9 is equally applicable to claim 19.







Claims 3 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271) as applied above further in view of Diamant et al. (US Patent App. Pub. No.: US 2019/0027147 A1). Note that claims 3 and 13 are twice rejected under 35 USC 103 because the claimed “inquiry voice” has multiple meanings thus multiple rejections.
Regarding claim 3, Hodge as combined teaches the method for controlling an electronic device of claim 2, 
wherein the user (said via “The video clip sharing request 800 may be user-generated” as modified via the combination) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) comprises a trigger voice (comprising “ ‘trigger’ words… associated with particular events”) for initiating an inquiry (via fig. 8:803: “PROVIDE IMAGE QUERY”) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query”) included in the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) and an inquiry voice (said via “voice commands… identifying objects being displayed on a video” to “formulate an object descriptor query”) for the information (said comprised by said “files”) about the object (said via fig. 4a: “Nice Spot!” or said “image data around detected relevant points in the mage region…used as a query” via:

“[0082] Sound processing may also include speech recognition and natural language processing to recognize human speech, words, and/or commands.  For example, certain "trigger" words may be associated with particular events. When the "trigger" word is found present in the audio data, the corresponding event may be determined.  Similarly, the outputs of the available sensors may be received and processed to determine presence of patterns associated with events.  For example, GPS signals, accelerator signals, gyroscope signals, magnetometer signals, and the like may be received and analyzed to detect the presence of events.  In one embodiment, additional data received via wireless module 205, such as traffic information, weather information, police reports, or the like, is also used in the detection process.  The detection process 702 applies algorithms and heuristics that associate combinations of all these 
potential inputs with possible events.”), and 

















wherein the inputting (via said “multiple inputs”) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”),  based on the user (said via “The video clip sharing request 800 may be user-generated”) voice (said via “voice commands… identifying objects being displayed on a video”) input (said via fig. 8:arrows being input via “voice commands, or other user input”) being received (said as shown in fig. 8: any one arrow being input), into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”) comprises inputting (via said “multiple inputs”) the second frame (said “video…image” via said via “video data may be uploaded to the cloud system 103” corresponding to “the image” or “the video data in which the object was detected”) based on the (said “displayed”) time point (said during said via “voice commands… identifying objects being displayed on a video”) when (at fig. 8:805: “MATCH?”) the trigger voice (said comprising “ ‘trigger’ words… associated with particular events” of interest to a user or driver identifying said “Nice Spot!”) is received (via said “multiple inputs”) into the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”).
Thus, Hodge as combined does not teach, as indicated in bold above, the “inquiry voice” meaning that the voice itself is the inquiry in contrast to an object-of-interest identifying voice serving as a basis of an inquiry descriptor as discussed in the rejection of claim 3 under 35 USC 102.

Accordingly, Diamant teaches:
an inquiry voice (via fig. 2A: 116: “Hey Ayeye, what is [this]?”).
Thus, one of ordinary skill in audio can modify Hodge’s trigger event words with Diamant’s teaching of fig. 2A: 116: “Hey Ayeye, what is [this]?” and recognize that the modification is predictable or looked forward to because Diamant’s teaching is “operative to perform intent understanding for identifying…information the user would like to obtain” such that “overall user experience…is enhanced” via Diamant:
“[0026] The intent system 126 is operative to receive the text translated from the received utterance 116 and the objects and text recognized from the captured image 136, and interpret the content of the image as part of the search query or command indicated in the utterance.  According to one aspect, the intent system 126 recognizes and replaces the trigger 134 in the text translated from the received utterance 116 with the identified object(s) and text from the captured image 136.  The intent system 126 is further operative to perform intent understanding for identifying an action the user 102 wants the client computing device 104 to take or information the user would like to obtain, conveyed in the spoken utterance 116.  According to an example, the 
intent system 126 is exposed as an API.

[0027] In some examples, the digital assistant 110 provides context information 138 to the image integrated query system 105.  Context data 138 can include, for example, time/date, the user's location, language, schedule, applications 108 installed on the client computing device 104, the user's preferences, the user's behaviors (in which such behaviors are monitored/tracked with notice to the user and the user's consent), stored contacts (including, in some cases, links to a local user's or remote user's social graph such as those maintained by external social networking services), call history, messaging history, browsing history, device type, device capabilities, and the like.  According to an aspect, the intent system 126 applies context data 138 that is available to it to enable interactions with the user 102 that are more natural and an overall user experience supported by the digital assistant 110 that is enhanced.  That is, the intent system 126 is operative to apply context data 138 provided to it by the digital assistant 110 to the combined text translated from the received utterance 116 and the objects and the text recognized from the captured image 136 for understanding the semantic intent of the search query or command indicated in the utterance 116.  According to examples, the intent system 126 uses natural language processing to process the combined text translated from the received utterance 116 and the objects and the text 
recognized from the captured image 136 in association with available context 
information 138.












Regarding claim 13, claim 13 is rejected the same as claim 3. Thus, argument presented in claim 3 is equally applicable to claim 13.











Claims 10 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271) as applied above further in view of Casper (US Patent Application No.: US 2015/0296250 A1).
Regarding claim 10, Hodge as combined teaches the method for controlling an electronic device of claim 9, wherein the external server (said fig. 1:102) recognizes a fingerprint included in the second frame (said via “video data may be uploaded to the cloud system 103”).
Thus, Hodges as combined does not teach “the external server recognizes a fingerprint included in the second frame”.
Accordingly, Casper teaches:
the external server (via figs. 6 and 7: “SERVER”) recognizes (via “one or more servers…are capable of…object…recognition”) a fingerprint (via fig. 3:360: “IDENTIFY… FINGERPRINT”) included in the second frame (via fig. 3:310: “VIDEO FRAME” via:
“[0105] Video processing server(s) 623 can include one or more servers that are capable of receiving, processing, storing, and/or delivering video content, performing object detection and/or recognition, receiving, processing, storing, and/or providing commerce information relating to merchandise items, searching for matching merchandise items, and/or performing any other suitable functions.”).

	




Thus, one of ordinary skill in the art of “Web-based…marketing materials” can modify Hodge’s fig. 1:102: “a server system” and uploading to the cloud, as shown in Hodge’s fig. 1:103: “cloud-based system”, corresponding to Hodge’s teaching of:
[0027] Referring now to FIG. 1, an exemplary vehicular video-based data capture and analysis system 100 according to one embodiment of the disclosure is provided.  Client device 101 is a dedicated data capture and recording system suitable for installation in a vehicle.  In one embodiment, client device 101 is a video-based dash camera system designed for installation on the dashboard or windshield of a car.  Client device 101 is connected to cloud-based system 103.  In one embodiment, cloud-based system 103 includes a server system 102 and network connections, such as for example, to Internet connections.  In one embodiment, cloud-based system 103 is a set of software services and programs operating in a public data center, such as an Amazon Web Services (AWS) data center, a Google Cloud Platform data center, or the like.  Cloud-based system 103 is accessible via mobile device 104 and web-based system 105.  In one embodiment, mobile device 104 includes a mobile device, such as an Apple iOS based device, including iPhones, iPads, or iPods, or an Android based device, like a Samsung Galaxy smartphone, a tablet, or the like.  Any such mobile device includes an application program or app running on a processor.  Web-based system 105 can be any computing device capable of running a Web browser, such as for example, a Windows.TM.  PC or tablet, Mac Computer, or the like.  Web-based system 105 may provide access to information or marketing materials of a system operations for new or potential users.  In addition, Web-based system 105 may also optionally provide access to users via a software program or application similar to the mobile app further described below.  In one embodiment, system 100 may also include one or more auxiliary camera modules 106.  For example, one or more camera modules on a user's home, vacation home, or place of business.  Auxiliary camera module 106 may be 
implemented as a client device 101 and operate the same way.  In one embodiment, auxiliary camera module 106 is a version of client device 101 with a subset of components and functionality.  For example, in one embodiment, auxiliary camera module 106 is a single camera client device 101.







with Casper’s teaching of figs. 6 and 7: “SERVER” with “object…recognition” by:
a)	installing into Hodge’s fig. 1:102: “a server system” the object recognition/identification; and
b)	send the marketing materials over the cloud to users/consumers/buyers such that the marketing materials are recognized/identified by Hodge’s fig. 1:102;
and thus said one of skill would recognize that the modification is predictable or looked forward to because Casper’s teaching uses the recognition/identification of “identified objects” to “provide a viewer…with an opportunity to purchase one or more merchandise items” via Casper:
“[0032] In some implementations, the mechanisms can be used in a variety of applications.  For example, the mechanisms can provide commerce information relating to merchandise items presented in video content.  More particularly, for example, the mechanisms can identify discrete objects in a video frame and match the discrete objects against products and other merchandise items that are available for sale in a product catalogue.  The mechanisms can then store commerce information relating to the merchandise items (e.g., prices, product names, sellers of the products, links to ordering information, etc.) in association with video frames of the video content (e.g., by timestamping the commerce information).  As another example, the mechanisms can provide commerce information relating to merchandise items presented in video content in a real-time manner.  In a more particular example, in response to receiving an 
indication that a viewer of the video content is interested in merchandise items presented in the video content (e.g., a user request to pause the playback of the video content), the mechanisms can retrieve commerce information relating to the merchandise items and present the commerce information to the viewer.  In this example, the mechanisms can provide a viewer that is consuming video content with an opportunity to purchase one or more merchandise items corresponding to identified objects in a video frame and/or an opportunity to place the one or more merchandise items in a queue for making a purchasing decision at a later time without leaving or navigating away from the presented video content.”

Regarding claim 20, claim 20 is rejected the same as claim 10. Thus, argument presented in claim 10 is equally applicable to claim 20.


Claim 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hodge et al. (US Patent App. Pub. No.: US 2018/0220189 A1) in view of Sanchez et al. (US Patent 10,182,271) as applied above further in view of MILSTEIN (WO 2019/030551 A1).
Regarding claim 22, Hodge as combined teaches the method of claim 2, wherein the providing the search result (via said fig. 8:800: “SHARING REQUEST” as modified via the combination) comprises obtaining a keyword corresponding to the object (said via fig. 4a: “Nice Spot!”) based on the information about the object that is acquired based on the output (via fig. 8:805: “MATCH?”) of the artificial intelligence model (said via “trained” “artificial intelligence (‘AI’) algorithms are applied to the multiple inputs to…match” as shown in fig. 8:805: “MATCH?”), and 
wherein the search result is obtained based on the keyword.   
Sanchez before being combined teaches claim 22 of:
providing the search (or seek via “query”, c.39,ll. 17-20) result comprises obtaining a keyword (via fig. 626: magnifying-glass:612: “ ‘Kamp Krusty’ ”, a program meta-data description for the seeking query) corresponding to the object based on the information about the object that is acquired based on the output of the artificial intelligence model, and 
wherein the search result is obtained based on the keyword (said “ ‘Kamp Krusty’ ”) .   









The combination does not teach:
obtaining a keyword corresponding to the object based on the information about the object that is acquired based on the output of the artificial intelligence model, and 
wherein the search result is obtained based on the keyword.












Thus, one of ordinary skill in the art of meta-data (as indicated in Hodge, fig. 8: “OBTAIN METADATA FOR VIDEO” and in Sanchez, “the media guidance application may use metadata”, c.2,ll. 39-41, “ ‘Kamp Krusty’ ”) can modify Hodge’s fig. 8: “OBTAIN METADATA FOR VIDEO” with said Sanchez’s seeking “query” by:
a)	making Hodge’s fig. 6c:656: “tagTitle” be as the program description: 
“ ‘Kamp Krusty’ ”;
b)	making Hodge’s fig. 8:803: “PROVIDE IMAGE QUERY” also include said seeking “query”; and
c)	recognizing that the modification is predictable or looked forward to for the same reasons as in claim 1’s discussion of the television asset (comprising the program description “ ‘Kamp Krusty’ ”) being useful and desirable.




















The second combination does not teach:
“obtaining a keyword corresponding to the object based on the information about the object that is acquired based on the output of the artificial intelligence model”.











Milstein teaches claim 22 of:
providing (via fig. 1:all arrows) the search result (or “metadata…to be discovered” or found by search via “make those keywords universally searchable” via: 
page 10, last paragraph:
“Adobe Premiere has a built in marker function for applying notes to video files. The inventors have realized that Adobe Premiere's marker function is suitable to embed keywords and information in a video's .xmp metadata fields, and as a result make those keywords and information universally searchable. The inventors have further recognized that marker function is suitable for carrying out Steps 1 10, 1 12, 1 14, 1 16 and 1 18 of the method according to the invention.”;

page 13, lines 6-8:
“The end result of the method according to the invention is an immersive (360-degree) video tagged with metadata that allows for it to be discovered by using frame- and location based metadata markers.”) 

comprises obtaining a (searchable) keyword corresponding to the object (fig. 2:tree) based on the information (via fig. 1:Step 100: “Providing immersive media file”) about the object that is acquired (via an output) based on the output (via output of fig. 1:Step 102: “Generating keywords by AI”) of the artificial intelligence model (via said fig. 1:Step 102: “Generating keywords by AI”), and 
wherein the search result is obtained (via said “to be discovered”) based on the keyword.   







Thus, one of ordinary skill in the art of searching with metadata can modify the combination’s fig. 6c:656: “tagTitle”, as already modified via the combination of Sanchez, with Milstein’s fig. 1:all arrows by:
a)	inputting fig. 6c:656: “tagTitle” into Milstein’s fig.1 to make “tagTitle” universally searchable without exception; and
b)	recognizing that the modification is predictable or looked forward to because modification shows “the accurateness of a given keyword” or shows the useful state (via a facilitated threshold review: Milstein: fig. 1: Step 104: “Review of keywords by human”) of being accurate and thus being “universally searchable”, cited above, without exception or searchable in every case, via Milstein, 2nd text-section after the figure listing:
“In Step 102 the frames 12 are preferably processed by performing image recognition on the frames by an artificial intelligence. One suitable artificial intelligence is Clarifai (www.clarifai.com). Clarifai's visual recognition model processes the frames 12 of the video file 10 in real-time and returns predictions on what is in the still images coded in the frames. These predictions take the form of keywords with associated probabilities indicating the probability of the accurateness of the given keyword.”









Suggestions
Applicant’s disclose states:
 “[0013]       According to an embodiment of the disclosure as described above, a user 
becomes capable of searching information on an image content that the user is currently viewing more easily and intuitively through his or her voice, without stopping the reproduction of the image content.” 

Claim 2 is directed to being easy and intuitive due to AI, however, claim 2 does not claim applicant’s fig. 4:S430: “ACQUIRING INFORMATION ON THE OBJECTS INCLUDED IN THE VIDEO OF THE PREDETERMINED SECOND PERIOD”. 
In contrast, Milstein (WO 2019/030551) as applied above, teaches AI: “Clarifai's visual recognition model processes the frames 12 of the video file 10 in real-time and returns predictions on what is in the still images coded in the frames.”, page 4, ll. 23-29.
Note that these suggestions are not provided with respect to overcoming 35 USC 101,112,102 and/or 103. These suggestion are mainly provided to seek out advantages in the disclosure regardless of 35 USC 101,112,102 and/or 103.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENNIS ROSARIO whose telephone number is (571)272-7397. The examiner can normally be reached Monday-Friday, 9AM-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on (571)272-7778. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/DENNIS ROSARIO/Examiner, Art Unit 2667

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667