DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08/22/22 has been entered.

Response to Amendment
The amendment filed on 08/22/22 has been entered. Claims 1, 3, 5-12, 14-20 remain pending in the application.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5-9, 12, 14-18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Hannuksela (US 2013/0226850) in view of Kaufman (US 2011/0320202) and further in view of Ady (US 2015/0127710) and Geisner (US 2013/0177296) and Ko (US 2014/0006020).
Regarding claim 1, Hannuksela discloses:
A method of processing image data, the method comprising: receiving environmental data and associated capture time data from a sensor of a mobile computing device, the capture time data reflecting capture time of the environmental data, wherein the environmental data comprise audio data at least by ([0077] “In this regard reference is first made to FIG. 1 which shows a schematic block diagram of an exemplary apparatus or electronic device 50, which may incorporate a context recognizing and adapting module 100 according to an embodiment of the invention.” [0078] “The electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system, a digital camera, a laptop computer etc.” [0091] “In FIG. 4 a some further details of an example embodiment of the apparatus 50 are depicted. The context recognizing and adapting module 100 may comprise one or more sensor inputs 101 for inputting sensor data from one or more sensors 110 a-110 e. The sensor data may be in the form of electrical signals, for example as analog or digital signals.” [0108] “FIG. 5 depicts some processing steps of a method according to an embodiment of the invention. The user captures 501 a media clip, such as takes a photo, records an audio clip, or shoots a video. If a still image or a video is taken in step 501, an audio clip (e.g. 10 s for still images) may be recorded with the microphone. The audio recording may start e.g. when the user presses the shutter button 610 (FIG. 6 a) to begin the auto-focus feature, and end after a predetermined time. Alternatively, the audio recording may take place continuously when the camera application is active and a predetermined window of time with respect to the shooting time of the image is selected to the short audio clip to be analyzed.” [0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”) and when a user captures media, such as taking an image or video, audio is also recorded at the same time.
processing the environmental data to identify, based on the audio data, a particular physical location at which the environmental data was captured at least by ([0109] “Before, during, and/or after the capture, the apparatus 50 runs the context recognizer 107, which recognizes 502 the context around the device during or related to the media capture and/or the activity of the user before, during, and/or after the media capture. In one embodiment, the context recognizer 107 is an audio-based context recognizer which produces information on the surrounding audio ambiance. The audio-based context recognizer may e.g. produce tags like “street”, “outdoors”, “nature”, “birds”, “music”, “people”, “vehicles”, “restaurant”, “pub”, and so on, each with an associated likelihood indicating how confident the context recognition is. In one embodiment, the activity context recognizer uses accelerometer, audio, and/or other sensors to determine the user's activity before media capture. The activity context recognizer may e.g. produce tags like “driving”, “walking”, and “cycling”, each with an associated likelihood indicating how confident the context recognition is. An example of a captured image 601 is shown in FIG. 6 a.” [0132] “Audio context information may describe general characteristics of the captured audio, such as energy, loudness, or spectrum. Audio context information may also describe the type of environment where the audio was captured. Example audio environments may be ‘office’, ‘car’, ‘restaurant’ etc. Audio context information may also identify one or more audio events describing audible sounds present in the location at which the audio was captured.”).
wherein the identifying of the particular physical location comprises: filtering the audio data using a … filter…at least by ([0166]-[0167]) which disclose the filtering of audio data using mel-scale bandpass filters);
based on the processing of the environmental data, identifying in the audio data a verbal description…, provided by a user of the mobile computing device during video capture, of subject matter contained in video data associated with the environmental data; in an automated operation, automatically generating a transcription of the verbal description at least by ([0108] “If a still image or a video is taken in step 501, an audio clip (e.g. 10 s for still images) may be recorded with the microphone.” [0175] “a speech recognizer 152 is applied on the audio clip to extract tags uttered by the user to be associated to the image. The tags may be spoken one at a time, with a short pause in between them. The speech recognizer 152 may then recognize spoken tags from the audio clip, which has been converted into a feature representation (MFCCs for example).”) and the verbal description of subject matter contained in the video data and the transcription of the verbal description are the tags recognized by the speech recognizer, automatically and without user intervention, that are applied to the audio clip that was captured during the capture of an image/video;
generating metadata based at least in part on: the particular physical location, and the transcription of the verbal description included in the environmental data, such that the metadata includes a collection of words from the transcription of the verbal description at least by ([0109] “Before, during, and/or after the capture, the apparatus 50 runs the context recognizer 107, which recognizes 502 the context around the device during or related to the media capture and/or the activity of the user before, during, and/or after the media capture. In one embodiment, the context recognizer 107 is an audio-based context recognizer which produces information on the surrounding audio ambiance. The audio-based context recognizer may e.g. produce tags like “street”, “outdoors”, “nature”, “birds”, “music”, “people”, “vehicles”, “restaurant”, “pub”, and so on, each with an associated likelihood indicating how confident the context recognition is. In one embodiment, the activity context recognizer uses accelerometer, audio, and/or other sensors to determine the user's activity before media capture. The activity context recognizer may e.g. produce tags like “driving”, “walking”, and “cycling”, each with an associated likelihood indicating how confident the context recognition is. An example of a captured image 601 is shown in FIG. 6 a.” [0175] “a speech recognizer 152 is applied on the audio clip to extract tags uttered by the user to be associated to the image. The tags may be spoken one at a time, with a short pause in between them. The speech recognizer 152 may then recognize spoken tags from the audio clip, which has been converted into a feature representation (MFCCs for example).”) and the generating of the metadata is the generating/extracting of the tags (transcription of the verbal description) which are based on speech recognition and an audio-based context such as surrounding audio ambiance (identified physical location), such as “outdoors”, “street” “restaurant”, and “pub”;
time stamping the metadata using the capture time data at least by ([0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”).
receiving, at a processor, the video data and video time data associated with the environmental data at least by ([0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.” [0172] “other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture, and similar camera settings such as exposure time”).
the video data comprising a plurality of video frames and the video time data reflecting record time of the video data at least by ([0084] “In some embodiments of the invention, the apparatus 50 comprises a camera 62 capable of recording or detecting individual frames or images which are then passed to an image processing circuitry 60 or controller 56 for processing.” [0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”).
correlating the metadata to the video data using the capture time data and the video time data at least by ([0172] “In some embodiments of the invention, the similarity obtained based on the audio analysis may be combined with other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture, and similar camera settings such as exposure time and focus details, as well as potentially a second analysis based on image content.” [0173] “In one embodiment of the invention, a generic audio similarity/distance measure may be used to find images with similar audio background…The similarity/distance measure may also be based on Euclidean distance, correlation distance, cosine angle, Bhattacharyya distance, the Bayesian information criterion, or on L1 distance (taxi driver's distance), and the features may be time-aligned for comparison”) and the correlating is the combining of the similarity obtained based on the audio analysis with other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture.
receiving a search query, including a search criterion, at the processor at least by ([0125] “In the next step, a second user, who may but need not be the same as the first user, enters 703 a search query for searching of media clips. The search query may be keyword-based or example-based or a combination thereof.”).
Hannuksela fails to disclose “filtering the audio data using a low-pass filter that excludes voice frequencies, thereby extracting non-voice environmental audio data; accessing stored audio fingerprint data that comprises multiple audio fingerprints associated with respective known locations; and comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location; ... a verbal description comprising a sentence; identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result”
However, Kaufman teaches the following limitations, filtering the audio data using a … filter that excludes voice frequencies, thereby extracting non-voice environmental audio data at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0067] “Background noise may be filtered out and separately analyzed to identify location”);
accessing stored audio fingerprint data that comprises multiple audio fingerprints associated with respective known locations at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0064] “One having skill in the art will recognize that templates may be stored and/or transmitted along with payload information such as user information, location information and time information” [0065] “Templates may be formed for both the speaker and the background in essence de-convoluting the sound and creating individual templates.”) and the multiple audio fingerprints are the templates formed for the background sounds which are stored along with the sound location information and are accessed in order to be compared to background information templates that were filtered out in order to identify a location based on the background noise.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Kaufman into the teaching of Hannuksela because both references disclose the processing of audio data. Consequently, one of ordinary skill in the art would be motivated to further modify the system as in Hannuksela to further include the matching of filtered audio fingerprints associated with those associated with known locations as in Kaufman.
Hannuksela, Kaufman fail to disclose “…a low pass filter…; comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location; ... a verbal description comprising a sentence; identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result”
However, Ady teaches …a low pass filter… at least by ([0022] “In accordance with the embodiments, each mobile device that may access the server 105 includes an always-operating audio detection system. The always-operating audio detection system is operative to detect voice commands” [0041] “The mobile device 200 includes one or more microphones 225 (such as a microphone array) and a speaker 223 that are operatively coupled to configuration and pre-processing logic 221. The configuration and pre-processing logic 221 may include analog-to-digital converters (ADCs), digital-to-analog converters (DACs), echo cancellation, high-pass filters, low-pass filters, band-pass filters, adjustable band filters, noise reduction filtering, automatic gain control (AGC) and other audio processing that may be applied to filter noise from audio received using the one or more microphones 225.” [0034] “the mobile device always-operating audio detection system is used to identify users engaged in a common event, or having a common interest, by monitoring ambient audio and looking for audio signature matches across devices.”);
comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location at least by ([0023] “An “audio signature” may be an acoustic fingerprint that enables audio database searching and identification of audio samples contained in the audio data” [0025] “The event signature database 107 contains audio signatures for various types of known events for which audio data has been previously collected. Acoustic fingerprints have thus been generated to facilitate searchable “event signatures.” For example, the event signature database 107 may contain audio signatures for events such as a football game (crowd noise or other characteristic audio), an outdoor music concert, and indoor music concert, public speaking event or various other such events for which audio signatures may be collected and stored.” [0049] “embodiments, the grouping application may also receive location information from the mobile devices along with the audio data as well as timestamp information and may thus make the inference that mobile devices are present at the Google I/O event. In other words the grouping application 120 may assume or infer that since the mobile devices are at or near the location coordinates of the Google I/O event, and have sent the matching audio signatures (such as crowd noise) that have timestamps at or during the known time of the event, such mobile devices are likely to be present at the Google I/O event.” [0058] “the server 105 receives audio samples from the various mobile devices. In operation block 703, server 105, and the grouping application 120 residing thereon, compares the received audio samples to audio signatures contained in the various databases such as, but not limited to, event signature database 107 or media signature database 109 or some other database containing audio signatures.”) and the audio samples, are processed using high-pass, lows-pass, and band-pass filters before they are sent to the server 105. The samples of audio data from each of the mobile devices is compared to samples of audio in the event signature database that are known to be associated with certain events; further location data and a timestamp can also be used in addition to the comparing of the audio samples in order to determine if the mobile devices are at the same event/location (particular physical location), such as the Google I/O event as provided in the example.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Ady into the teaching of Hannuksela, Kaufman because the references similarly disclose the processing of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the filtering as in the combination of references to further include low pass, high-pass, and band-pass filters as in Ady in order to separate the vocal frequencies and the background frequencies to be used in determining if the users are attending the same event.
Hannuksela, Kaufman, Ady fail to disclose “... a verbal description comprising a sentence; identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result”
However, Geisner teaches identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result at least by ([0096] “Mobile device 210 may subsequently search the metadata tag file associated with life recorder 240 to find and/or download a particular life recording of interest from application server 250.” [0117] “During a search of one or more entries in metadata tag file 672, if one or more fields in an entry are satisfied, then that entry may be deemed to be satisfied…Alternatively, once a search has been found, then the portion of the life recording corresponding with the Start timestamp and the End timestamp, e.g., Start timestamp: 00:25:12 & End timestamp: 00:35:10 as specified in index entry 676, may be found and downloaded from the life recording.”) and the frame within the video data is the portion of the life recording while the including of the frame in a search result is the downloading of the matching portion of the life event recording based on the matching of times, as aforementioned, in response to the searching of the entries in the metadata tag file.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Geisner into the teaching of Hannuksela, Kaufman, Ady because the references similarly disclose the mining of captured audio data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the additional search feature of Geisner that allows for the identification of portions of video content, such as a frame.
Hannuksela, Kaufman, Ady, Geisner fail to disclose “... a verbal description comprising a sentence”
However, Ko teaches the above limitation at least by ([0030] “For example, each word of the transcripted text report may be correlated with a corresponding position within the audio file, each sentence within the transcripted text report may be associated with a corresponding position within the audio file, and/or each paragraph within the transcripted text report may be correlated with a corresponding position within the audio file.” [0034] “the processing circuitry 22, such as the voice recognition engine implemented by the processing circuitry, may compare subsequences from the collection of word and audio location pairs that have been identified from the audio file with sentences from the transcripted text report on a sentence-by-sentence basis in order to determine a correspondence therebetween”) and the verbal description comprising a sentence is the transcripted text of the audio which comprises sentences.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Ko into the teaching of Hannuksela, Kaufman, Ady, Geisner because the references similarly disclose the mining of captured audio data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the verbal transcription of audio data comprising sentences as in Ko “in order to improve the efficiency with which the transcripted text may be reviewed in relation to the corresponding audio file” (Ko, [0025]).
As per claim 3, claim 1 is incorporated, Kaufman further discloses:
wherein identifying the particular physical location further comprises generating an audio fingerprint using at least a portion of the non-voice environmental audio data, and comparing the audio fingerprint to the multiple stored audio fingerprints at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0074] “At a step 516 a comparison is performed. This comparison includes creating one or more templates from the received audio of step 512 and comparing that template to those persisted in memory.”).
As per claim 5, claim 1 is incorporated, Kaufman further discloses:
wherein the extracted non-environmental audio data comprise background audio data and the method includes using the background audio data to identify the particular physical location at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0074] “At a step 516 a comparison is performed. This comparison includes creating one or more templates from the received audio of step 512 and comparing that template to those persisted in memory.”).
As per claim 6, claim 1 is incorporated, Hannuksela further discloses:
filtering the audio data using at least one of a band pass filter and a high pass filter… at least by ([0166]-[0168] which describe the use of bandpass mel-filters (band pass filter);
Kaufman further discloses:
wherein the processing of the environmental data further includes: filtering the audio data…, to extract voice-containing environmental audio data at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0067] “Background noise may be filtered out and separately analyzed to identify location”);
Geisner further discloses:
performing voice recognition on the voice-containing environmental audio data to identify a word, and generating the metadata to include the word at least by ([0057] “Processing unit 191 may include one or more processors for executing object, facial, and voice recognition algorithms. In one embodiment, processing engine 194 may apply object recognition and facial recognition techniques to image or video data. For example, object recognition may be used to detect particular objects (e.g., soccer balls, cars, or landmarks) and facial recognition may be used to detect the face of a particular person. Processing engine 194 may apply audio and voice recognition techniques to audio data. For example, audio recognition may be used to detect a particular sound or word being uttered and voice recognition may be used to detect the voice of a particular person. The particular faces, voices, sounds, and objects to be detected may be stored in one or more memories contained in memory unit 192.”).
As per claim 7, claim 1 is incorporated, Geisner further discloses:
wherein the environmental data comprises location data, the method further comprising processing of the environmental data to identify at least one place, object or event using the location data, and generating the metadata to include the at least one place, object or event at least by ([0066] “For example, a metadata tag <Greece> may be automatically generated for the life recording based on location information associated with the life recorder at the time of recording.”).
As per claim 8, claim 1 is incorporated, Geisner further discloses:
including performing image recognition with respect to the frame, and tagging the frame with a tag indicative of an object recognized within the frame, the identifying of the frame performed by matching the search criterion and the tag at least by ([0108] “In step 432, a particular situation associated with a video (or image) recording may be identified. In one embodiment, a particular situation associated with a video (or image) recording may be identified using object, pattern, and/or facial recognition techniques. For example, facial recognition may be used to identify a particular person and object recognition may be used to identify a particular object within a portion of a video recording. In one example, the particular situation identified may include detecting of a particular object (e.g., a soccer ball) and a particular person (e.g., a friend).” [0109] “In step 434, the particular situation identified in steps 430-432 may be stored for further processing and/or future use. For example, a particular situation identified from a video recording may be used to determine whether a tag event exists according to step 368 of FIG. 3B. In one embodiment, the particular situation identified in steps 430-432 may be stored locally in the life recorder itself or in a remote storage device (e.g., application server 250 of FIG. 1A).”).
As per claim 9, claim 1 is incorporated, Geisner further discloses:
including performing characteristic recognition with respect to the frame, and tagging the frame with a tag indicative of a characteristic recognized in the frame, the identifying of the frame being performed by matching the search criterion and the tag at least by ([0108] “In step 432, a particular situation associated with a video (or image) recording may be identified. In one embodiment, a particular situation associated with a video (or image) recording may be identified using object, pattern, and/or facial recognition techniques. For example, facial recognition may be used to identify a particular person and object recognition may be used to identify a particular object within a portion of a video recording. In one example, the particular situation identified may include detecting of a particular object (e.g., a soccer ball) and a particular person (e.g., a friend).” [0109] “In step 434, the particular situation identified in steps 430-432 may be stored for further processing and/or future use. For example, a particular situation identified from a video recording may be used to determine whether a tag event exists according to step 368 of FIG. 3B. In one embodiment, the particular situation identified in steps 430-432 may be stored locally in the life recorder itself or in a remote storage device (e.g., application server 250 of FIG. 1A).”) and the characteristic could be any of the detected situations using object, patterns, and/or facial recognition techniques, for example.
Regarding claim 12, Hannuksela discloses:
A system, comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, configure the system to: receive environmental data and associated capture time data from a sensor of a mobile computing device, the capture time data reflecting capture time of the environmental data, wherein the environmental data comprise audio data at least by ([0077] “In this regard reference is first made to FIG. 1 which shows a schematic block diagram of an exemplary apparatus or electronic device 50, which may incorporate a context recognizing and adapting module 100 according to an embodiment of the invention.” [0078] “The electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system, a digital camera, a laptop computer etc.” [0091] “In FIG. 4 a some further details of an example embodiment of the apparatus 50 are depicted. The context recognizing and adapting module 100 may comprise one or more sensor inputs 101 for inputting sensor data from one or more sensors 110 a-110 e. The sensor data may be in the form of electrical signals, for example as analog or digital signals.” [0108] “FIG. 5 depicts some processing steps of a method according to an embodiment of the invention. The user captures 501 a media clip, such as takes a photo, records an audio clip, or shoots a video. If a still image or a video is taken in step 501, an audio clip (e.g. 10 s for still images) may be recorded with the microphone. The audio recording may start e.g. when the user presses the shutter button 610 (FIG. 6 a) to begin the auto-focus feature, and end after a predetermined time. Alternatively, the audio recording may take place continuously when the camera application is active and a predetermined window of time with respect to the shooting time of the image is selected to the short audio clip to be analyzed.” [0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”) and when a user captures media, such as taking an image or video, audio is also recorded at the same time.
process the environmental data to identify, based on the audio data, a particular physical location at which the environmental data was captured at least by ([0109] “Before, during, and/or after the capture, the apparatus 50 runs the context recognizer 107, which recognizes 502 the context around the device during or related to the media capture and/or the activity of the user before, during, and/or after the media capture. In one embodiment, the context recognizer 107 is an audio-based context recognizer which produces information on the surrounding audio ambiance. The audio-based context recognizer may e.g. produce tags like “street”, “outdoors”, “nature”, “birds”, “music”, “people”, “vehicles”, “restaurant”, “pub”, and so on, each with an associated likelihood indicating how confident the context recognition is. In one embodiment, the activity context recognizer uses accelerometer, audio, and/or other sensors to determine the user's activity before media capture. The activity context recognizer may e.g. produce tags like “driving”, “walking”, and “cycling”, each with an associated likelihood indicating how confident the context recognition is. An example of a captured image 601 is shown in FIG. 6 a.” [0132] “Audio context information may describe general characteristics of the captured audio, such as energy, loudness, or spectrum. Audio context information may also describe the type of environment where the audio was captured. Example audio environments may be ‘office’, ‘car’, ‘restaurant’ etc. Audio context information may also identify one or more audio events describing audible sounds present in the location at which the audio was captured.”).
wherein the identifying of the particular physical location comprises: filtering the audio data using a … filter…at least by ([0166]-[0167]) which disclose the filtering of audio data using mel-scale bandpass filters);
identify in the audio data a verbal description, provided by a user of the mobile computing device during video capture, of subject matter contained in video data associated with the environmental data; automatically generate a transcription of the verbal description at least by ([0108] “If a still image or a video is taken in step 501, an audio clip (e.g. 10 s for still images) may be recorded with the microphone.” [0175] “a speech recognizer 152 is applied on the audio clip to extract tags uttered by the user to be associated to the image. The tags may be spoken one at a time, with a short pause in between them. The speech recognizer 152 may then recognize spoken tags from the audio clip, which has been converted into a feature representation (MFCCs for example).”) and the verbal description of subject matter contained in the video data and the transcription of the verbal description are the tags recognized by the speech recognizer, automatically and without user intervention, that are applied to the audio clip that was captured during the capture of an image/video;
generate metadata based at least in part on: the particular physical location, and the transcription of the verbal description included in the environmental data, such that the metadata includes a collection of words from the transcription of the verbal description at least by ([0109] “Before, during, and/or after the capture, the apparatus 50 runs the context recognizer 107, which recognizes 502 the context around the device during or related to the media capture and/or the activity of the user before, during, and/or after the media capture. In one embodiment, the context recognizer 107 is an audio-based context recognizer which produces information on the surrounding audio ambiance. The audio-based context recognizer may e.g. produce tags like “street”, “outdoors”, “nature”, “birds”, “music”, “people”, “vehicles”, “restaurant”, “pub”, and so on, each with an associated likelihood indicating how confident the context recognition is. In one embodiment, the activity context recognizer uses accelerometer, audio, and/or other sensors to determine the user's activity before media capture. The activity context recognizer may e.g. produce tags like “driving”, “walking”, and “cycling”, each with an associated likelihood indicating how confident the context recognition is. An example of a captured image 601 is shown in FIG. 6 a.” [0175] “a speech recognizer 152 is applied on the audio clip to extract tags uttered by the user to be associated to the image. The tags may be spoken one at a time, with a short pause in between them. The speech recognizer 152 may then recognize spoken tags from the audio clip, which has been converted into a feature representation (MFCCs for example).”) and the generating of the metadata is the generating/extracting of the tags (transcription of the verbal description) which are based on speech recognition and an audio-based context such as surrounding audio ambiance (identified physical location), such as “outdoors”, “street” “restaurant”, and “pub”;
time stamping the metadata using the capture time data at least by ([0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”).
receiving, at a processor, the video data and video time data associated with the environmental data at least by ([0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.” [0172] “other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture, and similar camera settings such as exposure time”).
the video data comprising a plurality of video frames and the video time data reflecting record time of the video data at least by ([0084] “In some embodiments of the invention, the apparatus 50 comprises a camera 62 capable of recording or detecting individual frames or images which are then passed to an image processing circuitry 60 or controller 56 for processing.” [0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”).
correlating the metadata to the video data using the capture time data and the video time data at least by ([0172] “In some embodiments of the invention, the similarity obtained based on the audio analysis may be combined with other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture, and similar camera settings such as exposure time and focus details, as well as potentially a second analysis based on image content.” [0173] “In one embodiment of the invention, a generic audio similarity/distance measure may be used to find images with similar audio background…The similarity/distance measure may also be based on Euclidean distance, correlation distance, cosine angle, Bhattacharyya distance, the Bayesian information criterion, or on L1 distance (taxi driver's distance), and the features may be time-aligned for comparison”) and the correlating is the combining of the similarity obtained based on the audio analysis with other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture.
receiving a search query, including a search criterion, at the processor at least by ([0125] “In the next step, a second user, who may but need not be the same as the first user, enters 703 a search query for searching of media clips. The search query may be keyword-based or example-based or a combination thereof.”).
Hannuksela fails to disclose “filtering the audio data using a low-pass filter that excludes voice frequencies, thereby extracting non-voice environmental audio data; accessing stored audio fingerprint data that comprises multiple audio fingerprints associated with respective known locations; and comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location; ... a verbal description comprising a sentence; identify a frame within the video data by performing a search of the metadata using the search criterion; and include the identified frame in a search result”
However, Kaufman teaches the following limitations, filtering the audio data using a … filter that excludes voice frequencies, thereby extracting non-voice environmental audio data at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0067] “Background noise may be filtered out and separately analyzed to identify location”);
accessing stored audio fingerprint data that comprises multiple audio fingerprints associated with respective known locations at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0064] “One having skill in the art will recognize that templates may be stored and/or transmitted along with payload information such as user information, location information and time information” [0065] “Templates may be formed for both the speaker and the background in essence de-convoluting the sound and creating individual templates.”) and the multiple audio fingerprints are the templates formed for the background sounds which are stored along with the sound location information and are accessed in order to be compared to background information templates that were filtered out in order to identify a location based on the background noise.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Kaufman into the teaching of Hannuksela because both references disclose the processing of audio data. Consequently, one of ordinary skill in the art would be motivated to further modify the system as in Hannuksela to further include the matching of filtered audio fingerprints associated with those associated with known locations as in Kaufman.
Hannuksela, Kaufman fail to disclose “…a low pass filter…; comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location; ... a verbal description comprising a sentence; identify a frame within the video data by performing a search of the metadata using the search criterion; and include the identified frame in a search result”
However, Ady teaches …a low pass filter… at least by ([0022] “In accordance with the embodiments, each mobile device that may access the server 105 includes an always-operating audio detection system. The always-operating audio detection system is operative to detect voice commands” [0041] “The mobile device 200 includes one or more microphones 225 (such as a microphone array) and a speaker 223 that are operatively coupled to configuration and pre-processing logic 221. The configuration and pre-processing logic 221 may include analog-to-digital converters (ADCs), digital-to-analog converters (DACs), echo cancellation, high-pass filters, low-pass filters, band-pass filters, adjustable band filters, noise reduction filtering, automatic gain control (AGC) and other audio processing that may be applied to filter noise from audio received using the one or more microphones 225.” [0034] “the mobile device always-operating audio detection system is used to identify users engaged in a common event, or having a common interest, by monitoring ambient audio and looking for audio signature matches across devices.”);
comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location at least by ([0023] “An “audio signature” may be an acoustic fingerprint that enables audio database searching and identification of audio samples contained in the audio data” [0025] “The event signature database 107 contains audio signatures for various types of known events for which audio data has been previously collected. Acoustic fingerprints have thus been generated to facilitate searchable “event signatures.” For example, the event signature database 107 may contain audio signatures for events such as a football game (crowd noise or other characteristic audio), an outdoor music concert, and indoor music concert, public speaking event or various other such events for which audio signatures may be collected and stored.” [0049] “embodiments, the grouping application may also receive location information from the mobile devices along with the audio data as well as timestamp information and may thus make the inference that mobile devices are present at the Google I/O event. In other words the grouping application 120 may assume or infer that since the mobile devices are at or near the location coordinates of the Google I/O event, and have sent the matching audio signatures (such as crowd noise) that have timestamps at or during the known time of the event, such mobile devices are likely to be present at the Google I/O event.” [0058] “the server 105 receives audio samples from the various mobile devices. In operation block 703, server 105, and the grouping application 120 residing thereon, compares the received audio samples to audio signatures contained in the various databases such as, but not limited to, event signature database 107 or media signature database 109 or some other database containing audio signatures.”) and the audio samples, are processed using high-pass, lows-pass, and band-pass filters before they are sent to the server 105. The samples of audio data from each of the mobile devices is compared to samples of audio in the event signature database that are known to be associated with certain events; further location data and a timestamp can also be used in addition to the comparing of the audio samples in order to determine if the mobile devices are at the same event/location (particular physical location), such as the Google I/O event as provided in the example.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Ady into the teaching of Hannuksela, Kaufman because the references similarly disclose the processing of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the filtering as in the combination of references to further include low pass, high-pass, and band-pass filters as in Ady in order to separate the vocal frequencies and the background frequencies to be used in determining if the users are attending the same event.
Hannuksela, Kaufman, Ady fail to disclose “... a verbal description comprising a sentence; identify a frame within the video data by performing a search of the metadata using the search criterion; and include the identified frame in a search result”
However, Geisner teaches identify a frame within the video data by performing a search of the metadata using the search criterion; and include the identified frame in a search result at least by ([0096] “Mobile device 210 may subsequently search the metadata tag file associated with life recorder 240 to find and/or download a particular life recording of interest from application server 250.” [0117] “During a search of one or more entries in metadata tag file 672, if one or more fields in an entry are satisfied, then that entry may be deemed to be satisfied…Alternatively, once a search has been found, then the portion of the life recording corresponding with the Start timestamp and the End timestamp, e.g., Start timestamp: 00:25:12 & End timestamp: 00:35:10 as specified in index entry 676, may be found and downloaded from the life recording.”) and the frame within the video data is the portion of the life recording while the including of the frame in a search result is the downloading of the matching portion of the life event recording based on the matching of times, as aforementioned, in response to the searching of the entries in the metadata tag file.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Geisner into the teaching of Hannuksela, Kaufman, Ady because the references similarly disclose the mining of captured audio data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the additional search feature of Geisner that allows for the identification of portions of video content, such as a frame.
Hannuksela, Kaufman, Ady, Geisner fail to disclose “... a verbal description comprising a sentence”
However, Ko teaches the above limitation at least by ([0030] “For example, each word of the transcripted text report may be correlated with a corresponding position within the audio file, each sentence within the transcripted text report may be associated with a corresponding position within the audio file, and/or each paragraph within the transcripted text report may be correlated with a corresponding position within the audio file.” [0034] “the processing circuitry 22, such as the voice recognition engine implemented by the processing circuitry, may compare subsequences from the collection of word and audio location pairs that have been identified from the audio file with sentences from the transcripted text report on a sentence-by-sentence basis in order to determine a correspondence therebetween”) and the verbal description comprising a sentence is the transcripted text of the audio which comprises sentences.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Ko into the teaching of Hannuksela, Kaufman, Ady, Geisner because the references similarly disclose the mining of captured audio data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the verbal transcription of audio data comprising sentences as in Ko “in order to improve the efficiency with which the transcripted text may be reviewed in relation to the corresponding audio file” (Ko, [0025]).
As per claim 15, claim 12 is incorporated, Kaufman further discloses:
wherein identifying the particular physical location further comprises filtering the audio data to extract background audio data and the instructions, when executed by the at least one processor, configure the system to use the background audio data to identify the particular physical location at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0074] “At a step 516 a comparison is performed. This comparison includes creating one or more templates from the received audio of step 512 and comparing that template to those persisted in memory.”).
As per claim 16, claim 12 is incorporated, Hannuksela further discloses:
wherein the instructions, when executed by the at least one processor, further configure the system to perform operations comprising: filter the audio data using at least one of a band pass filter and a high pass filter… at least by ([0166]-[0168] which describe the use of bandpass mel-filters (band pass filter)
Kaufman further discloses:
filter the audio data…, to extract voice-containing environmental audio data at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0067] “Background noise may be filtered out and separately analyzed to identify location”);
Geisner further discloses:
performing voice recognition on the voice-containing environmental audio data to identify a word; and generate the metadata to include the word at least by ([0057] “Processing unit 191 may include one or more processors for executing object, facial, and voice recognition algorithms. In one embodiment, processing engine 194 may apply object recognition and facial recognition techniques to image or video data. For example, object recognition may be used to detect particular objects (e.g., soccer balls, cars, or landmarks) and facial recognition may be used to detect the face of a particular person. Processing engine 194 may apply audio and voice recognition techniques to audio data. For example, audio recognition may be used to detect a particular sound or word being uttered and voice recognition may be used to detect the voice of a particular person. The particular faces, voices, sounds, and objects to be detected may be stored in one or more memories contained in memory unit 192.”).
Regarding claim 20, Hannuksela discloses:
A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer, cause the computer to perform operations comprising: receiving environmental data and associated capture time data from a sensor of a mobile computing device, the capture time data reflecting capture time of the environmental data, wherein the environmental data comprise audio data at least by ([0077] “In this regard reference is first made to FIG. 1 which shows a schematic block diagram of an exemplary apparatus or electronic device 50, which may incorporate a context recognizing and adapting module 100 according to an embodiment of the invention.” [0078] “The electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system, a digital camera, a laptop computer etc.” [0091] “In FIG. 4 a some further details of an example embodiment of the apparatus 50 are depicted. The context recognizing and adapting module 100 may comprise one or more sensor inputs 101 for inputting sensor data from one or more sensors 110 a-110 e. The sensor data may be in the form of electrical signals, for example as analog or digital signals.” [0108] “FIG. 5 depicts some processing steps of a method according to an embodiment of the invention. The user captures 501 a media clip, such as takes a photo, records an audio clip, or shoots a video. If a still image or a video is taken in step 501, an audio clip (e.g. 10 s for still images) may be recorded with the microphone. The audio recording may start e.g. when the user presses the shutter button 610 (FIG. 6 a) to begin the auto-focus feature, and end after a predetermined time. Alternatively, the audio recording may take place continuously when the camera application is active and a predetermined window of time with respect to the shooting time of the image is selected to the short audio clip to be analyzed.” [0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”) and when a user captures media, such as taking an image or video, audio is also recorded at the same time.
processing the environmental data to identify, based on the audio data, a particular physical location at which the environmental data was captured at least by ([0109] “Before, during, and/or after the capture, the apparatus 50 runs the context recognizer 107, which recognizes 502 the context around the device during or related to the media capture and/or the activity of the user before, during, and/or after the media capture. In one embodiment, the context recognizer 107 is an audio-based context recognizer which produces information on the surrounding audio ambiance. The audio-based context recognizer may e.g. produce tags like “street”, “outdoors”, “nature”, “birds”, “music”, “people”, “vehicles”, “restaurant”, “pub”, and so on, each with an associated likelihood indicating how confident the context recognition is. In one embodiment, the activity context recognizer uses accelerometer, audio, and/or other sensors to determine the user's activity before media capture. The activity context recognizer may e.g. produce tags like “driving”, “walking”, and “cycling”, each with an associated likelihood indicating how confident the context recognition is. An example of a captured image 601 is shown in FIG. 6 a.” [0132] “Audio context information may describe general characteristics of the captured audio, such as energy, loudness, or spectrum. Audio context information may also describe the type of environment where the audio was captured. Example audio environments may be ‘office’, ‘car’, ‘restaurant’ etc. Audio context information may also identify one or more audio events describing audible sounds present in the location at which the audio was captured.”).
wherein the identifying of the particular physical location comprises: filtering the audio data using a … filter…at least by ([0166]-[0167]) which disclose the filtering of audio data using mel-scale bandpass filters);
based on the processing of the environmental data, identifying in the audio data a verbal description, provided by a user of the mobile computing device during video capture, of subject matter contained in video data associated with the environmental data; in an automated operation, automatically generating a transcription of the verbal description at least by ([0108] “If a still image or a video is taken in step 501, an audio clip (e.g. 10 s for still images) may be recorded with the microphone.” [0175] “a speech recognizer 152 is applied on the audio clip to extract tags uttered by the user to be associated to the image. The tags may be spoken one at a time, with a short pause in between them. The speech recognizer 152 may then recognize spoken tags from the audio clip, which has been converted into a feature representation (MFCCs for example).”) and the verbal description of subject matter contained in the video data and the transcription of the verbal description are the tags recognized by the speech recognizer, automatically and without user intervention, that are applied to the audio clip that was captured during the capture of an image/video;
generating metadata based at least in part on: the particular physical location, and the transcription of the verbal description included in the environmental data, such that the metadata includes a collection of words from the transcription of the verbal description at least by ([0109] “Before, during, and/or after the capture, the apparatus 50 runs the context recognizer 107, which recognizes 502 the context around the device during or related to the media capture and/or the activity of the user before, during, and/or after the media capture. In one embodiment, the context recognizer 107 is an audio-based context recognizer which produces information on the surrounding audio ambiance. The audio-based context recognizer may e.g. produce tags like “street”, “outdoors”, “nature”, “birds”, “music”, “people”, “vehicles”, “restaurant”, “pub”, and so on, each with an associated likelihood indicating how confident the context recognition is. In one embodiment, the activity context recognizer uses accelerometer, audio, and/or other sensors to determine the user's activity before media capture. The activity context recognizer may e.g. produce tags like “driving”, “walking”, and “cycling”, each with an associated likelihood indicating how confident the context recognition is. An example of a captured image 601 is shown in FIG. 6 a.” [0175] “a speech recognizer 152 is applied on the audio clip to extract tags uttered by the user to be associated to the image. The tags may be spoken one at a time, with a short pause in between them. The speech recognizer 152 may then recognize spoken tags from the audio clip, which has been converted into a feature representation (MFCCs for example).”) and the generating of the metadata is the generating/extracting of the tags (transcription of the verbal description) which are based on speech recognition and an audio-based context such as surrounding audio ambiance (identified physical location), such as “outdoors”, “street” “restaurant”, and “pub”;
time stamping the metadata using the capture time data at least by ([0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”).
receiving the video data and video time data associated with the environmental data at least by ([0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.” [0172] “other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture, and similar camera settings such as exposure time”).
the video data comprising a plurality of video frames and the video time data reflecting record time of the video data at least by ([0084] “In some embodiments of the invention, the apparatus 50 comprises a camera 62 capable of recording or detecting individual frames or images which are then passed to an image processing circuitry 60 or controller 56 for processing.” [0121] “In addition to the tag, the time of the media capture to which the tag is related may be shared.”).
correlating the metadata to the video data using the capture time data and the video time data at least by ([0172] “In some embodiments of the invention, the similarity obtained based on the audio analysis may be combined with other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture, and similar camera settings such as exposure time and focus details, as well as potentially a second analysis based on image content.” [0173] “In one embodiment of the invention, a generic audio similarity/distance measure may be used to find images with similar audio background…The similarity/distance measure may also be based on Euclidean distance, correlation distance, cosine angle, Bhattacharyya distance, the Bayesian information criterion, or on L1 distance (taxi driver's distance), and the features may be time-aligned for comparison”) and the correlating is the combining of the similarity obtained based on the audio analysis with other pieces of similarity obtained from image metadata, such as the same or similar textual tags, similar time of year and time of day and location of shooting a picture.
receiving a search query, including a search criterion at least by ([0125] “In the next step, a second user, who may but need not be the same as the first user, enters 703 a search query for searching of media clips. The search query may be keyword-based or example-based or a combination thereof.”).
Hannuksela fails to disclose “filtering the audio data using a low-pass filter that excludes voice frequencies, thereby extracting non-voice environmental audio data; accessing stored audio fingerprint data that comprises multiple audio fingerprints associated with respective known locations; and comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location; ... a verbal description comprising a sentence; identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result”
However, Kaufman teaches the following limitations, filtering the audio data using a … filter that excludes voice frequencies, thereby extracting non-voice environmental audio data at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0067] “Background noise may be filtered out and separately analyzed to identify location”);
accessing stored audio fingerprint data that comprises multiple audio fingerprints associated with respective known locations at least by ([Abstract] “background information may be filtered out of the voice signal and separately compared against the database to assist in identifying a location based on the background noise” [0064] “One having skill in the art will recognize that templates may be stored and/or transmitted along with payload information such as user information, location information and time information” [0065] “Templates may be formed for both the speaker and the background in essence de-convoluting the sound and creating individual templates.”) and the multiple audio fingerprints are the templates formed for the background sounds which are stored along with the sound location information and are accessed in order to be compared to background information templates that were filtered out in order to identify a location based on the background noise.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Kaufman into the teaching of Hannuksela because both references disclose the processing of audio data. Consequently, one of ordinary skill in the art would be motivated to further modify the system as in Hannuksela to further include the matching of filtered audio fingerprints associated with those associated with known locations as in Kaufman.
Hannuksela, Kaufman fail to disclose “…a low pass filter…; comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location; ... a verbal description comprising a sentence; identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result”
However, Ady teaches …a low pass filter… at least by ([0022] “In accordance with the embodiments, each mobile device that may access the server 105 includes an always-operating audio detection system. The always-operating audio detection system is operative to detect voice commands” [0041] “The mobile device 200 includes one or more microphones 225 (such as a microphone array) and a speaker 223 that are operatively coupled to configuration and pre-processing logic 221. The configuration and pre-processing logic 221 may include analog-to-digital converters (ADCs), digital-to-analog converters (DACs), echo cancellation, high-pass filters, low-pass filters, band-pass filters, adjustable band filters, noise reduction filtering, automatic gain control (AGC) and other audio processing that may be applied to filter noise from audio received using the one or more microphones 225.” [0034] “the mobile device always-operating audio detection system is used to identify users engaged in a common event, or having a common interest, by monitoring ambient audio and looking for audio signature matches across devices.”);
comparing the non-voice environmental audio data to the multiple audio fingerprints to identify the particular physical location at least by ([0023] “An “audio signature” may be an acoustic fingerprint that enables audio database searching and identification of audio samples contained in the audio data” [0025] “The event signature database 107 contains audio signatures for various types of known events for which audio data has been previously collected. Acoustic fingerprints have thus been generated to facilitate searchable “event signatures.” For example, the event signature database 107 may contain audio signatures for events such as a football game (crowd noise or other characteristic audio), an outdoor music concert, and indoor music concert, public speaking event or various other such events for which audio signatures may be collected and stored.” [0049] “embodiments, the grouping application may also receive location information from the mobile devices along with the audio data as well as timestamp information and may thus make the inference that mobile devices are present at the Google I/O event. In other words the grouping application 120 may assume or infer that since the mobile devices are at or near the location coordinates of the Google I/O event, and have sent the matching audio signatures (such as crowd noise) that have timestamps at or during the known time of the event, such mobile devices are likely to be present at the Google I/O event.” [0058] “the server 105 receives audio samples from the various mobile devices. In operation block 703, server 105, and the grouping application 120 residing thereon, compares the received audio samples to audio signatures contained in the various databases such as, but not limited to, event signature database 107 or media signature database 109 or some other database containing audio signatures.”) and the audio samples, are processed using high-pass, lows-pass, and band-pass filters before they are sent to the server 105. The samples of audio data from each of the mobile devices is compared to samples of audio in the event signature database that are known to be associated with certain events; further location data and a timestamp can also be used in addition to the comparing of the audio samples in order to determine if the mobile devices are at the same event/location (particular physical location), such as the Google I/O event as provided in the example.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Ady into the teaching of Hannuksela, Kaufman because the references similarly disclose the processing of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the filtering as in the combination of references to further include low pass, high-pass, and band-pass filters as in Ady in order to separate the vocal frequencies and the background frequencies to be used in determining if the users are attending the same event.
Hannuksela, Kaufman, Ady fail to disclose “... a verbal description comprising a sentence; identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result”
However, Geisner teaches identifying a frame within the video data by performing a search of the metadata using the search criterion; and including the identified frame in a search result at least by ([0096] “Mobile device 210 may subsequently search the metadata tag file associated with life recorder 240 to find and/or download a particular life recording of interest from application server 250.” [0117] “During a search of one or more entries in metadata tag file 672, if one or more fields in an entry are satisfied, then that entry may be deemed to be satisfied…Alternatively, once a search has been found, then the portion of the life recording corresponding with the Start timestamp and the End timestamp, e.g., Start timestamp: 00:25:12 & End timestamp: 00:35:10 as specified in index entry 676, may be found and downloaded from the life recording.”) and the frame within the video data is the portion of the life recording while the including of the frame in a search result is the downloading of the matching portion of the life event recording based on the matching of times, as aforementioned, in response to the searching of the entries in the metadata tag file.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Geisner into the teaching of Hannuksela, Kaufman, Ady because the references similarly disclose the mining of captured audio data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the additional search feature of Geisner that allows for the identification of portions of video content, such as a frame.
Hannuksela, Kaufman, Ady, Geisner fail to disclose “... a verbal description comprising a sentence”
However, Ko teaches the above limitation at least by ([0030] “For example, each word of the transcripted text report may be correlated with a corresponding position within the audio file, each sentence within the transcripted text report may be associated with a corresponding position within the audio file, and/or each paragraph within the transcripted text report may be correlated with a corresponding position within the audio file.” [0034] “the processing circuitry 22, such as the voice recognition engine implemented by the processing circuitry, may compare subsequences from the collection of word and audio location pairs that have been identified from the audio file with sentences from the transcripted text report on a sentence-by-sentence basis in order to determine a correspondence therebetween”) and the verbal description comprising a sentence is the transcripted text of the audio which comprises sentences.
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Ko into the teaching of Hannuksela, Kaufman, Ady, Geisner because the references similarly disclose the mining of captured audio data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the verbal transcription of audio data comprising sentences as in Ko “in order to improve the efficiency with which the transcripted text may be reviewed in relation to the corresponding audio file” (Ko, [0025]).
Claims 14, 17, 18 recite equivalent claim limitations as the method of claims 3, 7, 9, except that they set forth the claimed invention as a system, as such they are rejected for the same reasons as applied hereinabove.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Hannuksela (US 2013/0226850) in view of Kaufman (US 2011/0320202) and Ady (US 2015/0127710) and Geisner (US 2013/0177296) and Ko (US 2014/0006020) and further in view of Archibong (US 2014/0067945).
As per claim 10, claim 1 is incorporated, Hannuksela, Kaufman, Ady, Geisner , Ko fail to disclose “comprising receiving segments of the video data by a messaging system as part of respective video messages, and combining the segments to constitute the video data”
However, Archibong teaches the above limitation at least by ([0127] “Social TV dongle 810 may then decode incoming video stream 850 into a series of incoming video frames 1120. Social TV dongle 810 then overlays top frame 1130 onto incoming video frame 1120 to create a combined output frame 1110. Combined output frames 1110 are then sent as a modified video stream 860 for display on TV 830.”).
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Archibong into the teaching of Hannuksela, Kaufman, Ady, Geisner, Ko because the references similarly disclose the mining of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the generating of videos from segments as in Archibong so that the system can create dynamic and customized videos based on received segments.

Claims 11, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Hannuksela (US 2013/0226850) in view of Kaufman (US 2011/0320202) and Ady (US 2015/0127710) and Geisner (US 2013/0177296) and Ko (US 2014/0006020) and Archibong (US 2014/0067945) and further in view of Shi (US 2016/0359778).
As per claim 11, claim 10 is incorporated, Hannuksela, Kaufman, Ady, Geisner, Ko, Archibong fail to disclose “wherein the respective video messages are ephemeral messages”
However, Shi discloses the above limitation at least by (0020] “If the second user accepts the first user's request, the application controller 101 establishes a communication channel between the two users, through which they can exchange messages.” [0043] “At step 302, the process 300 determines whether the message is an ephemeral message. If the message is an ephemeral message, the process 300 goes to step 304.”).
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Shi into the teaching of Hannuksela, Kaufman, Ady, Geisner, Ko Archibong because the references similarly disclose the mining of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with messages including ephemeral messages as in Shi in order to protect private video data and conserve resources as a result of the temporary storage of the messages.
As per claim 19, claim 12 is incorporated, Hannuksela, Kaufman, Ady, Geisner, Ko fail to disclose “further configure the system to receive segments of the video data by a messaging system as part of respective video messages, and to combine the segments to constitute the video data; and the respective video messages are ephemeral messages”
However, Archibong teaches further configure the system to receive segments of the video data by a messaging system as part of respective video messages, and to combine the segments to constitute the video data at least by ([0127] “Social TV dongle 810 may then decode incoming video stream 850 into a series of incoming video frames 1120. Social TV dongle 810 then overlays top frame 1130 onto incoming video frame 1120 to create a combined output frame 1110. Combined output frames 1110 are then sent as a modified video stream 860 for display on TV 830.”).
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Archibong into the teaching of Hannuksela, Kaufman, Ady, Geisner, Ko because the references similarly disclose the mining of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with the generating of videos from segments as in Archibong in order to create dynamic videos from received segments.
Hannuksela, Kaufman, Ady, Geisner, Ko, Archibong fail to disclose “and the respective video messages are ephemeral messages”
However, Shi discloses the above limitation at least by (0020] “If the second user accepts the first user's request, the application controller 101 establishes a communication channel between the two users, through which they can exchange messages.” [0043] “At step 302, the process 300 determines whether the message is an ephemeral message. If the message is an ephemeral message, the process 300 goes to step 304.”).
Therefore, it would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the teaching of Shi into the teaching of Hannuksela, Kaufman, Ady, Geisner, Ko, Archibong because the references similarly disclose the mining of captured audio/video data. Consequently, one of ordinary skill in the art would be motivated to modify the combination of references with messages including ephemeral messages as in Shi in order to protect private video data and conserve resources as a result of the temporary storage of the messages.

	
	Response to Arguments
The following is in response to the amendment filed on 08/22/22.
Applicant’s arguments with respect to the prior art rejections have been considered but are moot because they do not apply to all of the references being used in the current rejection.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM P BARTLETT whose telephone number is (469)295-9085.  The examiner can normally be reached on M-Th 11:30-8:30, F 11-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached on 5712724046.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/WILLIAM P BARTLETT/
Examiner, Art Unit 2169

/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2169