DETAILED ACTION
This action is in response to the Amendment dated 08 August 2022.  Claims 1, 8, 9, 10, 12, 13 and 16 are amended.  No claims have been added or cancelled.  Claims 1-20 remain pending and have been considered below.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Based on applicant’s amendment, the claim objection of claims 8, 9 and 12 are withdrawn.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-10, 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 2014/0314391 A1) in view of McCauley et al. (US 2017/0293461 A1).

As for independent claim 1, Kim teaches a method comprising:
receiving a first audio segment, wherein the first audio segment is non-spatialized, and wherein the first audio segment is associated with first video frames [(e.g. see Kim paragraph 0060) ”FIG. 4 illustrates an example of collecting an image and audio from a video in an electronic device according to an embodiment of the present invention. Referring to FIG. 4, a shot video 410 includes a video track 413 and audio track 415. The video track 413 includes a plurality of frames (frame #1, frame #2, frame #3, . . . frame #n). The electronic device generates image data 421 by extracting at least one image from the video track 413, and generates audio data 422 from the audio track 415”].
identifying visual objects in the first video frames [(e.g. see Kim paragraphs 0046, 0074) ”The electronic device separates human/thing through face recognition, and separates the human into male/female/child/young or old, based on which at least one image object is extracted. For example, the electronic device first separates an image object A 511 and an image object B 512 as human and separates an image object C 513 as sea … The image analysis operation 115 includes an identifying image objects within a taken image, and setting an area of each image object. The image object designates one of a specific subject (e.g., a human or a thing) and a gesture within an image, and is specified as a closed-loop area within the image. For this, the image analysis operation 115 can adopt a technique such as character recognition or face recognition”].
identifying auditory events in the first audio segment [(e.g. see Kim paragraphs 0046, 0074) ”The audio analysis operation 125 includes an identifying and extracting audio of each object from recorded one audio data … The electronic device then analyzes audio data. The electronic device separates a human voice/thing sound with a unique feature of a waveform through audio waveform analysis by duration. As a result, among the entire audio `AAA.about.BB.about.CCCCC.about.`, `AAA.about. [high-pitched tone]` is classified as the audio object A 521, `BB.about. [low-pitched tone]` is classified as the audio object B 522, and `CCCCC.about. [wave sound]` is classified as the audio object C 523”].
identifying a match between a visual object of the visual objects and an auditory event of the auditory events  [(e.g. see Kim paragraph 0075) ”The electronic device maps features of the classified image objects 511, 512, and 513 with features of the classified audio objects 521, 522, and 523. According to this, the image object A [female] 511 and the audio object A [high-pitched tone] 521 are mapped with each other, the image object B [male] 512 and the audio object B [low-pitched tone] 522 are mapped with each other, and the image object C [sea] 513 and the audio object C [wave sound] 523 are mapped with each other”].
assigning a spatial location to the auditory event based on a location of the visual object [(e.g. see Kim paragraphs 0086, 0157) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other. The electronic device can display a list of selectable audio objects, identify audio object selected by a user, and map the audio object with an identified image object. Alternately, when an image object is selected by a user, the electronic device can display a list of mappable audio objects, and map audio object selected by the user to the image object. For instance, the electronic device determines a corresponding relationship between the image object and audio object that are selected by the user … play at least one audio data having a speaker face image corresponding to the human. As with the metadata of the image, the metadata of the audio includes location and time information”].

Kim does not specifically teach wherein the spatial location indicates coordinates of a point within a bounding box surrounding the visual object or generating an audio output by upmixing the auditory event and the spatial location into a spatial representation, wherein the audio output conveys the location to a listener of the audio output.  However, in the same field of invention, McCauley teaches:
wherein the spatial location indicates coordinates of a point within a bounding box surrounding the visual object [(e.g. see McCauley paragraphs 0039, 0043, 0066) ”The audio source is identified by a graphical indicator 104 on the video frame 102. The graphical indicator 104 may correspond to the location of the audio source. The graphical indicator 104 may be an icon. In the example shown, the icon 104 is a dashed circle around the audio source (e.g., the head of the person speaking). In some embodiments, the icon may be an ellipse, a rectangle, a cross and/or a user-selected image … The pointer 106 may be used to select the audio source and place the icon 104 (e.g., the user clicks or taps the location of the audio source with the pointer 106 to place the icon 104 for the audio source) … To graphically represent the distance of an audio source, the icon 104′ may be centered at the audio source location. For example, the icon 104′ may be a symbol and/or a shape (e.g., an ellipse, a rectangle, a cross, etc.). The user may set the distance parameter 120 (e.g., by clicking and dragging, with a slider, scrolling a mouse wheel, by entering the distance manually as a text field, etc.). The size of the icon 104′ may represent the distance parameter 120. The shape of the icon 104′ may represent the direction parameter 130. In an example, with a closer audio source the icon 104′ may be larger. In another example, with a farther audio source the icon 104′ may be smaller …  The user may use the interface 100 to place the audio source relative to the 360° video (e.g., the video portion 102). The audio stream (e.g., the audio file parameter 122) may be associated with the placed audio source. The 3D position of the audio source may be represented using the coordinate parameters 124. For example, the coordinate parameters may be represented by xyz (e.g., Cartesian) or rθφ (e.g., polar) values. The polar system for the coordinate parameters 124 may have an advantage of the direction and distance being distinctly separate (e.g., when modifying the distance, only the parameter r changes, while in Cartesian, any or all values of x, y and z may change). The polar system for the coordinate parameters 124 may be used in the equations for placing the audio sources in ambisonics (B-format) and/or VBAP”].
generating an audio output by upmixing the auditory event and the spatial location into a spatial representation, wherein the audio output conveys the location to a listener of the audio output [(e.g. see McCauley paragraphs 0031, 0058, 0059, 0099, 0102) ”the audio source may be translated (e.g., converted) to a stereo sound track, multichannel sound tracks and/or an immersive sound field … the audio processing may be used to enable the audio stream playback to a user while viewing the spherical video to approximate the audio that would be heard from the point of view of the capture device 52 … the computing device 80 may be configured to process (e.g., encode) the audio streams (e.g., the audio file parameter 122). For example, the audio stream may be adjusted based on the placement (e.g., the coordinates parameter 124, the distance parameter 120 and/or the distance parameter 130) of the icon 104 on the video file 102 to identify the audio source”].
Therefore, considering the teachings of Kim and McCauley, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to add wherein the spatial location indicates coordinates of a point within a bounding box surrounding the visual object or generating an audio output by upmixing the auditory event and the spatial location into a spatial representation, wherein the audio output conveys the location to a listener of the audio output, as taught by McCauley, to the teachings of Kim because it improves the listening experience for the end user (e.g. see McCauley paragraph 0102).

As for dependent claim 2, Kim and McCauley teach the method as described in claim 1 and Kim further teaches:
identifying an unmatched auditory event, wherein the unmatched auditory event is not matched to an identified visual object in the first video frames [(e.g. see Kim paragraph 0084) ”a correlation between the first image and the first audio is low. If the second audio includes "BBB" but the character A moves at a second image-taking, a correlation between the second image and the second audio is low”].
presenting the unmatched auditory event in a user interface [(e.g. see Kim paragraphs 0086) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other. The electronic device can display a list of selectable audio objects, identify audio object selected by a user, and map the audio object with an identified image object. Alternately, when an image object is selected by a user, the electronic device can display a list of mappable audio objects, and map audio object selected by the user to the image object. For instance, the electronic device determines a corresponding relationship between the image object and audio object that are selected by the user”].

As for dependent claim 3, Kim and McCauley teach the method as described in claim 2 and Kim further teaches:
receiving, from a user, an assignment of the unmatched auditory event to another visual object of the visual objects identified in the first video frames [(e.g. see Kim paragraph 0086) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other. The electronic device can display a list of selectable audio objects, identify audio object selected by a user, and map the audio object with an identified image object. Alternately, when an image object is selected by a user, the electronic device can display a list of mappable audio objects, and map audio object selected by the user to the image object. For instance, the electronic device determines a corresponding relationship between the image object and audio object that are selected by the user”].

As for dependent claim 5, Kim and McCauley teach the method as described in claim 2 and Kim further teaches:
receiving, from a user, an indication to assign an unmatched auditory event of the unmatched auditory events as directional sound and a spatial direction for the unmatched auditory event [(e.g. see Kim paragraphs 0086, 0157) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other. The electronic device can display a list of selectable audio objects, identify audio object selected by a user, and map the audio object with an identified image object. Alternately, when an image object is selected by a user, the electronic device can display a list of mappable audio objects, and map audio object selected by the user to the image object. For instance, the electronic device determines a corresponding relationship between the image object and audio object that are selected by the user … play at least one audio data having a speaker face image corresponding to the human. As with the metadata of the image, the metadata of the audio includes location and time information”].

As for dependent claim 6, Kim and McCauley teach the method as described in claim 1, but Kim does not specifically teach wherein the first video frames are frames of a spherical video.  However, McCauley teaches:
wherein the first video frames are frames of a spherical video [(e.g. see McCauley paragraph 0024) ”A video source may be … a spherical video”].
The motivation to combine is the same as that used for claim 1.

As for dependent claim 7, Kim and McCauley teach the method as described in claim 1 and Kim further teaches:
wherein the first audio segment is monophonic [(e.g. see Kim paragraphs 0046, 0081) ”The audio recording operation 120 includes an making peripheral sound into data by means of a recording means provided in the electronic device, such as a microphone … an installed microphone is a directional microphone”].  Examiner notes that a directional microphone inherently outputs a single track (i.e. mono).

As for dependent claim 8, Kim and McCauley teach the method as described in claim 1 and Kim further teaches:
wherein identifying the auditory events in the first audio segment comprises: using blind source separation to identify the auditory events in the first audio segment by decomposing the first audio segment into multiple track, each corresponding to a respective auditory event [(e.g. see Kim paragraph 0074 and Fig. 5B) ”The electronic device separates a human voice/thing sound with a unique feature of a waveform through audio waveform analysis by duration. As a result, among the entire audio `AAA.about.BB.about.CCCCC.about.`, `AAA.about. [high-pitched tone]` is classified as the audio object A 521, `BB.about. [low-pitched tone]` is classified as the audio object B 522, and `CCCCC.about. [wave sound]` is classified as the audio object C 523”].

As for dependent claim 9, Kim and McCauley teach the method as described in claim 1 and Kim further teaches:
wherein identifying the visual objects in the first video frames comprises: using image recognition to identify the visual objects in the first video frames [(e.g. see Kim paragraphs 0046, 0074) ”The electronic device separates human/thing through face recognition, and separates the human into male/female/child/young or old, based on which at least one image object is extracted. For example, the electronic device first separates an image object A 511 and an image object B 512 as human and separates an image object C 513 as sea … The image analysis operation 115 includes an identifying image objects within a taken image, and setting an area of each image object. The image object designates one of a specific subject (e.g., a human or a thing) and a gesture within an image, and is specified as a closed-loop area within the image. For this, the image analysis operation 115 can adopt a technique such as character recognition or face recognition”].

As for dependent claim 10, Kim and McCauley teach the method as described in claim 1, but Kim does not specifically teach wherein the audio output is a multi-channel file, an Ambisonics file, in a stereophonic format, or a Binaural stereo file.  However, McCauley teaches:
wherein the audio output is a multi-channel file, an Ambisonics file, in a stereophonic format, or a Binaural stereo file [(e.g. see McCauley paragraph 0031, 0051) ”the audio source may be translated (e.g., converted) to a stereo sound track, multichannel sound tracks and/or an immersive sound field … Associating the audio streams with the audio sources may be technology-agnostic. In one example, the audio sources may be placed on the spherical view 102 in ambisonic-based audio systems with B-format equations”].
The motivation to combine is the same as that used for claim  1.

As for independent claim 13, Kim teaches a method comprising:
demultiplexing the video to obtain an audio track and video frames [(e.g. see Kim paragraph 0060) ”FIG. 4 illustrates an example of collecting an image and audio from a video in an electronic device according to an embodiment of the present invention. Referring to FIG. 4, a shot video 410 includes a video track 413 and audio track 415. The video track 413 includes a plurality of frames (frame #1, frame #2, frame #3, . . . frame #n). The electronic device generates image data 421 by extracting at least one image from the video track 413, and generates audio data 422 from the audio track 415”].
assigning respective visual labels to visual objects in the video frames [(e.g. see Kim paragraphs 0046, 0074) ”The electronic device separates human/thing through face recognition, and separates the human into male/female/child/young or old, based on which at least one image object is extracted. For example, the electronic device first separates an image object A 511 and an image object B 512 as human and separates an image object C 513 as sea … The image analysis operation 115 includes an identifying image objects within a taken image, and setting an area of each image object. The image object designates one of a specific subject (e.g., a human or a thing) and a gesture within an image, and is specified as a closed-loop area within the image. For this, the image analysis operation 115 can adopt a technique such as character recognition or face recognition”].
separating the audio track into multiple tracks [(e.g. see Kim paragraph 0074 and Fig. 5B) ”The electronic device separates a human voice/thing sound with a unique feature of a waveform through audio waveform analysis by duration. As a result, among the entire audio `AAA.about.BB.about.CCCCC.about.`, `AAA.about. [high-pitched tone]` is classified as the audio object A 521, `BB.about. [low-pitched tone]` is classified as the audio object B 522, and `CCCCC.about. [wave sound]` is classified as the audio object C 523”].
assigning respective audio labels to the multiple tracks [(e.g. see Kim paragraphs 0046, 0074) ”The audio analysis operation 125 includes an identifying and extracting audio of each object from recorded one audio data … The electronic device then analyzes audio data. The electronic device separates a human voice/thing sound with a unique feature of a waveform through audio waveform analysis by duration. As a result, among the entire audio `AAA.about.BB.about.CCCCC.about.`, `AAA.about. [high-pitched tone]` is classified as the audio object A 521, `BB.about. [low-pitched tone]` is classified as the audio object B 522, and `CCCCC.about. [wave sound]` is classified as the audio object C 523”].
automatically matching some of the respective audio labels to some of the visual labels [(e.g. see Kim paragraph 0075) ”The electronic device maps features of the classified image objects 511, 512, and 513 with features of the classified audio objects 521, 522, and 523. According to this, the image object A [female] 511 and the audio object A [high-pitched tone] 521 are mapped with each other, the image object B [male] 512 and the audio object B [low-pitched tone] 522 are mapped with each other, and the image object C [sea] 513 and the audio object C [wave sound] 523 are mapped with each other”].
assigning respective spatial locations to the some of the respective audio labels based on respective locations of the some of the visual objects [(e.g. see Kim paragraphs 0086, 0157) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other. The electronic device can display a list of selectable audio objects, identify audio object selected by a user, and map the audio object with an identified image object. Alternately, when an image object is selected by a user, the electronic device can display a list of mappable audio objects, and map audio object selected by the user to the image object. For instance, the electronic device determines a corresponding relationship between the image object and audio object that are selected by the user … play at least one audio data having a speaker face image corresponding to the human. As with the metadata of the image, the metadata of the audio includes location and time information”].

Kim does not specifically teach generating an audio output using the respective spatial location, wherein the audio output conveys the respective spatial locations to a listener of the audio output.  However, in the same field of invention, McCauley teaches:
generating an audio output using the respective spatial location, wherein the audio output conveys the respective spatial locations to a listener of the audio output [(e.g. see McCauley paragraphs 0031, 0039, 0043, 0058, 0059, 0066, 0099, 0102) ”The user may use the interface 100 to place the audio source relative to the 360° video (e.g., the video portion 102). The audio stream (e.g., the audio file parameter 122) may be associated with the placed audio source. The 3D position of the audio source may be represented using the coordinate parameters 124 … the audio source may be translated (e.g., converted) to a stereo sound track, multichannel sound tracks and/or an immersive sound field … the audio processing may be used to enable the audio stream playback to a user while viewing the spherical video to approximate the audio that would be heard from the point of view of the capture device 52 … the computing device 80 may be configured to process (e.g., encode) the audio streams (e.g., the audio file parameter 122). For example, the audio stream may be adjusted based on the placement (e.g., the coordinates parameter 124, the distance parameter 120 and/or the distance parameter 130) of the icon 104 on the video file 102 to identify the audio source”].
Therefore, considering the teachings of Kim and McCauley, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to add generating an audio output using the respective spatial location, wherein the audio output conveys the respective spatial locations to a listener of the audio output, as taught by McCauley, to the teachings of Kim because it improves the listening experience for the end user (e.g. see McCauley paragraph 0102).

As for dependent claim 14, Kim and McCauley teach the method as described in claim 13 and Kim further teaches:
identifying residual tracks corresponding to unmatched audio labels [(e.g. see Kim paragraph 0084) ”a correlation between the first image and the first audio is low. If the second audio includes "BBB" but the character A moves at a second image-taking, a correlation between the second image and the second audio is low”].
displaying, to a user, the residual tracks in a display [(e.g. see Kim paragraphs 0086) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other. The electronic device can display a list of selectable audio objects, identify audio object selected by a user, and map the audio object with an identified image object. Alternately, when an image object is selected by a user, the electronic device can display a list of mappable audio objects, and map audio object selected by the user to the image object. For instance, the electronic device determines a corresponding relationship between the image object and audio object that are selected by the user”].

Claims 4 and 15-20 is rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 2014/0314391 A1) in view of McCauley et al. (US 2017/0293461 A1), as applied to claim 2 above, and further in view of Grosvenor et al. (US 2005/0281410 A1).

As for dependent claim 4, Kim and McCauley teach the method as described in claim 2, but do not specifically teach receiving, from a user, an indication to assign the unmatched auditory event as diffuse sound.  However, in the same field of invention, Grosvenor teaches:
receiving, from a user, an indication to assign the unmatched auditory event as diffuse sound [(e.g. see Grosvenor paragraphs 0077, 0251) ”Aesthetic constraints may also be provided to determine how background or diffuse sound sources are to be used in a given editing session … A further consideration that may be taken into account is that a sound source may be diffuse and therefore an improved solution would regard the sound source as occupying a region rather than being a point source”].
Therefore, considering the teachings of Kim, McCauley and Grosvenor, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to add receiving, from a user, an indication to assign the unmatched auditory event as diffuse sound, as taught by Grosvenor, to the teachings of Kim and McCauley because setting background sound sources would add to the atmosphere (e.g. see Grosvenor paragraph 0049).

As for dependent claim 15, Kim and McCauley teach the method as described in claim 14 and Kim further teaches:
a second assignment of the residual track to an arbitrary spatial location of the video frames [(e.g. see Kim paragraphs 0075, 0086) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other … image object A [female] 511 and the audio object A [high-pitched tone] 521 are mapped with each other”].
a third assignment of the residual track as an ambient sound [(e.g. see Kim paragraphs 0075, 0086) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other … the image object C [sea] 513 and the audio object C [wave sound] 523 are mapped with each other”].
a fourth assignment of the residual track to a visual object in the video frames [(e.g. see Kim paragraphs 0075, 0086) ”the electronic device can provide a UI capable of setting the corresponding relationship by means of a user's instruction. In other words, the electronic device can provide the UI capable of mapping the image object and the audio object with each other … the image object B [male] 512 and the audio object B [low-pitched tone] 522 are mapped with each other”].

Kim does not specifically teach receiving, from a user, at least one of: a first assignment of a residual track of the residual tracks to a diffuse sound field.  However, Grosvenor teaches:
receiving, from a user, at least one of: a first assignment of a residual track of the residual tracks to a diffuse sound field [(e.g. see Grosvenor paragraphs 0077, 0251) ”Aesthetic constraints may also be provided to determine how background or diffuse sound sources are to be used in a given editing session … A further consideration that may be taken into account is that a sound source may be diffuse and therefore an improved solution would regard the sound source as occupying a region rather than being a point source”].
The motivation to combine is the same as that used for claim 4.

As for independent claim 16, Kim, McCauley and Grosvenor teach a system.  Claim 16 discloses substantially the same limitations as claims 1 and 4.  Therefore, it is rejected with the same rational as claims 1 and 4.

As for dependent claim 17, Kim, McCauley and Grosvenor teaches the system as described in claim 16; further, claim 17 discloses substantially the same limitations as claim 1.  Therefore, it is rejected with the same rational as claim 1.

As for dependent claim 18, Kim, McCauley and Grosvenor teach the system as described in claim 17, but Kim does not specifically teach wherein the spatial location corresponds to a center of a bounding polygon of the visual object.  However, McCauley teaches:
wherein the spatial location corresponds to a center of a bounding polygon of the visual object [(e.g. see McCauley paragraph 0066) ”To graphically represent the distance of an audio source, the icon 104′ may be centered at the audio source location. For example, the icon 104′ may be a symbol and/or a shape (e.g., an ellipse, a rectangle, a cross, etc.)”].
The motivation to combine is the same as that used for claim 4.

As for dependent claim 19, Kim, McCauley and Grosvenor teaches the system as described in claim 17; further, claim 19 discloses substantially the same limitations as claim 10.  Therefore, it is rejected with the same rational as claim 10.

As for dependent claim 20, Kim, McCauley and Grosvenor teach the system as described in claim 17 and Kim further teaches:
wherein the instructions further include instructions to: generate an audio file that includes the auditory event and diffuse sound information related to the auditory event [(e.g. see Kim paragraph 0166) ”The electronic device proceeds to step 1709 and encodes a combination data set, which includes image data, audio data, and mapping data. For example, the image data includes an image itself, image object designation information, a corrected image, and indirect information for accessing the image data, and the audio data includes the recorded entire audio, a processed audio, at least one audio object, audio characteristic information, and indirect information for accessing the audio data, and the mapping data includes object identification information and corresponding relationship information. The combination data set can be one of a first form in which image data is inserted into an audio file, a second form in which audio data is inserted into an image file, a 3rd form being a video file whose image data is constructed as a video track and audio data is constructed as an audio track, and a 4th form of adding separate mapping information data in which an image file, an audio file, and a mapping information database exist separately, respectively”].

Claims 11 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 2014/0314391 A1) in view of McCauley et al. (US 2017/0293461 A1), as applied to claim 1 above, and further in view of Campbell et al. (US 2016/0005435 A1).

As for dependent claim 11, Kim and McCauley teach the method as described in claim 1, but do not specifically teach the following limitations.  However, in the same field of invention, Campbell teaches:  
receiving a second audio segment, wherein the second audio segment includes the auditory event [(e.g. see Campbell paragraph 0028) ”sound originating from a subject shown in the output video may be heard more prominently than sound originating from a subject outside the field of view”].
receiving second video frames, wherein the second video frames do not include the visual object [(e.g. see Campbell paragraph 0027) ”the output video thus reduces the captured spherical content to a standard field of view video having the content of interest while eliminating extraneous data outside the targeted field of view”].
determining a motion vector of the visual object based at least in part on at least a subset of the first video frames [(e.g. see Campbell paragraph 0039) ”the video server 240 can automatically identify sub-frames of interest based on the spherical video content itself or its associated audio track. For example, facial recognition, object recognition, motion tracking, or other content recognition or identification techniques may be applied to the spherical video to identify sub-frames of interest”].
assigning an ambient spatial location to the auditory event of the auditory events based on the motion vector [(e.g. see Campbell paragraphs 0027, 0028) ”audio channels corresponding to different directions are variably weighted over time to provide an output audio track that has a variable directionality that approximately follows the person's path … a subject outside the field of view. In one embodiment, audio from different directions is weighted in order to create a realistic audio experience. For example, audio from directions other than where the viewer is focused may be present in the recreated audio, but the various channels may be weighted such that the audio in the viewing direction is most prominent”].
Therefore, considering the teachings of Kim, McCauley and Campbell, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to add the above identified limitations, as taught by Campbell, to the teachings of Kim and McCauley because it allows the output audio to be directionally weighted based on the sub-frame of the video presented (e.g. see Campbell paragraph 0028).

As for dependent claim 12, Kim and McCauley teach the method as described in claim 1, but do not specifically teach the following limitations.  However, Campbell teaches:
receiving a second audio segment, wherein the second audio segment includes the auditory event of the auditory events [(e.g. see Campbell paragraph 0028) ”sound originating from a subject shown in the output video may be heard more prominently than sound originating from a subject outside the field of view”].
receiving second video frames, wherein the second video frames do not include the visual object [(e.g. see Campbell paragraph 0027) ”the output video thus reduces the captured spherical content to a standard field of view video having the content of interest while eliminating extraneous data outside the targeted field of view”].
assigning, based on a time difference between the first video frames and the second video frames, one of an ambient spatial location or a diffuse location to the auditory event [(e.g. see Campbell paragraph 0027, 0028) ”audio channels corresponding to different directions are variably weighted over time to provide an output audio track that has a variable directionality that approximately follows the person's path … a subject outside the field of view. In one embodiment, audio from different directions is weighted in order to create a realistic audio experience. For example, audio from directions other than where the viewer is focused may be present in the recreated audio, but the various channels may be weighted such that the audio in the viewing direction is most prominent”].
The motivation to combine is the same as that used for claim 11.

Response to Arguments
Applicant's arguments, filed 08 August 2022, have been fully considered but they are not persuasive.

Applicant argues that [“The Office cannot show that Kim includes all of the features of claim 1 as amended” (Page 8).].

The argument described above, in paragraph number 10, with respect to the newly added limitations to the independent claims has been considered, but is moot in view of the new grounds of rejection.

Citation of Pertinent Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
U.S. PGPub 2017/0215005 A1 issued to Hsu et al. on 27 July 2017.  The subject matter disclosed therein is pertinent to that of claims 1-20 (e.g. selecting an object in an interface and modifying the sound source based on directional audio techniques).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHRISTOPHER J FIBBI whose telephone number is (571)-270-3358. The examiner can normally be reached Monday - Thursday (8am-6pm).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sherief Badawi can be reached on (571)-272-9782. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/CHRISTOPHER J FIBBI/Primary Examiner, Art Unit 2174