DETAILED ACTION
Claims 1-73 and 78-81 are pending in the present application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant's claim for foreign priority under 35 U.S.C. 119(a)-(d).  The certified copy of European patent application number EP17196259.0 filed on 10/12/2017 has been received and made of record.

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 07/01/2020, 10/15/2020, 04/05/2021, 06/25/2021, and 07/14/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Election/Restrictions
 	Restriction is required under 35 U.S.C. 121.
Group I: Claims 1-73 and 78-81 are directed towards an system, a method, and a non-transitory digital storage medium having a computer program stored thereon to perform the method for a virtual reality, VR, augmented reality, AR, mixed reality, MR, 
15wherein first audio elements in the first audio streams are more relevant and/or more audible than second audio elements in the second audio streams, wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams OR wherein the system is configured to control the request of the at least one audio stream to the server on the basis of a distance of the user's position from the boundaries of neighboring and/or adjacent video environments associated to different audio scenes.
Group II: Claims 74-77 is directed towards a server for delivering audio and video streams to a client for a virtual reality, VR, augmented reality, AR, mixed reality, MR, or 360-degree video environment, the video and audio streams to be reproduced in a media consumption device, wherein the server comprises an encoder to encode and/or a storage to store video streams to describe a video environment, the video environment being associated to an audio scene;  30wherein the server further comprises 
wherein the request is based on a distance of the user's position from the boundaries of neighboring and/or adjacent video environments associated to different audio scenes OR wherein first audio elements in the first audio streams are more relevant and/or more audible than second audio elements in the second audio streams, wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams.
In accordance with 37 CFR 1.499, applicant is required, in reply to this action, to elect a single invention to which the claims must be restricted. During a telephone conversation with Jae Youn Kim on 07/26j/2021, a provisional election was made without traverse to prosecute the invention of Group I, claims 1-73 and 78-81.  Affirmation of this election must be made by applicant in replying to this Office action.  Claims 74-77 are withdrawn from further consideration by the examiner, 37 CFR 1.142(b), as being drawn to a non-elected invention.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-81 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Regarding claims 1-81, the phrase "and/or" renders the claim indefinite because the scope of the claims are rendered indefinite as it is not clear what boundary they are. Suggest selecting “and” or “or” to provide clear definition.
Regarding claims 62-63, the phrase "e.g." renders the claim indefinite because it is unclear whether the limitation(s) following the phrase are part of the claimed invention.  See MPEP § 2173.05(d).
Claims 62-63 recites the limitation "at least the adaptation sets" in line 1.  There is insufficient antecedent basis for this limitation in the claim.
Claim Objections
Claims 62-63 and 68-69 are objected to because of the following informalities:  
Regard claim 62: 
Change:
The system of claim 1, configured to create or use at least the adaptation sets so that: A number of Adaptation Sets are associated with one Audio Scene; and/or  10Additional information is provided that relates each Adaptation Set to one Viewpoint, or one Audio Scene; and/or Additional information is provided that include - Information about the boundaries of one Audio Scene and/or - Information about the relation between one Adaptation Set and one Audio Scene (e.g., 15Audio Scene is encoded in three streams that are encapsulated into three Adaptation Sets) and/or - Information about the connection between the boundaries of an audio scene and the multiple Adaptation Sets” 
To 
The system of claim 1, configured to create or use at least the adaptation sets so that: a number of adaptation sets are associated with one audio scene; and/or  10additional information is provided that relates each adaptation set to one viewpoint, or one audio scene; and/or additional information is provided that include - Information about the boundaries of one audio Scene and/or - Information about the relation between one Adaptation Set and one Audio Scene (e.g., 15audio scene is encoded in three streams that are encapsulated into three adaptation sets) and/or - Information about the connection between the boundaries of an audio scene and the multiple adaptation sets.  

Change:
The system of claim 2, configured to create or use at least the adaptation sets so that: A number of Adaptation Sets are associated with one Audio Scene; and/or  10Additional information is provided that relates each Adaptation Set to one Viewpoint, or one Audio Scene; and/or Additional information is provided that include - Information about the boundaries of one Audio Scene and/or - Information about the relation between one Adaptation Set and one Audio Scene (e.g., 15Audio Scene is encoded in three streams that are encapsulated into three Adaptation Sets) and/or - Information about the connection between the boundaries of an audio scene and the multiple Adaptation Sets” 
To 
The system of claim 2, configured to create or use at least the adaptation sets so that: a number of adaptation sets are associated with one audio scene; and/or  10additional information is provided that relates each adaptation set to one viewpoint, or one audio scene; and/or additional information is provided that include - Information about the boundaries of one audio Scene and/or - Information about the relation between one Adaptation Set and one Audio Scene (e.g., 15audio scene is encoded in three streams that are encapsulated into three adaptation sets) and/or - Information about the connection between the boundaries of an audio scene and the multiple adaptation sets.  
Regard claim 68, “Viewpoint” should be “viewpoint”.
Regard claim 69, “Viewpoint” should be “viewpoint”.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5, 7, 9, 11, 13, 15, 17, 28, 30, 32, 34, 35, 37, 39, 45, 52, 56, 58, 62, 68, 72, 79, 81 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2019/0005986 to Peters et al. in view of U.S. PGPubs 2002/0103554 to Coles et al..

Regarding claim 1, Peters et al. teach a system for a virtual reality, VR, augmented reality, AR, mixed reality, MR, or 360-degree video environment configured to receive video and audio streams to be reproduced in 5a media consumption device (par 0006, par 0025-0026, par 0070, “This disclosure relates generally to using auditory aspects of the user experience of computer-mediated reality systems, including virtual reality (VR), mixed reality (MR), augmented reality (AR), computer vision, and graphics systems, to enhance the visual aspects of the user experience. In some particular examples, aspects of this disclosure are directed to using the directionality of audio data to predict particular portions of corresponding video data that is to be output at a greater resolution (or “upsampled”) to enhance the user wherein the system comprises: 

    PNG
    media_image1.png
    239
    494
    media_image1.png
    Greyscale

at least one media video decoder configured to decode video signals from video streams for the representation of VR, AR, MR or 360-degree video environments to a user (Figs 7 and 10A-10B, par 0055-0057, par 0080-0085, par 0089-0090, “The playback device may perform the prediction based on various criteria, including the direction of the predominant audio components in the soundfield representation being played back for the computer-mediated reality (e.g., VR) experience, and the current FoV …. the playback device may select high-resolution video data of the predicted viewport(s) from the locally-stored video data, prior to the FoV actually changing to any of the predicted viewport(s). In this way, the playback device of FIG. 10B may predictively select high-resolution video data for future FoV viewports and prepare the selected high-resolution video data for output via the corresponding viewport(s), while reducing or potentially eliminating the lag time experienced by a user when changing the FoV to a different viewport or a group of different viewport(s)” …disclose a video decoder based on the viewpoint), and 
at least one audio decoder configured to decode audio signals from audio streams for the 10representation of audio scenes (Figs 2, 7 and 10A-10B, par 0048-0051, par 0055-0057, par 0087, “The audio decoding device 24 may represent a device configured to decode ambisonic coefficients 15 from the bitstream 21. As such, the ambisonic coefficients 15 may be similar to a full set or a partial subset of the HOA coefficients 11, but may differ due to lossy operations (e.g., quantization) and/or transmission via the transmission channel. The audio playback system 16 may, after decoding the bitstream 21 to obtain the Ambisonic coefficients 15 and render the Ambisonic coefficients 15 to output loudspeaker feeds 25” ..disclose an audio decoder), 
wherein the system is configured to request first audio streams and second audio streams and/or one audio element of an audio stream and/or one adaptation set to a server on the basis of at least the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data (par 0007-0008, par 0031-0035, par 0055-0057, par 0081, par 0085, par 0087-0090, par 0093-0094, “the processor is configured to identify one or more foreground audio objects of the soundfield using the audio spatial metadata stored to the memory device, and to select, based on the identified one or more foreground audio objects, one or more viewports associated with the video data“, “The audio decoding device 24 may use metadata (e.g., audio spatial metadata) available from a 3D HOA soundfield representation to determine predominant sounds of the 3D HOA soundfield. For instance, the audio decoding device 24 may use a V-vector main direction of the audio objects. In a streaming scenario, the audio decoding device 24 may use audio spatial metadata received in the bitstream(s) 21. In a local-24 may use audio spatial metadata from a 3D HOA soundfield representation stored locally at the content consumer device 14. The audio decoding device 24 may also use metadata associated with the audio objects (e.g., direction, distance, and priority information) to determine the energy-predominant sounds of the soundfield. In turn, the audio decoding device 24 may calculate a time-averaged (and optionally, metadata-weighted) histogram of the predominant sound directions of the 3D soundfield”, “the streaming client of FIG. 10A may determine the locations of the foreground audio objects of the soundfield representation received via the audio stream. For instance, the streaming client may use the energy of the various objects of the soundfield to determine which objects qualify as foreground audio objects. In turn, the streaming client may map the positions of the foreground audio objects in the soundfield to corresponding positions in the corresponding video data, such as to one or more of the tiles 116 illustrated in FIG. 3B. If the streaming client determines that foreground audio objects that are to be played back imminently map to positions in viewports (or tiles) that is different from the current FoV viewport (or tile), then the streaming client may predict that the user's FoV will change, to track the viewport positions that map to the position(s) of the soon-to-be-rendered foreground audio objects” ..obtain audio streams based on the position of the foreground audio objects (more than one)).
But Peters et al. keep silent for teaching wherein first audio elements in the first audio streams are more relevant and/or more audible than second audio elements in the second audio streams, wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams.

    PNG
    media_image2.png
    334
    465
    media_image2.png
    Greyscale

In related endeavor, Coles et al. teach wherein first audio elements in the first audio streams are more relevant and/or more audible than second audio elements in the second audio streams, wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams (abstract, par 0010-0011, par 0032, par 0038-0039, par 0043-0048, par 0052-0055, “The voice browser 28 also acts to control the bit-rate at which the first, second, fourth and fifth codecs 19, 20, 24 and 25 code and decode the audio components. The focus component (the second component) is coded and decoded at the highest bit-rate, whilst the non-focus component (the first component) is coded and decoded and the lowest bit-rate. This is done on the basis that the focus component will be the component which the user is most interested in hearing. Accordingly, in this embodiment at least, the focus component is positioned straight-ahead of the user and is coded and decoded 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. to include wherein first audio elements in the first audio streams are more relevant and/or more audible than second audio elements in the second audio streams, wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams as taught by Coles et al. to  select an audio component as a focus component by using the user control device to transmit at a higher bit-rate than the non focus components so as to maintain the required bandwidth of the data link at a suitable level.

Regarding claim 3, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to provide the server with the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data so as to acquire the at least one audio stream and/or one audio element of an audio stream and/or one adaptation set from the server (Fig 7, par 0007-0008, par 0031-0035, par 0055-0057, par 0081, par 0085, par 0087-0090, par 0093-0094, obtain audio objects from streaming server based on user’s viewport or FOV or metadata)

Regarding claim 5, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach wherein at least one audio scene is 

Regarding claim 7, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to decide whether at least one audio element of an audio stream and/or one adaptation set is to be reproduced for the current user's viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual position in an 30audio scene, and  67Attorney Docket No.: PJK5283832 wherein the system is configured to request and/or to receive the at least one audio element at the current user's virtual position (Fig 10A, par 0007-0008, par 0031-0035, par 0055-0057, par 0080-0081, par 0085, par 0087-0090, par 0093-0094, “The processor is configured to identify one or more foreground audio objects of the soundfield using the audio spatial metadata stored to the memory device, and to select, based on the identified one or more foreground audio objects, one or more viewports associated with the video data”, “ If the streaming client determines that foreground audio objects that are to be played 

Regarding claim 9, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to predictively decide whether at least one audio element of an audio stream and/or one adaptation set will become relevant and/or audible based on at least the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data, 15wherein the system is configured to request and/or to receive the at least one audio element and/or audio stream and/or adaptation set at a particular user's virtual position before the predicted user's movement and/or interaction in an audio scene, and wherein the system is configured to reproduce the at least one audio element and/or audio stream, when received, at the particular user's virtual position after the user's movement and/or 20interaction in an audio scene (Fig 10A-10B and 11, par 0007-0008, par 0031-0035, par 0073-0076, par 0080-0087, par 0090-0093, “to audio spatial metadata of an HOA representation of a soundfield, the predictive viewport selection techniques of this disclosure can be performed based on other representations of a soundfield, as well. For example, a VR client can implement the techniques of this disclosure to predictively select one or more possibly-next viewports by using object metadata of an object-based representation of the soundfield. As such, the VR systems of this disclosure may perform predictive viewport selection for the 14 device may base the decision on a parsing of the directional audio parameters either from the compressed HOA bitstream (using the direction of a predominant sound and/or a direction of an HOA V-vector), or object-related metadata (e.g., direction, distance, object priority) of the soundfield representation. The content consumer device 14 may predict the likely-next viewport(s) using the information listed above, as being the data that the content consumer device 14 may need to obtain, for playback in the near future. In some examples, rather than base the decision on directional audio parameters available from the soundfield representation, the content consumer device 14 may compute and use a spatial energy distribution of the decoded spatial audio content for the viewport prediction” )

Regarding claim 11, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Coles et al. further teach configured to request and/or to receive the at least one audio element at a lower bitrate, at the user's virtual position before a user's interaction, the interaction resulting from either change in positional data in the same audio scene or entering a next scene 5separated from the current scene, wherein the system is configured to request and/or to receive the at least one audio element at a higher bitrate, at the user's virtual position after the user's interaction in an audio scene (abstract, par 0010-0012, par 0032, par 0038-0039, par 0042-0048, par 0052-0055, introduce an interactive audio system to provide different rate for different audio object based on user’s position or focus).

Regarding claim 13, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Coles et al. further teach wherein at least one audio element associated to at least one audio scene is associated to a position and/or area in the video environment associated to an audio scene, wherein the system is configured to request and/or receive streams at higher bitrate for 20audio elements closer to the user than for audio elements more distant from the user (abstract, par 0010-0012, par 0032, par 0038-0039, par 0042-0048, par 0052-0055, introduce an interactive audio system to provide different rate for different audio object based on user’s position or focus).

Regarding claim 15, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and further teach wherein at least one audio element is associated to at least 30one audio scene, the at last one audio element being associated to a position and/or area in the video environment associated to an audio scene (Peters et al.: abstract, par 0007-0008, par 0031-0035, par 0085, par 0087-0090, par 0093-0094,  Coles et al.: par 0012, generate sound field based on the sound objects), wherein the system is configured to request different streams at different bitrates for audio elements based on their relevance and/or auditability level at each user's virtual position in an audio scene, wherein the system is configured to request an audio stream at higher bitrate for audio 5elements which are more relevant and/or more audible at the current user's virtual position, and/or an audio stream at lower bitrate for audio elements which are less relevant and/or less audible at the current user's virtual position (Coles et al.: abstract, par 0010-0012, par 0032, par 0038-0039, par 0042-0048, par 0052-0055, 

Regarding claim 17, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and further teach wherein at least one audio element is associated to an audio scene, each audio element being associated to a position and/or area in the video environment associated to an audio scene, wherein the system is configured to periodically send to the server the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual 25positional data (Peters et al.: par 0007-0008, par 0031-0035, par 0085, par 0087-0090, par 0093-0095, Coles et al.: par 0012, generate sound field based on the sound objects  and require audio and video data based on updated field of view or viewport), so that: for a first position, a stream at higher bitrate is provided, from the server, and for a second position, a stream at lower bitrate is provided, from the server, wherein the first position is closer to the at least one audio element than the second position (Coles et al.: abstract, par 0010-0012, par 0032, par 0038-0039, par 0042-0048, par 0052-0055, introduce an interactive audio system to provide different rate for different audio object based on user’s position or focus).

Regarding claim 28, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Coles et al. further teach wherein a plurality of audio scenes is defined for multiple video environments, so that the system requests and/or acquires the audio streams associated to a current audio scene at a higher bitrate and the audio 

Regarding claim 30, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach wherein a plurality of N audio elements are defined, and, in 25case the user's distance to the position or area of these audio elements is larger than a predetermined threshold, the N audio elements are processed to acquire a smaller number M of audio elements associated to a position or area close to the position or area of the N audio elements, so as to provide the system with at least one audio stream associated to the N audio elements, in 30case the user's distance to the position or area of the N audio elements is smaller than a predetermined threshold, or  74Attorney Docket No.: PJK5283832 to provide the system with at least one audio stream associated to the M audio elements, in case the user's distance to the position or area of the N audio elements is larger than a predetermined threshold (par 0088-0089, “The content consumer device 14 may implement threshold detection, to determine a number of sounds that are positioned outside of the current FoV viewport. If the number of detected sounds crosses the threshold value, then the content consumer device 14 may determine the ‘N’ most likely-next viewports. ‘N’ represents an integer value. In turn, if the number of sounds that fall outside the current viewport crosses the threshold value, then the content consumer device 14 may obtain upsampled video data for all of the ‘N’ most likely-next viewports. In a streaming scenario, the content consumer 14 may request the upsampled video data for all of the ‘N’ most likely-next viewports from the source device 12. In a local-storage scenario, the content consumer device 14 may retrieve the upsampled video data for all of the ‘N’ most likely-next viewports from local storage of the content consumer device 14”).

Regarding claim 32, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and further teach wherein a plurality of N audio elements are defined, and, in 25case the user's distance to the position or area of these audio elements is larger than a predetermined threshold, the N audio elements are processed to acquire a smaller number M of audio elements associated to a position or area close to the position or area of the N audio elements, so as to provide the system with at least one audio stream associated to the N audio elements, in 30case the user's distance to the position or area of the N audio elements is smaller than a predetermined threshold, or  74Attorney Docket No.: PJK5283832 to provide the system with at least one audio stream associated to the M audio elements, in case the user's distance to the position or area of the N audio elements is larger than a predetermined threshold (Peters et al.: par 0088-0089, ““The content consumer device 14 may implement threshold detection, to determine a number of sounds that are positioned outside of the current FoV viewport “, Coles et al.: Fig. 4, abstract, par 0010-0011, par 0032, par 0038-0039, par 0043-0048, par 0052-0055, “The voice browser 28 also acts to control the bit-rate at which the first, second, fourth and fifth codecs 19, 20, 24 and 25 code and decode the audio components. The focus component (the second component) is coded and decoded at the highest bit-rate, whilst the non-focus component (the first component) is coded and decoded and the lowest 

Regarding claim 34, Peters et al. as modified by Coles et al. teach all the limitation of claim 17, and Peter et al. further teach wherein, in case the user's distance is lower than a predetermined distance threshold, or the relevance is lower than a predetermined relevance 20threshold, or the audibility level is lower than a predetermined distance threshold, than a predetermined threshold, different audio streams are acquired for the different audio elements (par 0088-0089, “The content consumer device 14 may implement threshold detection, to determine a number of sounds that are positioned outside of the current FoV viewport. If the number of detected sounds crosses the threshold value, then the content consumer device 14 may determine the ‘N’ most likely-next viewports. ‘N’ represents an integer value. In turn, if the number of sounds that fall outside the current viewport crosses the threshold value, then the content consumer device 14 may obtain upsampled video data for all of the ‘N’ most likely-next viewports. In a streaming scenario, the content consumer device 14 may request the upsampled video data for all of the ‘N’ most likely-next viewports from the source device 12. In a local-storage scenario, the content consumer device 14 may 14”).

Regarding claim 35, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and further teach configured to request and/or acquire the audio streams on the basis of the user's orientation and/or user's direction of movement and/or user's interactions 25in an audio scene (Peters et al.: par 0007-0008, par 0031-0035, par 0085, par 0087-0090, par 0093-0095, Coles et al.: par 0012, generate sound field based on the sound objects  and require audio and video data based on updated field of view or viewport, Coles et al.: abstract, par 0010-0012, par 0032, par 0038-0039, par 0042-0048, par 0052-0055, introduce an interactive audio system to provide different rate for different audio object based on user’s position or focus).

Regarding claim 37, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach wherein the viewport is associated to the position and/or virtual position and/or movement data and/or head orientation (par 0026-0027, par 0090, viewport is selected based on the head position or FOV).

Regarding claim 39, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and further teach wherein different audio elements are provided at different viewports, wherein the system is configured to request and/or receive, in case one first audio element falls within a viewport, the first audio element at a higher bitrate than a second audio 10element which does not fall within the viewport (Peters et al.: Fig 7, 

Regarding claim 45, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach wherein the system comprises a metadata processor configured 5to manipulate metadata in at least one audio stream prior to the at least one audio decoder based on user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data (par 0007-0008, par 0031-0035, par 0055-0057, par 0081, par 0085, par 0087-0090, par 0093-0094, “the processor is configured to identify one or more foreground audio objects of the soundfield using the audio spatial metadata stored to the memory device, and to select, based on the identified one or more foreground audio objects, one or more viewports associated with the video data“, “The audio decoding device 24 may use metadata (e.g., audio spatial metadata) available from a 3D HOA soundfield representation to determine predominant sounds of the 3D HOA soundfield. For instance, the audio decoding device 24 may use a V-vector main direction of the audio objects. In a streaming scenario, the audio decoding device 24 may use audio spatial metadata received in the bitstream(s) 21. In a local-storage scenario, the audio decoding device 24 may use audio spatial metadata from a 3D HOA soundfield 14. The audio decoding device 24 may also use metadata associated with the audio objects (e.g., direction, distance, and priority information) to determine the energy-predominant sounds of the soundfield. In turn, the audio decoding device 24 may calculate a time-averaged (and optionally, metadata-weighted) histogram of the predominant sound directions of the 3D soundfield”).

Regarding claim 52, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to acquire and/or collect statistical or aggregated data on the user's current viewport and/or head orientation and/or movement data and/or metadata 15and/or virtual positional data, so as to transmit the request to the server associated to the statistical or aggregated data (Fig 10, par 0048, par 0080, require video stream and audio steam from server base on the viewport data related to user’s FOV, metadata, or head position).

Regarding claim 56, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach further configured to: manipulate metadata associated with a group of selected audio streams based on at least the user's current or estimated viewport and/or head orientation and/or movement data and/or 5metadata and/or virtual positional data, so as to: select and/or activate audio elements composing an audio scene to be reproduced; and/or merge all selected audio streams into a single audio stream (Fig 10, par 0048, par 0080, par 0093-0095, “Process 220 may begin when the playback device identifies, using a processor coupled 222). In turn, the processor of the playback device may select, based on the identified one or more foreground audio objects, one or more viewports associated with video data stored to the memory device (224) ….the processor of the playback device may determine a number of viewports associated with the identified one or more viewports based on the identified one or more foreground audio objects. In some examples, the processor of the playback device may upsample a portion of the stored video data that is associated with the identified one or more foreground audio objects ….one or more loudspeakers (e.g., speaker hardware of the headset 200) may output an audio data format representative of the soundfield” ….select video stream and audio steam about audio objects from server base on the viewport data related to user’s FOV, metadata, or head position).

Regarding claim 58, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach wherein information is provided from the server, for each audio element or audio object, wherein the information includes descriptive information about the 20locations in which an audio scene or the audio elements are active (par 0080, par 0093-0095, “Process 220 may begin when the playback device identifies, using a processor coupled to a memory, one or more foreground audio objects of a soundfield, using audio spatial metadata stored to the memory (222). In turn, the processor of the playback device may select, based on the identified one or more foreground audio objects, one or more viewports associated with video data stored to the memory device (224)”).

Regarding claim 62, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to create or use at least the adaptation sets so that: A number of Adaptation Sets are associated with one Audio Scene; and/or  10Additional information is provided that relates each Adaptation Set to one Viewpoint, or one Audio Scene; and/or Additional information is provided that include - Information about the boundaries of one Audio Scene and/or - Information about the relation between one Adaptation Set and one Audio Scene (e.g., 15Audio Scene is encoded in three streams that are encapsulated into three Adaptation Sets) and/or - Information about the connection between the boundaries of an audio scene and the multiple Adaptation Sets (par 0080, par 0093-0095, “Process 220 may begin when the playback device identifies, using a processor coupled to a memory, one or more foreground audio objects of a soundfield, using audio spatial metadata stored to the memory (222). In turn, the processor of the playback device may select, based on the identified one or more foreground audio objects, one or more viewports associated with video data stored to the memory device (224)”).

Regarding claim 68, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to receive information about user's current viewport and/or head orientation and/or 10movement data and/or metadata and/or virtual positional data and/or any information characterizing changes triggered by the user's actions; and receive information about the availability of adaptation sets and information describing an association of at least one adaptation set 24 may use metadata (e.g., audio spatial metadata) available from a 3D HOA soundfield representation to determine predominant sounds of the 3D HOA soundfield. For instance, the audio decoding device 24 may use a V-vector main direction of the audio objects. In a streaming scenario, the audio decoding device 24 may use audio spatial metadata received in the bitstream(s) 21. In a local-storage scenario, the audio decoding device 24 may use audio spatial metadata from a 3D HOA soundfield representation stored locally at the content consumer device 14. The audio decoding device 24 may also use metadata associated with the audio objects (e.g., direction, distance, and priority information) to determine the energy-predominant sounds of the soundfield. In turn, the audio decoding device 24 may calculate a time-averaged (and optionally, metadata-weighted) histogram of the predominant sound directions of the 3D soundfield”).

Regarding claim 72, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, and Peter et al. further teach configured to 10manipulate audio metadata associated with selected audio streams, based on at least the user's current 220 may begin when the playback device identifies, using a processor coupled to a memory, one or more foreground audio objects of a soundfield, using audio spatial metadata stored to the memory (222). In turn, the processor of the playback device may select, based on the identified one or more foreground audio objects, one or more viewports associated with video data stored to the memory device (224) ….the processor of the playback device may determine a number of viewports associated with the identified one or more viewports based on the identified one or more foreground audio objects. In some examples, the processor of the playback device may upsample a portion of the stored video data that is associated with the identified one or more foreground audio objects ….one or more loudspeakers (e.g., speaker hardware of the headset 200) may output an audio data format representative of the soundfield” ….select video stream and audio steam about audio objects from server base on the viewport data related to user’s FOV, metadata, or head position).

Regarding claim 79, the method claim 79 is similar in scope to claim 1 and is rejected under the same rational.

Regarding claim 81, Peters et al. teach non-transitory digital storage medium having a computer program stored thereon to perform the method for a virtual reality, .

Claims 19, 21, 23, 26, 50, 60, 64, 66, 70 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2019/0005986 to Peters et al. in view of U.S. PGPubs 2002/0103554 to Coles et al., further in view of U.S. PGpubs 2010/0040238 to Jang et al..

    PNG
    media_image3.png
    568
    407
    media_image3.png
    Greyscale

Regarding claim 19, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching wherein a plurality of audio scenes are defined for multiple video environments such as adjacent and/or neighboring video environments, so that first streams are provided associated to a first, current audio scene and, in case of user's transition to a second, further audio scene, to provide both the audio streams associated to 15the first audio scene and the second streams associated to the second audio scene.
In related endeavor, Jang et al. teach wherein a plurality of audio scenes are defined for multiple video environments such as adjacent and/or neighboring video environments, so that first streams are provided associated to a first, current audio scene and, in case of user's transition to a second, further audio scene, to provide both the audio streams associated to 15the first audio scene and the second streams associated to the second audio scene (Figs 6-7, par 0086-0090, par 0093-0098, “basket ball court sound 605 including sound of people 601a to 601d and sound of a speaker 605 in the basket ball court 600 is first provided as shown in (a) of FIG. 6 and should be instantly converted to conference room sound 610 including sound of people 611a to 611f and sound of a television 613 in the conference room 610 as soon as the virtual space shift from the first space to the second space is detected “, “FIG. 8C illustrates a fading-out of the first space sensory output 831 and the fading in of the second space sensory output 833 that begins at a faster rate and tapers off to a slower rate. Such a process may be used to depict a slow transition from the first space to the second space. Alternatively, such a process may be used when a large number of sensory output producing objects or characters are detected in the non-focus areas relative to 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include wherein a plurality of audio scenes are defined for multiple video environments such as adjacent and/or neighboring video environments, so that first streams are provided associated to a first, current audio scene and, in case of user's transition to a second, further audio scene, to provide both the audio streams associated to 15the first audio scene and the second streams associated to the second audio scene as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas divided into a focus area within a predetermined visual field and a non-focus area out of the predetermined visual field in a virtual reality space for sound sources to provide a virtual reality that can give a higher sense of realism.

    PNG
    media_image4.png
    366
    322
    media_image4.png
    Greyscale

Regarding claim 21, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching wherein a plurality of audio scenes are defined for a first and a second video environments, the first and second video environments being adjacent and/or 25neighboring video environments, wherein first streams associated to the first audio scene are provided, from the server, for the reproduction of the first audio scene in case of the user's position or virtual position being in a first video environment associated to the first audio scene, second streams associated to the second audio scene are provided, from the server, for the 30reproduction of the second audio scene in case of the user's position or virtual position being in a second video environment associated to the second audio scene, and  71Attorney Docket No.: PJK5283832 both first streams associated to the first audio scene and second streams associated to the second audio scene are provided in case of the user's position or virtual position being in a transitional position between the first audio scene and the second audio scene.
 In FIG. 1B, the control unit 104 controls the general operation of the virtual reality server 100 and provides sound sources as well as image sources occurring in the virtual reality environments to the virtual reality apparatus 110“, “basket ball court sound 605 including sound of people 601a to 601d and sound of a speaker 605 in the basket ball court 600 is first provided as shown in (a) of FIG. 6 and should be instantly converted to conference room sound 610 including sound of people 611a to 611f and sound of a television 613 in the conference room 610 as soon as the virtual space shift from the first space to the second space is detected “, “FIG. 8C illustrates a fading-out of the first space sensory output 831 and the fading in of the second space sensory output 833 that begins at a faster rate and tapers off to a slower rate. Such a process may be used to depict a slow transition from the first space to the second space. Alternatively, such a 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include wherein a plurality of audio scenes are defined for a first and a second video environments, the first and second video environments being adjacent and/or 25neighboring video environments, wherein first streams associated to the first audio scene are provided, from the server, for the reproduction of the first audio scene in case of the user's position or virtual position being in a first video environment associated to the first audio scene, second streams associated to the second audio scene are provided, from the server, for the 30reproduction of the second audio scene in case of the user's position or virtual position being in a second video environment associated to the second audio scene, and  71Attorney Docket No.: PJK5283832 both first streams associated to the first audio scene and second streams associated to the second audio scene are provided in case of the user's position or virtual position being in a transitional position between the first audio scene and the second audio scene as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas divided into a focus area within a predetermined visual field and a non-focus area out of 

    PNG
    media_image4.png
    366
    322
    media_image4.png
    Greyscale

Regarding claim 23, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching wherein a plurality of audio scenes are defined for a first and a second video environments, which are adjacent and/or neighboring environments, 20wherein the system is configured to request and/or receive first streams associated to a first audio scene associated to the first environment, for the reproduction of the first audio scene in case of the user's virtual position being in the first environment, wherein the system is configured to request and/or receive second streams associated to the second audio scene associated to the second environment, for the reproduction of the second audio 25scene in case of the user's virtual position being in the second environment, and wherein the system is configured to request and/or receive both first streams associated to the first audio scene and second streams 
In related endeavor, Jang et al. teach wherein a plurality of audio scenes are defined for a first and a second video environments, which are adjacent and/or neighboring environments, 20wherein the system is configured to request and/or receive first streams associated to a first audio scene associated to the first environment, for the reproduction of the first audio scene in case of the user's virtual position being in the first environment, wherein the system is configured to request and/or receive second streams associated to the second audio scene associated to the second environment, for the reproduction of the second audio 25scene in case of the user's virtual position being in the second environment, and wherein the system is configured to request and/or receive both first streams associated to the first audio scene and second streams associated to the second audio scene in case of the user's virtual position being in a transitional position between the first environment and the second environment (Figs 6-7 and 8A, par 0086-0090, par 0093-0098, “basket ball court sound 605 including sound of people 601a to 601d and sound of a speaker 605 in the basket ball court 600 is first provided as shown in (a) of FIG. 6 and should be instantly converted to conference room sound 610 including sound of people 611a to 611f and sound of a television 613 in the conference room 610 as soon as the virtual space shift from the first space to the second space is detected “, “FIG. 8C illustrates a fading-out of the first space sensory output 831 and the fading in of the second space sensory output 833 that begins at a faster rate and tapers off to a slower rate. Such a process may be used to depict a slow transition from the first space to the second space. Alternatively, such a process may be 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include wherein a plurality of audio scenes are defined for a first and a second video environments, which are adjacent and/or neighboring environments, 20wherein the system is configured to request and/or receive first streams associated to a first audio scene associated to the first environment, for the reproduction of the first audio scene in case of the user's virtual position being in the first environment, wherein the system is configured to request and/or receive second streams associated to the second audio scene associated to the second environment, for the reproduction of the second audio 25scene in case of the user's virtual position being in the second environment, and wherein the system is configured to request and/or receive both first streams associated to the first audio scene and second streams associated to the second audio scene in case of the user's virtual position being in a transitional position between the first environment and the second environment as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas divided into a focus area within a 

Regarding claim 26, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching wherein a plurality of audio scenes is defined for multiple environments such as adjacent and/or neighboring environments, so that the system is configured to acquire the audio streams associated to a first current audio scene associated to a first, current environment, and, in case the distance of the user's position or virtual position from a boundary of an audio 30scene is below a predetermined threshold, the system further acquires audio streams associated to a second, adjacent and/or neighboring video environment associated to the second audio scene.
In related endeavor, Jang et al. teach wherein a plurality of audio scenes is defined for multiple environments such as adjacent and/or neighboring environments, so that the system is configured to acquire the audio streams associated to a first current audio scene associated to a first, current environment, and, in case the distance of the user's position or virtual position from a boundary of an audio 30scene is below a predetermined threshold, the system further acquires audio streams associated to a second, adjacent and/or neighboring video environment associated to the second audio scene (Figs 6-7 and 8A, par 0086-0090, par 0093-0098, “First, if a space shift is detected, sound 731 of the first space in which the current character exists is gradually decreased. Then, sound 733 of the second space, which is the target of the shift, is 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as 

Regarding claim 50, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching configured to merge at least one first audio stream associated to a current audio scene to at least one stream associated to a neighboring, adjacent and/or future audio scene.
In related endeavor, Jang et al. further teach configured to merge at least one first audio stream associated to a current audio scene to at least one stream associated to a neighboring, adjacent and/or future audio scene (Figs 6-7 and 8A, par 0086-0090, par 0093-0098, “basket ball court sound 605 including sound of people 601a to 601d 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include configured to merge at least one first audio stream associated to a current audio scene to at least one stream associated to a neighboring, adjacent and/or future audio scene as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas divided into a focus area within a predetermined visual field and a non-focus area out of 

Regarding claim 60, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching configured to choose between reproducing one audio scene and composing or mixing or muxing or superposing or combining at least two audio scenes on the basis of the current or future or viewport and/or head orientation and/or movement data and/or metadata and/or virtual position and/or a user's selection, the two audio scenes being associated to 30different neighboring and/or adjacent video environments.
In related endeavor, Jang et al. further teach configured to choose between reproducing one audio scene and composing or mixing or muxing or superposing or combining at least two audio scenes on the basis of the current or future or viewport and/or head orientation and/or movement data and/or metadata and/or virtual position and/or a user's selection, the two audio scenes being associated to 30different neighboring and/or adjacent video environments (Figs 6-7 and 8A, par 0086-0090, par 0093-0098, “basket ball court sound 605 including sound of people 601a to 601d and sound of a speaker 605 in the basket ball court 600 is first provided as shown in (a) of FIG. 6 and should be instantly converted to conference room sound 610 including sound of people 611a to 611f and sound of a television 613 in the conference room 610 as soon as the virtual space shift from the first space to the second space is detected “, “FIG. 8C illustrates a fading-out of the first space sensory output 831 and the fading in of the second space sensory output 833 that begins at a faster rate and tapers off to a 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include configured to choose between reproducing one audio scene and composing or mixing or muxing or superposing or combining at least two audio scenes on the basis of the current or future or viewport and/or head orientation and/or movement data and/or metadata and/or virtual position and/or a user's selection, the two audio scenes being associated to 30different neighboring and/or adjacent video environments as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas divided into a focus area within a predetermined visual field and a non-focus area out of the predetermined visual field in a virtual reality space for sound sources to provide a virtual reality that can give a higher sense of realism.

Regarding claim 64, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching configured to: receive a stream for an audio scene associated to a neighboring or adjacent environment; start decoding and/or reproducing the audio stream for the neighboring or adjacent 5environment at the detection of the transition of a boundary between two environments.
In related endeavor, Jang et al. further teach configured to: receive a stream for an audio scene associated to a neighboring or adjacent environment; start decoding and/or reproducing the audio stream for the neighboring or adjacent 5environment at the detection of the transition of a boundary between two environments (Figs 4, 6-7 and 8A, par 0074-0078, par 0086-0090, par 0093-0098, “basket ball court sound 605 including sound of people 601a to 601d and sound of a speaker 605 in the basket ball court 600 is first provided as shown in (a) of FIG. 6 and should be instantly converted to conference room sound 610 including sound of people 611a to 611f and sound of a television 613 in the conference room 610 as soon as the virtual space shift from the first space to the second space is detected “, “FIG. 8C illustrates a fading-out of the first space sensory output 831 and the fading in of the second space sensory output 833 that begins at a faster rate and tapers off to a slower rate. Such a process may be used to depict a slow transition from the first space to the second space. Alternatively, such a process may be used when a large number of sensory output producing objects or characters are detected in the non-focus areas relative to the focus area. In such a case, the transition from the first space to the second case may be made smoother by rapidly outputting the sensory output 833a of the focus area of the second space and then gradually increasing the output 833b of the non-focus areas of the second space”).


Regarding claim 66, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching wherein the system is further configured to: request and/or receive at least one first adaptation set comprising at least one audio stream associated with at least one first audio scene;  15request and/or receive at least one second adaptation set comprising at least one second audio stream associated with at least two audio scenes, including the at least one first audio scene; and merge the at least one first audio stream and the at least one second audio stream into a new audio stream to be decoded, based on metadata available regarding user's current viewport 20and/or head orientation and/or movement data and/or metadata and/or virtual positional data and/or information describing an association of the at least one 
In related endeavor, Jang et al. further teach wherein the system is further configured to: request and/or receive at least one first adaptation set comprising at least one audio stream associated with at least one first audio scene;  15request and/or receive at least one second adaptation set comprising at least one second audio stream associated with at least two audio scenes, including the at least one first audio scene; and merge the at least one first audio stream and the at least one second audio stream into a new audio stream to be decoded, based on metadata available regarding user's current viewport 20and/or head orientation and/or movement data and/or metadata and/or virtual positional data and/or information describing an association of the at least one first adaptation set to the at least one first audio scene and/or an association of the at least one second adaptation set to the at least one first audio scene (Figs 3-4, 6-7 and 8A, par 0051, par 0069-0072, par 0074-0078, par 0086-0090, par 0093-0098, “First, under the control of the control unit 114, the focused sound processor 1162 identifies the sound (focused sound) occurring in the focus area according to the division of the virtual reality space described above with reference to FIG. 2, i.e. the focus area within the visual field of the character selected by the user of the virtual reality space, and controls the volume and left-right balance of the focused sound. Further, under the control of the control unit 114, the non-focused sound processor 1164 identifies an exact location of sound (non-focused sound) occurring in the non-focus area and controls the volume and left-right balance of the non-focused sound …when the virtual reality space shifts from a first space to a second space, the sound volume controller 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include wherein the system is further configured to: request and/or receive at least one first adaptation set comprising at least one audio stream associated with at least one first audio scene;  15request and/or receive at least one second adaptation set comprising at least one second audio stream associated with at least two audio scenes, including the at least one first audio scene; and merge the at least one first audio stream and the at least one second audio stream into a new audio stream to be decoded, based on metadata available regarding user's current viewport 20and/or head orientation and/or movement data and/or metadata and/or virtual positional data and/or information describing an association of the at least one first adaptation set to the at least one first audio scene and/or an association of the at least one second adaptation set to the at least one first audio scene as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas divided into a focus area within a predetermined visual field and a non-

Regarding claim 70, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching configured to 25decide if at least one audio element from at least one audio scene embedded in at least one audio stream and at least one additional audio element from at least one additional audio scene embedded in at least one additional audio stream are to be reproduced; and cause, in case of a positive decision, an operation of merging or composing or muxing or superposing or combining at the least one additional stream of the additional audio scene to the at 30least one stream of the at least one audio scene.
In related endeavor, Jang et al. further teach configured to 25decide if at least one audio element from at least one audio scene embedded in at least one audio stream and at least one additional audio element from at least one additional audio scene embedded in at least one additional audio stream are to be reproduced; and cause, in case of a positive decision, an operation of merging or composing or muxing or superposing or combining at the least one additional stream of the additional audio scene to the at 30least one stream of the at least one audio scene (Figs 3-4, 6-7 and 8A, par 0051, par 0069-0072, par 0074-0078, par 0086-0090, par 0093-0098, “First, under the control of the control unit 114, the focused sound processor 1162 identifies the sound (focused sound) occurring in the focus area according to the division of the virtual reality space described above with reference to FIG. 2, i.e. the focus area within the visual field of the character selected by the user of the virtual reality space, and controls 
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include configured to 25decide if at least one audio element from at least one audio scene embedded in at least one audio stream and at least one additional audio element from at least one additional audio scene embedded in at least one additional audio stream are to be reproduced; and cause, in case of a positive decision, an operation of merging or composing or muxing or superposing or combining at the least one additional stream of the additional audio scene to the at 30least one stream of the at least one audio scene as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit to process and produce sound output in sound areas .

Claims 2, 4, 6, 8, 10, 20, 22, 24, 27, 31, 33, 36, 38, 46, 51, 53, 57, 59, 61, 63, 65, 67, 69, 71, 73, 78, and 80 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2019/0005986 to Peters et al. in view of U.S. PGpubs 2010/0040238 to Jang et al..

Regarding claim 2, Peters et al. teach a system for a virtual reality, VR, augmented reality, AR, mixed reality, MR, or 360-degree video environment configured to receive video and audio streams to be reproduced in 5a media consumption device (par 0006, par 0025-0026, par 0070, “This disclosure relates generally to using auditory aspects of the user experience of computer-mediated reality systems, including virtual reality (VR), mixed reality (MR), augmented reality (AR), computer vision, and graphics systems, to enhance the visual aspects of the user experience. In some particular examples, aspects of this disclosure are directed to using the directionality of audio data to predict particular portions of corresponding video data that is to be output at a greater resolution (or “upsampled”) to enhance the user experience provided by the computer-mediated reality system”), wherein the system comprises: 

    PNG
    media_image1.png
    239
    494
    media_image1.png
    Greyscale

at least one media video decoder configured to decode video signals from video streams for the representation of VR, AR, MR or 360-degree video environments to a user (Figs 7 and 10A-10B, par 0055-0057, par 0080-0085, par 0089-0090, “The playback device may perform the prediction based on various criteria, including the direction of the predominant audio components in the soundfield representation being played back for the computer-mediated reality (e.g., VR) experience, and the current FoV …. the playback device may select high-resolution video data of the predicted viewport(s) from the locally-stored video data, prior to the FoV actually changing to any of the predicted viewport(s). In this way, the playback device of FIG. 10B may predictively select high-resolution video data for future FoV viewports and prepare the selected high-resolution video data for output via the corresponding viewport(s), while reducing or potentially eliminating the lag time experienced by a user when changing the FoV to a different viewport or a group of different viewport(s)” …disclose a video decoder based on the viewpoint), and 
at least one audio decoder configured to decode audio signals from audio streams for the 10representation of audio scenes (Figs 2, 7 and 10A-10B, par 0048-24 may represent a device configured to decode ambisonic coefficients 15 from the bitstream 21. As such, the ambisonic coefficients 15 may be similar to a full set or a partial subset of the HOA coefficients 11, but may differ due to lossy operations (e.g., quantization) and/or transmission via the transmission channel. The audio playback system 16 may, after decoding the bitstream 21 to obtain the Ambisonic coefficients 15 and render the Ambisonic coefficients 15 to output loudspeaker feeds 25” ..disclose an audio decoder), 
wherein the system is configured to request first audio streams and second audio streams and/or one audio element of an audio stream and/or one adaptation set to a server on the basis of at least the user's current viewport and/or head orientation and/or movement data and/or interaction metadata and/or virtual positional data (par 0007-0008, par 0031-0035, par 0055-0057, par 0081, par 0085, par 0087-0090, par 0093-0094, “the processor is configured to identify one or more foreground audio objects of the soundfield using the audio spatial metadata stored to the memory device, and to select, based on the identified one or more foreground audio objects, one or more viewports associated with the video data“, “The audio decoding device 24 may use metadata (e.g., audio spatial metadata) available from a 3D HOA soundfield representation to determine predominant sounds of the 3D HOA soundfield. For instance, the audio decoding device 24 may use a V-vector main direction of the audio objects. In a streaming scenario, the audio decoding device 24 may use audio spatial metadata received in the bitstream(s) 21. In a local-storage scenario, the audio decoding device 24 may use audio spatial metadata from a 3D HOA soundfield representation stored locally at the content consumer device 14. 24 may also use metadata associated with the audio objects (e.g., direction, distance, and priority information) to determine the energy-predominant sounds of the soundfield. In turn, the audio decoding device 24 may calculate a time-averaged (and optionally, metadata-weighted) histogram of the predominant sound directions of the 3D soundfield”, “the streaming client of FIG. 10A may determine the locations of the foreground audio objects of the soundfield representation received via the audio stream. For instance, the streaming client may use the energy of the various objects of the soundfield to determine which objects qualify as foreground audio objects. In turn, the streaming client may map the positions of the foreground audio objects in the soundfield to corresponding positions in the corresponding video data, such as to one or more of the tiles 116 illustrated in FIG. 3B. If the streaming client determines that foreground audio objects that are to be played back imminently map to positions in viewports (or tiles) that is different from the current FoV viewport (or tile), then the streaming client may predict that the user's FoV will change, to track the viewport positions that map to the position(s) of the soon-to-be-rendered foreground audio objects” ..obtain audio streams based on the position of the foreground audio objects (more than one)).
But Peters et al. keep silent for teaching wherein the system is configured to control the request of the at least one audio stream to the server on the basis of a distance of the user's position from the boundaries of neighboring and/or adjacent video environments associated to different audio scenes.
In related endeavor, Jang et al. teach wherein the system is configured to control the request of the at least one audio stream to the server on the basis of a distance of the user's position from the boundaries of neighboring and/or adjacent video environments associated to different audio scenes (Fig 2, par 0066, par 0069, par 0075, “the virtual reality apparatus may divide the virtual reality space into a focus area within the visual field of the character using the virtual reality apparatus and non-focus areas outside the visual field of the character. The focus area within the visual field of the character refers to an area 203 (shaded portion) adjacent to the character 200 in the front space within the visual field of the character 200. The non-focus areas beyond the visual field of the character include an area 201 relatively distanced far from the character 200 in the front space within the visual field of the character 200 and areas 205, 207, 209, and 211 located outside the visual field of the character 200. ….Further, at least one of the focus area and the non-focus areas may be divided into multiple areas in the virtual reality space according to a predetermined priority in controlling the volume of the sound or the left-right balance” ..obtain audio sources based on the identified area (focus area or non-focus areas based on the distance of the user and field of view of the user).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. to include wherein the system is configured to control the request of the at least one audio stream to the server on the basis of a distance of the user's position from the boundaries of neighboring and/or adjacent video environments associated to different audio scenes as taught by Jang et al. to shift between virtual spaces without having to portray the passage of distance or time to allow user to travel from one space to another to  perform sound processing in a virtual reality system includes a sound processing unit 

Regarding Claims 4, 6, 8, 10, 20, 22, 24, 27, 31, 33, 36, 38, 46, 51, 53, 57, 59, 61, 63, 65, 67, 69, 71, and 73, Peters et al. as modified by Jang et al. teach all the limitation of claim 2, the claims 4, 6, 8, 10, 20, 22, 24, 27, 31, 33, 36, 38, 46, 51, 53, 57, 59, 61, 63, 65, 67, 69, 71, and 73 are similar in scope to claims 3, 5, 7, 9, 19, 21, 23, 26, 30, 32, 35, 37, 45, 50, 52, 56, 58, 60, 62, 64, 66, 68, 70, 72 and are rejected under the same rational.

Regarding claim 78, the method claim 78 is similar in scope to claim 2 and is rejected under the same rational.

Regarding claim 80, Peters et al. teach non-transitory digital storage medium having a computer program stored thereon to perform the method for a virtual reality, VR, augmented reality, AR, mixed reality, MR, or 360- degree video environment configured to receive video and/audio streams to be reproduced in a media consumption device (par 0006, par 0010, par 0049). The remaining limitations of the claim are similar in scope to claim 2 and rejected under the same rationale.

Claims 12, 14, 16, 18, 29, and 40-41 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2019/0005986 to Peters et al. in view of U.S. PGPubs 2010/0040238 to Jang et al., further in view of U.S. PGpubs 2002/0103554 to Coles et al..

Regarding Claims 12, 14, 16, 18, 29, and 40, Peters et al. as modified by Jang et al. teach all the limitation of claim 2, the claims 12, 14, 16, 18, 29, and 40 are similar in scope to claims 11, 13, 15, 17, 28, and 39 and are rejected under the same rational.

Regarding claim 41, Peters et al. as modified by Jang et al. teach all the limitation of claim 2, but keep silent for teaching configured so as to request and/or receive first audio streams and second audio streams, wherein the first audio elements in the first audio streams are more relevant and/or more audible than the second audio elements in the second audio streams, 20wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams.
In related endeavor, Coles et al. teach configured so as to request and/or receive first audio streams and second audio streams, wherein the first audio elements in the first audio streams are more relevant and/or more audible than the second audio elements in the second audio streams (Coles et al.: abstract, par 0010-0012, par 0032, par 0038-0039, par 0042-0048, par 0052-0055, introduce an interactive audio system to provide different rate for different audio object based on user’s position or focus), 20wherein the first audio streams are requested and/or received at a higher bitrate than the bitrate of the second audio streams (abstract, par 0010-0012, par 0032, par 0038-
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Jang et al. to include configured so as to request and/or receive first audio streams and second audio streams, wherein the first audio elements in the first audio streams are more relevant and/or more audible than the second audio elements in the second audio streams as taught by Coles et al. to  select an audio component as a focus component by using the user control device to transmit at a higher bit-rate than the non focus components so as to maintain the required bandwidth of the data link at a suitable level.

Claims 48 and 54 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2019/0005986 to Peters et al. in view of U.S. PGPubs 2002/0103554 to Coles et al., further in view of Boustead (P. Boustead and F. Safaei. Dice: Internet delivery of immersive voice communication for crowded virtual spaces. In VR ’05: Proceedings of the 2005 IEEE Conference 2005 on Virtual Reality, pages 35–41, Washington, DC, USA, 2005. IEEE Computer Society).

Regarding claim 48, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching configured to disable the decoding of 
In related endeavor, Boustead et al. teach configured to disable the decoding of audio elements element selected the basis of the user's current viewport and/or head orientation and/or movement 30data and/or metadata and/or virtual position (Figs 2 and 3, section 1 and 3, define hearing range for avatar based on the distance and viewpoint to obtain the mass talking avatars to remove the sound from talking avatars which is outside hearing arrange and view range).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include configured to disable the decoding of audio elements element selected the basis of the user's current viewport and/or head orientation and/or movement 30data and/or metadata and/or virtual position as taught by Boustead et al. to render a realistic crowded audio scene including spatial rendering of the voices of surrounding avatars to deliver over the Internet in a peer-to-peer manner to reduce the computational load on the servers to performs simple operations in the servers (including weighted mixing of audio streams) to cope with access bandwidth restrictions of clients to reduce the access bandwidth requirements.

Regarding claim 54, Peters et al. as modified by Coles et al. teach all the limitation of claim 1, but keep silent for teaching configured to deactivate the decoding and/or reproduction of at least one stream on the basis of metadata associated to the at 
In related endeavor, Boustead et al. teach configured to deactivate the decoding and/or reproduction of at least one stream on the basis of metadata associated to the at least one stream and on the basis 25of the user's current viewport and/or head orientation and/or movement data and/or metadata and/or virtual positional data (Figs 2 and 3, section 1 and 3, define hearing range for avatar based on the distance and viewpoint to obtain the mass talking avatars to remove the sound from talking avatars which is outside hearing arrange and view range).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Peters et al. as modified by Coles et al. to include configured to deactivate the decoding and/or reproduction of at least one stream on the basis of metadata associated to the at least one stream and on the basis 25of the user's current viewport and/or head orientation and/or movement data and/or metadata and/or virtual positional data as taught by Boustead et al. to render a realistic crowded audio scene including spatial rendering of the voices of surrounding avatars to deliver over the Internet in a peer-to-peer manner to reduce the computational load on the servers to performs simple operations in the servers (including weighted mixing of audio streams) to cope with access bandwidth restrictions of clients to reduce the access bandwidth requirements.

Claims 49 and 55 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2019/0005986 to Peters et al. in view of U.S. PGPubs 2010/0040238 to Jang et al., further in view of Boustead (P. Boustead and F. Safaei. Dice: Internet delivery of immersive voice communication for crowded virtual spaces. In VR ’05: Proceedings of the 2005 IEEE Conference 2005 on Virtual Reality, pages 35–41, Washington, DC, USA, 2005. IEEE Computer Society).

Regarding Claims 49 and 55, Peters et al. as modified by Jang et al. teach all the limitation of claim 2, the claims 49 and 55 are similar in scope to claims 48 and 54 and are rejected under the same rational.

Allowable Subject Matter
Claims 25, 42-44, and 47 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claim 25, including "wherein 15the first streams associated to the first audio scene are acquired at a higher bitrate when the user is in the first environment associated to the first audio scene, while the second streams associated to the second audio scene associated to the second environment are acquired at a lower bitrate when the user is in the beginning of a transitional position from the first audio scene to the second audio scene, and  20the first streams associated to the first audio scene are acquired at a lower bitrate .
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claims 42 and 43, including "wherein at least two visual environment scenes are defined, wherein at least one first and second audio elements are associated to a first audio scene associated 25to a first video environment, and at least one third audio element is associated to a second audio scene associated to a second video environment, wherein the system is configured to acquire interaction metadata describing that the at least one second audio element is additionally associated with the second video environment, wherein the system is configured to request and/or receive the at least one first and second 30audio elements in case the user's virtual position is in the first video environment, 77Attorney Docket No.: PJK5283832 wherein the system is configured to request and/or receive the at least one second and third audio elements in case the user's virtual position is in the second video environment, and wherein the system is configured to request and/or receive the at least one first and second and third audio elements in case the user's virtual position is in transition between the first video 5environment and the second video environment".
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claim 47, including "wherein the metadata processor is configured to enable and/or 15disable at least one audio element in at least one audio stream prior to the at least one audio decoder based on user's current viewport and/or head orientation and/or movement .

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jin Ge whose telephone number is (571)272-5556.  The examiner can normally be reached on 8:00 to 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Devona Faulk can be reached on (571)272-7515.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.



JIN . GE
Examiner
Art Unit 2616



/JIN GE/           Primary Examiner, Art Unit 2616