PNG
    media_image1.png
    340
    340
    media_image1.png
    Greyscale
United States Patent and Trademark Office    
        
            
                                
            
        
    

Commissioner for Patents
United States Patent and Trademark Office
P.O. Box 1450
Alexandria, VA 22313-1450
www.uspto.gov











BEFORE THE PATENT TRIAL AND APPEAL BOARD


Application Number: 16/907,934
Filing Date: 22 Jun 2020
Appellant(s): QUALCOMM Incorporated



__________________
Matthew Gage
For Appellant


EXAMINER’S ANSWER





This is in response to the appeal brief filed June 21, 2022.

(1) Grounds of Rejection to be Reviewed on Appeal
Every ground of rejection set forth in the Office action dated 1/28/2022 from which the appeal is taken is being maintained by the examiner except for the grounds of rejection (if any) listed under the subheading “WITHDRAWN REJECTIONS.”  New grounds of rejection (if any) are provided under the subheading “NEW GROUNDS OF REJECTION.”
(2) Response to Argument
Applicant has filed an Appeal Brief 6/21/2022, appealing final office action and arguments of Advisory Action of 4/5/2022.

Regarding the application at hand, Claim 1 recites:
A device configured to encode scene-based audio data, the device comprising: 
a memory configured to store scene-based audio data; and 
one or more processors configured to: 
perform spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining spatial characteristics of the foreground audio signal; 
perform psychoacoustic audio encoding with respect to the foreground audio signal to obtain an encoded foreground audio signal; 
determine, when performing psychoacoustic audio encoding with respect to the foreground audio signal, a first bit allocation for the foreground audio signal; 
determine, based on the first bit allocation for the foreground audio signal, a second bit allocation for the spatial component; 
quantize, based on the second bit allocation for the spatial component, the spatial component to obtain a quantized spatial component; and 
specify, in a bitstream, the encoded foreground audio signal and the quantized spatial component.

Claim 13 recites A method of encoding scene-based audio data, the method comprising: 
performing spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining spatial characteristics of the foreground audio signal; 
performing psychoacoustic audio encoding with respect to the foreground audio signal to obtain an encoded foreground audio signal; 
determining, when performing psychoacoustic audio encoding with respect to the foreground audio signal, a first bit allocation for the foreground audio signal; 
determining, based on the first bit allocation for the foreground audio signal, a second bit allocation for the spatial component; 
quantizing, based on the second bit allocation for the spatial component, the spatial component to obtain a quantized spatial component; and 
specifying, in a bitstream, the encoded foreground audio signal and the quantized spatial component.

As presented in the final office action: 
Regarding claim 1 Atti teaches A device configured to encode scene-based audio data (Fig. 1; para: 4 scene-based audio format, coder/decoder (codec); para. 28: ambisonics; scene based audio), the device comprising: 
A memory configured to store scene-based audio data (Fig. 6); and 
one or more processors (Fig. 6) configured to: 
perform spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining Spatial characteristics of the foreground audio signal (fig 1; 21; 39: spatial characteristics; 45: spatial metadata; 66-68: configured to determine whether each stream corresponds to a background audio source or foreground audio source, higher priority to foreground sources, lower priority to…background sources; 100 spatial
linear invertible transform: 28 scene based audio; Eigen decomposed coefficients corresponding to sound scene; where app 37 teaches LIT can be Eigenvalue decomposition); 
perform psychoacoustic audio encoding with respect to the foreground audio signal to obtain an encoded foreground audio signal (21; 45-46: IVAS codec encodes the streams; 52; 68 – encoding foreground audio/higher priority): 
determine, when performing psychoacoustic audio encoding with respect to the foreground audio signal, a first bit allocation for the foreground audio signal (21: allocate more bits to higher priority; 36; 45-46; 52: However, because audio streams having higher priority are encoded with a higher bit rate, the decoded versions of the higher priority streams are typically higher-accuracy reproductions of the original audio streams than the decoded versions of the lower priority streams; 68); 
determine, based on the first bit allocation for the foreground audio signal, a second bit allocation for the spatial component (21; 36; 45-46; 68; 88: bit rate estimator and distribution; 100: spatial metadata 124. For example, a quantized version of the spatial metadata 124 may be used where an amount of quantization for each IS stream is based on the priority of the IS stream. To illustrate, spatial metadata encoding for high-priority streams may use 4 bits for azimuth data and 4 bits for elevation data, and spatial metadata encoding for low-priority streams may use 3 bits or fewer for azimuth data and 3 bits or fewer for elevation data."); 
quantize, based on the second bit allocation for the spatial component, the spatial component to obtain a quantized spatial component (100 quantization); and 
specify, in a bitstream, the encoded foreground audio signal and the quantized spatial component (fig 1, 2; 20 generate a bitstream 
21: The IVAS codec 102 includes a stream priority module 110 that is configured to determine a priority configuration for some or all of the received audio streams and to encode the audio streams based on the determined priorities (e.g., perceptually more important, more “critical” sound to the scene, background sound overlays on top of the other sounds in a scene, directionality relative to diffusiveness, etc.;
In an example embodiment, the IVAS codec 102 may allocate more bits to streams having higher priority than to streams having lower priority).

	Claim 13 (the method embodiment) recites limitations similar to claim 1 and is rejected for similar rationale and reasoning.

Applicant has presented arguments after final, which have been reiterated in the Appeal brief, arguing that the cited prior art does not read on the limitations as claimed (that cited prior art Atti does not teach application of a linear invertible transform with respect to the scene-base audio).
Applicant’s arguments filed have been fully considered but are not persuasive.

	Applicant argues using independent claim 13, which is the method embodiment, and similar to device claim 1.  Applicant argues that the cited prior art, Atti (2019/0103118), does not teach the limitations of claim 13,
	And more specifically does not teach the limitation:
“performing spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining spatial characteristics of the foreground audio signal”.
Applicant argues repeatedly on pages 7-14 of Brief that Examiner’s Advisory Action response “merely reiterate previously cited portions of Atti without providing any rebuttal that actually addresses Applicant’s arguments” (pages 7-8).  Applicant then proceeds to restate their after final arguments of 3/28/2022, and goes through each of Examiner’s Advisory Action arguments to merely state that the reference does not teach the disputed limitation.
Examiner respectfully disagrees with Applicants position that their arguments were not addressed.  Examiner has mapped the claim language to the appropriate teachings of the reference, and further elaborated on the reference to better explain and describe the components.

Overall, regarding the art, Applicant’s Brief argues that Atti does not specifically teach
“performing spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining Spatial characteristics of the foreground audio signal”
Because Atti only mentions “Eigen decomposed coefficients”, which cannot read on application of a linear invertible transform with respect to the scene-based audio, and does not explain how the spatial metadata is derived.
Applicant argues on pages 8-9 of brief (which are from applicant’s march arguments) that in claim 13 the linear invertible transform is applied with respect to the scene-based audio data, while Atti is describing a process of obtaining scene based audio from audio signals output by a microphone, as such the eigen-decomposed coefficients are derived from the audio signals output by a microphone
It is equally likely that Atti is referring to application of an Eigenvalue decomposition to the audio signals to generate SBA, which would then include Eigen-decomposed coefficients.

Examiner respectfully disagrees.
Examiner notes that it is not sufficient to merely look at the cited phrases or even paragraphs in a vacuum, but to consider the entirety of the refence to better understand the teachings of such.  The reference must also be considered in light of the state of the technology at the time the application was submitted, and what would be known by those of ordinary skill in the art.  Further, Applicant’s interpretation of the cited paragraph of Atti is based on Applicants own speculations and assumptions.  Nowhere does the sentence/paragraph state that Atti is referring to application of an Eigenvalue decomposition to the audio signals output by a microphone to generate scene-based audio (SBA).  Not only is this not stated in the reference, this is also not scientifically accurate. 

	The independent claims 1 and 13 had been amended 12/17/2021 to recite limitations of previous dependent claim 6,  
“performing spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining spatial characteristics of the foreground audio signal”.

Regarding the limitation the application at hand teaches:
	[0037] The spatial audio encoding device 24 may be configured to compress the ambisonic coefficients 21. That is, the spatial audio encoding device 24 may compress the ambisonic coefficients 21 using a decomposition involving application of a linear invertible transform (LIT). One example of the linear invertible transform is referred to as a “singular value decomposition” (“SVD”), a principal component analysis (“PCA”), or an Eigenvalue decomposition, which may represent different examples of a linear invertible decomposition.
[0038] In this example, the spatial audio encoding device 24 may apply SVD to the ambisonic coefficients 21 to determine a decomposed version of the ambisonic coefficients 21. The decomposed version of the ambisonic coefficients 21 may include one or more of predominant audio signals and one or more corresponding spatial components describing spatial characteristics, e.g., a direction, shape, and width, of the associated predominant audio signals.

Atti teaches an audio encoder for encoding multiple streams based on priority.  Atti teaches ambisonics and scene-based audio, and deriving components using eigenvalue decomposition (28).
Atti teaches encoding of multiple audio signals (0002), The audio signals may be processed into audio data streams according to a particular audio format, such as a two-channel stereo format, a multichannel format such as 5.1 or a 7.1 format, a scene-based audio format, or one or more other formats. The audio data streams may be encoded by an encoder, such as a coder/decoder (codec) that is designed to encode and decode audio data streams according to the audio format (0004). 
[0021] The IVAS codec 102 includes a stream priority module 110 that is configured to determine a priority configuration for some or all of the received audio streams and to encode the audio streams based on the determined priorities (e.g., perceptually more important, more “critical” sound to the scene,

Atti teaches:
[0028] In another implementation, the streams 131-133 are generated by the front end audio processor 104 to have a format based on ambisonics or scene-based audio (SBA) in which the channels may sometimes include Eigen-decomposed coefficients corresponding to the sound scene. 
A mapping of the relevant portions of the application and the prior art are presented below:

Application

[0037] The spatial audio encoding device 24 may be configured to compress the ambisonic coefficients 21. That is, the spatial audio encoding device 24 may compress the ambisonic coefficients 21 using a decomposition involving application of a linear invertible transform (LIT). One example of the linear invertible transform is referred to as a “singular value decomposition” (“SVD”), a principal component analysis (“PCA”), or an Eigenvalue decomposition, which may represent different examples of a linear invertible decomposition.
Atti

[0028] In another implementation, the streams 131-133 are generated by the front end audio processor 104 to have a format based on ambisonics or scene-based audio (SBA) in which the channels may sometimes include Eigen-decomposed coefficients corresponding to the sound scene. 


	Therefore, a comparison of the relevant portions of the Application and reference as presented in the table above shows that (in the application) an example of the LIT can be an Eigenvalue decomposition, which can be applied to the scene-based audio/ambisonics, and where Atti explicitly teaches this in paragraph [0028]. 

Applicants arguments represent a clear misunderstanding of the reference and the spatial audio coding art, and confuse certain steps of audio recording and processing process.  In audio coding, current innovations have strived to go beyond stereo and even 5.1, 7.1 formats.  We have thus seen the emergence of ambisonics, 3D, and scene-based audio; audio formats/schemes that better recreate the recording environment to present a more immersive experience for listeners. With (all) audio formats, the (audio capture) process begins with multiple microphones (array) that are placed in the designated space for recording, and these streams/channels are passed to a mixer to obtain the format (where panning can place the sounds in the appropriate locations).  Decomposition is not performed in mixing audio channels/streams for the format.  For example, when using two microphones, the mics are run to a mixer and the output is a stereo format.  Decomposition is not performed in creating the stereo representation.  Applicant appears to be arguing that the teachings of Atti may be able to apply to this audio capture step, however this is not what is taught, nor is this in line with current and standard practice in the art.  Atti teaches microphones for capturing sound originating from various sources (24).  The front end audio processor 104 is configured receive the audio signals 136-139 from the microphones 130 and to process the audio signals 136-139 to generate multi-stream formatted audio data (25).  the streams 131-133 include pulse-code modulation (PCM) data and have a format (26).  The format can include stereo (27), or a format based on ambisonics or scene-based audio (28).  
This therefore demonstrates the audio capturing portion of the process, of receiving the acoustic signals, and capturing and obtaining a representation/format.  Specific NPL references provided in Applicant’s Information Disclosure Statements also teach this, which as previously mentioned is known to one of ordinary skill in the art: 

Peters – “Scene-Based Audio Implemented with Higher Order Ambisonics (HOA)” presented in Applicant’s IDS 12/27/2021

Scene based Audio uses a sound-filed technology called “Higher Order Ambisonics” (HOA) to create holistic descriptions of both live-captured sound scenes
The audio can be carries as a set of PCM channels that contain predominant sounds and ambience in separate tracks.  Standard audio bandwidth compression techniques then can be applied to the PCM channels (abstract)

Pages 3-5 Scene-based Audio : it is possible to capture live immersive sound scenes…one can design scene-based content in a digital audio workstation by panning audio-objects to desired positions (page 4)


3rd Generation Partnership Project (3GPP) provided in Applicant’s IDS 12/15/2020
	Pg 21: 4.3.2 Audio Capture System
	4.3.2.2 Audio capture system for scene-based audio representation

	Once audio format has been obtained, the industry has sought how to better store, transport, and present this audio data, as spatial audio has begun to include more and more microphones and associated channels (incorporate more data).  Mathematical algorithms have been incorporated, Eigenvalue decomposition and singular value decomposition, to break down/decompose the scene-based or HOA audio data into various components to allow for better decisions and options on which data to compress, etc.  Eigenvalue decomposition and singular value decomposition are linear algebra techniques for breaking down matrices, and have been applied to various audio strategies.  When applied to spatial audio, they have been used to separate audio formats into components to allow for additional adjustments.  It is with this background that a reference such as Atti should be viewed (where this represents standard teachings to one or ordinary skill in the art for this technology).
	When referring back to the cited paragraph of Atti, one can now better understand that 
the streams 131-133 are generated by the front end audio processor 104 to have a format based on ambisonics or scene-based audio (SBA) in which the channels may sometimes include Eigen-decomposed coefficients corresponding to the sound scene. 
	Actually teaches The channels of the ambinsonic or scene based audio can include Eigen-decomposed coefficients, which are the result of performing (Eigenvalue) decomposition on the formats.  Eigenvalue decomposition is a matrix reduction technique, and when applied to audio, allows for decomposition of audio components.
	While Atti teaches eigen decomposed coefficients, to obtain these coefficients requires the implementation of Eigenvalue decomposition on the scene-based audio (SBA).  

	In reference to Applicant’s speculation about what the sentence of Atti (in para 28) actually teaches (as argued on page 8-9 of Brief), Applicant provides completely unsupported ideas and assumptions that have no theoretical basis or support.  The eigen coefficients are not derived from audio signals of the microphone (directly) as there would be no decomposition to be performed.  One would not apply Eigenvalue decomposition to combine received audio signals into a format.  However, the claim merely recites “application of a linear invertible transform with respect to the scene-based audio” (where the linear invertible transform can be an Eigenvalue decomposition), and Atti explicitly teaches application of an Eigenvalue decomposition with respect to scene-based audio (to obtain channels that include eigen decomposed coefficients).

	Regarding the remainder of the limitation, Atti teaches streams, and a stream priority that is configured to determine a priority configuration for some or all of the received audio streams (21) with the more important streams being higher priority (foreground), and less important streams being lower priority (background) (34; 68).  Thus, as a particular format is decomposed into the various streams, they are assigned a priority.
These streams/channels can also include spatial characteristics:
	39: front end audio processor 104 may provide information indicating spatial characteristics (e.g., azimuth, elevation, direction of arrival, etc.) of the source of each streams 131-133 to the stream selection module.
	The disputed limitation only recites application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal,
	And a corresponding spatial component.
	Atti teaches application of a LIT with respect to SBA to obtain the channels/streams representing a sound scene (foreground/background audio signals), and as the  
front end audio processor 104 may provide information indicating spatial characteristics (e.g., azimuth, elevation, direction of arrival, etc.) of the source of each streams 
teaches a corresponding spatial component.
The implementation of a linear invertible transform with the audio data allows for the decomposition of the characteristics into a collection of components such as the main audio signal and accompanying spatial components of a corresponding (or with respect to) sound scene.  Thus, according to the teachings of Atti, the system can include Eigen-decomposed coefficients corresponding to the sound scene which will allow for the foreground and spatial components to be obtained.

From here Atti can then perform psychoacoustic coding allowing for bit allocation based on stream priority, with the foreground signal encoded with a first bit allocation, and additional streams/background/spatial audio data encoded with a second bit allocation:
21: The IVAS codec 102 includes a stream priority module 110 that is configured to determine a priority configuration for some or all of the received audio streams and to encode the audio streams based on the determined priorities (e.g., perceptually more important, more “critical” sound to the scene, background sound overlays on top of the other sounds in a scene, directionality relative to diffusiveness, etc.;
In an example embodiment, the IVAS codec 102 may allocate more bits to streams having higher priority than to streams having lower priority
	The spatial characteristics are not just derived for the independent streams (IS) format, but for the other formats as well: 
	The spatial metadata 124 is generated and provided to the IVAS codec 102 in certain circumstances, such as e.g., when the streams 121-124 have the independent streams (IS) format. In other formats, e.g., stereo, SBA, MC, the spatial metadata 124 may be derived partially from the front end audio processor 104. In an example embodiment, the spatial metadata may be different for the different input formats and may also be embedded in the input streams.(45)

	The audio data that has been allocated bits will then be quantized and encoded by the codec to generate the bitstream for transmission (fig 1 and paragraphs 21-22).  This includes the spatial characteristics, which are needed for proper decoding and generation of the audio signals (55).


Applicant only argues the limitation
“performing spatial audio encoding that includes application of a linear invertible transform with respect to the scene-based audio data to obtain a foreground audio signal and a corresponding spatial component, the spatial component defining spatial characteristics of the foreground audio signal”;
And more specifically 
application of a linear invertible transform with respect to the scene-based audio.


The portion of the limitation only requires application of a LIT with respect to scene-based audio.  As demonstrated above and in past Examiner responses, this is explicitly taught by Atti, the cited art of record, which teaches:
[0028] In another implementation, the streams 131-133 are generated by the front end audio processor 104 to have a format based on ambisonics or scene-based audio (SBA) in which the channels may sometimes include Eigen-decomposed coefficients corresponding to the sound scene. 



Therefore the limitations of claim 13 do not yet overcome the current art of record.
Claim 1 recites limitations similar to claim 13 and is rejected for similar arguments presented above regarding claim 13.


	The additional independent and dependent claims (corresponding to Applicant’s groups 2-5) are also rejected based on arguments presented above and art rejections of final office action.




For the above reasons, it is believed that the rejections should be sustained.
Respectfully submitted,
/SHAUN ROBERTS/Primary Examiner, Art Unit 2655                                                                                                                                                                                                        
Conferees:
/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655    

                                                                                                                                                                                                    /DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657                                                                                                                                                                                                        
Requirement to pay appeal forwarding fee.  In order to avoid dismissal of the instant appeal in any application or ex parte reexamination proceeding, 37 CFR 41.45 requires payment of an appeal forwarding fee within the time permitted by 37 CFR 41.45(a), unless appellant had timely paid the fee for filing a brief required by 37 CFR 41.20(b) in effect on March 18, 2013.