DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Examiner’s Comments

The terminal disclaimer filed 7-12-2022 has been approved. 
‘separate’ as used in claim 1 as now amended, is drawn to the combination of the well known functions of physically separating the sound stream into separate sound tracks and also classifying each sound track in order to provide the corresponding sound types.
The examiner notes prior art to Robinson et al (US 10721521 B1) which discloses detecting spatial audio scenes from legacy media, similar to that claimed in applicant’s claim 2. 

	
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


The following claims and their respective depending claims are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

As per claim 1, it is not clear how to read sound tracks as recited in claim 1, the sound tracks are the results of the ‘separating’ step in claim 1, but require additional processing (per the received configuration) in order to be provided to the loudspeakers to generate a 3 dimensional field.  As such the sound tracks are not directly output to loudspeakers but instead are processed via the configuration where each audio object must be split into at least two loudspeaker driving signals to provide a 3d sound field.  The claimed signals should be clearly recited relative to the various processing stages in the claimed system.
As per claim 2, it is not clear how to read sound track relative to sound channel as recited as the sound track is the output of the separating step and has not yet been processed via the configuration information to transform into loudspeaker driving signals/sound channels.  As such the sound track cannot comprise multiple channels until it has been processed, and should then be recited as a different signal.  Further it is not clear how the sound tracks can be provided to the loudspeakers per claim 2 as they have already been provided in claim 1. It is not clear which sound tracks are being referred to in each claim as they appear to be different signals.

As per claim 12, it is not clear how the loudspeakers can receive both the tracks or channels as per claim 1 and the filtered sound tracks as per claim 12 in order to generate the same 3d sound field.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The following claims 1,13,9,20,11,12,18,17, 2,14,19,7,8,3,15,4,16,5,6,10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Delamont (US 20200368616 A1) as applied to claims 1,13,18 above, and further in view of Wilcox et al (US 10261749 B1).

As per claims 1,13, Delamont discloses: a three-dimensional sound generation system/computer implemented method, comprising: 
a user interface device (the speakers, microphones, camera and display 5,6,7,8 from fig. 1b, in the device of fig. 1a is a user interface); 
and a processing device 11 (fig. 1b), communicatively coupled to the user interface device, implementing signal processing/computer implemented method to: 
receive a sound stream that is composed of the one or more sound sources, wherein each of the one or more sound sources corresponds to a sound type (decoder 16 in fig. receives a soundstream of encoded data images and audio as per para. 138, where the audio includes at least types as per para. 143, the spatial audio and 3d audio effects) ; 

The decoder decodes/separates the audio;
obtain a specification/preconfiguration of a three-dimensional space and a mesh of filters defined on a grid in the three-dimensional space (para. 84, the processor, via the observe component 25 in fig. 2, receives generated surface and mesh data/specification of a 3d space, and mesh filters , which are defined on a 3d grid in 3d space that is defined by the model coordinates expressed as points on the three dimensional planes, as cited in para. 82, and the coordinates as used as described in para. 89)
 wherein the three-dimensional space is presented in a user interface of the user interface device (the virtual image is in a 3d that is presented to the user via the user interface, as the displayed virtual image via the mounted AR display apparatus, para. 85); 

obtain one or more sound tracks comprising a corresponding sound signals associated with a sound source (para. 101, the manipulation of the sound tracks/audio to appear directional, while being applied to the same coordinate system as the video, where the manipulation of the sound output/source is the determination of sound tracks which are associated with the original sound signal before it is manipulated );
 present, in a user interface of the user interface device, representations representing one or more listeners and the one or more sound sources corresponding to the one or more sound signals in the three-dimensional space (para. 90: a mesh filter to re-render and display the 3D holographic images of the virtual game objects accordingly via the users augmented reality ("AR") display apparatus 1 based on the user's movements, where the sound tracks/audio are placed together with the video as per para. 101) ; 
receive a configuration of at least one of a position of the listener or positions of the sound sources in the three-dimensional space, determine a plurality of filters from the mesh of filters based on the relative positions of the sound sources and the listener in the three-dimensional space determine a plurality of filters based on the configuration and predetermined locations of one or more loudspeakers (as cited above from para. 90,101, the mesh filters are selected to re-render the images based on user’s movements, where a user’s movements by definition alter the relative positions of the sound sources and the listener in the three-dimensional space, and hence alters a position of the listener, where the predetermined locations of loudspeakers are required and must be received by the mesh filters by the relative one or more positions of the sound sources and listener since the sound sources are produced relative to predetermined locations of loudspeakers ); 
; and
 provide the one or more sound tracks and the one or more filtered sound signals to the one or more loudspeakers to generate a three-dimensional sound field (playback of perceptual based 3d sound, para. 181 for output from speakers 6 based on the filtered sound signals/sound tracks and based on the one or more sound signals, where (the manipulated sound output in para. 101 is based on the selected mesh filters as the selected mesh filters are used to create the virtual game objects as per para. 167, where the virtual game objects include both audio and video via the audio manager which positional places the audio para.  181 where the signals must drive loudspeakers to produce 3d sound)).

However, Delamont does not specify
That the separating step comprises: to separate, using a machine learning model, the sound stream into the one or more sound tracks, wherein each of the one or more sound tracks comprises a respective one of the one or more sound sources and a respective corresponding sound type;

Wilcox teaches that an 3d audio	processing device can use machine learning including:
Separate/classify, using a machine learning model, the sound stream into the one or more sound signals (the audio signals separated by the model based on the detected type, as described at the bottom of para. 98); and determine, using the machine learning model, a sound type corresponding to each of the one or more sound signals (para. 98, the detected type of sound ).
wherein the sound type is one of a voice sound (voice as per para 98), a vocal sound, an instrument sound, a car sound, a helicopter sound, an airplane sound, a vehicle sound, a gunshot sound, or an environmental noise.
Where this allows the sound to be recognized (para. 98).

It would have been obvious to one skilled in the art to implement the above cited processing for the advantage of being able to recognize an audio sound.  Where the classification taught above and the decoding via decoder 16 comprise the claimed separating step.


As per claim 18, 18. A cloud sound generation system, comprising: 
one or more processing devices to: 
receive a sound stream that is composed of the one or more sound sources, wherein each of the one or more sound sources corresponds to a sound type; separate, using a machine learning model, the sound stream into the one or more sound tracks, wherein each of the one or more sound tracks comprises a respective one of the one or more sound sources and a respective corresponding sound type (as per the claim 1 rejection);




receive a specification/pre-configuration of a three-dimensional space (the system/method per the claim 1 rejection can implement cloud computing: (para. 126: the system may store the global mesh data, wireframes, 3D models on the media servers or on a cloud storage cluster for example,_), additionally; 

present, in a user interface, representations representing one or more listeners and the one or more sound sources corresponding to the one or more sound tracks in the three-dimensional space (the representation in the display of the device of fig. 1a represents the dynamic position of the listener/user of the device and the relative position of any objects/audio tracks represented in the sound field, as the user or virtual game objects move they are rendered as per the claim 1 rejection, which is dynamic movement via the listener position based objects displayed on the device of fig. 1a.); 


receive one or more sound tracks each comprising a corresponding sound signal associated with a corresponding sound source (the mesh data and 3d models which include the associated audio tracks/signals); 
responsive to receiving a configuration comprising locations of one or more listeners and locations of the one or more sound sources in the three-dimensional space, determine a plurality of filters based on the configuration and pre-determined locations of one or more loudspeakers (the processing of the claim 1 rejection using the cloud processing as per (para. 126: Similarly, the captured images and video from the user's camera(s) 7L, 7R may be stored remotely or locally for processing in the generating of mesh data, 3D models and wireframes that maybe generated by the observer component 2 5 locally on the user's augmented reality (“AR”) display apparatus 1 or remotely by the game server 88 or host 89 global observer components ; )
36apply the plurality of filters to the one or more sound signals to generate filtered sound signals ((the processing of the claim 1 rejection using the cloud processing as per (para. 126: Similarly, the captured images and video from the user's camera(s) 7L, 7R may be stored remotely or locally for processing in the generating of mesh data, 3D models and wireframes that maybe generated by the observer component 2 5 locally on the user's augmented reality (“AR”) display apparatus 1 or remotely by the game server 88 or host 89 global observer components ; )
); and

 provide the one or more filtered/processed sound tracks to one or more sound generation devices to generate a three-dimensional sound field in a virtual or real three-dimensional space (since the mesh data is processed by the game server/cloud it must be provided to/received by the headset of fig. 1a in order to render the filtered sound track audio out of the speakers to the listener).

As per claims 2,14,19, the sound generation system of claim 1, wherein the sound sources include at least one of a mono or stereo sound stream (para. 218: this process binaural recordings maybe converted to stereo recordings before playback via the users augmented reality ("AR") display apparatus 1, speakers 6L, 6R, where the binaural recordings are a stereo sound stream)), and
 wherein the plurality of filters include at least one of a head related transfer function (HRTF) filter, an all-pass filter, a multiple-input multiple-output filter, or an equalizer filter ((HRTFs are used to produce the positional audio, as such the mesh filters of the claim 1 rejection comprise HRTF filters/HRTFs para. 187, additionally, any filtering, by definition comprises equalizing a filter since it changes the frequency or phase content of the signal)).
each of the one or more sound Application No. 17/568,343-4- Docket No. ZL0035-0002-US-CON1tracks comprises one or more sound channels (the signals required to drive loudspeakers 6 in fig. 1A), the sound generation system comprises a screen/display and a matrix of loudspeakers (the ar headset of Delamont), and the matrix of loudspeakers are aligned with the area of the screen/display(they are located respective to the display), and wherein the matrix of loudspeakers are arranged behind the screen/display (as shown in fig. 1a) or are an array of flat, transparent, and screen top loudspeakers , wherein the processing device is further to: obtain a video stream that is associated with the audio stream; determine, based on content of the video stream, locations of the one or more sound sources on the screen/display that presents the content of the video stream (para. 1193, the surfaces and objects identified via the video frames, which are then used to create a new game object, where the distances/depth/locations that are visually represented must be determined, per para. 1196); 
determine, from the matrix of loudspeakers, the one or more loudspeakers that correspond to the locations of the one or more sound sources on the screen/display (the determination of the loudspeaker driving signals is the processing to create the sound channels from the sound tracks) ; and 
provide the one or more sound tracks to the one or more loudspeakers to generate the three-dimensional sound field (provided as per the claim 1 rejection).


As per claim 7, The sound generation system of claim 1, wherein to provide the one or more sound signals and the one or more filtered sound signals to the one or more loudspeakers to generate a three-dimensional sound field, the processing device is to provide the one or 32more sound tracks and the one or more filtered sound signals to one or more amplification circuits that each drives a corresponding one of the one or more loudspeakers located in the three-dimensional space (the cited loudspeakers in the claim 1 rejection require each require respective amplifiers for the purpose of physically moving the membrane to produce sound).

As per claims 8,17, the sound generation system of claim 1, wherein the configuration via a user interface  of at least one of the locations of the one or more listeners or the locations of the one or more sound sources in the three-dimensional space via the user interface allows the processing device to:
 dynamically moving at least one of the representations/virtual game objects (moved as per the claim 1 rejection) to change at least one location of the representations of the one or more listeners or the one or more sound sources associated (as the user or virtual game objects move they are rendered as per the claim 1 rejection, which is dynamic movement), and
 processing device is further to: responsive to dynamically moving at least one of the representations, determine a plurality of updated filters (the mesh filters including HRTF filters of the claim 1 rejection are updated based on movement) based on the dynamically-changing configuration; 
apply the plurality of updated filters to the one or more sound tracks to generate updated sound signals (part of the rendering  and rerendering of the claim 1 rejection); and 
provide the one or more sound tracks and the one or more updated sound tracks to the one or more loudspeakers to generate an updated three-dimensional sound field (the rendering as per the claim 1 rejection output to the speakers as per the claim 1 rejection).

As per claims 9,20, the sound generation system of claim 1, wherein the sound generation system is installed in the three-dimensional space of a virtual conference room, virtual concert system, a game system (the gaming system, para. 3) , an in-vehicle sound system, or a theater sound system, 

wherein each of the corresponding sound sources comprises a conference, concert, or game participant (the listener of the gaming system is a gaming participant), and 
wherein the processor is to:
 present the corresponding sound sources in the three-dimensional space representing a virtual conference room, a virtual concert, a game system (the gaming system), an in-vehicle sound system, or a theater sound system based on at least one of locations of the sound sources (the listener and virtual game objects), 
a movement of at least one of the one or more listeners or the one or more sound sources, the user configuration, or pre-programmed software (as per the claim 8 rejection).

As per claim 11, the sound generation system/speakers are implemented on a helmet as per fig. 1a.

As per claim 12, he sound generation system of claim 1, wherein to apply the plurality of filters to the one or more sound signals to generate filtered sound signals for driving the one or more loudspeakers, the processing device is to: 
provide the one or more sound signals to a cloud computing system (para. 126: the system may store the global mesh data, wireframes, 3D models on the media servers or on a cloud storage cluster for example,, 
wherein the cloud computing system is to apply each of the plurality of filters to a corresponding one of the one or more sound signals to generate one or more filtered sound signals (para. 126: Similarly, the captured images and video from the user's camera(s) 7L, 7R may be stored remotely or locally for processing in the generating of mesh data, 3D models and wireframes that maybe generated by the observer component 2 5 locally on the user's augmented reality (“AR”) display apparatus 1 or remotely by the game server 88 or host 89 global observer components; and 
receive, from the cloud computing system, the one or more filtered sound signals for driving the one or more loudspeakers (since the mesh data is processed by the game server/cloud it must be received by the headset of fig. 1a in order to render the audio out of the speakers to the listener).
where the separating is performed via cloud computing and local computing as described above;
where the mesh filters of the claim 1 rejection apply signal processing to the sound tracks which are then used to drive loudspeakers as per the claim 1 rejection.








As per claims 3,15, the one or more sound tracks are each previously separated from other sound tracks (para. 189, in which the incoming audio signals and waves from two or more sources may be computed, in which the ITD…), where the incoming/received audio signals/sound stream composed of one or more sound sources in one or more sound tracks  are previously separated because they are originally two separate sources), 
wherein the sound type is one of a voice sound, a vocal sound, an instrument sound, a car sound, a helicopter sound, an airplane sound, a vehicle sound, a gunshot sound, or an environmental noise.
Wilcox teaches that an 3d audio	processing device can use machine learning including:
separate, using a machine learning model, the sound stream into the one or more sound signals (the audio signals separated by the model based on the detected type, as described at the bottom of para. 98); and determine, using the machine learning model, a sound type corresponding to each of the one or more sound signals (para. 98, the detected type of sound ).
wherein the sound type is one of a voice sound (voice as per para 98), a vocal sound, an instrument sound, a car sound, a helicopter sound, an airplane sound, a vehicle sound, a gunshot sound, or an environmental noise.
Where this allows the sound to be recognized (para. 98).

It would have been obvious to one skilled in the art to implement the above cited processing for the advantage of being able to recognize an audio sound.



As per claim 4, the sound generation system of claims 1,13, further comprising: 
a microphone array comprising a plurality of microphones (para. 107, two microphones), wherein the processing device is to: 
implement a plurality of microphones, wherein each of the plurality of microphones is configured to capture sound from a corresponding sound source at a corresponding direction with the microphone array (para. 158 to provide voice command inputs during the game or in the opening of a video or audio communication between two or more users in which the audio and/or video can be heard or seen from their augmented reality ("AR") display apparatus 1);
Where the detected audio communication is generated as an output based on the captured sound.
However, Delamont does not disclose the microphones capture specific sounds via respective beamformers used to generate the output/one or more sound tracks
The examiner takes official notice it is well known in the art to use beamforming to effectively recover detected sound sources using a microphone array, where detection of each source is a different beam of a plurality of beamformers.

As per claim 5, the microphone array is mounted on an AV headset (fig. 1a of Delamont).

As per claims 6,16, the sound generation system of claim 1, wherein to present, in a user interface of the user interface device, representations representing one or more listeners and the sound sources corresponding to the one or more sound tracks in the three-dimensional space, the processing device, for each sound track, is to: 
present, in the user interface, icons representing the one or more listeners and icons representing isolated sound tracks in the three-dimensional space at positions according to the configuration, wherein each of the icons is at least one of a symbol representation, a graphic representation, an image of a corresponding source, a video of the corresponding sound source, or an animation.
((the AR device can show user an image/icon of an object, para. 2, with associated audio para 101, as such an image of the source represents the sound signals, where the image is presented according to the user position configuration as per the claim 1 rejection))

As per claim 10, the sound generation system of claim 1, further comprising at least one of: 33a microphone array comprising a plurality of microphones for capturing sound from the plurality of sound sources of different directions (per the beamforming in the claim 4 rejection); 

an acoustic echo cancellation unit for removing echoes in the one or more sound tracks; a noise reduction unit for reducing a noise component in the one or more sound tracks; a set of sound equalizer units for processing each one of the one or more sound tracks; a reference sound capture circuit positioned at proximity to the one or more loudspeakers for capturing a reference signal, wherein the acoustic echo cancellation unit is to remove the echoes based on the captured reference signal; or a speech recognition unit to recognize voice commands.(these are recited in the alternative and are not mapped).

Response to Arguments

The submitted arguments have been considered but are moot in view of the new grounds of rejection.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

	

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER KRZYSTAN whose telephone number is 571-272-7498, and whose email address is alexander.krzystan@uspto.gov

The examiner can usually be reached on m-f 7:30-4:00 est.
If attempts to reach the examiner by telephone or email are unsuccessful, the examiner’s supervisor, Fan Tsang can be reached on (571) 272-7547.  

The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications.
/ALEXANDER KRZYSTAN/Primary Examiner, Art Unit 2653                                                                                                                                                                                                        
Examiner Alexander Krzystan
July 21, 2022