DETAILED ACTION
	Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of U.S. Patent Application No. 62/737,913 filed on September 27, 2018, which is acknowledged.

Drawings
The drawings were received on 09/27/2019.  These drawings are acceptable.

Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 03/09/2020 has been considered by the examiner. 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-12, and 14-17 rejected under 35 U.S.C. 103 as being unpatentable over Arik et al. (US Pub. No. 2019/0251952, hereinafter ‘Ari’) in Zhang et al. (US Pub. No. 2019/0005090, hereinafter Zhang’).

Regarding independent claim 1 limitations, Ari teaches: a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement a neural network system configured to (Ari teaches in 0167-: In embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems/ computing systems. A computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system [i.e. one or more computers] may be or may include a personal computer (e.g., laptop), tablet computer, phablet, personal digital assistant (PDA), …. The computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, ROM, and/or other types of memory [i.e. one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement]…; AND 0238-0241: Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media [i.e. one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement] with instructions for one or more processors or processing units to cause steps to be performed... It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various com­puter-implemented operations… One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into sub­modules or combined together. It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure…)
receive a neural network input and to generate a neural network output, (Depicted in Fig. 13 and Fig. 14, in 0116:  FIG. 14 graphically depicts a model architecture, according to embodiments of the present disclosure. In one or more embodiments, mel-scaled spectrograms 1415 [i.e. a neural network input], 1420 [i.e. a neural network input] of enrollment audio 1405 and test audio 1410 are computed after resampling the input to a constant sampling frequency. Then, a two-dimensional convolutional layers 1425 [i.e. receive a neural network input] con­volving over both time and frequency bands are applied, with batch normalization 1430, … Mean-pool 1445 is performed over time (and enrollment audios if there are many), then a fully connected layer 1450 [i.e. to generate a neural network output] is applied to obtain the speaker encodings for both enrollment audios and test audio. A probabilistic linear discriminant analysis (PLDA) 1455 may be used for scoring the similarity between the two encodings…: And in 0107-: FIG. 13 depicts a more detail view of a speaker encoder architecture [i.e. receive a neural network input and to generate a neural network output] with intermediate state dimensions (batch: batch size, Nsamples : number of cloning audio samples I ell s i, T: number of me] spectrograms timeframes, F mei'. number k of me] frequency channels, F mapped: number of frequency channels after prenet, dembedd; n : speaker embed­ding dimension), according to embodiments g of the present disclosure. )

the neural network system comprising: an area attention layer, wherein the area attention layer is configured to, during the processing of the neural network input: (As depicted in Fig. 13 and Fig. 14, and in 0060: … Different voice cloning embodiments with end-to-end neural speech synthesis approaches, which apply sequence-to-sequence modeling with attention mechanism [i.e. the neural network system comprising: an area attention layer, wherein the area attention layer is configured to, during the processing of the neural network input], are presented herein. In neural speech synthesis, an encoder converts text to hidden representations, and a decoder estimates the time-frequency representation of speech in an autoregressive way….; And in )

receive data specifying a memory comprising a plurality of items and, for each item, a respective key and a respective value; (As depicted in Fig. 13, and in 0201-0208: … In one or more embodiments, these embeddings he are first projected via a fully-connected layer from the embedding dimension to a target dimensionality. Then, in one or more embodiments, they are processed through a series of convolution blocks (such as the embodi­ments described in Section F. l.c) to extract time-dependent text information [i.e. receive data specifying a memory comprising a plurality of items and, for each item, a respective key and a respective value]. Lastly, in one or more embodiments, they are projected back to the embedding dimension to create the attention key vectors hk… FIG. 29 graphically depicts an embodiment of an attention block [i.e. … a memory comprising a plurality of items and, for each item, a respective key and a respective value], according to embodiments of the present disclosure. As shown in FIG. 29, in one or more embodi­ments, positional encodings 2905, 2910 may be added to both keys 2920 and query 2938 vectors, with rates of wkey 2905 and wquery 2910, respectively… In one or more embodiments, a dot-product atten­tion mechanism ( depicted in FIG. 29) is used. In one or more embodiments, the attention mechanism uses a query vector 2938 (the hidden states of the decoder) and the per-timestep key vectors 2920 from the encoder to compute attention weights, and then outputs a context vector 2915 computed as the weighted average of the value vectors 2921.. )

determine a plurality of areas within the memory, wherein (i) each area includes one or more items in the memory, and (ii) one or more of the areas include multiple adjacent items in the memory; determine, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area; receive an attention query; (As depicted in Fig. 13, and in 0106-0107: Cloning sample attention: Considering that different cloning audios contain different amount of speaker information, in one or more embodiments, a multi-head self-attention mechanism 1230 may be used to compute the weights for different audios and get aggregated embeddings. FIG. 13 depicts a more detail view of a speaker encoder architecture with intermediate state dimensions 
    PNG
    media_image1.png
    29
    340
    media_image1.png
    Greyscale
number of cloning audio samples
    PNG
    media_image2.png
    40
    71
    media_image2.png
    Greyscale
  T: number of me] spectrograms timeframes, …


    PNG
    media_image3.png
    1125
    1159
    media_image3.png
    Greyscale

Examiner notes that the datm for each batch of samples is associated with the claimed area defined by the same key where each respective area has the depicted respective key and values as depicted in Fig. 13 for each datm; and depicted queries as received claimed receive an attention query)

apply an attention mechanism between the attention query and the area keys for each area to generate a respective attention weight for each area; (As depicted in Fig. 13; Examiner notes Multi-head attention layer as including claimed applied attention mechanism; And  in 0106-0107: Cloning sample attention: Considering that different cloning audios contain different amount of speaker information, in one or more embodiments, a multi-head self-attention mechanism 1230 may be used to compute the weights for different [i.e. apply an attention mechanism between the attention query and the area keys for each area to generate a respective attention weight for each area] audios and get aggregated embeddings )

and generate an area attention layer output by combining the area values for each area in accordance with the attention weights. (As depicted in Fig. 13. Combining using the FC the area values in accordance with claimed weights, in 0106: Cloning sample attention: Considering that different cloning audios contain different amount of speaker information, in one or more embodiments, a multi-head self-attention mechanism 1230 may be used to compute the weights for different audios and get aggregated embeddings [i.e. and generate an area attention layer output by combining the area values for each area in accordance with the attention weights ].; And in 0208: In one or more embodiments, a dot-product atten­tion mechanism ( depicted in FIG. 29) is used. In one or more embodiments, the attention mechanism uses a query vector 2938 (the hidden states of the decoder) and the per-timestep key vectors 2920 from the encoder to compute attention weights, and then outputs a context vector [i.e. and generate an area attention layer output …] 2915 computed as the weighted average [i.e. and generate an area attention layer output by combining the area values for each area in accordance with the attention weights ] of the value vectors 2921.)

While Ari teaches the use of keys and values to help identify memory allocations for associated memory items for processing the data sets as disclosed above. Ari does not expressly disclosed the key and value elements as memory access parameters as would be understood by person having ordinary skill in art.
Zhang does expressly teach the key and value elements as memory access parameters. (Zhang teaches in 0037-0042: … For example, a device that includes a microphone and a wireless network adapter may record speech by a user and transmit the recording over a network to a server. The server may use a speech-to-text translator to generate the input text query 110… The input text query 110 may be converted to a matrix. In some example embodiments, each word of the input text query  [i.e. receive an attention query] 110 is converted to a vector of predeter­mined length (e.g., 100 dimensions or 300 dimensions) and the resulting vectors are arranged to form a matrix (e.g., a matrix with predetermined height and a width equal to the number of words in the input text query 110)… The multilayer LSTM controller 130 [i.e. determine a plurality of areas within the memory, wherein (i) each area includes one or more items in the memory…; Examiner notes the controller as including claimed plurality of areas in memory identified using a key/index], in a first layer, provides the vector to the static memory 140 and the dynamic memory 150… The fact represented by the triplet may be stored in the database of the static memory 140 as a key-value pair [i.e. determine a plurality of areas within the memory, wherein (i) each area includes one or more items in the memory, and (ii) one or more of the areas include multiple adjacent items in the memory; determine, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area…]…; And alternatively in 0045-0049: The dynamic memory 150 at discrete time t may be represented by a matrix M, of size N xd, where N is the total number of represented facts and d is the predetermined size for representing a word vector… For content-based memory addressing [i.e. determine a plurality of areas within the memory, wherein (i) each area includes one or more items in the memory, and (ii) one or more of the areas include multiple adjacent items in the memory; determine, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area….], a cosine similarity between the controller output u and each dynamic memory slot vector Mt(i) [i.e. … determine, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area…] and Pt(j) [i.e. … determine, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area…], i= l, ... N; j= l, ... L is employed, where Lis the number of entries in the dynamic memory. The cosine similarity for each dynamic memory slot vector Mi(i) may be compared to a predetermined threshold, E… The multilayer LSTM controller 130 has multiple layers and thus multiple outputs. In some example embodi­ments, the initial input to the multilayer LSTM controller 130, u0, is the input query vector q [i.e. receive an attention query]. The equations below may be used to determine the reading weight for each relevant fact in the static memory 140 and the dynamic memory 150…)
The Ari and Zhang are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose for developing information processing system/methods using neural network learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art developing natural language processing  using neural network and index memory storage and retrieval as disclosed by Zhang  with the method of information processing using neural network models as disclosed by Ari.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Zhang and Ari in order utilize static and dynamic memory for training machine learning systems (Zhang, 0088); Doing so reduces the power consumption and training time associated with the training process,  (Zhang, 0088).

Regarding claim 2, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein the area attention layer is further configured to: provide the area attention layer output to another component of the neural network system. (Ari teaches as depicted in Fig. 13 the output of the multi-thread attention provided to the FC component of the depicted neural network system:

    PNG
    media_image4.png
    833
    559
    media_image4.png
    Greyscale

And in 0106: Cloning san1ple attention: Considering that different cloning audios contain different amount of speaker information, in one or more embodiments, a multi-head self-attention mechanism 1230 may be used to compute the weights for different audios and get aggregated embeddings [i.e. wherein the area attention layer is further configured to: provide the area attention layer output to another component of the neural network system ].; And in 0208: In one or more embodiments, a dot-product atten­tion mechanism ( depicted in FIG. 29) is used. In one or more embodiments, the attention mechanism uses a query vector 2938 (the hidden states of the decoder) and the per-timestep key vectors 2920 from the encoder to compute attention weights, and then outputs a context vector [i.e. wherein the area attention layer is further configured to: provide the area attention layer output to another component of the neural network system …] 2915 computed as the weighted average of the value vectors 2921. )

	
Regarding claim 3, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1 wherein the memory is arranged as a sequence of items, and wherein determining a plurality of areas comprises: identifying, as a different area…. (Ari teaches in 0208: In one or more embodiments, a dot-product atten­tion mechanism ( depicted in FIG. 29) is used. In one or more embodiments, the attention mechanism uses a query vector 2938 (the hidden states of the decoder) and the per-timestep key vectors 2920 [i.e. wherein the memory is arranged as a sequence of items, and wherein determining a plurality of areas comprises: identifying, as a different area; Examiner notes claimed sequence as time steps associated with the key vector(s)] from the encoder to compute attention weights, and then outputs a context vector 2915 computed as the weighted average of the value vectors 2921.)
While Ari teaches the use of time sequenced keys for arranging items associated with key index; Ari does not expressly teach the use of a number of items as recited in the limitation: identifying, as a different area, each combination of adjacent items that includes no more than a maximum number of items.
Zhang does expressly teach the use of a number of items as recited in the limitation: identifying, as a different area, each combination of adjacent items that includes no more than a maximum number of items. (in 0044-0045: Two vectors will have a cosine similarity of 1 when they are identical, -1 when they are opposite, and 0 when they are orthogonal. In some example embodiments, a fact entry is compared to a query using cosine similarity and determined to be relevant if the cosine similarity exceeds a predetermined threshold (e.g., 0. 1, 0.2, or 0.3). Thus, the cosine similarity between the query vector and each key value in the database of the static memory 140 may be determined and compared to a predetennined threshold to identify a set of relevant entries [i.e. identifying, as a different area, each combination of adjacent items that includes no more than a maximum number of items.]. The dynamic memory 150 at discrete time t may be represented by a matrix M, of size N xd, where N [i.e. each combination of adjacent items that includes no more than a maximum number of items.]is the total number of represented facts and d is the predetennined size for representing a word vector.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Zhang and Ari for the same reasons disclosed above.

Regarding claim 4, the rejection of claim 1 is incorporated. While Ari teaches the use of keys to index data items for processing information in a neural network system as disclosed above. Ari does not expressly teach the memory index as a two dimensional index as recited in the claim 4 limitation. Zhang does expressly teach claim 4 limitation : wherein the memory is arranged as a two-dimensional grid of items, and wherein determining a plurality of areas comprises: identifying, as a different area, each rectangular region of items within the two-dimensional grid that has no more than a maximum height and no more than a maximum width. (Zhang teaches in 0045-0049: The dynamic memory 150 at discrete time t may be represented by a matrix M, of size N xd, where N is the total number of represented facts and d is the predetermined size for representing a word vector… For content-based memory addressing [i.e. wherein the memory is arranged as a two-dimensional grid of items, and wherein determining a plurality of areas comprises: identifying, as a different area, each rectangular region of items within the two-dimensional grid that has no more than a maximum height and no more than a maximum width], a cosine similarity between the controller output u and each dynamic memory slot vector M,(i) and P,G), i= l, ... N; j= l, ... L is employed, where Lis the number of entries in the dynamic memory. The cosine similarity for each dynamic memory slot vector Mi(i) may be compared to a predetermined threshold, E.)
Regarding claim 5, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining the area value to be a sum of the values of the items in the area. (in 0208: In one or more embodiments, a dot-product atten­tion mechanism [i.e. wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining the area value to be a sum of the values of the items in the area.; Examiner notes that an dot product includes a sum product of items in the given area the dot-product is applied]( depicted in FIG. 29) is used. In one or more embodiments, the attention mechanism uses a query vector 2938 (the hidden states of the decoder) and the per-timestep key vectors 2920 from the encoder to compute attention weights, and then outputs a context vector 2915 computed as the weighted average [i.e. wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining the area value to be a sum of the values of the items in the area.; Examiner notes that an computing an average also includes computing a sum of the area associated with the weighted value] of the value vectors 2921.)

Regarding claim 6, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining the area key to be a mean of the keys of the items in the area. (Ari teaches  0208: In one or more embodiments, a dot-product atten­tion mechanism ( depicted in FIG. 29) is used. In one or more embodiments, the attention mechanism uses a query vector 2938 (the hidden states of the decoder) and the per-timestep key vectors 2920 from the encoder to compute attention weights, and then outputs a context vector 2915 computed as the weighted average [i.e. wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining the area key to be a mean of the keys of the items in the area; Examiner notes that an computing an average  as claimed area value associated with the weighted area] of the value vectors 2921.)

Regarding claim 7, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining a plurality of features of the items in the area; and combining the features to generate the area key of the area. (As depicted in Fig. 13 the noted area as claimed: 

    PNG
    media_image5.png
    930
    995
    media_image5.png
    Greyscale


And in 0016: Then, a two-dimensional convolutional layers 1425 con­volving over both time and frequency bands are applied, with batch normalization [i.e. wherein determining, for each of the plurality of areas, a respective area key and a respective area value from at least the keys and values of the items in the area comprises: determining a plurality of features of the items in the area;] 1430 and rectified linear unit (ReLU) non-linearity 1435 after each convolution layer. The output of last convolution block 1438 is feed into a recurrent layer (e.g., gated recurrent unit (GRU)) 1440. Mean-pool [i.e. and combining the features to generate the area key of the area] 1445 is performed over time (and enrollment audios if there are many), then a fully connected layer 1450 is applied to obtain the speaker encodings for both enrollment audios and test audio.)


Regarding claim 8, the rejection of claim 7 is incorporated and Ari in combination with Zhang teaches the system of claim 7, wherein combining the features comprises: summing or concatenating the features to generate a combined feature; and applying one or more learned non-linear transformations to the combined feature to generate the area key. (Ari teaches as depicted in Fig. 13. And, in 0016: Then, a two-dimensional convolutional layers 1425 con­volving over both time and frequency bands are applied, with batch normalization  1430 and rectified linear unit (ReLU) [i.e. applying one or more learned non-linear transformations to the combined feature to generate the area key] non-linearity 1435 after each convolution layer. The output of last convolution block 1438 is feed into a recurrent layer (e.g., gated recurrent unit (GRU)) 1440. Mean-pool [i.e. wherein combining the features comprises: summing or concatenating the features to generate a combined feature] 1445 is performed over time (and enrollment audios if there are many), then a fully connected layer 1450 is applied to obtain the speaker encodings for both enrollment audios and test audio.)

Regarding claim 9, the rejection of claim 7 is incorporated and Ari in combination with Zhang teaches the system of claim 7, wherein the features comprise one or more of: an embedding corresponding to a number of items in the area; a mean of the keys of the items in the area; or a variance of the keys of the items in the area. (Ari teaches as depicted in Fig. 13, And in in 0106-0107: Cloning sample attention: Considering that different cloning audios contain different amount of speaker information, in one or more embodiments, a multi-head self-attention mechanism 1230 may be used to compute the weights for different audios and get aggregated embeddings. FIG. 13 depicts a more detail view of a speaker encoder architecture with intermediate state dimensions 
    PNG
    media_image1.png
    29
    340
    media_image1.png
    Greyscale
[i.e. wherein the features comprise one or more of: an embedding corresponding to a number of items in the area]number of cloning audio samples
    PNG
    media_image2.png
    40
    71
    media_image2.png
    Greyscale
  T: number of me] spectrograms timeframes, …; And in 0016: Then, a two-dimensional convolutional layers 1425 con­volving over both time and frequency bands are applied, with batch normalization [i.e. , wherein the features comprise one or more of: an embedding corresponding to a number of items in the area; a mean of the keys of the items in the area; Examiner notes that determining a normalization involves computing the number of items associated with a key batch to compute the normalization …] 1430 and rectified linear unit (ReLU) non-linearity 1435 after each convolution layer. The output of last convolution block 1438 is feed into a recurrent layer (e.g., gated recurrent unit (GRU)) 1440. Mean-pool [i.e. wherein the features comprise one or more of: … a mean of the keys of the items in the area …] 1445 is performed over time (and enrollment audios if there are many), then a fully connected layer 1450 is applied to obtain the speaker encodings for both enrollment audios and test audio.)

Regarding claim 10, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1 wherein the area key and area value for each area that includes only a single item are the key and value for the single item. (As depicted in Fig. 26, in 0180: FIG. 26 graphical depicts an example Deep Voice 3 architecture 2600, according to embodiments of the pres­ent disclosure. In embodiment, a Deep Voice 3 architecture 2600 uses residual convolutional layers in an encoder 2605 to encode text into per-timestep key and value vectors 2620 [i.e. wherein the area key and area value for each area that includes only a single item are the key and value for the single item.] for an attention-based decoder 2630….; Examiner notes as depicted in Fig. 26 the key-value is associated with a single value query:

    PNG
    media_image6.png
    627
    857
    media_image6.png
    Greyscale

)

Regarding claim 11, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein each item corresponds to a portion of the neural network input, and wherein the value for each memory item is an encoded representation of the neural network input. (As depicted in Fig. 13, and in 0201-0208: … In one or more embodiments, these embeddings he are first projected via a fully-connected layer from the embedding dimension to a target dimensionality. Then, in one or more embodiments, they are processed through a series of convolution [i.e. wherein each item corresponds to a portion of the neural network input, and wherein the value for each memory item is an encoded representation of the neural network input] blocks (such as the embodi­ments described in Section F. l.c) to extract time-dependent text information. Lastly, in one or more embodiments, they are projected back to the embedding dimension to create the attention key vectors hk… FIG. 29 graphically depicts an embodiment of an attention block [i.e. wherein each item corresponds to a portion of the neural network input, and wherein the value for each memory item is an encoded representation of the neural network input], according to embodiments of the present disclosure. As shown in FIG. 29, in one or more embodi­ments, positional encodings 2905, 2910 may be added to both keys 2920 and query 2938 vectors, with rates of wkey 2905 and wquery 2910, respectively…; Examiner notes that the encoded rep of the neural network input as depicted in Fig. 13 sample of attention blocks associated with each item; And in 0094-0106: A set of speaker cloning audios 950 and corre­sponding speaker embeddings obtained from the trained multi-speaker generative model 935 may be used to train (805B/905) a speaker encoder model 928….. (iii) Cloning san1ple attention: Considering that different cloning audios contain different amount of speaker information [i.e. wherein each item corresponds to a portion of the neural network input, and wherein the value for each memory item is an encoded representation of the neural network input; Examiner notes the input as the speaker audio data contained as the cloning audios having different amounts of the speaker information as claimed corresponding items], in one or more embodiments, a multi-head self-attention mechanism 1230 may be used to compute the weights for different audios and get aggregated embeddings. FIG. 13 depicts a more detail view of a speaker encoder architecture with intermediate state dimensions 
    PNG
    media_image1.png
    29
    340
    media_image1.png
    Greyscale
… )

Regarding claim 12, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein (i) the values and keys for the items in the memory and (ii) the attention query are provided as input to the area attention layer by respective other components of the neural network system. (As depicted in Fig. 13 Examiner notes the queries (which includes claimed attention query) is provided by other components as depicted in Fig. 13:

    PNG
    media_image4.png
    833
    559
    media_image4.png
    Greyscale

)
Regarding claim 14, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein the key and the value are different for each item in the memory. (As depicted in Fig. 13, and in 0180: … . In embodiment, a Deep Voice 3 architecture 2600 uses residual convolutional layers in an encoder 2605 to encode text into per-timestep key and value vectors 2620 for an attention-based decoder 2630. In one or more embodi­ments, the decoder 2630 uses these to predict the mel-scale log magnitude spectrograms 2642 that correspond to the output audio…; And in 201: … The key vectors hk [i.e. wherein the key and the value are different for each item in the memory. ] are used by each attention block to compute attention weights, whereas the final context vector is com­puted as a weighted average over the value vectors I\, (see Section F. l.f).)
Ari also teaches the system of claim 1, wherein the key and the value are different for each item in the memory. (in 0019: … for each triplet in the initial database: generating a first vector based on the head entity and the relation; generating a second vector based on the tail entity; and storing the first vector and the second vector as a key-value pair, wherein the first vector is the key [i.e. wherein the key and the value are different for each item in the memory.] and the second vector is the value.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Zhang and Ari for the same reasons disclosed above.

Regarding independent claim 15 limitations, Ari teaches: One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to implement an area attention layer configured to perform operations comprising: (Ari teaches in 0167 In embodiments, aspects of the present patent document may be directed to, may include, or may be implemented on one or more information handling systems/ computing systems. A computing system may include any instrumentality or aggregate of instrumentalities operable to compute, calculate, determine, classify, process, transmit, receive, retrieve, originate, route, switch, store, display, communicate, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data. For example, a computing system [i.e. one or more computers] may be or may include a personal computer (e.g., laptop), tablet computer, phablet, personal digital assistant (PDA), …. The computing system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, ROM, and/or other types of memory [i.e. One or more computer storage media storing instructions]…; AND 0238-0241: Aspects of the present disclosure may be encoded upon one or more non-transitory computer-readable media [i.e. One or more computer storage media storing instructions] with instructions for one or more processors or processing units to cause steps to be performed... It shall be noted that embodiments of the present disclosure may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various com­puter-implemented operations… One skilled in the art will recognize no computing system or programming language is critical to the practice of the present disclosure. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into sub­modules or combined together. It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure…)
Claim limitations of claim 15 are similar to claim 1 and thus rejected under the same rationale.

Regarding claims 16 and 17, the claims recite similar limitation to claims 2 and 3, respectively. Thus claim 16 and 17 are rejected under the same rationale as claims 2 and 3.
  
Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Arik et al. (US Pub. No. 2019/0251952, hereinafter ‘Ari’) in Zhang et al. (US Pub. No. 2019/0005090, hereinafter Zhang’) in further view of Miller et al. (NPL: “Key-Value Memory Networks for Directly Reading Documents”, hereinafter ‘Miller’).

Regarding claim 13, the rejection of claim 1 is incorporated and Ari in combination with Zhang teaches the system of claim 1, wherein the key is … as the value for each item in the memory. (as depicted in Fig. 26.)
While Ari in combination with Zhang disclose the use of the key and value pair for memory identification; Ari and Zhang do not expressly teach that the key and value are the same. Miller does expressly teach wherein the key is the same as the value for each item in the memory. (in Pg 4 Left Col 2nd to last para. : To obtain the standard End-To-End Memory Network of Sukhbaatar et al. (2015) one can simply set the key and value to be the same for all memories [i.e. wherein the key is the same as the value for each item in the memory ]…; And in Sec. 3.2: Right col. 2nd para.: Sentence Level ... Both the key and the value encode the entire sentence as a bag-of-words. As the key and value are the same [i.e. wherein the key is the same as the value for each item in the memory] in this case, this is identical to a standard MemNN and this approach has been used in several papers (Weston et al., 2016; Dodge et al., 2016).)
The Miller, Ari and Zhang are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose for developing information processing system/methods using learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior art developing natural language processing  using automated index memory storage and retrieval for supporting natural language processing task as disclosed by Miller with the method of information processing using neural network models as collectively disclosed by Zhang and Ari.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Zhang and Ari in order enable end-to end memory network for processing computations for retrieval and processing natural language data (Miller, Pg. 4 Left Col 2nd to last para.); Doing so helps to improve computational efficacy for information processing using key-value memories for natural language processing task,  (Miller, Pg. 4 Left Col 2nd to last para.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure listed below:
Keskar et al. (US Pub. 2019/0130273): teaches use of key, value pairs to capture data for information processing for making sequence to sequence predictions using neural network model.
Kaiser et al. (NPL: “Learning to remember rare events”): teaches the use of memory modules that are used to recall information used in sequence-to-sequence networks using neural network models. 
Sukhbaatar et al. (NPL: “End-To-End Memory Network”): teaches the use of memory that is an attention mechanism for mapping the output and input memory access for mapping natural language data.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516. The examiner can normally be reached Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/OLUWATOSIN O ALABI/Examiner, Art Unit 2129