Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on January 6, 2021 has been entered.
 
Remarks
	This Office Action is in response to applicant’s amendment filed on January 6, 2021, under which claims 1, 4, 10-11, 13-14, 17-21, 23-24, and 26 are pending and under consideration.

Response to Arguments
Applicant’s amendments have overcome the claim objections and § 112(b) rejections of the previous Office Action. Therefore, the objections and § 112(b) rejections have been withdrawn. 
Claim Interpretation under 35 U.S.C. § 112(f)
With respect to claim interpretation under 35 U.S.C. § 112(f), various claim terms remain interpreted as means-plus-function limitations, as set forth below. 

However, MPEP § 2181(I)(A) expressly states that “[t]his list is not exhaustive, and other generic placeholders may invoke 35 U.S.C. 112(f)” (emphasis added). Therefore, the fact that “converter”, “modeler”, “adjuster” and “normalizer” are not listed among the examples of generic placeholders in MPEP § 2181(I)(A) does not mean that these terms cannot invoke § 112(f). As noted in MPEP § 2181, paragraph 3, under Williamson v. Citrix Online, LLC, 792 F.3d 1339, 1349 (Fed. Cir. 2015), “[t]he standard is whether the words of the claim are understood by persons of ordinary skill in the art to have a sufficiently definite meaning as the name for structure.” Accordingly, “a substitute term [that] acts as a generic placeholder for the term ‘means’ and would not be recognized by one of ordinary skill in the art as being sufficiently definite structure for performing the claimed function.” MPEP § 2181(I).
Here, the terms “converter,” “modeler,” “adjuster,” and “normalizer” are not understood by persons of ordinary skill in the art to have a sufficiently definite meaning as the name for structure. Applicant has not submitted any evidence to the contrary. Therefore, these terms are means-plus-function limitations. 
For example, the term “normalizer” is not understood by persons of ordinary skill in the art to have a sufficiently definite meaning as the name for structure. Instead, the term 
If the applicant does not wish to invoke § 112(f), applicant may amend the claims to recite the functions as functionalities of an element that does not invoke § 112(f). For example, based on the descriptions on pages 7-8 of the specification, claim 1 could be amended to recite a memory storing instructions and a processor configured to execute the instructions to perform the currently recited functionalities. 
Claim Rejections under 35 U.S.C. § 103
With respect to claim rejections under 35 U.S.C. § 103, applicant’s arguments have been fully considered, but they are not deemed to be persuasive.
Applicant argues that Zhang and Priddy do not teach the normalization recited by the limitation of “performs the normalization by multiplying the number of elements of each of the L vectors by each of the vectors divided by the number of elements greater than zero” (see amended claim 1). This limitation was previously recited in dependent claim 3. 
The Examiner agrees that Zhang and Priddy do not teach the normalization technique recited by the above limitation. However, previous dependent claim 3 was rejected for obviousness over Zhang in view of Priddy, and further in view of Reinwald (US 2016/0364327 A1) and Srivastava (Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (2014) 1929-1958.). Priddy was relied upon for the teaching that normalization in general is a common technique for neural networks. Priddy was not relied upon for the specific normalization technique recited in previous dependent claim. Instead, Reinwald and Srivastava were cited for the normalization technique recited in previous dependent claim 3 (and now recited in independent claim 1). 
Since the rejection of claim 3 was based on the combined teachings of four references, including Reinwald and Srivastava, applicant cannot overcome the rejection (as it is now applied to current claim 1) by arguing that Zhang and Priddy alone do not teach the limitations in question without addressing Reinwald and Srivastava. 
The Examiner also acknowledges the discussion of specification examples in applicant’s response. However, the Examiner notes that the present claims are not limited to the specification examples discussed in applicant’s response. 
Therefore, applicant’s arguments are not persuasive in overcoming the present rejections over Zhang in view of Priddy, Reinwald, and Srivastava.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  
Claims 1, 4, 10, 11, and 13 invoke 35 U.S.C. § 112(f). Limitations in these claims that invoke § 112(f) are: “input data converter”, “modeler”, “adjuster” and “normalizer” recited in claim 1; and “input data normalizer” recited in claim 11.
The above terms are each considered to be “a substitute term [that] acts as a generic placeholder for the term ‘means’ and would not be recognized by one of ordinary skill in the art as being sufficiently definite structure for performing the claimed function.” MPEP § 2181(I). Additionally, “configured to” is considered to be a linking word modifying the generic placeholder (see MPEP § 2181(I), element (B) of the described 3-prong analysis).
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.	Claims 1, 4, 10-11, 13-14, 17, 23, 24, and 26 are rejected under 35 U.S.C. § 103 as being unpatentable over Zhang (Junlin Zhang et al., “A Distributional Representation Model For Collaborative Filtering.” Cornell University Library, arXiv preprint arXiv:1502.04163. February 14, 2015) (cited by applicant in the IDS filed on 9/18/2017), in view of Priddy (Priddy et al., Data Normalization in Artificial Neural Networks: An Introduction (Ch. 3.). SPIE Tutorial Texts in Optical Engineering, Vol. TT68. August 2005), Reinwald (US 2016/0364327 A1) and Srivastava (Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (2014) 1929-1958.).
As to claim 1, Zhang teaches an apparatus for generating an artificial-neural-network-based prediction model, the apparatus comprising: [Zhang teaches a recommendation system that generates a neural network model. See Abstract; Section 1, paragraph 4 (describing a “recommender system”); and Section 2.2 (“[Neural] Network Structure”). It is understood that the recommendation system is implemented on a computer, especially given that Section 3 (“Experiments”) describes training on large data sets.] 
an input data converter configured to (use) input data of an L-dimensional array [Section 2.1 (“Transforming the user and item into vectors”): Either user vector WiU or item vector WjI (or both) may correspond to the “input data” of the instant claim. WiU and WjI include data from lookup table LT (see FIG. 1), which corresponds to an L-dimensional array.] and input vector data [Section 2.2: {WiU, WjI} is input into a neural network], where L is a natural number; [WiU and WjI  are one-dimensional arrays (L = 1), since they are vectors having d features. See also FIG. 1, which depicts the vectors as one-dimensional arrays.] 
a modeler configured to model an artificial-neural-network-based prediction model for learning the input vector data and output a value predicted through the model to create a pre-learned model; [Section 2.2 (“[Neural] Network Structure”) and Section 2.3 (“Training”), teaching that a neural network is trained using the aforementioned input vectors to output a value f̂(θ), where θ = {WU, WI, WL1, WL2}, as described in Section 2.3. After any portion of the training has been conducted, the neural network in Zhang is considered to be a “pre-learned model.” In other words, the neural network is “pre-learned” after having gone through training under at least one data object, which may be a data object in the form of [Ui, Ij, y], as described in section 2.3.] 
an adjuster configured to compare the value predicted by the modeler [The output value f̂(θ)] with an actually measured value [The rating of the user, “y” as described in section 2.3] to calculate an error value and adjust learning parameters of an artificial neural network using the error value and a back-propagation algorithm [Section 2.3, teaching that the training process calculates a “prediction error” J(θ) and that “general back-propagation is used to train the model by taking derivatives with respect to the four groups of parameters.” The prediction error includes a comparison between f̂(θ) and y.]; and
a controller configured to generate an L-expansive dimensional array using the pre-learned model when additional data is input [In Zhang, training on a data object results in the computation of WU and WI (included in θ, as described in section 2.2). The lookup table LT is updated to include the learned WU and WI (which corresponds to “additional data”), as indicated by Section 2.1 (teaching that the elements in the lookup table “needs to be learned through training”) and FIG. 1 (showing that the user and item vectors are extracted from the lookup table).] by adding the additional data to the input data [The aforementioned update of the lookup table LT reads on the limitation of “adding additional data” to generate an updated L-dimensional array.] and control the input data converter and the modeler to output a predicted value as L vectors [In Zhang, additional learning on a new data object generates another predicted value for WU and WI (as part of θ). The WU and WI that is learned and output includes “L vectors” (e.g., some set of WiU and WjI) corresponding to the lookup table LT that was updated in the previous training iteration. It is noted that “L” is not limited to any number, and may be a number such as 1, in which case the instant limitation only requires a single vector to satisfy the limitation of “L vectors.” It is also noted that Section 3.1 (“Datasets”) describes a large number of training data (each in the form of [Ui, Ij, y], as described in section 2.3). Therefore, Zhang teaches training over a large number of data objects, over a large number of training iterations. For example, Zhang describes training over a 90% sampling of a data set that includes 2.8 million data objects, associated with 72,916 users and 1628 items.] corresponding to element values included in additional data intended to be predicted by the L-dimensional array [The “additional data intended to be predicted by the L-dimensional array” may correspond to any data that serves as a basis for the L vectors, such as the new data object upon which learning is performed. It is also noted that “intended to be predicted by the L-dimensional array” specifies an intended use and does not further limit the structure of the additional data or the operations of the claim.],
wherein the controller is further configured to control the input data converter and the modeler to generate the L-expansive dimensional array [As noted in the rejection of claim 1, training on a data object results in a new computation of WU and WI, which are updated to the lookup table LT. The update of the lookup table LT reads on the limitation of “adding additional data” to generate an updated L-dimensional array. Training on new data includes performing the functionalities of (and thus controlling) the data converter and the modeler.] and wherein the L vectors corresponding to the additional data have the same size as L vectors corresponding to the input data. [In Zhang, the size of WiU and WjI is not changed during training of subsequent training data. See Section 2.1, teaching that the size of the vectors is d, which is a hyper-parameter, and Section 3.2, paragraph 2, indicating that the user vector and item vector have a fixed length of 24.]
	Zhang does not teach the following: 
(1)	the input data converter is configured to “convert the input data…into normalized vector data” and that the data being input is the “normalized vector data”;
(2)	“wherein the input data converter comprises: an input data normalizer configured to normalize L vectors corresponding to predetermined element values of the L-dimensional array, where L is a natural number, and a normalization vector input processor configured to input the L vectors to the artificial neural network”; and
(3) 	“wherein the input data normalizer performs the normalization by multiplying the number of elements of each of the L vectors by each of the vectors divided by the number of elements greater than zero.”
	Priddy, in an analogous art, teaches limitations (1) and (2) listed above. Priddy pertains to techniques for artificial neural network (see Chapter 3 (p. 15), first paragraph). Therefore, Priddy is in the same field of endeavor as the claimed invention.
In particular, Priddy teaches “convert[ing] the input data…into normalized vector data” so that the data that is input is the “normalized vector data” [Chapter 3 (p. 15): “One of the most common tools used by designers of automated recognition systems to obtain better results is to utilize data normalization.” Priddy teaches normalizing “each input feature vector” (Section 3.1, first paragraph).]. Priddy further teaches: wherein the input data converter comprises: an input data normalizer configured to normalize L vectors corresponding to predetermined element values of the L-dimensional array, where L is a natural number [Priddy teaches normalizing each input feature vector, as noted above. Therefore, the combination of Zhang and Priddy teaches normalizing “L vectors”]; and a normalization vector input processor configured to input the L vectors to the artificial neural network [Priddy teaches normalizing “each input feature vector” (Priddy, Section 3.1, first paragraph).].
Priddy teaches that data normalization may be used to minimize bias within the neural network or speed up training time (Chapter 3 (p. 15), first paragraph: “Data normalization can also speed up training time by starting the training process for each feature within the same scale”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and Priddy by modifying the input data converter to convert input data of an L-dimensional array into normalized vector data and input the normalized vector data, and such that “the input data converter comprises: an input data normalizer configured to normalize L vectors corresponding to predetermined element values of the L-dimensional array, where L is a natural number, and a normalization vector input processor configured to input the L vectors to the artificial neural network,” in order to minimize bias within the neural network or speed up training time, as suggested by Priddy.
Reinwald, in the same field of endeavor or otherwise being analogous, teaches and also evidences that “the number of elements…divided by the number of elements greater than zero,” in its reciprocal form, is well-known in mathematics as a measure of the sparsity or density of a vector. Reinwald generally pertains to matrix-related techniques for machine learning algorithms ([0003]). 
In Reinwald teaches “the number of elements…divided by the number of elements greater than zero” in reciprocal form [[0004]: “The fraction of non-zero elements to all elements in a matrix is called the sparsity or density of the matrix.” This fraction is the reciprocal of the instant claim limitation. While this description refers to “non-zero elements” rather than “elements greater than zero,” one of ordinary skill in the art would have understood from Reinwald that the concept of “elements greater than zero” is merely a case of “non-zero elements” when the matrix comprises only non-negative values. Reinwald teaches this case of matrices with only non-negative values. FIG. 2 and [0022] of Reinwald teaches matrices comprising only non-negative values, in which case, the number of non-zero elements would be the number of elements greater than zero. Therefore, Reinwald teaches the concept of sparsity or density based on a number of elements greater than zero.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and Priddy with the teachings Reinwald, by computing the measure of “the number of elements…divided by the number of elements greater than zero” in order to determine the sparsity or density of a matrix, as suggested by Reinwald (see sections cited above).  
Srivastava, in the same field of endeavor, suggests “multiplying the number of elements of each of the L vectors by each vector divided by the number of elements greater than zero.” Srivastava generally pertains to a method for optimizing neural networks through a dropout technique. In Srivastava’s method, nodes are retained by probability p, while other nodes are dropped. Retained activations of the nodes (analogous to “each vector” of the instant claim) are scaled by “multiplying” by a factor of 1/p (analogous to “the number of elements of each of the L vectors…divided by the number of elements greater than zero” of the instant claim). See Srivastava, Section 10, second paragraph: “we retain units with probability p… scale up the retained activations by multiplying by 1/p.” See also Section 5.2, paragraph 2. The purpose of doing so is to the expected output from it under random dropout will be the same as the output during pretraining (Section 5.2, paragraph 2). Note that Srivastava teaches that dropout results in sparsity (Section 7.2).
That is, Srivastava teaches that the sparsity of an array may be accounted for by scaling the array based on its density/sparsity. In the context of dropout, the concept of density takes the form of the probability (p) of retaining a node. However, a person of ordinary skill in the art, who “is also a person of ordinary creativity, not an automaton” and “in many cases…will be able to fit the teachings of multiple patents together like pieces of a puzzle” (MPEP § 2141(II)(C) (citing KSR International Co. v. Teleflex Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007)), would have recognized that the concept taught in Srivastava can analogously be applied to pre-process vectors input into a neural network for rating prediction. In particular, the percentage of non-zero elements (i.e., the number of non-zero elements divided by the total number of elements) is analogous to the value “p” because both relate to density/sparsity. Furthermore, a vector that is sparser than another vector is analogous to a set of outputs having undergone dropout to a greater extent than a set of outputs having undergone dropout to a lesser extent. Therefore, one of ordinary skill would have recognized that sparse vectors are analogous to weights having undergone dropout.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang, Priddy, and Reinwald with the further teachings of Srivastava, by configuring the normalizer to perform the normalization by “multiplying the number of elements of each of the L vectors by each of the vectors divided by the number of elements greater than zero,” in order to account for vector sparsity such that the expected value of vectors of different sparsity become the same, as suggested by Srivastava. 
It is noted that the cited references do not specifically teach the particular order of operation of first dividing the vector by the number of elements greater than zero and then multiplying the result by the number of elements. However, any order of applying the multiplication and division operations would have been obvious because the result would have been the same. See MPEP § 2144.05(IV)(C) (“selection of any order of performing process steps is prima facie obvious in the absence of new or unexpected results”). 
	
As to claim 4, the combination of Zhang, Priddy, Reinwald, and Srivastava teaches the apparatus of claim 1, wherein the normalization vector input processor inputs a row vector formed by sequentially connecting the L vectors to the artificial neural network [Zhang, FIG. 1, and Section 2.1, last paragraph: “we concatenate the user vector and item vector into a longer vector {WiU, WjI}.” Section 2.2 teaches that {WiU, WjI} is input into the neural network.].

As to claim 10, the combination of Zhang, Priddy, Reinwald, and Srivastava teaches the apparatus of claim 1, wherein the controller is further configured to control the input data converter, the modeler, and the adjuster to generate the L-expansive dimensional array [As noted in the rejection of claim 1, training on a data object results in the computation of WU and WI, which are updated to the lookup table LT. The update of the lookup table reads on the limitation of “adding additional data” to generate an updated L-dimensional array. In the combination of Zhang and Priddy, training on new data includes performing the functionalities of (and thus controlling) the data converter, the modeler, and the adjuster.] and additionally learn the prediction model with L vectors corresponding to element values included in the additional data intended to be learned by the L-dimensional array [In Zhang, further training on a new data object would involve a new set of L vectors (WiU, WjI), which correspond to element values included in lookup table LT updated in the previous training iteration. Note that Section 3.1 (“Datasets”) describes a large number of training data (each the form of [Ui, Ij, y], as described in section 2.3). Therefore, it is understood that the method of Zhang includes training over a large number of data objects. For example, Zhang describes training over a 90% sampling of a data set that includes 2.8 million data objects, associated with 72,916 users and 1628 items.], wherein 
the L vectors corresponding to the additional data have the same size as L vectors corresponding to the input data. [In Zhang, the size of WiU and WjI is not changed during training of subsequent training data. See Section 2.1, teaching that the size of the vectors is d, which is a hyper-parameter and Section 3.2, paragraph 2, indicating that the user vector and item vector have a fixed length of 24.]

As to claim 11, Zhang teaches an apparatus for converting data to be input to an artificial neural network, the apparatus [Zhang describes a “recommendation system.” It is understood that the recommendation system is implemented on a computer, since section 3 (“Experiments”) describes training of a neural network on a large data set.] comprising: 
L vectors corresponding to predetermined element values of an L-dimensional array, where L is a natural number; [Section 2.1 (“Transforming the user and item into vectors”): WiU and WjI are vectors corresponding to predetermined elements values in lookup table LT, which is an L-dimensional array (tensor). Additionally, WiU and WjI are one-dimensional arrays (L = 1), since they are vectors having d features. Also shown in FIG. 1.] 
a normalization vector input processor configured to input the L vectors to the artificial neural network; and [Section 2.2: {WiU, WjI} is input into a neural network]
a controller configured to generate an L-expansive dimensional array using a pre-learned model when additional data is input [In Zhang, training on a data object results in the computation of WU and WI (included in θ, as described in section 2.2). The lookup table LT is updated to include the learned WU and WI (which correspond to “additional data”), as indicated by Section 2.1 (teaching that the elements in the lookup table “needs to be learned through training”) and FIG. 1 (showing that the user and item vectors are extracted from the lookup table). With respect to “pre-learned model,” after any portion of the training has been conducted, the neural network in Zhang is considered to be a “pre-learned model.” In other words, the neural network is “pre-learned” after having gone through training under at least one data object, which may be a data object in the form of [Ui, Ij, y], as described in section 2.3.] by adding the additional data to the input data [The aforementioned update of the lookup table LT reads on the limitation of “adding additional data” to generate an updated L-dimensional array.] and control … the normalizing vector input processor to output a predicted value as L vectors [In Zhang, additional learning on a new data object generates another predicted value for WU and WI (as part of θ). The WU and WI that is learned and output includes “L vectors” (e.g., some set of WiU and WjI) corresponding to the lookup table LT that was updated in the previous training iteration. It is noted that “L” is not limited to any number, and may be a number such as 1, in which case the instant limitation only requires a single vector to satisfy the limitation of “L vectors.” It is also noted that Section 3.1 (“Datasets”) describes a large number of training data (each in the form of [Ui, Ij, y], as described in section 2.3). Therefore, Zhang teaches training over a large number of data objects, using a large number of training iterations. For example, Zhang describes training over a 90% sampling of a data set that includes 2.8 million data objects, associated with 72,916 users and 1628 items.] corresponding to element values included in additional data intended to be predicted by the L-dimensional array [In this context, the “additional data intended to be predicted by the L-dimensional array” may be any data that serves as a basis for the L vectors, such as the new data object upon which learning is performed. It is also noted that “intended to be predicted by the L-dimensional array” specifies an intended use and does not further limit the structure of the additional data or the operations of the claim.].
However, Zhang does not teach: 
(1) 	“an input data normalizer configured to normalize” the aforementioned L vectors and to “control the input data normalizer” along with the normalization vector input processor to output the predicted value; and
(2) 	“wherein the input data normalizer performs the normalization by multiplying the number of elements of each of the L vectors by each of the vectors divided by the number of elements greater than zero.”
Priddy, in the same field of endeavor, teaches the above limitations of “an input data normalizer configured to normalize” L vectors and to “control the input data normalizer” along with the normalization vector input processor to output the predicted value Priddy pertains to techniques for artificial neural network. 
In particular, Priddy teaches “convert[ing] the input data…into normalized vector data” so that the data that is input is the “normalized vector data” [Chapter 3 (p. 15): “One of the most common tools used by designers of automated recognition systems to obtain better results is to utilize data normalization.” Priddy teaches normalizing “each input feature vector” (Section 3.1, first paragraph). Note that “each input feature vector” as taught in Priddy correspond to the “input data” of the instant claim, as well as any other input data for the “neural network” or “pre-learned” model of the instant claim. Since Priddy teaches normalization for neural network inputs in general, Priddy also teaches and suggests “control[ing] the input data normalizer” along with the normalization vector input processor to output the predicted value, since this operation may use a neural network to process input data.]. 
Priddy teaches that data normalization may be used to minimize bias within the neural network or speed up training time (Chapter 3 (p. 15), first paragraph: “Data normalization can also speed up training time by starting the training process for each feature within the same scale.”).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and Priddy, by implementing a normalization process via an input data normalizer configured to normalize the L vectors and by controlling the input data normalizer along with the normalization vector input processor to output the predicted value, as taught by Priddy, in order to minimize bias within the neural network or speed up training time.
Reinwald, in the same field of endeavor or otherwise being analogous, teaches and also evidences that “the number of elements…divided by the number of elements greater than zero,” in its reciprocal form, is well-known in mathematics as a measure of the sparsity or density of a vector. Reinwald generally pertains to matrix-related techniques for machine learning algorithms ([0003]). 
In Reinwald teaches “the number of elements…divided by the number of elements greater than zero” in reciprocal form [[0004]: “The fraction of non-zero elements to all elements in a matrix is called the sparsity or density of the matrix.” This fraction is the reciprocal of the instant claim limitation. While this description refers to “non-zero elements” rather than “elements greater than zero,” one of ordinary skill in the art would have understood from Reinwald that the concept of “elements greater than zero” is merely a case of “non-zero elements” when the matrix comprises only non-negative values. Reinwald teaches this case of matrices with only non-negative values. FIG. 2 and [0022] of Reinwald teaches matrices comprising only non-negative values, in which case, the number of non-zero elements would be the number of elements greater than zero. Therefore, Reinwald teaches the concept of sparsity or density based on a number of elements greater than zero.]
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and Priddy with the teachings Reinwald, by computing the measure of “the number of elements…divided by the number of elements greater than zero” in order to determine the sparsity or density of a matrix, as suggested by Reinwald (see sections cited above).  
Srivastava, in the same field of endeavor, suggests “multiplying the number of elements of each of the L vectors by each vector divided by the number of elements greater than zero.” Srivastava generally pertains to a method for optimizing neural networks through a dropout technique. In Srivastava’s method, nodes are retained by probability p, while other nodes are dropped. Retained activations of the nodes (analogous to “each vector” of the instant claim) are scaled by “multiplying” by a factor of 1/p (analogous to “the number of elements of each of the L vectors…divided by the number of elements greater than zero” of the instant claim). See Srivastava, Section 10, second paragraph: “we retain units with probability p… scale up the retained activations by multiplying by 1/p.” See also Section 5.2, paragraph 2. The purpose of doing so is to the expected output from it under random dropout will be the same as the output during pretraining (Section 5.2, paragraph 2). Note that Srivastava teaches that dropout results in sparsity (Section 7.2).
That is, Srivastava teaches that the sparsity of an array may be accounted for by scaling the array based on its density/sparsity. In the context of dropout, the concept of density takes the form of the probability (p) of retaining a node. However, a person of ordinary skill in the art, who “is also a person of ordinary creativity, not an automaton” and “in many cases…will be able to fit the teachings of multiple patents together like pieces of a puzzle” (MPEP § 2141(II)(C) (citing KSR International Co. v. Teleflex Inc., 550 U.S. 398, 82 USPQ2d 1385 (2007)), would have recognized that the concept taught in Srivastava can analogously be applied to pre-process vectors input into a neural network for rating prediction. In particular, the percentage of non-zero elements (i.e., the number of non-zero elements divided by the total number of elements) is analogous to the value “p” because both relate to density/sparsity. Furthermore, a vector that is sparser than another vector is analogous to a set of outputs having undergone dropout to a greater extent than a set of outputs having undergone dropout to a lesser extent. Therefore, one of ordinary skill would have recognized that sparse vectors are analogous to weights having undergone dropout.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang, Priddy, and Reinwald with the further teachings of Srivastava, by configuring the normalizer to perform the normalization by “multiplying the number of elements of each of the L vectors by each of the vectors divided by the number of elements greater than zero,” in order to account for vector sparsity such that the expected value of vectors of different sparsity become the same, as suggested by Srivastava. 
It is noted that the cited references do not specifically teach the particular order of operation of first dividing the vector by the number of elements greater than zero and then multiplying the result by the number of elements. However, any order of applying the multiplication and division operations would have been obvious because the result would have been the same. See MPEP § 2144.05(IV)(C) (“selection of any order of performing process steps is prima facie obvious in the absence of new or unexpected results”). 

As to claim 13, the combination of Zhang, Priddy, Reinwald, and Srivastava teaches the apparatus of claim 11, wherein the normalization vector input processor inputs a row vector formed by sequentially connecting the L vectors to the artificial neural network. [Zhang, FIG. 1, and Section 2.1, last paragraph: “we concatenate the user vector and item vector into a longer vector {WiU, WjI}.” Section 2.2 teaches that {WiU, WjI} is input into the neural network.]. 

As to claims 14 and 17, these claims are directed to “a method of generating an artificial-neural-network-based prediction model” comprising operations that are the same or substantially the same as those performed by the apparatus of claims 1 and 4.  
Therefore, the rejections made to claims 1 and 4 are applied to claims 14 and 17, respectively. In particular, the limitations of “to generate an L-expansive dimensional array obtained by adding additional data to the input data and output a predicted value as L vectors corresponding to element values included in additional data intended to be predicted by the L-dimensional array” recited in claim 14 is taught by Zhang for the reasons given for the same or substantially the same limitations of “to generate an L-expansive dimensional array…by adding the additional data to the input data and…output a predicted value as L vectors corresponding to element values included in additional data intended to be predicted by the L-dimensional array” recited in claim 1. 
Additionally, Zhang teaches “performing control to re-perform…the outputting” because the neural network of Zhang performs the “outputting” whenever it processes data, including when it processes data to generate the L-expansive dimensional array.
Zhang does not teach re-performing “the converting” along with the outputting. 
However, Priddy teaches “to re-perform the converting” [As noted in the rejection of claim 1, Priddy teaches normalizing “each input feature vector” (Section 3.1, first paragraph). Note that “each input feature vector” as taught in Priddy correspond to the “input data” of the instant claim, as well as any other input data for the “neural network” or “pre-learned” model of the instant claim. Since Priddy teaches normalization for neural network inputs in general, Priddy also suggests re-performing the converting along with the outputting in order to generate the L-expansive dimensional array using the pre-learned model.].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and Priddy by re-performing the converting along with the outputting to generate the L-expansive dimensional array, in order to minimize bias within the neural network or speed up training time.

As to claim 23, the combination of Zhang, Priddy, Reinwald, and Srivastava teaches the method of claim 14, as set forth in the rejection of claim 14, above, further comprising when the artificial-neural-network-based prediction model is modeled [In Zhang, the neural network is considered to be “modeled” after having gone through training under at least data object (i.e., a data object in the form of [Ui, Ij, y], as described in section 2.3).] and then additional data is input [In Zhang, training on a data object results in the computation of WU and WI (included in θ, as described in section 2.2). The lookup table LT is updated to include the learned WU and WI (i.e., “additional data is input”), as indicated by Section 2.1 (teaching that the elements in the lookup table “needs to be learned through training”) and FIG. 1 (showing that the user and item vectors are extracted from the lookup table).], performing control to re-perform the converting, the outputting, and the adjusting to generate an L-expansive dimensional array, where L is a natural number obtained by adding the additional data to the input data [As noted above, training on a data object results in the computation of WU and WI, which are updated to the lookup table LT. The update of the lookup table reads on the limitation of “adding additional data” to generate an updated L-dimensional array. Furthermore, the converting, outputting and adjusting are part of the training process for each iteration. Therefore, they are re-performed for any additional training data.] and additionally learn the prediction model by using L vectors corresponding to element values included in additional data intended to be learned by the L-dimensional array [In Zhang, further training on a new data object would involve a new set of L vectors (WiU, WjI), which correspond to element values included in lookup table LT updated in the previous training iteration. Note that Section 3.1 (“Datasets”) describes a large number of training data (each the form of [Ui, Ij, y], as described in section 2.3). Therefore, it is understood that the method of Zhang includes training over a large number of data objects. For example, Zhang describes training over a 90% sampling of a data set that includes 2.8 million data objects, associated with 72,916 users and 1628 items.], wherein the L vectors corresponding to the additional data have the same size as L vectors corresponding to the input data. [In Zhang, the size of WiU and WjI is not changed during training of subsequent training data. See Section 2.1, teaching that the size of the vectors is d, which is a hyper-parameter and Section 3.2, paragraph 2, indicating that the user vector and item vector have a fixed length of 24.]

As to claims 24 and 26, these claims are directed to “a method of converting data to be input to an artificial neural network” comprising operations that are the same or substantially the same as those performed by the apparatus of claims 11 and 13.  
Therefore, the rejections made to claims 11 and 13 are applied to claims 24 and 26, respectively. In particular, the limitations of “performing control to…generating an L-expansive dimensional array obtained by adding additional data to the input data and output a predicted value as L vectors corresponding to element values included in additional data intended to be predicted by the L-dimensional array” recited in claim 24 is taught by Zhang for the reasons given for the same or substantially the same limitations of “a controller configured to generate an L-expansive dimensional array…by adding the additional data to the input data and…to output a predicted value as L vectors corresponding to element values included in additional data intended to be predicted by the L-dimensional array” recited in 11.
Zhang does not teach that “to re-perform the normalizing.” 
However, Priddy teaches “to re-perform the normalizing” [As noted in the rejection of claim 1, Priddy teaches normalizing “each input feature vector” (Section 3.1, first paragraph). Note that “each input feature vector” as taught in Priddy correspond to the “input data” of the instant claim, as well as any other input data for the “neural network” or “pre-learned” model of the instant claim. Since Priddy teaches normalization for neural network inputs in general, Priddy also suggests re-performing the converting along with the outputting in order to generate the L-expansive dimensional array using the pre-learned model.].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and the above teachings of Priddy by performing control to re-perform the normalizing to generate the L-expansive dimensional array, in order to minimize bias within the neural network or speed up training time.

2.	Claims 18-20 are rejected under 35 U.S.C. § 103 as being unpatentable over Zhang in view of Priddy, Reinwald, and Srivastava, and further in view of Ioffe (Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv:1502.03167 [cs.LG] (2015)).
As to claim 18, the combination of Zhang, Priddy, Reinwald, and Srivastava teaches the method of claim 14, as set forth in the rejection above, but does not teach that the “outputting” comprises the further operations recited in the instant claim. 
Ioffe, in the same field of endeavor, teaches “batch-normalizing a data distribution between an input layer and a hidden layer of the artificial neural network, between hidden layers of the artificial neural network, and between the hidden layer and an output layer of the artificial neural network at least one time.” Ioffe teaches normalization techniques for neural networks that is applicable to the outputs from one layer to another. 
In particular, Ioffe teaches “batch normalization for normalization data distribution” (abstract). Ioffe teaches a network of multiple hidden layers (see Section 4.1, paragraph 1), and generally teaches that normalization may be applied to “the inputs of each layer” (Section 2, paragraph 2). See also Section 3 and Section 4.1, paragraph 1 (“we added batch Normalization to each hidden layer of the network” for a three-layer network”). Therefore, Ioffe teaches normalization “between an input layer and a hidden layer of the artificial neural network, between hidden layers of the artificial neural network, and between the hidden layer and an output layer of the artificial neural network at least one time,” since Ioffe disclose an example network with 3 hidden layers followed by a fully-connected layer (Section 4.1, paragraph 1). 
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang and Priddy with the batch normalization teachings of Ioffe, particularly by modifying the outputting to comprise batch-normalizing a data distribution between an input layer and a hidden layer of the artificial neural network, between hidden layers of the artificial neural network, and between the hidden layer and an output layer of the artificial neural network to be performed at least one or more times. The motivation for doing so would have been to attain higher learning rates, permit less-careful initialization, and/or regularization, as suggested by Ioffe (abstract: “Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer”)
Furthermore, while Zhang does not specifically teach (a plurality of) “hidden layers,” Ioffe teaches a plurality of hidden layers (see, e.g., Ioffe, Section 4.1, paragraph 1). Additionally, both Zhang and Ioffe are directed to deep learning (see, e.g., Zhang abstract), and one of ordinary skill in the art would have recognized that, in deep learning in general, the number of hidden layers can be adjusted. For example, hidden layers can be added to increase depth (see Ioffe, page 2, first full paragraph). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have used a plurality of hidden layers (as taught in Ioffe, for example) in the neural network of Zhang, in order to increase depth. The use of a plurality of hidden layers would also have been an obvious combination of prior art elements according to known methods to yield predictable results, an obvious duplication of parts without new and unexpected results (MPEP § 2144.04(VI)), and the mere discovery of an optimum or workable number of hidden layers by routine experimentation (MPEP § 2144.05(II)(A)).

As to claim 19, the combination of Zhang, Priddy, Reinwald, Srivastava, and Ioffe teaches the method of claim 18, wherein the outputting comprises calculating an average of input values [computation of the mini-batch mean µB (Ioffe, page 2, Algorithm 1)]; calculating a variance of the input values by using the average [computation of the mini-batch variance σ2B (Ioffe, page 2, Algorithm 1)]; calculating normalized values using the variance [computation of normalized values x̂i (Ioffe, page 2, Algorithm 1)]; and calculating batch-normalized values by scaling and shifting the normalized values [computation of scaled and shifted values (Ioffe, page 2, Algorithm 1, line 4 under “output”)]. 

As to claim 20, the combination of Zhang, Priddy, Reinwald, Srivastava, and Ioffe teaches the method of claim 18, as set forth in the rejection above, wherein the outputting comprises applying a non-linear function to the batch-normalized values in the hidden layer [Zhang, section 2.2, teaching the “tanh” non-linear function.]. 
Alternatively, this feature is taught by Ioffe, section 3.2, paragraph 1, which discloses a non-linear function g, which may be ReLU. Ioffe teaches using the function ReLU(x) = max(x, 0) to address the saturation problem its resulting vanishing gradients (Ioffe page 2, first full paragraph). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have substituted the tanh non-linear function with the ReLu non-linear function to address the saturation problem its resulting vanishing gradients

3.	Claim 21 is rejected under 35 U.S.C. § 103 as being unpatentable over Zhang in view of Priddy, Reinwald, Srivastava, and Ioffe, and further in view of Tang (Tang et al., User Modeling with Neural Network for Review Rating Prediction, Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)) and Sharma (Symmetric Collaborative Filtering Using the Noisy Sensor Model in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001), ArXiv reprint: arXiv:1301.2309 [cs.IR] (2013)).
As to claim 21, the combination of Zhang, Priddy, Reinwald, Srivastava, and Ioffe teaches the method of claim 18, but does not teach that the outputting comprises the further operations recited in the instant claim. 
Tang, in an analogous art, teaches “applying a softmax function to the batch-normalized values in the output layer to calculate respective probabilities of ratings.” Tang generally pertains to the use of neural networks for rating predictions (abstract), and is analogous for at least the reason of being in the same field of endeavor. 
In particular, Tang teaches “applying a softmax function to the batch-normalized values in the output layer to calculate respective probabilities of ratings” [FIG. 1, showing a softmax layer; Section 3.4, teaching that “we use softmax to predict the probabilities for classes (e.g. one to five stars).” The “classes” disclosed in Tang correspond to the “each ratings” recited in the instant claim. Note that in section 3.4, paragraph 1, Tang teaches that the probability distribution f(r,l) of the ratings are calculated, using the softmax function.] Thus, Tang teaches that the softmax function is useful for computing a probability distribution of a rating. Tang generally teaches that the probability distribution is useful for rating prediction (see Tang, FIG. 1).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Zhang, Priddy, Reinwald, Srivastava, and Ioffe and the teachings of Tang by modifying the modeler of Zhang so that the outputting includes “applying a softmax function to the batch-normalized values in the output layer to calculate respective probabilities of ratings,” in order to compute the a probability distribution of a rating that is useful for rating prediction, as suggested by Tang.
Sharma, in an analogous art, teaches “…and outputting the predicted value by applying weights to the probabilities to calculated weighted probabilities, and summing the weighted probabilities.” Sharma generally relates to techniques for collaborative filtering (abstract), and is analogous for at least the reason of being in the same field of endeavor. 
In particular, Sharma teaches “outputting the predicted value by applying weights to the probabilities to calculated weighted probabilities, and summing the weighted probabilities” [Page 491, left column, fourth paragraph, teaching that “To predict a rating (for example, to compare it with other algorithms that predict ratings), we predict the expected value of the rating,” wherein the expected value is computed by applying weights (vi) to the probability distribution values (Pr(Saj = vi|X)). Note that the step of “summing” is part of the calculation of the expected value.]. Thus, Sharma teaches that the probability distribution is useful for deriving a predicted rating. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Zhang, Priddy, Reinwald, Srivastava, Ioffe, and Tang so that the outputting further includes “outputting the predicted value by applying weights to the probabilities to calculated weighted probabilities, and summing the weighted probabilities,” in order to compute a predicted rating using a probability distribution, as suggested by Sharma.  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kim et al., “Collaborative Filtering for Recommendation Using Neural Networks,” ICCSA 2005, LNCS 3480, pp. 127 – 136, 2005 teaches collaborative filtering using neural networks. 
Strub et al., “Hybrid Collaborative Filtering with Neural Networks,” arXiv: 1603.00806v2 [cs.IR] 9 Mar 2016 teaches collaborative filtering using autoencoder neural networks, involving normalization of input data to a range of -1 to 1.   

Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764.  The examiner can normally be reached on Monday - Friday 8:30 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Y.D.H./Examiner, Art Unit 2124                                                                                                                                                                                                        




/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124