DETAILED ACTION
This Office Action is in response to the Application filed on October 12, 2021, which is a continuation (CON) application of Application No. 16/085339, filed on September 14, 2018, which is a national stage application under 35 U.S.C. §371 of International Application No. PCT/IB2017/051580, filed on March 17, 2017, which claims benefit of U.S. Application No. 62/309682 filed on March 17, 2016. An action on the merits follows. Claim 1 is pending on the application.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The disclosure is objected to because it contains an embedded hyperlink and/or other form of browser-executable code. Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01.
Additionally, the title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) are: “unit for generating”, “generating for receiving”, “unit for receiving” , “unit for selecting” in claim 1.
Because these claim limitation(s) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof: Fig. 1 and 4; Pg. 29 describe a programmed computing device or computer, including for example software, hardware, or a combination of hardware and software capable of performing the described functionality.  
If applicant does not intend to have these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Georgescu et al. (US PG Pub. No. 2016/0174902 A1), hereafter referred to as Georgescu, in view Agnihotri of et al. (US PG Pub. No. 2010/0272338 A1), hereafter referred to as Agnihotri, Applicant cited prior art.

Regarding claim 1, Georgescu discloses a unit for generating a vector of at least one numeric value to be used for processing a task (Fig. 28; Par. [0003-4]: image analysis tasks, such as segmentation, motion tracking, and disease diagnosis and quantification… each of the deep neural networks directly inputs image patches from the training data and learns high-level domain-specific image features. The trained deep neural network for a particular marginal search space may be discriminative, in that it calculates, for a given hypothesis in the search space, a probability that the hypothesis in the search space is correct, or may provide a regression function (regressor) that calculates, for each hypothesis in the search space, a difference vector from that hypothesis to predicted pose parameters of the object in the search space; Par. [0038]: the method of FIG. 1 can train each of the deep neural networks to be discriminative in that it calculates, for a given hypothesis in a search space, a probability that the hypothesis in the search space is correct… the method of FIG. 1 can train each of the deep neural networks to be a regression function (regressor) that calculates, for each hypothesis in a search space, a difference vector from that hypothesis to predicted pose parameters of the target anatomical object in the search space; Par. [0042-44]: the first deep neural network may train a regressive function that inputs image patches of an image as hypotheses and calculates a difference vector for each input image patch… train networks with three or more hidden layers. The pre-training can be treated as an unsupervised learning process to discover powerful image features from the input image data. Various deep learning techniques, such as an auto-encoder (AE) or a restricted Boltzman machine (RBM), can be used to pre-train a hidden layer… The AE 200 has an input layer L1 202, the hidden layer L2, and an output layer L3 206. If the AE 200 is a fully connected network, each node in the input layer 202 can correspond to a respective voxel or pixel of an image patch… The goal of an AE is to minimize the difference between the input and output vectors; Par. [0047]: use deep neural networks to train a series of regressors, each of which calculates, for each hypothesis in the search space, a difference vector from that hypothesis to predicted pose parameters of the object in the search space; Par. [0136]: above-described methods can be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 28. Computer 2802 contains a processor 2804, which controls the overall operation of the computer 2802 by executing computer program instructions which define such operation; generating a vector of at least one numeric value to be used for processing a task (i.e. processing), including an image analysis tasks (i.e. actions, functions, etc.), such as segmentation, motion tracking, etc., including calculating a difference vector (i.e. generating a vector of at least one numeric value) for each hypothesis in a search space to predicted pose parameters of an object in the search space, as indicated above), for example), the unit for generating a vector comprising:
a unit for generating combined feature maps, the unit for generating combined feature maps comprising a feature map generating unit, the feature map generating unit for receiving more than one modality and for generating more than one corresponding feature map using more than one corresponding transformation operating independently of each other (Par. [0038-41]: a method of training a series of deep neural networks for anatomical object detection in medical images according to an embodiment of the present invention. The method of FIG. 1 utilizes a database of training images to train a series of deep neural networks in a series of marginal search spaces of increasing dimensionality to determine a full pose parameter space for an anatomical object in a medical image… the method of FIG. 1 can train each of the deep neural networks to be discriminative in that it calculates, for a given hypothesis in a search space, a probability that the hypothesis in the search space is correct… the method of FIG. 1 can train each of the deep neural networks to be a regression function (regressor) that calculates, for each hypothesis in a search space, a difference vector from that hypothesis to predicted pose parameters of the target anatomical object in the search space… training images are received. In particular, a plurality of training images are loaded from a database. The training images can be 2D or 3D medical images acquired using any medical imaging modality, such as but not limited to CT, MRI, Ultrasound, X-ray fluoroscopy, DynaCT, etc… training samples are generated for the current marginal search space. The training samples are image patches that are used as hypotheses in the current search space to train the deep neural network for that search space. For the first search space (e.g., position) the training samples are generated by selecting image patches from the training images; Par. [0044-47]: train a series of discriminative deep neural networks, each of which calculates, for a given hypothesis in its marginal search space, a probability that the hypothesis in the search space is correct… The regressed hypotheses are passed through the incrementally increasing marginal spaces during both the training and objected detection in a new image; Par. [0048-53]: complex image patterns can be encoded in hierarchical features by learning one or more hidden layers by stacking deep neural network architectures, as described above. To solve the regression problem for a particular search space, at the output layer either a discretized multi-class classifier or a linear/non-linear regressor can be trained on top of the neural network features extracted by the learned hidden layers… hidden layers 304 and 306 can be trained to hierarchically extract features from the input image patches by stacking multiple deep neural network architectures in an unsupervised pre-training phase. The output layer 308 calculates displacement vector between the hypothesis parameters for each input image patch and the parameters of the target anatomical object for the current parameter space. An inverse of the distance of the estimated image patch to the ground truth image patch for the anatomical object location is used to train the confidence score… output parameter space can be either directly regressed using a linear function or it can be discretized relative to the parameter range and solved as a multi-class classification problem… the displacement vector dp(2) output by the deep neural network 300 maps the hypothesis parameters dp(2) to the target parameters p(1). In a second iteration, the parameters p(1) are then input back into the deep neural network 300 in order to refine the estimated target parameters, and the deep neural network 330 outputs a displacement vector that maps the parameters p(1) to the refined target parameters p(1)… training images are received. Step 402 of FIG. 4 can be implemented similarly to step 102 of FIG. 1. The training images can be 2D or 3D images, depending on the imaging modality and anatomical object to be detected… a first deep neural network is trained to detect position candidates based on the training images… the first deep neural network (either discriminative or regressor) can be trained in two stages of unsupervised pre-training of the hidden layers (e.g., using stacked DAE) for learning complex features from input image patches; Par. [0060]: detecting an anatomical object in a medical image using a series of trained deep neural networks according to an embodiment of the present invention. The method of FIG. 5 can be performed using a series of deep neural networks trained using the method of FIG. 4. Referring to FIG. 5, at step 502, a medical image of the patient is received. The medical image can be 2D or 3D and can be acquired using any type of medical imaging modality, such as but not limited to CT, MRI, ultrasound, X-ray fluoroscopy, DynaCT, etc.; Par. [0089-90]: use of deep neural network architectures for detection and segmentation of 3D objects in volumetric (3D) medical image data may require scanning large volumetric input spaces. This requires significant computational resources due to the large, high-dimensional input space and the complex weight matrices learned for such deep neural networks. This may apply for convolutional layers as well as fully connected filters. In an advantageous embodiment of the present invention, sparse adaptive deep neural networks (SADNN) (also referred to herein as sparse deep neural networks) are trained to learn representations of from 3D medical image modalities and are used in place of convolutional or fully connected deep neural networks to perform 3D object detection and segmentation in volumetric medical image data… the detection of candidates in the respective marginal search space is essentially reduced to a patch-wise classification task described by a set of m parameterized input patches X (i.e., observations) with a corresponding set of class assignments y, specifying whether the target anatomical structure is contained in the patch or not. In a representation learning approach for training a deep neural network, such inputs are processed to higher-level data representations using the inter-neural connections, defined as kernels under non-linear mappings. For general notation purposes, the parameters of a convolution filter for a given neuron (node) in the network can be defined as the pair (w, b), where w encodes the weights and b represents the associated bias. The same notation holds for a fully connected layer, which can conceptually be regarded as a convolution layer with the filter size equal to the underlying feature-map size. From the perspective of a given neuron in a fully connected layer, this means that the neuron is connected to all the neurons in the previous layer and a corresponding weight is learned for each connection… In this case of the fully connected deep neural network, n also represents the number of neurons in the network, as there is a one-to-one association between neuron and kernel. In order to compute the response or so-called activation of a given neuron, a linear combination is computed between the weights of all incoming connections and the activations of all neurons from where the incoming connections originate. The bias of this neuron is then added to this linear combination, and the resulting value is transformed by a non-linear mapping to obtain the activation value; a unit for generating combined feature maps, the unit for generating combined feature maps comprising a feature map generating unit (e.g. generate combined feature maps by using a mapping function (i.e. a feature map generating unit/function) that has as input, image patches corresponding to current hypothesis parameters, including corresponding extracted features from input image patches, and as output, target parameter displacement, which is learned from the current hypothesis parameters to the correct object parameters in each marginal search space, including displacement vector dp(2) output by the deep neural network 300, which maps the hypothesis parameters dp(2) to the target parameters p(1), as indicated above, for example), the feature map generating unit for receiving more than one modality (e.g. mapping function (i.e. feature map generating) that has as input (i.e. receives), image patches corresponding to current hypothesis parameters, including corresponding extracted features from input image patches, such as selected image patches from training images or training samples, which are used as hypotheses in the current search space to train the deep neural network for that search space, including a first search space (position), and the training images include 2D or 3D images, depending on the imaging modality and anatomical object to be detected (i.e. mapping function receiving more than one modality), as indicated above) and for generating more than one corresponding feature map using more than one corresponding transformation (e.g. generate more than one corresponding feature map using more than one corresponding transformation, including a linear combination computed and the resulting value transformed by a non-linear mapping (i.e. generate more than one corresponding feature map using more than one corresponding transformation), as indicated above), for example); 
wherein the more than one corresponding transformation is generated following an initial training performed in accordance with the processing task to be performed and a combining unit for selecting and combining the corresponding more than one feature map generated by the feature map generating unit in accordance with at least one combining operation and for providing at least one corresponding combined feature map wherein the more than one corresponding transformation is generated following an initial training performed in accordance with the processing task to be performed and a combining unit for selecting and combining the corresponding more than one feature map generated by the feature map generating unit in accordance with at least one combining operation and for providing at least one corresponding combined feature map (Par. [0038-41]: a method of training a series of deep neural networks for anatomical object detection in medical images according to an embodiment of the present invention. The method of FIG. 1 utilizes a database of training images to train a series of deep neural networks in a series of marginal search spaces of increasing dimensionality to determine a full pose parameter space for an anatomical object in a medical image… the method of FIG. 1 can train each of the deep neural networks to be discriminative in that it calculates, for a given hypothesis in a search space, a probability that the hypothesis in the search space is correct… the method of FIG. 1 can train each of the deep neural networks to be a regression function (regressor) that calculates, for each hypothesis in a search space, a difference vector from that hypothesis to predicted pose parameters of the target anatomical object in the search space… training images are received. In particular, a plurality of training images are loaded from a database. The training images can be 2D or 3D medical images acquired using any medical imaging modality, such as but not limited to CT, MRI, Ultrasound, X-ray fluoroscopy, DynaCT, etc… training samples are generated for the current marginal search space. The training samples are image patches that are used as hypotheses in the current search space to train the deep neural network for that search space. For the first search space (e.g., position) the training samples are generated by selecting image patches from the training images; Par. [0044-47]: train a series of discriminative deep neural networks, each of which calculates, for a given hypothesis in its marginal search space, a probability that the hypothesis in the search space is correct. This framework for training a sequential series of discriminative deep neural networks in a series of marginal spaces of increasing dimensionality can be referred to as Marginal Space Deep Learning (MSDL). In MSDL, deep learning is utilized to automatically learn high-level domain-specific image features directly from the medical image data... unsupervised pre-training followed by supervised fine-tuning can be used to overcome the over-fitting issue. This technique can be used to train networks with three or more hidden layers. The pre-training can be treated as an unsupervised learning process to discover powerful image features from the input image data. Various deep learning techniques, such as an auto-encoder (AE) or a restricted Boltzman machine (RBM), can be used to pre-train a hidden layer. FIG. 2 illustrates an exemplary AE neural network. As shown in FIG. 2, the AE 200 is a feed-forward neural network with one hidden layer 204. The AE 200 has an input layer L1 202, the hidden layer L2, and an output layer L3 206. If the AE 200 is a fully connected network, each node in the input layer 202 can correspond to a respective voxel or pixel of an image patch. Ignoring the bias term (the nodes labeled as +1 in FIG. 2), the input and output layers 202 and 206, respectively have the same number of nodes. The goal of an AE is to minimize the difference between the input and output vectors… after pre-training a number of hidden layers, the output of the hidden layers can be treated as high-level image features and used to train a discriminative classifier for detecting the anatomical object in the current parameter space… use deep neural networks to train a series of regressors, each of which calculates, for each hypothesis in the search space, a difference vector from that hypothesis to predicted pose parameters of the object in the search space. This framework for training a sequential series of deep neural network regressors in a series of marginal spaces of increasing dimensionality can be referred to as Marginal Space Deep Regression (MSDR). In MSDR, a mapping function is learned from the current hypothesis parameters to the correct object parameters in each marginal search space. The mapping function has as input, an image patch corresponding to the current hypothesis parameters and as output the target parameter displacement. Each current hypothesis will yield a new hypothesis through the regression function which converges to the correct object parameters when learned successfully. The regressed hypotheses are passed through the incrementally increasing marginal spaces during both the training and objected detection in a new image; par. [0048-53]: complex image patterns can be encoded in hierarchical features by learning one or more hidden layers by stacking deep neural network architectures, as described above. To solve the regression problem for a particular search space, at the output layer either a discretized multi-class classifier or a linear/non-linear regressor can be trained on top of the neural network features extracted by the learned hidden layers… hidden layers 304 and 306 can be trained to hierarchically extract features from the input image patches by stacking multiple deep neural network architectures in an unsupervised pre-training phase. The output layer 308 calculates displacement vector between the hypothesis parameters for each input image patch and the parameters of the target anatomical object for the current parameter space. An inverse of the distance of the estimated image patch to the ground truth image patch for the anatomical object location is used to train the confidence score… output parameter space can be either directly regressed using a linear function or it can be discretized relative to the parameter range and solved as a multi-class classification problem… the displacement vector dp(2) output by the deep neural network 300 maps the hypothesis parameters dp(2) to the target parameters p(1). In a second iteration, the parameters p(1) are then input back into the deep neural network 300 in order to refine the estimated target parameters, and the deep neural network 330 outputs a displacement vector that maps the parameters p(1) to the refined target parameters p(1)… training images are received. Step 402 of FIG. 4 can be implemented similarly to step 102 of FIG. 1. The training images can be 2D or 3D images, depending on the imaging modality and anatomical object to be detected… a first deep neural network is trained to detect position candidates based on the training images… the first deep neural network (either discriminative or regressor) can be trained in two stages of unsupervised pre-training of the hidden layers (e.g., using stacked DAE) for learning complex features from input image patches; Par. [0060]: detecting an anatomical object in a medical image using a series of trained deep neural networks according to an embodiment of the present invention. The method of FIG. 5 can be performed using a series of deep neural networks trained using the method of FIG. 4. Referring to FIG. 5, at step 502, a medical image of the patient is received. The medical image can be 2D or 3D and can be acquired using any type of medical imaging modality, such as but not limited to CT, MRI, ultrasound, X-ray fluoroscopy, DynaCT, etc.; Par. [0089-90]: use of deep neural network architectures for detection and segmentation of 3D objects in volumetric (3D) medical image data may require scanning large volumetric input spaces. This requires significant computational resources due to the large, high-dimensional input space and the complex weight matrices learned for such deep neural networks. This may apply for convolutional layers as well as fully connected filters. In an advantageous embodiment of the present invention, sparse adaptive deep neural networks (SADNN) (also referred to herein as sparse deep neural networks) are trained to learn representations of from 3D medical image modalities and are used in place of convolutional or fully connected deep neural networks to perform 3D object detection and segmentation in volumetric medical image data… the detection of candidates in the respective marginal search space is essentially reduced to a patch-wise classification task described by a set of m parameterized input patches X (i.e., observations) with a corresponding set of class assignments y, specifying whether the target anatomical structure is contained in the patch or not. In a representation learning approach for training a deep neural network, such inputs are processed to higher-level data representations using the inter-neural connections, defined as kernels under non-linear mappings. For general notation purposes, the parameters of a convolution filter for a given neuron (node) in the network can be defined as the pair (w, b), where w encodes the weights and b represents the associated bias. The same notation holds for a fully connected layer, which can conceptually be regarded as a convolution layer with the filter size equal to the underlying feature-map size. From the perspective of a given neuron in a fully connected layer, this means that the neuron is connected to all the neurons in the previous layer and a corresponding weight is learned for each connection… In this case of the fully connected deep neural network, n also represents the number of neurons in the network, as there is a one-to-one association between neuron and kernel. In order to compute the response or so-called activation of a given neuron, a linear combination is computed between the weights of all incoming connections and the activations of all neurons from where the incoming connections originate. The bias of this neuron is then added to this linear combination, and the resulting value is transformed by a non-linear mapping to obtain the activation value; wherein the more than one corresponding transformation is generated following an initial training performed in accordance with the processing task to be performed (e.g. pre-training (i.e. initial training) is performed, including an unsupervised learning process to discover powerful image features from the input image data, such as corresponding extracted features from input image patches of training images above, depending on the imaging modality and anatomical object to be detected (i.e. in accordance with the processing task to be performed), and after pre-training a number of hidden layers, the output of the hidden layers can be treated as high-level image features and used to train a discriminative classifier for detecting the anatomical object in the current parameter space, such as by generating more than one corresponding feature map using more than one corresponding transformation, such as at least one final (i.e. resulting) feature map, including a linear combination computed and the resulting (i.e. final) value transformed by a non-linear mapping (i.e. final feature map is performed by applying each of the at least one corresponding transformation), as indicated above) and a combining unit for selecting and combining the corresponding more than one feature map generated by the feature map generating unit in accordance with at least one combining operation and for providing at least one corresponding combined feature map (e.g. provide at least one corresponding combined feature map by selecting and combining the corresponding more than one feature map generated in accordance with at least one combining operation, such as by generating the more than one corresponding feature map using the more than one corresponding transformation, including a linear combination (i.e. at least one combining operation) computed and the resulting value transformed by a non-linear mapping, as indicated above), for example); 
wherein the combining unit is operating in accordance with the processing task to be performed and the combining operation reduces each corresponding numeric value of each of the more than one feature map generated by the feature map generation unit down to one numeric value in the at least one corresponding combined feature map (Par. [0038-41]: a method of training a series of deep neural networks for anatomical object detection in medical images according to an embodiment of the present invention. The method of FIG. 1 utilizes a database of training images to train a series of deep neural networks in a series of marginal search spaces of increasing dimensionality to determine a full pose parameter space for an anatomical object in a medical image… the method of FIG. 1 can train each of the deep neural networks to be discriminative in that it calculates, for a given hypothesis in a search space, a probability that the hypothesis in the search space is correct… the method of FIG. 1 can train each of the deep neural networks to be a regression function (regressor) that calculates, for each hypothesis in a search space, a difference vector from that hypothesis to predicted pose parameters of the target anatomical object in the search space… training images are received. In particular, a plurality of training images are loaded from a database. The training images can be 2D or 3D medical images acquired using any medical imaging modality, such as but not limited to CT, MRI, Ultrasound, X-ray fluoroscopy, DynaCT, etc… training samples are generated for the current marginal search space. The training samples are image patches that are used as hypotheses in the current search space to train the deep neural network for that search space. For the first search space (e.g., position) the training samples are generated by selecting image patches from the training images; Par. [0044-47]: train a series of discriminative deep neural networks, each of which calculates, for a given hypothesis in its marginal search space, a probability that the hypothesis in the search space is correct. This framework for training a sequential series of discriminative deep neural networks in a series of marginal spaces of increasing dimensionality can be referred to as Marginal Space Deep Learning (MSDL). In MSDL, deep learning is utilized to automatically learn high-level domain-specific image features directly from the medical image data... unsupervised pre-training followed by supervised fine-tuning can be used to overcome the over-fitting issue. This technique can be used to train networks with three or more hidden layers. The pre-training can be treated as an unsupervised learning process to discover powerful image features from the input image data. Various deep learning techniques, such as an auto-encoder (AE) or a restricted Boltzman machine (RBM), can be used to pre-train a hidden layer. FIG. 2 illustrates an exemplary AE neural network. As shown in FIG. 2, the AE 200 is a feed-forward neural network with one hidden layer 204. The AE 200 has an input layer L1 202, the hidden layer L2, and an output layer L3 206. If the AE 200 is a fully connected network, each node in the input layer 202 can correspond to a respective voxel or pixel of an image patch. Ignoring the bias term (the nodes labeled as +1 in FIG. 2), the input and output layers 202 and 206, respectively have the same number of nodes. The goal of an AE is to minimize the difference between the input and output vectors… after pre-training a number of hidden layers, the output of the hidden layers can be treated as high-level image features and used to train a discriminative classifier for detecting the anatomical object in the current parameter space… use deep neural networks to train a series of regressors, each of which calculates, for each hypothesis in the search space, a difference vector from that hypothesis to predicted pose parameters of the object in the search space. This framework for training a sequential series of deep neural network regressors in a series of marginal spaces of increasing dimensionality can be referred to as Marginal Space Deep Regression (MSDR). In MSDR, a mapping function is learned from the current hypothesis parameters to the correct object parameters in each marginal search space. The mapping function has as input, an image patch corresponding to the current hypothesis parameters and as output the target parameter displacement. Each current hypothesis will yield a new hypothesis through the regression function which converges to the correct object parameters when learned successfully. The regressed hypotheses are passed through the incrementally increasing marginal spaces during both the training and objected detection in a new image; par. [0048-53]: complex image patterns can be encoded in hierarchical features by learning one or more hidden layers by stacking deep neural network architectures, as described above. To solve the regression problem for a particular search space, at the output layer either a discretized multi-class classifier or a linear/non-linear regressor can be trained on top of the neural network features extracted by the learned hidden layers… hidden layers 304 and 306 can be trained to hierarchically extract features from the input image patches by stacking multiple deep neural network architectures in an unsupervised pre-training phase. The output layer 308 calculates displacement vector between the hypothesis parameters for each input image patch and the parameters of the target anatomical object for the current parameter space. An inverse of the distance of the estimated image patch to the ground truth image patch for the anatomical object location is used to train the confidence score… output parameter space can be either directly regressed using a linear function or it can be discretized relative to the parameter range and solved as a multi-class classification problem… the displacement vector dp(2) output by the deep neural network 300 maps the hypothesis parameters dp(2) to the target parameters p(1). In a second iteration, the parameters p(1) are then input back into the deep neural network 300 in order to refine the estimated target parameters, and the deep neural network 330 outputs a displacement vector that maps the parameters p(1) to the refined target parameters p(1)… training images are received. Step 402 of FIG. 4 can be implemented similarly to step 102 of FIG. 1. The training images can be 2D or 3D images, depending on the imaging modality and anatomical object to be detected… a first deep neural network is trained to detect position candidates based on the training images… the first deep neural network (either discriminative or regressor) can be trained in two stages of unsupervised pre-training of the hidden layers (e.g., using stacked DAE) for learning complex features from input image patches; Par. [0060]: detecting an anatomical object in a medical image using a series of trained deep neural networks according to an embodiment of the present invention. The method of FIG. 5 can be performed using a series of deep neural networks trained using the method of FIG. 4. Referring to FIG. 5, at step 502, a medical image of the patient is received. The medical image can be 2D or 3D and can be acquired using any type of medical imaging modality, such as but not limited to CT, MRI, ultrasound, X-ray fluoroscopy, DynaCT, etc.; Par. [0089-90]: use of deep neural network architectures for detection and segmentation of 3D objects in volumetric (3D) medical image data may require scanning large volumetric input spaces. This requires significant computational resources due to the large, high-dimensional input space and the complex weight matrices learned for such deep neural networks. This may apply for convolutional layers as well as fully connected filters. In an advantageous embodiment of the present invention, sparse adaptive deep neural networks (SADNN) (also referred to herein as sparse deep neural networks) are trained to learn representations of from 3D medical image modalities and are used in place of convolutional or fully connected deep neural networks to perform 3D object detection and segmentation in volumetric medical image data… the detection of candidates in the respective marginal search space is essentially reduced to a patch-wise classification task described by a set of m parameterized input patches X (i.e., observations) with a corresponding set of class assignments y, specifying whether the target anatomical structure is contained in the patch or not. In a representation learning approach for training a deep neural network, such inputs are processed to higher-level data representations using the inter-neural connections, defined as kernels under non-linear mappings. For general notation purposes, the parameters of a convolution filter for a given neuron (node) in the network can be defined as the pair (w, b), where w encodes the weights and b represents the associated bias. The same notation holds for a fully connected layer, which can conceptually be regarded as a convolution layer with the filter size equal to the underlying feature-map size. From the perspective of a given neuron in a fully connected layer, this means that the neuron is connected to all the neurons in the previous layer and a corresponding weight is learned for each connection… In this case of the fully connected deep neural network, n also represents the number of neurons in the network, as there is a one-to-one association between neuron and kernel. In order to compute the response or so-called activation of a given neuron, a linear combination is computed between the weights of all incoming connections and the activations of all neurons from where the incoming connections originate. The bias of this neuron is then added to this linear combination, and the resulting value is transformed by a non-linear mapping to obtain the activation value; wherein the combining unit is operating in accordance with the processing task to be performed and the combining operation reduces each corresponding numeric value of each of the more than one feature map generated by the feature map generation unit down to one numeric value in the at least one corresponding combined feature map (e.g. generate more than one corresponding feature map using more than one corresponding transformation based on training images, and the training images include 2D or 3D images, depending on the imaging modality and anatomical object to be detected (i.e. in accordance with the processing task to be performed), including a linear combination computed and the resulting value transformed by a non-linear mapping to obtain activation value (i.e. the combining operation reduces each corresponding numeric value of each of the more than one feature map generated by the feature map generation unit down to one numeric value), for example);
generating at least one final feature map using at least one corresponding transformation; 
wherein the at least one corresponding transformation is generated following an initial training performed in accordance with the processing task to be performed; and
a feature map processing unit for receiving the generated at least one final feature map from the second feature map generating unit and for processing the generated at least one final feature map to provide a generated vector of at least one numeric value to be used for processing the task (Par. [0038-42]: a method of training a series of deep neural networks for anatomical object detection in medical images according to an embodiment of the present invention. The method of FIG. 1 utilizes a database of training images to train a series of deep neural networks in a series of marginal search spaces of increasing dimensionality to determine a full pose parameter space for an anatomical object in a medical image… the method of FIG. 1 can train each of the deep neural networks to be discriminative in that it calculates, for a given hypothesis in a search space, a probability that the hypothesis in the search space is correct… the method of FIG. 1 can train each of the deep neural networks to be a regression function (regressor) that calculates, for each hypothesis in a search space, a difference vector from that hypothesis to predicted pose parameters of the target anatomical object in the search space… training images are received. In particular, a plurality of training images are loaded from a database. The training images can be 2D or 3D medical images acquired using any medical imaging modality, such as but not limited to CT, MRI, Ultrasound, X-ray fluoroscopy, DynaCT, etc… training samples are generated for the current marginal search space. The training samples are image patches that are used as hypotheses in the current search space to train the deep neural network for that search space. For the first search space (e.g., position) the training samples are generated by selecting image patches from the training images… neural network may train a regressive function that inputs image patches of an image as hypotheses and calculates a difference vector for each input image patch between the parameters of the image patch in the current search space and the parameters of the target anatomical object in the current search space, resulting in predicted parameters of the target anatomical object in the current search space; Par. [0044-47]: train a series of discriminative deep neural networks, each of which calculates, for a given hypothesis in its marginal search space, a probability that the hypothesis in the search space is correct. This framework for training a sequential series of discriminative deep neural networks in a series of marginal spaces of increasing dimensionality can be referred to as Marginal Space Deep Learning (MSDL). In MSDL, deep learning is utilized to automatically learn high-level domain-specific image features directly from the medical image data... unsupervised pre-training followed by supervised fine-tuning can be used to overcome the over-fitting issue. This technique can be used to train networks with three or more hidden layers. The pre-training can be treated as an unsupervised learning process to discover powerful image features from the input image data. Various deep learning techniques, such as an auto-encoder (AE) or a restricted Boltzman machine (RBM), can be used to pre-train a hidden layer. FIG. 2 illustrates an exemplary AE neural network. As shown in FIG. 2, the AE 200 is a feed-forward neural network with one hidden layer 204. The AE 200 has an input layer L1 202, the hidden layer L2, and an output layer L3 206. If the AE 200 is a fully connected network, each node in the input layer 202 can correspond to a respective voxel or pixel of an image patch. Ignoring the bias term (the nodes labeled as +1 in FIG. 2), the input and output layers 202 and 206, respectively have the same number of nodes. The goal of an AE is to minimize the difference between the input and output vectors… after pre-training a number of hidden layers, the output of the hidden layers can be treated as high-level image features and used to train a discriminative classifier for detecting the anatomical object in the current parameter space… use deep neural networks to train a series of regressors, each of which calculates, for each hypothesis in the search space, a difference vector from that hypothesis to predicted pose parameters of the object in the search space. This framework for training a sequential series of deep neural network regressors in a series of marginal spaces of increasing dimensionality can be referred to as Marginal Space Deep Regression (MSDR). In MSDR, a mapping function is learned from the current hypothesis parameters to the correct object parameters in each marginal search space. The mapping function has as input, an image patch corresponding to the current hypothesis parameters and as output the target parameter displacement. Each current hypothesis will yield a new hypothesis through the regression function which converges to the correct object parameters when learned successfully. The regressed hypotheses are passed through the incrementally increasing marginal spaces during both the training and objected detection in a new image; Par. [0048-53]: complex image patterns can be encoded in hierarchical features by learning one or more hidden layers by stacking deep neural network architectures, as described above. To solve the regression problem for a particular search space, at the output layer either a discretized multi-class classifier or a linear/non-linear regressor can be trained on top of the neural network features extracted by the learned hidden layers… hidden layers 304 and 306 can be trained to hierarchically extract features from the input image patches by stacking multiple deep neural network architectures in an unsupervised pre-training phase. The output layer 308 calculates displacement vector between the hypothesis parameters for each input image patch and the parameters of the target anatomical object for the current parameter space. An inverse of the distance of the estimated image patch to the ground truth image patch for the anatomical object location is used to train the confidence score… output parameter space can be either directly regressed using a linear function or it can be discretized relative to the parameter range and solved as a multi-class classification problem… the displacement vector dp(2) output by the deep neural network 300 maps the hypothesis parameters dp(2) to the target parameters p(1). In a second iteration, the parameters p(1) are then input back into the deep neural network 300 in order to refine the estimated target parameters, and the deep neural network 330 outputs a displacement vector that maps the parameters p(1) to the refined target parameters p(1)… training images are received. Step 402 of FIG. 4 can be implemented similarly to step 102 of FIG. 1. The training images can be 2D or 3D images, depending on the imaging modality and anatomical object to be detected… a first deep neural network is trained to detect position candidates based on the training images… the first deep neural network (either discriminative or regressor) can be trained in two stages of unsupervised pre-training of the hidden layers (e.g., using stacked DAE) for learning complex features from input image patches; Par. [0060-61]: detecting an anatomical object in a medical image using a series of trained deep neural networks according to an embodiment of the present invention. The method of FIG. 5 can be performed using a series of deep neural networks trained using the method of FIG. 4. Referring to FIG. 5, at step 502, a medical image of the patient is received. The medical image can be 2D or 3D and can be acquired using any type of medical imaging modality, such as but not limited to CT, MRI, ultrasound, X-ray fluoroscopy, DynaCT, etc… neural network may train a regressive function that inputs image patches centered at voxels of the medical image and calculates a difference vector for each voxel resulting in a predicted center position of the anatomical object calculated for each input voxel. In this case, the first trained deep neural network can also calculate a confidence score for each predicted position and a number of predicted positions with the highest confidence scores are kept; Par. [0078]: a trained deep neural network regressor can be used to estimate a target position-orientation-scale image patch for each of the position-orientation candidates, and the target image patch having the highest confidence score can be selected as the final detection result; Par. [0089-90]: use of deep neural network architectures for detection and segmentation of 3D objects in volumetric (3D) medical image data may require scanning large volumetric input spaces. This requires significant computational resources due to the large, high-dimensional input space and the complex weight matrices learned for such deep neural networks. This may apply for convolutional layers as well as fully connected filters. In an advantageous embodiment of the present invention, sparse adaptive deep neural networks (SADNN) (also referred to herein as sparse deep neural networks) are trained to learn representations of from 3D medical image modalities and are used in place of convolutional or fully connected deep neural networks to perform 3D object detection and segmentation in volumetric medical image data… the detection of candidates in the respective marginal search space is essentially reduced to a patch-wise classification task described by a set of m parameterized input patches X (i.e., observations) with a corresponding set of class assignments y, specifying whether the target anatomical structure is contained in the patch or not. In a representation learning approach for training a deep neural network, such inputs are processed to higher-level data representations using the inter-neural connections, defined as kernels under non-linear mappings. For general notation purposes, the parameters of a convolution filter for a given neuron (node) in the network can be defined as the pair (w, b), where w encodes the weights and b represents the associated bias. The same notation holds for a fully connected layer, which can conceptually be regarded as a convolution layer with the filter size equal to the underlying feature-map size. From the perspective of a given neuron in a fully connected layer, this means that the neuron is connected to all the neurons in the previous layer and a corresponding weight is learned for each connection… In this case of the fully connected deep neural network, n also represents the number of neurons in the network, as there is a one-to-one association between neuron and kernel. In order to compute the response or so-called activation of a given neuron, a linear combination is computed between the weights of all incoming connections and the activations of all neurons from where the incoming connections originate. The bias of this neuron is then added to this linear combination, and the resulting value is transformed by a non-linear mapping to obtain the activation value; and generating at least one final feature map using at least one corresponding transformation (e.g. resulting (i.e. final) value is transformed by a non-linear mapping (i.e. generating at least one final feature map using at least one corresponding transformation), as indicated above); wherein the generating of the at least one final feature map is performed by applying each of the at least one corresponding transformation (e.g. generate more than one corresponding feature map using more than one corresponding transformation, such as at least one final (i.e. resulting) feature map, including a linear combination computed and the resulting (i.e. final) value transformed by a non-linear mapping (i.e. final feature map is performed by applying each of the at least one corresponding transformation), as indicated above); wherein the at least one corresponding transformation is generated following an initial training performed in accordance with the processing task to be performed (e.g. pre-training (i.e. initial training) is performed, including an unsupervised learning process to discover powerful image features from the input image data, such as corresponding extracted features from input image patches of training images above, depending on the imaging modality and anatomical object to be detected (i.e. in accordance with the processing task to be performed), and after pre-training a number of hidden layers, the output of the hidden layers can be treated as high-level image features and used to train a discriminative classifier for detecting the anatomical object in the current parameter space, such as by generating more than one corresponding feature map using more than one corresponding transformation, including a linear combination computed and the resulting value transformed by a non-linear mapping (i.e. the at least one corresponding transformation is generated following an initial training), as indicated above); and a feature map processing unit for receiving the generated at least one final feature map and for processing the generated at least one final feature map to provide a generated vector of at least one numeric value to be used for processing the task (e.g. mapping function that has as input, image patches corresponding to current hypothesis parameters, including corresponding extracted features from input image patches, and as output (i.e. a feature map processing), target parameter displacement, which is learned from the current hypothesis parameters to the correct object parameters in each marginal search space, including a difference vector from that hypothesis to predicted pose parameters of the target anatomical object in the search space (i.e. processing the generated at least one final feature map to provide a generated vector of at least one numeric value to be used for processing the task), as indicated above), the feature map generating unit for receiving more than one modality (e.g. mapping function (i.e. feature map generating) that has as input (i.e. receives), image patches corresponding to current hypothesis parameters, including corresponding extracted features from input image patches, such as selected image patches from training images or training samples, which are used as hypotheses in the current search space to train the deep neural network for that search space, including a first search space (e.g., position), and the training images include 2D or 3D images, depending on the imaging modality and anatomical object to be detected (i.e. mapping function receiving more than one modality), as indicated above), for example), but fails to teach the following as further recited in claim 1.
However, Agnihotri teaches wherein the generating of each of the more than one corresponding feature map is performed by applying a given corresponding transformation on a given corresponding modality (Par. [0015-20]: system further analyzes a volume of interest ("VOI") across different modalities in order to find feature mapping from one modality to another. This information is used to populate a table which gives the ratios and mapping of feature values in one modality versus another. For example, one such method of mapping may be referred to as Factor Analysis, which may be used to map image-based features of one modality to image-based features of another modality… image feature vector extracted from a VOI will allow initial retrievals of similar lesions in the same modality. The image feature vector is translated to image feature vectors for the other desired modalities… modalilty features are indexed to the corresponding feature in different image modalities (e.g., ultrasound, MRI, X-ray) to create feature mapping. Feature mapping results in feature relationships between the same features or different features in different image modalities… converting features from different modalities by fitting a polynomial function to estimate the features in one modality from features in another modality; wherein the generating of each of the more than one corresponding feature map is performed by applying a given corresponding transformation on a given corresponding modality (e.g. modalilty features are indexed to the corresponding feature in different image modalities (e.g., ultrasound, MRI, X-ray) to create feature mapping, including feature relationships between the same features or different features in different image modalities, in which image feature vector(s) are translated (i.e. transformed, converted, etc.) to image feature vectors for the other desired modalities, as indicated above), for example), 
a second feature map generating unit, the second feature map generating unit for receiving the at least one corresponding combined feature map from the unit for generating combined feature maps and for generating at least one final feature map using at least one corresponding transformation (Par. [0018-20]: image feature vector extracted from a VOI will allow initial retrievals of similar lesions in the same modality. The image feature vector is translated to image feature vectors for the other desired modalities… modalilty features are indexed to the corresponding feature in different image modalities (e.g., ultrasound, MRI, X-ray) to create feature mapping. Feature mapping results in feature relationships between the same features or different features in different image modalities… converting features from different modalities by fitting a polynomial function to estimate the features in one modality from features in another modality; Par. [0023-24]: image-based features (image-based features from the original modalities and the mapped features) and the non-image based information of the patient in question may then be combined in step 140. That is, the features from the original modality may be combined with features calculated from images of a similar modality and features calculated from images from a different modality… corresponding features are then used to retrieve similar cases from the non-original modalities. For example, an original X-ray scan may be used to retrieve similar, previously diagnosed scans such as ultrasound, MRI, etc. The similar cases from the original modality 210 and the non-original modalities 220 are combined in step 230; a second feature map generating unit, the second feature map generating unit for receiving the at least one corresponding combined feature map from the unit for generating combined feature maps and for generating at least one final feature map using at least one corresponding transformation (e.g. image-based features from the original modalities and the mapped features and the non-image based information of the patient in question are combined, in which the features from the original modality are combined with features calculated from images of a similar modality and features calculated from images from a different modality one final (i.e. combined, resulting, etc.) feature map using at least one corresponding transformation, as indicated above), for example); 
wherein the generating of the at least one final feature map is performed by applying each of the at least one corresponding transformation on at least one of the at least one corresponding feature map received from the unit for generating combined feature maps; and
a feature map processing unit for receiving the generated at least one final feature map from the second feature map generating unit (Par. [0018-20]: image feature vector extracted from a VOI will allow initial retrievals of similar lesions in the same modality. The image feature vector is translated to image feature vectors for the other desired modalities… modalilty features are indexed to the corresponding feature in different image modalities (e.g., ultrasound, MRI, X-ray) to create feature mapping. Feature mapping results in feature relationships between the same features or different features in different image modalities… converting features from different modalities by fitting a polynomial function to estimate the features in one modality from features in another modality; Par. [0023-24]: image-based features (image-based features from the original modalities and the mapped features) and the non-image based information of the patient in question may then be combined in step 140. That is, the features from the original modality may be combined with features calculated from images of a similar modality and features calculated from images from a different modality… corresponding features are then used to retrieve similar cases from the non-original modalities. For example, an original X-ray scan may be used to retrieve similar, previously diagnosed scans such as ultrasound, MRI, etc. The similar cases from the original modality 210 and the non-original modalities 220 are combined in step 230; wherein the generating of the at least one final feature map is performed by applying each of the at least one corresponding transformation on at least one of the at least one corresponding feature map received from the unit for generating combined feature maps and a feature map processing unit for receiving the generated at least one final feature map from the second feature map generating unit (e.g. generate/receive image-based features from the original modalities and the mapped features and the non-image based information of the patient in question, which are combined, including features from the original modality combined with features calculated from images of a similar modality and features calculated from images from a different modality one final (i.e. combined, resulting, etc.) feature map using at least one corresponding transformation, including image feature vector(s) which are translated (i.e. transformed, converted, etc.) to image feature vectors for the other desired modalities (i.e. by applying each of the at least one corresponding transformation on at least one of the at least one corresponding feature map received), as indicated above), for example).
Georgescu and Agnihotri are considered to be analogous art because they pertain to image processing applications. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify the apparatus which includes a method that trains each of the deep neural networks to be discriminative in that it calculates, for a given hypothesis in a search space, a probability that the hypothesis in the search space is correct, by generating combined feature maps using a mapping function that has as input, image patches corresponding to current hypothesis parameters, including corresponding extracted features from input image patches (as disclosed by Georgescu) with wherein the generating of each of the more than one corresponding feature map is performed by applying a given corresponding transformation on a given corresponding modality, receiving the at least one corresponding combined feature map from the unit for generating combined feature maps and for generating at least one final feature map using at least one corresponding transformation, wherein the generating of the at least one final feature map is performed by applying each of the at least one corresponding transformation on at least one of the at least one corresponding feature map received from the unit for generating combined feature maps, and receiving the generated at least one final feature map (as taught by Agnihotri, Abstract, Par. [0015-20, 23-24]) to perform visual comparison useful in training medical personnel in diagnosing different diseases (Agnihotri, Abstract, Par. [0003]).

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GUILLERMO M RIVERA-MARTINEZ whose telephone number is (571) 272-4979. The examiner can normally be reached on 9 am to 5 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on 571-272-7332.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/GUILLERMO M RIVERA-MARTINEZ/           Primary Examiner, Art Unit 2668