DETAILED ACTION

	Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of a prior-filed U.S. Application No. 14/847,133, filed on September 8, 2015, which claims priority to U.S. Provisional Application No. 62/059,494, filed on October 3, 2014.
Drawings
The drawings were received on 12/31/2019.  These drawings are acceptable.

Response to Arguments
	Applicant’s amendments and remarks filed 11/15/2021 has been fully considered by the examiner.
	Regarding the rejection of claims under 35 USC § 103, the applicant claims are unpersuasive and the rejection made in the previous action has been maintained. 
Applicant argues that the cited reference fails to disclose a process for where the first layer receives in the LSTM receives the input from CNN, in Pg. :
Elagouni generally discloses a technique for text recognition using a connectionist approach where images are scanned and segments of the scanned images input to a convolutional network (ConvNet) for generating a multi-scale image representation that includes a sequence of generated features x°, x!,..., xt. See Elagouni at FIG. 1. While this sequence of generated features output from the ConvNet is provided as input to a bidirectional long short-term memory (LSTM) network for classifying features to thereby perform text recognition via decoding, the LSTM network fails, Elagouni fails to ever disclose the LSTM network also processing any sort of features for the segments of the scanned images that were input to the ConvNet. Elagouni simply fails to disclose a first layer of his LSTM network receiving, as input, segment features for the segments of the scanned images and the sequence of 

Examiner notes that neural networks, take input for processing, that is considered the claimed input layer where Elagouni  et al. (NPL: “Text Recognition in Videos using a Recurrent Connectionist Approach”, hereinafter ‘Ela’) denotes this using the arrow that depicts the claimed input generated by the convolutional neural network, and thus the X sequence of generated features as claimed  input features, wherein the input features include respective segment features for each of a plurality of segments… generating first features for the segment by processing the segment features for the segment using one or more convolutional neural network, applicant has also noted this to be taught by the cited reference Ela in the filed remarks cited above.
In addition, Figure 1 discloses that the use of one or more LSTM are used to classify the received features sequences as the claimed first layers of the one or more LSTMs as depicted in Fig. 1. See the claimed input received by the claimed first layer for processing inputs by the LSTM associated with the first layer noted by the arrow, comprising claimed first layer LSTM for receiving the claimed data, and as described in Secs. 3 and 4 as the LSTM used to learn classifications from the received inputs from the CNNs for processing by the first layer LSTM depicted in Fig.1 below.  

    PNG
    media_image1.png
    664
    525
    media_image1.png
    Greyscale


Regarding claim 5 limitations, applicant’s argued the deficiencies addressed above, and thus the rejection of claim 5 has been maintained as there are no deficiencies in the Ela references as discussed above.

The rejection under 35 USC § 103 has been maintained.



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, 6-7, 9-11, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Elagouni  et al. (NPL: “Text Recognition in Videos using a Recurrent Connectionist Approach”, hereinafter ‘Ela’) in view of Simard et al. (US Pub. No. 2007/0086655, ‘Sim’).

Regarding independent claims 1, 6, and 10 limitations, Ela teaches: a method comprising; a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising; and a computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: (claimed method for processing data using a processor and algorithm as claimed functions performed via computer having storage for instructions to perform operations of text extraction from videos, in pg. 7: Sec. 6: We have presented an OCR scheme adapted to the recognition of texts extracted from digital videos. Using a multi-scale scanning scheme, a novel representation of text images built with features learnt by a ConvNet is generated. Based on a particular recurrent neural network—namely the BLSTM—and a connectionist classiﬁcation—namely the CTC—our approach takes as input generated repre-sentations and recognizes texts…)
receiving input features, wherein the input features include respective segment features for each of a plurality of segments; and processing the input features using a model, wherein the processing comprises: for each of the segments: generating first features for the segment by processing the segment features for the segment using one or more convolutional neural network (CNN) layers, wherein the convolutional neural network (CNN) layers perform spatial modeling on the input feature; (claimed receiving input for respective plurality of spatial S segments as depicted in Fig. 1  and  feature extraction using claimed convolutional neural network, in pg. 2: Sec. 2: The ﬁrst task for video text recognition consists in detecting and extracting texts from videos as described in [4]. Once extracted, text images are recognized by means of two main steps as depicted in ﬁg. 1: generation of text image representations and text recognition. In the ﬁrst step, images are scanned at diﬀerent scales so that, for each position in the image, four diﬀerent windows are extracted. Each window is then represented by a vector of features learnt with a convolutional neural network (ConvNet). Considering the diﬀerent positions in the scanning step and the four windows extracted each time, a sequence of learnt features vectors X0, . . . , Xt, . . . , Xp  is thus generated to represent each image…; And claimed generated first features depicted as learnt feature vectors  X in Fig. 1

    PNG
    media_image2.png
    754
    700
    media_image2.png
    Greyscale

generating second features for the segment by processing both the segment features for the segment and the first features using one or more long short-term memory network (LSTM) layers to perform temporal modeling over the first features, wherein a first layer of the one or more LSTM layers is configured to receive, as input, the segment features for the segment and the first features generated for the segment; and determining an output feature based on at least the second features for the plurality of segments. (claimed second feature sequence classification used to determine claimed output feature as the text recognition output depicted in Fig. 1, in Sec. 2: … The second step of the proposed OCR is similar to the model presented in [7], using a speciﬁc bidirectional recurrent neural network (BLSTM) able to learn to recognize text making use of both future and past context. The recurrent network is also characterized by a speciﬁc objective function (CTC) [7], that allows the classiﬁcation of non-segmented characters. Finally, the network’s out-puts are decoded to obtain the recognized text.…; And the claimed temporal modeling of the learnt features as the use of the LSTM model to account for past and future temporal contexts to perform classification of learnt features, in Sec. 4.1: The basic idea of RNN is to introduce recurrent connections which enable the network to maintain an internal state and thus to take into account the past context... A LSTM neuron contains a constant “memory cell”—namely constant error carousel (CEC)—whose access is controlled by some mul-tiplicative gates. For these reasons we chose to use the LSTM model to classify our learnt feature sequences [i.e. generating second features for the segment by processing both the segment features for the segment and the first features using one or more long short-term memory network (LSTM)]. Moreover, in our task of text recognition, the past context is as important as the future one (i.e., both previous and next letters are important to recognize the current letter). Hence, we propose to use a bidirec-tional LSTM which consists of two separated hidden layers of LSTM neurons. The ﬁrst one permits to process the forward pass making use of the past context, while the second serves for the backward pass making use of the future context. Both hidden layers are connected to the same output layer (cf. ﬁg. 1).)
Examiner notes that given that the first features are generated as claimed input features, wherein the input features include respective segment features for each of a plurality of segments… generating first features for the segment by processing the segment features for the segment using one or more convolutional neural network, thus the features processed by the convolutional neural network are consider claimed both the segment features for the segment and the first features depicted as X features generated by the convolutional networks and received as a sequence of generated features [i.e. the one or more LSTM layers is configured to receive, as input, the segment features for the segment and the first features generated for the segment] by the LSTM models as depicted in Fig. 1:

    PNG
    media_image2.png
    754
    700
    media_image2.png
    Greyscale


Sim expressly disclosing the computing environment for natural language processing as claimed: a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising; and a computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: (example claimed computing environment depicted in Fig. 12: 

    PNG
    media_image3.png
    817
    520
    media_image3.png
    Greyscale

And in 0072-0076: … The computer 1212 includes a processing unit 1214, a system memory 1216, and a system bus 1218. The system bus 1218 couples system components including, but not limited to, the system memory 1216 to the processing unit 1214. The processing unit 1214 can be any of various available processors. Dual microprocessors and other multiprocessor architectures also can be employed as the processing unit 1214… The system memory 1216 includes volatile memory 1220 and nonvolatile memory 1222... By way of illustration, and not limitation, nonvolatile memory 1222 can include read only memory (ROM), progranrmable ROM (PROM), electrically pro­grammable ROM (EPROM), electrically erasable program­mable ROM (EEPROM), or flash memory. Volatile memory 1220 includes random access memory (RAM), … Operating system 1228, which can be stored on disk storage 1224, acts to control and allocate resources of the computer system 1212. System applications 1230 take advantage of the management of resources by operating system 1228 through program modules 1232 and program data 1234 stored either in system memory 1216 or on disk storage 1224. It is to be appreciated that the claimed subject matter can be implemented with various operating systems or combinations of operating systems.)
The Ela and Sim references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a natural language information processing system using learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the computing environment executing machine learning algorithms as disclosed by Sim with the method of information processing natural language content  using machine learning algorithms as disclosed by Ela.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Ela and Sim in order to develop speech, and/or object recognition systems and/or methodologies utilizing one or more programs that are customized for particular actions and/or applications; and implementing for “optical character recognition (OCR) systems, a convolutional neural network… to extract features in order to perform high-level classification” (Sim, 0004 & 0008); Doing so will allow for a computing environment that can “take advantage of special parallel hardware (SSE or MMX) to speed 

Regarding claims 2, 7, and 11 the rejection of claims 1, 6, and 10 are respectively incorporated. Ela in combination with Sim teaches the limitations: wherein processing the first features using the one or more LSTM layers to generate the second features comprises: processing the first features using a linear layer to generate reduced  features having a reduced dimension from a dimension of the first features; and processing the reduced features using the one or more LSTM layers to generate the second features. (in Sec. 3.2: … In our experiments, several conﬁgurations of ConvNets have been tested. The best conﬁguration takes as input a color window image mapped into three 36 × 36 input maps, containing values normalized between −1 and 1, and returns a vector of values normalized with the softmax function. The architecture of our ConvNet is similar to the one presented in [4] and consists of six hidden layers. The ﬁrst four ones are alternated convolutional and sub-sampling layers connected to three other neuron layers where the penultimate layer contains 50 neurons. Therefore, using this network architecture, each position in the text image is represented by a vector of 200 values [processing the first features using a linear layer to generate reduced  features having a reduced dimension from a dimension of the first features] (50 values for each scale window [claimed reduced features]) corresponding to the features learnt by the ConvNet model; And inputting the vector values as the claimed reduced dimensions of the corresponding features learnt by the ConvNet model to be processed by the claimed LSTM layers as depicted in Fig. 1)


(claimed process for jointly train CNN layer and fully connected CNN layers and training LSTM layers using training data to determine the claimed parameters, in Sec. 5.1: … Four videos were used to generate a dataset of 15168 images of single characters perfectly segmented. This database—called CharDb—consists of 42 classes of characters (26 letters, 10 numbers, the space character and 5 special characters; namely ’.’, ’-’, ’(’, ’)’ and ’:’) and is used to train the ConvNet described in section 3.2. The remaining videos were annotated and divided into two sets: VidTrainDb and VidTestDb containing respectively 20 and 8 videos. While the ﬁrst one is used to train the BLSTM,…)

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Elagouni  et al. (NPL: “Text Recognition in Videos using a Recurrent Connectionist Approach”, hereinafter ‘Ela’) in view of Simard et al. (US Pub. No. 2007/0086655, ‘Sim’) and in further view of Sainath et al. (NPL: “Improvements to filterbank and delta learning within a deep neural network framework”, hereinafter ‘Sa’)

Regarding claim 5, the rejection of claim 1 is  incorporated. While Ela in combination with Sim teaches the processing of input features to be pre-processed by a convolutional neural network. Ela and Sim does not expressly teach the limitation wherein the input features include log-mel features having multiple dimensions.
Sa does teach the claim limitation wherein the input features include log-mel features having multiple dimensions. (in Sec. 2: Convolutional neural networks (CNN) are commonly trained with log-mel ﬁlterbank features, as well as the delta and double-delta of these features [6]. While the process of generating these features is often separate from the CNN training process, both the ﬁlter and delta learning stages can be seen as different layers within a neural network, and can be learned jointly with the rest of the CNN…; And features having multiple dimensions, in Sec. 3: …The baseline CNN system is trained with 40 dimensional log mel-ﬁlter features, along with the delta and double-deltas, which are per-speaker mean-and-variance normalized, rather than the speaker-independent globally normalized ﬁlter learning system proposed in [8]…)
The Ela, Sim, and Sa references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing a natural language information processing system using learning algorithms.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the use of log-mel filter banks for processing input features for training convolution neural network as disclosed by Sa with the method of information processing natural language content using machine learning algorithms as collectively disclosed by Ela and Sim.
One of ordinary skill in the arts would have been motivated to combine the disclosed methods of Ela, Sim, and Sa in order “to do feature extraction jointly with classiﬁcation such that features are tuned to the classiﬁcation task” (Sa, Introduction); doing so will support “performance across a variety of small and large vocabulary tasks” where “[t]he most popular features to use with CNNs are hand-crafted log-mel ﬁlter bank features” (Sa, Introduction).

	
	

	Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  


The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Baccouche (NPL: “Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks”) teaches: a method comprising; a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising; and a computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: (Bac teaches the method for processing data using a processor and algorithm as claimed functions performed via computer having storage for instructions to perform operations, in pg. 3: Sec. 3.2: … Thus, we perform a pre-processing step, inspired by the work in [6], which consists in detecting and blurring these logos. Once SIFT matches are computed, we robustly estimate the aﬃne transforma-tion while ignoring outliers (e.g. moving players) using the RANSAC algorithm [7], aiming at only preserving matches corresponding to the dominant motion.) receiving input features, wherein the input features include respective segment features for each of a plurality of segments; and processing the input features using a model, wherein the processing (claimed receiving input for respective plurality of n segments as depicted in Fig. 1 using feature extraction, as including claimed convolutional neural network, in pg. 2: Sec. 2 & 3: The outline of the proposed approach is shown in Fig. 1. The aim is to classify soccer video sequences that are represented by a sequence of descriptors (one descriptor per image) corresponding to a set of features…; And claimed generated first features depicted as descriptors in Fig. 1

    PNG
    media_image4.png
    362
    1033
    media_image4.png
    Greyscale

And the claimed spatial modeling of features as the extracted Bag of Words associated with a sequence of descriptor, in Sec. 3.2: Bag of words (BoW) are widely used models in image processing, and particularly in object recognition. The main idea is to represent an image by means of an histogram of visual words, corresponding each to a set of local features extracted from the image. In most cases, these features are SIFT descriptors [5]. In the proposed work, the appearance part of our descriptor is inspired by the work of Ballan et al. [3] where a video is represented by means of a sequence of visual BoW (one BoW per frame). To that aim, we generate a codebook of 30 words (empirical choice) resulting of a K-means classiﬁcation [claimed spatial modeling on the input feature] applied to a large number of images extracted from the database. Then, for each video we associate a sequence of descriptors (one per image) having the same size as the codebook and containing values that encode the occurrence frequency of words present in the sequence…; And where a convolutional neural network can be used to extract the claimed spatial modeling on the input feature, in Sec. 6: … As future work, we plan to verify the genericity of the approach by testing it on other, more-complex video databases. We also plan to jointly learn feature extractors and classiﬁcation network using a Convolutional Neural Network-LSTM approach.) generating second features for the segment by processing the first features using one or more long short-term memory network (LSTM) layers to perform temporal modeling over the first features; and determining an output feature based on at least the second features for the plurality of segments. (claimed second features as temporal evolutions of the descriptors used to determine claimed output feature as the classification output based on the temporal evolutions of the descriptors, in Sec. 4: Once the descriptors presented in the previous section are calculated, image by image, for each feature (bag of visual words and dominant motion), the next step consists in using them to classify the actions of the video sequences. We propose to use a particular recurrent neural network classiﬁer, namely Long Short-Term Memory, in order to take beneﬁts of its ability to use the temporal evolution of the descriptors for classiﬁcation…)
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516.  The examiner can normally be reached on Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.







/O.O.A./Examiner, Art Unit 2126                                                                                                                                                                                                                                                                                                                                                                                                         
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129