DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is responsive to the Amendment filed on February 16, 2022.  Claims 1, 6, and 11 are amended. Claims 2, 4, 5, 7, 9, 10, 12, 14, and 15 are cancelled.  Claims 1, 3, 6, 8, 11, and 13 are pending in the case.  Claims 1, 6, and 11 are the independent claims.  
This action is non-final.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on February 16, 2022 has been entered.
 
Applicant’s Response
In Applicant’s Amendment filed February 16, 2021, Applicant amended the claims in response to the objections to the claims and the rejections of the claims under 35 USC 103 and 112 in the previous office action.

Response to Argument/Amendment
Applicant’s amendments to the claims in response to the objections to the claims in the previous office action are acknowledged.  The objections are withdrawn in view of Applicant’s amendments.
Applicant’s amendments to the claims in response to the rejection of the claims under 35 USC 112 in the previous office action are acknowledged, and Applicant’s associated arguments have been fully considered.  As the amendments to the claims remove the basis for the rejection, the rejection is withdrawn.  However, new grounds of rejection, necessitated by Applicant’s amendments, are provided below.
Applicant’s amendments to the claims in response to the rejection of the claims under 35 USC 103 in the previous office action are acknowledged, and Applicant’s arguments have been fully considered.  Applicant’s arguments are persuasive in part.
Applicant appears to indicate that it disagrees with the previous office action regarding the rejection of the claims under 35 USC 103.  Regarding Garg, on pages 20-21 of Applicant’s remarks filed February 16, 2022, Applicant apparently argues that various features of the instant application are different from the model disclosed by Garg, including the CNN architecture consisting of two convolutional blocks each with three convolutional layers followed by a max-pooling layer, using three fully connected layers to regress over fingertip values, inputs to the Bi-LSTM network being 3x99x99 sized RGB images, etc.  However, Garg is not cited as teaching these limitations in the previous office action.  Instead, as cited in the previous office action, Dani clearly teaches these limitations  (e.g. page 176 second column first full (i.e. second) paragraph, architecture consists of two convolutional blocks each with three convolutional layers followed by a max-pooling layer, and uses three fully connected layers to regress over two coordinate values of fingertip point at the last layer; determining continuous valued outputs corresponding to positions; page 177, Fig. 4 and its caption; fingertip regressor architecture as previously described, input to the network is 3x99x99 sized RGB images).
Regarding Dani, on page 21 of Applicant’s remarks filed February 16, 2022, Applicant briefly describes its teachings, and also describes related aspects of the instant application.  Although Applicant does not specifically argue any particular limitation found in the instant claims which Dani fails to teach, it appears that the primary difference between Dani and the instant claims is that Dani teaches various models for object detection, i.e. hand localization in this instance, including Faster RCNN (but also including MobileNet as cited in the previous office action), while the instant claims specifically recite utilizing MobileNetV2 for hand localization.
Examiner agrees (i.e. to the extent that this is argued by Applicant) that, although Dani teaches use of deep learning models for hand localization, including MobileNet, Garg, Dani, and Kumar do not explicitly disclose the user of MobileNetV2 for hand localization.
In the majority of Applicant’s remaining remarks, Applicant appears to state “[o]n the contrary…” and to provide what appears to be a list of various limitations recited in the Applicant’s claims and what appear to be quotations from the specification of the instant application regarding various benefits of the various embodiments of the invention (as compared to “existing/conventional technique(s)”) as described in the instant application (e.g. the majority of page 22, the majority of page 23, the majority of page 24, and the top half of page 25 of Applicant’s remarks filed on February 16, 2022).  
Examiner notes that many of the features and benefits discussed in these remarks do not appear to actually be recited in the claims.  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  Moreover, with respect to the features and benefits which actually are recited in the claims, the previous office action has already cited Garg and Dani as teaching these features (see, e.g. the previous office action and below).  Examiner notes that Applicant provides absolutely no argument regarding these cited teachings of Garg and Dani, other than the bare allegation that it would not be possible to combine their teachings with Kumar to arrive at the claimed invention.  Therefore, the only remaining reference cited in the previous rejection (i.e. not including newly-cited references in the Advisory action mailed January 25, 2022, which are discussed further below) which Applicant argues against appears to be Kumar.  To the extent that Applicant argues features and benefits of the instant invention which are not actually recited in the claims, this argument is not persuasive of the basis that these features and benefits are not actually recited in the claims.  To the extent that Applicant argues that features and benefits of the instant invention which are recited in the claims are not taught by Kumar, these arguments are not persuasive on the basis that Garg and Dani (as previously cited) teach all of these limitations, which Applicant does not appear to rebut, and it is therefore not necessary for Kumar to teach them.
With respect to Kumar, Applicant appears to argue (i.e. on page 21-22 of Applicant’s remarks dated February 16, 2022) that “Kumar teaches…combining neural network based objection detection and tracking with gesture recognition systems for use in user-interactions with a virtual reality application.  Specifically, the system takes, as input, a video feed of a user’s hands, and is able to track and understand a variety of movements and gestures.  On the other hand, in the present application, embodiments describe a computationally effective hand gesture recognition framework that works without depth information and the need of specialized hardware, thereby providing mass accessibility of gestural interfaces to most affordable video see-through HMDs.”
To the extent that Applicant appears to argue that the instant application is distinct from Kumar due to working without depth information, specialized hardware, etc., Examiner notes that Kumar also appears to teach a system for performing hand gesture recognition without depth information, specialized hardware, etc., or as recited in the independent claims, wherein the hand gesture recognition framework works without depth information and a need for specialized hardware, the hand gesture recognition capable of providing mass accessibility of one or more gestural interfaces (e.g. abstract, gesture recognition, recognizing fingers of user and gestures of interest from series of images; paragraph 0005, utilizing simple RGB camera without using depth camera; paragraph 0031, utilizing simple RGB camera, depth camera not necessary; paragraph 0038, using simple RGB camera without using depth camera).  Moreover, even if Kumar did not teach this limitation, it is also taught by Dani (e.g. page 174, first column, abstract paragraph, “we demonstrate a cost-effective solution…using frugal devices….we propose the use of intuitive pointing fingertip gestural interface….””; page 175, first column, lines 1-4, “we attempt to recognize…using a single RGB monocular camera while addressing challenges such as (i) lack of additional depth/IR sensors on smartphones….”; page 175 first column, first full paragraph, “opens avenues for rich user-interaction on frugal devices”; page 175, second column, first full paragraph “we detect pointing hand gesture…using RGB data as the input, without additional depth information…”, page 175, second column, fourth full paragraph “real-time and markerless gesture recognition…without the need for additional depth information.”).  Applicant is silent regarding these teachings in both Kumar and Dani.  Therefore, this argument is not persuasive.
Applicant additionally argues (i.e. on page 23 of Applicant’s remarks dated February 16, 2022) that “Kumar is completely silent of the classification of fingertip patterns, as disclosed in the amended claim 1.  Further as fingertip motion pattern recognition in real-time is not disclosed in Kumar, hence fingertip localization is also not taught by Kumar.”  
However, Examiner notes that Kumar does appear to teach classification/recognition of fingertip patterns in real time, and fingertip localization (e.g. paragraph 0005, recognizing and tracking fingers, gesture recognition; paragraph 0007, recognizing fingers of user, gestures of interest; paragraph 0021, interacting with user’s hand motions in real-time; paragraph 0025, predicting locations of fingertips; neural network which yields, in real-time, location and size of fingertips shown in image; paragraph 0026, fingertip tracking; 0028, gesture recognition, sequences of images; paragraph 0032, combining fingertip detection/tracking and gesture recognition).  Therefore, Applicant’s arguments regarding these alleged deficiencies of Kumar are also not persuasive.
In addition, Garg and Dani each also appear to teach fingertip gesture classification/recognition in real-time (e.g. Garg page 231 first full paragraph, architecture is for efficient classification of user gestures, works in real-time, ported on mobile devices due to low memory footprint; Dani page 175 second column, fourth full paragraph, real-time gesture recognition; hand candidate detection given an RGB input image; page 175 second column, fourth full paragraph, fingertip regressor accurately estimating fingertip spatial location given hand candidate detection from previous block as input).  Therefore, due to these similarities between Garg, Dani, and Kumar, and in the absence of any particular rationale or evidence to the contrary other than Applicant’s bare allegation, Applicant’s allegation that it would not be possible to combine the teachings of Garg, Dani, and Kumar is also not persuasive.
To summarize, while Applicant provides a listing of various features and benefits of the instant Application, many of these features and benefits are not actually recited in the claims, and those features and benefits which are recited in the claims have already been cited as being taught by the combination of Garg, Dani, and Kumar.  Applicant provides no argument at all regarding these cited teachings of Garg and Dani.  Applicant also provides no specific argument that Kumar does not teach the limitations which it has been cited as teaching.  While Applicant does appear to provide arguments attempting to distinguish Kumar from the instant invention on various bases, for the reasons stated above (based on the explicit teachings of Kumar) Examiner disagrees with Applicant’s assessment (and Examiner additionally notes that many of these alleged elements, where actually claimed, are already taught by Garg and Dani and, therefore, no teaching by Kumar is necessary).  Essentially, the majority of the limitations as actually recited in the claims are taught by Garg and Dani, which Applicant does not appear to rebut.  While Garg and Dani do not specifically teach that “the absence of a positive pointing-finger hand detection on a set of consecutive frames…is indicative of the end of the hand gesture” the detection of this particular gesture/absence for this particular purpose is taught by Kumar, which appears to teach a similar system implemented for similar reasons.  Applicant does not appear to argue that Kumar does not teach this limitation.
Applicant additionally provides arguments regarding the Peng and Howard references, which were cited as potentially relevant in the Advisory Action mailed on January 25, 2022.  As these references have not been formally cited in a previous rejection, and are not cited in the new grounds of rejection below, Applicant’s arguments regarding Peng and Howard are generally moot in view of the new grounds of rejection provided below.  However, various limitations argued by Applicant as not being taught by Peng and Howard are already taught by Garg and Dani.  
For example, Garg clearly teaches:
computing by the Bi-LSTM Network executed via the one or more hardware processors on the mobile communication device a probability score, wherein the probability score indicates the probability of the fingertip motion pattern to be identified as a candidate gesture (e.g. page 236 second full paragraph, gesture patterns fed as input to Bi-LSTM layer; passing to fully connected layer with 10 output scores that correspond to each of the 10 gestures; caption to Fig. 6, gesture is detected when the predicted probability is more than 0.75; i.e. the Bi-LSTM layer is used to generate, for a given pattern, a score for each gesture, and a given gesture is detected when this probability is greater than 0.75);
recognizing by the one or more hardware processors on the mobile communication device, the one or more hand gestures for a plurality of Augmented Reality (AR) wearable device with a monocular Red Green Blue (RGB) camera by using a limited amount of labelled classification data (e.g. page 229, Abstract, enable mass market reach via inexpensive Augmented Reality (AR) headsets without built-in depth or IR sensors; real-time, in-air gestural framework that works on monocular RGB input; classifier trained on limited classification data; page 232, first full paragraph, computationally efficient pointing pose-based gesture recognition using just RGB data; page 232, final paragraph, continuing on page 233, pointing hand gestural framework with limited labelled classification data, classifying motion patterns into gestures; page 239, Conclusion, the entire framework works just with monocular RGB data at real-time and can be used with frugal AR devices without any sensor fusion);
using a Bi-directional Long Short-Term Memory (Bi-LSTM) model to classify the one or more hand gestures (e.g. page 229, Abstract, Bi-LSTM for real-time pointing hand gesture classification; page 233, first paragraph, Bi-LSTM network for classification into gestures; page 239, Conclusion, gesture classification).
Moreover, Dani clearly teaches:
computing by the one or more hardware processors on the mobile communication device, the object detector architecture to localize the one or more hand candidates, wherein the object detector architecture outputs at least one hand candidate bounding box from the plurality of hand candidate bounding boxes that comprises the hand candidate (e.g. page 175 second column, fourth full paragraph, real-time gesture recognition; hand candidate detection given an RGB input image; page 176, Fig. 3 and its caption, along with first column, section 3.1, taking RGB input image and outputting hand candidate bounding box, detecting specific pointing hand pose, such as using Faster R-CNN, YOLOv2, or MobileNet; predicting object bounding boxes along with confidence probabilities);
recognizing by the one or more hardware processors on the mobile communication device, the one or more hand gestures for a plurality of Augmented Reality (AR) wearable device with a monocular Red Green Blue (RGB) camera by using a limited amount of labelled classification data (e.g. page 175, first paragraph in first column, recognize pointing pose using single RGB monocular camera; page 175, final paragraph in second column, pointing hand gestural framework for frugal wearable devices with single monocular camera; real-time and markerless gesture detection; page 178, Conclusion in first column, pointing hand gesture recognition based framework for interacting with wearable devices such as Google Cardboard and VR Box etc., using just monocular RGB data).
Examiner notes that Applicant does not appear to provide any discussion regarding these teachings of Garg and Dani with respect to these newly-cited limitations.  
Applicant then concludes that “Based on the claim amendments and aforementioned arguments, it would not be possible for person having ordinary skill in the art to combined teaching of Garg, Dani, Kumar, Peng, and Howard to arrive at the amended claim 1” (i.e. on page 29 of Applicant’s remarks dated February 16, 2022).
Applicant’s arguments are not persuasive with respect to a variety of the amended claim limitations for the reasons discussed above.  However, as noted above, and in the Advisory Action mailed on January 25, 2022, although Dani appears to teach the use of MobileNet as an object detector, Examiner agrees that Garg, Dani, and Kumar do not explicitly disclose the use of MobileNetV2, specifically, as the object detector (i.e. for hand localization).
Therefore, Applicant’s arguments are persuasive in part, and the rejection is withdrawn.  However, new grounds of rejection are provided below.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1, 3, 6, 8, 11, and 13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
With respect to claims 1, 6, and 11, these claims recite, on 56-57 (in claim 1, and similarly in claims 6 and 11) ”using a Bi-Directional Long Short-Term Memory (Bi-LSTM) model to classify the one or more hand gestures.”  However, prior to this recitation, the claims also recite “classifying in real-time, via the Bi-LSTM Network comprised in the CDLM…the fingertip motion pattern into one or more hand gestures…”  Since the claim cites both a Bi-LSTM model  and a Bi-LSTM Network, each utilized to perform hand gesture classification, it is not clear whether the claims intend to recite two different Bi-LSTM architectures (i.e. a network and a model), each separately performing hand gesture classification, or if the claims intend to refer to only one Bi-LSTM architecture (i.e. the network comprised in the CDLM, where the later reference to the “model” is intended to refer to this same entity), performing the same hand gesture classification task.  Therefore, the claims are indefinite.  In the interest of providing full examination on the merits, these limitations are interpreted as if they are intended to refer to the same Bi-LSTM network, performing the same hand gesture classification task.
In addition, claims 1, 6, and 11 recite, on lines 19-20 (in claim 1, and similarly in claims 6 and 11), “the pointing gesture pose triggers a regression convolutional neural network for fingertip localization.”  However, prior to this limitation, the claims recite “wherein the CDLM comprises…a Fingertip regressor…” and subsequent to this limitation, the claims also recite “detecting in real-time, using the Fingertip regressor comprised in the CDLM…a spatial location of a fingertip…”  Since the claim refers to both a fingertip regressor which is used for detecting finger location, and a regression convolutional neural network for fingertip localization, it is not clear whether the claims intend to recite two different entities (i.e. fingertip regressor and regression convolutional neural network), each separately performing a same/similar task (i.e. detecting fingertip location/fingertip localization), or if the claims intend to refer a single entity, i.e. a fingertip regressor, which is a regression convolutional neural network, performing the same fingertip localization task.  Therefore, the claims are indefinite.  In the interest of providing full examination on the merits, these limitations are interpreted as if they are intended to refer to the same entity, i.e. a fingertip regressor, which is a regression convolutional neural network, performing the same fingertip localization task.
With respect to claims 3, 8, and 13, these claims depend upon claims 1, 6, and 11, respectively, and inherit the deficiencies identified above with respect to claims 1, 6, and 11.  Therefore, these claims are rejected on the same basis as is identified above with respect to claims 1, 6, and 11.

Claim Rejections – 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims under pre-AIA  35 U.S.C. 103(a), the examiner presumes that the subject matter of the various claims was commonly owned at the time any inventions covered therein were made absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and invention dates of each claim that was not commonly owned at the time a later invention was made in order for the examiner to consider the applicability of pre-AIA  35 U.S.C. 103(c) and potential pre-AIA  35 U.S.C. 102€, (f) or (g) prior art under pre-AIA  35 U.S.C. 103(a).
Claims 1, 3, 6, 8, 11, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Garg, Gaurav & Hegde, Srinidhi & Perla, Ramakrishna & Jain, Varun & Vig, Lovekesh & Hebbalaguppe, Ramya. (2019). DrawInAir: A Lightweight Gestural Interface Based on Fingertip Regression. In: Leal-Taixé L., Roth S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science, vol 11134. Springer, Cham. https://doi.org/10.1007/978-3-030-11024-6_15.  [retrieved on May 18, 2021].  Retrieved from the Internet:  https://link.springer.com/content/pdf/10.1007%2F978-3-030-11024-6_15.pdf.  (Hereinafter referred to as “Garg”) in view of Dani, Meghal & Garg, Gaurav & Perla, Ramakrishna & Hebbalaguppe, Ramya.  (2018). Mid-Air Fingertip-Based User Interaction in Mixed Reality. 2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), 2018, pp. 174-178, doi: 10.1109/ISMAR-Adjunct.2018.00061. [retrieved on May 18, 2021].  Retrieved from the Internet:  https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8699224.  (Hereinafter referred to as “Dani”), further in view of Kumar et al. (US 20170161555 A1) (Hereinafter referred to as “Kumar”), further in view of Islam, Md Jahidul.  (2018).  Understanding Human Motion and Gestures for Underwater Human-Robot Collaboration.  https://doi.org/10.48550/arXiv.1804.1804.02479.  [retrieved on September 28, 2022].  Retrieved from the Internet:  https://arxiv.org/pdf/1804.02479.pdf.  (Hereinafter referred to as “Islam”).
With respect to claims 1, 6, and 11, Garg teaches 
a system for classification of fingertip motion patterns into gestures, the system comprising: a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces (e.g. page 230 first full paragraph, frugal HMDs/smartphones; i.e. where HMD/smartphone includes hardware processor executing instructions stored in memory), wherein the one or more hardware processors are configured by the instructions to perform a method, 
one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause an on-device classification of fingertip motion patterns into gestures (e.g. page 230 first full paragraph, frugal HMDs/smartphones; i.e. where HMD/smartphone includes hardware processor executing instructions stored in memory) by performing the method, and
the method, which is a processor implemented method for an on-device classification of fingertip motion patterns into gestures, the method comprising: 
receiving in real-time, in a Cascaded Deep Learning Model (CDLM) executed via the one or more hardware processors of a mobile communication device, a plurality of Red, Green and Blue (RGB) input images from a real-time feed or a video from an image capturing device, wherein each of the plurality of RGB input images comprises a hand gesture (e.g. page 230, first and second full paragraphs, frugal HMD/smartphone; neural network architecture; page 231 first full paragraph, neural network architecture uses only RGB image sequence, works in real-time; Fig. 2 and its caption, classifying images and subsequent frames of hand including fingertip into different gestures; page 234 second full paragraph, adaptable to videos/live feeds); 
wherein the CDLM comprises an object detector, a Fingertip regressor and a Bidirectional Long Short Term Memory (Bi-LSTM) Network, for accurate gesture recognition, wherein the CDLM ported on the mobile communication device and removes hand gesture recognition framework dependence on a remote server (e.g. page 230, second full paragraph, neural network architecture comprising of a base CNN and DSNT layer followed by a Bi-LSTM; layer transforms heatmap from CNN to output spatial location of fingertip; page 231 first full paragraph, architecture is for efficient classification of user gestures, works in real-time, ported on mobile devices due to low memory footprint (i.e. and therefore removes dependence on server); page 231, Fig. 2 and its caption, DrawInAir comprises a Fingertip Regressor module which accurately localize the fingertip and Bi-LSTM network for classification);
detecting in real-time, using the Fingertip regressor comprised in the CDLM executed via the one or more hardware processors on the mobile communication device, a spatial location of a fingertip from the images (e.g. page 231 first full paragraph, system works in real-time, implemented on mobile device; Fig. 2 and its caption, fingertip regressor module localizes fingertip; page 233 second full paragraph, regressing over coordinates x, y of the fingertip),
wherein the spatial location of the fingertip from the hand candidates represents a fingertip motion pattern (e.g. page 232 final paragraph through page 233 first paragraph, classifying point gesture motion patterns into different gestures),
classifying in real-time, via the Bi-LSTM Network comprised in the CDLM executed via the one or more hardware processors on the mobile communication device, using the first coordinate and the second coordinate from the spatial location of the fingertip, the fingertip motion pattern into the one or more hand gestures (e.g. page 231 first full paragraph, system works in real-time, implemented on mobile device; page 234 first full paragraph, spatial location of fingertip fed to gesture classification network; employing Bi-LSTM for classification of gestures),
wherein the spatial location of the fingertip is detected based on a presence of a positive pointing-finger hand detection on a set of consecutive frames in a second set of RGB input images from the plurality of RGB input images, and wherein the presence of the positive pointing-finger hand detection is indicative of a start of the hand gesture (e.g. page 231 first full paragraph, neural network architecture uses only RGB image sequence, works in real-time; Fig. 2 and its caption, classifying images and subsequent frames of hand including fingertip into different gestures; page 234 second full paragraph, adaptable to videos/live feeds; page 234 first full paragraph, using only gestures that have pointing fingers; i.e. where a first set of all captured/processed images contains at least one second set of images which comprise a recognized/classified gesture);
computing by the Bi-LSTM Network executed via the one or more hardware processors on the mobile communication device a probability score, wherein the probability score indicates the probability of the fingertip motion pattern to be identified as a candidate gesture (e.g. page 236 second full paragraph, gesture patterns fed as input to Bi-LSTM layer; passing to fully connected layer with 10 output scores that correspond to each of the 10 gestures; caption to Fig. 6, gesture is detected when the predicted probability is more than 0.75; i.e. the Bi-LSTM layer is used to generate, for a given pattern, a score for each gesture, and a given gesture is detected when this probability is greater than 0.75);
recognizing by the one or more hardware processors on the mobile communication device, the one or more hand gestures for a plurality of Augmented Reality (AR) wearable device with a monocular Red Green Blue (RGB) camera by using a limited amount of labelled classification data (e.g. page 229, Abstract, enable mass market reach via inexpensive Augmented Reality (AR) headsets without built-in depth or IR sensors; real-time, in-air gestural framework that works on monocular RGB input; classifier trained on limited classification data; page 232, first full paragraph, computationally efficient pointing pose-based gesture recognition using just RGB data; page 232, final paragraph, continuing on page 233, pointing hand gestural framework with limited labelled classification data, classifying motion patterns into gestures; page 239, Conclusion, the entire framework works just with monocular RGB data at real-time and can be used with frugal AR devices without any sensor fusion);
using a Bi-directional Long Short-Term Memory (Bi-LSTM) model to classify the one or more hand gestures (e.g. page 229, Abstract, Bi-LSTM for real-time pointing hand gesture classification; page 233, first paragraph, Bi-LSTM network for classification into gestures; page 239, Conclusion, gesture classification).
Garg does not explicitly disclose:
detecting in real-time, using the object detector comprised in the CDLM executed via the one or more hardware processors on the mobile communication device, a plurality of hand candidate bounding boxes from the received plurality of RGB input images, wherein each of the plurality of hand candidate bounding boxes is specific to a corresponding RGB image from the received plurality of RGB input images, wherein each of the plurality of hand candidate bounding boxes comprises a hand candidate, and wherein each of the plurality of hand candidate bounding boxes comprising the hand candidate depicts a pointing gesture pose to be utilized for classifying into one or more hand gestures,
and wherein the pointing gesture pose triggers a regression convolutional neural network for fingertip localisation; 
downscaling in real-time, the hand candidate from each of the plurality of hand candidate bounding boxes to obtain a set of down-scaled hand candidates, wherein downscaling comprises downscaling a first set of RGB input images from the plurality of RGB input images comprising hand candidates to a specific resolution to reduce processing time without compromising on quality of image features; 
that the spatial location of the fingertip is detected from each down- scaled hand candidate from the set of down-scaled hand candidates, where the spatial location of the fingertip from the set of down-scaled hand candidates represents a fingertip motion pattern;
wherein the Fingertip regressor is implemented based on a Convolutional Neural Network (CNN) architecture to localize a first coordinate and a second coordinate of the fingertip, wherein the CNN consists of two convolutional blocks and three fully connected layers to regress over the fingertip spatial location, wherein each of the two convolutional blocks have three convolutional layers followed by a max-pooling layer;
wherein the hand gesture recognition framework is capable of providing mass accessibility of one or more gestural interfaces;
computing by the one or more hardware processors on the mobile communication device, the object detector architecture to localize the one or more hand candidates, wherein the object detector architecture outputs at least one hand candidate bounding box from the plurality of hand candidate bounding boxes that comprises the hand candidate.
However, Dani teaches:
detecting in real-time, using the object detector comprised in the CDLM executed via the one or more hardware processors on the mobile communication device, a plurality of hand candidate bounding boxes from the received plurality of RGB input images, wherein each of the plurality of hand candidate bounding boxes is specific to a corresponding RGB image from the received plurality of RGB input images, wherein each of the plurality of hand candidate bounding boxes comprises a hand candidate, and wherein each of the plurality of hand candidate bounding boxes comprising the hand candidate depicts a pointing gesture pose to be utilized for classifying into one or more hand gestures (e.g. page 175 second column, fourth full paragraph, real-time gesture recognition; hand candidate detection given an RGB input image; page 176, Fig. 3 and its caption, along with first column, section 3.1, taking RGB input image and outputting hand candidate bounding box, detecting specific pointing hand pose, such as using Faster R-CNN, YOLOv2, or MobileNet),
and wherein the pointing gesture pose triggers a regression convolutional neural network for fingertip localization (e.g. page 176, section 3.2, first paragraph continuing into the second column, “hand candidate detection (pointing gesture pose)…triggers the regression CNN for fingertip localization…”); 
downscaling in real-time, the hand candidate from each of the plurality of hand candidate bounding boxes to obtain a set of down-scaled hand candidates, wherein downscaling comprises downscaling a first set of RGB input images from the plurality of RGB input images comprising hand candidates to a specific resolution to reduce processing time without compromising on quality of image features (e.g. page 175, Fig. 2 and its caption, smartphone sends downscaled video frames to gesture recognition framework; page 175 second column, third full paragraph, each frame is down-scaled to 640x480 resolution to achieve real-time performance by reducing computational time; page 176, Fig. 3 and its caption, cropping and resizing hand candidate to feed into fingertip regressor; page 176 second column, first paragraph, hand candidate bounding box is cropped and resized to 99x99 resolution; i.e. the smartphone sends the downscaled frames and therefore performs the down-scaling; i.e. where a first set of images is downscaled for sending to the gesture recognition framework, and then gestures appearing in a second set of images within the first set are identified); 
that the spatial location of the fingertip is detected from each down- scaled hand candidate from the set of down-scaled hand candidates, where the spatial location of the fingertip from the set of down-scaled hand candidates represents a fingertip motion pattern (e.g. page 175 second column, fourth full paragraph, fingertip regressor accurately estimating fingertip spatial location given hand candidate detection from previous block as input; page 176, Fig. 3 and its caption, cropped and resized hand candidate fed to fingertip regressor block for accurately localizing fingertip; page 176, second column first and second paragraphs, regressing over x, y coordinates of the fingertip, determining continuous valued outputs corresponding to fingertip positions);
wherein the Fingertip regressor is implemented based on a Convolutional Neural Network (CNN) architecture to localize a first coordinate and a second coordinate of the fingertip, wherein the CNN consists of two convolutional blocks and three fully connected layers to regress over the fingertip spatial location, wherein each of the two convolutional blocks have three convolutional layers followed by a max-pooling layer (e.g. page 176 second column first full (i.e. second) paragraph, architecture consists of two convolutional blocks each with three convolutional layers followed by a max-pooling layer, and uses three fully connected layers to regress over two coordinate values of fingertip point at the last layer; determining continuous valued outputs corresponding to positions; page 177, Fig. 4 and its caption; fingertip regressor architecture as previously described);
wherein the hand gesture recognition framework is capable of providing mass accessibility of one or more gestural interfaces (e.g. page 174, first column, abstract paragraph, “we demonstrate a cost-effective solution…using frugal devices….we propose the use of intuitive pointing fingertip gestural interface….””; page 175, first column, lines 1-4, “we attempt to recognize…using a single RGB monocular camera while addressing challenges such as (i) lack of additional depth/IR sensors on smartphones….”; page 175 first column, first full paragraph, “opens avenues for rich user-interaction on frugal devices”; page 175, second column, first full paragraph “we detect pointing hand gesture…using RGB data as the input, without additional depth information…”, page 175, second column, fourth full paragraph “real-time and markerless gesture recognition…without the need for additional depth information.”);
computing by the one or more hardware processors on the mobile communication device, the object detector architecture to localize the one or more hand candidates, wherein the object detector architecture outputs at least one hand candidate bounding box from the plurality of hand candidate bounding boxes that comprises the hand candidate (e.g. page 175 second column, fourth full paragraph, real-time gesture recognition; hand candidate detection given an RGB input image; page 176, Fig. 3 and its caption, along with first column, section 3.1, taking RGB input image and outputting hand candidate bounding box, detecting specific pointing hand pose, such as using Faster R-CNN, YOLOv2, or MobileNet; predicting object bounding boxes along with confidence probabilities).
Dani additionally teaches recognizing by the one or more hardware processors on the mobile communication device, the one or more hand gestures for a plurality of Augmented Reality (AR) wearable device with a monocular Red Green Blue (RGB) camera by using a limited amount of labelled classification data (e.g. page 175, first paragraph in first column, recognize pointing pose using single RGB monocular camera; page 175, final paragraph in second column, pointing hand gestural framework for frugal wearable devices with single monocular camera; real-time and markerless gesture detection; page 178, Conclusion in first column, pointing hand gesture recognition based framework for interacting with wearable devices such as Google Cardboard and VR Box etc., using just monocular RGB data).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention having the teachings of Garg and Dani in front of him to have modified the teachings of Garg (directed to a lightweight gestural interface based on fingertip regression) to incorporate the teachings of Dani (directed to fingertip based user interaction in mixed reality) to incorporate, within the neural network architecture (i.e. of Garg), the capabilities for the object detector to detect hand candidate bounding boxes for each RGB input image, perform downscaling of the RGB input images and hand candidate bounding boxes to specific resolutions, and perform fingertip location detection on the downscaled hand candidate bounding boxes using the fingertip regressor, where the fingertip regressor is implemented using two convolutional blocks and three fully connected layers to regress over the fingertip spatial location, wherein each of the two convolutional blocks have three convolutional layers followed by a max-pooling layer (i.e. incorporating the object and bounding box detection/downscaling architectures of Dani into the neural network of Garg, and further implementing the fingertip regressor, which is also taught by Garg, using the architecture of Dani).  One of ordinary skill would have been motivated to perform such a modification in order to overcome limitations with existing techniques and open avenues for rich user-interaction on frugal devices as described in Dani (page 175 first column, first full paragraph).
Garg and Dani do not explicitly disclose wherein an absence of a positive pointing-finger hand detection on a set of consecutive frames in the plurality of RGB input images is indicative of an end of the hand gesture.  However, Kumar teaches wherein an absence of a positive pointing-finger hand detection on a set of consecutive frames in the plurality of RGB input images is indicative of an end of the hand gesture (e.g. paragraph 0033, single finger pointing gesture detected over plurality of images, performing drawing/tracing operation based on gesture; paragraph 0034, image 108 fed into system, five fingertips are all detected; when multiple fingertips are detected, the system no longer draws on the screen; in first frame of this gesture, system is unable to determine swipe right gesture is occurring, and therefore correctly predicts current frame 108 has no swipe-right gesture performed; image 110 fed into system, still too early to determine swipe right gesture; image 112 fed into system, no fingertips tracked/detected, detecting swipe right gesture performed based on context of previous frames; i.e. the system detects that the pointing gesture for drawing/tracing on the screen is ended when multiple and/or no fingers are detected, and, therefore, the drawing/tracing operation corresponding to the gesture is stopped).
Kumar further teaches wherein the hand gesture recognition framework works without depth information and a need for specialized hardware, the hand gesture recognition capable of providing mass accessibility of one or more gestural interfaces (e.g. abstract, gesture recognition, recognizing fingers of user and gestures of interest from series of images; paragraph 0005, utilizing simple RGB camera without using depth camera; paragraph 0031, utilizing simple RGB camera, depth camera not necessary; paragraph 0038, using simple RGB camera without using depth camera).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention having the teachings of Garg, Dani, and Kumar in front of him to have modified the teachings of Garg (directed to a lightweight gestural interface based on fingertip regression) and Dani (directed to fingertip based user interaction in mixed reality), to incorporate the teachings of Kumar (directed to improved virtual reality interaction utilizing deep learning) to include the capability to detect, after detecting a single finger pointing gesture corresponding to a drawing operation, that the single finger gesture is no longer detected and interpret this as an end of the gesture.  One of ordinary skill would have been motivated to perform such a modification in order to provide improved neural network object detection as described in Kumar (paragraphs 0005).
While Garg teaches wherein the CDLM comprises and object detector (as cited above) and Dani further teaches that the object detector may be a MobileNet architecture (e.g. page 176, first column, section 3.1, object detection approaches including MobileNet), Garg, Dani, and Kumar do not explicitly disclose that the object detector, implemented as a MobileNet architecture, is a MobileNetV2 architecture.  
However, Islam teaches that the object detector, implemented as a MobileNet architecture, is a MobileNetV2 architecture (e.g. page 8, second column, final full paragraph, using SSD with MobileNet v2 as object detector for hand gesture recognition; page 9, Fig. 15(b), showing use of MobileNet v2 architecture for hand detection in images, including generating bounding boxes around hands; page 9, second column, only paragraph, continuing on page 10, first column, SSD with MobileNet v2 as an object detection model; using MobileNet v2 as the base network; ; page 12, second column first paragraph, using pre-trained SSD with MobileNet v2 model for hand gesture detection/recognition; training models with data set and then using as hand gesture recognizers within framework; Table IV, showing performance results of different hand gesture recognition models including SSD/MobileNet v2; page 13, first column, second full paragraph, using SSD/MobileNet v2 as hand gesture recognizer to balance tradeoffs between performance and running time; page 14, Fig. 19(b), showing snapshots of hand gesture recognition using MobileNet v2, including bounding boxes around recognized hands).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention having the teachings of Garg, Dani, Kumar, and Islam in front of him to have modified the teachings of Garg (directed to a lightweight gestural interface based on fingertip regression), Dani (directed to fingertip based user interaction in mixed reality), and Kumar (directed to improved virtual reality interaction utilizing deep learning), to incorporate the teachings of Islam (directed to detecting and understanding human motion and gestures) to include the capability to implement the object detector as a MobileNetV2 architecture.  One of ordinary skill would have been motivated to perform such a modification in order to provide an object detector which performs region selection and hand gesture classification in a single pass, with highly accurate and robust performance in noisy visual conditions, and to balance tradeoffs between performance and running time as described in Islam (page 8, second column, final full paragraph; page 13, first column, second full paragraph).
With respect to claims 3, 8, and 13, Garg in view of Dani, further in view of Kumar, further in view of Islam teaches all of the limitations of claims 1, 6, and 11, as previously discussed, and Garg further teaches wherein the step of classifying the fingertip motion pattern into one or more hand gestures comprises (i.e. the fingertip motion pattern is classified into one or more hand gestures by) applying a regression technique on the first coordinate and the second coordinate of the fingertip (e.g. page 231 first full paragraph, system works in real-time, implemented on mobile device; Fig. 2 and its caption, fingertip regressor module localizes fingertip; page 233 second full paragraph, regressing over coordinates x, y of the fingertip).

It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. “The use of patents as references is not limited to what the patentees describe as their own inventions or to the problems with which they are concerned. They are part of the literature of the art, relevant for all they contain,” In re Heck, 699 F.2d 1331, 1332-33, 216 USPQ 1038, 1039 (Fed. Cir. 1983) (quoting in re Lemelson, 397 F.2d 1006, 1009, 158 USPQ 275, 277 (GCPA 1968)). Further, a reference may be relied upon for all that it would have reasonably suggested to one having ordinary skill the art, including nonpreferred embodiments. Merck & Co, v. Biocraft Laboratories, 874 F.2d 804, 10 USPQ2d 1843 (Fed. Cir.), cert, denied, 493 U.S. 975 (1989). See also Upsher-Smith Labs. v. Pamlab, LLC, 412 F,3d 1319, 1323, 75 USPQ2d 1213, 1215 (Fed. Cir, 2005): Celeritas Technologies Ltd. v. Rockwell International Corp., 150 F.3d 1354, 1361, 47 USPQ2d 1516, 1522-23 (Fed. Cir. 1998).



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JEREMY STANLEY whose telephone number is (469)295-9105. The examiner can normally be reached on Mon-Thurs 8:00-5:00 CST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Renee Chavez can be reached on (571) 270-1104. The fax phone number for the organization where this application or proceeding is assigned is 571 -273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR.
Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JEREMY L STANLEY/
Examiner, Art Unit 2179