Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-17 and 23-25 are pending. Claims 1, 12, and 23-25 are independent.  A preliminary amendment was filed on 4/17/2020 which is reflected in the Claims.
This Application was published as U.S. 2021/0193167.
Apparent foreign priority October 2017.  
35 U.S.C. 112(f) Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 
The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Such claim limitation(s) is/are: 


an acquisition module, configured to acquire an audio file to be recognized; 
an extraction module, configured to extract audio feature information of the audio file to be recognized, wherein the audio feature information comprises audio fingerprints; and 
a search module, configured to search audio attribute information matched with the audio feature information, in a fingerprint index database; 
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. 

These limitations are generic in the context of the art and don’t refer to any specific structure and only serve as placeholders for the structure that performs the associated function(s) without providing any information about what that structure is. MPEP 2181 I A says:
For a term to be considered a substitute for "means," and lack sufficient structure for performing the function, it must serve as a generic placeholder and thus not limit the scope of the claim to any specific manner or structure for performing the claimed function. It is important to remember that there are no absolutes in the determination of terms used as a substitute for "means" that serve as generic placeholders. The examiner must carefully consider the term in light of the specification and the commonly accepted meaning in the technological art. Every application will turn on its own facts.
Based on the ordinary skill in the art and description of functions of these components in the Specification, they refer to processors or a combination of processor and memory and possibly transducers such as microphones and displays or to a combination of software and hardware.
This is NOT a rejection. Please don’t address it as a rejection. If the Applicant does not agree with the INTERPRETATION, he may argue or amend to replace the terms interpreted under 112(f) with structural terms such as “microphone” or “microprocessor” as appropriately supported by the Specification. In the alternative, he may let the interpretation stand if the intent was to include a means plus function limitation in the Claim.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 

Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim 24 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter.
Claim 24 is directed to “24. A computer-readable storage medium, wherein computer programs stored in the storage medium are executed by a processor to perform the audio recognition method of claim 1.”
No definition is provided for the phrase “computer-readable storage medium.”  Accordingly, the phrase is interpreted under its broadest reasonable interpretation and as such includes transitory wave media which are machine/computer readable and yet non-statutory. The broadest reasonable interpretation of the Claim would then include non-statutory embodiments and the Claim as a whole is directed to non-statutory subject matter.
To overcome the rejection, see suggested amendment:  “24. A computer-readable storage medium, wherein computer programs stored in the storage medium are executed by a processor to perform the audio recognition method of claim 1.”
Claim 25 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter.
Claim 25 is directed to an “application program,” which is interpreted as a “computer program,” and “computer program” alone is not considered as one of the categories of patentable subject matter.
25. An application program, wherein the application program is executed to perform the audio recognition method of claim 1.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 8 is rejected under 335 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The “audio file” in Claim 8 lacks antecedent basis.
Is it the same “audio file” as in Claim 1?  Then it should be “the audio file.”
Is it a different one?  Then it should be specified differently.
8. The audio recognition method according to claim 3, 
wherein the audio attack comprises data processing on audio file, and 
…
	Based on the Specification, the “audio attacks” are done during the generation of the fingerprints and therefore the “audio file” of Claim 8 is not the same as the “audio file” of Claim 1.
	Please amend to provide proper description inside the Claim language for what this “audio file” is and make sure it has a name different from the one used in Claim 1.  If they are the same, use “the.”
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 9, 11-12, and 23-25 are rejected under 35 U.S.C. 103 as being unpatentable over Sporer (U.S. 2018/0122398) in view of Li (U.S. 2018/0042174).
Regarding Claim 1, Sporer teaches:
1. An audio recognition method, comprising: 
acquiring an audio file to be recognized; [Sporer, Figure 1a, “microphone 11” to “recording 110.”  (There is no speech or voice in this Claim and “recognized” pertains to the recognition of fingerprints.)]
extracting audio feature information of the audio file to be recognized, wherein the audio feature information comprises audio fingerprints; and [Sporer, Figure 1b, where the “recording step 110” is divided between 110a’ and 110b’.  “[0053] … Generally, an audio fingerprint may be described such that it is a representation of an audio signal representing all the essential features of the audio signal so that subsequent classification is possible….”  “[0059] In this embodiment, the step of recording 110' is subdivided into two sub-steps, i.e. 110a' and 110b'. Step 110a refers to calculating psychoacoustic parameters, like roughness, sharpness, volume, tonality and impulse characteristic and/or variation intensity, for example. Step 110b is reduced to determining an audio fingerprint which describes the recording such that the characteristic features can be recognized again later on using the audio fingerprint.”]
searching audio attribute information matched with the audio feature information, in a fingerprint index database; [Sporer, Figures 1a and 1b, “storing data in a database” stores the obtained fingerprints in a fingerprint index database.  Figures 1a, 1b, 1c show a method of “setting up a database” of fingerprints.  [0042]-[0045].  Figure 2a shows the comparison of an incoming audio with the previously stored fingerprints.  “matching with database 210.”  “[0091] FIG. 2a shows the method 200 comprising step 210 of matching environmental noises received via the microphone 11 (cf. step of receiving 205), to recordings from the database 15….”  OR: “[0092] In correspondence with embodiments, the respective audio fingerprints of the current environmental noises, instead of the recordings, may be compared to audio fingerprints stored before in the database 15. The method here comprises determining the audio fingerprint of the current environmental noise and comparing it to audio fingerprints stored in the database 15.”]
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. [Sporer, Figure 2a, “database 15.”  Figure 1b teaches “subjective noise evaluation 120’” which teaches or very strongly suggests removing the “invalid audio fingerprints” before being stored at the “storing data in a database 130.”  “[0058]  …  Consequently, the method 100' also comprises the basic steps of recording 110', receiving 120' the signal relative to a subjective noise evaluation or, generally, relative to an allocation of the signal received into a signal class (like a disturbing noise) starting from a plurality of signal classes (like non-disturbing noise, slightly disturbing noise and highly disturbing noise), and storing the buffered recording 130, like using a database. In addition, steps 130 and 120' are connected via the point of decision 125.”  The subjective evaluation at 120’ filters the obtained fingerprints and if the result of decision step 125 is NO the sample is not stored in the database.  “[0055] This method 100 serves setting up a database where subjective disturbing noises received (i.e. recorded) by the microphone 11 are identified. Identifying is done using a step performed by the user which exemplarily executes the "signal 120 output" step using a key 12 (or generally a user input interface 12), when the user has recognized a disturbing noise in the environment. Since the microphone 110 listens to the environmental noises and these are buffered in step 110, these disturbing noises are also recorded so that the buffered recording or a part thereof may be stored in a permanent memory for setting up the database (cf. step 130). In case no disturbing noise has been recognized by the user, the method will be repeated, which is illustrated using the arrow from the subjective evaluation (decision element 125) to the starting point 101.”]
The removal of “invalid” data points from a set of samples is obvious but not discussed in Sporer.
Li teaches:
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. [Li, Figure 1, “Code instructions 108” including various instructions regarding outliers and Figure 3, 305 which teaches the removal of data that impact the model training adversely:  “[0116] At block 305, the agricultural intelligence computer system 130 is configured or programmed to implement agronomic data preprocessing of field data received from one or more data sources. The field data received from one or more data sources may be preprocessed for the purpose of removing noise and distorting effects within the agronomic data including measured outliers that would bias received field data values. Embodiments of agronomic data preprocessing may include, but are not limited to, removing data values commonly associated with outlier data values, specific measured data points that are known to unnecessarily skew other data values, data smoothing techniques used to remove or reduce additive or multiplicative effects from noise, and other filtering or data derivation techniques used to provide clear distinctions between positive and negative data inputs.”]
Sporer and Li pertain to databases of information and it would have been obvious to combine the data processing method of Li which can be used for developing a database used for model training with the system of Sporer which teaches generating a database of audio fingerprints to add processing steps that result in a more useful database for model training or recognition of events/structures.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

(Note also application of Sharma to Claim 1:
Regarding Claim 1, Sharma teaches:
1. An audio recognition method, comprising: 
acquiring an audio file to be recognized; [Sharma, Figure 2, “Audio in.”  “[0005] … Recognition is the process of computing features and then querying the database to find corresponding features….”]
extracting audio feature information of the audio file to be recognized, wherein the audio feature information comprises audio fingerprints; and [Sharma, Figure 1 shows the watermarking of the audio file and teaches “Embed 104” where the watermark/fingerprint is embedded.  “[0025] The selected configuration of embedding operations (104) embeds auxiliary data within a segment of the audio signal. …”  Processes are done on the “features extracted from the audio. “…  Feature extraction and matching are also used to adapt audio watermark embedding and detecting.”  Abstract.]
searching audio attribute information matched with the audio feature information, in a fingerprint index database; [Sharma, Figure 2, “Detect 204.”  See Figure 6 for details of detection.]
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. [Sharma, Figure 8, “Check/iterate over quality and robustness metrics 812.” ])

Regarding Claim 9, Sporer teaches:
9. The audio recognition method according to claim 1, wherein the audio attribute information matched with the audio feature information comprises at least one of the following: 
a song style, a natural sound in an audio or a language of a speaker in an audio. [Sporer teaches classification of noises and natural sounds.  “[0095] In accordance with an extended embodiment, the method may not only purely recognize such disturbing noises, but associate, i.e. classify, the noises to voice, motor noise, music, church bells or shots, for example.”  “[0086] The expert publication "Multimedia Content Analysis", Yao Wang et al., IEEE Signal Processing Magazine, November 2000, pages 12 to 36, discloses a similar concept for indexing and characterizing multimedia pieces….”]
(Sharma: a song style, a natural sound in an audio or a language of a speaker in an audio. [Sharma, Figure 3, classification into types of music/song 312, and speech, noise, music genre at 304, or “language recognition” in 314.)

Regarding Claim 11, Sporer teaches:
11. The audio recognition method according to claim 1, further comprising:
outputting the audio attribute information. [Sporer, Figure 2a, “Event of Interest Recognized? 215” to YES.  “[0059] … Step 110b is reduced to determining an audio fingerprint which describes the recording such that the characteristic features can be recognized again later on using the audio fingerprint.”]
(Sharma, Figure 2, “interpret,” “data out”.  “[0027] FIG. 2 is a diagram illustrating audio processing for classifying audio and adaptively decoding data embedded in the audio…. ”)

Claim 12 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
12. An audio recognition device, comprising: 
an acquisition module, configured to acquire an audio file to be recognized; [Sporer, Figures 1a, 1b, “microphone 11.”  “[0016] Embodiments of this aspect are based on the finding that, starting from a database as may be determined in correspondence with another aspect (see below), like by comparing the current noise environment to the noises from the database or parameters obtained from the database or stored in the database, like audio fingerprints, it is possible to recognize the presence of subjectively perceived disturbing noises or to associate the noise to a class. This method can be executed in an automated manner and allows a forecast of the evaluation of a noise situation (chirping of a bird vs. air condition) solely using a stored database, without having any subjective evaluation done by humans.”]
an extraction module, configured to extract audio feature information of the audio file to be recognized, wherein the audio feature information comprises audio fingerprints; and [Sporer, Figure 1b, “generating audio fingerprints (AFP) 110b’.”]
a search module, configured to search audio attribute information matched with the audio feature information, in a fingerprint index database; [Sporer, Figure 2a, “matching with database 210.”]
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. [Sporer, Figure 2a, “database 15” and the cleaning / removing the invalid fingerprints of the fingerprint data before storing in the “database 130” by the “subjective noise evaluation 120’” in Figure 1b.  “[0063] … Finally, the audio fingerprints, psychoacoustic parameters or, generally, the recording, associated to one of the signal classes, are stored in the database.”]

Claim 23 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
23. A server, comprising: 
one or more processors; [Sporer, “CPU 41.”]
a memory; and [Sporer, “Memory 44.”]
one or more application programs, [Sporer, “[0032] In accordance with further embodiments, a computer program for executing one of the methods described above is provided.”]
wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors to perform the audio recognition method of claim 1. [Sporer, [0032] teaching that computer programs perform the methods taught by Sporer and the rejection of Claim 1.]

Claim 24 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
24. A computer-readable storage medium, wherein computer programs stored in the storage medium are executed by a processor to perform the audio recognition method of claim 1. [Sporer, “[0132] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, Blu-Ray disc, CD, ROM, PROM, EPROM, EEPROM or a FLASH memory, a hard drive or another magnetic or optical memory having electronically readable control signals stored thereon, which cooperate or are capable of cooperating with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer-readable.”]

Claim 25 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
25. An application program, wherein the application program is executed to perform the audio recognition method of claim 1. [Sporer, [0032] teaching that computer programs perform the methods taught by Sporer and the rejection of Claim 1.]

Claims 2 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Sporer and Li in view of Sharma (U.S. 2014/0142958).
Regarding Claim 2, Sporer teaches:
2. The audio recognition method according to claim 1, 
wherein the fingerprint index database comprises the audio fingerprint set in which invalid audio fingerprints have been removed from the audio sample data by a classifier.[Sporer teaches classifying the audio. “[0095] In accordance with an extended embodiment, the method may not only purely recognize such disturbing noises, but associate, i.e. classify, the noises to voice, motor noise, music, church bells or shots, for example.”]
Li pertains to database development for model training and is not express about a classifier.
Sharma expressly eaches:
2. The audio recognition method according to claim 1, 
wherein the fingerprint index database comprises the audio fingerprint set in which invalid audio fingerprints have been removed from the audio sample data by a classifier. [Sharma, Figure 3 shows the classification of signals of “audio in” where the classifier steps communicate with and feed the “fingerprint classifier 316.”  “[0030] FIG. 3 is a diagram illustrating an example configuration of a multi-stage audio classifier for preliminary analysis of audio for auxiliary data encoding and decoding. We refer to this classifier as "multi-stage" to reflect that it encompasses both sequential (e.g., 300-304) and concurrent execution of classifiers (e.g., fingerprint classifier 316 executes in parallel with silence/speech/music discriminators 300-304).”  “[0047] Multiple stream analysis enables different watermark layers to be separated from input audio, particularly if those layers are designed to have distinct kurtosis properties that facilitates un-mixing. It also allows separation of certain types of big noise sources from music or speech. It also allows separation of different musical pieces or separate speech sources. In these cases, these estimated sound sources may be analyzed separately, in preparation for separate watermark embedding or detecting. Unwanted portions can be ignored or filtered out from watermark processing. One example is filtering out noise sources, or conversely, discriminating noise sources so that they can be adapted to carry watermark signals (and possible unique watermark layers per sound source). Another is inserting different watermarks in different sounds that have been separated by this process, or concentrating watermark signal energy in one of the sounds….”]
Sporer and Li pertain to databases of information and Sporer/Li and Sharma pertain to fingerprint/watermark development for audio files and the related considerations and it would have been obvious to combine the expressly shown classifier of Sharma with the combination for completeness.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 13 is a device Claim with limitations similar to the limitations of method Claim 2 and is rejected under similar rationale.
Claims 3-7 and 14-17 are rejected under 35 U.S.C. 103 as being unpatentable over Sporer and Li in view of Sharma and Mihcak (U.S. 2002/0184505).
Regarding Claim 3, Sporer does not teach training.  Li teaches training of models but is not related to audio data or audio attacks.
Sharma teaches:
3. The audio recognition method according to claim 2, wherein the classifier is established through following operations: 
extracting feature point data of audio data in a training data set as first feature point data; [Sharma, Figure 5, shows a process of training where the “Watermarked audio in” and the “Original audio in” are compared and the parameters adjusted.  The input “watermarked audio in” teaches the “training set” of the Claim and the operations are done on the “audio features” of the audio.  “[0053] …  The training set is provides signals typical for the intended usage environment….”  Figure 7, “Segment audio 700” yields the “first feature point data.”    “[0078] Audio classifiers for determining audio type are constructed by computing features of audio clips in a training data set ….”]
performing an audio attack on the audio data in the training data set, and extracting feature point data of audio data in the training data set after performing the audio attack as second feature point data; [Sharma, Figure 5, 502 is a “Robustness Evaluator” which measures how well a watermark can withstand attacks.  Figure 7, watermarked audio/ “first feature point data” is input to the “Segment Audio 700” and is then subjected to Attack at “Distort 702” and then the output which teaches the “second feature point data” is obtained.  “[0177] The robustness evaluator 502 modifies the watermarked audio signal with simulated distortion and evaluates robustness of the watermark in the modified signal….”]
comparing the first feature point data with the second feature point data, marking disappeared or moved feature point data as counter-example data, and marking feature point data with robustness as positive example data; and [Sharma, Figure 7, at 704, the comparison is performed between the input/”first feature point data” and the output of the attack/distortion which is the “second feature point data” and the bit error rate (BER) is detected.  “[0206] As noted above, there are different measures of robustness, and the length of audio segment and processing to compute them vary with the robustness measure. For watermark bit error rate based measures, the length of the segment should be about the length of watermark packet, such that it is long enough to enable the detector to extract estimates of the error correction coded message symbols (e.g., message bits) from which a bit error rate can be computed…..”]
training and establishing the classifier by using the first feature point data, the positive example data, and the counter-example data. [Sharma, Figure 5, “adjust gain/model/insert method 504” which takes the output of the “robustness evaluator 502” teaches the training step.  “[0179] … This update is reflected in the update module 504, in which the decision to update embedding is made, and the nature of the update is determined. In addition to improving quality in response to a poor quality metric and increasing reliability in response to a poor robustness metric, the evaluations of quality and robustness can be used together to optimize both quality and robustness….”  “[0180] The robustness measure indicates where the watermark signal cannot be reliably detected, and as such, the watermark strength should be increased, if allowable based on the quality measure….”    “[0078] Audio classifiers for determining audio type are constructed by computing features of audio clips in a training data set and deriving a mapping of the features to a particular audio type. For the purpose of digital watermarking operations, we seek classifications that enable selection of audio watermark parameters that best fit the audio type in terms of achieving the objectives of the application for audio quality (imperceptibility of the audio modifications made to embed the watermark), watermark robustness, and watermark data capacity per time segment of audio….”]
Sporer and Li pertain to databases of information and Sporer/Li and Sharma pertain to fingerprint/watermark development for audio files and the related considerations and it would have been obvious to combine training steps of Sharma which consider an audio attack with the system of combination to provide a trained model for use by the system of combination.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

The obtaining of the positive and negative examples for the training from data that survived after the attack is more expressly taught by the following reference.
Mihcak teaches:
performing an audio attack on the audio data in the training data set, and extracting feature point data of audio data in the training data set after performing the audio attack as second feature point data; [Mihcak, Figure 7, “suspect audio clip 750” indicates that an audio attack may have happened to the clip.  “[0014] At the same time, these characteristics of the human auditory system can be exploited for illegal or unscrupulous purposes. For example, a pirate may use advanced audio processing techniques to remove copyright notices or embedded watermarks from an audio clip without perceptually altering the audio clip. Such malicious changes to the audio clip are referred to as "attacks", and result in changes at the data domain.”  “[0028] A good audio hashing technique should generate the same unique identifier even though some forms of attacks have been done to the original audio clip, given that the altered audio clip is reasonably similar (i.e., perceptually) to a human listener when comparing with the original audio clip. However, if the modified audio clip is audibly different or the attacks cause irritation to the listeners, the hashing technique should recognize such degree of changes and produce a different hash value from the original audio clip.”]
comparing the first feature point data with the second feature point data, marking disappeared or moved feature point data as counter-example data, and marking feature point data with robustness as positive example data; and [Mihcak, Figure 7, At step 758, the hash-value/fingerprint of the suspect clip is compared versus the hash-value/fingerprint.  If they match (YES): “indicate that the suspect clip is pirated from the selected clip 762” which teaches a counter-example because the attack is not showing in the comparison and if they do not match (No) “indicate that the suspect clip is not pirated from the selected clip 760” which means that if an attack actually happened, it must have been successful in making the marking disappear and results in a mismatch.  “[0027] Accordingly, there is a need for a hashing technique for digital audio clips that allows slight changes to the audio clip which are tolerable or undetectable (i.e., imperceptible) to the human ear, yet do not result in a different hash value….”]
Sporer/Li, Sharma, and Mihcak pertain to fingerprint/watermark development for audio files and it would have been obvious to combine the example and counter-example development of data after an attack from Mihcak with the system of combination for more express teaching of these steps.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, Sporer is not express about removing samples.
Li teaches:
4. The audio recognition method according to claim 3, 
wherein the classifier filters the audio sample data, and removes feature point data, determined as counter-example data, as invalid audio fingerprints. [Li, Figure 3, 305, data pre-processing that removes a number of types of data from the training dataset, and “[0116] … Embodiments of agronomic data preprocessing may include, but are not limited to, removing data values commonly associated with outlier data values, specific measured data points that are known to unnecessarily skew other data values, data smoothing techniques used to remove or reduce additive or multiplicative effects from noise, and other filtering or data derivation techniques used to provide clear distinctions between positive and negative data inputs.”]
Rationale for combination as provided for Claim 1.

Regarding Claim 5, Sporer and Li are not express about classifier.
Sharma teaches:
5. The audio recognition method according to claim 4, 
wherein the classifier filters the audio sample data, and removes the feature point data, determined as the counter-example data, as the invalid audio fingerprints, comprising: 
extracting feature point data of the audio sample data; [Sharma, Figure 7, “watermarked audio in” to “segment audio 700” performs audio feature extractions and thus the “distort 702 / audio attack changes the audio features and then to detect the distortion at 704, the audio features must be extracted:   “[0090] … The watermark structure is inserted into audio by altering audio features according to watermark signal elements that make up the structure….”  Watermarking/fingerprinting is done by altering audio features.  So detection of it is by examining the audio features. ]
inputting the extracted feature point data into the classifier; and [Sharma, Figure 5, 504 has to use the right model which is obtained from the classifier of Figure 3.  See Figure 4, “Audio + Signal Classification to select model and insertion methods.”]
removing the feature point data, determined as the counter-example data, as the invalid audio fingerprints, according to a result of positive example data or counter-example data output by the classifier. [Sharma teaches that the “training data set” has data that best reflects the robust watermarking methods that withstand Distortion/Attack which suggests removing data that are not useful or are misleading.  “[0078] Audio classifiers for determining audio type are constructed by computing features of audio clips in a training data set and deriving a mapping of the features to a particular audio type. For the purpose of digital watermarking operations, we seek classifications that enable selection of audio watermark parameters that best fit the audio type in terms of achieving the objectives of the application for audio quality (imperceptibility of the audio modifications made to embed the watermark), watermark robustness, and watermark data capacity per time segment of audio. Each of these watermark embedding constraints is related to the masking capability of the host audio, which indicates how much signal can be embedded in a particular audio segment. The perceptual masking models used to exploit the masking properties of the host audio to hide different types of watermark are computed from host audio features. Thus, these same features are candidates for determining audio classes, and thus, the corresponding watermark type and perceptual models to be used for that audio class. Below, we describe watermark types and corresponding perceptual models in more detail.”]
Removing specious data from the training data set is not expressly taught by Sharma.
Li teaches:
removing the feature point data, determined as the counter-example data, as the invalid audio fingerprints, according to a result of positive example data or counter-example data output by the classifier. [Li, Figure 3, 305, [0116].  “[0112] 2.4 Process Overview-Agronomic Model Training “  “In an embodiment, a method comprises determining, in received yield data, one or more passes, each pass including a plurality of observations. For each pass of the one or more passes, one or more discrete derivatives are determined, and based on the one or more discrete derivatives first outlier data is generated. First filtered data is generated by removing the first outlier data from the yield data. Furthermore, for each observation in the yield data, a plurality of nearest neighbor observations is determined, and used to determine a plurality of absolute differences in yield values. Based on the plurality of absolute differences, second outlier data is determined. Second filtered data is generated by removing the second outlier data from the first filtered data. Using a presentation layer of a computer system, a graphical representation of the second filtered data is generated and displayed on the computing system.”  Abstract.]
Rationale for combination as provided for Claim 3.

Regarding Claim 6, Sporer and Li were not cited for teaching the training although Li includes model training and mentions the use of the “nearest neighbor observations.”  Abstract.  Sharma was cited for the steps of Claim 3.
Sharma teaches:
6. The audio recognition method according to claim 3, 
wherein an algorithm for training and establishing the classifier by using the first feature point data, the positive example data, and the counter-example comprises at least one of the following: a
nearest neighbor algorithm, a support vector machine algorithm or a neural network algorithm. [Sharma uses a “neural net training algorithm.”  “[0053] Just as the PEAQ method (describe further below {Perceptual Evaluation of Audio Quality in Figure 6 as part of the training process of Figure 5}) is derived based on neural net training on audio test signals, so can the classifier by derived by mapping measured audio features of a training set of audio signals to audio classifications used to control watermark embedding and detecting parameters. This neural net training approach enables classifiers to be tuned for different usage scenarios and audio environments in which watermarked audio is produced and output, or captured and processed for watermark embedding or detecting. The training set is provides signals typical for the intended usage environment. In this fashion, the perceptual quality can be analyzed in the context of audio types and noise sources that are likely to be present in the audio stream being processed for audio classification, recognition, and watermark embedding or detecting.”  “[0054] Microphones arranged in a particular venue, or audio test equipment in particular audio distribution workflow, can be deployed to capture audio training signals, from which a neural net classifier used in that environment is trained. Such neural net trained classifiers may also be designed to detect noise sources and classify them so that the perceptual quality model tuned to particular noise sources may be selected for watermark embedding, or filters may be applied to mitigate noise sources prior to watermark embedding or detecting. This neural net training may be conducted continuously, in an automated fashion, to monitor audio signal conditions in a usage scenario, such as a distribution channel or venue. The mapping of audio features to classifications in the neural net classifier model is then updated over time to adapt based on this ongoing monitoring of audio signals.”]
Rationale for combination as provided for Claim 3.

Regarding Claim 7, Sporer teaches and therefore suggests:
7. The audio recognition method according to claim 3, wherein feature point data comprises at least one of the following: 
energy of an audio frame where a local maximum point is located; energy of a frequency where the local maximum point is located, and a ratio of energy of the frequency where the local maximum point is located to energy of the audio frame where the local maximum point is located; a quantity of local maximum points in the audio frame;  energy of an audio frame near the audio frame where the local maximum point is located in time dimension; or energy distribution of points around a local maximum point. [Sporer teaches the use of energy for classifying/identifying a signal and therefore suggests the various energy-related parameters of Claim 7:  “[0086] The expert publication "Multimedia Content Analysis", Yao Wang et al., IEEE Signal Processing Magazine, November 2000, pages 12 to 36, discloses a similar concept for indexing and characterizing multimedia pieces. In order to ensure an efficient association of an audio signal to a certain class, a number of features and classifiers have been developed. Time-range features or frequency-range features are suggested as features for classifying the contents of a multimedia piece. These comprise the volume, the pitch as a basic frequency of an audio signal shape, spectral features, like the energy contents of a band relative to the total energy contents, cut-off frequencies in the spectral course and others. Apart from short-time features relating to the so-called sizes per block of samples of the audio signals, long-term quantities are suggested which relate to a longer period of the audio piece.”  “[0088] WO 02/065782 describes a method for forming a fingerprint to form a multimedia signal. The method relates to extracting one or several features from an audio signal. The audio signal here is divided into segments and processing as to blocks and frequency bands takes place in each segment. Band-wise calculation of energy, tonality and standard deviation of the power density spectrum are mentioned as examples.”]

Additionally, Sharma teaches and therefore suggests:
7. The audio recognition method according to claim 3, wherein feature point data comprises at least one of the following: 
energy of an audio frame where a local maximum point is located; [Sharma, ‘[0064] Initially, the classifier process acts as a high level discriminator of audio type, namely, discriminating among parts of the audio that are comprised of silence, speech or music. A silence discriminator (300) discriminates between background noise and speech or music content, and speech--music discriminator (302) discriminates between speech and music. This level of discrimination can use similar computations, such as energy metrics (sum of squared or absolute amplitudes, rate of change of energy, for a particular time frame, etc.), signal activity metrics (zero crossing rate). As such, the routines for discriminating speech, silence and music may be integrated more tightly together. Alternatively, a frequency domain analysis (i.e. a spectral analysis) could be employed instead of or in addition to time-domain analysis. For example, a relatively flat spectrum with low energy would indicate silence.”]
energy of a frequency where the local maximum point is located, and a ratio of energy of the frequency where the local maximum point is located to energy of the audio frame where the local maximum point is located; 
a quantity of local maximum points in the audio frame; [Sharma, “[0173] Another aspect of temporal modeling is removal of pre and post echoes. Pre and post echoes are introduced during embedding of watermark frequency components (or modulation of the host audio frequency components). For example, consider the case of an event occurring in the audio signal that is very localized in time (for example a clap or a door slam). Assume that this event occurs at the end of an audio segment under consideration for embedding. Modification of the audio signal components to embed the watermark signal can cause some frequency components of this event to be heard slightly earlier in the embedded version than the originally occur in the host audio. These effects can be perceived even in the case of typical audio signals, and are not necessarily constrained to dominant events. The reason is that the host signal's content is used to shape the watermark. After the shaping operation, the watermark is transformed to the time domain before being added to the host audio. Although the host signal power at each frequency can vary over time significantly, the time domain version of the watermark will generally have uniform power over all frequencies over the course of the audio segment. Such pre echoes (and similarly post echoes) can be suppressed or removed by an analysis and filtering in the time domain. This is achieved by generating suitable window functions to apply to the watermark signal, with the window being proportional to the instantaneous energy of the host….”]
energy of an audio frame near the audio frame where the local maximum point is located in time dimension; or 
energy distribution of points around a local maximum point. [Sharma, “[0113] There are variations on the basic option of code symbols that correspond to signal peaks….”]

Claim 14 is a device Claim with limitations similar to the limitations of method Claim 3 and is rejected under similar rationale.

Claim 15 is a device Claim with limitations similar to the limitations of method Claim 4 and is rejected under similar rationale.

Claim 16 is a device Claim with limitations similar to the limitations of method Claim 5 and is rejected under similar rationale.

Claim 17 is a device Claim with limitations similar to the limitations of method Claim 6 and is rejected under similar rationale.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Sporer, Li and Sharma and Mihcak in view of Xu (U.S. 6988201).
Regarding Claim 8, Sporer and Li do not discuss attacks.
Sharma teaches:
8. The audio recognition method according to claim 3, 
wherein the audio attack comprises data processing on audio file, and [Sharma, Figure 7, “Distort 702” teaches the audio attack.”]
the data processing comprises at least one of the following: 
segmentation, conversion of audio encoding formats, sampling rate conversion, noising processing according to different signal-to-noise ratios and noise types, over-the-air (ova) dubbing or non-ova dubbing. [Sharma, “[0205] The next step is to apply a perturbation (702) to the watermarked audio segment that simulates the distortion of the channel prior to watermark detection. One example is to simulate the distortion of the channel with Additive White Gaussian Noise (AWGN), in which this AWGN signal is added to the watermarked audio. Other forms of distortion may be applied or modeled and then applied. Direct forms of distortion include applying time compression or warping to simulate distortions in time scaling (e.g., linear time scale shifts or Pitch Invariant Time Scale modification), or data compression techniques (e.g., MP3, AAC) at targeted audio bit-rates. Modeled forms of distortion include adding echoes to simulate multipath distortion and models of audio sensor, transducer and background noise typically encountered in environments where the watermark is detected from ambient audio captured through a microphone. For more background on iterative robustness evaluation, see U.S. Pat. No. 7,796,826, incorporated above.”]
Mihcak teaches some common types of attacks in [0016]-[0026] including:  “[0017] two successive D/A and A/D conversions,” which teaches “conversion of audio encoding formats.”
Xu more expressly teaches a type of attack that is claimed:
8. The audio recognition method according to claim 3, 
wherein the audio attack comprises data processing on audio file, and [Xu, “… . Firstly, unlike sampled digital audio, WT audio is a parameterised digital audio, so it is difficult to attack using typical signal processing techniques, such as adding noise and re-sampling….”  Col. 6, lines 6-36.]
the data processing comprises at least one of the following: 
segmentation, conversion of audio encoding formats, sampling rate conversion, noising processing according to different signal-to-noise ratios and noise types, over-the-air (ova) dubbing or non-ova dubbing. [Xu, “Adaptive-bit coding has, however, low immunity to manipulations. Embedded information can be destroyed by channel noise, re-sampling, and other operations. Adaptive-bit coding technique is used based on several considerations. Firstly, unlike sampled digital audio, WT audio is a parameterised digital audio, so it is difficult to attack using typical signal processing techniques, such as adding noise and re-sampling. Secondly, the size of a wave sample 210 in WT audio is small, and therefore it is unsuitable to embed a watermark in the sample in the frequency domain. Thirdly, to ensure robustness, the watermarked bit sequence of sample data is embedded into the articulation parameters 122 of WT audio. If the sample data are distorted, the embedded information can be used to restore the coded bits of the sample data 124.”  Col. 6, lines 6-36.]
Sporer/Li, Sharma, Mihcak and Xu ertain to fingerprint/watermark development for audio files and it would have been obvious to combine the additional examples of types of audio attack from Xu with the combination.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
(See Lazar in the Conclusion section for segmentation as a form of audio attack.)

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Sporer, Li, and Sharma and Mihcak and Xu in view of Dhoot (U.S. 2018/0204576).
Regarding Claim 10, Sporer and Li do not teach.
Sharma teaches and suggests:
10. The audio recognition method according to claim 8, wherein under a condition that the audio attribute information comprises the language of the speaker in an audio, the audio attribute information further comprises translated text information corresponding to the language of the speaker in the audio. [Sharma, Figure 3, the classifier detects the “language of the speaker” in “language recognition” and also obtains the text of the speech in “speech recognition.”  Sharma does not teach including a “translated text.”  However, once the speech is recognized and its language is known adding a machine translation is trivial.  “[0071] For some applications, further analysis of speech can also be useful in adapting watermarking or content fingerprint operations. In addition to male/female voice discrimination, such recognition modules (314) may include recognition of a particular language, recognizing a speaker, or speech recognition, for example….”]
Translation is not taught by Sharma expressly.  Neither by Mihcak or Xu.
Dhoot teaches:
wherein under a condition that the audio attribute information comprises the language of the speaker in an audio, the audio attribute information further comprises translated text information corresponding to the language of the speaker in the audio. [Dhoot, Figure 2, “[0022] Analytics suite 106 includes a speech recognition program, one or more language translation programs, and an analytics program (e.g., NLP) that analyzes dialog (e.g., verbal, text-based, etc.) of a meeting communicated by network 110. …”]
Sporer/Li/Sharma/Mihcak/Xu and Dhoot pertain to natural language processing and as provided in the mapping of the Claim to Sharma, once speech is recognized, the step of translation can be added on as a trivial step and extension and this step is taught by Dhoot.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Lazar (U.S. 20100095350):  “[0027] … Moreover, randomly trimming arbitrary amounts of signal from both ends of the individual audio recordings prior to concatenation may hinder attacks based on segmentation of the audio file….”

Regarding Classification of Claim 9, see also Davis (U.S. 20130085825):  [0120] The service can also be invoked to effect database pruning. For example, a database may be organized with several partitions (physical or logical), each containing information of a different class. In a facial recognition database, the data may be segregated by subject gender (i.e., male facial portraits, female facial portraits), and/or by age (15-40, 30-65, 55 and higher--data may sometimes be indexed in two or more classifications), etc. In an image database, the data may be segregated by topical classification (e.g., portrait, sports, news, landscape). In an audio database, the data may be segregated by type (spoken word, music, other). Each classification, in turn, can be further segregated (e.g., "music" may be divided into classical, country, rock, other). And these can be further segregated (e.g., "rock" may be classified by genre, such as soft rock, hard rock, Southern rock; by artist, e.g., Beatles, Rolling Stones, etc.).

Davis (U.S. 20130085825): Types of audio attacks:
[0132] Rotating a video by a few degrees is one of several hacks that can defeat fingerprint identification. (It is axiomatic that introduction of any new content protection technology draws hacker scrutiny. Familiar examples include attacks against Macrovision protection for VHS tapes, and against CSS protection for packaged DVD discs.) If fingerprinting is employed in content protection applications, such as in social networking sites (as outlined above) or peer-to-peer networks, its vulnerability to attack will eventually be determined and exploited. 
[0133] Each fingerprinting algorithm has particular weaknesses that can be exploited by hackers to defeat same. An example will help illustrate. 
[0134] A well-known fingerprinting algorithm operates by repeatedly analyzing the frequency content of a short excerpt of an audio track (e.g., 0.4 seconds). The method determines the relative energy of this excerpt within 33 narrow frequency bands that logarithmically span the range 300 Hz-2000 Hz. A corresponding 32-bit identifier is then generated from the resulting data. In particular, a frequency band corresponds to a data bit "1" if its energy level is larger than that of the band above, and a "0" if its energy level is lower. (A more complex arrangement can also take variations over time into account, outputting a "1" only if the immediately preceding excerpt also met the same test, i.e., having a band energy greater than the band above.) 
[0135] Such a 32 bit identifier is computed every hundredth of a second or so, for the immediately preceding 0.4 second excerpt of the audio track, resulting in a large number of "fingerprints." This series of characteristic fingerprints can be stored in a database entry associated with the track, or only a subset may be stored (e.g., every fourth fingerprint). 
[0136] When an unknown track is encountered, the same calculation process is repeated. The resulting set of data is then compared against data earlier stored in the database to try and identify a match. (As noted, various strategies can be employed to speed the search over a brute-force search technique, which yields unacceptable search times.) 
[0137] While the just-described technique is designed for audio identification, a similar arrangement can be used for video. Instead of energies in audio subbands, the algorithm can use average luminances of blocks into which the image is divided as the key perceptual features. Again, a fingerprint can be defined by determining whether the luminance in each block is larger or smaller than the luminance of the preceding block. 
[0138] While little has been written about attacks targeting fingerprinting systems, a casual examination of possible attack scenarios reveals several possibilities. A true hacker will probably see many more. Four simple approaches are discussed below. 
Radio Loudness Profiling 
[0139] The reader may be familiar with different loudness profiles selectable on car radios, e.g., Jazz, Talk, Rock, etc. Each applies a different frequency equalization profile to the audio, e.g., making bass notes louder if the Rock setting is selected, and quieter if the Talk setting is selected, etc. The difference is often quite audible when switching between different settings. 
[0140] However, if the radio is simply turned on and tuned to different stations, the listener is generally unaware of which loudness profile is being employed. That is, without the ability to switch between different profiles, the frequency equalization imposed by a particular loudness profile is typically not noticed by a listener. The different loudness profiles, however, yield different fingerprints. 
[0141] For example, in the Rock setting, the 300 Hz energy in a particular 0.4 second excerpt may be greater than the 318 Hz energy. However, in the Talk setting, the situation may be reversed. This change prompts a change in the leading bit of the fingerprint. 
[0142] In practice, an attacker would probably apply loudness profiles more complex than those commonly available in car radios--increasing and decreasing the loudness at many different frequency bands (e.g., 32 different frequency bands). Significantly different fingerprints may thus be produced. Moreover, the loudness profile could change with time--further distancing the resulting fingerprint from the reference values stored in a database. 
Multiband Compression 
[0143] Another process readily available to attackers is audio multiband compression, a form of processing that is commonly employed by broadcasters to increase the apparent loudness of their signal (most especially commercials). Such tools operate by reducing the dynamic range of a soundtrack--increasing the loudness of quiet passages on a band-by-band basis, to thereby achieve a higher average signal level. Again, this processing of the audio changes its fingerprint, yet is generally not objectionable to the listeners. 
Psychoacoustic Processing 
[0144] The two examples given above are informal attacks--common signal processing techniques that yield, as side-effects, changes in audio fingerprints. Formal attacks--signal processing techniques that are optimized for purposes of changing fingerprints--are numerous. 
[0145] Some formal attacks are based on psychoacoustic masking. This is the phenomena by which, e.g., a loud sound at one instant (e.g., a drum beat) obscures a listener's ability to perceive a quieter sound at a later instant. Or the phenomena by which a loud sound at one frequency (e.g., 338 Hz) obscures a listener's ability to perceive a quieter sound at a nearby frequency (e.g., 358 Hz) at the same instant. Research in this field goes back decades. (Modern watermarking software employs psychoacoustic masking in an advantageous way, to help hide extra data in audio and video content.) 
[0146] Hacking software, of course, can likewise examine a song's characteristics and identify the psychoacoustic masking opportunities it presents. Such software can then automatically make slight alterations in the song's frequency components in a way that a listener won't be able to note, yet in a way that will produce a different series of characteristic fingerprints. The processed song will be audibly indistinguishable from the original, but will not "match" any series of fingerprints in the database. 
Threshold Biasing 
[0147] Another formal attack targets fingerprint bit determinations that are near a threshold, and slightly adjusts the signal to swing the outcome the other way. Consider an audio excerpt that has the following respective energy levels (on a scale of 0-99), in the frequency bands indicated: 
TABLE-US-00002 300 Hz 318 Hz 338 Hz 358 Hz 69 71 70 68 
[0148] The algorithm detailed above would generate a fingerprint of {011 . . . } from this data (i.e., 69 is less than 71, so the first bit is `0`; 71 is greater than 70, so the second bit is `1`; 70 is greater than 68, so the third bit is `1`). 
[0149] Seeing that the energy levels are somewhat close, an attacker tool could slightly adjust the signal's spectral composition, so that the relative energy levels are as follows: 
TABLE-US-00003 300 Hz 318 Hz 338 Hz 358 Hz 70 69 70 68 
[0150] Instead of {011 . . . }, the fingerprint is now {101 . . . }. Two of the three illustrated fingerprint bits have been changed. Yet the change to the audio excerpt is essentially inaudible. 
Exploiting Database Pruning 
[0151] Other fingerprint hacking vulnerabilities arise from shortcuts employed in the database searching strategy--seeking to prune large segments of the data from further searching. For example, the system outlined above confines the large potential search space by assuming that there exists a 32 bit excerpt of the unknown song fingerprint that exactly matches (or matches with only one bit error) a 32 bit excerpt of fingerprint data in the reference database. The system looks at successive 32 bit excerpts from the unknown song fingerprint, and identifies all database fingerprints that include an excerpt presenting a very close match (i.e., 0 or 1 errors). A list of candidate song fingerprints is thereby identified that can be further checked to determine if any meets the looser match criteria generally used. (To allow non-exact fingerprint matches, the system generally allows up to 2047 bit errors in every 8192 bit block of fingerprint data.) 
[0152] The evident problem is: what if the correct "match" in the database has no 32 bit excerpt that corresponds--with just 1 or 0 bit errors--to a 32 bit excerpt from the unknown song? Such a correct match will never be found--it gets screened out at the outset. 
[0153] A hacker familiar with the system's principles will see that everything hinges on the assumption that a 32 bit string of fingerprint data will identically match (or match with only one bit error) a corresponding string in the reference database. Since these 32 bits are based on the strengths of 32 narrow frequency bands between 300 Hz and 2000 Hz, the spectrum of the content can readily be tweaked to violate this assumption, forcing a false-negative error. (E.g., notching out two of these narrow bands will force four bits of every 32 to a known state: two will go to zero--since these bands are lower in amplitude than the preceding bands, and two will go to one--since the following bands are higher in amplitude that these preceding, notched, bands). On average, half of these forced bits will be "wrong" (compared to the untweaked music), leading to two bit errors--violating the assumption on which database pruning is based.) 
[0154] Attacks like the foregoing require a bit of effort. However, once an attacker makes the effort, the resulting hack can be spread quickly and widely. 
[0155] The exemplary fingerprinting technique noted above (which is understood to be the basis for Gracenote's commercial implementation, MusicID, built from technology licensed from Philips) is not unique in being vulnerable to various attacks. All fingerprinting techniques (including the recently announced MediaHedge, as well as CopySense and RepliCheck) are similarly believed to have vulnerabilities that can be exploited by hackers. (A quandary for potential adopters is that susceptibility of different techniques to different attacks has not been a focus of academic attention.) 
[0156] It will be recognized that crowdsourcing can help mitigate the vulnerabilities and uncertainties that are inherent in fingerprinting systems. Despite a "no-match" returned from the fingerprint-based content identification system (based on its rote search of the database for a fingerprint that matches that of the altered content), the techniques detailed herein allow human judgment to take a "second look." Such techniques can identify content that has been altered to avoid its correct identification by fingerprint techniques. (Again, once such identification is made, corresponding information is desirably entered into the database to facilitate identification of the altered content next time.)

Mihcak (U.S. 2002/0184505)
[0016] Common Attacks. The standard set of plausible attacks is itemized in the Request for Proposals (RFP) of IFPI (International Federation of the Phonographic Industry) and RIAA (Recording Industry Association of America). The RFP encapsulates the following security requirements: [0017] two successive D/A and A/D conversions, [0018] data reduction coding techniques such as MP3, [0019] adaptive transform coding (ATRAC), [0020] adaptive subband coding, [0021] Digital Audio Broadcasting (DAB), [0022] Dolby AC2 and AC3 systems, [0023] applying additive or multiplicative noise, [0024] applying a second Embedded Signal, using the same system, to a single program fragment, [0025] a frequency response distortion corresponding to normal analogue frequency response controls such as bass, mid and treble controls, with maximum variation of 15 dB with respect to the original signal, and [0026] applying frequency notches with possible frequency hopping. [0027] Accordingly, there is a need for a hashing technique for digital audio clips that allows slight changes to the audio clip which are tolerable or undetectable (i.e., imperceptible) to the human ear, yet do not result in a different hash value. For an audio clip hashing technique to be useful, it should accommodate the characteristics of the human auditory system and withstand various audio signal manipulation processes common to today's digital audio clip processing. [0028] A good audio hashing technique should generate the same unique identifier even though some forms of attacks have been done to the original audio clip, given that the altered audio clip is reasonably similar (i.e., perceptually) to a human listener when comparing with the original audio clip. However, if the modified audio clip is audibly different or the attacks cause irritation to the listeners, the hashing technique should recognize such degree of changes and produce a different hash value from the original audio clip.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659




1. An audio recognition method, comprising: 
acquiring an audio file to be recognized; 
extracting audio feature information of the audio file to be recognized, wherein the audio feature information comprises audio fingerprints; and 
searching audio attribute information matched with the audio feature information, in a fingerprint index database; 
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. 

2. The audio recognition method according to claim 1, 
wherein the fingerprint index database comprises the audio fingerprint set in which invalid audio fingerprints have been removed from the audio sample data by a classifier. 

3. The audio recognition method according to claim 2, wherein the classifier is established through following operations: 
extracting feature point data of audio data in a training data set as first feature point data; 
performing an audio attack on the audio data in the training data set, and extracting feature point data of audio data in the training data set after performing the audio attack as second feature point data; 
comparing the first feature point data with the second feature point data, marking disappeared or moved feature point data as counter-example data, and marking feature point data with robustness as positive example data; and 
training and establishing the classifier by using the first feature point data, the positive example data, and the counter-example data. 

4. The audio recognition method according to claim 3, 
wherein the classifier filters the audio sample data, and removes feature point data, determined as counter-example data, as invalid audio fingerprints. 

5. The audio recognition method according to claim 4, 
wherein the classifier filters the audio sample data, and removes the feature point data, determined as the counter-example data, as the invalid audio fingerprints, comprising: 
extracting feature point data of the audio sample data; 
inputting the extracted feature point data into the classifier; and 
removing the feature point data, determined as the counter-example data, as the invalid audio fingerprints, according to a result of positive example data or counter-example data output by the classifier. 

6. The audio recognition method according to claim 3, 
wherein an algorithm for training and establishing the classifier by using the first feature point data, the positive example data, and the counter-example comprises at least one of the following: a
nearest neighbor algorithm, a support vector machine algorithm or a neural network algorithm. 

7. The audio recognition method according to claim 3, wherein feature point data comprises at least one of the following: 
energy of an audio frame where a local maximum point is located; 
energy of a frequency where the local maximum point is located, and a ratio of energy of the frequency where the local maximum point is located to energy of the audio frame where the local maximum point is located; 
a quantity of local maximum points in the audio frame; 
energy of an audio frame near the audio frame where the local maximum point is located in time dimension; or 
energy distribution of points around a local maximum point. 

8. The audio recognition method according to claim 3, 
wherein the audio attack comprises data processing on audio file, and 
the data processing comprises at least one of the following: 
segmentation, conversion of audio encoding formats, sampling rate conversion, noising processing according to different signal-to-noise ratios and noise types, over-the-air (ova) dubbing or non-ova dubbing. 

9. The audio recognition method according to claim 1, wherein the audio attribute information matched with the audio feature information comprises at least one of the following: 
a song style, a natural sound in an audio or a language of a speaker in an audio. 

10. The audio recognition method according to claim 8, wherein under a condition that the audio attribute information comprises the language of the speaker in an audio, the audio attribute information further comprises translated text information corresponding to the language of the speaker in the audio. 

11. The audio recognition method according to claim 1, further comprising:
outputting the audio attribute information. 

Claim 12 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
12. An audio recognition device, comprising: 
an acquisition module, configured to acquire an audio file to be recognized; 
an extraction module, configured to extract audio feature information of the audio file to be recognized, wherein the audio feature information comprises audio fingerprints; and 
a search module, configured to search audio attribute information matched with the audio feature information, in a fingerprint index database; 
wherein, the fingerprint index database comprises an audio fingerprint set in which invalid audio fingerprints have been removed from audio sample data. 

Claim 13 is a device Claim with limitations similar to the limitations of method Claim 2 and is rejected under similar rationale.
13. The audio recognition device according to claim 12, wherein the fingerprint index database comprises the audio fingerprint set in which invalid audio fingerprints have been removed from the audio sample data by a classifier. 

Claim 14 is a device Claim with limitations similar to the limitations of method Claim 3 and is rejected under similar rationale.
14. The audio recognition device according to claim 13, wherein the classifier is established through following operations: 
extracting feature point data of audio data in a training data set as first feature point data; 
performing an audio attack on the audio data in the training data set, and extracting feature point data of audio data in the training data set after performing the audio attack as second feature point data; 
comparing the first feature point data with the second feature point data, marking disappeared or moved feature point data as counter-example data, and marking feature point data with robustness as positive example data; and 
training and establishing the classifier by using the first feature point data, the positive example data, and the counter-example data. 

Claim 15 is a device Claim with limitations similar to the limitations of method Claim 4 and is rejected under similar rationale.
15. The audio recognition device according to claim 14, wherein the classifier filters the audio sample data, and removes feature point data, determined as counter-example data, as invalid audio fingerprints. 

Claim 16 is a device Claim with limitations similar to the limitations of method Claim 5 and is rejected under similar rationale.
16. The audio recognition device according to claim 15, wherein the classifier filters the audio sample data, and removes the feature point data, determined as the counter-example data, as the invalid audio fingerprints, comprising: 
extracting feature point data of the audio sample data; 
inputting the extracted feature point data into the classifier; and 
removing the feature point data, determined as the counter-example data, as the invalid audio fingerprints, according to a result of positive example data or counter-example data output by the classifier. 

Claim 17 is a device Claim with limitations similar to the limitations of method Claim 6 and is rejected under similar rationale.
17. The audio recognition device according to claim 14, wherein an algorithm for training and establishing the classifier by using the first feature point data, the positive example data, and the counter-example comprises at least one of the following: a nearest neighbor algorithm, a support vector machine algorithm or a neural network algorithm. 18. (canceled) 19. (canceled) 20. (canceled) 21. (canceled) 22. (canceled) 

Claim 23 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
23. A server, comprising: 
one or more processors; 
a memory; and 
one or more application programs, 
wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors to perform the audio recognition method of claim 1. 

Claim 24 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
24. A computer-readable storage medium, wherein computer programs stored in the storage medium are executed by a processor to perform the audio recognition method of claim 1. 

Claim 25 is a device Claim with limitations similar to the limitations of method Claim 1 and is rejected under similar rationale.  The additional limitations are taught as shown below.
25. An application program, wherein the application program is executed to perform the audio recognition method of claim 1.