Preface |
|
xvii | |
|
|
1 | (19) |
|
|
1 | (1) |
|
|
1 | (8) |
|
|
8 | (1) |
|
Pattern Recognition Systems |
|
|
9 | (5) |
|
|
9 | (1) |
|
Segmentation and Grouping |
|
|
9 | (2) |
|
|
11 | (1) |
|
|
12 | (1) |
|
|
13 | (1) |
|
|
14 | (2) |
|
|
14 | (1) |
|
|
14 | (1) |
|
|
15 | (1) |
|
|
15 | (1) |
|
|
15 | (1) |
|
|
16 | (1) |
|
|
16 | (1) |
|
|
16 | (1) |
|
|
17 | (1) |
|
|
17 | (1) |
|
|
17 | (3) |
|
|
17 | (1) |
|
Bibliographical and Historical Remarks |
|
|
18 | (1) |
|
|
19 | (1) |
|
|
20 | (64) |
|
|
20 | (4) |
|
Bayesian Decison Theory---Continuous Features |
|
|
24 | (2) |
|
Two-Category Classification |
|
|
25 | (1) |
|
Minimum-error-Rate Classification |
|
|
26 | (3) |
|
|
27 | (1) |
|
|
28 | (1) |
|
Classifiers, Discriminant Functions, and Decision Surfaces |
|
|
29 | (2) |
|
|
29 | (1) |
|
|
30 | (1) |
|
|
31 | (5) |
|
|
32 | (1) |
|
|
33 | (3) |
|
Discriminant Functions for the Normal Density |
|
|
36 | (9) |
|
|
36 | (3) |
|
|
39 | (2) |
|
|
41 | (1) |
|
Decision Regions for Two-Dimensional Gaussian Date |
|
|
41 | (4) |
|
Error Probabilities and Integrals |
|
|
45 | (1) |
|
Error Bounds for Normal Densities |
|
|
46 | (5) |
|
|
46 | (1) |
|
|
47 | (1) |
|
Error Bounds for Gaussian Distribution |
|
|
48 | (1) |
|
Signal Detection Theory and Operating Characteristics |
|
|
48 | (3) |
|
Bayes Decision Theory---Discrete Features |
|
|
51 | (3) |
|
Independent Binary Features |
|
|
52 | (1) |
|
Bayesian Decisions for Three-Dimensional Binary Data |
|
|
53 | (1) |
|
Missing and Noisy Features |
|
|
54 | (2) |
|
|
54 | (1) |
|
|
55 | (1) |
|
|
56 | (6) |
|
|
59 | (3) |
|
Compound Bayesian Decision Theory and Context |
|
|
62 | (22) |
|
|
63 | (1) |
|
Bibliographical and Historical Remarks |
|
|
64 | (1) |
|
|
65 | (15) |
|
|
80 | (2) |
|
|
82 | (2) |
|
Maximum-Likelihood and Bayesian Parameter Estimation |
|
|
84 | (77) |
|
|
84 | (1) |
|
Maximum-Likelihood Estimation |
|
|
85 | (5) |
|
|
85 | (3) |
|
The Gaussian Case: Unknown μ |
|
|
88 | (1) |
|
The Gaussian Case: Unknown μ and Σ |
|
|
88 | (1) |
|
|
89 | (1) |
|
|
90 | (2) |
|
The Class-Conditional Densities |
|
|
91 | (1) |
|
The Parameter Distribution |
|
|
91 | (1) |
|
Bayesian Parameter Estimation: Gaussian Case |
|
|
92 | (5) |
|
The Univariate Case: p(μ\D) |
|
|
92 | (3) |
|
The Univariate Case: p(x\D) |
|
|
95 | (1) |
|
|
95 | (2) |
|
Bayesian Parameter Estimation: General Theory |
|
|
97 | (5) |
|
|
98 | (2) |
|
When Do Maximum-Likelihood and bayes Methods Differ? |
|
|
100 | (1) |
|
Noninformative Priors and Invariance |
|
|
101 | (1) |
|
|
102 | (1) |
|
|
102 | (5) |
|
Sufficient Statistics and the Exponential Family |
|
|
106 | (1) |
|
Problems of Dimensionality |
|
|
107 | (7) |
|
Accuracy Dimension and Training Sample Size |
|
|
107 | (4) |
|
|
111 | (2) |
|
|
113 | (1) |
|
Component Analysis and Discriminants |
|
|
114 | (10) |
|
Principal Component Analysis (PCA) |
|
|
115 | (2) |
|
Fisher Linear Discriminant |
|
|
117 | (4) |
|
Multiple Discriminant Analysis |
|
|
121 | (3) |
|
Expectation-Maximization (EM) |
|
|
124 | (4) |
|
Expectation-Maximization for a 2D Normal Model |
|
|
126 | (2) |
|
|
128 | (33) |
|
First-Order Markov Models |
|
|
128 | (1) |
|
First-Order Hidden Markov Models |
|
|
129 | (1) |
|
Hidden Markov Model Computation |
|
|
129 | (2) |
|
|
131 | (2) |
|
|
133 | (2) |
|
|
135 | (1) |
|
|
136 | (1) |
|
|
137 | (2) |
|
|
139 | (1) |
|
Bibliographical and Historical Remarks |
|
|
139 | (1) |
|
|
140 | (15) |
|
|
155 | (4) |
|
|
159 | (2) |
|
|
161 | (54) |
|
|
161 | (1) |
|
|
161 | (3) |
|
|
164 | (10) |
|
|
167 | (1) |
|
Convergence of the Variance |
|
|
167 | (1) |
|
|
168 | (1) |
|
|
168 | (4) |
|
Probabilistic Neural Networks (PNNs) |
|
|
172 | (2) |
|
Choosing the Window Function |
|
|
174 | (1) |
|
kn-Nearest-Neighbor Estimation |
|
|
174 | (3) |
|
kn-Nearest-Neighbor and Parzen-Window Estimation |
|
|
176 | (1) |
|
Estimation of A Posteriori Probabilities |
|
|
177 | (1) |
|
The Nearest-Neighbor Rule |
|
|
177 | (10) |
|
Convergence of the Nearest Neighbor |
|
|
179 | (1) |
|
Error Rate for the Nearest-Neighbor Rule |
|
|
180 | (1) |
|
|
180 | (2) |
|
The k-Nearest-Neighbor Rule |
|
|
182 | (2) |
|
Computational Complexity of the k-Nearest-Neighbor Rule |
|
|
184 | (3) |
|
Metrics and Nearest-Neighbor Classification |
|
|
187 | (5) |
|
|
187 | (1) |
|
|
188 | (4) |
|
|
192 | (3) |
|
Reduced Coulomb Energy Networks |
|
|
195 | (2) |
|
Approximations by Series Expansions |
|
|
197 | (18) |
|
|
199 | (1) |
|
Bibliographical and Historical Remarks |
|
|
200 | (1) |
|
|
201 | (8) |
|
|
209 | (4) |
|
|
213 | (2) |
|
Linear Discriminant Functions |
|
|
215 | (67) |
|
|
215 | (1) |
|
Linear Discriminant Functions and Decision Surfaces |
|
|
216 | (3) |
|
|
216 | (2) |
|
|
218 | (1) |
|
Generalized Linear Discriminant Functions |
|
|
219 | (4) |
|
The Two-Category Linearly Separable Case |
|
|
223 | (4) |
|
|
224 | (1) |
|
Gradient Descent Procedures |
|
|
224 | (3) |
|
Minimizing the Perceptron Criterion Function |
|
|
227 | (8) |
|
The Perceptron Criterion Function |
|
|
227 | (2) |
|
Convergence Proof for Single-Sample Correction |
|
|
229 | (3) |
|
Some Direct Generalizations |
|
|
232 | (3) |
|
|
235 | (3) |
|
|
235 | (2) |
|
|
237 | (1) |
|
|
238 | (1) |
|
Minimum Squared-Error Procedures |
|
|
239 | (10) |
|
Minimum Squared-Error and the Pseudoinverse |
|
|
240 | (1) |
|
Constructing a Linear Classifier by Matrix Pseudoinverse |
|
|
241 | (1) |
|
Relation to Fisher's Linear Discriminant |
|
|
242 | (1) |
|
Asymptotic Approximation to an Optimal Discriminant |
|
|
243 | (2) |
|
The Widrow-Hoff or LMS Procedure |
|
|
245 | (1) |
|
Stochastic Approximation Methods |
|
|
246 | (3) |
|
The Ho-Kashyap Procedures |
|
|
249 | (7) |
|
|
250 | (1) |
|
|
251 | (2) |
|
|
253 | (1) |
|
|
253 | (3) |
|
Linear Programming Algorithms |
|
|
256 | (3) |
|
|
256 | (1) |
|
The Linearly Separable Case |
|
|
257 | (1) |
|
Minimizing the Perceptron Criterion Function |
|
|
258 | (1) |
|
|
259 | (6) |
|
|
263 | (1) |
|
|
264 | (1) |
|
Multicategory Generalizations |
|
|
265 | (17) |
|
|
266 | (1) |
|
Convergence of the Fixed-Increment Rule |
|
|
266 | (2) |
|
Generalizations for MSE Procedures |
|
|
268 | (1) |
|
|
269 | (1) |
|
Bibliographical and Historical Remarks |
|
|
270 | (1) |
|
|
271 | (7) |
|
|
278 | (3) |
|
|
281 | (1) |
|
Multilayer Neural Networks |
|
|
282 | (68) |
|
|
282 | (2) |
|
Feedforward Operation and Classification |
|
|
284 | (4) |
|
General Feedforward Operation |
|
|
286 | (1) |
|
Expressive Power of Multilayer Networks |
|
|
287 | (1) |
|
Backpropagation Algorithm |
|
|
288 | (8) |
|
|
289 | (4) |
|
|
293 | (2) |
|
|
295 | (1) |
|
|
296 | (3) |
|
|
296 | (2) |
|
|
298 | (1) |
|
|
298 | (1) |
|
How Important Are Multiple Minima? |
|
|
299 | (1) |
|
Backpropagation as Feature Mapping |
|
|
299 | (4) |
|
Representations at the Hidden Layer-Weights |
|
|
302 | (1) |
|
Backpropagation, Bayes Theory and Probability |
|
|
303 | (2) |
|
Bayes Discriminants and Neural Networks |
|
|
303 | (1) |
|
|
304 | (1) |
|
Related Statistical Techniques |
|
|
305 | (1) |
|
Practical Techniques for Improving Backpropagation |
|
|
306 | (12) |
|
|
307 | (1) |
|
parameters for the Sigmoid |
|
|
308 | (1) |
|
|
308 | (1) |
|
|
309 | (1) |
|
|
310 | (1) |
|
|
310 | (1) |
|
|
310 | (1) |
|
|
311 | (1) |
|
|
312 | (1) |
|
|
313 | (1) |
|
|
314 | (1) |
|
|
315 | (1) |
|
On-Line, Stochastic or Batch Training? |
|
|
316 | (1) |
|
|
316 | (1) |
|
|
317 | (1) |
|
|
318 | (1) |
|
|
318 | (6) |
|
|
318 | (1) |
|
|
319 | (1) |
|
|
320 | (1) |
|
Conjugate Gradient Descent |
|
|
321 | (1) |
|
Conjugate Gradient Descent |
|
|
322 | (2) |
|
Additional Networks and Training Methods |
|
|
324 | (6) |
|
Radial Basis Function Networks (RBFs) |
|
|
324 | (1) |
|
|
325 | (1) |
|
|
325 | (1) |
|
|
326 | (2) |
|
|
328 | (1) |
|
|
329 | (1) |
|
Regularization, Complexity Adjustment and Pruning |
|
|
330 | (20) |
|
|
333 | (1) |
|
Bibliographical and Historical Remarks |
|
|
333 | (2) |
|
|
335 | (8) |
|
|
343 | (4) |
|
|
347 | (3) |
|
|
350 | (44) |
|
|
350 | (1) |
|
|
351 | (9) |
|
|
351 | (1) |
|
|
352 | (5) |
|
Deterministic Simulated Annealing |
|
|
357 | (3) |
|
|
360 | (10) |
|
Stochastic Boltzmann Learning of Visible States |
|
|
360 | (5) |
|
Missing Features and Category Constraints |
|
|
365 | (1) |
|
Deterministic Boltzmann Learning |
|
|
366 | (1) |
|
Initialization and Setting Parameters |
|
|
367 | (3) |
|
Boltzmann Networks and Graphical Models |
|
|
370 | (3) |
|
|
372 | (1) |
|
|
373 | (5) |
|
|
373 | (4) |
|
|
377 | (1) |
|
|
378 | (1) |
|
|
378 | (16) |
|
|
381 | (1) |
|
Bibliographical and Historical Remarks |
|
|
381 | (2) |
|
|
383 | (5) |
|
|
388 | (3) |
|
|
391 | (3) |
|
|
394 | (59) |
|
|
394 | (1) |
|
|
395 | (1) |
|
|
396 | (15) |
|
|
397 | (1) |
|
Query Selection and Node Impurity |
|
|
398 | (4) |
|
|
402 | (1) |
|
|
403 | (1) |
|
Assignment of Leaf Node Labels |
|
|
404 | (1) |
|
|
404 | (2) |
|
|
406 | (1) |
|
|
407 | (1) |
|
Multivariate Decision Trees |
|
|
408 | (1) |
|
|
409 | (1) |
|
|
409 | (1) |
|
Surrogate Splits and Missing Attributes |
|
|
410 | (1) |
|
|
411 | (2) |
|
|
411 | (1) |
|
|
411 | (1) |
|
Which Tree Classifier Is Best? |
|
|
412 | (1) |
|
|
413 | (8) |
|
|
415 | (3) |
|
|
418 | (2) |
|
|
420 | (1) |
|
String Matching with Errors |
|
|
420 | (1) |
|
String Matching with the ``Don't-Care'' Symbol |
|
|
421 | (1) |
|
|
421 | (8) |
|
|
422 | (2) |
|
|
424 | (1) |
|
A Grammar for Pronouncing Numbers |
|
|
425 | (1) |
|
Recognition Using Grammars |
|
|
426 | (3) |
|
|
429 | (2) |
|
|
431 | (1) |
|
|
431 | (22) |
|
|
433 | (1) |
|
|
434 | (1) |
|
Bibliographical and Historical Remarks |
|
|
435 | (2) |
|
|
437 | (9) |
|
|
446 | (4) |
|
|
450 | (3) |
|
Algorithm-Independent Machine Learning |
|
|
453 | (64) |
|
|
453 | (1) |
|
Lack of Inherent Superiority of Any Classifier |
|
|
454 | (11) |
|
|
454 | (3) |
|
No Free Lunch for Binary Data |
|
|
457 | (1) |
|
|
458 | (3) |
|
Minimum Description Length (MDL) |
|
|
461 | (2) |
|
Minimum Description Length Principle |
|
|
463 | (1) |
|
Overfitting Avoidance and Occam's Razor |
|
|
464 | (1) |
|
|
465 | (6) |
|
Bias and Variance for Regression |
|
|
466 | (2) |
|
Bias and Variance for Classification |
|
|
468 | (3) |
|
Resampling for Estimating Statistics |
|
|
471 | (4) |
|
|
472 | (1) |
|
Jackknife Estimate of Bias and Variance of the Mode |
|
|
473 | (1) |
|
|
474 | (1) |
|
Resampling for Classifier Design |
|
|
475 | (7) |
|
|
475 | (1) |
|
|
476 | (4) |
|
|
480 | (2) |
|
Arcing, Learning with Queries, Bias and Variance |
|
|
482 | (1) |
|
Estimating and Comparing Classifiers |
|
|
482 | (13) |
|
|
483 | (1) |
|
|
483 | (2) |
|
Jackknife and Bootstrap Estimation of Classification Accuracy |
|
|
485 | (1) |
|
Maximum-Likelihood Model Comparison |
|
|
486 | (1) |
|
Bayesian Model Comparison |
|
|
487 | (2) |
|
The Problem-Average Error Rate |
|
|
489 | (3) |
|
Predicting Final Performance from Learning Curves |
|
|
492 | (2) |
|
The Capacity of a Separating Plane |
|
|
494 | (1) |
|
|
495 | (22) |
|
Component Classifiers with Discriminant Functions |
|
|
496 | (2) |
|
Component Classifiers without Discriminant Functions |
|
|
498 | (1) |
|
|
499 | (1) |
|
Bibliographical and Historical Remarks |
|
|
500 | (2) |
|
|
502 | (6) |
|
|
508 | (5) |
|
|
513 | (4) |
|
Unsupervised Learning and Clustering |
|
|
517 | (84) |
|
|
517 | (1) |
|
Mixture Densities and Identifiability |
|
|
518 | (1) |
|
Maximum-Likelihood Estimates |
|
|
519 | (2) |
|
Application to Normal Mixtures |
|
|
521 | (9) |
|
Case 1: Unknown Mean Vectors |
|
|
522 | (2) |
|
Case 2: All Parameters Unknown |
|
|
524 | (2) |
|
|
526 | (2) |
|
|
528 | (2) |
|
Unsupervised Bayesian Learning |
|
|
530 | (7) |
|
|
530 | (1) |
|
Learning the Parameter Vector |
|
|
531 | (3) |
|
Unsupervised Learning of Gaussian Data |
|
|
534 | (2) |
|
Decision-Directed Approximation |
|
|
536 | (1) |
|
Data Description and Clustering |
|
|
537 | (5) |
|
|
538 | (4) |
|
Criterion Functions for Clustering |
|
|
542 | (6) |
|
The Sum-of-Squared-Error Criterion |
|
|
542 | (1) |
|
Related Minimum Variance Criteria |
|
|
543 | (1) |
|
|
544 | (2) |
|
|
546 | (2) |
|
|
548 | (2) |
|
|
550 | (7) |
|
|
551 | (1) |
|
Agglomerative Hierarchical Clustering |
|
|
552 | (3) |
|
Stepwise-Optimal Hierarchical Clustering |
|
|
555 | (1) |
|
Hierarchical Clustering and Induced Metrics |
|
|
556 | (1) |
|
|
557 | (2) |
|
|
559 | (7) |
|
Unknown Number of Clusters |
|
|
561 | (2) |
|
|
563 | (2) |
|
|
565 | (1) |
|
|
566 | (2) |
|
|
568 | (5) |
|
Principal Component Analysis (PCA) |
|
|
568 | (1) |
|
Nonlinear Component Analysis (NLCA) |
|
|
569 | (1) |
|
Independent Component Analysis (ICA) |
|
|
570 | (3) |
|
Low-Dimensional Representations and Multidimensional Scaling (MDS) |
|
|
573 | (28) |
|
Self-Organizing Feature Maps |
|
|
576 | (4) |
|
Clustering and Dimensionality Reduction |
|
|
580 | (1) |
|
|
581 | (1) |
|
Bibliographical and Historical Remarks |
|
|
582 | (1) |
|
|
583 | (10) |
|
|
593 | (5) |
|
|
598 | (3) |
A MATHEMATICAL FOUNDATIONS |
|
601 | (36) |
|
|
601 | (3) |
|
|
604 | (6) |
|
A.2.1 Notation and Preliminaries |
|
|
604 | (1) |
|
|
605 | (1) |
|
|
606 | (1) |
|
A.2.4 Derivatives of Matrices |
|
|
606 | (2) |
|
A.2.5 Determinant and Trace |
|
|
608 | (1) |
|
|
609 | (1) |
|
A.2.7 Eigenvectors and Eigenvalues |
|
|
609 | (1) |
|
A.3 Lagrange Optimization |
|
|
610 | (1) |
|
|
611 | (12) |
|
A.4.1 Discrete Random Variables |
|
|
611 | (1) |
|
|
611 | (1) |
|
A.4.3 Pairs of Discrete Random Variables |
|
|
612 | (1) |
|
A.4.4 Statistical Independence |
|
|
613 | (1) |
|
A.4.5 Expected Values of Functions of Two Variables |
|
|
613 | (1) |
|
A.4.6 Conditional Probability |
|
|
614 | (1) |
|
A.4.7 The Law of Total Probability and Bayes' Rule |
|
|
615 | (1) |
|
A.4.8 Vector Random Variables |
|
|
616 | (1) |
|
A.4.9 Expectations, Mean Vectors and Covariance Matrices |
|
|
617 | (1) |
|
A.4.10 Continuous Random Variables |
|
|
618 | (2) |
|
A.4.11 Distributions of Sums of Independent Random Variables |
|
|
620 | (1) |
|
A.4.12 Normal Distributions |
|
|
621 | (2) |
|
A.5 Gaussian Derivatives and Integrals |
|
|
623 | (5) |
|
A.5.1 Multivariate Normal Densities |
|
|
624 | (2) |
|
A.5.2 Bivariate Normal Densities |
|
|
626 | (2) |
|
|
628 | (2) |
|
|
629 | (1) |
|
|
630 | (3) |
|
A.7.1 Entropy and Information |
|
|
630 | (2) |
|
|
632 | (1) |
|
|
632 | (1) |
|
A.8 Computational Complexity |
|
|
633 | (4) |
|
|
635 | (2) |
Index |
|
637 | |