Exact Cross-Validation for kNN : application to passive and active learning in classification
[Validation-croisée exacte pour les kNN : application à l’apprentissage passif et actif en classification]
Journal de la société française de statistique, Tome 152 (2011) no. 3, pp. 83-97.

Pour l’algorithme de classification des k plus proches voisins (kNN), une expression explicite de l’estimateur du taux d’erreur de classification par validation croisée Leave p Out (LpO) est proposée. Cette expression explicite est d’abord utilisée dans le cadre de l’apprentissage passif pour étudier l’impact du choix du paramètre p du LpO sur le choix de k dans l’algorithme kNN. On s’intéresse ensuite au problème de l’apprentissage actif (active learning). Une procédure de sélection des exemples basée sur la recommandation du comité des classificateurs LpO est considérée. L’influence du paramètre p sur le choix des nouveaux exemples et sur le choix du paramètre k à chaque étape de l’apprentissage actif est étudiée. En particulier, il est montré que l’évolution de la valeur du paramètre k choisie par LpO en apprentissage actif est différente de celle observée en apprentissage passif.

In the binary classification framework, a closed form expression of the cross-validation Leave-p-Out (LpO) risk estimator for the k Nearest Neighbor algorithm (kNN) is derived. It is first used to study the LpO risk minimization strategy for choosing k in the passive learning setting. The impact of p on the choice of k and the LpO estimation of the risk are inferred. In the active learning setting, a procedure is proposed that selects new examples using a LpO committee of kNN classifiers. The influence of p on the choice of new examples and the tuning of k at each step is investigated. The behavior of k chosen by LpO is shown to be different from what is observed in passive learning.

Keywords: Classification, Cross-validation, kNN algorithm Active learning
Mot clés : Classification, Valildation-croisée, kNN, Apprentissage actif
@article{JSFS_2011__152_3_83_0,
     author = {Celisse, Alain and Mary-Huard, Tristan},
     title = {Exact {Cross-Validation} for $k${NN} : application to passive and active learning in classification},
     journal = {Journal de la soci\'et\'e fran\c{c}aise de statistique},
     pages = {83--97},
     publisher = {Soci\'et\'e fran\c{c}aise de statistique},
     volume = {152},
     number = {3},
     year = {2011},
     mrnumber = {2871178},
     zbl = {1316.62084},
     language = {en},
     url = {https://www.numdam.org/item/JSFS_2011__152_3_83_0/}
}
TY  - JOUR
AU  - Celisse, Alain
AU  - Mary-Huard, Tristan
TI  - Exact Cross-Validation for $k$NN : application to passive and active learning in classification
JO  - Journal de la société française de statistique
PY  - 2011
SP  - 83
EP  - 97
VL  - 152
IS  - 3
PB  - Société française de statistique
UR  - https://www.numdam.org/item/JSFS_2011__152_3_83_0/
LA  - en
ID  - JSFS_2011__152_3_83_0
ER  - 
%0 Journal Article
%A Celisse, Alain
%A Mary-Huard, Tristan
%T Exact Cross-Validation for $k$NN : application to passive and active learning in classification
%J Journal de la société française de statistique
%D 2011
%P 83-97
%V 152
%N 3
%I Société française de statistique
%U https://www.numdam.org/item/JSFS_2011__152_3_83_0/
%G en
%F JSFS_2011__152_3_83_0
Celisse, Alain; Mary-Huard, Tristan. Exact Cross-Validation for $k$NN : application to passive and active learning in classification. Journal de la société française de statistique, Tome 152 (2011) no. 3, pp. 83-97. https://www.numdam.org/item/JSFS_2011__152_3_83_0/

[1] Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection, Statist. Surv., Volume 4 (2010), pp. 40-79 | MR | Zbl

[2] Celisse, A.; Robin, S. Nonparametric density estimation by exact leave-p-out cross-validation, Comput. Statist. Data Anal., Volume 52 (2008) no. 5, pp. 2350-2368 | DOI | MR | Zbl

[3] Devroye, L.; Gyorfi, L.; Lugosi, G. A probabilistic theory of pattern recognition, Springer, 1996 | MR | Zbl

[4] Fix, E.; Hodges, J. Discriminatory analysis- nonparametric discrimination: Consistency principles, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, CA (1991) (Reprint of original work from 1952)

[5] Fix, E.; Hodges, J. Nonparametric Discrimination: small sample performance, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, CA (1991) (Reprint of original work from 1952)

[6] Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C.D.; Lander, E.S. Class prediction and discovery using gene expression data, Science, Volume 286 (1999), pp. 531-537

[7] Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction., Springer, New York, 2001 | MR | Zbl

[8] McCallum, A.; Nigam, K. Employing EM and Pool-Based Active Learning for Text Classification, Mach. Learn.: Proc. of the Fifteenth Intern. Conf. (ICML ’98) (1998), pp. 359-367

[9] Settles, B.; Craven, M.; Friedland, L. Active Learning with Real Annotation Costs, Proceedings of the NIPS Workshop on Cost-Sensitive Learning (2008), pp. 1-10

[10] Settles, B. Active Learning Literature Survey (2009) no. 1648 (Comp. Sci. Tech. Report)

[11] Simard, P.Y.; LeCun, Y.; Denker, J.S.; Victorri, B. Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation, International Journal of Imaging Systems and Technology, Volume 11 (2001) no. 3

[12] Seung, H. S.; Opper, M; Sompolinsky, H. Query by committee, Annual Workshop on Computational Learning Theory (1992), pp. 287-294

[13] Stone, M. Cross-validatory choice and assessment of statistical predictions, J. Roy. Statist. Soc. Ser. B, Volume 36 (1974), pp. 111-147 | MR | Zbl