Apprentissage d'un ensemble pré-structuré de concepts d'un domaine : l'outil GALEX
Mathématiques informatique et sciences humaines, Tome 148 (1999), pp. 41-71.

La quantité d'information textuelle augmente de façon exponentielle aussi bien comme archives que documents de travail dans les organisations académiques, dans les administrations et dans les entreprises. Une solution pour structurer cette montagne de données textuelles est de construire un modèle de connaissances pour indexer cette information. L'acquisition de connaissances doit permettre d'extraire et classifier les données pour aboutir à une indexation conceptuelle. Traditionnellement, les méthodes de classification d'analyse de données étaient adaptées pour des tables classiques de données de la forme objet/attribut/valeur. Nous présentons Galex (Graph Analyzer for LEXicometry) qui développe une structuration de la connaissance grâce à une méthode de clustering de termes. Cette structuration a pour but de synthétiser le contenu d'information présentant un intérêt majeur dans des applications de filtrage d'information ou de navigation hypertextuelle sur des documents similaires. Galex prend en compte la nature des données sur lesquelles il s'applique : le langage naturel. La complexité du langage naturel est bien connue : ambiguité de sens, constructions grammaticales multiples de la phrase, style, création de termes... Nous montrons qu'à travers l'intégration de notions mal définies mais utiles telles que «concept», «ontologie» et «corpus», le clustering peut être amélioré par adjonctions de connaissances linguistiques. Nous basons notre approche sur des phénomènes typiques tels que des relations graphe-statistiques entre termes, des relations de schéma dans un contexte et la réduction canonique de formes variantes.

The huge amount of electronic textual information increases exponentially just as easily as archives and working documents in academic organizations, in administration and in firms. A solution for structuring this mountain of textual database is to build a knowledge model to index this information. One way can be obtained by data extraction and classification producing conceptual indexing by knowledge acquisition. Traditionally the classification methods of Data Analysis were adapted while used for the classical table of data under an object/characteristics/value format. We present Galex (Graph Analyzer for LEXicometry) which develops structuration of knowledge by a term clustering method. This structuration synthetizes the content of information providing the mapping data to information filtering or hypertextual navigation on similar documents. Galex aims at taking into account the nature of the data to which it is applied : natural language. The complexity of natural language is well known: sense ambiguity, multiple grammatical construction of sentence, style, term creation...We show through integration of poorly defined, though useful as concept, ontology, term and corpus, notions that clustering can be improved by adding linguistic knowledge. We base our approach on typical phenomena such as graph-statistical relations between terms, scheme relations in a context and canonical reduction of variants.

Mot clés : clustering de termes, acquisition de connaissances, ontologie, apprentissage de concepts, analyse de corpus, text-mining, fouille de texte, analyse de données
Mots-clés : terms clustering, knowledge acquisition, ontology, concept learning, corpus analysis, text-mining, statistical data analysis
@article{MSH_1999__148__41_0,
     author = {Turenne, Nicolas},
     title = {Apprentissage d'un ensemble pr\'e-structur\'e de concepts d'un domaine : l'outil {GALEX}},
     journal = {Math\'ematiques informatique et sciences humaines},
     pages = {41--71},
     publisher = {Ecole des hautes-\'etudes en sciences sociales},
     volume = {148},
     year = {1999},
     language = {fr},
     url = {http://www.numdam.org/item/MSH_1999__148__41_0/}
}
TY  - JOUR
AU  - Turenne, Nicolas
TI  - Apprentissage d'un ensemble pré-structuré de concepts d'un domaine : l'outil GALEX
JO  - Mathématiques informatique et sciences humaines
PY  - 1999
SP  - 41
EP  - 71
VL  - 148
PB  - Ecole des hautes-études en sciences sociales
UR  - http://www.numdam.org/item/MSH_1999__148__41_0/
LA  - fr
ID  - MSH_1999__148__41_0
ER  - 
%0 Journal Article
%A Turenne, Nicolas
%T Apprentissage d'un ensemble pré-structuré de concepts d'un domaine : l'outil GALEX
%J Mathématiques informatique et sciences humaines
%D 1999
%P 41-71
%V 148
%I Ecole des hautes-études en sciences sociales
%U http://www.numdam.org/item/MSH_1999__148__41_0/
%G fr
%F MSH_1999__148__41_0
Turenne, Nicolas. Apprentissage d'un ensemble pré-structuré de concepts d'un domaine : l'outil GALEX. Mathématiques informatique et sciences humaines, Tome 148 (1999), pp. 41-71. http://www.numdam.org/item/MSH_1999__148__41_0/

Assadi, H., «Knowledge acquisition from texts: using an automatic clustering method based on noun-modifier relationship», Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, 1997.

Augustson, J.G., Minker, J., «Deriving term relations for a corpus by graph theoretical clusters », Journal of the American Society for Information Science, Vol. 21 n° 2, 1970.

Aussenac-Gilles, N., Bourigault, D. and Condamines, A., «How can knowledge acquisition benefit from terminology?», Proceedings of 9th KAW, Banff (Canada), 1995.

Basili, R., Pazienza, T. and Velardi, P., «Corpus processing for lexical acquisition», Categorization of Lexical Units, ed. B. Boguraev J. Pustejovsky, MIT Press, 1997.

Bisson, G., «Clustering and categorization», Actes de CIMPA96, Nice (France),1996.

Capponi, N., Toussaint, Y., «Interprétation de classes de termes par généralisation de structure prédicat-arguments», Ingénierie des Connaissances [French Knowledge Engineering Workshop, Pont-à-Mousson (France), 1998.

Carpineto, C., Romano, G., «A lattice conceptual clustering system and it application to browsing retrieval», Machine Learning, n° 2495, 1996.

Chanod, J.-P., Tapanainen, P., «Tagging French- comparing a statistical and a constraint-based method », Proceedings of EACL'95, Dublin, 1995.

Chanod, J.-P., Tapanainen, P., «Creating a tagset, lexicon and guesser for a French tagger», ACL-SIGDAT, Dublin, 1995.

Cutting, D., Karlgren, J., «Recognizing text genres with simple metrics using discriminant analysis », Proceedings of COLING'94, Kyoto (Japan), 1994.

Devin, Ch., Panlingua a universal subsurface language, Technical report, Hawaï (USA), 1998.

Edmonds, Ph., «Choosing the word most typical in context using a lexical cooccurrence network», Proceedings 35th annual meeting ACL, Madrid, 1997.

Faure, D., Nedellec, C., «Asium: learning subcategorization frames and restrictions of selection », Text mining workshop of ECML, Chemnitz (Germany), 1998.

Feldman, R., Dagan, I., «Knowledge discovery in textual databases (KDT)», Proceedings of the 1st International Conference on Knowledge Discovery KDD-95, Montréal, 1995.

Feng, C., Copeck, T., Szpakowicz, S. and Matwin, S., Semantic clustering acquisition of partial ontologies from public domain lexical sources, Technical Report, Ottawa, University of Ottawa, 1994.

Fisher, D., «Knowledge acquisition via incremental conceptual clustering», Machine Learning, 2, 1987.

Fisher, D., Schlimmer, J., Models of incremental concept learning: a coupled research proposal, Technical Report, Carnegie Mellon University (USA), 1997.

Frege, G., On sense and reference [Trans. Max Black: Translations from the philosophical writings of Gottlob Frege], ed. Peter Geach and Max Black., Oxford, Basil Blackwell, 1892 | MR

Fujihara, H., Simmons, D., Ellis, N. and Shannon, R., «Knowledge conceptualization tool», IEEE transactions on knowledge and data engineering, Vol. 9, n° 2, 1997.

Grefenstette, G., «SQLET: short query ünguistic expansion techniques, palliating on-word queries by providing intermediate structure to text», Recherche d'Information Assistée par Ordinateur RIAO, Montréal, 1997.

Grobelnik, M., Mladenic, D., «Efficient text categorization», ECML text mining workshop, Chemnitz (Germany), 1998.

Habert, B., Naulleau, E. and Nazarenko, A., «Symbolic word classification for medium-size corpora», Proceedings of Coling'96, Copenhague, 1996.

Harris, Z., Mathematical structure of language, New-York, ed. Wiley, 1968. | MR | Zbl

Hearst, M., Contextualizing retrieval of full-length documents. Technical report, University of California, n° UCB/CSD94/789, 1994.

Ibekwe-Sanjuan, F., Processing for thematic trends mapping, Technical Report, Grenoble, Université de Grenoble, 1996.

http 1 http://jedlik.phy.bme.hu/∼gerjanos/HMM/node7 .html

http 2 http://jedlik.phy.bme.hu/∼gerjanos/HMM/node8.html

http 3 http://jedlik.phy.bme.hu/∼gerjanos/HMM/node8.html

Hull, D., Pedersen, J., «Method combination for document filtering», Proceedings of SIGIR'96, Zurch (Switzerland), 1996.

Kanter, I., Kessler, I., «Markov processes: linguistics and Zipfs law», Physical Review Letters, Volume 74, Issue 22, 1995, pp.4559-4562.

Kohonen, T., Self-organization and associative memory, ed. Springer-Verlag, 1989. Kirsten, T., «Relational distance-based clustering», ILP'98 workshop, Berlin, 1998. | MR

Lingras, P., «Classifying highways:hierarchical grouping vs Kohonen neural networks », Journal of Transportation Engineering, Vol. 121, n° 4,1994, pp. 364-368.

Lebart, L., Salem, A. and Berry, L., Exploring textual data, ed. Kluwer, Academic Publishers, 1998.

Maikevich, N., «From information space to knowledge space; ontology on internet», CAI'98 Russia [Conference on Artificial Intelligence], Pushino (Russia), 1998.

Mikheev, A., and Finch, S., «Towards a Workbench for Acquisition of Domain Knowledge from Natural Language», Proceedings ACL student session, Madrid, 1995.

Michalski, R., «Knowledge acquisition through conceptual clustering. A theoretical framework and algorithm for partitioning data into conjunctive concepts analysis », International journal of policy and informatics systems, Vol. 4, n° 3, 1980, pp. 219-244. | MR

Meila, M., Heckerman, D., An experimental comparison of several clustering and initialization methods, Technical Report MSR-TR-98-06, Microsoft, 1998.

Memmi, D., Gabi, K. and Meunier, J.-G., «Dynamic knowledge extraction from texts by Art networks», Fourth International Conference on Neural Networks and their Applications NeurAp' 98, Marseille, 1998.

Messatfa, H., Zait, M., «A comparative study of clustering methods», Future Generation Computer System, n° 500, 1997.

Oakes, M., Statistics for corpus linguistics, ed. Edinburgh textbooks in empirical linguistics, 1998.

Polanco, X., Grivel, L., Royauté, J., «How to do things with terms in informetrics: terminological variation and stabilization as science watch indicators», Fifth International Conference on scientometrics & informetrics, Edited by M.E.D. Koening and A. Bookstein, Medford (NJ, USA), Learned Information Inc., 1995, pp. 435-444.

Rousselot, F., Frath, P., «Extracting concepts and relations from Corpora», Proceedings of Workshop on Corpus-oriented Semantic Analysis European Conference on Artificial Intelligence ECAI'96, Budapest, 1996.

Schiller, A., «Multilingual finite-state noun phrase extraction», Proceedings of ECAI'96 conference, Budapest, 1996.

Schütze, H., Silverstein, C., «A comparison of projections for efficient document clustering», Proceedings of the Twentieth Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia (USA), 1997.

Skuce, D., Meyer, I., «Terminology and knowledge acquisition: exploring a symbiotic relationship », Proceedings of 6th KAW, Banff (Canada), 1991.

Smadja, F., Mckeown, K., «Automatically extracting and representing collocations for language génération», Proceedings of Conference ACL, Pittsburgh (USA), 1990.

Sparck-Jones, K., Synonymy and Semantic Classification, Edinburgh, ed. Edinburgh University Press, 1987.

Tanguy, L., Thlivitis, T., «PASTEL : un protocole informatisé d'aide à l'interprétation des textes », Actes du colloque Conférence Informatique et Langue Naturelle ILN'96, Nantes (France), 1996.

Teil, G., Latour, B., «The Hume machine: can association networks do more than formal rules ?», Stanford Humanities Review (SEHR), Vol. 4, Issue 2: Construction of the mind, 1995.

Thomson, K., Langley, P., «Incremental concept formation with composite objects», Machine Learning proceedings of the 6th international workshop, Ed. Morgan Kaufmann, 1988, pp. 371-378.

Turenne, N., Rousselot, F., «Evaluation of four clustering methods in textmining», ECML workshop on textmining, Chemnitz (Germany), 1998.

Wüster, E., Die terminologische Sprachbehandlung. I: Studium Generale, Jahrg. Heft,4, 1991.

Yarowsky, D., «Word-sense disambiguation using statistical models of Roget's categories trained on large corpora», Proceedings of conference COLING'92, Nantes (France), 1992.

Zytkow, J., Zembowicz, R., «Contingency tables as the foundations for concepts, concept hierarchies, and rules: the 49er system approach», Fundamenta Informaticae, n° 30, 1997, pp. 383-399. | Zbl