Feature Selection in SVM Text Categorization

Hirotoshi Taira; NTT Communication Science Labs; Masahiko Haruno; ATR Human Information Processing Research Labs

Back to AAAI

AAAI 1999

Feature Selection in SVM Text Categorization

Conference Paper Natural Language and Information Retrieval Artificial Intelligence

PDF Details

Abstract

This paperinvestigates the effect of prior feature selection in Support Vector Machine(SVM) text categorization. The input space was gradually increased by using mutualinformation (MI) filtering and part-of-speech (POS)filtering, which determine the portion of wordsthat are appropriate for learning fromthe information-theoretic and the linguistic perspectives, respectively. We tested the two filtering methodson SVMs as well as a decision tree algorithm C4. 5. TheSVMs’ results common to bothfiltering are that 1) the optimalnumberof features differed completelyacross categories, and2) the averageperformance for all categories wasbest whenall of the wordswere used. In addition, a comparison of the twofiltering methodsclarified that POSfiltering on SVMs consistently outperformedMIfiltering, whichindicates that SVMs cannot find irrelevant parts of speech. Theseresults suggesta simplestrategy for the SVM text categorization: use a full number of wordsfound through a rough filtering technique like part-of-speechtagging.

Feature Selection in SVM Text Categorization

Abstract

Authors

Keywords

Context