مقاله انگلیسی رایگان در مورد خوشه بندی ماشین بردار پشتیبانی

بخشی از متن مقاله:۱٫ IntroductionIn supervised classification [2,18,37], we are given a set of objects Ω partitioned, in its simplest setting, into two classes, and the aim is to classify new objects. Given an object iAΩ, it is represented by a vector ðxi; x0 i ; yiÞ. The feature vector xi is associated with J categorical features, that can be binarized by splitting each feature into a series of 0-1 dummy features, one for each category, and takes values on a set XDf0; 1g PJ j ¼ ۱ Kj , where Kj is the number of categories of feature j. Thus, xi ¼ ðxi;j;kÞ, where xi;j;k is equal to 1 if the value of categorical feature j in object i is equal to category k and 0 otherwise. The feature vector x0 i is associated with J 0 continuous features and takes values on a set X0 DRJ 0 . Finally, yi Af1; þ۱g is the class membership of object i. Information about objects is only available in the so-called training sample, with n objects.

In many applications of supervised classification datasets are composed by a large number of features and/or objects [26], making it hard to both build the classifier and interpret the results. In this case, it is desirable to obtain a sparser classifier, which may make classification easier to handle and interpret, less prone to overfitting and computationally cheaper when classifying new objects. The most popular strategy proposed in the literature to achieve this goal is feature selection [14,15,17,35], which aims at selecting the subset of most relevant features for classification while maintaining or improving accuracy and preventing the risk of overfitting. Feature selection reduces the number of features by means of an all-or-nothing procedure. For categorical features, binarized as explained above, it simply ignores some categories of some features, and does not give valuable insight on the relationship between feature categories. These issues may imply a significant loss of information.

A state-of-the-art method in supervised classification is the support vector machine (SVM). The SVM aims at separating both classes by means of a classifier, ðωÞ > xþ ðω۰ Þ > x0 þb ¼ ۰, ðω;ω۰ Þ being the so-called score vector, where ω is associated with the categorical features and ω۰ is associated with the continuous features. Given an object i, it is classified in the positive or the negative class, according to the sign of the score function, signððωÞ > xi þ ðω۰ Þ > x0 i þbÞ, while for the case ðωÞ > xi þ ðω۰ Þ > x0 i þ b ¼ ۰, the object is classified randomly. See [5,11,17,24,29] for successful applications of the SVM and [10] for a recent review on Mathematical Optimization and the SVM.