Computer Science ›› 2013, Vol. 40 ›› Issue (2): 120-123.
Previous Articles Next Articles
Online:
Published:
Abstract: Character N-gram can be used to effectively capture individual-author stylistic information in texts. To deal with the problems of high-sparsity and high-redundancy in the feature space, an ensemble classification algorithm based on semi-random feature sampling was proposed in this study. Firstly, the whole feature space is divided into several individual-author feature sets by a divergence rule. Then each of them is divided into equally sized subspaces by a semi-random selection method, and a base classifier is trained on each random subspace. Finally, these base classifiers arc combined to construct an ensemble via the majority voting method. To examine the algorithm, the experiment was conducted on a real-life dataset. It is observes that the algorithm achieved a considerable improvement in accuracy and robustness compared with the benchmark technique in Chinese writeprint identification (random subspace method, bagging and support vector machine).
Key words: Writeprint, Semi-random feature sampling, Individual feature set, Ensemble classifier, Diversity
0 / / Recommend
Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks
URL: https://www.jsjkx.com/EN/
https://www.jsjkx.com/EN/Y2013/V40/I2/120
Cited