Authors: Tanmoy Chakraborty
Stylometry is the study of the unique linguistic styles and writing behaviors of individuals. It belongs to the core task of text categorization like authorship identification, plagiarism detection etc. Though reasonable amount of works have been studied in English for a long time, no major work has been done so far in Bengali. In this work, we present a strategy for authorship identification of the documents written in Bengali. It takes into account a writer-independent model and builds a robust system which reduces the pattern-recognition problem. We adopt a set of fine-grained stylistic features for the analysis of the text and use them to develop two different models: statistical similarity model consisting of three measures and their combination, and machine learning model with Decision Tree, Neural Network and SVM. Experimental results show that SVM outperforms others with average 83.3% of accuracy after 10-fold cross validations using same set of features. We also validate the relative importance of each stylistic feature to show some of them remain consistently significant in every model used in this experiment.