TY - JOUR
T1 - Learning stylometric representations for authorship analysis
AU - Ding, Steven H.H.
AU - Fung, Benjamin C.M.
AU - Iqbal, Farkhund
AU - CHEUNG, Kwok Wai
N1 - Funding Information:
Manuscript received February 23, 2017; revised July 29, 2017 and October 2, 2017; accepted October 8, 2017. Date of publication November 21, 2017; date of current version December 14, 2018. This work was supported in part by NSERC Discovery under Grant 356065-2013, in part by the Canada Research Chairs Program under Grant 950-230623, and in part by the Research Incentive through Zayed University, Abu Dhabi, UAE, under Grant RIF13059. This paper was recommended by Associate Editor M. Last. (Corresponding author: Steven H. H. Ding.) S. H. H. Ding is with the School of Information Studies, McGill University, Montreal, QC H3A 1X1, Canada (e-mail: [email protected]).
PY - 2019/1
Y1 - 2019/1
N2 - Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic n -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.
AB - Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic n -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.
KW - Authorship analysis (AA)
KW - computational linguistics
KW - representation learning
KW - text mining
UR - http://www.scopus.com/inward/record.url?scp=85035745911&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2017.2766189
DO - 10.1109/TCYB.2017.2766189
M3 - Journal article
C2 - 29990260
AN - SCOPUS:85035745911
SN - 2168-2267
VL - 49
SP - 107
EP - 121
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 1
M1 - 8116753
ER -