Research on Translation Style in Machine Learning Based on Linguistic Quantitative Characteristics Perception

Research on the metrological characteristics of linguistic quantitative characteristics (LQCs) based on corpus and metrological linguistic methods has gained wide attention in artificial and online machine translations. Although a support vector machine (SVM) is one of the most widely used machine learning (ML) algorithms in the field of text analysis, its application in the study of translation style is rare. This study compares the translation styles of Pride and Prejudice with ML using different linguistic measurement features. Firstly, the language measurement features of three translations are obtained with the information gain algorithm. Specifically, the corpus can be achieved through human–machine interaction (HCI), i.e., computers can look, hear, touch, smell, taste, and speak using sensors such as cameras and mathematical algorithms. Then a text classifier, i.e., an SVM, is constructed on the basis of these features to automatically classify the translated texts of the three translations. Finally, the validity of the classifier is verified by the tenfold cross-validation method. It is proved that the SVM algorithm has high classification accuracy and a strong predictive function, which is helpful for judging or predicting the translation or translator’s style. Compared with the traditional method, this classification method based on an SVM saves time and effort, the process can be repeated, and the result is accurate and reliable.


Introduction
It is possible to make a contrastive analysis of target texts or translations by using the structural features of language measurement. However, such studies have not yet been reported. Machine translation, including online translation, and human translation are bound to produce different language styles and linguistic features in terms of language expression and syntactic structure. These differences are also known as "stylistic differences" and are caused by differences in the frequency with which language units are used. (1,2) The distribution frequency of language units can be used as an objective basis for analyzing the language of a translation. As with the linguistic style of literary works, these differences can be expressed by some statistical features of metrological linguistics. Metrological linguistics holds that the differences in language styles of translations are caused by the differences in the frequency of language units used. (3,4) In our opinion, since translation is a kind of re-creation and also a kind of literary work, different linguistic styles and linguistic features will appear in different translations. On the basis of this description, we compare the stylistic features and translation styles of different translators to identify the translators of different translations and assess the quality of artificial translations.
Web-based online machine translation, an example of machine translation, is derived from corpus-based machine translation as shown in Fig. 1. (5,6) There are now online translation tools with high translation quality, and popular tools include Baidu, Google, and Youdao. Online machine translation has become the first choice of many machine translation system developers. In online machine translation, a large number of bilingual web pages and documents are collected, whole sentences are taken as the unit, and a statistical algorithm is used to carry out multiple fuzzy matching of the original text, with grammar rules combined to optimize, correct, and disambiguate the translation results. A support vector machine (SVM) is one of the most widely used machine learning (ML) algorithms in text analysis. (7) In this paper, the SVM algorithm is used to model the different features and verify the validity of the model. Also, the model has a predictive function, which can be used for the automatic classification and recognition of texts.
Specifically, in this study, we use the information gain algorithm to find the language differences in different translations, use SVM to classify the texts of three translators, apply the tenfold cross-validation method to verify the validity of the classifier, and verify the regularity and stability of different language features. Finally, by combining our results with the statistical data obtained from the corpus method, we qualitatively discuss the form characteristics of different languages.

Related Works
The different linguistic quantitative characteristics (LQC) (as shown in Fig. 2) formed by authors in language expression can be quantitatively used as a statistical feature to analyze language style. (8) In other words, language style is due to differences in the frequency of use of language units. The distribution frequency of language units is the material basis for analyzing an author's language. The statistical analysis of texts using a variety of linguistic structural features has been going on since 1851. In the past 20 years, some statistical methods of econometric linguistics have been widely used to compare language style characteristics, determine the text age, determine the author, and identify the author style. (9) On the basis of the statistics of language measurement characteristics in different authors' languages, the consistency or differentiation characteristics of language style can be obtained, and the distributed data of the language structure is converted to measurement characteristics that reflect the author's language style. Conversely, if data on the linguistic structure in an unfamiliar text can be obtained, it is possible to determine the author of the text from the data.
In recent years, scholars have begun to study text translation using the linguistic characteristics of metrology and compared the translation styles of different translations based on linguistic form parameters. The research field of metrological style has broadened from the initial study of works and the author's style to the study of translation. The research methods include simple frequency statistics, various statistical programs, and the application of ML algorithms. Meng combined corpus translation with the theory and methods of computational linguistics and applied various domestic and foreign large-scale corpuses. The reference translated texts from Meng's work were collected, processed, sorted, and measured to establish a reference model to evaluate the quality of translation products quantitatively and objectively. With the aid of software, standardization of the evaluation process and automation can be realized, and finally five Chinese translations with 200,000 words were selected as the evaluation model to explain the results of the evaluation. (10) Robert used Delta software to cluster texts and found that texts of the same author could be clustered together but texts of the same translator could not. This study confirmed the invisibility of the translators. (11) Defrancq and Verliefde investigated stylistic variation and idiomaticity in translation utilizing reduction and clustering, and proposed two possible general laws in translation: stylistic transformation and enhancement of idiomaticity in the target language. (12) Marasek et al. used an ML method to study the literary style of translators and conducted translator recognition experiments. (13) Rybicki and Heydel accurately found the "intersection" of Night and Day, two Polish co-translators, in a translation using the method of cluster analysis. (14) Dottori et al. compared the linguistic measurement characteristics of manual translation and online machine translation. To some extent, the study of translation employing measurement was both a requirement and a product of the era of big data. (15) Taking dependency grammar as the theoretical framework and using the corpus research method, Ali et al. made a quantitative comparative analysis of the inaugural speeches of George Washington (the first president of the United States) and Donald Trump from the distribution of the dependency distance and the lexical category composition of subjects, objects, attributives, and adverbials. (16) However, the study of translation style considering both linguistic measurement features and SVMs has not received much attention. In this paper, different translations of Pride and Prejudice are used as textual data. Meanwhile, gain algorithms, text classifiers, and tenfold intersection methods are used to study the application of ML algorithms to analyzing translation style.

Corpus
The specific process of seeking measurement features that reflect the different styles of authors is to select two authors' corpus samples, divide them into words, calculate the frequency and percentage of specific language structures in the text, and compare the distributions of these language structures in the two samples based on the mean value of their frequency. The authors' corpus samples are selected to calculate the correlation between them, and statistical corpus samples are used to test the effectiveness of measurement characteristics in distinguishing different authors' language styles. When selecting a corpus considering the influence of external factors on language, it is difficult to make qualitative and quantitative analyses, and a corpus with a similar language environment tends to be chosen. In this study, our corpus is the English version of the novel published by Oxford University Press in 1970, and we consider Chinese versions translated by Keyi Wang (published in 1980), Zhili Sun (published in 1985), and Baidu Online (2015). The corpus realized the parallel alignment of the three Chinese versions at the sentence level, while the Chinese version was annotated with parts of speech and word segmentation. The corpus characteristics are shown in Table 1.

Linguistic quantitative characteristics
The selected objects are the linguistic structural features at the lexical level and the sentence level. The measurement information of vocabulary is easy to obtain, and the research of vocabulary measurement is always one of the hot topics in metrology and linguistics. At the same time, although word frequency is still the basis of the study, content words, parts of speech markers, word position, word length, word order, single-occurrence words (hapaxes), and n-element attributes have also entered the field of vision of domestic and international econometric linguistics research. Some of the language structures that represent the length of language structures, the richness of vocabulary, the parts of speech, and the use of sentence patterns were selected as the objects of examination.
The measurement features of Chinese proposed in the literature for text clustering include 12 categories: word length, sentence length, type example ratio, adverb ratio, noun ratio, pronoun ratio, auxiliary ratio, punctuation ratio, declarative sentence ratio (period ratio), interrogative sentence ratio (question mark ratio), interjective sentence ratio, and single expressions (hapaxes). To make a multidimensional and more comprehensive evaluation based on the above 12 linguistic metrological features, we have expanded and established a translation quality evaluation index system consisting of 23 linguistic metrological features. The detailed LQC of the three translations are presented in Table 2.

Model of Translation Style Evaluation
On the basis of the above linguistic measurement features, we conducted text classification experiments with the SVM, verified the effectiveness of the classifier, and finally determined the differences in the regular language forms between the three translations. A block diagram of the system consisting of the text collection, LQC, SVM, verification model, and application model is depicted in Fig. 3.
The purpose of using this translation quality assessment model is to test whether and to what extent the 23 linguistic measurement features of the translated text conform to the standards of the established model. The formula of the model is as follows.
, if (min), The score of each test text is the sum of the differences of the 23 linguistic measurement features, namely, C i represents the difference between each linguistic measurement feature and the standard, 0.0438 represents the weight of each linguistic measurement feature in the model, and d i represents the median value of each measurement feature. The calculation of the difference is divided into three types of cases. If the linguistic measurement feature meets the standard, the difference is 0; if the difference is greater than the maximum value of each parameter interval or less than the minimum value of each parameter interval, the translation style can be achieved via different algorithms. To facilitate the operation process and to facilitate the generalization of this evaluation model based on the above evaluation formula, we programmed the computer based on the idea that the differences between each measurement feature and the final total difference were automatically calculated.

Feature extraction
Using an information gain method to select effective style features can reduce the time and complexity of the experiment and improve accuracy. (17) To obtain an effective classification feature set, a feature selection method based on information gain is adopted. In this selection method, the criterion used to measure the importance of a feature is the amount of information that it can provide to the classification system. The more information provided, the more important the feature is. The amount of information can be calculated using the information entropy, which is defined as (2) For the text translation system, the category Δ is a constant. If Δ has N LQC, the possible set of values of Δ includes Δ 1 , Δ 2 , ..., Δ N . Suppose the probability of occurrence of each category is P(Δ 1 ), P(Δ 2 ), ..., P(Δ N ), then the entropy of the text translation system can be expressed as Eq.
(2). The information gain is specific to different information characteristics. For example, for a characteristic t, the information of the system is calculated both with and without the characteristic, and the difference between the two is the information provided by characteristic t for the system, i.e., the gain. When the system has characteristic t, its information quantity is calculated using Eq. (3), which represents the information quantity when the system has all the characteristics. If there is no characteristic t in the classification system, then the current conditional entropy is

H T P H C t P H C t P P C t P C t P P C t P C t
The difference between the characteristics with and without t is the information gain of characteristic t , and the information gain ( ) ( ) ( | )

G I T H C H C T = −
can be achieved.

Accuracy evaluation of classification algorithm
After obtaining a valid classification feature set, we use an SVM to classify the translation samples of the three translators. SVM is an algorithm based on structural risk minimization. Its application condition is that samples are linearly separable and there is an optimal classification surface. Its goal is to maximize the classification distance of samples. Its classification on the two-dimensional plane is shown in Fig. 4. For the classification problem on a two-dimensional plane, the solid line in the figure is the classification plane, and several points marked by circles represent the support vectors for classification. The classification boundary refers to the translation from the classification plane to the sample points of the two classes until the first data point is encountered, which is shown as two dashed lines in Fig. 4. The distance between the classification boundary of two classes is the classification interval. For N-dimensional classification problems, the classification hyperplane can be expressed as The hyperplane divides the data into two parts: Since the maximum classification interval is equivalent to the minimum reciprocal of the classification interval, the fundamental problem is an optimization problem. Since the reciprocal of the classification interval is 1 2 w , its optimization model can be written as  Here, y i is the category of data classification with values of 1 and −1. The constraint condition is that the distance between each data point (x i , y i ) and the classification plane is greater than or equal to 1. This is a quadratic optimization problem, which can be solved by the Lagrangian function. Its Lagrangian equation is By solving the Lagrangian equation, we obtain ( ) Then according to the dual theory, the dual problem can be obtained as ( ) Note that this problem is still a constrained optimization problem.

Tenfold cross-validation
Since the SVM is a classical classification algorithm, there have been many tools to implement it. We used the SVM algorithm in WEKA to classify the samples and the tenfold cross-validation method to evaluate the model accuracy. Cross-validation is an effective method of evaluating classification accuracy. In ML, data set A is divided into training set B and test set C when the sample size is not enough to make full use of the data set tests. The effect of the algorithm is that A random data sets are divided into k packages, one of which is used as test set C and k −1 of which is used as the training set to train B; this is k-fold cross-validation. Also, k is generally set to 10, i.e., tenfold cross-validation. A test result is obtained for each training session, and the results of ten training sessions are averaged, that is, the accuracy of the classifier on the data is obtained.

Results and Discussion
The ML algorithm can successfully find differences in the language form among the three translations, and the above nine style parameters can effectively distinguish the three translations of Pride and Prejudice, which also indicates that the two human translators have stability and inertia in the performance of these characteristics. According to the degree of differentiation, these style parameters are ordered as follows: semicolon ratio, noun ratio, average sentence length, pronoun ratio, idiomatic ratio, conjunction ratio, function word density, function word density, and content word density. The two human translators have the greatest differences in the usage habits of semicolons: Sun does not use semicolons, whereas Wang does. The smallest difference was in the density of content words, with little difference between the two translators according to the distribution diagram. In general, the average sentence length and the ratios of pronouns, conjunctions, auxiliary words, and function words are larger in Wang's translation than in Sun's translation. However, Sun's translation quality is similar to Wang's in terms of nouns, idioms, and semicolons.

Content words and function words
Content words refer to parts of speech with stable meaning, including nouns, verbs, adverbs, numerals, and quantifiers. Content word density can reflect text information. As shown in Fig. 5, except for adjectives and pronouns, the proportion of content words is higher in the artificial translation version than in the Baidu online translation version. The proportions of content words are 4.18% and 4.88% higher in Wang's and Sun's translations than in the machine translation, respectively, indicating that the information of the source language is larger. The lack of content words in the online translation indicates that original information may have been lost in translation.
Chinese is a paratactic language with a low degree of formalization. Owing to the lack of morphological changes, the formalization of Chinese is mainly manifested as the use of function words, whose main components include auxiliary words and conjunctions. The proportion of function words in the total number of words is called the "informal degree", which can indicate the degree of sentence manifestation in the translated text. When Chinese native speakers translate English works, they are influenced by conjunctive means in English, which make the target language take on the characteristics of the original language. However, the translations of different translators, restricted by the translation norms in different historical periods and influenced by different translation strategies, show the characteristics of the source language to different degrees. We found that the proportion of function words is higher in Wang's translation (10.85%) than in Zhang's translation (10.53%), that is, the formalization degree of Wang's translation is higher, and the sentence method is more obvious, which further indicates that Wang's translation is influenced by the sentence style of the English source. The proportion of content words in the Baidu translation was 14.99%. However, the proportion of content words in Zhang's translation is relatively high and the formalization degree is slightly lower, the degree of sentence manifestation is weaker in Zhang's translation than in Wang's translation, and the translator's syntactic creativity is stronger owing to the influence of the source text. Moreover, among all the function words, the auxiliaries and conjunctions have the highest degree of manifestation in Wang's translation, which is the main factor causing the difference in formalization degree.

Nouns and pronouns
After comparing the distribution frequency of nouns and pronouns in the two human translations with the original Chinese text and the translated Chinese text, we found that the two human translations show major differences from the original Chinese text, as shown in Fig. 6. In general, the noun frequency of both translations is lower than that of the original Chinese, while the pronoun frequency is higher than that of the original Chinese. The difference between Wang's translation and the original Chinese version is about 5 percentage points, whereas the difference between Zhang's translation and the original Chinese version is about 2 percentage points. Therefore, Wang's translation is more different and better. Compared with the average level of Chinese translated literature, the difference between Wang's translation and Zhang's translation is about 3 percentage points. Therefore, in terms of the overall distribution of nouns and pronouns, Zhang's translation is closer to the linguistic norms of Chinese. It is interesting to note that the total frequency of nouns and pronouns is close to 31%. The frequencies of nouns and pronouns in a text have an inverse relationship, namely, the use of a large number of pronouns decreases the number of nouns, and vice versa. It can be inferred that owing to the transfer of pronouns in the original English text, Wang's translation has a strong tendency of "pronoun manifestation", thus reducing the frequency of nouns in the text. Zhang's translation also shows pronoun manifestation, but it is not as obvious as in Wang's translation.

Conclusions
In this study, an ML algorithm is applied to the comparison of translated text styles, and the measurement characteristics of the text language are used to obtain reliable results. The main contribution lies in our breakthrough in research methods and ideas. We provide a new methodological perspective for translation studies in the era of big data, which can further promote the process of conscientization and objectification of translation studies. In this paper, popular ML algorithms used in computational studies, namely, the information gain algorithm and SVM, are used to effectively discover the differences in language form parameters between translations, and the regularity and stability of these differences are proved by cross-validation. The main advantages of this method are as follows: it can save time and effort in extracting the features of large texts and in calculating the discrimination degree of language parameters; it can sort features according to the discrimination degree; the experimental process can be repeated; an established model can be verified; and it has a good predictive function. The research results are in good agreement with the subjective feeling, nature analysis, and statistical analysis of the text, showing that we have realized a new idea and method for the classification of translations and the comparative study of translator styles. In the future, the scale of the extensible corpus will be studied, more detailed hierarchical labeling of the corpus will be carried out, and more in-depth studies will be conducted on the stylistic differences of translations in terms of vocabulary, syntax, and discourse.