4 The research design
The tagged words of the SWECCL
The tagging is a method making human language processable by the machine。 The test we human can easily read is the raw text, which is not suitable for the computer to analyze。 The most common tagging method is the “word_tag” mode。 Here we adopt the CLAWS 4 Tagging Collection to tag the raw text。 After the tagging, the tagged text can be processed by the software Colligator developed by Beijing Foreign Studies University professor (Liang Maocheng: 2008), which can analyze the tagged text prepared in advanced。
The regular expression
In order to sort as many as possible qualified results out of the corpus。 We should seek for some items which can cover many various kinds of conditions of our research aim。
In order to do that, we should firstly set a list of all the colligation items we are looking for。 Here we mainly focus on 7 different colligations of the infinitive particle and what precedes it。 They are listed as follows:来,自,优.尔:论;文*网www.youerw.com +QQ752018766-
infinitive as subject preceded by a stop period mark (IAS)
infinitive as direct object preceded by verb (VI)
infinitive preceded by noun(NI)
infinitive preceded by adjective(AI)
infinitive preceded by present ”ING” participle (INGI)
infinitive preceded by past “ED” particle(EDI)
infinitive preceded by adverb(ADI)
The reason why this research chooses this seven colligation items is that the preceding part of the TO infinitive comes from the major lexical categories in linguistics: noun, verb, adjective and adverb。 They take up a majority of the total number of the English vocabularies; according to the math principle the possible combination result is the most various。 And for the convenience of referring to the items above, this article will use the abbreviation in the parentheses。
In addition, we have to use the regular expression to retrieve targeted information out of the corpus。 “The regular expression is a kind of special character string which is applied to describing and matching string with same or similar property。” (Jurafsky &Martin: 2009) According to the tagged expression and the rules (Liang Maocheng: 2009) stated in forming it, we can turn some our human language grammatical devices into regular expression which can be interpreted by the machine。 The 7 abbreviation mentioned above can be rewritten by the regular expression。 Here listed as follows:(for more details, please check the appendix)