Morph_UNIX.txt 8.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259
  1. This is a program of moprhological analysis (Russian, German, and English languages).
  2. This program is distributed under the Library GNU Public Licence, which is in the file
  3. COPYING.
  4. This program was written by Andrey Putrin, Alexey Sokirko.
  5. The project started in Moscow in Dialing
  6. Company (Russian and English language). The German part was created
  7. at Berlin-Brandenburg Academy of Sciences and Humanities in Berlin (the project DWDS).
  8. The Russian lexicon is based upon Zaliznyak's Dictionary .
  9. The German lexicon is based upon Morphy system (http://www-psycho.uni-paderborn.de/lezius/).
  10. The English lexicon is based upon Wordnet.
  11. The project uses a regular expression library "PCRE" (Perl Compatible Regular Expressions).
  12. We test compilation only with version 6.4. Other versions were not tested.
  13. One should download this version from the official site and install it
  14. to the default place. If you do not want to install it or you do not have enough
  15. rights to do it, then you should create two environment variables:
  16. 1. RML_PCRE_LIB, that points to PCRE library directory, where
  17. libpcre.a and libpcrecpp.a should be located, for example:
  18. export RML_PCRE_LIB=~/RML/contrib/pcre-6.4/.libs
  19. 2 RML_PCRE_INCLUDE, that points to PCRE include catalog,
  20. where "pcrecpp.h" is located, for example
  21. export RML_PCRE_INCLUDE=~/RML/contrib/pcre-6.4
  22. The system has been developed under Windows 2000 (MS VS 6.0), but
  23. has also been compiled and run under Linux(GCC). It should work with
  24. minor changes on other systems.
  25. Website of DDC: www.aot.ru, https://sf.net/projects/morph-lexicon/
  26. I compiled all sources with gcc 3.2. Lower versions are not supported.
  27. Contents of the this source archive
  28. 1. The main morphological library (Source/LemmatizerLib).
  29. 2. Library for grammatical codes (Source/AgrgamtabLib).
  30. 3. Test morphological program (Source/TestLem)..
  31. 4. Library for working with text version of the dictionaries (Source/MorphWizardLib).
  32. 5. Generator of morphological prediction base (Source/GenPredIdx).
  33. 6. Generator of binary format of the dictionaries (Source/MorphGen).
  34. =================================================
  35. ====== Installation =====
  36. =================================================
  37. Unpacking
  38. * Create a catalog and register a system variable RML, which points
  39. to this catalog:
  40. mkdir /home/sokirko/RML
  41. export RML=/home/sokirko/RML
  42. * Put "lemmatizer.tar.gz", "???-src-morph.tar.gz"
  43. to this catalog, "???" can be "rus", "ger" or "eng"
  44. according to what you have downloaded. Unpack it
  45. tar xfz lemmatizer.tar.gz
  46. tar xfz ???-src-morph.tar.gz
  47. Compiling morphology
  48. 0. Do not forget to set RML_PCRE (see above)
  49. 1. cd $RML
  50. 2. ./compile_morph.sh
  51. This step should create all libraries and a test program $RML\Bin\TestLem.
  52. Building Morphological Dictionary
  53. 1. cd $RML
  54. 2. ./generate_morph_bin.sh <lang>
  55. where <lang> can be Russian, German according to the dictionary
  56. yo have downloaded.
  57. The script should terminate with message "Everything is OK".
  58. You can test the morphology
  59. $RML\Bin\TestLem <lang>
  60. If something goes wrong, write me to sokirko@yandex.ru.
  61. ======================================================
  62. ========== MRD-file ============
  63. ======================================================
  64. This section describes the format of a mrd-file. Mrd-file is a text
  65. file which contains one morphological dictionary for one natural language.
  66. MRD is an abbreviation of "morphological dictionary".
  67. The usual place for this file is
  68. $RML/Dicts/SrcMorph/xxxSrc/morphs.mrd,
  69. where xxx can be "Eng", "Rus" or "Ger" depending on the language.
  70. The encoding of the file depends also upon the language:
  71. * Russian - Windows 1251
  72. * German - Windows 1252
  73. * English - ASCII
  74. Gramtab-files
  75. A mrd-file refers to a gramtab-file, which is
  76. language-dependent and which contains all possible full morphological
  77. patterns for the words. One line in a gramtab-file looks like as follows:
  78. <ancode> <unused_number> <part_of_speech> <grammems>
  79. An ancode is an ID, which consists of two letters and which uniquely
  80. identifies a morphological pattern. A morphological pattern consists of
  81. <part_of_speech> and <grammems>. For example, here is a line from the English
  82. gramtab:
  83. te 1 VBE prsa,pl
  84. Here "te" is an ancode, "VBE" is a part of speech, "prsa,pl" are grammems,
  85. "1" is the obsolete unused number.
  86. In mrd-files we use ancodes to refer to a morphological pattern.
  87. Here is the list of all gramtab-files:
  88. * Russian - $Rml/Dicts/Morph/rgramtab.tab
  89. * German - $Rml/Dicts/Morph/ggramtab.tab
  90. * English - $Rml/Dicts/Morph/egramtab.tab
  91. Common information
  92. All words in a mrd-file are written in uppercase.
  93. One mrd-file consists of the following sections:
  94. 1. Section of flexion and prefix models;
  95. 2. Section of accentual models;
  96. 3. Section of user sessions;
  97. 4. Section of prefix sets;
  98. 5. Section of lemmas.
  99. Each section is a set of records, one per line. The number of all records
  100. of the section is written in the very beginning of the section at
  101. a separate line. For example, here is a possible variant
  102. of the section of user sessions:
  103. 1
  104. alex;17:10, 13 October 2003;17:12, 13 October 2003
  105. "1" means that this section contains only one record, which is written
  106. on the next line, thus this section contains only two lines.
  107. Section of possible flexion and prefix models
  108. Each record of this section is a list of items. Each item
  109. describes how one word form in a paradigm should be built. The whole list
  110. describes the whole paradigm (a set of word forms with morphological patterns).
  111. The format of one item is the following:
  112. %<flexion>*<ancode>
  113. or %<flexion>*<ancode>*<prefix>
  114. where
  115. <flexion> is a flexion (a string, which should be added to right of the word base)
  116. <prefix> is a prefix (a string, which should be added to left of the word base)
  117. <ancode> is an ancode.
  118. Let us consider an example of an English flexion and prefix model:
  119. %F*na%VES*nb
  120. Here we have two items:
  121. 1. <flexion> = F; <ancode> = na
  122. 2. <flexion> = VES; <ancode> = nb
  123. In order to decipher ancodes we should go the English gramtab-file.
  124. There we can find the following lines:
  125. na NOUN narr,sg
  126. nb NOUN narr,pl
  127. If base "lea" would be ascribed to this model, then its paradigm
  128. would be the following:
  129. leaf NOUN narr,sg
  130. leaves NOUN narr,pl
  131. It is important, that each word of a morphological dictionary
  132. should contain a reference to a line in this section.
  133. Section of possible accentual models
  134. Each record of this section is a comma-delimited list of numbers, where
  135. each number is an index of a stressed vowel of a word form(counting
  136. from the end). The whole list contains a position for each word
  137. form in the paradigm.
  138. If an item of an accentual model of word is equal to 255, then it
  139. is undefined, and it means that this word form is unstressed.
  140. Each word in the dictionary should have a reference to
  141. an accentual model, even though this model can consist only of empty items.
  142. For one word, the number and the order of items in the accentual model
  143. should be equal to the number and the order of items in the flexion and
  144. prefix model. For example we can ascribe to word "leaf" with the paradigm
  145. leaf NOUN narr,sg
  146. leaves NOUN narr,pl
  147. the following accentual model:
  148. 2,3
  149. It produces the following accented paradigm:
  150. le'af NOUN narr,sg
  151. le'aves NOUN narr,pl
  152. Section of user section
  153. This is a system section, which contains information about user edit
  154. sessions.
  155. Section of prefix sets
  156. Each record of this section is a comma-delimited list of strings, where
  157. each string is a prefix, which can be prefixed to the whole word. If a prefix
  158. set is ascribed to a word, it means, that the words with these prefixes
  159. can also exist in the language. For example, if "leaf" has
  160. the prefix set "anti,contra", it follows the existence of words "antileaf",
  161. "contraleaf".
  162. A flexion and prefix model can contain
  163. also a reference to a prefix, but this prefix is for
  164. one separate word form, while a prefix set is ascribed to the whole word
  165. paradigm.
  166. Section of lemmas
  167. A record of this section is a space-separated tuple of the following format:
  168. <base> <flex_model_no> <accent_model_no> <session_no> <type_ancode> <prefix_set_no>
  169. where
  170. <base> is a base (a constant part of a word in its paradigm)
  171. <flex_model_no> is an index of a flexion and prefix model
  172. <accent_model_no> is an index of an accentual model
  173. <session_no> is an index of the session, by which the last user edited this word
  174. <type_ancode> is ancode, which is ascribed to the whole word
  175. (intended: the common part of grammems in the paradigm)
  176. "-" if it is undefined
  177. <prefix_set_no> is an index of a prefix set, or "-" if it is undefined