123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259 |
- This is a program of moprhological analysis (Russian, German, and English languages).
- This program is distributed under the Library GNU Public Licence, which is in the file
- COPYING.
- This program was written by Andrey Putrin, Alexey Sokirko.
- The project started in Moscow in Dialing
- Company (Russian and English language). The German part was created
- at Berlin-Brandenburg Academy of Sciences and Humanities in Berlin (the project DWDS).
- The Russian lexicon is based upon Zaliznyak's Dictionary .
- The German lexicon is based upon Morphy system (http://www-psycho.uni-paderborn.de/lezius/).
- The English lexicon is based upon Wordnet.
- The project uses a regular expression library "PCRE" (Perl Compatible Regular Expressions).
- We test compilation only with version 6.4. Other versions were not tested.
- One should download this version from the official site and install it
- to the default place. If you do not want to install it or you do not have enough
- rights to do it, then you should create two environment variables:
- 1. RML_PCRE_LIB, that points to PCRE library directory, where
- libpcre.a and libpcrecpp.a should be located, for example:
- export RML_PCRE_LIB=~/RML/contrib/pcre-6.4/.libs
- 2 RML_PCRE_INCLUDE, that points to PCRE include catalog,
- where "pcrecpp.h" is located, for example
- export RML_PCRE_INCLUDE=~/RML/contrib/pcre-6.4
- The system has been developed under Windows 2000 (MS VS 6.0), but
- has also been compiled and run under Linux(GCC). It should work with
- minor changes on other systems.
- Website of DDC: www.aot.ru, https://sf.net/projects/morph-lexicon/
- I compiled all sources with gcc 3.2. Lower versions are not supported.
- Contents of the this source archive
- 1. The main morphological library (Source/LemmatizerLib).
- 2. Library for grammatical codes (Source/AgrgamtabLib).
- 3. Test morphological program (Source/TestLem)..
- 4. Library for working with text version of the dictionaries (Source/MorphWizardLib).
- 5. Generator of morphological prediction base (Source/GenPredIdx).
- 6. Generator of binary format of the dictionaries (Source/MorphGen).
- =================================================
- ====== Installation =====
- =================================================
- Unpacking
- * Create a catalog and register a system variable RML, which points
- to this catalog:
- mkdir /home/sokirko/RML
- export RML=/home/sokirko/RML
- * Put "lemmatizer.tar.gz", "???-src-morph.tar.gz"
- to this catalog, "???" can be "rus", "ger" or "eng"
- according to what you have downloaded. Unpack it
- tar xfz lemmatizer.tar.gz
- tar xfz ???-src-morph.tar.gz
- Compiling morphology
- 0. Do not forget to set RML_PCRE (see above)
- 1. cd $RML
-
- 2. ./compile_morph.sh
- This step should create all libraries and a test program $RML\Bin\TestLem.
- Building Morphological Dictionary
- 1. cd $RML
- 2. ./generate_morph_bin.sh <lang>
- where <lang> can be Russian, German according to the dictionary
- yo have downloaded.
- The script should terminate with message "Everything is OK".
- You can test the morphology
- $RML\Bin\TestLem <lang>
- If something goes wrong, write me to sokirko@yandex.ru.
- ======================================================
- ========== MRD-file ============
- ======================================================
- This section describes the format of a mrd-file. Mrd-file is a text
- file which contains one morphological dictionary for one natural language.
- MRD is an abbreviation of "morphological dictionary".
- The usual place for this file is
- $RML/Dicts/SrcMorph/xxxSrc/morphs.mrd,
- where xxx can be "Eng", "Rus" or "Ger" depending on the language.
- The encoding of the file depends also upon the language:
- * Russian - Windows 1251
- * German - Windows 1252
- * English - ASCII
- Gramtab-files
- A mrd-file refers to a gramtab-file, which is
- language-dependent and which contains all possible full morphological
- patterns for the words. One line in a gramtab-file looks like as follows:
- <ancode> <unused_number> <part_of_speech> <grammems>
- An ancode is an ID, which consists of two letters and which uniquely
- identifies a morphological pattern. A morphological pattern consists of
- <part_of_speech> and <grammems>. For example, here is a line from the English
- gramtab:
- te 1 VBE prsa,pl
- Here "te" is an ancode, "VBE" is a part of speech, "prsa,pl" are grammems,
- "1" is the obsolete unused number.
- In mrd-files we use ancodes to refer to a morphological pattern.
- Here is the list of all gramtab-files:
- * Russian - $Rml/Dicts/Morph/rgramtab.tab
- * German - $Rml/Dicts/Morph/ggramtab.tab
- * English - $Rml/Dicts/Morph/egramtab.tab
- Common information
- All words in a mrd-file are written in uppercase.
- One mrd-file consists of the following sections:
- 1. Section of flexion and prefix models;
- 2. Section of accentual models;
- 3. Section of user sessions;
- 4. Section of prefix sets;
- 5. Section of lemmas.
- Each section is a set of records, one per line. The number of all records
- of the section is written in the very beginning of the section at
- a separate line. For example, here is a possible variant
- of the section of user sessions:
- 1
- alex;17:10, 13 October 2003;17:12, 13 October 2003
- "1" means that this section contains only one record, which is written
- on the next line, thus this section contains only two lines.
- Section of possible flexion and prefix models
- Each record of this section is a list of items. Each item
- describes how one word form in a paradigm should be built. The whole list
- describes the whole paradigm (a set of word forms with morphological patterns).
- The format of one item is the following:
- %<flexion>*<ancode>
- or %<flexion>*<ancode>*<prefix>
- where
- <flexion> is a flexion (a string, which should be added to right of the word base)
- <prefix> is a prefix (a string, which should be added to left of the word base)
- <ancode> is an ancode.
- Let us consider an example of an English flexion and prefix model:
- %F*na%VES*nb
- Here we have two items:
- 1. <flexion> = F; <ancode> = na
- 2. <flexion> = VES; <ancode> = nb
- In order to decipher ancodes we should go the English gramtab-file.
- There we can find the following lines:
- na NOUN narr,sg
- nb NOUN narr,pl
- If base "lea" would be ascribed to this model, then its paradigm
- would be the following:
- leaf NOUN narr,sg
- leaves NOUN narr,pl
- It is important, that each word of a morphological dictionary
- should contain a reference to a line in this section.
- Section of possible accentual models
- Each record of this section is a comma-delimited list of numbers, where
- each number is an index of a stressed vowel of a word form(counting
- from the end). The whole list contains a position for each word
- form in the paradigm.
- If an item of an accentual model of word is equal to 255, then it
- is undefined, and it means that this word form is unstressed.
- Each word in the dictionary should have a reference to
- an accentual model, even though this model can consist only of empty items.
- For one word, the number and the order of items in the accentual model
- should be equal to the number and the order of items in the flexion and
- prefix model. For example we can ascribe to word "leaf" with the paradigm
- leaf NOUN narr,sg
- leaves NOUN narr,pl
- the following accentual model:
- 2,3
- It produces the following accented paradigm:
- le'af NOUN narr,sg
- le'aves NOUN narr,pl
-
- Section of user section
- This is a system section, which contains information about user edit
- sessions.
- Section of prefix sets
- Each record of this section is a comma-delimited list of strings, where
- each string is a prefix, which can be prefixed to the whole word. If a prefix
- set is ascribed to a word, it means, that the words with these prefixes
- can also exist in the language. For example, if "leaf" has
- the prefix set "anti,contra", it follows the existence of words "antileaf",
- "contraleaf".
- A flexion and prefix model can contain
- also a reference to a prefix, but this prefix is for
- one separate word form, while a prefix set is ascribed to the whole word
- paradigm.
-
- Section of lemmas
- A record of this section is a space-separated tuple of the following format:
- <base> <flex_model_no> <accent_model_no> <session_no> <type_ancode> <prefix_set_no>
- where
- <base> is a base (a constant part of a word in its paradigm)
- <flex_model_no> is an index of a flexion and prefix model
- <accent_model_no> is an index of an accentual model
- <session_no> is an index of the session, by which the last user edited this word
- <type_ancode> is ancode, which is ascribed to the whole word
- (intended: the common part of grammems in the paradigm)
- "-" if it is undefined
- <prefix_set_no> is an index of a prefix set, or "-" if it is undefined
|