diff options
author | Jehan <jehan@girinstud.io> | 2022-12-17 21:32:24 +0100 |
---|---|---|
committer | Jehan <jehan@girinstud.io> | 2022-12-17 21:41:11 +0100 |
commit | 41d309e8a28407372317b048342e2bb23d9c8959 (patch) | |
tree | a9756bbdfe12c9e2560d252fbb919b8d85b64787 /script | |
parent | 60dcec8a82d55fabe2fc72af0c63f18f8289b662 (diff) |
script, src: regenerate Russian models and add UTF-8/Russian support.
This fixes the broken Russian test in Windows-1251 which once again gets
a much better score with Russian. Also this adds UTF-8 support.
Same as Bulgarian, I wonder why I had not regenerated this earlier.
The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian.
Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru
(0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a
bit closer at our GB18030 dedicated prober.
Diffstat (limited to 'script')
-rw-r--r-- | script/BuildLangModelLogs/LangRussianModel.log | 270 | ||||
-rw-r--r-- | script/charsets/ibm855.py | 75 | ||||
-rw-r--r-- | script/charsets/ibm866.py | 72 | ||||
-rw-r--r-- | script/charsets/koi8-r.py | 74 | ||||
-rw-r--r-- | script/charsets/mac-cyrillic.py | 72 | ||||
-rw-r--r-- | script/langs/ru.py | 58 |
6 files changed, 621 insertions, 0 deletions
diff --git a/script/BuildLangModelLogs/LangRussianModel.log b/script/BuildLangModelLogs/LangRussianModel.log new file mode 100644 index 0000000..82d9804 --- /dev/null +++ b/script/BuildLangModelLogs/LangRussianModel.log @@ -0,0 +1,270 @@ += Logs of language model for Russian (ru) = + +- Generated by BuildLangModel.py +- Started: 2022-12-17 19:53:30.416132 +- Maximum depth: 4 +- Max number of pages: 200 + +== Parsed pages == + +Пулмен (рабочий посёлок) (revision 127314030) +Водонапорная башня (revision 123368499) +Обама, Барак (revision 127312814) +Историзм (искусство) (revision 125199154) +Насосная станция (revision 126671775) +Школьный округ (revision 118138873) +Конденсат (revision 97819205) +1880-е годы (revision 124959394) +Линкольн, Роберт Тодд (revision 126851305) +Габарит подвижного состава (revision 127265050) +Межвоенный период (revision 123201828) +Гражданская война в США (revision 127311614) +История евреев в США (revision 123703208) +Англо-занзибарская война (revision 127263956) +Линкольн, Джесси Харлан (revision 87795509) +Бенкен, Герман (revision 120809711) +УралГАХУ (revision 126489964) +Великобритания (revision 127175319) +Фленсбург (revision 126961771) +Мещанство (revision 127304945) +Прохоров, Александр Михайлович (revision 127233579) +VIAF (revision 122626337) +Национальная библиотека Чешской Республики (revision 124152023) +Регулирующая арматура (revision 116046805) +Раннее Средневековье (revision 126932807) +Европейская интеграция (revision 125721443) +Бойл, Уиллард (revision 120835257) +Бут, Эдвин (revision 126437526) +Московский трамвай (revision 127184149) +Лондонский метрополитен (revision 126810923) +F-18 (revision 127113399) +Сацумско-британская война (revision 124671983) +Луизианская покупка (revision 123941200) +Община (Германия) (revision 125007479) +Запорная арматура (revision 121220496) +Новая Англия (revision 125214368) +Берни Сандерс (revision 126983575) +Бак (резервуар) (revision 126670363) +Хемингуэй, Эрнест (revision 126959711) +2021 год (revision 127125948) +1951 год (revision 126285688) +Жидкость (revision 127133343) +Большая советская энциклопедия (revision 127144085) +Россия (revision 127297047) +CSS Virginia (revision 121318647) +Школа реки Гудзон (revision 123627995) +Водозаборные сооружения (revision 123836554) +Ривера, Диего (revision 125976771) +Квантовая физика (revision 126896053) +Рочестер (Нью-Йорк) (revision 126016553) +Конденсация (теплотехника) (revision 123837631) +Средиземноморская Антанта (revision 125156636) +Историография (revision 121180824) +Гбови, Лейма (revision 124860814) +Премудрый пискарь (revision 121359555) +Люнебургская водонапорная башня (revision 117681965) +XVIII век (revision 126913825) +Сислей, Альфред (revision 127063100) +Средние века (revision 127154753) +Энциклопедический словарь Брокгауза и Ефрона (revision 125357601) +Нефтепровод (revision 123810227) +Нефть (revision 126997759) +Вентиляция (revision 126675588) +Цилиндр (revision 126783664) +Английский язык (revision 127275941) +Бензин (revision 126966322) +Министр по делам ветеранов США (revision 124072400) +Первобытное общество (revision 127057340) +Пикассо, Пабло (revision 126869217) +Рисунок в разрезе (revision 121960314) +Междупутье (revision 125745955) +Битва при Форт-Генри (revision 123999672) +Канал (водный) (revision 123736265) +Белорусская народная республика (revision 126958885) +25 апреля (revision 127246597) +Насос (revision 126768788) +Теннесси (revision 124804069) +Локомотив (revision 127032264) +Габарит погрузки (revision 123372556) +Вебби (revision 121964659) +Алегзандрия (Виргиния) (revision 126338837) +Война Фаррапус (revision 125765352) +Образование в США (revision 126788195) +Пресс-конференция (revision 127075029) +Рио-де-Жанейро (revision 127002708) +Габарит приближения строений (revision 117538368) +Международный идентификатор стандартных наименований (revision 120216410) +Мопассан, Ги де (revision 127086462) +История Европейского союза (revision 123952687) +Прусский социализм (revision 127165836) +Библиотека Александрина (revision 126093192) +Тэйкан-дзукури (revision 124877986) +1883 год (revision 125476166) +Конфликт на Китайско-Восточной железной дороге (revision 122499702) +Энергетический уровень (revision 119322956) +Алюминий (revision 126861293) +Санкт-петербургский трамвай (revision 127306763) +Национальная библиотека Франции (revision 127015965) +12 мая (revision 127207333) +Граммофон (revision 126498827) +Маккьяйоли (revision 126836176) +Канализационная установка (revision 123736401) +Газ (revision 126950046) +Луизиана (revision 127312945) +Память Парижской Коммуны (revision 126960401) +Сталь (revision 127216605) +Семья Барака Обамы (revision 124529726) +Поверхностный насос (revision 121146223) +Каразин, Николай Николаевич (revision 127097562) +Кирпичная готика (revision 125337841) +The Century Magazine (revision 127098805) +Контрольный номер Библиотеки Конгресса (revision 113360170) +Русско-персидская война (1804—1813) (revision 126999654) +Берн (revision 122913269) +Поздняя античность (revision 127266287) +Гарвардский университет (revision 127033732) +Бои на Халхин-Голе (revision 126542980) +Алый знак доблести (фильм, 1951) (revision 120728355) +Водопровод (revision 127182411) +Пар (revision 126003244) +1971 год (revision 127068279) +Искусство Древнего Египта (revision 125737336) +Пенсильванский университет (Индиана) (revision 123963620) +Национальная библиотека Израиля (revision 126108080) +1884 год (revision 125476122) +Проезд снаружи поездов (revision 127239100) +Норвегия (revision 126986958) +Барбур, Джеймс (revision 126851158) +Французская интервенция в Испанию (revision 119666106) +Англия (revision 127268120) +Галлатин, Альберт (revision 127160198) +Калифорния (revision 127027363) +Роял, Кеннет Клайборн (revision 110605693) +США (revision 126887888) +Федеральная архитектура (revision 116000492) +Конденсат Бозе — Эйнштейна (revision 125188375) +Колонна (revision 126876842) +1907 год (revision 127134918) +13 сентября (revision 125587404) +Генрих Лев (revision 126407574) +Этрусское искусство (revision 123158050) +Амальрик, Андрей Алексеевич (revision 126033545) +9 декабря (revision 127201233) +Селищи (22712000298) (revision 124521248) +1798 год (revision 125783094) +Мюледорф (Берн) (revision 121861015) +Большая игра (revision 126891168) +Битва (revision 124395796) +Война не-персе (revision 127189710) +Президентские выборы в США (2020) (revision 126639368) +Площадь Карла Фаберже (revision 123223942) +Банкрофт, Джордж (revision 126851184) +Кобаяси, Макото (revision 121939251) +Газойль (revision 123647640) +Ватиканская апостольская библиотека (revision 124986491) +Общественная собственность (revision 125722109) +Славная революция (revision 122270271) +Золя (revision 127092383) +Офицер (revision 126230098) +Метастабильное состояние (revision 118552209) +Лыжные гонки (revision 124233040) +Средиземное море (revision 126980465) +Защитная арматура (revision 124665168) +Президент Турции (revision 123861767) +Макдональд, Артур (revision 123992590) +Песок (revision 126799930) +Сублимация (физика) (revision 127108939) +Новицкий, Василий Фёдорович (revision 126350745) +Список султанов Занзибара (revision 94020222) +Туман (revision 124866163) +2005 год (revision 127291761) +Исламская Республика Афганистан (revision 126605442) +Викисловарь (revision 126840626) +22 января (revision 126465130) +Российская национальная библиотека (revision 126055277) +Наука в США (revision 124150312) +Екатеринбургский завод (revision 125779202) +Океания (revision 125374219) +Нидершерли (revision 116230829) +Война за австрийское наследство (revision 126874381) +Доминиканская Республика (revision 127046641) +Военный паровоз (revision 124117506) +Подземные воды (revision 126705165) +5 сентября (revision 126628763) +Кафка, Франц (revision 127130321) +Двухванная печь (revision 123510834) +Чертаново Южное (revision 122081039) + +== End of Parsed pages == + +- Wikipedia parsing ended at: 2022-12-17 19:57:30.506110 + +63 characters appeared 2343890 times. + +Most Frequent characters: +[ 0] Char о: 10.136567842347551 % +[ 1] Char и: 8.217151828797427 % +[ 2] Char а: 7.941797609956098 % +[ 3] Char е: 7.781337861418411 % +[ 4] Char н: 6.689093771465385 % +[ 5] Char с: 5.755304216494801 % +[ 6] Char р: 5.58695160609073 % +[ 7] Char т: 5.486136294791991 % +[ 8] Char в: 4.621547939536412 % +[ 9] Char л: 4.156039745892512 % +[10] Char к: 3.458694733967891 % +[11] Char м: 2.899666793236884 % +[12] Char д: 2.856064064439884 % +[13] Char п: 2.69799350652121 % +[14] Char у: 2.0648579924825823 % +[15] Char я: 1.9596482770095867 % +[16] Char г: 1.812798382176638 % +[17] Char ы: 1.7729500957809452 % +[18] Char б: 1.5043794717328884 % +[19] Char з: 1.4936707780655236 % +[20] Char й: 1.4190938994577391 % +[21] Char ь: 1.2650764327677493 % +[22] Char ч: 1.0549983147673312 % +[23] Char х: 1.0016255029032932 % +[24] Char ж: 0.7652236239755279 % +[25] Char ц: 0.5965297006258826 % +[26] Char ю: 0.5917513193878552 % +[27] Char ш: 0.5520310253467526 % +[28] Char ф: 0.4393977533075358 % +[29] Char щ: 0.3068403380704726 % +[30] Char э: 0.3063710327703092 % +[31] Char i: 0.25978181569954223 % +[32] Char ё: 0.24984107615971737 % +[33] Char e: 0.2357619171548153 % +[34] Char a: 0.21839762104876934 % +[35] Char n: 0.18004257878996027 % +[36] Char r: 0.1703151598411188 % +[37] Char t: 0.16216631326555428 % +[38] Char s: 0.15969179441014725 % +[39] Char o: 0.1568759626091668 % +[40] Char l: 0.1263711180985456 % +[41] Char c: 0.09795681537956133 % +[42] Char d: 0.08571221345711616 % +[43] Char h: 0.07956858043679524 % +[44] Char m: 0.07009714619713382 % +[45] Char u: 0.0688598867694303 % +[46] Char x: 0.05725524661993524 % +[47] Char p: 0.05644462837419845 % +[48] Char b: 0.05482339188272487 % +[49] Char g: 0.051111613599614324 % +[50] Char f: 0.05038632359027087 % +[51] Char y: 0.04923439239896071 % +[52] Char v: 0.0470158582527337 % +[53] Char ъ: 0.03617917223077875 % + +The first 54 characters have an accumulated ratio of 0.9991548238185242. +The first 5 characters have an accumulated ratio of 0.40765948913984873. +All characters whose order is over 29 have an accumulated ratio of 0.030302616590369. + +1554 sequences found. + +First 819 (typical positive ratio): 0.9950050289366638 +Next 260 (1079-819): 0.003999322715788067 +Rest: 0.0009956483475481726 + +- Processing end: 2022-12-17 19:57:30.653466 diff --git a/script/charsets/ibm855.py b/script/charsets/ibm855.py new file mode 100644 index 0000000..451e938 --- /dev/null +++ b/script/charsets/ibm855.py @@ -0,0 +1,75 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan <jehan@girinstud.io> +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +from codepoints import * + +name = 'IBM855' +aliases = ['CP855', 'OEM 855', 'MS-DOS Cyrillic'] + +language = \ +{ + # Wikipedia tells us: At one time it was widely used in Serbia, Macedonia + # and Bulgaria, but it never caught on in Russia, where Code page 866 was more + # common. This code page is not used much. + 'complete': [ 'sr', 'mk', 'bg', 'ru' ], + 'incomplete': [] +} + +# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF # +charmap = \ +[ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X + + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 8X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 9X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM, # AX + SYM,SYM,SYM,SYM,SYM,LET,LET,LET,LET,SYM,SYM,SYM,SYM,LET,LET,SYM, # BX + SYM,SYM,SYM,SYM,SYM,SYM,LET,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # CX + LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,LET,LET,SYM, # DX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM, # EX + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM, # FX +] diff --git a/script/charsets/ibm866.py b/script/charsets/ibm866.py new file mode 100644 index 0000000..9ed7bc5 --- /dev/null +++ b/script/charsets/ibm866.py @@ -0,0 +1,72 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan <jehan@girinstud.io> +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +from codepoints import * + +name = 'IBM866' +aliases = ['CP866', 'DOS Cyrillic Russian'] + +language = \ +{ + 'complete': [ 'bg', 'ru' ], + 'incomplete': [ 'uk', 'be' ] +} + +# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF # +charmap = \ +[ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X + + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 8X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 9X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # AX + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # BX + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # CX + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # DX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # EX + LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # FX +] diff --git a/script/charsets/koi8-r.py b/script/charsets/koi8-r.py new file mode 100644 index 0000000..8abbc04 --- /dev/null +++ b/script/charsets/koi8-r.py @@ -0,0 +1,74 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan <jehan@girinstud.io> +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +from codepoints import * + +name = 'KOI8-R' +aliases = ['csKOI8R'] + +language = \ +{ + # KOI8-R is an 8-bit character encoding, designed to cover Russian, which + # uses a Cyrillic alphabet. It also happens to cover Bulgarian, but has not + # been used for that purpose since CP1251 was accepted. + 'complete': [ 'ru', 'bg' ], + 'incomplete': [] +} + +# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF # +charmap = \ +[ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 8X + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 9X + SYM,SYM,SYM,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # AX + SYM,SYM,SYM,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # BX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # CX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # DX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # EX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # FX +] diff --git a/script/charsets/mac-cyrillic.py b/script/charsets/mac-cyrillic.py new file mode 100644 index 0000000..a967519 --- /dev/null +++ b/script/charsets/mac-cyrillic.py @@ -0,0 +1,72 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan <jehan@girinstud.io> +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +from codepoints import * + +name = 'MAC-CYRILLIC' +aliases = ['x-mac-cyrillic' ] + +language = \ +{ + 'complete': [ 'bg', 'ru' ], + 'incomplete': [ 'uk', 'be' ] +} + +# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF # +charmap = \ +[ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X + SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X + + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 8X + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 9X + SYM,SYM,LET,SYM,SYM,SYM,SYM,LET,SYM,SYM,SYM,LET,LET,SYM,LET,LET, # AX + SYM,SYM,SYM,SYM,LET,SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # BX + LET,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,LET,LET,LET,LET,LET, # CX + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,LET,LET,LET,LET,SYM,LET,LET,LET, # DX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # EX + LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM, # FX +] diff --git a/script/langs/ru.py b/script/langs/ru.py new file mode 100644 index 0000000..9d330e1 --- /dev/null +++ b/script/langs/ru.py @@ -0,0 +1,58 @@ +#!/bin/python3 +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan <jehan@girinstud.io> +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +import re + +## Mandatory Properties ## + +name = 'Russian' +code = 'ru' +use_ascii = False +charsets = [ 'WINDOWS-1251', 'ISO-8859-5', 'KOI8-R', 'IBM855', 'IBM866', 'MAC-CYRILLIC' ] + +## Optional Properties ## + +# Alphabet characters. +alphabet = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя' +# A starred page which was rewarded on the main page when I created +# the data. +start_pages = ['Пулмен (рабочий посёлок)'] +wikipedia_code = code +case_mapping = True |