summaryrefslogtreecommitdiff
path: root/script
diff options
context:
space:
mode:
authorJehan <jehan@girinstud.io>2022-12-17 21:32:24 +0100
committerJehan <jehan@girinstud.io>2022-12-17 21:41:11 +0100
commit41d309e8a28407372317b048342e2bb23d9c8959 (patch)
treea9756bbdfe12c9e2560d252fbb919b8d85b64787 /script
parent60dcec8a82d55fabe2fc72af0c63f18f8289b662 (diff)
script, src: regenerate Russian models and add UTF-8/Russian support.
This fixes the broken Russian test in Windows-1251 which once again gets a much better score with Russian. Also this adds UTF-8 support. Same as Bulgarian, I wonder why I had not regenerated this earlier. The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian. Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru (0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a bit closer at our GB18030 dedicated prober.
Diffstat (limited to 'script')
-rw-r--r--script/BuildLangModelLogs/LangRussianModel.log270
-rw-r--r--script/charsets/ibm855.py75
-rw-r--r--script/charsets/ibm866.py72
-rw-r--r--script/charsets/koi8-r.py74
-rw-r--r--script/charsets/mac-cyrillic.py72
-rw-r--r--script/langs/ru.py58
6 files changed, 621 insertions, 0 deletions
diff --git a/script/BuildLangModelLogs/LangRussianModel.log b/script/BuildLangModelLogs/LangRussianModel.log
new file mode 100644
index 0000000..82d9804
--- /dev/null
+++ b/script/BuildLangModelLogs/LangRussianModel.log
@@ -0,0 +1,270 @@
+= Logs of language model for Russian (ru) =
+
+- Generated by BuildLangModel.py
+- Started: 2022-12-17 19:53:30.416132
+- Maximum depth: 4
+- Max number of pages: 200
+
+== Parsed pages ==
+
+Пулмен (рабочий посёлок) (revision 127314030)
+Водонапорная башня (revision 123368499)
+Обама, Барак (revision 127312814)
+Историзм (искусство) (revision 125199154)
+Насосная станция (revision 126671775)
+Школьный округ (revision 118138873)
+Конденсат (revision 97819205)
+1880-е годы (revision 124959394)
+Линкольн, Роберт Тодд (revision 126851305)
+Габарит подвижного состава (revision 127265050)
+Межвоенный период (revision 123201828)
+Гражданская война в США (revision 127311614)
+История евреев в США (revision 123703208)
+Англо-занзибарская война (revision 127263956)
+Линкольн, Джесси Харлан (revision 87795509)
+Бенкен, Герман (revision 120809711)
+УралГАХУ (revision 126489964)
+Великобритания (revision 127175319)
+Фленсбург (revision 126961771)
+Мещанство (revision 127304945)
+Прохоров, Александр Михайлович (revision 127233579)
+VIAF (revision 122626337)
+Национальная библиотека Чешской Республики (revision 124152023)
+Регулирующая арматура (revision 116046805)
+Раннее Средневековье (revision 126932807)
+Европейская интеграция (revision 125721443)
+Бойл, Уиллард (revision 120835257)
+Бут, Эдвин (revision 126437526)
+Московский трамвай (revision 127184149)
+Лондонский метрополитен (revision 126810923)
+F-18 (revision 127113399)
+Сацумско-британская война (revision 124671983)
+Луизианская покупка (revision 123941200)
+Община (Германия) (revision 125007479)
+Запорная арматура (revision 121220496)
+Новая Англия (revision 125214368)
+Берни Сандерс (revision 126983575)
+Бак (резервуар) (revision 126670363)
+Хемингуэй, Эрнест (revision 126959711)
+2021 год (revision 127125948)
+1951 год (revision 126285688)
+Жидкость (revision 127133343)
+Большая советская энциклопедия (revision 127144085)
+Россия (revision 127297047)
+CSS Virginia (revision 121318647)
+Школа реки Гудзон (revision 123627995)
+Водозаборные сооружения (revision 123836554)
+Ривера, Диего (revision 125976771)
+Квантовая физика (revision 126896053)
+Рочестер (Нью-Йорк) (revision 126016553)
+Конденсация (теплотехника) (revision 123837631)
+Средиземноморская Антанта (revision 125156636)
+Историография (revision 121180824)
+Гбови, Лейма (revision 124860814)
+Премудрый пискарь (revision 121359555)
+Люнебургская водонапорная башня (revision 117681965)
+XVIII век (revision 126913825)
+Сислей, Альфред (revision 127063100)
+Средние века (revision 127154753)
+Энциклопедический словарь Брокгауза и Ефрона (revision 125357601)
+Нефтепровод (revision 123810227)
+Нефть (revision 126997759)
+Вентиляция (revision 126675588)
+Цилиндр (revision 126783664)
+Английский язык (revision 127275941)
+Бензин (revision 126966322)
+Министр по делам ветеранов США (revision 124072400)
+Первобытное общество (revision 127057340)
+Пикассо, Пабло (revision 126869217)
+Рисунок в разрезе (revision 121960314)
+Междупутье (revision 125745955)
+Битва при Форт-Генри (revision 123999672)
+Канал (водный) (revision 123736265)
+Белорусская народная республика (revision 126958885)
+25 апреля (revision 127246597)
+Насос (revision 126768788)
+Теннесси (revision 124804069)
+Локомотив (revision 127032264)
+Габарит погрузки (revision 123372556)
+Вебби (revision 121964659)
+Алегзандрия (Виргиния) (revision 126338837)
+Война Фаррапус (revision 125765352)
+Образование в США (revision 126788195)
+Пресс-конференция (revision 127075029)
+Рио-де-Жанейро (revision 127002708)
+Габарит приближения строений (revision 117538368)
+Международный идентификатор стандартных наименований (revision 120216410)
+Мопассан, Ги де (revision 127086462)
+История Европейского союза (revision 123952687)
+Прусский социализм (revision 127165836)
+Библиотека Александрина (revision 126093192)
+Тэйкан-дзукури (revision 124877986)
+1883 год (revision 125476166)
+Конфликт на Китайско-Восточной железной дороге (revision 122499702)
+Энергетический уровень (revision 119322956)
+Алюминий (revision 126861293)
+Санкт-петербургский трамвай (revision 127306763)
+Национальная библиотека Франции (revision 127015965)
+12 мая (revision 127207333)
+Граммофон (revision 126498827)
+Маккьяйоли (revision 126836176)
+Канализационная установка (revision 123736401)
+Газ (revision 126950046)
+Луизиана (revision 127312945)
+Память Парижской Коммуны (revision 126960401)
+Сталь (revision 127216605)
+Семья Барака Обамы (revision 124529726)
+Поверхностный насос (revision 121146223)
+Каразин, Николай Николаевич (revision 127097562)
+Кирпичная готика (revision 125337841)
+The Century Magazine (revision 127098805)
+Контрольный номер Библиотеки Конгресса (revision 113360170)
+Русско-персидская война (1804—1813) (revision 126999654)
+Берн (revision 122913269)
+Поздняя античность (revision 127266287)
+Гарвардский университет (revision 127033732)
+Бои на Халхин-Голе (revision 126542980)
+Алый знак доблести (фильм, 1951) (revision 120728355)
+Водопровод (revision 127182411)
+Пар (revision 126003244)
+1971 год (revision 127068279)
+Искусство Древнего Египта (revision 125737336)
+Пенсильванский университет (Индиана) (revision 123963620)
+Национальная библиотека Израиля (revision 126108080)
+1884 год (revision 125476122)
+Проезд снаружи поездов (revision 127239100)
+Норвегия (revision 126986958)
+Барбур, Джеймс (revision 126851158)
+Французская интервенция в Испанию (revision 119666106)
+Англия (revision 127268120)
+Галлатин, Альберт (revision 127160198)
+Калифорния (revision 127027363)
+Роял, Кеннет Клайборн (revision 110605693)
+США (revision 126887888)
+Федеральная архитектура (revision 116000492)
+Конденсат Бозе — Эйнштейна (revision 125188375)
+Колонна (revision 126876842)
+1907 год (revision 127134918)
+13 сентября (revision 125587404)
+Генрих Лев (revision 126407574)
+Этрусское искусство (revision 123158050)
+Амальрик, Андрей Алексеевич (revision 126033545)
+9 декабря (revision 127201233)
+Селищи (22712000298) (revision 124521248)
+1798 год (revision 125783094)
+Мюледорф (Берн) (revision 121861015)
+Большая игра (revision 126891168)
+Битва (revision 124395796)
+Война не-персе (revision 127189710)
+Президентские выборы в США (2020) (revision 126639368)
+Площадь Карла Фаберже (revision 123223942)
+Банкрофт, Джордж (revision 126851184)
+Кобаяси, Макото (revision 121939251)
+Газойль (revision 123647640)
+Ватиканская апостольская библиотека (revision 124986491)
+Общественная собственность (revision 125722109)
+Славная революция (revision 122270271)
+Золя (revision 127092383)
+Офицер (revision 126230098)
+Метастабильное состояние (revision 118552209)
+Лыжные гонки (revision 124233040)
+Средиземное море (revision 126980465)
+Защитная арматура (revision 124665168)
+Президент Турции (revision 123861767)
+Макдональд, Артур (revision 123992590)
+Песок (revision 126799930)
+Сублимация (физика) (revision 127108939)
+Новицкий, Василий Фёдорович (revision 126350745)
+Список султанов Занзибара (revision 94020222)
+Туман (revision 124866163)
+2005 год (revision 127291761)
+Исламская Республика Афганистан (revision 126605442)
+Викисловарь (revision 126840626)
+22 января (revision 126465130)
+Российская национальная библиотека (revision 126055277)
+Наука в США (revision 124150312)
+Екатеринбургский завод (revision 125779202)
+Океания (revision 125374219)
+Нидершерли (revision 116230829)
+Война за австрийское наследство (revision 126874381)
+Доминиканская Республика (revision 127046641)
+Военный паровоз (revision 124117506)
+Подземные воды (revision 126705165)
+5 сентября (revision 126628763)
+Кафка, Франц (revision 127130321)
+Двухванная печь (revision 123510834)
+Чертаново Южное (revision 122081039)
+
+== End of Parsed pages ==
+
+- Wikipedia parsing ended at: 2022-12-17 19:57:30.506110
+
+63 characters appeared 2343890 times.
+
+Most Frequent characters:
+[ 0] Char о: 10.136567842347551 %
+[ 1] Char и: 8.217151828797427 %
+[ 2] Char а: 7.941797609956098 %
+[ 3] Char е: 7.781337861418411 %
+[ 4] Char н: 6.689093771465385 %
+[ 5] Char с: 5.755304216494801 %
+[ 6] Char р: 5.58695160609073 %
+[ 7] Char т: 5.486136294791991 %
+[ 8] Char в: 4.621547939536412 %
+[ 9] Char л: 4.156039745892512 %
+[10] Char к: 3.458694733967891 %
+[11] Char м: 2.899666793236884 %
+[12] Char д: 2.856064064439884 %
+[13] Char п: 2.69799350652121 %
+[14] Char у: 2.0648579924825823 %
+[15] Char я: 1.9596482770095867 %
+[16] Char г: 1.812798382176638 %
+[17] Char ы: 1.7729500957809452 %
+[18] Char б: 1.5043794717328884 %
+[19] Char з: 1.4936707780655236 %
+[20] Char й: 1.4190938994577391 %
+[21] Char ь: 1.2650764327677493 %
+[22] Char ч: 1.0549983147673312 %
+[23] Char х: 1.0016255029032932 %
+[24] Char ж: 0.7652236239755279 %
+[25] Char ц: 0.5965297006258826 %
+[26] Char ю: 0.5917513193878552 %
+[27] Char ш: 0.5520310253467526 %
+[28] Char ф: 0.4393977533075358 %
+[29] Char щ: 0.3068403380704726 %
+[30] Char э: 0.3063710327703092 %
+[31] Char i: 0.25978181569954223 %
+[32] Char ё: 0.24984107615971737 %
+[33] Char e: 0.2357619171548153 %
+[34] Char a: 0.21839762104876934 %
+[35] Char n: 0.18004257878996027 %
+[36] Char r: 0.1703151598411188 %
+[37] Char t: 0.16216631326555428 %
+[38] Char s: 0.15969179441014725 %
+[39] Char o: 0.1568759626091668 %
+[40] Char l: 0.1263711180985456 %
+[41] Char c: 0.09795681537956133 %
+[42] Char d: 0.08571221345711616 %
+[43] Char h: 0.07956858043679524 %
+[44] Char m: 0.07009714619713382 %
+[45] Char u: 0.0688598867694303 %
+[46] Char x: 0.05725524661993524 %
+[47] Char p: 0.05644462837419845 %
+[48] Char b: 0.05482339188272487 %
+[49] Char g: 0.051111613599614324 %
+[50] Char f: 0.05038632359027087 %
+[51] Char y: 0.04923439239896071 %
+[52] Char v: 0.0470158582527337 %
+[53] Char ъ: 0.03617917223077875 %
+
+The first 54 characters have an accumulated ratio of 0.9991548238185242.
+The first 5 characters have an accumulated ratio of 0.40765948913984873.
+All characters whose order is over 29 have an accumulated ratio of 0.030302616590369.
+
+1554 sequences found.
+
+First 819 (typical positive ratio): 0.9950050289366638
+Next 260 (1079-819): 0.003999322715788067
+Rest: 0.0009956483475481726
+
+- Processing end: 2022-12-17 19:57:30.653466
diff --git a/script/charsets/ibm855.py b/script/charsets/ibm855.py
new file mode 100644
index 0000000..451e938
--- /dev/null
+++ b/script/charsets/ibm855.py
@@ -0,0 +1,75 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+# ##### BEGIN LICENSE BLOCK #####
+# Version: MPL 1.1/GPL 2.0/LGPL 2.1
+#
+# The contents of this file are subject to the Mozilla Public License Version
+# 1.1 (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+# http://www.mozilla.org/MPL/
+#
+# Software distributed under the License is distributed on an "AS IS" basis,
+# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
+# for the specific language governing rights and limitations under the
+# License.
+#
+# The Original Code is Mozilla Universal charset detector code.
+#
+# The Initial Developer of the Original Code is
+# Netscape Communications Corporation.
+# Portions created by the Initial Developer are Copyright (C) 2001
+# the Initial Developer. All Rights Reserved.
+#
+# Contributor(s):
+# Jehan <jehan@girinstud.io>
+#
+# Alternatively, the contents of this file may be used under the terms of
+# either the GNU General Public License Version 2 or later (the "GPL"), or
+# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
+# in which case the provisions of the GPL or the LGPL are applicable instead
+# of those above. If you wish to allow use of your version of this file only
+# under the terms of either the GPL or the LGPL, and not to allow others to
+# use your version of this file under the terms of the MPL, indicate your
+# decision by deleting the provisions above and replace them with the notice
+# and other provisions required by the GPL or the LGPL. If you do not delete
+# the provisions above, a recipient may use your version of this file under
+# the terms of any one of the MPL, the GPL or the LGPL.
+#
+# ##### END LICENSE BLOCK #####
+
+from codepoints import *
+
+name = 'IBM855'
+aliases = ['CP855', 'OEM 855', 'MS-DOS Cyrillic']
+
+language = \
+{
+ # Wikipedia tells us: At one time it was widely used in Serbia, Macedonia
+ # and Bulgaria, but it never caught on in Russia, where Code page 866 was more
+ # common. This code page is not used much.
+ 'complete': [ 'sr', 'mk', 'bg', 'ru' ],
+ 'incomplete': []
+}
+
+# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF #
+charmap = \
+[
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X
+ NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X
+
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 8X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 9X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM, # AX
+ SYM,SYM,SYM,SYM,SYM,LET,LET,LET,LET,SYM,SYM,SYM,SYM,LET,LET,SYM, # BX
+ SYM,SYM,SYM,SYM,SYM,SYM,LET,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # CX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,LET,LET,SYM, # DX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM, # EX
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM, # FX
+]
diff --git a/script/charsets/ibm866.py b/script/charsets/ibm866.py
new file mode 100644
index 0000000..9ed7bc5
--- /dev/null
+++ b/script/charsets/ibm866.py
@@ -0,0 +1,72 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+# ##### BEGIN LICENSE BLOCK #####
+# Version: MPL 1.1/GPL 2.0/LGPL 2.1
+#
+# The contents of this file are subject to the Mozilla Public License Version
+# 1.1 (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+# http://www.mozilla.org/MPL/
+#
+# Software distributed under the License is distributed on an "AS IS" basis,
+# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
+# for the specific language governing rights and limitations under the
+# License.
+#
+# The Original Code is Mozilla Universal charset detector code.
+#
+# The Initial Developer of the Original Code is
+# Netscape Communications Corporation.
+# Portions created by the Initial Developer are Copyright (C) 2001
+# the Initial Developer. All Rights Reserved.
+#
+# Contributor(s):
+# Jehan <jehan@girinstud.io>
+#
+# Alternatively, the contents of this file may be used under the terms of
+# either the GNU General Public License Version 2 or later (the "GPL"), or
+# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
+# in which case the provisions of the GPL or the LGPL are applicable instead
+# of those above. If you wish to allow use of your version of this file only
+# under the terms of either the GPL or the LGPL, and not to allow others to
+# use your version of this file under the terms of the MPL, indicate your
+# decision by deleting the provisions above and replace them with the notice
+# and other provisions required by the GPL or the LGPL. If you do not delete
+# the provisions above, a recipient may use your version of this file under
+# the terms of any one of the MPL, the GPL or the LGPL.
+#
+# ##### END LICENSE BLOCK #####
+
+from codepoints import *
+
+name = 'IBM866'
+aliases = ['CP866', 'DOS Cyrillic Russian']
+
+language = \
+{
+ 'complete': [ 'bg', 'ru' ],
+ 'incomplete': [ 'uk', 'be' ]
+}
+
+# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF #
+charmap = \
+[
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X
+ NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X
+
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 8X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 9X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # AX
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # BX
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # CX
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # DX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # EX
+ LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # FX
+]
diff --git a/script/charsets/koi8-r.py b/script/charsets/koi8-r.py
new file mode 100644
index 0000000..8abbc04
--- /dev/null
+++ b/script/charsets/koi8-r.py
@@ -0,0 +1,74 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+# ##### BEGIN LICENSE BLOCK #####
+# Version: MPL 1.1/GPL 2.0/LGPL 2.1
+#
+# The contents of this file are subject to the Mozilla Public License Version
+# 1.1 (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+# http://www.mozilla.org/MPL/
+#
+# Software distributed under the License is distributed on an "AS IS" basis,
+# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
+# for the specific language governing rights and limitations under the
+# License.
+#
+# The Original Code is Mozilla Universal charset detector code.
+#
+# The Initial Developer of the Original Code is
+# Netscape Communications Corporation.
+# Portions created by the Initial Developer are Copyright (C) 2001
+# the Initial Developer. All Rights Reserved.
+#
+# Contributor(s):
+# Jehan <jehan@girinstud.io>
+#
+# Alternatively, the contents of this file may be used under the terms of
+# either the GNU General Public License Version 2 or later (the "GPL"), or
+# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
+# in which case the provisions of the GPL or the LGPL are applicable instead
+# of those above. If you wish to allow use of your version of this file only
+# under the terms of either the GPL or the LGPL, and not to allow others to
+# use your version of this file under the terms of the MPL, indicate your
+# decision by deleting the provisions above and replace them with the notice
+# and other provisions required by the GPL or the LGPL. If you do not delete
+# the provisions above, a recipient may use your version of this file under
+# the terms of any one of the MPL, the GPL or the LGPL.
+#
+# ##### END LICENSE BLOCK #####
+
+from codepoints import *
+
+name = 'KOI8-R'
+aliases = ['csKOI8R']
+
+language = \
+{
+ # KOI8-R is an 8-bit character encoding, designed to cover Russian, which
+ # uses a Cyrillic alphabet. It also happens to cover Bulgarian, but has not
+ # been used for that purpose since CP1251 was accepted.
+ 'complete': [ 'ru', 'bg' ],
+ 'incomplete': []
+}
+
+# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF #
+charmap = \
+[
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X
+ NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 8X
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 9X
+ SYM,SYM,SYM,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # AX
+ SYM,SYM,SYM,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # BX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # CX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # DX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # EX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # FX
+]
diff --git a/script/charsets/mac-cyrillic.py b/script/charsets/mac-cyrillic.py
new file mode 100644
index 0000000..a967519
--- /dev/null
+++ b/script/charsets/mac-cyrillic.py
@@ -0,0 +1,72 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+# ##### BEGIN LICENSE BLOCK #####
+# Version: MPL 1.1/GPL 2.0/LGPL 2.1
+#
+# The contents of this file are subject to the Mozilla Public License Version
+# 1.1 (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+# http://www.mozilla.org/MPL/
+#
+# Software distributed under the License is distributed on an "AS IS" basis,
+# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
+# for the specific language governing rights and limitations under the
+# License.
+#
+# The Original Code is Mozilla Universal charset detector code.
+#
+# The Initial Developer of the Original Code is
+# Netscape Communications Corporation.
+# Portions created by the Initial Developer are Copyright (C) 2001
+# the Initial Developer. All Rights Reserved.
+#
+# Contributor(s):
+# Jehan <jehan@girinstud.io>
+#
+# Alternatively, the contents of this file may be used under the terms of
+# either the GNU General Public License Version 2 or later (the "GPL"), or
+# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
+# in which case the provisions of the GPL or the LGPL are applicable instead
+# of those above. If you wish to allow use of your version of this file only
+# under the terms of either the GPL or the LGPL, and not to allow others to
+# use your version of this file under the terms of the MPL, indicate your
+# decision by deleting the provisions above and replace them with the notice
+# and other provisions required by the GPL or the LGPL. If you do not delete
+# the provisions above, a recipient may use your version of this file under
+# the terms of any one of the MPL, the GPL or the LGPL.
+#
+# ##### END LICENSE BLOCK #####
+
+from codepoints import *
+
+name = 'MAC-CYRILLIC'
+aliases = ['x-mac-cyrillic' ]
+
+language = \
+{
+ 'complete': [ 'bg', 'ru' ],
+ 'incomplete': [ 'uk', 'be' ]
+}
+
+# X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF #
+charmap = \
+[
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, # 0X
+ CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, # 1X
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, # 2X
+ NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, # 3X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 4X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,SYM, # 5X
+ SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 6X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM,SYM,SYM,SYM,CTR, # 7X
+
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 8X
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # 9X
+ SYM,SYM,LET,SYM,SYM,SYM,SYM,LET,SYM,SYM,SYM,LET,LET,SYM,LET,LET, # AX
+ SYM,SYM,SYM,SYM,LET,SYM,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # BX
+ LET,LET,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,LET,LET,LET,LET,LET, # CX
+ SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,LET,LET,LET,LET,SYM,LET,LET,LET, # DX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET, # EX
+ LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,LET,SYM, # FX
+]
diff --git a/script/langs/ru.py b/script/langs/ru.py
new file mode 100644
index 0000000..9d330e1
--- /dev/null
+++ b/script/langs/ru.py
@@ -0,0 +1,58 @@
+#!/bin/python3
+# -*- coding: utf-8 -*-
+
+# ##### BEGIN LICENSE BLOCK #####
+# Version: MPL 1.1/GPL 2.0/LGPL 2.1
+#
+# The contents of this file are subject to the Mozilla Public License Version
+# 1.1 (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+# http://www.mozilla.org/MPL/
+#
+# Software distributed under the License is distributed on an "AS IS" basis,
+# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
+# for the specific language governing rights and limitations under the
+# License.
+#
+# The Original Code is Mozilla Universal charset detector code.
+#
+# The Initial Developer of the Original Code is
+# Netscape Communications Corporation.
+# Portions created by the Initial Developer are Copyright (C) 2001
+# the Initial Developer. All Rights Reserved.
+#
+# Contributor(s):
+# Jehan <jehan@girinstud.io>
+#
+# Alternatively, the contents of this file may be used under the terms of
+# either the GNU General Public License Version 2 or later (the "GPL"), or
+# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
+# in which case the provisions of the GPL or the LGPL are applicable instead
+# of those above. If you wish to allow use of your version of this file only
+# under the terms of either the GPL or the LGPL, and not to allow others to
+# use your version of this file under the terms of the MPL, indicate your
+# decision by deleting the provisions above and replace them with the notice
+# and other provisions required by the GPL or the LGPL. If you do not delete
+# the provisions above, a recipient may use your version of this file under
+# the terms of any one of the MPL, the GPL or the LGPL.
+#
+# ##### END LICENSE BLOCK #####
+
+import re
+
+## Mandatory Properties ##
+
+name = 'Russian'
+code = 'ru'
+use_ascii = False
+charsets = [ 'WINDOWS-1251', 'ISO-8859-5', 'KOI8-R', 'IBM855', 'IBM866', 'MAC-CYRILLIC' ]
+
+## Optional Properties ##
+
+# Alphabet characters.
+alphabet = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя'
+# A starred page which was rewarded on the main page when I created
+# the data.
+start_pages = ['Пулмен (рабочий посёлок)']
+wikipedia_code = code
+case_mapping = True