diff options
author | Jehan <jehan@girinstud.io> | 2016-09-28 22:11:19 +0200 |
---|---|---|
committer | Jehan <jehan@girinstud.io> | 2016-09-28 22:13:17 +0200 |
commit | d62154bd6ed1eaeca2e40f36673a3e32acd445d7 (patch) | |
tree | 1eca008cffe5e976bd22207dc2c1eab7c9a87938 | |
parent | fbd2efdbe918ec18ec79a3b2e0064b2247393cd0 (diff) |
LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
-rw-r--r-- | README.md | 6 | ||||
-rw-r--r-- | script/BuildLangModelLogs/LangSloveneModel.log | 148 | ||||
-rw-r--r-- | script/langs/sl.py | 59 | ||||
-rw-r--r-- | src/CMakeLists.txt | 1 | ||||
-rw-r--r-- | src/LangModels/LangSloveneModel.cpp | 259 | ||||
-rw-r--r-- | src/nsSBCSGroupProber.cpp | 6 | ||||
-rw-r--r-- | src/nsSBCSGroupProber.h | 2 | ||||
-rw-r--r-- | src/nsSBCharSetProber.h | 6 | ||||
-rw-r--r-- | test/sl/ibm852.txt | 9 | ||||
-rw-r--r-- | test/sl/iso-8859-16.txt | 9 | ||||
-rw-r--r-- | test/sl/iso-8859-2.txt | 9 | ||||
-rw-r--r-- | test/sl/mac-centraleurope.txt | 9 | ||||
-rw-r--r-- | test/sl/utf-8.txt | 9 | ||||
-rw-r--r-- | test/sl/windows-1250.txt | 9 |
14 files changed, 540 insertions, 1 deletions
@@ -132,6 +132,12 @@ Techniques used by universalchardet are described at http://www.mozilla.org/proj * ISO-8859-2 * IBM852 * MAC-CENTRALEUROPE + * Slovene + * ISO-8859-2 + * ISO-8859-16 + * Windows-1250 + * IBM852 + * MAC-CENTRALEUROPE * Spanish * ISO-8859-1 * ISO-8859-15 diff --git a/script/BuildLangModelLogs/LangSloveneModel.log b/script/BuildLangModelLogs/LangSloveneModel.log new file mode 100644 index 0000000..e494190 --- /dev/null +++ b/script/BuildLangModelLogs/LangSloveneModel.log @@ -0,0 +1,148 @@ += Logs of language model for Slovene (sl) = + +- Generated by BuildLangModel.py +- Started: 2016-09-28 22:00:35.243966 +- Maximum depth: 5 +- Max number of pages: 100 + +== Parsed pages == + +XCOM: Enemy Unknown (revision 4704271) +1UP.com (revision 4547348) +2K Games (revision 4110089) +Android (operacijski sistem) (revision 4619359) +Animator videoigre (revision 4702643) +App Store (revision 3903089) +Artefakt (revision 4484504) +Athlon (revision 4524746) +Avstralazija (revision 4623530) +Avtopsija (revision 4541344) +Bralno-pisalni pomnilnik (revision 4256388) +Civilization (serija) (revision 4645770) +Deus Ex: Human Revolution (revision 4694860) +Digitalna distribucija (revision 4696215) +DirectX (revision 4477913) +Dishonored (revision 4619444) +Edge (magazine) (revision 4690049) +Electronic Entertainment Expo (revision 4538691) +Enoigralska videoigra (revision 4610359) +Eurogamer (revision 4694860) +Evropa (revision 4687833) +Fantasy Flight Games (revision 4649361) +Firaxis Games (revision 4110089) +GameRankings (revision 3934020) +GameSpot (revision 4238015) +GameSpy (revision 4538691) +GameTrailers (revision 4704271) +Game Informer (revision 4704271) +GamesTM (revision 4704271) +Grafična kartica (revision 4257980) +Granata (revision 3859332) +Holograf (revision 4477482) +IGN (revision 4576233) +IOS (revision 4597264) +Igra igranja vlog (revision 4642276) +Igra na deski (revision 4649363) +Igralna konzola (revision 4649866) +Igralni pogon (revision 4622773) +Intel (revision 4626025) +International Standard Book Number (revision 4015087) +Izdelovalec videoigre (revision 3851747) +Joker (revija) (revision 3867772) +Kotaku (revision 4613535) +Kristal (revision 4156234) +Linux (revision 4524740) +Lovec prestreznik (revision 4102792) +MTV (revision 4621758) +Mac OS X (revision 4601645) +Machinima (revision 4601716) +Major (revision 4245802) +Mednarodna različica (revision 4116054) +Metacritic (revision 3934020) +Michael McCann (skladatelj) (revision 4694860) +MicroProse (revision 4382810) +Microsoft Windows (revision 4691357) +Nezemeljsko življenje (revision 4620576) +NowGamer (revision 4704271) +OS X (revision 4601645) +Ognjena ekipa (revision 4694450) +Operacijski sistem (revision 4698515) +Ostrostrelec (revision 4529694) +Pilot (revision 4069093) +PlayStation 3 (revision 4382944) +PlayStation Network (revision 4382944) +PlayStation Vita (revision 3944025) +Pogon igre (revision 4622773) +Procesor (revision 4702518) +Producent videoiger (revision 4599904) +Razvijalec videoiger (revision 4093281) +Računalniška miška (revision 4385579) +Računalniška platforma (revision 4673669) +Severna Amerika (revision 4643798) +Sid Meier (revision 4061487) +Stealth (revision 4618630) +Steam (revision 4696215) +Strateška videoigra (revision 4236795) +Tablični računalnik (revision 4409985) +Take-Two Interactive (revision 4110089) +Telepatija (revision 4481192) +The Bureau: XCOM Declassified (revision 4704271) +The Guardian (revision 3929479) +Trdi disk (revision 4644623) +UFO: Enemy Unknown (revision 4704271) +Unreal Engine (revision 4622773) +Unreal Engine 3 (revision 4622773) +Uporabniški vmesnik (revision 4552473) +Valve Corporation (revision 4110105) +Večigralska videoigra (revision 4618639) +VideoGamer.com (revision 4704271) +Vohunski satelit (revision 4215166) +Vojaška taktika (revision 3970259) +Vojaški čini (revision 4363026) + +== End of Parsed pages == + +- Wikipedia parsing ended at: 2016-09-28 22:06:46.133919 + +41 characters appeared 411226 times. + +First 29 characters: +[ 0] Char a: 10.090315301075321 % +[ 1] Char e: 9.90477255815537 % +[ 2] Char i: 9.666703953543793 % +[ 3] Char o: 9.177921629468953 % +[ 4] Char n: 7.28309980400072 % +[ 5] Char r: 5.808241696779873 % +[ 6] Char s: 4.575586174025961 % +[ 7] Char t: 4.4963110309173056 % +[ 8] Char j: 4.343840126840229 % +[ 9] Char l: 4.2672399118732764 % +[10] Char v: 3.802775116359374 % +[11] Char p: 3.5216644861949393 % +[12] Char k: 3.5136397017698293 % +[13] Char d: 3.0387183689747244 % +[14] Char m: 2.9487435132992563 % +[15] Char z: 2.350775485985808 % +[16] Char u: 1.9719083910064055 % +[17] Char g: 1.9342162217369525 % +[18] Char b: 1.5392995579073308 % +[19] Char c: 1.2924766430138173 % +[20] Char h: 1.1864522184881305 % +[21] Char č: 1.137087635509428 % +[22] Char š: 0.6932927392723223 % +[23] Char ž: 0.45303555709026183 % +[24] Char f: 0.40707542811009034 % +[25] Char x: 0.19381070263067024 % +[26] Char y: 0.19040624863213904 % +[27] Char w: 0.18919037220409216 % +[28] Char q: 0.011186063138031156 % + +The first 29 characters have an accumulated ratio of 0.9998978663800442. + +727 sequences found. + +First 512 (typical positive ratio): 0.9983524317161332 +Next 512 (512-1024): 2.4317528560937295e-06 +Rest: -3.859759734048396e-17 + +- Processing end: 2016-09-28 22:06:46.601266 diff --git a/script/langs/sl.py b/script/langs/sl.py new file mode 100644 index 0000000..bf02bf8 --- /dev/null +++ b/script/langs/sl.py @@ -0,0 +1,59 @@ +#!/bin/python3 +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan <jehan@girinstud.io> +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +import re + +## Mandatory Properties ## + +name = 'Slovene' +code = 'sl' +use_ascii = True +charsets = ['ISO-8859-2', 'ISO-8859-16', + 'Windows-1250', 'IBM852', 'MAC-CENTRALEUROPE'] + +## Optional Properties ## + +# Alphabet characters. +alphabet = 'čšž' +# The starred page which was rewarded on the main page when I created +# the data. +start_pages = ['XCOM: Enemy Unknown'] +wikipedia_code = code +case_mapping = True diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 2525ec6..67e76b1 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -30,6 +30,7 @@ set( LangModels/LangRomanianModel.cpp LangModels/LangRussianModel.cpp LangModels/LangSlovakModel.cpp + LangModels/LangSloveneModel.cpp LangModels/LangSpanishModel.cpp LangModels/LangThaiModel.cpp LangModels/LangTurkishModel.cpp diff --git a/src/LangModels/LangSloveneModel.cpp b/src/LangModels/LangSloveneModel.cpp new file mode 100644 index 0000000..da28d86 --- /dev/null +++ b/src/LangModels/LangSloveneModel.cpp @@ -0,0 +1,259 @@ +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ +/* ***** BEGIN LICENSE BLOCK ***** + * Version: MPL 1.1/GPL 2.0/LGPL 2.1 + * + * The contents of this file are subject to the Mozilla Public License Version + * 1.1 (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * http://www.mozilla.org/MPL/ + * + * Software distributed under the License is distributed on an "AS IS" basis, + * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License + * for the specific language governing rights and limitations under the + * License. + * + * The Original Code is Mozilla Communicator client code. + * + * The Initial Developer of the Original Code is + * Netscape Communications Corporation. + * Portions created by the Initial Developer are Copyright (C) 1998 + * the Initial Developer. All Rights Reserved. + * + * Contributor(s): + * + * Alternatively, the contents of this file may be used under the terms of + * either the GNU General Public License Version 2 or later (the "GPL"), or + * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), + * in which case the provisions of the GPL or the LGPL are applicable instead + * of those above. If you wish to allow use of your version of this file only + * under the terms of either the GPL or the LGPL, and not to allow others to + * use your version of this file under the terms of the MPL, indicate your + * decision by deleting the provisions above and replace them with the notice + * and other provisions required by the GPL or the LGPL. If you do not delete + * the provisions above, a recipient may use your version of this file under + * the terms of any one of the MPL, the GPL or the LGPL. + * + * ***** END LICENSE BLOCK ***** */ + +#include "../nsSBCharSetProber.h" + +/********* Language model for: Slovene *********/ + +/** + * Generated by BuildLangModel.py + * On: 2016-09-28 22:06:46.134717 + **/ + +/* Character Mapping Table: + * ILL: illegal character. + * CTR: control character specific to the charset. + * RET: carriage/return. + * SYM: symbol (punctuation) that does not belong to word. + * NUM: 0 - 9. + * + * Other characters are ordered by probabilities + * (0 is the most common character in the language). + * + * Orders are generic to a language. So the codepoint with order X in + * CHARSET1 maps to the same character as the codepoint with the same + * order X in CHARSET2 for the same language. + * As such, it is possible to get missing order. For instance the + * ligature of 'o' and 'e' exists in ISO-8859-15 but not in ISO-8859-1 + * even though they are both used for French. Same for the euro sign. + */ +static const unsigned char Iso_8859_2_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 4X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 6X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,CTR, /* 7X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */ + SYM, 41,SYM, 42,SYM, 43, 44,SYM,SYM, 22, 45, 46, 47,SYM, 23, 48, /* AX */ + SYM, 49,SYM, 50,SYM, 51, 52,SYM,SYM, 22, 53, 54, 55,SYM, 23, 56, /* BX */ + 57, 32, 58, 59, 60, 61, 37, 34, 21, 29, 62, 36, 63, 30, 64, 65, /* CX */ + 66, 67, 68, 31, 35, 69, 70,SYM, 71, 72, 39, 73, 74, 40, 75, 76, /* DX */ + 77, 32, 78, 79, 80, 81, 37, 34, 21, 29, 82, 36, 83, 30, 84, 85, /* EX */ + 86, 87, 88, 31, 35, 89, 90,SYM, 91, 92, 39, 93, 94, 40, 95,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Iso_8859_16_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 4X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 6X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,CTR, /* 7X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */ + SYM, 96, 97, 98,SYM,SYM, 22,SYM, 22,SYM, 99,SYM,100,SYM,101,102, /* AX */ + SYM,SYM, 21,103, 23,SYM,SYM,SYM, 23, 21,104,SYM,105,106,107,108, /* BX */ + 109, 32,110,111,112, 37,113, 34,114, 29, 33, 36,115, 30,116,117, /* CX */ + 118,119,120, 31, 35,121,122,123,124,125, 39,126,127,128,129,130, /* DX */ + 131, 32,132,133,134, 37,135, 34,136, 29, 33, 36,137, 30,138,139, /* EX */ + 140,141,142, 31, 35,143,144,145,146,147, 39,148,149,150,151,152, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Windows_1250_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 4X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 6X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,CTR, /* 7X */ + SYM,ILL,SYM,ILL,SYM,SYM,SYM,SYM,ILL,SYM, 22,SYM,153,154, 23,155, /* 8X */ + ILL,SYM,SYM,SYM,SYM,SYM,SYM,SYM,ILL,SYM, 22,SYM,156,157, 23,158, /* 9X */ + SYM,SYM,SYM,159,SYM,160,SYM,SYM,SYM,SYM,161,SYM,SYM,SYM,SYM,162, /* AX */ + SYM,SYM,SYM,163,SYM,SYM,SYM,SYM,SYM,164,165,SYM,166,SYM,167,168, /* BX */ + 169, 32,170,171,172,173, 37, 34, 21, 29,174, 36,175, 30,176,177, /* CX */ + 178,179,180, 31, 35,181,182,SYM,183,184, 39,185,186, 40,187,188, /* DX */ + 189, 32,190,191,192,193, 37, 34, 21, 29,194, 36,195, 30,196,197, /* EX */ + 198,199,200, 31, 35,201,202,SYM,203,204, 39,205,206, 40,207,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Mac_Centraleurope_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 4X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 6X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,CTR, /* 7X */ + 208,209,210, 29,211,212,213, 32,214, 21,215, 21, 37, 37, 29,216, /* 8X */ + 217,218, 30,219, 38, 38,220, 31,221, 35,222,223, 39,224,225,226, /* 9X */ + SYM,SYM,227,SYM,SYM,SYM,SYM,228,SYM,SYM,SYM,229,SYM,SYM,230,231, /* AX */ + 232,233,SYM,SYM,234,235,SYM,SYM,236,237,238,239,240,241,242,243, /* BX */ + 244,245,SYM,SYM,246,247,SYM,SYM,SYM,SYM,SYM,248,249,249,249,249, /* CX */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,249,249,249,249,SYM,SYM,249,249, /* DX */ + 249, 22,SYM,SYM, 22,249,249, 32,249,249, 30, 23, 23,249, 31, 35, /* EX */ + 249,249, 39,249,249,249,249,249, 40, 40,249,249,249,249,249,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Ibm852_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 4X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 0, 18, 19, 13, 1, 24, 17, 20, 2, 8, 12, 9, 14, 4, 3, /* 6X */ + 11, 28, 5, 6, 7, 16, 10, 27, 25, 26, 15,SYM,SYM,SYM,SYM,CTR, /* 7X */ + 34,249, 29,249,249,249, 37, 34,249, 36,249,249,249,249,249, 37, /* 8X */ + 29,249,249, 35,249,249,249,249,249,249,249,249,249,249,SYM, 21, /* 9X */ + 32, 30, 31, 39,249,249, 23, 23,249,249,SYM,249, 21,249,SYM,SYM, /* AX */ + SYM,SYM,SYM,SYM,SYM, 32,249,249,249,SYM,SYM,SYM,SYM,249,249,SYM, /* BX */ + SYM,SYM,SYM,SYM,SYM,SYM,249,249,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* CX */ + 249,249,249, 36,249,249, 30,249,249,SYM,SYM,SYM,SYM,249,249,SYM, /* DX */ + 31,249, 35,249,249,249, 22, 22,249, 39,249,249, 40, 40,249,SYM, /* EX */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,249,249,249,SYM,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + + +/* Model Table: + * Total sequences: 727 + * First 512 sequences: 0.9983524317161332 + * Next 512 sequences (512-1024): 0.0016475682838668457 + * Rest: -3.859759734048396e-17 + * Negative sequences: TODO + */ +static const PRUint8 SloveneLangModel[] = +{ + 2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,2,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,2,3,2,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,2,2, + 3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,2,3,2,3,3,3,2,0,0,3,2,3,3,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,0,0,0,3,2,3,3,0, + 3,3,3,3,3,2,3,3,0,0,3,3,3,3,3,2,3,2,3,3,3,2,3,0,0,0,0,0,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,2,3,2,3,3,2,3,2,0, + 3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,3,3,2,3,3,3,3,2,2,2,2,0,0, + 3,3,3,3,3,3,3,3,2,3,0,3,3,3,2,2,3,3,3,3,3,2,2,0,0,0,3,2,2, + 3,3,3,3,3,3,3,3,3,3,3,0,2,3,3,2,3,0,2,3,3,0,3,0,2,0,3,2,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,2,3,2,2,3,2,0, + 3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,0,3,2,3,3,2,2,2,0,2,2,3,2,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,2,3,2,0,2,0,0,0, + 3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,2,0,0, + 3,3,3,3,3,3,3,2,0,3,3,3,2,2,2,0,3,2,3,2,3,0,0,0,2,2,2,2,0, + 3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,0,3,0,2,2,0,3,3,2,2,0,3,0,0, + 3,3,3,3,3,3,3,3,0,3,2,3,3,3,2,2,3,2,2,3,3,0,0,0,2,2,3,2,2, + 3,3,3,3,3,3,2,3,0,3,3,3,3,2,2,2,3,0,2,0,0,2,0,0,2,0,2,2,0, + 3,3,3,3,3,3,0,0,3,3,2,2,3,2,0,0,3,0,2,2,0,0,2,0,0,0,0,0,0, + 3,3,3,3,3,2,0,3,3,3,2,3,3,0,0,0,3,0,0,0,0,3,0,2,0,0,0,0,0, + 3,3,3,2,3,2,0,2,3,3,2,0,3,0,0,0,3,2,3,2,0,0,0,2,0,0,0,0,0, + 3,3,3,3,2,3,3,3,0,3,0,0,0,2,2,0,3,2,0,2,2,0,0,0,3,2,2,2,0, + 3,3,3,3,2,2,2,3,0,0,2,3,0,2,2,0,3,2,3,3,2,0,0,0,2,2,2,2,0, + 3,3,2,3,3,2,3,3,3,3,0,2,2,2,2,0,2,2,2,3,2,0,0,0,0,2,0,2,0, + 3,3,3,3,3,0,3,0,0,2,0,0,0,0,2,0,2,2,2,0,2,0,0,0,2,0,2,3,0, + 0,0,0,0,2,0,0,2,0,2,0,0,0,0,0,0,3,0,0,2,0,0,0,0,0,0,0,0,0, +}; + + +const SequenceModel Iso_8859_2SloveneModel = +{ + Iso_8859_2_CharToOrderMap, + SloveneLangModel, + 29, + (float)0.9983524317161332, + PR_TRUE, + "ISO-8859-2" +}; + +const SequenceModel Iso_8859_16SloveneModel = +{ + Iso_8859_16_CharToOrderMap, + SloveneLangModel, + 29, + (float)0.9983524317161332, + PR_TRUE, + "ISO-8859-16" +}; + +const SequenceModel Windows_1250SloveneModel = +{ + Windows_1250_CharToOrderMap, + SloveneLangModel, + 29, + (float)0.9983524317161332, + PR_TRUE, + "WINDOWS-1250" +}; + +const SequenceModel Mac_CentraleuropeSloveneModel = +{ + Mac_Centraleurope_CharToOrderMap, + SloveneLangModel, + 29, + (float)0.9983524317161332, + PR_TRUE, + "MAC-CENTRALEUROPE" +}; + +const SequenceModel Ibm852SloveneModel = +{ + Ibm852_CharToOrderMap, + SloveneLangModel, + 29, + (float)0.9983524317161332, + PR_TRUE, + "IBM852" +}; diff --git a/src/nsSBCSGroupProber.cpp b/src/nsSBCSGroupProber.cpp index 96c93e0..161129d 100644 --- a/src/nsSBCSGroupProber.cpp +++ b/src/nsSBCSGroupProber.cpp @@ -179,6 +179,12 @@ nsSBCSGroupProber::nsSBCSGroupProber() mProbers[87] = new nsSingleByteCharSetProber(&Iso_8859_16RomanianModel); mProbers[88] = new nsSingleByteCharSetProber(&Ibm852RomanianModel); + mProbers[89] = new nsSingleByteCharSetProber(&Windows_1250SloveneModel); + mProbers[90] = new nsSingleByteCharSetProber(&Iso_8859_2SloveneModel); + mProbers[91] = new nsSingleByteCharSetProber(&Iso_8859_16SloveneModel); + mProbers[92] = new nsSingleByteCharSetProber(&Mac_CentraleuropeSloveneModel); + mProbers[93] = new nsSingleByteCharSetProber(&Ibm852SloveneModel); + Reset(); } diff --git a/src/nsSBCSGroupProber.h b/src/nsSBCSGroupProber.h index 7f7425c..b22f46e 100644 --- a/src/nsSBCSGroupProber.h +++ b/src/nsSBCSGroupProber.h @@ -40,7 +40,7 @@ #define nsSBCSGroupProber_h__ -#define NUM_OF_SBCS_PROBERS 89 +#define NUM_OF_SBCS_PROBERS 94 class nsCharSetProber; class nsSBCSGroupProber: public nsCharSetProber { diff --git a/src/nsSBCharSetProber.h b/src/nsSBCharSetProber.h index e6dd2ae..dd29b90 100644 --- a/src/nsSBCharSetProber.h +++ b/src/nsSBCharSetProber.h @@ -240,5 +240,11 @@ extern const SequenceModel Iso_8859_2RomanianModel; extern const SequenceModel Iso_8859_16RomanianModel; extern const SequenceModel Ibm852RomanianModel; +extern const SequenceModel Windows_1250SloveneModel; +extern const SequenceModel Iso_8859_2SloveneModel; +extern const SequenceModel Iso_8859_16SloveneModel; +extern const SequenceModel Ibm852SloveneModel; +extern const SequenceModel Mac_CentraleuropeSloveneModel; + #endif /* nsSingleByteCharSetProber_h__ */ diff --git a/test/sl/ibm852.txt b/test/sl/ibm852.txt new file mode 100644 index 0000000..5fa60a4 --- /dev/null +++ b/test/sl/ibm852.txt @@ -0,0 +1,9 @@ +Naseljvi plant je planet ali naravni satelit (redkeje tudi asteroid[1]), ki je +zmoen razviti in ohranjati ivljenje. + +Ker je obstoj nezemeljskega ivljenja trenutno negotov, je raziskovanje +naseljivih planetov v glavnem ekstrapolacija razmer na Zemlji in znailnosti +Sonca in celotnega Osonja, ki govorijo v prid razvitju ivljenja. e posebej so +pomembni faktorji, ki so ohranili zapletene, mnogoceline organizme in ne le +preprosta, enocelina iva bitja, mikroorganizme. Raziskovanje in teorija v tej +smeri je del planetologije in razvijajoe astrobiologije. diff --git a/test/sl/iso-8859-16.txt b/test/sl/iso-8859-16.txt new file mode 100644 index 0000000..80d0b26 --- /dev/null +++ b/test/sl/iso-8859-16.txt @@ -0,0 +1,9 @@ +Naseljvi plant je planet ali naravni satelit (redkeje tudi asteroid[1]), ki je +zmoen razviti in ohranjati ivljenje. + +Ker je obstoj nezemeljskega ivljenja trenutno negotov, je raziskovanje +naseljivih planetov v glavnem ekstrapolacija razmer na Zemlji in znailnosti +Sonca in celotnega Osonja, ki govorijo v prid razvitju ivljenja. e posebej so +pomembni faktorji, ki so ohranili zapletene, mnogoceline organizme in ne le +preprosta, enocelina iva bitja, mikroorganizme. Raziskovanje in teorija v tej +smeri je del planetologije in razvijajoe astrobiologije. diff --git a/test/sl/iso-8859-2.txt b/test/sl/iso-8859-2.txt new file mode 100644 index 0000000..7af252e --- /dev/null +++ b/test/sl/iso-8859-2.txt @@ -0,0 +1,9 @@ +Naseljvi plant je planet ali naravni satelit (redkeje tudi asteroid[1]), ki je +zmoen razviti in ohranjati ivljenje. + +Ker je obstoj nezemeljskega ivljenja trenutno negotov, je raziskovanje +naseljivih planetov v glavnem ekstrapolacija razmer na Zemlji in znailnosti +Sonca in celotnega Osonja, ki govorijo v prid razvitju ivljenja. e posebej so +pomembni faktorji, ki so ohranili zapletene, mnogoceline organizme in ne le +preprosta, enocelina iva bitja, mikroorganizme. Raziskovanje in teorija v tej +smeri je del planetologije in razvijajoe astrobiologije. diff --git a/test/sl/mac-centraleurope.txt b/test/sl/mac-centraleurope.txt new file mode 100644 index 0000000..4e84135 --- /dev/null +++ b/test/sl/mac-centraleurope.txt @@ -0,0 +1,9 @@ +Naseljvi plant je planet ali naravni satelit (redkeje tudi asteroid[1]), ki je +zmoen razviti in ohranjati ivljenje. + +Ker je obstoj nezemeljskega ivljenja trenutno negotov, je raziskovanje +naseljivih planetov v glavnem ekstrapolacija razmer na Zemlji in znailnosti +Sonca in celotnega Osonja, ki govorijo v prid razvitju ivljenja. e posebej so +pomembni faktorji, ki so ohranili zapletene, mnogoceline organizme in ne le +preprosta, enocelina iva bitja, mikroorganizme. Raziskovanje in teorija v tej +smeri je del planetologije in razvijajoe astrobiologije. diff --git a/test/sl/utf-8.txt b/test/sl/utf-8.txt new file mode 100644 index 0000000..11d013b --- /dev/null +++ b/test/sl/utf-8.txt @@ -0,0 +1,9 @@ +Naseljívi planét je planet ali naravni satelit (redkeje tudi asteroid[1]), ki je +zmožen razviti in ohranjati življenje. + +Ker je obstoj nezemeljskega življenja trenutno negotov, je raziskovanje +naseljivih planetov v glavnem ekstrapolacija razmer na Zemlji in značilnosti +Sonca in celotnega Osončja, ki govorijo v prid razvitju življenja. Še posebej so +pomembni faktorji, ki so ohranili zapletene, mnogocelične organizme in ne le +preprosta, enocelična živa bitja, mikroorganizme. Raziskovanje in teorija v tej +smeri je del planetologije in razvijajoče astrobiologije. diff --git a/test/sl/windows-1250.txt b/test/sl/windows-1250.txt new file mode 100644 index 0000000..512309b --- /dev/null +++ b/test/sl/windows-1250.txt @@ -0,0 +1,9 @@ +Naseljvi plant je planet ali naravni satelit (redkeje tudi asteroid[1]), ki je +zmoen razviti in ohranjati ivljenje. + +Ker je obstoj nezemeljskega ivljenja trenutno negotov, je raziskovanje +naseljivih planetov v glavnem ekstrapolacija razmer na Zemlji in znailnosti +Sonca in celotnega Osonja, ki govorijo v prid razvitju ivljenja. e posebej so +pomembni faktorji, ki so ohranili zapletene, mnogoceline organizme in ne le +preprosta, enocelina iva bitja, mikroorganizme. Raziskovanje in teorija v tej +smeri je del planetologije in razvijajoe astrobiologije. |