summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-12-16test: adding 2 tests for Hebrew/IBM862 recognition.wip/Jehan/improved-APIJehan2-0/+2
This is the same text, taken from this Wikipedia page, which was today's page of honor on Wikipedia in Hebrew: https://he.wikipedia.org/wiki/שתי מסכתות על ממשל מדיני I put it in 2 variants, since IBM862 can be used in logical and visual variants. The visual variant is just about inverting orders of letters (per lines, while lines stay in proper order), so that's what I did. Though note that the English title quoted in the text should likely not have been reverted, but it doesn't matter too much since anyway these are off-Hebrew alphabet and would trigger bad sequence score, whichever their order. So I didn't bother fixing these.
2022-12-16Issue #22: Hebrew CP862 support.Jehan8-544/+661
Added in both visual and logical order since Wikipedia says: > Hebrew text encoded using code page 862 was usually stored in visual > order; nevertheless, a few DOS applications, notably a word processor > named EinsteinWriter, stored Hebrew in logical order. I am not using the nsHebrewProber wrapper (nameProber) for this new support, because I am really unsure this is of any use. Our statistical code based on letter and sequence usage should be more than enough to detect both variants of Hebrew encoding already, and my testing show that so far (with pretty outstanding score on actual Hebrew tests while all the other probers return bad scores). This will have to be studied a bit more later and maybe the whole nsHebrewProber might be deleted, even for Windows-1255 charset. I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by incrementing a single index, instead of maintaining the indexes by hand (otherwise each time we add probers in the middle, to keep them logically gathered by languages, we have to manually increment dozens of following probers).
2022-12-16test: add ability to have several tests per charsets.Jehan1-1/+2
While the expected charset name is still the first part of the test file (until the first point character), the test name is all but the last part (until the last point character). This will allow to have several test files for a single charset. In particular, I want 2 test files at least for Hebrew when it has a visual and logical variant. So I could call these "ibm862.visual.txt" and "ibm862.logical.txt" which both expect IBM862 as a result charset, but test names will "he:ibm862.visual" and he:ibm862.logical" respectively. Without this change, the test names would collide and CMake would refuse these.
2022-12-15test: no:utf-8 is actually working now, after the last model script fix…Jehan1-2/+1
… and rebuild of models. The scores are really not bad now, 0.896026 for Norwegian and 0.877947 for Danish. It looks like the last confidence computation changes I did are really giving fruits!
2022-12-15src: all language models now rebuilt after the fix.Jehan61-11347/+11254
2022-12-15script: fix BuildLangModel.py.Jehan1-4/+6
2022-12-14test: finally add English/UTF-8 test file.Jehan1-0/+1
I had this test file locally for some time now, but it was always failing, and recognized as other languages until now. Thanks to the recent confidence improvements with new frequent/rare ratios, it is finally detected as English by uchardet!
2022-12-14scripts: all language models rebuilt with the new ratio data.Jehan63-8583/+11714
2022-12-14script: model-building script updated to produce the 2 new ratios…Jehan1-1/+26
… introduced in previous commit.
2022-12-14src: improve algorithm for confidence computation.Jehan2-5/+31
Additionally to the "frequent characters" concept, we add 2 sub-categories, which are the "very frequent characters" and "rare characters". The former are usually just a few characters which are used most of the time (like 3 or 4 characters used 40% of the time!), whereas the later are often a dozen or more characters which are barely used a few percents of the time, all together. We use this additional concept to help distinguish very similar languages, or languages whose frequent characters are a subset of the ones from another language (typically English, whose alphabet is a subset of many other European languages). The mTypicalPositiveRatio is getting rid of, as it was anyway barely of any use (it was 0.99-something for nearly all languages!). Instead we get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course the associated order counts to know which character are in these sets.
2022-12-14src: when checking for candidates, make sure we haven't any unprocessed…Jehan1-1/+8
… language data left.
2022-12-14script, src: rebuild the English model.Jehan2-331/+302
The previous model was most obviously wrong: all letters had the same probability, even non-ASCII ones! Anyway this new model does make unit tests a tiny bit better though the English detection is still weak (I have more concepts which I want to experiment to get this better).
2022-12-14src: add a --language|-l option to the uchardet CLI tool.Jehan1-9/+30
2022-12-14src, test: rename s/uchardet_get_candidates/uchardet_get_n_candidates/.Jehan5-15/+15
This was badly named as this function does not return candidates, but the number of candidates (to be actually used in other API).
2022-12-14test: temporarily disable the Norwegian/UTF-8 test.Jehan1-1/+2
It currently recognizes as Danish/UTF-8 with 0.958 score, though Norwegian/UTF-8 is indeed the second candidate with 0.911 (the third candidate is far behind, Swedish/UTF-8 with 0.815). Before wasting time tweaking models, there are more basic conceptual changes that I want to implement first (it might be enough to change the results!). So let's skip this test for now.
2022-12-14src: process pending language data when we are going to pass buffer size.Jehan1-0/+11
We were experiencing segmentation fault when processing long texts because we were ending up trying to access out-of-range data (from codePointBuffer). Verify when this will happen and process data to reset the index before adding more code points.
2022-12-14script, src: rebuild the Danish model.Jehan4-223/+341
Now that it has IBM865 support on the main branch and that I rebased, this feature branch for the new API got broken too.
2022-12-14script, src: update Norwegian model with the new language features.Jehan6-181/+352
As I just rebased my branch about new language detection API, I needed to re-generate Norwegian language models. Unfortunately it doesn't detect UTF-8 Norwegian text, though not far off (it detects it as second candidate with high 91% confidence; beaten by Danish UTF-8 with 94% confidence unfortunately!). Note that I also update the alphabet list for Norwegian as there were too many letters in there (according to Wikipedia at least), so even when training a model, we had some missing characters in the training set.
2022-12-14script: further fixing BuildLangModel.py.Jehan1-0/+2
2022-12-14script: improve a bit the management of use_ascii option.Jehan1-7/+5
2022-12-14script: work around recent issue of python wikipedia module.Jehan1-3/+3
Adding `auto_suggest=False` to the wikipedia.page() call because this auto-suggest is completely broken, searching "mar ot" instead of "marmot" or "ground hug" instead of "Groundhog" (this one is extra funny but not so useful!). I actually wonder why it even needs to suggest anything when the Wikipedia pages do actually exist! Anyway the script BuildLangModel.py was very broken because of this, now it's better. See: https://github.com/goldsmith/Wikipedia/issues/295 Also printing the error message when we discard a page, which helps debugging.
2022-12-14test: improve test error output even more.Jehan1-8/+61
Adding the found confidence, but also the confidence matched by the expected (lang, charset) couple, and its candidate order, if it even matched.
2022-12-14test: add stderr logging when a test fails.Jehan1-0/+7
It allows to get some more info in Testing/Temporary/LastTest.log to debug detection issues.
2022-12-14script, src: add English language model.Jehan10-2/+545
English detection is still quite crappy so I don't add a unit test yet. Though I believe the detection being bad is mostly because of too much shortcutting we are doing to go "fast". I should probably review this whole part of the logics as well.
2022-12-14src: drop less of UTF-8 confidence even with few non-multibyte chars.Jehan1-2/+3
Some languages are not meant to have multibyte characters. For instance, English would typically have none. Yet you can still have UTF-8 English text (with a few special characters, or foreign words…). So anyway let's make it less of a deal breaker. To be even fairer, the whole logics is biased of course and I believe that eventually we should get rid of these lines of code dropping confidence on a number of character. This is a ridiculous rule (we base on our whole logics on language statistics and suddenly we add some weird rule with a completely random number). But for now, I'll keep this as-is until we make the whole library even more robust.
2022-12-14test: fix test binary build for Windows.Jehan1-2/+9
realpath() doesn't exist on Windows. Replace it with _fullpath() which does the same thing, as far as I can see (at least for creating an absolute path, it doesn't seem to canonicalize the path, or the docs doesn't say it, yet since we are controlling the arguments from our CMake script, it's not a big problem anyway). This fixed the CI build for Windows failing with: > undefined reference to `realpath'
2022-12-14src: reset shortcut charset/language on Reset().Jehan1-0/+8
Failing to do so, we always return the same language once we detected a shortcut one, even after resetting. For instance, the issue happened on the uchardet CLI tool.
2022-12-14src: do not test with nsLatin1Prober anymore.Jehan1-2/+9
Just commenting it out for now. This is just not good enough and could take over detection when other probers have low confidence (yet reasonable ones), returning an ugly WINDOWS-1252 with no language detection. I think we should even just get rid of it completely. For now, I temporarily uncomment it and will see with further experiments.
2022-12-14src: improve confidence computation (generic and single-byte charset).Jehan3-26/+31
Nearly the same algorithm on both pieces of code now. I reintroduced the mTypicalPositiveRatio now that our models actually gives the right ratio (not the "first 512" meaningless stuff anymore). In remaining differences, the last computation is the ratio of frequent characters on the whole characters. For the generic detector, we use the frequent+out sum instead. It works much better. I think that Unicode text is much more prone to have characters outside your expected range, while still being meaningful characters. Even control characters are much more meaningful in Unicode. So a ratio off it would make much too low confidence. Anyway this confidence algorithm is already better. We seem to approach much nicer confidence at each iteration, very satisfying!
2022-12-14script: generate more complete frequent characters when range is set.Jehan1-19/+16
The early version used to stop earlier, assuming frequent ranges were used only for language scripts with a lot of characters (such as Korean, or even more Japanese or Chinese), hence it was not efficient to keep data for them all. Since we now use a separate language detector for CJK, remaining scripts (so far) have a usable range of characters. Therefore it is much prefered to keep as much data as possible on these. This allowed to redo the Thai model (cf. previous commit) with more data, hence get much better language confidence on Thai texts.
2022-12-14script, src: regenerate the Thai model.Jehan3-288/+325
With all the changes we made, regenerate the Thai model which is of poor quality. This new one is much better.
2022-12-14src, script: fix the order of characters for Vietnamese.Jehan2-376/+356
Cf. commit 872294d.
2022-12-14src, script: add concept of alphabet_mapping in language models.Jehan4-237/+192
This allows to handle cases where some characters are actually alternative/variants of another. For instance, a same word can be written with both variants, while both are considered correct and equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only use them for titles there. I use this the first time on characters with diacritics in Slovene. Indeed these are so rarely used that they would hardly show in the stats and worse, any sequence using these in tested text would likely show as negative sequences hence drop the confidence in Slovenian. As a consequence, various Slovene text would show up as Slovak as it's close enough and contains the same character with diacritics in a common way.
2022-12-14script: regenerate Slovak and Slovene with better alphabet support.Jehan6-558/+587
I was missing some characters, especially in the Slovak alphabet. Oppositely the Slovene alphabet does not use 4 of the common ASCII alphabet.
2022-12-14script: fix a stupid bug making same ratio for all frequent characters.Jehan1-1/+1
Argh! How did I miss this!
2022-12-14script, src: regenerate the Vietnamese model.Jehan3-229/+383
The alphabet was not complete and thus confidence was a bit too low. For instance the VISCII test case's confidence bumped from 0.643401 to 0.696346 and the UTF-8 test case bumped from 0.863777 to 0.99. Only the Windows-1258 test case is slightly worse from 0.532846 to 0.532098. But the overwhole recognition gain is obvious anyway.
2022-12-14src: fix negative confidence wrapping around because of unsigned int.Jehan1-1/+1
In extreme case of more mCtrlChar than mTotalChar (since the later does not include control characters), we end up with a negative value, which in unsigned int becomes a huge integer. So because the confidence was so bad that it would be negative, we ended up in a huge confidence. We had this case with our Japanese UTF-8 test file which ended up identified as French ISO-8859-1. So I just cast the uint to float early on in order to avoid such pitfall. Now all our test cases succeed again, this time with full UTF-8+language support! Wouhou!
2022-12-14script, src: remove generated statistics data for Korean.Jehan5-1315/+2
2022-12-14src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition.Jehan4-1/+313
I was pondering improving the logics of the LanguageModel contents, in order to better handle language with a huge number of characters (far too much to keep a full frequent list while keeping reasonable memory consumption and speed). But then I realize that this happens for languages which have anyway their own set of characters. For instance, modern Korean is near full hangul. Of course, we can find some Chinese characters here and there, but nothing which should really break confidence if we base it on the hangul ratio. Of course if some day we want to go further and detect older Korean, we will have to improve the logics a bit with some statistics, though I wonder if limiting ourselves to character frequency is not enough here (sequence frequency is maybe a bit overboard). To be tested. In any case, this new class gives much more relevant confidence on Korean texts, compared to the statistics data we previously generated. For Japanese, it is a mix of kana and Chinese characters. A modern full text cannot exist without a lot of kanas (probably only old text or very short texts, such as titles, could have only Chinese characters). We would still want to add a bit of statistics to differentiate correctly a Japanese text with a lot of Chinese characters in it and a Chinese text which quotes a bit of Japanese phrases. It will have to be improved, but for now it works fairly ok. A last case where we would want to play with statistics might be if we want to differentiate between regional variants. For instance, Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment later on. It's already a first good step for UTF-8 support with language!
2022-12-14README: fix a duplicate.Jehan1-1/+1
2022-12-14Update README.Jehan1-20/+105
2022-12-14src: consider any combination with a non-frequent character as sequence.Jehan1-0/+10
Basically since we excluse non-letters (Control chars, punctuations, spaces, separators, emoticones and whatnot), we consider any remaining character as an off-script letter (we may have forgotten some cases, but so far, it looks promising). Hence it is normal to consider a combination with these (i.e. 2 off-script letters or 1 frequent letter + 1 off-script, in any order) as a sequence too. Doing so will drop the confidence even more of any text having too much of these. As a consequence, it expands again the gap between the first and second contender, which seems to really show it works.
2022-12-14src: add Hindi/UTF-8 support.Jehan8-2/+501
2022-12-14src: improve confidence computation.Jehan2-26/+108
Detect various blocks of characters for punctuation, symbols, emoticons and whatnot. These are considered kind of neutral in the confidence (because it's normal to have punctuation, and various text nowadays are expected to display emoticones or various symbols). What is of interest is all the rest, which will then consider as out-of-range characters (likely characters for other scripts) and will therefore drop the confidence. Now confidence will therefore take into account the ratio of all in-range characters (script letters + various neutral characters) and the ratio of frequent letters within all letters (script letters + out-of-range characters). This improved algorithm makes for much more efficient detection, as it bumped most confidence in all our unit test, and usually increased the gap between the first and second contender.
2022-12-14script: fix a bit BuildLangModel.py when use_ascii is True.Jehan1-3/+8
In particular, I prepare the case for English detection. I am not pushing actual English models yet, because it's not so efficient yet. I will do when I will be able to handle better English confidence.
2022-12-14script, src: add generic Korean model.Jehan8-41/+2223
Until now, Korean charsets had its own probers as there are no single-byte encoding for writing Korean. I now added a Korean model only for the generic character and sequence statistics. I also improved the generation script (script/BuildLangModel.py) to allow for languages without single-byte charset generation and to provide meaningful statistics even when the language script has a lot of characters (so we can't have a full sequence combination array, just too much data). It's not perfect yet. For instance our UTF-8 Korean test file ends up with confidence of 0.38503, which is low for obvious Korean text. Still it works (correctly detected, with top confidence compared to others) and is a first step toward more improvement for detection confidence.
2022-12-14src, test: fix the new Johab prober and add a test.Jehan4-8/+15
This prober comes from MR !1 on the main branch though it was too agressive then and could not get merged. On the improved API branch, it doesn't detect other tests as Johab anymore. Also fixing it to work with the new API. Finally adding a Johab/ko unit test.
2022-12-14src: build new charset prober for Johab Korean.Jehan6-6/+8
CMake build was not completed and enum state nsSMState disappeared in commit 53f7ad0. Also fixing a few coding style bugs. See discussion in MR !1.
2022-12-14add charset prober for Johab KoreanLSY9-2/+1029
2022-12-14script, src: generate the Hebrew models.Jehan10-172/+642
The Hebrew Model had never been regenerated by my scripts. I now added the base generation files. Note that I added 2 charsets: ISO-8859-8 and WINDOWS-1255 but they are nearly identical. One of the difference is that the generic currency sign is replaced by the sheqel sign (Israel currency) in Windows-1255. And though this one lost the "double low line", apparently some Yiddish characters were added. Basically it looks like most Hebrew text would work fine with the same confidence on both charsets and detecting both is likely irrelevant. So I keep the charset file for ISO-8859-8, but won't actually use it. The good part is now that Hebrew is also recognized in UTF-8 text thanks to the new code and newly generated language model.