diff options
author | Jehan <jehan@girinstud.io> | 2021-05-23 17:04:37 +0200 |
---|---|---|
committer | Jehan <jehan@girinstud.io> | 2022-12-14 00:24:53 +0100 |
commit | bed459c6e75e8a5be59ccd9bc80ac76c0bb8dbeb (patch) | |
tree | 016c657eb09bc18bfa5462c5cecd445310d3e481 | |
parent | bffb7819d2af14610965429bbe324673a87aa4ae (diff) |
src: drop less of UTF-8 confidence even with few non-multibyte chars.
Some languages are not meant to have multibyte characters. For instance,
English would typically have none. Yet you can still have UTF-8 English
text (with a few special characters, or foreign words…). So anyway let's
make it less of a deal breaker.
To be even fairer, the whole logics is biased of course and I believe
that eventually we should get rid of these lines of code dropping
confidence on a number of character. This is a ridiculous rule (we base
on our whole logics on language statistics and suddenly we add some
weird rule with a completely random number). But for now, I'll keep this
as-is until we make the whole library even more robust.
-rw-r--r-- | src/nsUTF8Prober.cpp | 5 |
1 files changed, 3 insertions, 2 deletions
diff --git a/src/nsUTF8Prober.cpp b/src/nsUTF8Prober.cpp index 21f885e..6618bec 100644 --- a/src/nsUTF8Prober.cpp +++ b/src/nsUTF8Prober.cpp @@ -99,12 +99,13 @@ nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen, float nsUTF8Prober::GetConfidence(int candidate) { - float unlike = (float)0.99; - if (mNumOfMBChar < 6) { + float unlike = 0.5f; + for (PRUint32 i = 0; i < mNumOfMBChar; i++) unlike *= ONE_CHAR_PROB; + return (float)1.0 - unlike; } else |