src: do not shortcut UTF-8 detection too early.

I had the case with the Czech test which was considered as Irish after being shortcutted far too early after only 16 characters. Confidence values was just barely above 0.5 for Irish (and barely below for Czech). By adding a threshold (at least 256 characters), we give a bit of relevant data to the engine to actually make an informed decision. By then, the Czech detection was at more than 0.7, whereas the Irish one at 0.6.
author: Jehan <jehan@girinstud.io> 2021-03-17 21:26:31 +0100
committer: Jehan <jehan@girinstud.io> 2021-03-17 21:26:31 +0100
commit: 8b1755cac2e7d877d674b3cadc733a33ff007560 (patch)
tree: 76249a4d91ea84568f32e3ed8d2f211e81a42973
parent: 5463f4e0c0787849a1a9d2ce9d56a435f6abf56c (diff)
1 files changed, 3 insertions, 1 deletions
diff --git a/src/nsUTF8Prober.cpp b/src/nsUTF8Prober.cpp
index 744c66d..21f885e 100644
--- a/src/nsUTF8Prober.cpp
+++ b/src/nsUTF8Prober.cpp
@@ -45,6 +45,8 @@ void  nsUTF8Prober::Reset(void)
   currentCodePoint = 0;
 }
 
+#define ENOUGH_CHAR_THRESHOLD 256
+
 nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen,
                                         int** codePointBuffer,
                                         int*  codePointBufferIdx)
@@ -88,7 +90,7 @@ nsProbingState nsUTF8Prober::HandleData(const char* aBuf, PRUint32 aLen,
   }
 
   if (mState == eDetecting)
-    if (GetConfidence(0) > SHORTCUT_THRESHOLD)
+    if (mNumOfMBChar > ENOUGH_CHAR_THRESHOLD && GetConfidence(0) > SHORTCUT_THRESHOLD)
       mState = eFoundIt;
   return mState;
 }
author	Jehan <jehan@girinstud.io>	2021-03-17 21:26:31 +0100
committer	Jehan <jehan@girinstud.io>	2021-03-17 21:26:31 +0100
commit	8b1755cac2e7d877d674b3cadc733a33ff007560 (patch)
tree	76249a4d91ea84568f32e3ed8d2f211e81a42973
parent	5463f4e0c0787849a1a9d2ce9d56a435f6abf56c (diff)