src: tweak again the language detection confidence.

Computing a logical number of sequence was a big mistake. In particular, a language with only positive sequence would have the same score as a language with a mix of only positive and probable sequence (i.e. 1.0). Instead, just use the real number of sequence, but probable of sequence don't bring +1 to the numerator. Also drop the mTypicalPositiveRatio, at least for now. In my tests, it mostly made results worse. Maybe this would still make sense for language with a huge number of characters (like CJK languages), for which we won't have the full list of characters in our "frequent" list of characters. Yet for most other languages, we actually list all the possible sequences within the character set, therefore any sequence out of our sequence list should necessarily drop confidence. Tweaking the result backup up with some ratio is therefore counter-productive. As for CJK cases, we'll see how to handle the much higher number of sequences (too many to list them all) when we get there.
author: Jehan <jehan@girinstud.io> 2021-03-17 12:51:25 +0100
committer: Jehan <jehan@girinstud.io> 2021-03-17 12:51:25 +0100
commit: 714ae9ca2935807fa9c96d3049630867a13c23e8 (patch)
tree: 90b0220c3ff146abffcb98de6ec08d1bb79dfebc
parent: 26ed6280612ef9739ac5d826597d7cb404e32c0c (diff)
1 files changed, 9 insertions, 13 deletions
diff --git a/src/nsLanguageDetector.cpp b/src/nsLanguageDetector.cpp
index c952c6c..4245d53 100644
--- a/src/nsLanguageDetector.cpp
+++ b/src/nsLanguageDetector.cpp
@@ -116,21 +116,17 @@ float nsLanguageDetector::GetConfidence(void)
   float r;
 
   if (mTotalSeqs > 0) {
-    /* Create a "logical" number of sequences rather than real, but
-     * weighing the various sequences.
-     * Basically positive sequences will boost the confidence, probable
-     * sequence a bit, but not so much, neutral sequences will not be
-     * integrated in the confidence.
-     * Negative sequences will negatively impact the confidence as much
-     * as positive sequence positively impact it.
+    /* Positive sequences will boost the confidence, probable sequence
+     * only a bit but not so much, neutral sequences will stall the
+     * confidence.
+     * Negative sequences will negatively impact the confidence.
      */
-    int positiveSeqs = mSeqCounters[LANG_POSITIVE_CAT] * 4;
-    int probableSeqs = mSeqCounters[LANG_PROBABLE_CAT];
-    int neutralSeqs  = mSeqCounters[LANG_NEUTRAL_CAT];
-    int negativeSeqs = mSeqCounters[LANG_NEGATIVE_CAT] * 4;
-    int totalSeqs    = positiveSeqs + probableSeqs + neutralSeqs + negativeSeqs;
+    float positiveSeqs = mSeqCounters[LANG_POSITIVE_CAT];
+    float probableSeqs = mSeqCounters[LANG_PROBABLE_CAT];
+    //float neutralSeqs  = mSeqCounters[LANG_NEUTRAL_CAT];
+    float negativeSeqs = mSeqCounters[LANG_NEGATIVE_CAT];
 
-    r = ((float)1.0) * (positiveSeqs + probableSeqs - negativeSeqs) / totalSeqs / mModel->mTypicalPositiveRatio;
+    r = (positiveSeqs + probableSeqs / 4 - negativeSeqs * 2) / mTotalSeqs;
     /* The more control characters (proportionnaly to the size of the text), the
      * less confident we become in the current language.
      */
author	Jehan <jehan@girinstud.io>	2021-03-17 12:51:25 +0100
committer	Jehan <jehan@girinstud.io>	2021-03-17 12:51:25 +0100
commit	714ae9ca2935807fa9c96d3049630867a13c23e8 (patch)
tree	90b0220c3ff146abffcb98de6ec08d1bb79dfebc
parent	26ed6280612ef9739ac5d826597d7cb404e32c0c (diff)