diff options
author | Jehan <jehan@girinstud.io> | 2021-03-17 12:51:25 +0100 |
---|---|---|
committer | Jehan <jehan@girinstud.io> | 2021-03-17 12:51:25 +0100 |
commit | 714ae9ca2935807fa9c96d3049630867a13c23e8 (patch) | |
tree | 90b0220c3ff146abffcb98de6ec08d1bb79dfebc | |
parent | 26ed6280612ef9739ac5d826597d7cb404e32c0c (diff) |
src: tweak again the language detection confidence.
Computing a logical number of sequence was a big mistake. In particular,
a language with only positive sequence would have the same score as a
language with a mix of only positive and probable sequence (i.e. 1.0).
Instead, just use the real number of sequence, but probable of sequence
don't bring +1 to the numerator.
Also drop the mTypicalPositiveRatio, at least for now. In my tests, it
mostly made results worse. Maybe this would still make sense for
language with a huge number of characters (like CJK languages), for
which we won't have the full list of characters in our "frequent" list
of characters. Yet for most other languages, we actually list all the
possible sequences within the character set, therefore any sequence out
of our sequence list should necessarily drop confidence. Tweaking the
result backup up with some ratio is therefore counter-productive.
As for CJK cases, we'll see how to handle the much higher number of
sequences (too many to list them all) when we get there.
-rw-r--r-- | src/nsLanguageDetector.cpp | 22 |
1 files changed, 9 insertions, 13 deletions
diff --git a/src/nsLanguageDetector.cpp b/src/nsLanguageDetector.cpp index c952c6c..4245d53 100644 --- a/src/nsLanguageDetector.cpp +++ b/src/nsLanguageDetector.cpp @@ -116,21 +116,17 @@ float nsLanguageDetector::GetConfidence(void) float r; if (mTotalSeqs > 0) { - /* Create a "logical" number of sequences rather than real, but - * weighing the various sequences. - * Basically positive sequences will boost the confidence, probable - * sequence a bit, but not so much, neutral sequences will not be - * integrated in the confidence. - * Negative sequences will negatively impact the confidence as much - * as positive sequence positively impact it. + /* Positive sequences will boost the confidence, probable sequence + * only a bit but not so much, neutral sequences will stall the + * confidence. + * Negative sequences will negatively impact the confidence. */ - int positiveSeqs = mSeqCounters[LANG_POSITIVE_CAT] * 4; - int probableSeqs = mSeqCounters[LANG_PROBABLE_CAT]; - int neutralSeqs = mSeqCounters[LANG_NEUTRAL_CAT]; - int negativeSeqs = mSeqCounters[LANG_NEGATIVE_CAT] * 4; - int totalSeqs = positiveSeqs + probableSeqs + neutralSeqs + negativeSeqs; + float positiveSeqs = mSeqCounters[LANG_POSITIVE_CAT]; + float probableSeqs = mSeqCounters[LANG_PROBABLE_CAT]; + //float neutralSeqs = mSeqCounters[LANG_NEUTRAL_CAT]; + float negativeSeqs = mSeqCounters[LANG_NEGATIVE_CAT]; - r = ((float)1.0) * (positiveSeqs + probableSeqs - negativeSeqs) / totalSeqs / mModel->mTypicalPositiveRatio; + r = (positiveSeqs + probableSeqs / 4 - negativeSeqs * 2) / mTotalSeqs; /* The more control characters (proportionnaly to the size of the text), the * less confident we become in the current language. */ |