tdf#164006: Only use original word's positions, ignore extra encoded length - libreoffice/core

diff options

author	Mike Kaganski <mike.kaganski@collabora.com>	2024-11-23 09:52:53 +0500
committer	Mike Kaganski <mike.kaganski@collabora.com>	2024-11-23 10:03:41 +0100
commit	9c14ec81b6c25c7932964382f306dadfefeda518 (patch)
tree	f37292e9b073a09ae61c5a3dc4e91e3379243541 /codemaker
parent	29091962399dfae3707b1bbd981ce037c97684ec (diff)

tdf#164006: Only use original word's positions, ignore extra encoded length

The encoding of the string passed to Hunspell/hyphen service depends on the encoding of the dictionary itself. When the usual UTF-8 encoding is used, the resulting octet string may be longer than the original UTF-16 code unit count. In that case, the length of the buffer receiving the positions will be longer, respectively. But on return, the buffer will only contain data in positions corresponding to the characters, not code units (it is unclear if we even need to pass buffer that large). So just as the following loop only iterates up to nWord length, the calculation of hyphen count must use its length, too, not the length of encWord. I suspect that the use of UTF-16 code units as hyphen positions is wrong; it will break in SMP surrogate pairs. The proper would be to iterate code points. However, I don't have data to test, so let it be TODO/LATER. Change-Id: Ieed5e696e03cb22e3b48fabc14537372bbe74363 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/177077 Reviewed-by: Mike Kaganski <mike.kaganski@collabora.com> Tested-by: Jenkins

Diffstat (limited to 'codemaker')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: