diff options
author | Nathan Willis <nwillis@glyphography.com> | 2018-11-12 12:17:06 -0600 |
---|---|---|
committer | Khaled Hosny <khaledhosny@eglug.org> | 2018-11-24 16:46:02 +0200 |
commit | 53ac46e974cf0ee8720b40ef394714eb97ff53b9 (patch) | |
tree | edbec6750f20861dd6c6085eae17cac45ed29c14 /docs | |
parent | 30cb45b3eaacda15cc45435815cae3fd50e87557 (diff) |
Usermanual: expand clusters chapter.
Diffstat (limited to 'docs')
-rw-r--r-- | docs/usermanual-clusters.xml | 743 |
1 files changed, 473 insertions, 270 deletions
diff --git a/docs/usermanual-clusters.xml b/docs/usermanual-clusters.xml index 7b2c7adc..f7db0f59 100644 --- a/docs/usermanual-clusters.xml +++ b/docs/usermanual-clusters.xml @@ -5,306 +5,509 @@ <!ENTITY version SYSTEM "version.xml"> ]> <chapter id="clusters"> -<sect1 id="clusters"> <title>Clusters</title> - <para> - In shaping text, a <emphasis>cluster</emphasis> is a sequence of - code points that needs to be treated as a single, indivisible unit. - </para> - <para> - When you add text to a HB buffer, each character is associated with - a <emphasis>cluster value</emphasis>. This is an arbitrary number as - far as HB is concerned. - </para> - <para> - Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the - actual number does not matter. Moreover, it is not required for the - cluster values to be monotonically increasing, but pretty much all - of HB's tests are performed on monotonically increasing cluster - numbers. Nevertheless, there is no such assumption in the code - itself. With that in mind, let's examine what happens with cluster - values during shaping under each cluster-level. - </para> - <para> - HarfBuzz provides three <emphasis>levels</emphasis> of clustering - support. Level 0 is the default behavior and reproduces the behavior - of the old HarfBuzz library. Level 1 tweaks this behavior slightly - to produce better results, so level 1 clustering is recommended for - code that is not required to implement backward compatibility with - the old HarfBuzz. - </para> - <para> - Level 2 differs significantly in how it treats cluster values. - Levels 0 and 1 both process ligatures and glyph decomposition by - merging clusters; level 2 does not. - </para> - <para> - The conceptual model for what the cluster values mean, in levels 0 - and 1, is this: - </para> - <itemizedlist spacing="compact"> - <listitem> - <para> - the sequence of cluster values will always remain monotone - </para> - </listitem> - <listitem> - <para> - each value represents a single cluster - </para> - </listitem> - <listitem> - <para> - each cluster contains one or more glyphs and one or more - characters - </para> - </listitem> - </itemizedlist> - <para> - Assuming that initial cluster numbers were monotonically increasing - and distinct, then all adjacent glyphs having the same cluster - number belong to the same cluster, and all characters belong to the - cluster that has the highest number not larger than their initial - cluster number. This will become clearer with an example. - </para> -</sect1> -<sect1 id="a-clustering-example-for-levels-0-and-1"> - <title>A clustering example for levels 0 and 1</title> - <para> - Let's say we start with the following character sequence and cluster - values: - </para> - <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> - <para> - We then map the characters to glyphs. For simplicity, let's assume - that each character maps to the corresponding, identical-looking - glyph: - </para> - <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> - <para> - Now if, for example, <literal>B</literal> and <literal>C</literal> - ligate, then the clusters to which they belong "merge". - This merged cluster takes for its cluster number the minimum of all - the cluster numbers of the clusters that went in. In this case, we - get: - </para> - <programlisting> - A,BC,D,E - 0,1 ,3,4 -</programlisting> - <para> - Now let's assume that the <literal>BC</literal> glyph decomposes - into three components, and <literal>D</literal> also decomposes into - two. The components each inherit the cluster value of their parent: - </para> - <programlisting> - A,BC0,BC1,BC2,D0,D1,E - 0,1 ,1 ,1 ,3 ,3 ,4 -</programlisting> - <para> - Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then - their clusters (numbers 1 and 3) merge into - <literal>min(1,3) = 1</literal>: - </para> - <programlisting> - A,BC0,BC1,BC2D0,D1,E - 0,1 ,1 ,1 ,1 ,4 -</programlisting> - <para> - At this point, cluster 1 means: the character sequence - <literal>BCD</literal> is represented by glyphs - <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any - further. - </para> -</sect1> -<sect1 id="reordering-in-levels-0-and-1"> - <title>Reordering in levels 0 and 1</title> - <para> - Another common operation in the more complex shapers is when things - reorder. In those cases, to maintain monotone clusters, HB merges - the clusters of everything in the reordering sequence. For example, - let's again start with the character sequence: - </para> - <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> - <para> - If <literal>D</literal> is reordered before <literal>B</literal>, - then the <literal>B</literal>, <literal>C</literal>, and - <literal>D</literal> clusters merge, and we get: - </para> - <programlisting> - A,D,B,C,E - 0,1,1,1,4 -</programlisting> - <para> - This is clearly not ideal, but it is the only sensible way to - maintain monotone indices and retain the true relationship between - glyphs and characters. - </para> -</sect1> -<sect1 id="the-distinction-between-levels-0-and-1"> - <title>The distinction between levels 0 and 1</title> - <para> - So, the above is pretty much what cluster levels 0 and 1 do. The - only difference between the two is this: in level 0, at the very - beginning of the shaping process, we also merge clusters between - base characters and all Unicode marks (combining or not) following - them. E.g.: - </para> - <programlisting> - A,acute,B - 0,1 ,2 -</programlisting> - <para> - will become: - </para> - <programlisting> - A,acute,B - 0,0 ,2 -</programlisting> - <para> - This is the default behavior. We do it because Windows did it and - old HarfBuzz did it, so this remained the default. But this behavior - makes it impossible to color diacritic marks differently from their - base characters. That's why in level 1 we do not perform this - initial merging step. - </para> - <para> - For clients, level 0 is more convenient if they rely on HarfBuzz - clusters for cursor positioning. But that's wrong anyway: cursor - positions should be determined based on Unicode grapheme boundaries, - NOT shaping clusters. As such, level 1 clusters are preferred. - </para> - <para> - One last note about levels 0 and 1. We currently don't allow a - <literal>MultipleSubst</literal> lookup to replace a glyph with zero - glyphs (i.e., to delete a glyph). But in some other situations, - glyphs can be deleted. In those cases, if the glyph being deleted is - the last glyph of its cluster, we make sure to merge the cluster - with a neighboring cluster. - </para> - <para> - This is, primarily, to make sure that the starting cluster of the - text always has the cluster index pointing to the start of the text - for the run; more than one client currently relies on this - guarantee. - </para> - <para> - Incidentally, Apple's CoreText does something else to maintain the - same promise: it inserts a glyph with id 65535 at the beginning of - the glyph string if the glyph corresponding to the first character - in the run was deleted. HarfBuzz might do something similar in the - future. - </para> -</sect1> -<sect1 id="level-2"> - <title>Level 2</title> - <para> - Level 2 is a different beast from levels 0 and 1. It is simple to - describe, but hard to make sense of. It simply doesn't do any - cluster merging whatsoever. When things ligate or otherwise multiple - glyphs turn into one, the cluster value of the first glyph is - retained. - </para> - <para> - Here are a few examples of why processing cluster values produced at - this level might be tricky: - </para> - <sect2 id="ligatures-with-combining-marks"> - <title>Ligatures with combining marks</title> - <para> - Imagine capital letters are bases and lower case letters are - combining marks. With an input sequence like this: + <section id="clusters"> + <title>Clusters</title> + <para> + In text shaping, a <emphasis>cluster</emphasis> is a sequence of + characters that needs to be treated as a single, indivisible + unit. </para> - <programlisting> - A,a,B,b,C,c - 0,1,2,3,4,5 -</programlisting> <para> - if <literal>A,B,C</literal> ligate, then here are the cluster - values one would get under the various levels: + During the shaping process, some shaping operations may + merge adjacent characters (for example, when two code points form + a ligature and are replaced by a single glyph) or split one + character into several (for example, when performing the Unicode + canonical decomposition of a code point). </para> <para> - level 0: + HarfBuzz tracks clusters independently from how these + shaping operations alter the individual glyphs that comprise the + output HarfBuzz returns in a buffer. Consequently, + a client program using HarfBuzz can utilize the cluster + information to implement features such as: + </para> + <itemizedlist> + <listitem> + <para> + Correctly positioning the cursor between two characters that + have combined into a single glyph by forming a ligature. + </para> + </listitem> + <listitem> + <para> + Correctly highlighting a text selection that includes some, + but not all, of the characters comprising a ligature. + </para> + </listitem> + <listitem> + <para> + Applying text attributes (such as color or underlining) to + part, but not all, of a composed base-and-mark combination. + </para> + </listitem> + <listitem> + <para> + Generating output document formats (such as PDF) with + embedded text that can be fully extracted. + </para> + </listitem> + <listitem> + <para> + Performing line-breaking, justification, and other + line-level or paragraph-level operations that must be done + after shaping is complete, but which require character-level + properties. + </para> + </listitem> + </itemizedlist> + <para> + When you add text to a HarfBuzz buffer, each code point is assigned + a <emphasis>cluster value</emphasis>. + </para> + <para> + This cluster value is an arbitrary number; HarfBuzz uses it only + to distinguish between clusters. Many client programs will use + the index of each code point in the input text stream as the + cluster value, as a matter of convenience; the actual value does + not matter. + </para> + <para> + Client programs can choose how HarfBuzz handles clusters during + shaping by setting the + <literal>cluster_level</literal> of the + buffer. HarfBuzz offers three <emphasis>levels</emphasis> of + clustering support for this property: + </para> + <itemizedlist> + <listitem> + <para><emphasis>Level 0</emphasis> is the default and + reproduces the behavior of the old HarfBuzz library. + </para> + <para> + The distinguishing feature of level 0 behavior is that, at + the beginning of processing the buffer, all code points that + are categorized as <emphasis>marks</emphasis>, + <emphasis>modifier symbols</emphasis>, or + <emphasis>Emoji extended pictographic</emphasis> modifiers, + as well as the <emphasis>Zero Width Joiner</emphasis> and + <emphasis>Zero Width Non-Joiner</emphasis> code points, are + assigned the cluster value of the closest preceding code + point from <emphasis>diferent</emphasis> category. + </para> + <para> + In essence, whenever a base character is followed by a mark + character or a sequence of mark characters, those marks are + reassigned to the same initial cluster value as the base + character. This reassignment is referred to as + "merging" the affected clusters. This behavior is based on + the Grapheme Cluster Boundary specification in <ulink + url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode + Technical Report 29</ulink>. + </para> + <para> + Client programs can specify level 0 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. + </para> + </listitem> + <listitem> + <para> + <emphasis>Level 1</emphasis> tweaks the old behavior + slightly to produce better results. Therefore, level 1 + clustering is recommended for code that is not required to + implement backward compatibility with the old HarfBuzz. + </para> + <para> + Level 1 differs from level 0 by not merging the + clusters of marks and other modifier code points with the + preceding "base" code point's cluster. By preserving the + cluster values of these marks and modifier code points, + script shaping can perform additional operations that might + lead to improved results (for example, reordering a sequence + of marks). + </para> + <para> + Client programs can specify level 1 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. + </para> + </listitem> + <listitem> + <para> + <emphasis>Level 2</emphasis> differs significantly in how it + treats cluster values. In level 2, HarfBuzz never merges + clusters. + </para> + <para> + This difference can be seen most clearly when HarfBuzz processes + ligature substitutions and glyph decompositions. In level 0 + and level 1, ligatures and glyph decomposition both involve + merging clusters; in level 2, neither of these operations + triggers a merge. + </para> + <para> + Client programs can specify level 2 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. + </para> + </listitem> + </itemizedlist> + <para> + It is not <emphasis>required</emphasis> that the cluster values + in a buffer be monotonically increasing. However, if the initial + cluster values in a buffer are monotonic and the buffer is + configured to use clustering level 0 or 1, then HarfBuzz + guarantees that the final cluster values in the shaped buffer + will also be monotonic. No such guarantee is made for cluster + level 2. + </para> + <para> + In levels 0 and 1, HarfBuzz implements the following conceptual model for + cluster values: + </para> + <itemizedlist spacing="compact"> + <listitem> + <para> + The sequence of cluster values will always remain monotonic. + </para> + </listitem> + <listitem> + <para> + Each cluster value represents a single cluster. + </para> + </listitem> + <listitem> + <para> + Each cluster contains one or more glyphs and one or more + characters. + </para> + </listitem> + </itemizedlist> + <para> + In practice, this model offers several benefits. Assuming that + the initial cluster values were monotonically increasing + and distinct before shaping began, then, in the final output: + </para> + <itemizedlist spacing="compact"> + <listitem> + <para> + All adjacent glyphs having the same final cluster + value belong to the same cluster. + </para> + </listitem> + <listitem> + <para> + Each character belongs to the cluster that has the highest + cluster value <emphasis>not larger than</emphasis> its + initial cluster value. + </para> + </listitem> + </itemizedlist> + + </section> + <section id="a-clustering-example-for-levels-0-and-1"> + <title>A clustering example for levels 0 and 1</title> + <para> + The guarantees and benefits of level 0 and level 1 can be seen + with some examples. First, let us examine what happens with cluster + values when shaping involves cluster merging with ligatures and + decomposition. + </para> + <para> + Let's say we start with the following character sequence (top row) and + initial cluster values (bottom row): </para> <programlisting> - ABC,a,b,c - 0 ,0,0,0 -</programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> <para> - level 1: + During shaping, HarfBuzz maps these characters to glyphs from + the font. For simplicity, let's assume that each character maps + to the corresponding, identical-looking glyph: </para> <programlisting> - ABC,a,b,c - 0 ,0,0,5 -</programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> <para> - level 2: + Now if, for example, <literal>B</literal> and <literal>C</literal> + form a ligature, then the clusters to which they belong + "merge". This merged cluster takes for its cluster + value the minimum of all the cluster values of the clusters that + went in to the ligature. In this case, we get: </para> <programlisting> - ABC,a,b,c - 0 ,1,3,5 -</programlisting> + A,BC,D,E + 0,1 ,3,4 + </programlisting> + <para> + because 1 is the minimum of the set {1,2}, which were the + cluster values of <literal>B</literal> and + <literal>C</literal>. + </para> <para> - Making sense of the last example is the hardest for a client, - because there is nothing in the cluster values to suggest that - <literal>B</literal> and <literal>C</literal> ligated with - <literal>A</literal>. + Next, let us say that the <literal>BC</literal> ligature glyph + decomposes into three components, and <literal>D</literal> also + decomposes into two components. These components each inherit the + cluster value of their parent: </para> - </sect2> - <sect2 id="reordering"> - <title>Reordering</title> + <programlisting> + A,BC0,BC1,BC2,D0,D1,E + 0,1 ,1 ,1 ,3 ,3 ,4 + </programlisting> <para> - Another tricky case is when things reorder. Under level 2: + Next, if <literal>BC2</literal> and <literal>D0</literal> form a + ligature, then their clusters (cluster values 1 and 3) merge into + <literal>min(1,3) = 1</literal>: </para> <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> + A,BC0,BC1,BC2D0,D1,E + 0,1 ,1 ,1 ,1 ,4 + </programlisting> + <para> + At this point, cluster 1 means: the character sequence + <literal>BCD</literal> is represented by glyphs + <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any + further. + </para> + </section> + <section id="reordering-in-levels-0-and-1"> + <title>Reordering in levels 0 and 1</title> + <para> + Another common operation in the more complex shapers is glyph + reordering. In order to maintain a monotonic cluster sequence + when glyph reordering takes place, HarfBuzz merges the clusters + of everything in the reordering sequence. + </para> <para> - Now imagine <literal>D</literal> moves before - <literal>B</literal>: + For example, let us again start with the character sequence (top + row) and initial cluster values (bottom row): </para> <programlisting> - A,D,B,C,E - 0,3,1,2,4 -</programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> <para> - Now, if <literal>D</literal> ligates with <literal>B</literal>, we + If <literal>D</literal> is reordered before <literal>B</literal>, + then HarfBuzz merges the <literal>B</literal>, + <literal>C</literal>, and <literal>D</literal> clusters, and we get: </para> <programlisting> - A,DB,C,E - 0,3 ,2,4 -</programlisting> + A,D,B,C,E + 0,1,1,1,4 + </programlisting> <para> - In a different scenario, <literal>A</literal> and - <literal>B</literal> could have ligated - <emphasis>before</emphasis> <literal>D</literal> reordered; that - would have resulted in: + This is clearly not ideal, but it is the only sensible way to + maintain a monotonic sequence of cluster values and retain the + true relationship between glyphs and characters. + </para> + </section> + <section id="the-distinction-between-levels-0-and-1"> + <title>The distinction between levels 0 and 1</title> + <para> + The preceding examples demonstrate the main effects of using + cluster levels 0 and 1. The only difference between the two + levels is this: in level 0, at the very beginning of the shaping + process, HarfBuzz also merges clusters between any base character + and all Unicode marks (combining or not) that follow it. + </para> + <para> + For example, let us start with the following character sequence + (top row) and accompanying initial cluster values (bottom row): + </para> + <programlisting> + A,acute,B + 0,1 ,2 + </programlisting> + <para> + The <literal>acute</literal> is a Unicode mark. If HarfBuzz is + using cluster level 0 on this sequence, then the + <literal>A</literal> and <literal>acute</literal> clusters will + merge, and the result will become: </para> <programlisting> - AB,D,C,E - 0 ,3,2,4 -</programlisting> + A,acute,B + 0,0 ,2 + </programlisting> + <para> + This initial cluster merging is the default behavior of the + Windows shaping engine, and the old HarfBuzz codebase copied + that behavior to maintain compatibility. Consequently, it has + remained the default behavior in the new HarfBuzz codebase. + </para> + <para> + But this initial cluster-merging behavior makes it impossible to + color diacritic marks differently from their base + characters. That is why, in level 1, HarfBuzz does not perform + the initial merging step. + </para> + <para> + For client programs that rely on HarfBuzz cluster values to + perform cursor positioning, level 0 is more convenient. But + relying on cluster boundaries for cursor positioning is wrong: cursor + positions should be determined based on Unicode grapheme + boundaries, not on shaping-cluster boundaries. As such, level 1 + clusters are preferred. + </para> + <para> + One last note about levels 0 and 1. HarfBuzz currently does not allow a + <literal>MultipleSubst</literal> lookup to replace a glyph with zero + glyphs (in other words, to delete a glyph). But, in some other situations, + glyphs can be deleted. In those cases, if the glyph being deleted is + the last glyph of its cluster, HarfBuzz makes sure to merge the cluster + with a neighboring cluster. + </para> + <para> + This is done primarily to make sure that the starting cluster of the + text always has the cluster index pointing to the start of the text + for the run; more than one client currently relies on this + guarantee. + </para> + <para> + Incidentally, Apple's CoreText does something else to maintain the + same promise: it inserts a glyph with id 65535 at the beginning of + the glyph string if the glyph corresponding to the first character + in the run was deleted. HarfBuzz might do something similar in the + future. + </para> + </section> + <section id="level-2"> + <title>Level 2</title> + <para> + HarfBuzz's level 2 cluster behavior uses a significantly + different model than that of level 0 and level 1. + </para> <para> - There's no way to differentiate between these two scenarios based - on the cluster numbers alone. + The level 2 behavior is easy to describe, but it may be + difficult to understand in practical terms. In brief, level 2 + performs no merging of clusters whatsoever. </para> <para> - Another problem happens with ligatures under level 2 if the - direction of the text is forced to opposite of its natural - direction (e.g. left-to-right Arabic). But that's too much of a - corner case to worry about. + When glyphs form a ligature (or when some other feature + substitutes multiple glyphs with one glyph), the cluster value + of the first glyph is retained as the cluster value for the + ligature. However, no subsequent clusters — including + marks and modifiers — are affected. </para> - </sect2> -</sect1> + <para> + Level 2 cluster behavior is less complex than level 0 or level + 1, but there are a few cases in which processing cluster values + produced at level 2 may be tricky. + </para> + <section id="ligatures-with-combining-marks-in-level-2"> + <title>Ligatures with combining marks in level 2</title> + <para> + The first example of how HarfBuzz's level 2 cluster behavior + can be tricky is when the text to be shaped includes combining + marks attached to ligatures. + </para> + <para> + Let us start with an input sequence with the following + characters (top row) and initial cluster values (bottom row): + </para> + <programlisting> + A,acute,B,breve,C,circumflex + 0,1 ,2,3 ,4,5 + </programlisting> + <para> + If the sequence <literal>A,B,C</literal> forms a ligature, + then these are the cluster values HarfBuzz will return under + the various cluster levels: + </para> + <para> + Level 0: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,0 ,0 ,0 + </programlisting> + <para> + Level 1: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,0 ,0 ,5 + </programlisting> + <para> + Level 2: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,1 ,3 ,5 + </programlisting> + <para> + Making sense of the level 2 result is the hardest for a client + program, because there is nothing in the cluster values that + indicates that <literal>B</literal> and <literal>C</literal> + formed a ligature with <literal>A</literal>. + </para> + <para> + In contrast, the "merged" cluster values of the mark glyphs + that are seen in the level 0 and level 1 output are evidence + that a ligature substitution took place. + </para> + </section> + <section id="reordering-in-level-2"> + <title>Reordering in level 2</title> + <para> + Another example of how HarfBuzz's level 2 cluster behavior + can be tricky is when glyphs reorder. Consider an input sequence + with the following characters (top row) and initial cluster + values (bottom row): + </para> + <programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> + <para> + Now imagine <literal>D</literal> moves before + <literal>B</literal> in a reordering operation. The cluster + values will then be: + </para> + <programlisting> + A,D,B,C,E + 0,3,1,2,4 + </programlisting> + <para> + Next, if <literal>D</literal> forms a ligature with + <literal>B</literal>, the output is: + </para> + <programlisting> + A,DB,C,E + 0,3 ,2,4 + </programlisting> + <para> + However, in a different scenario, in which the shaping rules + of the script instead caused <literal>A</literal> and + <literal>B</literal> to form a ligature + <emphasis>before</emphasis> the <literal>D</literal> reordered, the + result would be: + </para> + <programlisting> + AB,D,C,E + 0 ,3,2,4 + </programlisting> + <para> + There is no way for a client program to differentiate between + these two scenarios based on the cluster values + alone. Consequently, client programs that use level 2 might + need to undertake additional work in order to manage cursor + positioning, text attributes, or other desired features. + </para> + </section> + <section id="other-considerations-in-level-2"> + <title>Other considerations in level 2</title> + <para> + There may be other problems encountered with ligatures under + level 2, such as if the direction of the text is forced to + opposite of its natural direction (for example, left-to-right + Arabic). But, generally speaking, these other scenarios are + minor corner cases that are too obscure for most client + programs to need to worry about. + </para> + </section> + </section> </chapter> |