1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
|
<?xml version="1.0"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
<!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
<!ENTITY version SYSTEM "version.xml">
]>
<chapter id="clusters">
<sect1 id="clusters">
<title>Clusters</title>
<para>
In shaping text, a <emphasis>cluster</emphasis> is a sequence of
code points that needs to be treated as a single, indivisible unit.
</para>
<para>
When you add text to a HB buffer, each character is associated with
a <emphasis>cluster value</emphasis>. This is an arbitrary number as
far as HB is concerned.
</para>
<para>
Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
actual number does not matter. Moreover, it is not required for the
cluster values to be monotonically increasing, but pretty much all
of HB's tests are performed on monotonically increasing cluster
numbers. Nevertheless, there is no such assumption in the code
itself. With that in mind, let's examine what happens with cluster
values during shaping under each cluster-level.
</para>
<para>
HarfBuzz provides three <emphasis>levels</emphasis> of clustering
support. Level 0 is the default behavior and reproduces the behavior
of the old HarfBuzz library. Level 1 tweaks this behavior slightly
to produce better results, so level 1 clustering is recommended for
code that is not required to implement backward compatibility with
the old HarfBuzz.
</para>
<para>
Level 2 differs significantly in how it treats cluster values.
Levels 0 and 1 both process ligatures and glyph decomposition by
merging clusters; level 2 does not.
</para>
<para>
The conceptual model for what the cluster values mean, in levels 0
and 1, is this:
</para>
<itemizedlist spacing="compact">
<listitem>
<para>
the sequence of cluster values will always remain monotone
</para>
</listitem>
<listitem>
<para>
each value represents a single cluster
</para>
</listitem>
<listitem>
<para>
each cluster contains one or more glyphs and one or more
characters
</para>
</listitem>
</itemizedlist>
<para>
Assuming that initial cluster numbers were monotonically increasing
and distinct, then all adjacent glyphs having the same cluster
number belong to the same cluster, and all characters belong to the
cluster that has the highest number not larger than their initial
cluster number. This will become clearer with an example.
</para>
</sect1>
<sect1 id="a-clustering-example-for-levels-0-and-1">
<title>A clustering example for levels 0 and 1</title>
<para>
Let's say we start with the following character sequence and cluster
values:
</para>
<programlisting>
A,B,C,D,E
0,1,2,3,4
</programlisting>
<para>
We then map the characters to glyphs. For simplicity, let's assume
that each character maps to the corresponding, identical-looking
glyph:
</para>
<programlisting>
A,B,C,D,E
0,1,2,3,4
</programlisting>
<para>
Now if, for example, <literal>B</literal> and <literal>C</literal>
ligate, then the clusters to which they belong "merge".
This merged cluster takes for its cluster number the minimum of all
the cluster numbers of the clusters that went in. In this case, we
get:
</para>
<programlisting>
A,BC,D,E
0,1 ,3,4
</programlisting>
<para>
Now let's assume that the <literal>BC</literal> glyph decomposes
into three components, and <literal>D</literal> also decomposes into
two. The components each inherit the cluster value of their parent:
</para>
<programlisting>
A,BC0,BC1,BC2,D0,D1,E
0,1 ,1 ,1 ,3 ,3 ,4
</programlisting>
<para>
Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
their clusters (numbers 1 and 3) merge into
<literal>min(1,3) = 1</literal>:
</para>
<programlisting>
A,BC0,BC1,BC2D0,D1,E
0,1 ,1 ,1 ,1 ,4
</programlisting>
<para>
At this point, cluster 1 means: the character sequence
<literal>BCD</literal> is represented by glyphs
<literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
further.
</para>
</sect1>
<sect1 id="reordering-in-levels-0-and-1">
<title>Reordering in levels 0 and 1</title>
<para>
Another common operation in the more complex shapers is when things
reorder. In those cases, to maintain monotone clusters, HB merges
the clusters of everything in the reordering sequence. For example,
let's again start with the character sequence:
</para>
<programlisting>
A,B,C,D,E
0,1,2,3,4
</programlisting>
<para>
If <literal>D</literal> is reordered before <literal>B</literal>,
then the <literal>B</literal>, <literal>C</literal>, and
<literal>D</literal> clusters merge, and we get:
</para>
<programlisting>
A,D,B,C,E
0,1,1,1,4
</programlisting>
<para>
This is clearly not ideal, but it is the only sensible way to
maintain monotone indices and retain the true relationship between
glyphs and characters.
</para>
</sect1>
<sect1 id="the-distinction-between-levels-0-and-1">
<title>The distinction between levels 0 and 1</title>
<para>
So, the above is pretty much what cluster levels 0 and 1 do. The
only difference between the two is this: in level 0, at the very
beginning of the shaping process, we also merge clusters between
base characters and all Unicode marks (combining or not) following
them. E.g.:
</para>
<programlisting>
A,acute,B
0,1 ,2
</programlisting>
<para>
will become:
</para>
<programlisting>
A,acute,B
0,0 ,2
</programlisting>
<para>
This is the default behavior. We do it because Windows did it and
old HarfBuzz did it, so this remained the default. But this behavior
makes it impossible to color diacritic marks differently from their
base characters. That's why in level 1 we do not perform this
initial merging step.
</para>
<para>
For clients, level 0 is more convenient if they rely on HarfBuzz
clusters for cursor positioning. But that's wrong anyway: cursor
positions should be determined based on Unicode grapheme boundaries,
NOT shaping clusters. As such, level 1 clusters are preferred.
</para>
<para>
One last note about levels 0 and 1. We currently don't allow a
<literal>MultipleSubst</literal> lookup to replace a glyph with zero
glyphs (i.e., to delete a glyph). But in some other situations,
glyphs can be deleted. In those cases, if the glyph being deleted is
the last glyph of its cluster, we make sure to merge the cluster
with a neighboring cluster.
</para>
<para>
This is, primarily, to make sure that the starting cluster of the
text always has the cluster index pointing to the start of the text
for the run; more than one client currently relies on this
guarantee.
</para>
<para>
Incidentally, Apple's CoreText does something else to maintain the
same promise: it inserts a glyph with id 65535 at the beginning of
the glyph string if the glyph corresponding to the first character
in the run was deleted. HarfBuzz might do something similar in the
future.
</para>
</sect1>
<sect1 id="level-2">
<title>Level 2</title>
<para>
Level 2 is a different beast from levels 0 and 1. It is simple to
describe, but hard to make sense of. It simply doesn't do any
cluster merging whatsoever. When things ligate or otherwise multiple
glyphs turn into one, the cluster value of the first glyph is
retained.
</para>
<para>
Here are a few examples of why processing cluster values produced at
this level might be tricky:
</para>
<sect2 id="ligatures-with-combining-marks">
<title>Ligatures with combining marks</title>
<para>
Imagine capital letters are bases and lower case letters are
combining marks. With an input sequence like this:
</para>
<programlisting>
A,a,B,b,C,c
0,1,2,3,4,5
</programlisting>
<para>
if <literal>A,B,C</literal> ligate, then here are the cluster
values one would get under the various levels:
</para>
<para>
level 0:
</para>
<programlisting>
ABC,a,b,c
0 ,0,0,0
</programlisting>
<para>
level 1:
</para>
<programlisting>
ABC,a,b,c
0 ,0,0,5
</programlisting>
<para>
level 2:
</para>
<programlisting>
ABC,a,b,c
0 ,1,3,5
</programlisting>
<para>
Making sense of the last example is the hardest for a client,
because there is nothing in the cluster values to suggest that
<literal>B</literal> and <literal>C</literal> ligated with
<literal>A</literal>.
</para>
</sect2>
<sect2 id="reordering">
<title>Reordering</title>
<para>
Another tricky case is when things reorder. Under level 2:
</para>
<programlisting>
A,B,C,D,E
0,1,2,3,4
</programlisting>
<para>
Now imagine <literal>D</literal> moves before
<literal>B</literal>:
</para>
<programlisting>
A,D,B,C,E
0,3,1,2,4
</programlisting>
<para>
Now, if <literal>D</literal> ligates with <literal>B</literal>, we
get:
</para>
<programlisting>
A,DB,C,E
0,3 ,2,4
</programlisting>
<para>
In a different scenario, <literal>A</literal> and
<literal>B</literal> could have ligated
<emphasis>before</emphasis> <literal>D</literal> reordered; that
would have resulted in:
</para>
<programlisting>
AB,D,C,E
0 ,3,2,4
</programlisting>
<para>
There's no way to differentiate between these two scenarios based
on the cluster numbers alone.
</para>
<para>
Another problem happens with ligatures under level 2 if the
direction of the text is forced to opposite of its natural
direction (e.g. left-to-right Arabic). But that's too much of a
corner case to worry about.
</para>
</sect2>
</sect1>
</chapter>
|