summaryrefslogtreecommitdiff
path: root/docs/usermanual-shaping-concepts.xml
diff options
context:
space:
mode:
Diffstat (limited to 'docs/usermanual-shaping-concepts.xml')
-rw-r--r--docs/usermanual-shaping-concepts.xml368
1 files changed, 368 insertions, 0 deletions
diff --git a/docs/usermanual-shaping-concepts.xml b/docs/usermanual-shaping-concepts.xml
new file mode 100644
index 00000000..8c49ab13
--- /dev/null
+++ b/docs/usermanual-shaping-concepts.xml
@@ -0,0 +1,368 @@
+<chapter id="shaping-concepts">
+ <title>Shaping concepts</title>
+ <section id="text-shaping-concepts">
+ <title>Text shaping</title>
+ <para>
+ Text shaping is the process of transforming a sequence of Unicode
+ codepoints that represent individual characters (letters,
+ diacritics, tone marks, numbers, symbols, etc.) into the
+ orthographically and linguistically correct two-dimensional layout
+ of glyph shapes taken from a specified font.
+ </para>
+ <para>
+ For some writing systems (or <emphasis>scripts</emphasis>) and
+ languages, the process is simple, requiring the shaper to do
+ little more than advance the horizontal position forward by the
+ correct amount for each successive glyph.
+ </para>
+ <para>
+ But, for <emphasis>complex scripts</emphasis>, any combination of
+ several shaping operations may be required, and the rules for how
+ and when they are applied vary from script to script. HarfBuzz and
+ other shaping engines implement these rules.
+ </para>
+ <para>
+ The exact rules and necessary operations for a particular script
+ constitute a shaping <emphasis>model</emphasis>. OpenType
+ specifies a set of shaping models that covers all of
+ Unicode. Other shaping models are available, however, including
+ Graphite and Apple Advanced Typography (AAT).
+ </para>
+ </section>
+
+ <section id="complex-scripts">
+ <title>Complex scripts</title>
+ <para>
+ In text-shaping terminology, scripts are generally classified as
+ either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
+ </para>
+ <para>
+ Complex scripts are those for which transforming the input
+ sequence into the final layout requires some combination of
+ operations&mdash;such as context-dependent substitutions,
+ context-dependent mark positioning, glyph-to-glyph joining,
+ glyph reordering, or glyph stacking.
+ </para>
+ <para>
+ In some complex scripts, the shaping rules require that a text
+ run be divided into syllables before the operations can be
+ applied. Other complex scripts may apply shaping operations over
+ entire words or over the entire text run, with no subdivision
+ required.
+ </para>
+ <para>
+ Non-complex scripts, by definition, do not require these
+ operations. However, correctly shaping a text run in a
+ non-complex script may still involve Unicode normalization,
+ ligature substitutions, mark positioning, kerning, and applying
+ other font features. The key difference is that a text run in a
+ non-complex script can be processed sequentially and in the same
+ order as the input sequence of Unicode codepoints, without
+ requiring an analysis stage.
+ </para>
+ </section>
+
+ <section id="shaping-operations">
+ <title>Shaping operations</title>
+ <para>
+ Shaping a complex-script text run involves transforming the
+ input sequence of Unicode codepoints with some combination of
+ operations that is specified in the shaping model for the
+ script.
+ </para>
+ <para>
+ The specific conditions that trigger a given operation for a
+ text run varies from script to script, as do the order that the
+ operations are performed in and which codepoints are
+ affected. However, the same general set of shaping operations is
+ common to all of the complex-script shaping models.
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ A <emphasis>reordering</emphasis> operation moves a glyph
+ from its original ("logical") position in the sequence to
+ some other ("visual") position.
+ </para>
+ <para>
+ The shaping model for a given complex script might involve
+ more than one reordering step.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A <emphasis>joining</emphasis> operation replaces a glyph
+ with an alternate form that is designed to connect with one
+ or more of the adjacent glyphs in the sequence.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A contextual <emphasis>substitution</emphasis> operation
+ replaces either a single glyph or a subsequence of several
+ glyphs with an alternate glyph. This substitution is
+ performed when the original glyph or subsequence of glyphs
+ occurs in a specified position with respect to the
+ surrounding sequence. For example, one substitution might be
+ performed only when the target glyph is the first glyph in
+ the sequence, while another substitution is performed only
+ when a different target glyph occurs immediately after a
+ particular string pattern.
+ </para>
+ <para>
+ The shaping model for a given complex script might involve
+ multiple contextual-substitution operations, each applying
+ to different target glyphs and patterns, and which are
+ performed in separate steps.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A contextual <emphasis>positioning</emphasis> operation
+ moves the horizontal and/or vertical position of a
+ glyph. This positioning move is performed when the glyph
+ occurs in a specified position with respect to the
+ surrounding sequence.
+ </para>
+ <para>
+ Many contextual positioning operations are used to place
+ <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
+ signs, and tone markers) with respect to
+ <emphasis>base</emphasis> glyphs. However, some complex
+ scripts may use contextual positioning operations to
+ correctly place base glyphs as well, such as
+ when the script uses <emphasis>stacking</emphasis> characters.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </section>
+
+ <section id="unicode-character-categories">
+ <title>Unicode character categories</title>
+ <para>
+ Shaping models are typically specified with respect to how
+ scripts are defined in the Unicode standard.
+ </para>
+ <para>
+ Every codepoint in the Unicode Character Database (UCD) is
+ assigned a <emphasis>Unicode General Category</emphasis> (UGC),
+ which provides the most fundamental information about the
+ codepoint: whether the codepoint represents a
+ <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
+ <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
+ <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
+ or something else (<emphasis>Other</emphasis>).
+ </para>
+ <para>
+ These UGC properties are "Major" categories. Each codepoint is
+ further assigned to a "minor" category within its Major
+ category, such as "Letter, uppercase" (<literal>Lu</literal>) or
+ "Letter, modifier" (<literal>Lm</literal>).
+ </para>
+ <para>
+ Shaping models are concerned primarily with Letter and Mark
+ codepoints. The minor categories of Mark codepoints are
+ particularly important for shaping. Marks can be nonspacing
+ (<literal>Mn</literal>), spacing combining
+ (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
+ </para>
+ <para>
+ In addition to the UGC property, codepoints in the Indic and
+ Southeast Asian scripts are also assigned
+ <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
+ <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
+ property that provides more detailed information needed for
+ shaping.
+ </para>
+ <para>
+ The UISC property sub-categorizes Letters and Marks according to
+ common script-shaping behaviors. For example, UISC distinguishes
+ between consonant letters, vowel letters, and vowel marks. The
+ UIPC property sub-categorizes Mark codepoints by the visual
+ position that they occupy (above, below, right, left, or in
+ multiple positions).
+ </para>
+ <para>
+ Some complex scripts require that the text run be split into
+ syllables, and what constitutes a valid syllable in these
+ scripts is specified in regular expressions of the Letter and
+ Mark codepoints that take the UISC and UIPC properties into account.
+ </para>
+
+ </section>
+
+ <section id="text-runs">
+ <title>Text runs</title>
+ <para>
+ Real-world text usually contains codepoints from a mixture of
+ different Unicode scripts (including punctuation, numbers, symbols,
+ white-space characters, and other codepoints that do not belong
+ to any script). Real-world text may also be marked up with
+ formatting that changes font properties (including the font,
+ font style, and font size).
+ </para>
+ <para>
+ For shaping purposes, all real-world text streams must be first
+ segmented into runs that have a uniform set of properties.
+ </para>
+ <para>
+ In particular, shaping models always assume that every codepoint
+ in a text run has the same <emphasis>direction</emphasis>,
+ <emphasis>script</emphasis> tag, and
+ <emphasis>language</emphasis> tag.
+ </para>
+ </section>
+
+ <section id="opentype-shaping-models">
+ <title>OpenType shaping models</title>
+ <para>
+ OpenType provides shaping models for the following scripts:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The <emphasis>default</emphasis> shaping model handles all
+ non-complex scripts, and may also be used as a fallback for
+ handling unrecognized scripts.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Indic</emphasis> shaping model handles the Indic
+ scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
+ Malayalam, Oriya, Tamil, Telugu, and Sinhala.
+ </para>
+ <para>
+ The Indic shaping model was revised significantly in
+ 2005. To denote the change, a new set of <emphasis>script
+ tags</emphasis> was assigned for Bengali, Devanagari,
+ Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
+ Telugu. For the sake of clarity, the term "Indic2" is
+ sometimes used to refer to the current, revised shaping
+ model.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Arabic</emphasis> shaping model supports
+ Arabic, Mongolian, N'Ko, Syriac, and several other connected
+ or cursive scripts.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Thai/Lao</emphasis> shaping model supports
+ the Thai and Lao scripts.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Khmer</emphasis> shaping model supports the
+ Khmer script.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Myanmar</emphasis> shaping model supports the
+ Myanmar (or Burmese) script.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Tibetan</emphasis> shaping model supports the
+ Tibetan script.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Hangul</emphasis> shaping model supports the
+ Hangul script.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Hebrew</emphasis> shaping model supports the
+ Hebrew script.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <emphasis>Universal Shaping Engine</emphasis> (USE)
+ shaping model supports complex scripts not covered by one of
+ the above, script-specific shaping models, including
+ Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
+ Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
+ Viet, and many others.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Text runs that do not fall under one of the above shaping
+ models may still require processing by a shaping engine. Of
+ particular note is <emphasis>Emoji</emphasis> shaping, which
+ may involve variation-selector sequences and glyph
+ substitution. Emoji shaping is handled by the default
+ shaping model.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ </section>
+
+ <section id="graphite-shaping">
+ <title>Graphite shaping</title>
+ <para>
+ In contrast to OpenType shaping, Graphite shaping does not
+ specify a predefined set of shaping models or a set of supported
+ scripts.
+ </para>
+ <para>
+ Instead, each Graphite font contains a complete set of rules that
+ implement the required shaping model for the intended
+ script. These rules include finite-state machines to match
+ sequences of codepoints to the shaping operations to perform.
+ </para>
+ <para>
+ Graphite shaping can perform the same shaping operations used in
+ OpenType shaping, as well as other functions that have not been
+ defined for OpenType shaping.
+ </para>
+ </section>
+
+ <section id="aat-shaping">
+ <title>AAT shaping</title>
+ <para>
+ In contrast to OpenType shaping, AAT shaping does not specify a
+ predefined set of shaping models or a set of supported scripts.
+ </para>
+ <para>
+ Instead, each AAT font includes a complete set of rules that
+ implement the desired shaping model for the intended
+ script. These rules include finite-state machines to match glyph
+ sequences and the shaping operations to perform.
+ </para>
+ <para>
+ Notably, AAT shaping rules are expressed for glyphs in the font,
+ not for Unicode codepoints. AAT shaping can perform the same
+ shaping operations used in OpenType shaping, as well as other
+ functions that have not been defined for OpenType shaping.
+ </para>
+ </section>
+</chapter>