现在的位置: 首页 > 综合 > 正文

bidi

2013年07月01日 ⁄ 综合 ⁄ 共 28692字 ⁄ 字号小中大 ⁄ 评论关闭

Unicode Standard Annex #9

The Bidirectional Algorithm

Version	Unicode 4.1.0
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2005-03-25
This Version	file:///D:/unicode/unicode/www.unicode.org/reports/tr9/tr9-15.html
Previous Version	file:///D:/unicode/unicode/www.unicode.org/reports/tr9/tr9-13.html
Latest Version	file:///D:/unicode/unicode/www.unicode.org/reports/tr9/tr9.html
Revision	15

Summary

This document describes specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Introduction
2 Directional Formatting Codes
- 2.1 Explicit Directional Embedding
- 2.2 Explicit Directional Overrides
- 2.3 Terminating Explicit Directional Code
- 2.4 Implicit Directional Marks
3 Basic Display Algorithm
- 3.1 Definitions: BD1, BD2, BD3, BD4 BD5, BD6, BD7
- 3.2 Bidirectional Character Types
- 3.3 Resolving Embedding Levels
  - 3.3.1 The Paragraph Level: P1, P2, P3
  - 3.3.2 Explicit Levels and Directions: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
  - 3.3.3 Resolving Weak Types: W1, W2, W3, W4, W5, W6, W7
  - 3.3.4 Resolving Neutral Types: N1, N2
  - 3.3.5 Resolving Implicit Levels: I1, I2
- 3.4 Reordering Resolved Levels: L1, L2, L3, L4
- 3.5 Shaping
4 Bidirectional Conformance
- 4.1 Boundary Neutrals
- 4.2 Explicit Formatting_Codes
- 4.3 Higher-Level Protocols: HL1, HL2, HL3, HL4, HL5, HL6
5 Implementation Notes
- 5.1 Reference Code
- 5.2 Retaining Format Codes
- 5.3 Joiners
- 5.4 Vertical Text
- 5.5 Usage
6 Mirroring
Acknowledgements
References
Modifications

1 Introduction

The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, there are several scripts (such as Arabic or Hebrew) where the natural ordering of horizontal text in display is from right to left. If all of the text has the same horizontal direction, then the ordering of the display text is unambiguous. However, when bidirectional text (a mixture of left-to-right and right-to-left horizontal text) is present, some ambiguities can arise in determining the ordering of the displayed characters.

This document describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit format codes for special circumstances. In most cases, there is no need to include additional information with the text to obtain correct display ordering.

However, in the case of bidirectional text, there are circumstances where an implicit bidirectional ordering is not sufficient to produce comprehensible text. To deal with these cases, a minimal set of directional formatting codes is defined to control the ordering of characters when rendered. This allows exact control of the display ordering for legible interchange and also ensures that plain text used for simple items like filenames or labels can always be correctly ordered for display.

The directional formatting codes are used only to influence the display ordering of text. In all other respects they should be ignored--they have no effect on the comparison of text, nor on word breaks, parsing, or numeric analysis.

When working with bidirectional text, the characters are still interpreted in logical order--only the display is affected. The display ordering of bidirectional text depends upon the directional properties of the characters in the text.

Note: The changes in Section 4, Bidirectional Conformance override clause C13 of Unicode 4.0 [Unicode], and tighten the conformance requirements.

2 Directional Formatting Codes

Two types of explicit codes are used to modify the standard implicit Unicode bidirectional algorithm. In addition, there are implicit ordering codes, the right-to-left and left-to-right marks. All of these codes are limited to the current paragraph; thus their effects are terminated by a paragraph separator. The directional types left-to-right and right-to-left are called strong types, and characters of those types are called strong directional characters. The directional types associated with numbers are called weak types, and characters of those types are called weak directional characters.

Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.

2.1 Explicit Directional Embedding

The following codes signal that a piece of text is to be treated as embedded. For example, an English quotation in the middle of an Arabic sentence could be marked as being embedded left-to-right text. If there were a Hebrew phrase in the middle of the English quotation, then that phrase could be marked as being embedded right-to-left. These codes allow for nested embeddings.

RLE	Right-to-Left Embedding	Treat the following text as embedded right-to-left.
LRE	Left-to-Right Embedding	Treat the following text as embedded left-to-right.

The precise meaning of these codes will be made clear in the discussion of the algorithm. The effect of right-left line direction, for example, can be accomplished by simply embedding the text with RLE...PDF.

2.2 Explicit Directional Overrides

The following codes allow the bidirectional character types to be overridden when required for special cases, such as for part numbers. These codes allow for nested directional overrides.

RLO	Right-to-Left Override	Force following characters to be treated as strong right-to-left characters.
LRO	Left-to-Right Override	Force following characters to be treated as strong left-to-right characters.

The precise meaning of these codes will be made clear in the discussion of the algorithm. The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.

2.3 Terminating Explicit Directional Code

The following code terminates the effects of the last explicit code (either embedding or override) and restores the bidirectional state to what it was before that code was encountered.

PDF	Pop Directional Format	Restore the bidirectional state to what it was before the last LRE, RLE, RLO, LRO.

2.4 Implicit Directional Marks

These characters are very light-weight codes. They act exactly like right-to-left or left-to-right characters, except that they do not display or have any other semantic effect. Their use is generally more convenient than the explicit embeddings or overrides since their scope is much more local.

RLM	Right-to-Left Mark	Right-to-left zero-width character
LRM	Left-to-Right Mark	Left-to-right zero-width character

There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display.

3 Basic Display Algorithm

The Bidirectional Algorithm takes a stream of text as input, and proceeds in three main phases:

Separation of the input text into paragraphs. The rest of the algorithm affects only the text between paragraph separators.
Resolution of the embedding levels of the text. In this phase, the directional character types, plus the explicit format codes, are used to produce resolved embedding levels.
Reordering the text for display on a line-by-line basis using the resolved embedding levels, once the text has been broken into lines.

The algorithm only reorders text within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs.

Combining characters always attach to the preceding base character in the memory representation. Even after reordering for display and performing character shaping, the glyph representing a combining character will attach to the glyph representing its base character in memory. Depending on the line orientation and the placement direction of base letterform glyphs, it may, for example, attach to the glyph on the left, or on the right, or above.

In the following text, the normative definitions and rules are distinguished by the following numbering:

**Table 3-5. Normative Definitions and Rules**
Numbering	Section
BDn	Definitions
Pn	Paragraph levels
Xn	Explicit levels and directions
Wn	Weak types
Nn	Neutral types
In	Implicit levels
Ln	Resolved levels

3.1 Definitions

BD1. The bidirectional characters types are values assigned to each Unicode character, including unassigned characters.

BD2. Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is level 61.

Embedding levels are explicitly set by both override format codes and by embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guarantee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings.

BD3. The default direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd.

For example, in a particular piece of text, Level 0 is plain English text, Level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embedding level will become clear when the reordering algorithm is discussed, but the following provides an example of how the algorithm works.

BD4. The paragraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph.

BD5. The direction of the paragraph embedding level is called the paragraph direction.

In some contexts the paragraph direction is also known as the base direction.

BD6. The directional override status determines whether the bidirectional type of characters is to be reset with explicit directional controls. This status has three states:

**Table 3-6. Directional Override Status**
Status	Interpretation
neutral	no override is currently active
right-to-left	characters are to be reset to R
left-to-right	characters are to be reset to L

BD7. A level run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level (a level run is also known as a directional run).

Example

In the following examples, case is used to indicate different implicit character types for those unfamiliar with right-to-left letters. Uppercase letters stand for right-to-left characters (such as Arabic or Hebrew), while lowercase letters stand for left-to-right characters (such as English or Russian).

Memory:            car is THE CAR in arabic

Character types:   LLL-LL-RRR-RRR-LL-LLLLLL

Resolved levels:   000000011111110000000000

Notice that the neutral character (space) between THE and CAR gets the level of the surrounding characters. This is how the implicit directional marks have an effect. By inserting appropriate directional marks around neutral characters, the level of the neutral characters can be changed.

3.2 Bidirectional Character Types

The normative bidirectional character types for each character are specified in the Unicode Character Database [UCD] and are summarized in Table 3-7. This is a summary only: there are exceptions to the general scope. For example, certain characters such as U+0CBF KANNADA VOWEL SIGN I are given Type L (instead of NSM) to preserve canonical equivalence.

Table 3-7. Bidirectional Character Types

Category	Type	Description	General Scope
Strong	L	Left-to-Right	LRM, Most alphabetic, syllabic, Han ideographic characters, digits that are neither European nor Arabic, ...
	LRE	Left-to-Right Embedding	LRE
	LRO	Left-to-Right Override	LRO
	R	Right-to-Left	RLM, Hebrew alphabet, most punctuation specific to that script, ...
	AL	Right-to-Left Arabic	Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts, ...
	RLE	Right-to-Left Embedding	RLE
	RLO	Right-to-Left Override	RLO
Weak	PDF	Pop Directional Format	PDF
	EN	European Number	European digits, Eastern Arabic-Indic digits, ...
	ES	European Number Separator	Plus Sign, Minus Sign
	ET	European Number Terminator	Degree, Currency symbols, ...
	AN	Arabic Number	Arabic-Indic digits, Arabic decimal & thousands separators, ...
	CS	Common Number Separator	Colon, Comma, Full Stop (Period), Non-breaking space, ...
	NSM	Non-Spacing Mark	Characters marked Mn (Non-Spacing Mark) and Me (Enclosing Mark) in the Unicode Character Database.
	BN	Boundary Neutral	Most formatting and control characters, other than those explicitly given types above.
Neutral	B	Paragraph Separator	Paragraph Separator, appropriate Newline Functions, higher-protocol paragraph determination.
	S	Segment Separator	Tab
	WS	Whitespace	Space, Figure Space, Line Separator, Form Feed, General Punctuation Spaces, ...
	ON	Other Neutrals	All other characters, including OBJECT REPLACEMENT CHARACTER.

The term European digits is used to refer to decimal forms common in Europe and elsewhere, and Arabic-Indic digits to refer to the native Arabic forms. (See Section 8.2, Arabic of [Unicode], for more details on naming digits.)
Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types see DerivedBidiClass.txt [DerivedBIDI] in the [UCD].
Private use characters can be assigned different values by a conformant implementation.
For the purpose of the bidirectional algorithm, inline objects (such as graphics) are treated as if they are an OBJECT REPLACEMENT CHARACTER (U+FFFC).
As of Unicode 4.0, the Bidirectional Character Types of a few Indic characters were altered so that the Bidirectional Algorithm preserves canonical equivalence. That is, two canonically equivalent strings will result in equivalent ordering after applying the algorithm. This invariant will be maintained in the future.
Note, however, that the Bidirectional Algorithm does not preserve compatibility equivalence.

Table 3-8 lists additional abbreviations used in the examples and internal character types used in the algorithm.

**Table 3-8. Abbreviations for Examples and Internal Types**
Symbol	Description
N	Neutral or Separator (B, S, WS, ON)
e	The text ordering type (L or R) that matches the embedding level direction (even or odd)
sor	The text ordering type (L or R) assigned to the position before a level run.
eor	The text ordering type (L or R) assigned to the position after a level run.

3.3 Resolving Embedding Levels

The body of the bidirectional algorithm uses character types and explicit codes to produce a list of resolved levels. This resolution process consists of five steps: (1) determining the paragraph level; (2) determining explicit embedding levels and directions; (3) resolving weak types; (4) resolving neutral types; and (5) resolving implicit embedding levels.

3.3.1 The Paragraph Level

P1. Split the text into separate paragraphs. A paragraph separator is kept with the previous paragraph. Within each paragraph, apply all the other rules of this algorithm.

P2. In each paragraph, find the first character of type L, AL, or R.

Because paragraph separators delimit text in this algorithm, this will generally be the first strong character after a paragraph separator or at the very beginning of the text. Note that the characters of type LRE, LRO, RLE, RLO are ignored in this rule. This is because typically they are used to indicate that the embedded text is the opposite direction than the paragraph level.

P3. If a character is found in P2 and it is of type AL or R, then set the paragraph embedding level to one; otherwise, set it to zero.

Note that when a higher-level protocol specifies the paragraph level, it is not necessary to apply rules P2 and P3.

3.3.2 Explicit Levels and Directions

All explicit embedding levels are determined from the embedding and override codes, by applying the explicit level rules X1 through X9. These rules are applied as part of the same logical pass over the input.

Explicit Embeddings

X1. Begin by setting the current embedding level to the paragraph embedding level. Set the directional override status to neutral. Process each character iteratively, applying rules X2 through X9. Only embedding levels from 0 to 61 are valid in this phase.

In the resolution of levels in rules I1 and I2, the maximum embedding level of 62 can be reached.

X2. With each RLE, compute the least greater odd embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral.

b. If the new level would not be valid, then this code is invalid. Don't change the current level or override status.

For example, level 0 => 1; levels 1, 2 => 3; levels 3, 4 => 5; ...59,60 => 61; above 60, no change (don’t change levels with RLE if the new level would be invalid).

X3. With each LRE, compute the least greater even embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral.

b. If the new level would not be valid, then this code is invalid. Don't change the current level or override status.

For example, levels 0, 1 => 2; levels 2, 3 => 4; levels 4, 5 => 6; ...58, 59 => 60; above 59, no change (don’t change levels with LRE if the new level would be invalid).

Explicit Overrides

An explicit directional override sets the embedding level in the same way the explicit embedding codes do, but also changes the directional character type of affected characters to the override direction.

X4. With each RLO, compute the least greater odd embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to right-to-left.

b. If the new level would not be valid, then this code is invalid. Don't change the current level or override status.

X5. With each LRO, compute the least greater even embedding level.

a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to left-to-right.

b. If the new level would not be valid, then this code is invalid. Don't change the current level or override status.

X6. For all types besides RLE, LRE, RLO, LRO, and PDF:

a. Set the level of the current character to the current embedding level.

b. Whenever the directional override status is not neutral, reset the current character type to the directional override status.

If the directional override status is neutral, then characters retain their normal types: Arabic characters stay AL, Latin characters stay L, neutrals stay N, and so on. If the directional override status is R, then characters become R. If the directional override status is L, then characters become L.

Terminating Embeddings and Overrides

There is a single code to terminate the scope of the current explicit code, whether an embedding or a directional override. All codes and pushed states are completely popped at the end of paragraphs.

X7. With each PDF, determine the matching embedding or override code. If there was a valid matching code, restore (pop) the last remembered (pushed) embedding level and directional override.

X8. All explicit directional embeddings and overrides are completely terminated at the end of each paragraph. Paragraph separators are not included in the embedding.

X9. Remove all RLE, LRE, RLO, LRO, PDF, and BN codes.

Note that an implementation does not have to actually remove the codes, it just has to behave as though the codes were not present for the remainder of the algorithm. Conformance does not require any particular placement of these codes as long as all other characters are ordered correctly.

See 5. Implementation Notes for information on implementing the algorithm without removing the formatting codes.
The Zero Width Joiner and Non Joiner affect the shaping of the adjacent characters; those that are adjacent in the original backing-store order, even though those characters may end up being rearranged to be non-adjacent by the BIDI algorithm. For more information, see Joiners.

X10. The remaining rules are applied to each run of characters at the same level. For each run, determine the start-of-level-run (sor) and end-of-level-run (eor) type, either L or R. This depends on the higher of the two levels on either side of the boundary (at the start or end of the paragraph, the level of the 'other' run is the base embedding level). If the higher level is odd, the type is R, otherwise it is L.

For example:

Levels:  0   0   0   1   1   1   2

Runs:   <--- 1 ---> <--- 2 ---> <3>

Run 1 is at level 0, sor is L, eor is R.
Run 2 is at level 1, sor is R, eor is L.
Run 3 is at level 2, sor is L, eor is L.

For two adjacent runs, the eor of the first run is the same as the sor of the second.

3.3.3 Resolving Weak Types

Weak types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used.

Non-spacing marks are now resolved based on the previous characters.

W1. Examine each non-spacing mark (NSM) in the level run, and change the type of the NSM to the type of the previous character. If the NSM is at the start of the level run, it will get the type of sor.

Assume in this example that sor is R:

AL  NSM NSM => AL  AL  AL

sor NSM     => sor R

The text is next parsed for numbers. This pass will change the directional types European Number Separator, European Number Terminator, and Common Number Separator to be European Number text, Arabic Number text, or Other Neutral text. The text to be scanned may have already had its type altered by directional overrides. If so, then it will not parse as numeric.

W2. Search backwards from each instance of a European number until the first strong type (R, L, AL, or sor) is found. If an AL is found, change the type of the European number to Arabic number.

AL EN    => AL AN

AL N EN  => AL N AN

sor N EN => sor N EN

L N EN   => L N EN

R N EN   => R N EN

W3. Change all ALs to R.

W4. A single European separator between two European numbers changes to a European number. A single common separator between two numbers of the same type changes to that type:

EN ES EN => EN EN EN

EN CS EN => EN EN EN

AN CS AN => AN AN AN

W5. A sequence of European terminators adjacent to European numbers changes to all European numbers:

ET ET EN => EN EN EN

EN ET ET => EN EN EN

AN ET EN => AN EN EN

W6. Otherwise, separators and terminators change to Other Neutral:

AN ET    => AN ON

L  ES EN => L  ON EN

EN CS AN => EN ON AN

ET AN    => ON AN

W7. Search backwards from each instance of a European number until the first strong type (R, L, or sor) is found. If an L is found, then change the type of the European number to L.

L  N EN => L  N  L

R  N EN => R  N  EN

3.3.4 Resolving Neutral Types

Neutral types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used.

The next phase resolves the direction of the neutrals. The results of this phase are that all neutrals become either R or L. Generally, neutrals take on the direction of the surrounding text. In case of a conflict, they take on the embedding direction.

N1. A sequence of neutrals takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers act as if they were R in terms of their influence on neutrals. Start-of-level-run (sor) and end-of-level-run (eor) are used at level run boundaries.

R  N  R  => R  R  R

L  N  L  => L  L  L

R  N  AN => R  R  AN

AN N  R  => AN R  R

R  N  EN => R  R  EN

EN N  R  => EN R  R

Note that any AN or EN remaining after W7 will be in an right-to-left context.

N2. Any remaining neutrals take the embedding direction.

N => e

Assume in this example that eor is L, and sor is R. Then an application of N1 and N2 yields the following:

L   N eor => L   L eor

R   N eor => R   e eor

sor N L   => sor e L

sor N R   => sor R R

Examples. A list of numbers separated by neutrals and embedded in a directional run will come out in the run’s order.

Storage:	he said "THE VALUES ARE 123, 456, 789, OK".

Display:	he said "KO ,789 ,456 ,123 ERA SEULAV EHT".

In this case, both the comma and the space between the numbers take on the direction of the surrounding text (uppercase = right-to-left), ignoring the numbers. The commas are not considered part of the number since they are not surrounded on both sides (see number parsing). However, if there is an adjacent left-to-right sequence, then European numbers will adopt that direction:

Storage:	he said "IT IS A bmw 500, OK."

Display:	he said ".KO ,bmw 500 A SI TI"

3.3.5 Resolving Implicit Levels

In the final phase, the embedding level of text may be increased, based upon the resolved character type. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level. (Note that it is possible for text to end up at levels higher than 61 as a result of this process.) This results in the following rules:

I1. For all characters with an even (left-to-right) embedding direction, those of type R go up one level and those of type AN or EN go up two levels.

I2. For all characters with an odd (right-to-left) embedding direction, those of type L, EN or AN go up one level.

Table 3-10 summarizes the results of the implicit algorithm.

**Table 3-10. Resolving Implicit Levels**
Type	Embedding Level
Type	Even	Odd
L	EL	EL+1
R	EL+1	EL
AN	EL+2	EL+1
EN	EL+2	EL+1

3.4 Reordering Resolved Levels

The following algorithm describes the logical process of finding the correct display order. As described before, this logical process is not necessarily the actual implementation, which may diverge for efficiency as long as it produces the same results. As opposed to resolution phases, this algorithm acts on a per-line basis, and is applied after any line wrapping is applied to the paragraph.

The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see Section 8.2 Arabic of [Unicode]). Logically there are the following steps:

The levels of the text are determined according to the bidirectional algorithm.
The characters are shaped into glyphs according to their context (taking the embedding levels into account for mirroring!).
The accumulated widths of those glyphs (in logical order) are used to determine line breaks.
For each line, rules L1-L4 are used to reorder the characters on that line.
The glyphs corresponding to the characters on the line are displayed in that order.

L1. On each line, reset the embedding level of the following characters to the paragraph embedding level:

segment separators,
paragraph separators,
any sequence of whitespace characters preceding a segment separator or paragraph separator, and
any sequence of white space characters at the end of the line.

The types of characters used here are the original types, not those modified by the previous phase.
Since a Paragraph Separator breaks lines, there will be at most one per line, at the end of that line.

In combination with the following rule, this means that trailing white space will appear at the visual end of the line (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph.

L2. From the highest level found in the text to the lowest odd level on each line, including intermediate levels not actually present in the text, reverse any contiguous sequence of characters that are at that level or higher.

This reverses a progressively larger series of substrings. The following four examples illustrate this. In these examples, the paragraph embedding level for the first and third examples is assumed to be 0 (left to right direction), and for the second and fourth is assumed to be 1 (right to left direction).

Example 1 (embedding level = 0)

Memory:              car means CAR.

Resolved levels:     00000000001110

Reverse level 1:     car means RAC.

Example 2 (embedding level = 1)

Memory:              car MEANS CAR.

Resolved levels:     22211111111111

Reverse level 2:     rac MEANS CAR.

Reverse levels 1-2:  .RAC SNAEM car

Example 3 (embedding level = 0)

Memory:              he said "car MEANS CAR."

Resolved levels:     000000000222111111111100

Reverse level 2:     he said "rac MEANS CAR."

Reverse levels 1-2:  he said "RAC SNAEM car."

Example 4 (embedding level = 1)

Memory:              DID YOU SAY ‘he said "car MEANS CAR"’?

Resolved levels:     11111111111112222222224443333333333211

Reverse level 4:     DID YOU SAY ‘he said "rac MEANS CAR"’?

Reverse levels 3-4:  DID YOU SAY ‘he said "RAC SNAEM car"’?

Reverse levels 2-4:  DID YOU SAY ‘"rac MEANS CAR" dias eh’?

Reverse levels 1-4:  ?‘he said "RAC SNAEM car"’ YAS UOY DID

L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed.

Many font designers provide default metrics for combining marks that support rendering by simple overhang. Because of the reordering for right-to-left characters, it is common practice to make the glyphs for most combining characters overhang to the left (thus assuming the characters will be applied to left-to-right base characters) and make the glyphs for combining characters in right-to-left scripts overhang to the right (thus assuming that the characters will be applied to right-to-left base characters). With such fonts, the display ordering of the marks and base glyphs may need to be adjusted when combining marks are applied to "unmatching" base characters. See Section 5.14, Rendering Non-Spacing Marks of [Unicode] for more information.

L4. A character that possesses the mirrored property as specified by Section 4.7, Mirrored of [Unicode] must be depicted by a mirrored glyph if the resolved directionality of that character is R.

For example, U+0028 left parenthesis—which is interpreted in the Unicode Standard as an opening parenthesis—appears as "(" when its resolved level is even, and as the mirrored glyph ")" when its resolved level is odd.

3.5 Shaping

Shaping is logically applied after the bidirectional algorithm is used, and limited to characters within the same directional run. For example, suppose that we have the following string of Arabic characters in memory as characters 1, 2, 3, and 4, and where the first two characters are overridden to be LTR. To show both paragraph directions, the next two are embedded, but with the normal RTL direction.

1

2

3

4

ج
062C JEEM

ع
0639 AIN

ل
0644 LAM

م
0645 MEEM

L

L

R

R

One can use embedding codes to get this effect in plain text, or use markup in HTML, as in the examples below. (The red text would be for the right-to-left paragraph direction.)

LRM/RLM LRO JEEM AIN PDF RLO LAM MEEM PDF
<p dir="ltr"/"rtl">LRO JEEM AIN PDF RLO LAM MEEM PDF</p>
<p dir="ltr"/"rtl"><bdo dir="ltr">JEEM AIN</bdo><bdo dir="rtl">LAM MEEM</bdo></p>

The resulting shapes will be the following, according to the paragraph direction:

Left-Right Paragraph

Right-Left Paragraph

1	2	4	3
ﺞ`JEEM-F`	ﻋ`AIN-I`	ﻢ`MEEM-F`	ﻟ`LAM-I`

【上篇】Jquery 获取多选下拉列表select multiple的文字和值 text val
【下篇】用js获取元素属性的代码

作者: cropping

该日志由 cropping 于11年前发表在综合分类下，最后更新于 2013年07月01日.
转载请注明: bidi | 学步园 +复制链接

抱歉!评论已关闭.

返回首页

（其他合作也可洽谈）

必威体育

必威电竞

1	2	3	4
ج `062C JEEM`	ع `0639 AIN`	ل `0644 LAM`	م `0645 MEEM`
L	L	R	R

学步园