How to convert Glyph Indices returned by ScriptShape back to Unicode character codes ?

Discussion:

(too old to reply)

Dhiren

2006-10-12 05:57:16 UTC

Hi,
I am using the Uniscribe Win32 APIs ScriptItemize and ScriptShape
for Complex Script Handling in case of languages such as Arabic and
Hindi. In such languages, the shape of a character i.e. it's Unicode
character code changes depending on where it's position is in the
string. Therefore, in case of the Windows Edit control (which uses
ExtTextOut to render the text), the Window Text obtained using the
Win32 API GetWindowText and the text being rendered are different from
each other in case of these complex languages.

Now, ScriptShape returns an array of glyph indices which can be
passed to the Win32 API ExtTextOut (by specifying the ETO_GLYPH_INDEX
flag in the fuOptions parameter) to render the text.
However, I do not need to render text, but I need the new Unicode
character codes that ExtTextOut generates from the glyph indices.

I know that the Win32 API GetGlyphIndices uses the font's CMAP table
to map character codes to glyph indices, however I need some method for
generating a reverse mapping from glyph indices back to character codes
(just as ExtTextOut does internally).

Please help !!

Regards,
Dhiren.

Mihai N.

2006-10-12 06:59:54 UTC

Permalink

Post by Dhiren
Now, ScriptShape returns an array of glyph indices which can be
passed to the Win32 API ExtTextOut (by specifying the ETO_GLYPH_INDEX
flag in the fuOptions parameter) to render the text.
However, I do not need to render text, but I need the new Unicode
character codes that ExtTextOut generates from the glyph indices.
I know that the Win32 API GetGlyphIndices uses the font's CMAP table
to map character codes to glyph indices, however I need some method for
generating a reverse mapping from glyph indices back to character codes
(just as ExtTextOut does internally).

There is no way to do that.
You might try doing a cmap parsing of your own, but is not enough.
You also have to take care of ligatures, reordering, glyph variants, and what
not. You will have to basically reverse the rendering alghorithm.

And is still not enough. Take for instance the glyph for the fi ligature.
You have no way to know if you got this glyph because you had
<U+0066 U+0069> or <U+F001>.
Same with Arabic presentation forms, and lots of other things.

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Dhiren

2006-10-12 09:31:56 UTC

Permalink

Post by Mihai N.

There is no way to do that.
You might try doing a cmap parsing of your own, but is not enough.
You also have to take care of ligatures, reordering, glyph variants, and what
not. You will have to basically reverse the rendering alghorithm.
And is still not enough. Take for instance the glyph for the fi ligature.
You have no way to know if you got this glyph because you had
<U+0066 U+0069> or <U+F001>.
Same with Arabic presentation forms, and lots of other things.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------

Hi,

Thank you for the reply.

Actually I am working on the DirectX 9.0 API's D3DXCreateText
function to create 3DText. Now D3DXCreateText requires a string of
character codes to create the 3DText. However, D3DXCreateText does not
do complex script processing for languages such as Arabic and Hindi, as
is done by the Windows API ExtTextOut. D3DXCreateText just takes the
character code for one character at a time, passes it to
GetGlyphOutline and then generates 3DText using the glyph data.

However, for languages such as Arabic, the shape of a character
(i.e. it's Unicode character code) changes depending on its position in
the string. Hence I wanted to emulate the functionality of the
ExtTextOut API to obtain the actual character codes that get painted in
a window by ExtTextOut. Therefore, to obtain these new character codes,
I was using the Uniscribe API ScriptShape, which returns me the glyph
indices.

For e.g. after installing fonts to support Hindi, if Hindi is
selected as the language in the Language bar, and if the Devanagiri
character for Shra (used in the name SHRAvan) is generated by pressing
Shift + 8, the character appears correctly in the Window Edit box
control. Now, when I do GetWindowText() I get a string of 3 characters.
After that, I create a Memory DC into which I select the Arial Unicode
MS font (since it supports character sets for almost all languages). I
then pass the string to ScriptItemize and then to ScriptShape.

Now ScriptShape returns me a single resultant glyph index for the
input string of 3 characters. The value of this glyph index is 7085. To
check whether 7085 is the glyph index for any character code from 0 to
65535 in the Arial Unicode MS font, I also used the GetGlyphIndices
function 65535 times, passing it a string of length 1 corresponding to
each character code and wrote the char code - glyph index mapping to a
file. However, I found out that the glyph index did not match for any
character code from 0 to 65535. In fact, I found out that there were 4
sets of ranges in ascending order for the glyph indices. They were
3-5428, 8355-..., ...-... and 5429...64.. (don't remember all the
values). Any index even close to 7000 did not figure in the list of
glyph indices.

Earlier I had hooked the ExtTextOut API to my own function using
a DLL hooking code downloaded from the Net. I found out that the
ETO_GLYPH_INDEX flag was set for the fuOptions parameter in the call to
ExtTextOut. This means, that the Edit control either uses the Uniscribe
API ScriptShape or the Win32 API GetCharacterPlacement, obtains a list
of glyph indices from them, and then passes the glyph indices to
ExtTextOut. Now, using these glyph indices, how does ExtTextOut lookup
the font data to draw the correct character in a window. If I am also
able to somehow use glyph indices to lookup the font data for the
corresponding character codes, then my problem will be solved.

Regards,
Dhiren.

Dhiren

2006-10-12 09:40:03 UTC

Permalink

Post by Mihai N.

There is no way to do that.
You might try doing a cmap parsing of your own, but is not enough.
You also have to take care of ligatures, reordering, glyph variants, and what
not. You will have to basically reverse the rendering alghorithm.
And is still not enough. Take for instance the glyph for the fi ligature.
You have no way to know if you got this glyph because you had
<U+0066 U+0069> or <U+F001>.
Same with Arabic presentation forms, and lots of other things.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------

Hi,

Thank you for the reply.

Actually I am working on the DirectX 9.0 API's D3DXCreateText
function to create 3DText. Now D3DXCreateText requires a string of
character codes to create the 3DText. However, D3DXCreateText does not
do complex script processing for languages such as Arabic and Hindi, as
is done by the Windows API ExtTextOut. D3DXCreateText just takes the
character code for one character at a time, obtains it's corresponding
glyph using the CMAP table of the font and creates 3DText using that
glyph's shape.

However, for languages such as Arabic, the shape of a character
(i.e. it's Unicode character code) changes depending on its position in
the string. Hence I wanted to emulate the functionality of the
ExtTextOut API to obtain the actual character codes that get painted in
a window by ExtTextOut. Therefore, to obtain these new character codes,
I was using the Uniscribe API ScriptShape, which returns me the glyph
indices.

For e.g. after installing fonts to support Hindi, if Hindi is
selected as the language in the Language bar, and if the Devanagiri
character for Shra (used in the name SHRAvan) is generated by pressing
Shift + 8, the character appears correctly in the Window Edit box
control. Now, when I do GetWindowText() I get a string of 3 characters.
After that, I create a Memory DC into which I select the Arial Unicode
MS font (since it supports character sets for almost all languages). I
then pass the string to ScriptItemize and then to ScriptShape.

Now ScriptShape returns me a single resultant glyph index for the
input string of 3 characters. The value of this glyph index is 7085. To
check whether 7085 is the glyph index for any character code from 0 to
65535 in the Arial Unicode MS font, I also used the GetGlyphIndices
function 65535 times, passing it a string of length 1 corresponding to
each character code and wrote the char code - glyph index mapping to a
file. However, I found out that the glyph index did not match for any
character code from 0 to 65535. In fact, I found out that there were 4
sets of ranges in ascending order for the glyph indices. They were
3-5428, 8355-..., ...-... and 5429...64.. (don't remember all the
values). Any index even cloase to 7000 did not figure in the list of
glyph indices.

Earlier I had hooked the ExtTextOut API to my own function using
a DLL hooking code downloaded from the Net. I found out that the
ETO_GLYPH_INDEX flag was set for the fuOptions parameter in the call to
ExtTextOut. This means, that the Edit control either uses the Uniscribe
API ScriptShape or the obsolete Win32 API GetCharacterPlacement,
obtains a list of glyph indices from them, and then passes the glyph
indices to ExtTextOut. Now, using these glyph indices, how does
ExtTextOut lookup the font data to draw the correct character in a
window. If I am also able to somehow use glyph indices to lookup the
font data for the corresponding character codes, then my problem will
be solved.

Regards,
Dhiren.

James Brown

2006-10-12 12:42:53 UTC

Permalink

Post by Dhiren
Hi,
Thank you for the reply.
Actually I am working on the DirectX 9.0 API's D3DXCreateText
function to create 3DText. Now D3DXCreateText requires a string of
character codes to create the 3DText. However, D3DXCreateText does not
do complex script processing for languages such as Arabic and Hindi, as
is done by the Windows API ExtTextOut. D3DXCreateText just takes the
character code for one character at a time, obtains it's corresponding
glyph using the CMAP table of the font and creates 3DText using that
glyph's shape.
However, for languages such as Arabic, the shape of a character
(i.e. it's Unicode character code) changes depending on its position in
the string. Hence I wanted to emulate the functionality of the
ExtTextOut API to obtain the actual character codes that get painted in
a window by ExtTextOut. Therefore, to obtain these new character codes,
I was using the Uniscribe API ScriptShape, which returns me the glyph
indices.

You are referring to the process of contextual-shaping. However even though
the string looks different,
there are no 'new character codes' to obtain. The character-codes do _not_
change. Only their appearance changes due to surrounding characters - which
is why you get sometimes get different glyphs for different characters.

There is a "many : many" mapping between characters and glyphs. However it
is impossible to perform the reverse mapping because:

A single character can result in multiple glyphs being generated
A single character can result in one glyph being generated.
A single character can result in a *different* glyph being generated,
depending on that character's context.
Multiple characters (combining sequences) can result in a single, or
multiple glyphs.

Any, and all of the above combinations can occur when rendering Unicode
text. The font you are using has a strong impact on the glyph-generation
process also. Without further knowledge of how Unicode works, you need to
accept that you cannot map a glyph back to a character-code.

Post by Dhiren
For e.g. after installing fonts to support Hindi, if Hindi is
selected as the language in the Language bar, and if the Devanagiri
character for Shra (used in the name SHRAvan) is generated by pressing
Shift + 8, the character appears correctly in the Window Edit box
control. Now, when I do GetWindowText() I get a string of 3 characters.
After that, I create a Memory DC into which I select the Arial Unicode
MS font (since it supports character sets for almost all languages). I
then pass the string to ScriptItemize and then to ScriptShape.
Now ScriptShape returns me a single resultant glyph index for the
input string of 3 characters. The value of this glyph index is 7085. To
check whether 7085 is the glyph index for any character code from 0 to
65535 in the Arial Unicode MS font, I also used the GetGlyphIndices
function 65535 times, passing it a string of length 1 corresponding to
each character code and wrote the char code - glyph index mapping to a
file. However, I found out that the glyph index did not match for any
character code from 0 to 65535. In fact, I found out that there were 4
sets of ranges in ascending order for the glyph indices. They were
3-5428, 8355-..., ...-... and 5429...64.. (don't remember all the
values). Any index even cloase to 7000 did not figure in the list of
glyph indices.

please don't take this the wrong way, but I would suggest not wasting any
more time on this approach....

Post by Dhiren
Earlier I had hooked the ExtTextOut API to my own function using
a DLL hooking code downloaded from the Net. I found out that the
ETO_GLYPH_INDEX flag was set for the fuOptions parameter in the call to
ExtTextOut. This means, that the Edit control either uses the Uniscribe
API ScriptShape or the obsolete Win32 API GetCharacterPlacement,
obtains a list of glyph indices from them, and then passes the glyph
indices to ExtTextOut. Now, using these glyph indices, how does
ExtTextOut lookup the font data to draw the correct character in a
window. If I am also able to somehow use glyph indices to lookup the
font data for the corresponding character codes, then my problem will
be solved.
Regards,
Dhiren.

On Windows XP at least, the EDIT control uses the Uniscribe ScriptString
API, so it is Uniscribe calling *back into* ExtTextOut with the
ETO_GLYPH_INDEX flag.

Once ExtTextOut has 'glyph data' it does not really lookup any font data to
'draw the correct character' as you say. The font-data has already been
accessed. Uniscribe is a wrapper library over the lower-level OpenType
services that Windows provides. The process of generating Glyph-Indices from
character-codes is extremely complex and requires access to an font's
internal OpenType tables. Uniscribe makes this process a little more
bearable. But once you have a glyph-index there is nothing more to do.
ExtTextOut does not 'draw a character'. It draws a vector graphic (with a
particular glyph-index) contained in the font you are using.

So in summary:

It is impossible to map glyph-indices back to character-codes. You must
maintain your original Unicode string, and use the Uniscribe 'logical
character attribute' array to perform the mapping from Unicode characters ->
glyph indices.

Lots of info about Uniscribe in the tutorials section of my website:

--
James Brown
Microsoft MVP - Windows SDK
www.catch22.net
Free Win32 Tutorials and Sourcecode

Mihai N.

2006-10-13 07:19:23 UTC

Permalink

Post by Dhiren
Now, using these glyph indices, how does
ExtTextOut lookup the font data to draw the correct character in a
window. If I am also able to somehow use glyph indices to lookup the
font data for the corresponding character codes, then my problem will
be solved.

James Brown explained this already, but just to make it sure is not lost in
his longer post: ExtTextOut does not need to retieve back a character code.
It has the glyph indice, so it will just take the glyph vectorial info and
draw it.
You can do the same with GetGlyphOutline, which is public.
Using GGO_NATIVE seems quite complicated, and the structures not fully
documented, but you can try GGO_BEZIER.
You will have a bezier courve describing your glyph, and this is something
you should be able to use to somehow create a 3D object (might not be easy,
I don't know enough about Direct 3D)

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email