Guidance Forums / Reginald Rexx / Unicode and WideCharToMultiByte-Guidance指路人

Guidance
指路人
g.yi.org

Guidance Forums / Reginald Rexx / Unicode and WideCharToMultiByte

Home

Software

Upload

回顶部

⇑

Forum List • Thread List • Reply • Refresh • New Topic • Search • Previous • Next

1. Unicode and WideCharToMultiByte

#12372

Posted by: Michael S 2008-09-19 21:24:59

I "know" that Reginald doesn't support Unicode but does the following make any sense to you Jeff ? We have DB2 tables on the mainframe, and some of them contain columns with unicode data (stored with UTF-8). If I read them using your ODBC code, I get unreadable "garbage" (as expected). I found the following code


int fileunicodetoansi(const cstring &str1, const cstring &str2)
{
try
{
// open the unicode file
cfile file (str1, cfile::moderead);
int ifilelen = file.getlength ();
pstr buffer = new char[ifilelen];
pstr pmultibytestr = new char[ifilelen/2];
bool buseddefaultchar;
file.read (buffer, ifilelen);
// convert TO ansi string
widechartomultibyte(cp_acp, 0, (lpcwstr) (buffer+2), -1, pmultibytestr, ifilelen/2, _t(" "), &buseddefaultchar);
file.close();
// create the ansi file
file.open(str1 + str2, cfile::modecreate | cfile::modewrite);
file.write(pmultibytestr, ifilelen/2);
file.close();
delete[] buffer;
delete[] pmultibytestr;
}
CATCH (cfileexception* e)
{
tchar szcause[255];
e->geterrormessage(szcause, 255);
afxmessagebox(szcause);
e->reporterror ();
e->delete ();
}
RETURN 0;
}

but since I don't "speak" C, it makes no sense to me. Is WideCharToMultiByte something that could be funcdef'ed, and if so, do you think it would help to make the ODBC data more readable ?

2. Unicode and WideCharToMultiByte

#12373

Posted by: cliff 2008-09-20 00:57:24

Michael,

I have a process that reads the MS SQL-SERVER Unicode ERROR log to a
ascii text file.

Here is the code I used.

charcnt = chars(logBackup_file)
if charcnt < 1 then charcnt = 1
startpros = 'n'

do li = 1 to charcnt
Ucode = charin(logBackup_file,li,2)
logchar = convertstr(Ucode,'U')
eol = c2x(logchar)

if eol = '0A' then do
    linecnt = linecnt + 1
    lineout(LogWork_File,logline,linecnt)
         logline = ''
    proscnt = proscnt + 1
end /*if eol*/
else do
     logline = logline || logchar
end /*else eol*/
li = li + 1

end /*do*/

lineout(LogWork_File)

cliff:-)

3. Unfortunately, that didn't work

#12375

Posted by: Michael S 2008-09-22 22:09:06

I'm guessing that because Jeff's ODBC code doesn't support Unicode, when I read a row of data (which includes a unicode column), it's either not getting translated, or is getting translated incorrectly. Long and the short of it, I've written the following code. First, a call to the attached script that sets up the translation tables


rc = character_conversion_tables('DA', db2_unicode, ansi)

This sets up 2 translations tables, one for the DB2 unicode data, and one that is the equivalent ansi data. Then, in my actual code, once I've fetched a column of data that I know is defined as unicode, I have a call


data.i = TRANSLATE(data.i, ansi, db2_unicode)

character_conversion_tables is attached to this append

character_conversion_tables.rex

#12381

Posted by: Jeff Glatt 2008-09-25 10:32:11

Ironically, I've been working with various text encodings recently. There are basically 4 standard encodings used today. (Well, that IBM thing on its mainframes is still around, but that's going to disappear at some point).

There is ISO-8859-1 (also known as latin 1, ANSI C string, ASCII, and a few other names). This is the earliest, surviving electronic encoding used to represent human text. Every character is an 8-bit byte. This encoding can support upto 256 unique characters. The first 128 (ASCII) were defined to encompass the 26 letters in the english alphabet (lower, as well as upper, case 'a' to 'z'), the ten english numeric digits ('0' to '9'), and english punctuation such as a question mark or a comma. Later, some characters/punctuation present in western european languages (but not english), such as some german chars with umlauts, or the upside down question mark for spanish, were added to fill up the remaining 128 chars. So, this encoding suffices for many western european languages, and the americas. (NOTE: This is the encoding used by REXX for its strings). But that left a bunch of other countries out of the loop, such as Russia and China, because there wasn't any more room to add new chars to accomodate their alphabets/languages.

So, a couple of different approaches were taken to address this.

In one approach, referred to as utf-16, each character is 2 (instead of 1) 8-bit bytes. This allows for 65,535 unique chars. That gave a lot more room to add the chars of other languages not supported by ISO-8859-1. utf-16 is also referred to as unicode upon the Windows operating system. It's the encoding used by the Windows operating system internally, which is why there are versions of Windows for russian, chinese, etc. (ie, Error messages, menu items, file names, etc, can be displayed in the native languages of those countries). utf-16 is also used by Windows appplications that support all those languages too.

But utf-16 is rather wasteful if your language can be accomodated by ISO-8859-1. That's because, whereas ISO-8859-1 uses 1 byte for each char, utf-16 uses 2 bytes for the same char. So if you have a text file in ISO-8859-1 encoding whose size is 30K, when you rewrite it in utf-16 encoding, its size doubles to 60K. So someone came up with an idea. He decided to use ISO-8859-1 encoding, but with a special caveat. The char whose value is 255 (ie, hexadecimal FF) will have a special meaning. It won't be a char itself. Rather, it means that the next couple of bytes (upto 5 more bytes) should be combined to form a single char. This encoding is called utf-8. It's internally used by the Linux operating system for its internationalization (ie, support of all human languages). The benefit of utf-8 is that it's very efficient for western european languages, as well as backward compatible with ISO-8859-1 encoding. A text string in ISO-8859-1 encoding will display fine on a utf-8 system with no translation required. On the other hand, an ISO-8859-1 encoded string needs to be internally translated to utf-16 on a utf-16 system, which is what Windows does internally when given an ISO-8859-1 encoded string.

As an aside, that's why I once told Mike that it was very bad to use the hex value FF for his own purpose in a string. The Windows text controls, while they don't display utf-8 text -- they support either ISO-8859-1 or utf-16 -- appear to handle FF by saying "I'll ignore all the chars after this FF because I don't support utf-8". In other words, the control thinks the text is ISO-8859-1, and displays all those chars, up to the point it hits a FF, and then it says "Hmmm, this appears to be a utf-8 string. I've just encountered an 'extended char'. But since I don't support displaying utf-8 extended chars, I'll just stop right here in the string. That's all of it I'm displaying".

Now, all this is fine and dandy, but then people started adding more and more chars. Traditional chinese has tons of chars. And they did things like add even Egyptian hieroglyphs so that researchers could write terms papers with hieroglyphs in them. So guess what? Even utf-16 ran out of chars. They had to resort to the same trick in utf-16 to support even more additions. A special 2 byte value was reserved to mean "combine the next few, additional bytes to form one char". So, someone came up with utf-32 (also known as ucs4). Each char is now not 1, not 2, not 3, but 4 whole bytes. This gives over 4 million unique chars -- enough for even esperanto additions, and maybe martian and plutonian (but at some point we'll need utf-64 to accomodate other alien dialects). So take that 30K ISO-8859-1 encoded text file, and when you rewrite it in utf-32 format it jumps to 120K. utf-32 is not widely used, but some Linux compilers do use it for their "unicode" strings.

Now, RxOdbc does no translation (ie, re-encoding) of strings. It gets a bunch of bytes and assumes it's ISO-8859-1. The string is given to Reginald who also assumes it's ISO-8859-1. You're probably trying to display the text with a REXX GUI control which also assumes ISO-8859-1. (REXX GUI uses the ISO-8859-1 version of all the Windows controls -- the ones that assume they're being given ISO-8859-1 strings, and internally translate them to utf-16. REXX GUI does not use the utf-16 versions of the Windows controls. After all, REXX GUI assumes it is being given REXX strings, which are all supposed to be ISO-8859-1). So you're taking a utf-8 string, which the entire software chain is assuming to be ISO-8859-1. Of course, you're seeing garbage.

First of all, Reginald is indeed storing a utf-8 string (if that's what it really is). As long as you don't attempt to process the string with some function/instruction that alters it based upon the ISO-8859-1 assumption (such as running it through CHANGESTR), it will remain legitimate utf-8. You could write your own string manipulation routines to handle utf-8 (provided you do some reading up on utf-8 encoding). But you've still got a problem if you need to display it. REXX GUI doesn't do utf-16 nor utf-8. CONVERTSTR() is a wrapper around WideCharToMultiByte() and MultiByteToWideChar() which will let you convert utf-16 to utf-8, or vice versa. But with REXX GUI supporting only ISO-8859-1, that translation isn't going to help with displaying strings. What you'd need is the entire REXX stack redone as unicode.

On the other hand, if your utf-8 string contains nothing but western european chars, don't bother with translation. Just treat it like an ISO-8859-1 (regular REXX string). After all, a utf-8 string with none of those "extra chars" is the same as an ISO-8859-1 string. Have you tried treating it as just an ordinary REXX string?

5. UTF

#12382

Posted by: Michael S 2008-09-25 14:44:05

Have you tried treating it as just an ordinary REXX string?

Yep - that's where this topic started. We store some unicode DB2 columns on the mainframe as just that, so an example of the string SCENIC SOUNDS shown in hex on the mainframe looks like this

ë{á+ñ{.ë!í+àë
5444442545445
335E9303F5E43

If I then read this data and show it using reginald's gui, I simply see garbage, thereof the original query, resulting in a simple "translation" script that shows the results as SCENIC SOUNDS.
The problem will probably arise when I try and show Polish special characters, but hey, you can't win 'em all.

#12384

Posted by: Jeff Glatt 2008-09-26 09:15:15

Huh? That hexadecimal translation is definitely not "SCENIC SOUNDS" in utf8. If it were, it would be exactly the same as ISO-8859-1, and therefore be the following 13 hex bytes:

53 43 45 4e 49 43 20 53 4f 55 4e 44 53

Apparently, the data you have is not utf8. I have no idea what it is because it isn't ISO-8859-1, utf8, utf16, nor ucs4.

7. Slight misunderstanding I think

#12385

Posted by: Michael S 2008-09-26 15:28:24

The data being shown has to be read, top to bottom, left to right. The first byte is 53 (column 1, rows 2 & 3). Second byte is 43 (column 2, rows 2 & 3) etc etc.

#12386

Posted by: Jeff Glatt 2008-09-26 15:44:41

Well, however you read it, it's not utf8.

9. Don't uderstand your last comment

#12387

Posted by: Michael S 2008-09-26 15:55:00

Reading it top to bottom, left to right, the contents are

53 43 45 4e 49 43 20 53 4f 55 4e 44 53

which you stated just above

it would be exactly the same as ISO-8859-1, and therefore be the following 13 hex bytes:

In reality, the whole exercise is a bit academic. From what I can understand, you need to connect to your database using ODBC with a call of either SQLConnectW (I think it was) or SQLDriverConnectW - neither of which is supported in Reginald at the moment. So, as I mentioned above, I read the data and then translate it myself to something readable.

10.

#12388

Posted by: Jeff Glatt 2008-09-26 17:49:32

SQLConnectW shouldn't make any difference to what you get back. If you're indeed getting back those 13 bytes, that's an ISO-8859-1 string. I have no idea why it's not displaying. I think what you're really getting back is something entirely different than what you think you're getting. (ie, You're not getting back a utf8 string). Either that, or for some reason, you're doing some unnecessary extra translation. (You're not applying an EBCDIC translation to a utf8 string? utf8 isn't EBCDIC).

11. Okay - here are some results.

#12390

Posted by: Michael S 2008-09-29 15:19:54

On the mainframe, when I review the hex contents of a row, I'm seeing the following