The character trick !!

Posted: July 27, 2008 in C++ tips, Code Drinks, Coding Tips, IT, Programming Tips
Tags: , , ,

Well back to blogging after a real long time. Last few months has been challenging for me and left me with no time to blog especially after 18 to 20 hours of work schedule with little or almost no sleep. Well now that the pressure is a bit off, I am back to my code cafe with lots of learning from my last 4 months experience, which I will share slowly and steadily so that I don’t miss out on anything.

In this series I would start off with the most challenging problem I faced and it was related to a character set issue. Today there are lots of talk about Unicode and multi-byte character sets. No doubt they are wonderful and help in broadening the audience scope, however sometimes while working in heterogeneous environment it becomes more easier to stick to single byte character sets. In my case the backend was mainframe, middleware was linux and the client was either unix, windows or AS400.

Problems usually arise when dealing with binary data. The binary data contains characters which are higher in value and represent a multi byte character in Linux or any other Unicode system. The problem that I faced was my middleware server and client was completely dealing with bytes and not characters or strings. So when I received a multi-byte character from backend and was transmitted to the client it was transmitted as 2 bytes which I didn’t want. On the client end I was dealing with a fixed length response file. Any increase in the bytes than expected would give me surprising results and this is what happened. The multi-byte characters behaved as two different bytes when looked from the byte point of view. The response file that got generated shifted data to the right and I encountered data loss on Windows response file and more dangerous results in AS400 response file.

The fix:

The real fix was to prevent Linux from converting a high value character into a multi-byte system and rather treat it as a low value single byte character system. If you are working on a Linux system setting your environment variable LANG to en_US instead of the default en_US.UTF-8 helps fix the problem.

just try:

export LANG=en_US (on Red Hat or Fedora Systems)

However if for some reason you can-not set the environment variable then you will have to do a bit of math here to get your multi-byte character in to a single byte character. If you are dealing with strings, break the string in to individual character (each character can be stored in multiple bytes), then treat them as integers. Any high value character will have an integer value less than 0. For such characters add 256 to the integer value and use the new character. The final character array you would recieve will be in a single byte system.

The only way to verify the integrity of your string is use the hex value of it to verify. Before you convert your string into a single byte character encoding, grab hex value of the string. Then after conversion, grab the hex value again. They both should match.

I hope this information helps others who are working in a hetergenous environment and prevents them from putting the amount of time and energy that I had put to resolve my problem.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s