Friday, February 18, 2005

(Java) Character Streams vs Byte Streams

So many times I discover the use of something I learnt already, but failed to make the obvious connection about its use. In Programming as well as Administration I mean. I have to come to this point through sheer frustrating trial and error. I guess it shows that rather than reading books about Java, just trying it out to solve problems actually helps understand the various aspects of it a little better.

I knew there are Character streams and Byte streams in Java, but always felt their use was interchangeable. But today I have a problem that clearly differentiated the two. Not that this understanding has solved the problem yet. Will have to implement and see.

Situation: I have been using Apache XML FO to transform xml + xsl -> FO -> PDF. This worked fine when I used the packaged Java FopServlet directly. This servlet would read (static or Java Servlet generated dynamic) xml and static xsl files via http, transform and output bytes that were the resulting PDF files. This meant that I could not enforce any (decent) access control over my source dynamic xml generating servlets, since FopServlet could not supply anything more than a rudimentary (translation: hard coded) authentication information. Not good enough. Then I discovered Servlet Filters. This seemed to fix the problem. The filter would transform the output xml to PDF and I could introduce other Filters to serve as Authentication and Access Control. Good for me. Everything was as I wanted it to be. But then I discovered that my FopFilter was not preserving the utf-8 character set during transformation. All the xml and xsl input is screaming utf-8 wherever needed. And still the transformer was ignoring the character set. The code part in the FopFilter I used is almost exactly what I copied from FopServlet. And nowhere during the transformation does it request a particular character set. I assumed xerces implementation would preserve whatever character set is mentioned in the source itself. Expected behavior right. Except the PDF file wasn't.

Epiphany: The only difference between the two servlets was in the input. FopServlet was receiving a character stream over http, whereas FopFilter was using the output buffer as input. Seems to me that the StreamSource read over http must be proper (single or multi byte) characters, while the output buffer, being a byte stream is all single byte characters. Duh! That's why the FopServlet outputs words like InVision™, StainEase®, while FopFilter was outputting words like InVisionâ„¢, StainEase®. Clearly, the byte stream that Fop Filter uses as a buffer to hold the servlet's output to perform post processing is a bad idea on my part. Should have chosen a character stream to preserve the characters whether they are single or multi-byte. (I think)

Experiment: Will change the ResponseWrapper on the Filter from a ByteOutputStream to CharArrayWriter.

Let's see if this fixes my problem. A Java guide says it should.

1 comment:

Chaitan Bandela said...

Yup, this solved the problem. Goodbye Byte streams and hello character streams. Of course, should remember not to make the reverse mistake when outputting binary files.