Friday, April 24, 2009

ColdFusion: In search of Word/RTF to PDF Converters

So I have foolishly wandered into the open source jungle, in search of non-commercial Word/RTF to PDF converters. While I have not found the holy grail (or perfect solution) yet, it has been interesting learning about some of the different projects. I particularly liked the flying saucer project. (Though it essentially serves the same purpose as cfdocument, you have gotta' love that name!)

For those that have already gone down this route, you probably will not read anything new here. But I thought I would write up some of my findings and experiments in the event it is helpful to others on a similar quest.

Where to start
I rolled the dice and started with google and sourceforge.net. After looking over about a gazillion projects I came to the conclusion there were two main possibilities:

  1. Using a single application (typically an executable) to do a direct conversion: rtf to pdf
  2. Using a combination of tools: converting rtf to html/xml -> then html/xml to pdf
I decided to try a few approaches and compare the results, pros and cons of each. The tools I tested were a combination of old and new. I suspected some would work better than others, but decided to try them all for comparison purposes. Note, the advantages and disadvantages are just my own opinions based on my tests.


RTFEditorKit/HTMLEditorKit (Poor man's converter)
RTFEditorKit and HTMLEditorKit are two core java classes that can be used to read an rtf file and convert it to html, and finally to pdf using cfdocument.
Advantages:
This is probably one of the simplest options. It can be used right out of the box. No additional jars or programs are required.

Disadvantages:
There are some bugs and limitations to these classes. For example, the RTFEditorKit does not handle images in rtf files. But in a pinch, it might work for a down and dirty converter.
Example:
<cfscript>
inputFile = ExpandPath("sample.rtf");
outputFile = ExpandPath("./sample_converted.pdf");

// create editor objects used for the conversion
fis = createObject("java", "java.io.FileInputStream").init( inputFile );
rtfEditor = createObject("java", "javax.swing.text.rtf.RTFEditorKit");
htmlEditor = createObject("java", "javax.swing.text.html.HTMLEditorKit");

// create a default document and load the rtf file
document = rtfEditor.createDefaultDocument();
rtfEditor.read(fis, document, 0);

// convert the document to html
stringWriter = createObject("java", "java.io.StringWriter").init( document.getLength() );
htmlEditor.write(stringWriter, document, 0, document.getLength());

// get the html content as a string
htmlOutput = stringWriter.getBuffer().toString();
</cfscript>

<cfoutput>
<cfdocument format="pdf" filename="#outputFile#" overwrite="true">
#htmlOutput#
</cfdocument>
Finished converting file: #outputFile#
</cfoutput>


Majix
Majix is a java library for converting rtf files to xhtml. The resulting xhtml can be tweaked and used to create a pdf with the help of cfdocument.

As there were several rtf to html/xml converters, I decided to pick only one. Since Seth Duffey had already posted a great example on using Majix with ColdFusion, I used it rather than reinventing the wheel. (Just keep in mind it is an older blog entry)

Advantages:
Overall, it worked relatively well in my small tests, and does not require installing an executable program. Behavior can be customized somewhat via xsl.

Disadvantages:
The project has not been updated in a few years. It was not always able to handle some of the more exotic rtf's file I threw at it.

There was also one thing in the source that bothered me a bit: the application's use of System.exit(...). In my initial testing, I used some wrong parameters and ending up shutting down the jvm, and ColdFusion with it. Now in all fairness, the application was probably geared towards desktop usage, where exiting is less critical. Plus your typical CF sandbox settings might not allow this behavior anyway. However, I would probably modify the source, or add a SecurityManager , just to be safe.

POI/docx4j
I looked into these two options for possible doc to pdf conversion. But unfortunately I have not had much luck so far. I had problems with several documents right- off-the-bat. Though that may well have been due to my own ignorance of the api's. So I decided to come back to these tools after I have studied them a bit more.
iText
Since iText is part of what powers cfdocument, and supports the creation of rtf files, I had a look around to see if it provided any options for converting rtf to pdf format. Though it is not quite complete, this feature is apparently in the works. So I am keeping this option in mind for the future.

Advantages:
More stable/mature product already part of ColdFusion.

Disadvantages:
It is not ready yet ;)

Update November 19, 2009: Unfortunately, it looks like RTF will be abandoned and moved to an incubator project.
http://www.mail-archive.com/itext-questions%40lists.sourceforge.net/msg47892.html

JODConverter ( ..all roads lead to rome)
Searching the java and ColdFusion forums led me back to the JODConverter on sourceforge.net and eventually a helpful example on Todd Sharp's blog. For basic file conversion, the JODConverter is a lot simpler to use than OpenOffice API. Plus their project site has excellent documentation on how to install and use the JODConverter.

Within minutes I was converting different file formats to pdf: including doc, xls, and docx files. While I am sure it has some quirks too, overall, the output quality was the best of all the options I tried.

Advantages:
Simple interface and supports the conversion of multiple formats, not just doc or rtf.

Disadvantages:
It requires installing OpenOffice and running the program as a service. Conversions from one format to another may be a bit slow on less powerful machines.

Update: There is a newer version of the JODConverter available on Google Code


Example - (Using command line option)
Note: OpenOffice must be running as a service or this will not work

<!--- grab the path to java.exe --->
<cfset system = createObject("java", "java.lang.System")>
<cfset pathToJavaExe = system.getProperty("java.home") &"/bin/java.exe">

<!--- create path to input and output files --->
<cfset inputFile  = ExpandPath("./CFFAQ_Test.rtf")>
<cfset outputFile = ExpandPath("./CFFAQ_Test_Jar_converted.pdf")>

<!---
Construct the path to jodconverter jars.

Example:  My jar files are stored beneath the web root
c:\coldFusion8\wwwroot\jodconverter-2.2.2\lib\jodconverter-cli-2.2.2.jar
c:\coldFusion8\wwwroot\jodconverter-2.2.2\lib\jodconverter-2.2.2.jar
... etcetera ...
--->
<cfset pathToJodJar  = ExpandPath("/jodconverter-2.2.2/lib/jodconverter-cli-2.2.2.jar")>

<!---
Construct the command to convert a single file

ie: java -jar c:\pathTo\jodconverter-cli-2.2.0.jar c:\myInputFile.doc myOuputFile.pdf
--->
<cfset argString = '-jar #pathToJodJar# -v "#inputFile#" "#outputFile#"'>

<cfexecute name="#pathToJavaExe#"
arguments='#argString#'
timeout="60" />

<cfif FileExists(outputFile)>
Success. File Created <cfoutput>#outputFile#</cfoutput>
<cfelse>
Error. Unable to create file
</cfif>


That is all she wrote ..

I will probably post more about POI and docx4j later. (Once I figure it out.) Hopefully my trials, tribulations and explorations were helpful to someone.

Update: I finally had to give up on using docx4j with Adobe ColdFusion. It is nothing against docx4j. But there were just too many "jar hell" type conflicts with CF's own internal jars

Cheers



13 comments:

Gary Fenton April 27, 2009 at 3:44 AM  

It's funny you said "all roads lead to Rome" because I went on the same journey as you. Except I didn't find JODConverter so I was momentarily excited about it, until you mentioned OpenOffice needs to be running the server.

So the search for the Holy Grail continues.

cfSearching April 27, 2009 at 7:07 AM  

@Gary,

Yes, I know exactly what you mean. That requirement is one of the reasons I put off looking at OpenOffice until the end. It works well, but it is not an option in some environments. So as you say, the quest continues .. ;)

Eric,  April 27, 2009 at 2:18 PM  

Do you have an example of the CFML you used in the application to call it? I assume the first cfexecute is to get the service running in general and not the way you do the call...

cfSearching April 27, 2009 at 2:55 PM  

@Eric,

Correct. You start the service only once. Then do the conversion separately. There are few ways you can use JODConverter. Via the command line is probably the simplest. I updated the entry and posted an example above.

BTW, you could also set up OpenOffice as a windows service, instead of starting it via cfexecute:
Creating a Service on Windows

Paul Hastings April 28, 2009 at 12:06 AM  

command line? how DOS ;-)

i think perhaps OO is the way to go at least until iText gets clearer.

cfSearching April 28, 2009 at 5:39 AM  

@Paul,

Lol. That is their term, not mine ;) You can use the createObject(..) approach as well. I just posted an example of the simpler option.

I _really_ wish there was a jar only method, like with iText. Unfortunately, requirement to run OpenOffice as a server will rule out this option in a lot cases.

jwhite1202 July 9, 2009 at 12:27 PM  

I know it's been a while since you posted this, but I am need to do exactly what you are talking about here. I am leaning toward trying either HTMLEditorkit or RTFEditorkit classes, especially since they are already in CF 8. I saw you said the RTF editor has issues with converting images. So could I use the HTMLEditorkit to do the conversion if I have images? I also went and looked at the iText site and it looks like a good option too. My concern about iText is that I am doing gov't based work and they may want to review the source for iText which would take a long time. So basically I want to know that if worst came to worst would either of the two classes inherent to CF 8 get the job done? If so, which class words better on word docs?

Thanks,

JW

cfSearching July 9, 2009 at 7:08 PM  

@JW,

I think the primary factors that may influence your decision are:

a) Which formats you need to convert AND
b) Quality of the output

With the exception of the JODConverter and Majix, you will find most of the tools are only capable of handling a single format.

- RTF/HTMLEditorKit are limited to rtf files. So they are not an option if you need to convert .doc or docx
- docx4j - Handles .docx files, but AFAIK does cannot handle .rtf or binary .doc files
- iText - Will handle .rtf only. But unfortunately, it is still a work in progress at this point.
- Majix - Handles .rtf and supposedly .doc files too. But I _think_ that uses COM to handle .doc files, which sort of defeats the purpose here.
- OpenXMLViewer - Is another new option for converting docx files (only) to html.

As far as quality, the output of the RTF/HTMLEditorKit classes is probably in the lower ranks. In large part because those classes are older. The html produced is not well formed, by today's standards, and the classes do not support a lot of things. You can use them on rtf files with images, but the output will be incomplete (ie the images will be lost). So I think this option should be considered only if you need something along the lines of a S.W.A.G.

If you need to produce higher quality output, that is as close to the original as possible, you may be better off with JODConverter or a commericial product like Aspose.

-HTH
Leigh

cfSearching July 9, 2009 at 7:13 PM  

@JW,

BTW, I listed all of the options because I was not sure how you were classifying "word docs" (rtf only, rtf+doc, etcetera)

-Leigh

jwhite1202 July 10, 2009 at 9:23 AM  

Thanks for the response Leigh. I am looking to convert word docs (either .doc or .docx) into .pdf. I looked into iText yesterday and I basically learned what you stated. It seems that either I'll have to give this OpenXMLViewer a shot or be willing to deal with OpenOffice's API. If there's something else out there, I'm willing to learn. Either way thanks again for the response and the post.

cfSearching July 10, 2009 at 12:32 PM  

@JW,

You are welcome.

If you can convert to all docx files (and have full server control), another option is docx4j. It can be used for:

- docx to html OR
- docx to pdf conversions


Unfortunately, I have not gotten around to writing up an entry on it yet ;-)

-Leigh

Malcom Reynolds January 24, 2011 at 12:05 PM  

I too am in search of the Holy Grail of these programs. It is really hard to find a good one. i loved this post though. Really excellent. thank you.

cfSearching January 24, 2011 at 12:34 PM  

@Malcom,

Thanks. Yes, it truly is that elusive :) In the end, I think you just have to choose the tool with the least quirks or ones you can live with. It is the only way to stay sane ..

-Leigh

PS: I would have loved to explore docx4j further, but eventually had to give up on it. Too many jar hell issues with ACF.

  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep