Converting MIF to XML - Java Version
In my previous post I discussed a tool called mif2xml for converting MIF files to an intermediate XML dialect. In this post I'll talk about the Java port of mif2xml called mif2xml-j which you can download here including just the executable jar or browse the source online via svn.
JFlex is a lexical analyzer generator for Java and is the library I chose to use for creating the MIF lexer. The first step was to get JFlex integrated into my build environment. For this project I decided to use ant but integrating JFlex into another build environment should be straightforward. I created the following directory structure:
--/ |-- src/main/jflex/ - JFlex lexical specifications |-- src/main/resources/MANIFEST - Defines main class for executable jar |-- src/main/java/ - Java source |-- lib/ - 3rd party libraries (JFlex.jar) |-- build.xml - Ant build file
JFlex comes bundled with a JFlexAntTask which provides a very convenient <jflex/> task. Here's a snippet of the ant build file I created which shows how to set it up:
<property name="src" location="${basedir}/src/main/java" /> <property name="lib" location="${basedir}/lib" /> <property name="scanner-file" value="${basedir}/src/main/jflex/mif.jflex" /> <path id="classpath"> <pathelement location="${build}" /> <fileset dir="${lib}"> <include name="*.jar" /> </fileset> </path> <taskdef classpathref="classpath" classname="JFlex.anttask.JFlexTask" name="jflex" /> <target name="jflex" description="Generate the MIF lexer"> <echo message="Generating the MIF Lexer" /> <jflex file="${scanner-file}" destdir="${src}" /> </target>
I found writing the lexical specification in JFlex and flex to be very similar. JFlex has a great user manual which contains a lot of useful info. Here's the mif.jflex file:
/* * Copyright 2007 Andrew Bruno <aeb@qnot.org> * Licensed under the Apache License, Version 2.0 */ package org.qnot.mif2xml; import java.util.Stack; %% %{ private Stack<Tag> tags = new Stack<Tag>(); private StringBuffer data = new StringBuffer(); private StringBuffer facet = new StringBuffer(); %} %line %char %standalone %class MifLexer %xstate DATA %xstate STR %xstate FACET ID=[A-Za-z][A-Za-z0-9]* TAG="<"{ID}" " TAG_END=">" NONNEWLINE=[^\r|\n|\r\n] NEWLINE=[\r|\n|\r\n] WHITE_SPACE_CHAR=[ \n\t] %% <YYINITIAL> { {TAG} { Tag tag = new Tag(); tag.setName(yytext().substring(1, yytext().length()-1)); tags.push(tag); tag.writeStart(); data = new StringBuffer(); yybegin(DATA); } {TAG_END} { if(!tags.empty()) { Tag tag = (Tag)tags.pop(); tag.writeEnd(); } } ^"="[a-zA-Z][a-zA-Z0-9]*{NEWLINE} { facet = new StringBuffer(); facet.append(yytext()); yybegin(FACET); } {WHITE_SPACE_CHAR}+ { /* eat up whitespace */ } {NONNEWLINE} { /* eat up everything else */ } } <DATA> { {NEWLINE} { if(!tags.empty()) { Tag tag = (Tag)tags.pop(); tag.setValue(data.toString()); tags.push(tag); } yybegin(YYINITIAL); } "`" { yybegin(STR); } {TAG_END} { if(!tags.empty()) { Tag tag = (Tag)tags.pop(); String value = tag.getValue(); String dataStr = data.toString(); if(dataStr != null && dataStr.length() > 0) { value = dataStr; } if(value != null) { value = value.replaceAll("^\\s+", ""); value = value.replaceAll("\\s+$", ""); } tag.setValue(value); tag.writeEnd(); } yybegin(YYINITIAL); } [^\n|\r|\r\n|`|>] { data.append(yytext()); } } <STR> { "'" { if(!tags.empty()) { Tag tag = (Tag)tags.pop(); if(tag.getValue() == null || tag.getValue().length() == 0) { tag.setValue("`'"); } tags.push(tag); } yybegin(YYINITIAL); } [^']* { if(!tags.empty()) { Tag tag = (Tag)tags.pop(); StringBuffer buf = new StringBuffer(); buf.append("`"); buf.append(yytext()); buf.append("'"); tag.setValue(buf.toString()); tags.push(tag); } } } <FACET> { ^"=EndInset"{NEWLINE} { facet.append(yytext()); Tag.writeFacet(facet.toString()); yybegin(YYINITIAL); } .*{NEWLINE} { facet.append(yytext()); } }
I created a simple Tag class to encapsulate a MIF XML tag and handle writing out each tag. The MifLexer keeps a stack of Tag instances while it's processing the input file:
/* * Copyright 2007 Andrew Bruno <aeb@qnot.org> * Licensed under the Apache License, Version 2.0 */ package org.qnot.mif2xml; public class Tag { private String name; private String value; public String getName() { return this.name; } public String getValue() { return this.value; } public void setName(String name) { this.name = name; } public void setValue(String value) { this.value = value; } public void writeEnd() { if(value != null && value.length() > 0) { System.out.print(escape(value) + "</" + name + ">"); } else { System.out.print("</" + name + ">"); } } public void writeStart() { System.out.print("<" + name + ">" ); } public static void writeFacet(String facet) { System.out.print("<_facet><![CDATA["); System.out.print(facet); System.out.print("]]></_facet>"); } private String escape(String str) { str = str.replaceAll("&", "&"); str = str.replaceAll("\"", """); str = str.replaceAll(">", ">"); str = str.replaceAll("<", "<"); str = str.replaceAll("^\\s+", ""); str = str.replaceAll("\\s+$", ""); return str; } }
There's a separate Main class which creates a new instance of the MifLexer class for processing the file passed in on the command line. I'd like to eventually extend this class so that it handles command line options and possibly even runs some XSLT's over the generated MIF XML.
/* * Copyright 2007 Andrew Bruno <aeb@qnot.org> * Licensed under the Apache License, Version 2.0 */ package org.qnot.mif2xml; import java.io.IOException; import java.io.FileNotFoundException; import java.io.FileReader; public class Main { public static void main(String[] args) { if(args.length != 1) { System.err.println("Usage : mif2xml <inputfile>"); System.exit(1); } try { MifLexer scanner = new MifLexer(new FileReader(args[0])); System.out.print("<?xml version=\"1.0\"?><mif>"); scanner.yylex(); System.out.print("</mif>"); } catch(FileNotFoundException e) { System.out.println("File not found : "+args[0]); } catch(IOException e) { System.out.println("I/O error scanning file '"+args[0]+"': "+e.getMessage()); } catch(Exception e) { System.out.println("Unexpected exception: " + e.getMessage()); e.printStackTrace(); } } }
To run the code download the executable jar and run
$ java -jar mif2xml-0.1.jar myfile.mif
The MIF XML will be printed to stdout.
March 29th, 2007 at 2:21 am
Hi Andrew,
nice, clean work! This looks very promising to me, because part of my work involves handling of FrameMaker legacy documents. Doing that with an XML toolset seems much more up-to-date than any other method. Especially since most of the time conversion to XML is requested.
One question: While version 0.1 of the JAR file performs perfectly on my system (WinXPPro, Java 1.5.0_08), I get an exception with version 0.2 ("Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file"). What could be the reason for this?
All the best at your new job!
- Michael
March 29th, 2007 at 10:19 am
Thanks for your feedback Michael. I recently upgraded to Java 1.6 and compiled mif2xml-j 0.2 without adding a target=1.5. I recompiled 0.2 and it should work with Java 1.5 now. Let me know if you run into any more issues.
April 4th, 2008 at 2:43 pm
Hi Andrew,
Thank you for a good explanation. I wanted to know if I can use this jar to convert from .fm to xml files.
Regards-
Jyothi
April 4th, 2008 at 8:58 pm
Hi Jyothi,
You first have to save your *.fm file to MIF (SaveAs MIF from FrameMaker) then you can convert the MIF to xml. The jar only reads MIF files not the binary FrameMaker files. Hope this helps.
April 7th, 2008 at 3:51 pm
Hi Andrew,
Thank you for the quick reply. Where can I find the jar file for miftoxml. I didn't find it in the downloads section.
Regards-
Jyothi
April 7th, 2008 at 6:53 pm
You can download the executable jar here.
April 25th, 2008 at 8:39 am
Hi Andrew,
Thank you for the conversion script. I have been playing with fm8 and its xml output. It seems to me that the fm xml output outputs the data but is a lossy conversion and doesn't really do a good job with formating data where as your xml converter will keep everything from the mif to xml in a lossless conversion.
Have you found any cases that mif2xml does not work in? How would external entity references (I think you can do these in frame) be handled. What about character sets?
Regards,
IH
April 30th, 2008 at 7:02 am
@IH.. Thanks for the feedback. The XML output from mif2xml seems to work best when you want to process MIF using XML technologies like XSLT/XQuery or xml processing libraries in your favorite programming language. The xml is quite ugly but it has the advantage of round tripping back into raw MIF for use with FrameMaker again. I haven't exhaustively tested mif2xml so I'm sure there's a few bugs. Most of the issues that come up have been with parsing facet data and embedded TIFF/EPS objects. mif2xml was written using FrameMaker 7.0 as a reference and I haven't looked into how much has changed in 8.0. I tested a few FrameMaker 8.0 files and things to see to work ok.
I haven't really given much thought to external entities. One possible way to handle them might be to add some command line options to mif2xml so that the external entity references get added to the resulting XML. Something along the
lines of:
$ mif2xml --doctype /path/to/dtd
Which would output something like:
<?xml version="1.0"?>
<!DOCTYPE mif SYSTEM "/path/to/dtd">
<mif>
..
</mif>
Regarding character sets, mif2xml doesn't explicitly tell Java which encoding to use so when reading/writing files Java will use the platform's default encoding. This should work as expected unless you attempt to process a MIF file that was created on another platform using a different encoding. I could add another command line switch for defining a specific encoding to use when reading/writing MIF files which may help in that case. The latest version of mif2xml (0.3) which you can download here, has some bug fixes (reported by @Jyothi) and should do a better job with facet data and MIF 8.0 Unicode files.
I'd love to hear any other feedback you may have or any ideas for new features.
--Andrew
June 29th, 2008 at 4:08 pm
Hi Andrew,
nice, clean work! This looks very importent to me, because part of my work involves handling of FrameMaker legacy documents.
One question: I get an exception with version 0.2
7.0H
Unexpected exception: 255
java.lang.ArrayIndexOutOfBoundsException: 255
at org.qnot.mif2xml.MifLexer.yylex(MifLexer.java:600)
at org.qnot.mif2xml.Main.main(Main.java:33)
All the best at your new job!
Can you help on this..
thank you
June 29th, 2008 at 8:08 pm
@venkat - I've released a new version which should fix the error your getting. You can download the latest version (0.3) here. If you're still getting errors with the latest version let me know and I'll try and get them fixed up.
June 30th, 2008 at 5:45 am
Hi Andrew,
Thank for your reply, the problem is solved, but it's not printing xml string, i am getting some specila character like....
#%v
=EndInset
]]>.
can you help on this...
thank you
July 16th, 2008 at 8:55 am
Andrew,
This looks very interesting. I was given an assignment of converting portions of framemaker documents into openoffice. However, I know virtually nothing of xml, which is not good. How feasible would it be to translate the above generated xml into OOo readable xml?
Thank you,
Another Andrew
July 21st, 2008 at 6:54 am
@Venkat - It's difficult to diagnose the problem without seeing the MIF file. If you're still having issues send me a copy of the MIF file you're trying to convert and I'll take a look.
July 21st, 2008 at 7:05 am
@another Andrew - I haven't looked into OOo XML enough to say how difficult it would be to convert the MIF XML to a readable OOo document. I'm guessing it would be a bit of a challenge but you might start with looking at a few posts here which describe one use case of converting MIF XML to DocBook and back again. You could possibly even try doing MIF XML -> DocBook -> OOo.
March 26th, 2009 at 11:13 pm
Oh..okay I got it..! I was using java 1.4. I'm able to run it through Java 1.6..!
thanks,
Basav