<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Andrew E. Bruno &#187; XML</title>
	<atom:link href="http://left.subtree.org/category/xml/feed/" rel="self" type="application/rss+xml" />
	<link>http://left.subtree.org</link>
	<description>A sourceful of secrets</description>
	<lastBuildDate>Tue, 07 Feb 2012 12:25:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='left.subtree.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/e14c799c6e8030a8abefcb495c0b0e17?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>Andrew E. Bruno &#187; XML</title>
		<link>http://left.subtree.org</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://left.subtree.org/osd.xml" title="Andrew E. Bruno" />
	<atom:link rel='hub' href='http://left.subtree.org/?pushpress=hub'/>
		<item>
		<title>WordPress Blog to Print Book &#8211; A Case Study</title>
		<link>http://left.subtree.org/2011/04/09/wordpress-blog-to-print-book-a-case-study/</link>
		<comments>http://left.subtree.org/2011/04/09/wordpress-blog-to-print-book-a-case-study/#comments</comments>
		<pubDate>Sat, 09 Apr 2011 05:05:18 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Hacks]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://left.subtree.org/?p=257</guid>
		<description><![CDATA[In this post I discuss my experience converting a WordPress blog into a print book. This is by no means a generic how-to guide but more along the lines of a case study. There&#8217;s a number of ways one could tackle this problem however I wasn&#8217;t able to find any existing methods that fit my [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=257&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In this post I discuss my experience converting a WordPress blog into a print book. This is by no means a generic how-to guide but more along the lines of a case study. There&#8217;s a number of ways one could tackle this problem however I wasn&#8217;t able to find any existing methods that fit my needs. Specifically, I wanted to convert the content of a WordPress blog into a high quality print ready PDF book (complete with chapters, sections, table of contents, images, figures, page numbering, index, etc.) which could then be sent to various <a href="http://en.wikipedia.org/wiki/Print_on_demand">POD</a> publishers such as <a href="http://www.lulu.com/">Lulu</a> for printing. I wanted to streamline as much of the process as possible to allow for regenerating the PDF book as new posts are made. Ideally there would be a WordPress plugin for this but to make such a plugin generic enough would be tricky and require many assumptions to be made regarding the structure of your blog (i.e. what constitutes a chapter or a section?, etc.). In this post I describe how I ended up creating a PDF book from WordPress and discuss a few challenges I encountered along the way. A brief disclaimer: this post is intended for folks who are familiar with *nix command line and enjoy mucking around in code (and don&#8217;t mind slinging around some XML here and there). It&#8217;s not for the faint of heart but hopefully it will be useful to others interested in a similar outcome.</p>
<p>If you&#8217;d rather skip this epic post and dive right into the code you can browse it all over on <a href="https://github.com/aebruno/wp2print">github</a>. There&#8217;s a <a href="https://github.com/aebruno/wp2print/blob/master/README">README</a> file and a <a href="https://github.com/aebruno/wp2print/blob/master/Makefile">Makefile</a> which details how to run wp2print and includes a simple example. </p>
<p><strong>Background and Assumptions</strong></p>
<p>The idea for this project came about as my family has been writing a private blog that I want to be able to share with my children some day. I&#8217;ve often thought about giving them a copy of the blog when their much older and wondered what would be the best way to preserve the content ensuring it&#8217;s viewable years down the road. I thought by creating a physical copy of the blog I could have something tangible to pass down through the generations. Using the services provided by companies like Lulu and Blurb printing a book is as simple as uploading a PDF and designing a cover. Having <a href="http://oreilly.com/">worked</a> in the publishing industry in a former life I had some experience generating PDF books so I looked forward to the challenge. </p>
<p>As my goal was to create a book I needed to convert the content of the blog into an intermediate format which represented the book and could then be used to generate a PDF file. As I had a good amount of experience slinging around <a href="http://www.docbook.org/">DocBook</a>, my general idea was to export the content from WordPress and convert it into DocBook. Then converting DocBook into a PDF is fairly straightforward using the wonderful <a href="http://docbook.sourceforge.net/">DocBook stylesheets</a> and <a href="http://xmlgraphics.apache.org/fop/">Apache FOP</a>. </p>
<p>The first obvious challenge when converting a blog into a book is deciding how to go about organizing the blog posts into chapters and sections. Our family blog was authored in such a way that each post had only one tag (category) and most importantly all posts were tagged chronologically. Meaning that all posts in a given category were sequential. For example, posts 0-5 are all tagged with &#8220;tag1&#8243;, posts 6-10 are all tagged with &#8220;tag2&#8243;, and so on. You can probably see where this is going. Having authored the blog in this format allowed me to easily use each tag as a Chapter and each post appeared as a section within that chapter. If your unable to make these assumptions about your blog (and most likely you won&#8217;t be able to) just keep this in mind as we delve into the code later on. You would need to modify the script I wrote and add in the appropriate logic to slice your blog posts up into chapters/sections or however you&#8217;d like to structure your DocBook file. I experimented with just making each post a chapter (and even a series of DocBook <a href="http://www.docbook.org/tdg5/en/html/article.html">articles</a>) which isn&#8217;t a bad option however depending on the number of posts you may want to consider omitting the table of contents.</p>
<p>A few other assumptions I made:</p>
<ul>
<li>All posts were written by the same author so I omitted displaying any author information. Easy enough to add in if needed</li>
<li>All comments were excluded from the book. Comments are an important part of any blog but in this case my blog didn&#8217;t have very many comments. I was most interested in the content of the post only and decided to omit any and all comments. These could certainly be added in but some thought would be needed on which DocBook element to use for structuring them within the book.</li>
<li>There was one page (not a post) with the title &#8220;About&#8221; that I used as the preface for the book. This can be any post/page or omitted completely if desired.</li>
<li>Access to the WordPress code that runs the blog. This won&#8217;t work if your blog is hosted for example at WordPress.com. You&#8217;ll need to export your blog and run it on your own server.</li>
</ul>
<p>Here&#8217;s a brief outline of the entire process. I&#8217;ll go over each step in detail in the next section. </p>
<ol>
<li>Convert WordPress content to DocBook &#8211; using PHP and some XSLT</li>
<li>Convert DocBook to <a href="https://secure.wikimedia.org/wikipedia/en/wiki/XSL_Formatting_Objects">XSL-FO</a> &#8211; using DocBook stylesheets</li>
<li>Convert XSL-FO to PDF &#8211; using Apache FOP</li>
<li>Upload PDF file to Lulu and order print book</li>
</ol>
<p><strong>Convert WordPress content to DocBook</strong></p>
<p>First step was to convert the blog into DocBook. This was by far the most challenging step. My first attempt was to use the <a href="https://en.support.wordpress.com/export/">Export</a> feature in WordPress which dumps the entire contents of your blog in XML format (WordPress eXtended RSS) and write an XSLT to convert into DocBook. This turned out to be slightly harder than I anticipated because of how the content of each post was formatted in the WordPress XML dump. It appeared to be in the native format WordPress uses to store the post in the database and I didn&#8217;t want to have to write a custom WordPress post renderer for DocBook. I decided to instead write a fairly simple PHP script which used the WordPress API to render each post in HTML just like it normally would if someone visited the site, then convert the HTML to DocBook. I found converting the HTML to DocBook was slightly easier than having to parse the native WordPress format. I did this in two steps, first I wrote a PHP script to generate a quasi-DocBook file which uses the WordPress API to embed the HTML content of each post within a <code>&lt;section/&gt;</code>. Then I wrote an XSLT which transforms the quasi-DocBook and embedded HTML into a final valid DocBook file. The main PHP code is <a href="https://github.com/aebruno/wp2print/blob/master/lib/export-docbook.php">here</a>. You&#8217;ll need to change the include paths in <a href="https://github.com/aebruno/wp2print/blob/master/lib/config.php">config.php</a> to point to your WordPress installation (see the <a href="https://github.com/aebruno/wp2print/blob/master/Makefile">Makefile</a> for a complete example).  The XSLT is <a href="https://github.com/aebruno/wp2print/blob/master/wp-html2docbook.xsl">here</a>. It looks for various HTML tags that appear in my blog and converts those to valid DocBook elements. I built up the XSLT by trial and error. I first just rendered the quasi-DocBook generated from the PHP script as PDF. The DocBook stylesheets have a nice feature in that any invalid DocBook elements it encounters are highlighted in red in the resulting PDF. By iterating through the invalid elements I was able to add the correct templates to my XSLT to account for all HTML tags found in my blog posts. You&#8217;ll most certainly need to modify this XSLT file to suite your specific needs but should serve as a decent starting point. Here&#8217;s an example of the HTML generated by WordPress for an image included in a blog post:</p>
<p><pre class="brush: xml;">
&lt;div id=&quot;attachment_155&quot; class=&quot;wp-caption aligncenter&quot; style=&quot;width: 310px&quot;&gt;
  &lt;a href=&quot;/wp-content/media/2008/11/image.jpg&quot;&gt;
        &lt;img class=&quot;size-medium wp-image-155&quot; title=&quot;image title&quot; src=&quot;/wp-content/media/2008/11/image-300x218.jpg&quot; alt=&quot;image alt&quot; width=&quot;300&quot; height=&quot;218&quot;/&gt;
  &lt;/a&gt;
  &lt;p class=&quot;wp-caption-text&quot;&gt;This is a description of the image&lt;/p&gt;
&lt;/div&gt;
</pre></p>
<p>Which then gets converted to a DocBook <a href="http://www.docbook.org/tdg5/en/html/mediaobject.html">mediaobject</a> element:</p>
<p><pre class="brush: xml;">
&lt;para&gt;
  &lt;mediaobject&gt;
    &lt;imageobject&gt;
       &lt;imagedata align=&quot;center&quot; fileref=&quot;images/2008/11/image.jpg&quot; width=&quot;4.0in&quot; depth=&quot;3.0in&quot; scalefit=&quot;1&quot; format=&quot;JPG&quot;/&gt;
    &lt;/imageobject&gt;
    &lt;caption&gt;&lt;para&gt;This is a description of the image&lt;/para&gt;&lt;/caption&gt;
  &lt;/mediaobject&gt;
&lt;/para&gt;
</pre></p>
<p><strong>A note about images&#8230;</strong></p>
<p>Care must be taken to ensure any images you want included in the book are print ready. I ended up having quite a few images in my blog that I wanted to include in the final PDF which required some extra work to get them ready for printing. For best results you&#8217;ll want make to be sure the resolution of your images are at least 300ppi (<a href="https://secure.wikimedia.org/wikipedia/en/wiki/Dots_per_inch#DPI_or_PPI_in_digital_image_files">pixels per inch</a>). See <a href="http://connect.lulu.com/t5/Interior-Formatting/What-resolution-DPI-should-my-images-have-to-achieve-optimum/ta-p/31434">this post on Lulu</a>. For example, if your image is 600x600px and you set the resolution to be 300ppi, the printed image will be roughly 2x2in. In my case I was printing a 6&#215;9 book and after factoring in margins/spine/bleed etc. I calculated the maximum print size I wanted each image was 4x3in (as defined in the DocBook XML element <code>&lt;imagedata width="4.0in" height="3.0in"/&gt;</code> in the above example). As most of the images were pictures, this print size ended up being large enough so the photo was still viewable but small enough to allow for 2 images per page. This meant that the minimum size (in pixels) each image had to be was 1200x900px. The problem was when we uploaded pictures to our blog we had WordPress resize them to 500x400px (from their original size of 2816x2112px from the camera). Fortunately, I still had the original image files which I collected and used in the final PDF. Something to keep in mind if you have images (especially photographs) in your blog that you want printed. I ran into another edge case with the images which required a little bit of <a href="http://www.imagemagick.org">imagemagick</a>. I had a few important pictures that were taken with photo booth on a mac in which the original size image was a mere 640x480px. I knew the print version of the images would look dreadful so my only option was to resample them to a higher resolution. This can easily be accomplished using imagemagick&#8217;s <a href="http://www.imagemagick.org/script/command-line-options.php#resample">convert command</a>:</p>
<p><pre class="brush: plain;">
$ convert -resample 300x orig.jpg hires.jpg
</pre></p>
<p>In summary, be sure your images are high enough resolution for printing. It&#8217;s definitely worth the extra work. I had roughly 100 images in my blog and all of them turned out really nice in the final print book. I was quite impressed with the quality of Lulu&#8217;s printers. </p>
<p><strong>DocBook &#8211;&gt; XSL-FO &#8211;&gt; PDF</strong></p>
<p>Converting DocBook to PDF was fairly straightforward using two excellent projects <a href="http://docbook.sourceforge.net/">DocBook stylesheets</a> and <a href="http://xmlgraphics.apache.org/fop/">Apache FOP</a>. I won&#8217;t cover how to install them on your platform and refer you to the excellent INSTALL guides at the <a href="http://docbook.sourceforge.net/release/xsl/current/INSTALL">respective</a> <a href="http://xmlgraphics.apache.org/fop/quickstartguide.html">sites</a>. If you happen to be running Ubuntu using the stock packages should work fine. Simply run <code>aptitude install fop docbook-xsl</code> and you should be all set. The basic goal for this step was to use the DocBook XSL FO stylesheets to convert the DocBook created from the previous step into XSL-FO which can be fed into Apache FOP for conversion into PDF. This step required that an XSLT processor be installed such as xsltproc (libXML), Saxon, Xalan, etc. I used xsltproc and can easily be installed on Ubuntu <code>aptitude install xsltproc</code>. After running xsltproc I passed the resulting XSL-FO output into Apache FOP to generate the final PDF. For more details see the <a href="https://github.com/aebruno/wp2print/blob/master/Makefile">Makefile</a>. Here&#8217;s the basic commands:</p>
<p><pre class="brush: plain;">
$ xsltproc /path/to/docbook-xsl/fo/docbook.xsl docbook-final.xml &gt; book.fo
$ fop book.fo book.pdf
</pre></p>
<p>The DocBook XSL FO stylesheets provide a <a href="http://docbook.sourceforge.net/release/xsl/current/doc/fo/index.html">generous number of parameters</a> for customizing the resulting FO. The default parameter settings produce a very nice looking PDF but if you like to tweak things there&#8217;s no shortage of knobs to turn. As I ended up printing my book with Lulu there were a few specific customizations that were required. First I was interested in printing a US Trade 6&#215;9 inch hard cover book so the default page <a href="http://docbook.sourceforge.net/release/xsl/current/doc/fo/page.width.html">width</a>/<a href="http://docbook.sourceforge.net/release/xsl/current/doc/fo/page.height.html">height</a> needed to be set accordingly. Some other tweaks I made included adjusting the <a href="http://docbook.sourceforge.net/release/xsl/current/doc/fo/page.margin.inner.html">margins</a> slightly to provide some extra room on the <a href="http://connect.lulu.com/t5/Interior-Formatting/How-big-should-my-margins-be/ta-p/31404">spine edge</a> of the book, customizing the <a href="http://docbook.sourceforge.net/release/xsl/current/doc/fo/generate.toc.html">table of contents</a> to only include the chapter/sections, and <a href="http://docbook.sourceforge.net/release/xsl/current/doc/fo/body.start.indent.html">customizing the indentation</a> of chapters and sections (in this case I didn&#8217;t want any indentation). Here&#8217;s the resulting xsltproc command with the custom parameter settings:</p>
<p><pre class="brush: plain;">
    xsltproc \
    --stringparam page.width 6in \
    --stringparam page.height 9in \
    --stringparam page.margin.inner 1.0in \
    --stringparam page.margin.outer 0.8in \
    --stringparam body.start.indent 0pt \
    --stringparam body.font.family  Times \
    --stringparam title.font.family Times \
    --stringparam dingbat.font.family Times \
    --stringparam generate.toc 'book toc title' \
    --stringparam hyphenate false \
    /path/to/docbook-xsl/fo/docbook.xsl \
    docbook-final.xml &gt; book.fo
</pre></p>
<p><strong>A note about Fonts..</strong></p>
<p>The last and most important configuration I made was with fonts. Lulu requires fonts to be fully <a href="https://secure.wikimedia.org/wikipedia/en/wiki/Portable_Document_Format#Fonts">embedded</a> which means any font you use in your PDF <a href="http://www.lulu.com/help/embed_fonts">must be embedded</a> (the font files are included directly in the PDF file) or else they will reject the PDF. Embedding fonts is supported by Apache FOP but requires some custom configuration. First I had to decide which font to use. Fonts can be really tricky and I didn&#8217;t want to get too fancy. Using a single font for the entire book was fine with me and I decided to stick with a traditional Times New Roman font. I ended up using the FreeSerif TrueType font from <a href="http://www.gnu.org/software/freefont/">GNU FreeFont</a>. It was already installed on my Ubuntu machine and very easy to embed with Apache FOP. By default these fonts are installed in <code>/usr/share/fonts/truetype/freefont/</code>.There&#8217;s lots of other free fonts out there that you could use including the <a href="https://fedorahosted.org/liberation-fonts/">Liberation Fonts</a> and even the <a href="http://packages.ubuntu.com/lucid/ttf-mscorefonts-installer">Micro$oft True Type Core Fonts</a> which can be installed on Ubuntu by running <code>aptitude install msttcorefonts</code>. To configure Apache FOP to use GNU Free Fonts and embed them into the final PDF I created a file called <code>userconf.xconf</code> with the following lines:</p>
<p><pre class="brush: xml;">
&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;fop version=&quot;1.0&quot;&gt;
&lt;renderers&gt;
   &lt;renderer mime=&quot;application/pdf&quot;&gt;
      &lt;!-- Full path to truetype fonts to be embedded in PDF file --&gt;
      &lt;fonts&gt;
        &lt;font embed-url=&quot;file:///usr/share/fonts/truetype/freefont/FreeSerif.ttf&quot;&gt;
          &lt;font-triplet name=&quot;Times&quot; style=&quot;normal&quot; weight=&quot;normal&quot;/&gt;
        &lt;/font&gt;
        &lt;font embed-url=&quot;file:///usr/share/fonts/truetype/freefont/FreeSerifBold.ttf&quot;&gt;
          &lt;font-triplet name=&quot;Times&quot; style=&quot;normal&quot; weight=&quot;bold&quot;/&gt;
        &lt;/font&gt;
        &lt;font embed-url=&quot;file:///usr/share/fonts/truetype/freefont/FreeSerifItalic.ttf&quot;&gt;
          &lt;font-triplet name=&quot;Times&quot; style=&quot;italic&quot; weight=&quot;normal&quot;/&gt;
        &lt;/font&gt;
        &lt;font embed-url=&quot;file:///usr/share/fonts/truetype/freefont/FreeSerifBoldItalic.ttf&quot;&gt;
          &lt;font-triplet name=&quot;Times&quot; style=&quot;italic&quot; weight=&quot;bold&quot;/&gt;
        &lt;/font&gt;
      &lt;/fonts&gt;
   &lt;/renderer&gt;
&lt;/renderers&gt;
&lt;/fop&gt;
</pre></p>
<p>Then ran fop passing the -f option like so: <code>fop -f userconf.xconf book.fo book.pdf</code>. Note the <code>&lt;font-triplet <strong>name="Times"</strong> /&gt;</code> attribute must match the <code>body.font.family Times</code> XSLT parameter passed to xsltproc command. </p>
<p><strong>Simple Example</strong></p>
<p>All the code described in this post is available on <a href="https://github.com/aebruno/wp2print">github</a>. I also include a simple example to demonstrate the entire conversion process and provide some sample PDFs to see how final book renders. I created a simple test blog consisting of Shakespeare&#8217;s Sonnets I thru X and exported the content in WordPress eXtended RSS so you can then import into a fresh install of WordPress. I tested using the latest version of WordPress at the time of this writing (v3.1). To try it out yourself download the code for wp2print and read thru the <a href="https://github.com/aebruno/wp2print/blob/master/README">README</a> file which outlines all the gory details. The <a href="https://github.com/aebruno/wp2print/blob/master/Makefile">Makefile</a> outlines the general process and should provide a good starting point for experimenting. Here&#8217;s some sample PDFs that were rendered from the example Shakespeare blog:</p>
<ul>
<li><a href="http://qnot.files.wordpress.com/2011/04/book-with-chapters.pdf">Book with Chapters and Sections</a></li>
<li><a href="http://qnot.files.wordpress.com/2011/04/book-with-articles.pdf">Book with all posts as DocBook Articles</a></li>
<li><a href="https://github.com/aebruno/wp2print/blob/master/sample/sample-docbook.xml">Raw DocBook output</a></li>
</ul>
<p><strong>Conclusion</strong></p>
<p>With the help of a few simple scripts it&#8217;s possible to create a high quality print ready PDF book from a WordPress blog. Depending on the content of the blog you&#8217;ll most certainly need to tailor these scripts to suite your specific requirements. The main challenges are figuring out how you want to organize your blog posts into the framework of a book and then modifying the XSLT templates to convert the WordPress html markup of your blog into valid DocBook elements. The services offered by print on demand publishers such as Lulu provide an easy way to turn the resulting PDF into a high quality paper book.  </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/qnot.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/qnot.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/qnot.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/qnot.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/qnot.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/qnot.wordpress.com/257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/qnot.wordpress.com/257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/qnot.wordpress.com/257/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=257&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://left.subtree.org/2011/04/09/wordpress-blog-to-print-book-a-case-study/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">sigma110</media:title>
		</media:content>
	</item>
		<item>
		<title>MIF XML at O&#039;Reilly</title>
		<link>http://left.subtree.org/2007/02/04/mif-xml-at-oreilly/</link>
		<comments>http://left.subtree.org/2007/02/04/mif-xml-at-oreilly/#comments</comments>
		<pubDate>Mon, 05 Feb 2007 05:08:51 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://left.subtree.org/2007/02/04/mif-xml-at-oreilly/</guid>
		<description><![CDATA[Keith, a fellow O&#8217;Reillyer, has written a few posts on how O&#8217;Reilly has been making use of MIF XML (MX). Keith gives some background and example uses as a follow up to my previous posts on converting MIF to XML. He also posted some XSLT for round-tripping the XML back into MIF.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=9&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://kfahlgren.com/blog/">Keith</a>, a fellow <a href="http://www.oreilly.com">O&#8217;Reillyer</a>, has written a few <a href="http://kfahlgren.com/blog/?p=34">posts</a> on how O&#8217;Reilly has been making use of MIF XML (MX). Keith gives some background and example uses as a follow up to my previous posts on <a href="http://left.subtree.org/2007/01/25/converting-mif-to-xml/">converting MIF to XML</a>. He also posted some <a href="http://kfahlgren.com/blog/?p=35">XSLT</a> for round-tripping the XML back into MIF.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/qnot.wordpress.com/9/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/qnot.wordpress.com/9/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/qnot.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/qnot.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/qnot.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/qnot.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/qnot.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/qnot.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/qnot.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/qnot.wordpress.com/9/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=9&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://left.subtree.org/2007/02/04/mif-xml-at-oreilly/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">sigma110</media:title>
		</media:content>
	</item>
		<item>
		<title>Converting MIF to XML &#8211; Java Version</title>
		<link>http://left.subtree.org/2007/01/31/converting-mif-to-xml-java-version/</link>
		<comments>http://left.subtree.org/2007/01/31/converting-mif-to-xml-java-version/#comments</comments>
		<pubDate>Thu, 01 Feb 2007 03:57:01 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://left.subtree.org/2007/01/31/converting-mif-to-xml-java-version/</guid>
		<description><![CDATA[In my previous post I discussed a tool called mif2xml for converting MIF files to an intermediate XML dialect. In this post I&#8217;ll talk about the Java port of mif2xml called mif2xml-j which you can download here including just the executable jar. JFlex is a lexical analyzer generator for Java and is the library I [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=7&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://left.subtree.org/2007/01/25/converting-mif-to-xml/">previous post</a> I discussed a tool called <code>mif2xml</code> for converting MIF files to an intermediate XML dialect. In this post I&#8217;ll talk about the Java port of <code>mif2xml</code> called <code>mif2xml-j</code> which you can download <a href="https://github.com/aebruno/mif2xml-j">here</a> including just the <a href="#">executable jar</a>.</p>
<p><a href="http://www.jflex.de/">JFlex</a> is a lexical analyzer generator for Java and is the library I chose to use for creating the MIF lexer. The first step was to get JFlex integrated into my build environment. For this project I decided to use <a href="http://ant.apache.org/">ant</a> but integrating JFlex into another build environment <span id="more-7"></span>should be straightforward. I created the following directory structure:</p>
<p><pre class="brush: plain; light: true;">
--/
  |-- src/main/jflex/               - JFlex lexical specifications
  |-- src/main/resources/MANIFEST   - Defines main class for executable jar
  |-- src/main/java/                - Java source
  |-- lib/                          - 3rd party libraries (JFlex.jar)
  |-- build.xml                     - Ant build file
</pre></p>
<p>JFlex comes bundled with a <code>JFlexAntTask</code> which provides a very convenient <code>&lt;jflex/&gt;</code> task. Here&#8217;s a snippet of the ant build file I created which shows how to set it up:<br />
<pre class="brush: xml;">
&lt;property name=&quot;src&quot;   location=&quot;${basedir}/src/main/java&quot; /&gt;
&lt;property name=&quot;lib&quot; location=&quot;${basedir}/lib&quot; /&gt;
&lt;property name=&quot;scanner-file&quot; value=&quot;${basedir}/src/main/jflex/mif.jflex&quot; /&gt;

&lt;path id=&quot;classpath&quot;&gt;
    &lt;pathelement location=&quot;${build}&quot; /&gt;
    &lt;fileset dir=&quot;${lib}&quot;&gt;
        &lt;include name=&quot;*.jar&quot; /&gt;
    &lt;/fileset&gt;
&lt;/path&gt;

&lt;taskdef classpathref=&quot;classpath&quot; classname=&quot;JFlex.anttask.JFlexTask&quot; name=&quot;jflex&quot; /&gt;

&lt;target name=&quot;jflex&quot; description=&quot;Generate the MIF lexer&quot;&gt;
    &lt;echo message=&quot;Generating the MIF Lexer&quot; /&gt;
    &lt;jflex file=&quot;${scanner-file}&quot; destdir=&quot;${src}&quot; /&gt;
&lt;/target&gt;
</pre></p>
<p>I found writing the lexical specification in JFlex and flex to be very similar. JFlex has a great <a href="http://www.jflex.de/manual.html">user manual</a> which contains a lot of useful info. Here&#8217;s the <code>mif.jflex</code> file:</p>
<p><pre class="brush: cpp;">
/*
 * Copyright 2007 Andrew Bruno &lt;aeb@qnot.org&gt;
 * Licensed under the Apache License, Version 2.0
 */

package org.qnot.mif2xml;
import java.util.Stack;

%%

%{
  private Stack&lt;Tag&gt; tags = new Stack&lt;Tag&gt;();
  private StringBuffer data = new StringBuffer();
  private StringBuffer facet = new StringBuffer();
%}

%line
%char
%standalone
%class  MifLexer
%xstate DATA
%xstate STR
%xstate FACET

ID=[A-Za-z][A-Za-z0-9]*
TAG=&quot;&lt;&quot;{ID}&quot; &quot;
TAG_END=&quot;&gt;&quot;
NONNEWLINE=[^\r|\n|\r\n]
NEWLINE=[\r|\n|\r\n]
WHITE_SPACE_CHAR=[ \n\t]

%%

&lt;YYINITIAL&gt; { 
   {TAG}   {
        Tag tag = new Tag();
        tag.setName(yytext().substring(1, yytext().length()-1));
        tags.push(tag);
        tag.writeStart();
        data = new StringBuffer();
        yybegin(DATA);
    }

    {TAG_END}   {
        if(!tags.empty()) {
            Tag tag = (Tag)tags.pop();
            tag.writeEnd();
        }
    }

    ^&quot;=&quot;[a-zA-Z][a-zA-Z0-9]*{NEWLINE} {
        facet = new StringBuffer();
        facet.append(yytext());
        yybegin(FACET);
    }

    {WHITE_SPACE_CHAR}+   {  /* eat up whitespace */ }
    {NONNEWLINE}          {  /* eat up everything else  */ }
}

&lt;DATA&gt; {
    {NEWLINE}  {
        if(!tags.empty()) {
            Tag tag = (Tag)tags.pop();
            tag.setValue(data.toString());
            tags.push(tag);
        }
        yybegin(YYINITIAL);
    }
    &quot;`&quot;  {  yybegin(STR); }
    {TAG_END}  {
        if(!tags.empty()) {
            Tag tag = (Tag)tags.pop();
            String value = tag.getValue();

            String dataStr = data.toString();
            if(dataStr != null &amp;&amp; dataStr.length() &gt; 0) {
                value = dataStr;
            }

            if(value != null) {
                value = value.replaceAll(&quot;^\\s+&quot;, &quot;&quot;);
                value = value.replaceAll(&quot;\\s+$&quot;, &quot;&quot;);
            }

            tag.setValue(value);
            tag.writeEnd();
        }
        yybegin(YYINITIAL);
    }
    [^\n|\r|\r\n|`|&gt;] {
        data.append(yytext());
    }
}

&lt;STR&gt; {
    &quot;'&quot;  {
        if(!tags.empty()) {
            Tag tag = (Tag)tags.pop();
            if(tag.getValue() == null || tag.getValue().length() == 0) {
                tag.setValue(&quot;`'&quot;);
            }
            tags.push(tag);
        }
        yybegin(YYINITIAL);
    }
    [^']*  {
        if(!tags.empty()) {
            Tag tag = (Tag)tags.pop();
            StringBuffer buf = new StringBuffer();
            buf.append(&quot;`&quot;);
            buf.append(yytext());
            buf.append(&quot;'&quot;);
            tag.setValue(buf.toString());
            tags.push(tag);
        }
    }
}

&lt;FACET&gt; {
    ^&quot;=EndInset&quot;{NEWLINE} {
        facet.append(yytext());
        Tag.writeFacet(facet.toString());
        yybegin(YYINITIAL);
    }

    .*{NEWLINE} {
        facet.append(yytext());
    }
}
</pre></p>
<p>I created a simple <code>Tag</code> class to encapsulate a MIF XML tag and handle writing out each tag. The <code>MifLexer</code> keeps a stack of <code>Tag</code> instances while it&#8217;s processing the input file:</p>
<p><pre class="brush: java;">
/*
 * Copyright 2007 Andrew Bruno &lt;aeb@qnot.org&gt;
 * Licensed under the Apache License, Version 2.0
 */

package org.qnot.mif2xml;

public class Tag {
    private String name;
    private String value;

    public String getName() {
        return this.name;
    }

    public String getValue() {
        return this.value;
    }

    public void setName(String name) {
        this.name = name;
    }

    public void setValue(String value) {
        this.value = value;
    }

    public void writeEnd() {
        if(value != null &amp;&amp; value.length() &gt; 0) {
            System.out.print(escape(value) + &quot;&lt;/&quot; + name + &quot;&gt;&quot;);
        } else {
            System.out.print(&quot;&lt;/&quot; + name + &quot;&gt;&quot;);
        }
    }

    public void writeStart() {
        System.out.print(&quot;&lt;&quot; + name + &quot;&gt;&quot; );
    }

    public static void writeFacet(String facet) {
        System.out.print(&quot;&lt;_facet&gt;&lt;![CDATA[&quot;);
        System.out.print(facet);
        System.out.print(&quot;]]&gt;&lt;/_facet&gt;&quot;);
    }

    private String escape(String str) {
        str = str.replaceAll(&quot;&amp;&quot;, &quot;&amp;amp;&quot;);
        str = str.replaceAll(&quot;\&quot;&quot;, &quot;&amp;quot;&quot;);
        str = str.replaceAll(&quot;&gt;&quot;, &quot;&amp;gt;&quot;);
        str = str.replaceAll(&quot;&lt;&quot;, &quot;&amp;lt;&quot;);
        str = str.replaceAll(&quot;^\\s+&quot;, &quot;&quot;);
        str = str.replaceAll(&quot;\\s+$&quot;, &quot;&quot;);

        return str;
    }
}
</pre></p>
<p>There&#8217;s a separate <code>Main</code> class which creates a new instance of the <code>MifLexer</code> class for processing the file passed in on the command line. I&#8217;d like to eventually extend this class so that it handles command line options and possibly even runs some XSLT&#8217;s over the generated MIF XML.</p>
<p><pre class="brush: java;">
/*
 * Copyright 2007 Andrew Bruno &lt;aeb@qnot.org&gt;
 * Licensed under the Apache License, Version 2.0
 */

package org.qnot.mif2xml;

import java.io.IOException;
import java.io.FileNotFoundException;
import java.io.FileReader;

public class Main {
    public static void main(String[] args) {
        if(args.length != 1) {
            System.err.println(&quot;Usage : mif2xml &lt;inputfile&gt;&quot;);
            System.exit(1);
        }

        try {
            MifLexer scanner = new MifLexer(new FileReader(args[0]));
            System.out.print(&quot;&lt;?xml version=\&quot;1.0\&quot;?&gt;&lt;mif&gt;&quot;);
            scanner.yylex();
            System.out.print(&quot;&lt;/mif&gt;&quot;);
        } catch(FileNotFoundException e) {
            System.out.println(&quot;File not found : &quot;+args[0]);
        } catch(IOException e) {
            System.out.println(&quot;I/O error scanning file '&quot;+args[0]+&quot;': &quot;+e.getMessage());
        } catch(Exception e) {
            System.out.println(&quot;Unexpected exception: &quot; + e.getMessage());
            e.printStackTrace();
        }
    }
}
</pre></p>
<p>To run the code download the <a href="#">executable jar</a> and run<br />
<pre class="brush: plain; light: true;">
$ java -jar mif2xml-0.1.jar myfile.mif
</pre></p>
<p>The MIF XML will be printed to stdout.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/qnot.wordpress.com/7/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/qnot.wordpress.com/7/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/qnot.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/qnot.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/qnot.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/qnot.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/qnot.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/qnot.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/qnot.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/qnot.wordpress.com/7/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=7&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://left.subtree.org/2007/01/31/converting-mif-to-xml-java-version/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">sigma110</media:title>
		</media:content>
	</item>
		<item>
		<title>Converting MIF to XML</title>
		<link>http://left.subtree.org/2007/01/25/converting-mif-to-xml/</link>
		<comments>http://left.subtree.org/2007/01/25/converting-mif-to-xml/#comments</comments>
		<pubDate>Thu, 25 Jan 2007 08:39:43 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://left.subtree.org/2007/01/25/converting-mif-to-xml/</guid>
		<description><![CDATA[MIF (Maker Interchange Format) is an ASCII text representation of a FrameMaker document. You can export your FrameMaker documents into this text based representation to allow for parsing and manipulation by external tools outside of FrameMaker. You can also import MIF files back into FrameMaker. If your interested in reading more about MIF you can [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=6&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>MIF (Maker Interchange Format) is an ASCII text representation of a <a href="http://en.wikipedia.org/wiki/FrameMaker">FrameMaker</a> document. You can export your FrameMaker documents into this text based representation to allow for parsing and manipulation by external tools outside of FrameMaker. You can also import MIF files back into FrameMaker. If your interested in reading more about MIF you can check out the <a href="http://partners.adobe.com/public/developer/en/framemaker/MIF_Reference.pdf">MIF Reference</a> from Adobe (link may be out of date).</p>
<p>There&#8217;s a great perl module on CPAN for working with MIF files called <a href="http://search.cpan.org/perldoc?FrameMaker%3A%3AMifTree">FrameMaker::MifTree</a>. It&#8217;s a subclass of <a href="http://search.cpan.org/perldoc?Tree%3A%3ADAG_Node">Tree::DAG_Node</a> and provides a nice interface for modifying the in-memory tree structure and dumping back out into MIF. The only downside to this module is that it&#8217;s very slow especially with larger MIF files.</p>
<p>At <a href="http://www.oreilly.com">O&#8217;Reilly</a> we&#8217;ve had to work with MIF files quite a bit and have taken several different approaches for processing MIF most of which turn out to be unmaintainable scripts that are not very pleasant to work with. One of the ideas <a href="http://www.oreillynet.com/pub/au/1848">Andrew S.</a> and <a href="http://kfahlgren.com/blog/">Keith</a> came up with was to convert MIF to an intermediate XML format which would allow us to process MIF using XML tools such as XSLT and XQuery. From this intermediate XML format we can transform to DocBook, WordML, or even convert back to MIF again for later importing into FrameMaker. This approach was very appealing as it can greatly reduce the number of one off scripts and allow us to benefit from the wide variety of libraries for parsing and transforming XML.<span id="more-6"></span></p>
<p>For example, the following snippet from a MIF file:</p>
<p><pre class="brush: plain;">
#
# Example of MIF 
#
&lt;FontCatalog
 &lt;Font
  &lt;FTag `Acronym'&gt;
  &lt;FPosition FSubscript&gt;
  &lt;FLocked No&gt;
 &gt; # end of Font
&gt; # end of FontCatalog
</pre></p>
<p>Would get converted to this XML:</p>
<p><pre class="brush: xml;">
&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;!--
 Example XML from MIF
--&gt;
&lt;mif&gt;
  &lt;FontCatalog&gt;
    &lt;Font&gt;
      &lt;FTag&gt;`Acronym'&lt;/FTag&gt;
      &lt;FPosition&gt;FSubscript&lt;/FPosition&gt;
      &lt;FLocked&gt;No&lt;/FLocked&gt;
    &lt;/Font&gt;
  &lt;/FontCatalog&gt;
&lt;/mif&gt;
</pre></p>
<p>This is not a new idea and one tool I know of which seems to do a similar task is called <a href="http://www.leximation.com/tools/mifml/">MIFML</a> written by Leximation which coverts MIF to MIFML (an intermediate XML dialect they created). Unfortunately, it only runs on Windows and is not open source. They have however released the <a href="http://www.leximation.com/tools/mifml/mifml.dtd.txt">DTD</a> they are using for MIFML.</p>
<p>I thought this would be a fun problem to take a stab at so I built tool called <code>mif2xml</code> that produces output that looks a lot like the example above. You can download a <a href="https://github.com/aebruno/mif2xml/downloads">copy here</a> or browse the <a href="https://github.com/aebruno/mif2xml">source code</a> online via svn.</p>
<p>The guts of <code>mif2xml</code> include a lexer <code>mif.ll</code> and a helper class for writing out MIF XML tags.  I chose to create a <code>c++</code> lexer so I could make use of the STL <code>stack</code> and <code>string</code> classes. Here&#8217;s the <code>mif.ll</code> file which gets run through flex to generate the lexer:</p>
<p><pre class="brush: cpp;">
/** 
 * Copyright (c) 2007 Andrew Bruno &lt;aeb@qnot.org&gt;
 * Licensed under the GNU General Public License version 2
 */

%{
#include &lt;iostream&gt;
#include &lt;stack&gt;
#include &lt;string&gt;
#include &lt;miftag.h&gt;
using namespace std;

stack&lt;Tag&gt; tags;
string data;
string facet;
%}

%option  noyywrap
%option  c++
%x DATA
%x STR
%x FACET

ID                [A-Za-z][A-Za-z0-9]*
TAG               &quot;&lt;&quot;{ID}&quot; &quot;
TAG_END           &quot;&gt;&quot;
NONNEWLINE        [^\r|\n|\r\n]
NEWLINE           [\r|\n|\r\n]
WHITE_SPACE_CHAR  [ \n\t]

%%

&lt;INITIAL&gt;{TAG}  {
    Tag tag;
    string name = YYText();
    tag.name = name.substr(1, name.length()-2);
    tags.push(tag);
    tag.writeStart();
    data = string(&quot;&quot;);
    BEGIN(DATA);
}

&lt;INITIAL&gt;{TAG_END} {
    if(!tags.empty()) {
        Tag tag = tags.top();
        tag.writeEnd();
        tags.pop();
    }
}

&lt;INITIAL&gt;^&quot;=&quot;[a-zA-Z][a-zA-Z0-9]*{NEWLINE} {
    facet = string(&quot;&quot;);
    string str = string(YYText());
    facet.append(str);
    BEGIN(FACET);
}

&lt;INITIAL&gt;{WHITE_SPACE_CHAR}+   {  /* eat up whitespace */ }
&lt;INITIAL&gt;{NONNEWLINE}          {  /* eat up everything else  */ }

&lt;DATA&gt;{NEWLINE}  {
    if(!tags.empty()) {
        Tag tag = tags.top();
        tag.value = data;
    }
    BEGIN(INITIAL);
}
&lt;DATA&gt;&quot;`&quot;  {  BEGIN(STR); }
&lt;DATA&gt;{TAG_END}  {
    if(!tags.empty()) {
        Tag tag = tags.top();

        if(data.length() &gt; 0) {
            tag.value = data;
        }
        tag.writeEnd();
        tags.pop();
    }
    BEGIN(INITIAL);
}
&lt;DATA&gt;[^\n|\r|\r\n|`|&gt;] {
    string str = string(YYText());
    data.append(str);
}

&lt;STR&gt;&quot;'&quot;  {
    if(!tags.empty()) {
        Tag &amp;tag = tags.top();
        if(tag.value.length() == 0) {
            tag.value = &quot;`'&quot;;
        }
    }
    BEGIN(INITIAL);
}
&lt;STR&gt;[^']*  {
    if(!tags.empty()) {
        Tag &amp;tag = tags.top();
        string str = string(YYText());
        string buf = &quot;`&quot;;
        buf.append(str);
        buf.append(&quot;'&quot;);
        tag.value = buf;
    }
}

&lt;FACET&gt;^&quot;=EndInset&quot;{NEWLINE} {
    string str = string(YYText());
    facet.append(str);
    writeFacet(facet);
    BEGIN(INITIAL);
}

&lt;FACET&gt;.*{NEWLINE} {
    string str = string(YYText());
    facet.append(str);
}

%%

int main(int argc, char **argv) {
    cout &lt;&lt; &quot;&lt;?xml version=\&quot;1.0\&quot;?&gt;&lt;mif&gt;&quot;;
    FlexLexer* lexer = new yyFlexLexer;
    while(lexer-&gt;yylex() != 0);
    cout &lt;&lt; &quot;&lt;/mif&gt;&quot;;
    return 0;
}
</pre></p>
<p>Here&#8217;s the <code>miftag.h</code> file which contains a helper class for writing out MIF XML tags. Rather than having a dependency on libxml or some other XML processing library I choose to just implement the XML output by hand. It&#8217;s not nearly as robust but it worked out ok for a first pass.</p>
<p><pre class="brush: cpp;">
/** 
 * Copyright (c) 2007 Andrew Bruno &lt;aeb@qnot.org&gt;
 * Licensed under the GNU General Public License version 2
 */

#ifndef __MIFTAG__
#define __MIFTAG__

#include &lt;string&gt;
using namespace std;

class Tag {
    public:
        string name;
        string value;

        void writeEnd();
        void writeStart();
};

void Tag::writeEnd() {
    if(!this-&gt;value.empty()) {
        /* escape xml special chars */
        string::size_type size = this-&gt;value.size();
        for(string::size_type i = 0; i &lt; size;) {
            if(this-&gt;value[i] == '&amp;') {
                this-&gt;value.replace(i, 1, &quot;&amp;amp;&quot;);
                i += 4;
                size += 4;
            } else if(this-&gt;value[i] == '&lt;') {
                this-&gt;value.replace(i, 1, &quot;&amp;lt;&quot;);
                i += 3;
                size += 3;
            } else if(this-&gt;value[i] == '&gt;') {
                this-&gt;value.replace(i, 1, &quot;&amp;gt;&quot;);
                i += 3;
                size += 3;
            } else if(this-&gt;value[i] == '&quot;') {
                this-&gt;value.replace(i, 1, &quot;&amp;quot;&quot;);
                i += 5;
                size += 5;
            } else {
                i++;
            }
        }

        /* Trim leading spaces */
        while(this-&gt;value[0] == ' ') {
            this-&gt;value.erase(0, 1);
        }

        /* Trim trailing spaces */
        while(this-&gt;value[this-&gt;value.size()-1] == ' ') {
            this-&gt;value.erase(this-&gt;value.size()-1, 1);
        }

        cout &lt;&lt; value &lt;&lt; &quot;&lt;/&quot; &lt;&lt; this-&gt;name &lt;&lt; &quot;&gt;&quot;;
    } else {
        cout &lt;&lt; &quot;&lt;/&quot; &lt;&lt; this-&gt;name &lt;&lt; &quot;&gt;&quot;;
    }
}

void Tag::writeStart() {
    cout &lt;&lt; &quot;&lt;&quot; &lt;&lt; this-&gt;name &lt;&lt; &quot;&gt;&quot;;
}

void writeFacet(string facet) {
    cout &lt;&lt; &quot;&lt;_facet&gt;&lt;![CDATA[&quot; &lt;&lt; facet &lt;&lt; &quot;]]&gt;&lt;/_facet&gt;&quot;;
}

#endif
</pre></p>
<p>Finally a quick and dirty Makefile:</p>
<p><pre class="brush: plain;">
all:
	flex++ mif.ll
	g++ -I. -o mif2xml lex.yy.cc -lfl

clean:
	rm -f lex.yy.cc *.o mif2xml
</pre></p>
<p>The code above has not been thoroughly tested on all possible MIF files so your mileage may vary. We currently use a version of <code>mif2xml</code> at O&#8217;Reilly on the occasions we need to process MIF and has been working out quite well. The XML generated from <code>mif2xml</code> is then run through a set of custom transforms written in XSLT 2.0 which transform the MIF XML to DocBook, WordML, and various other formats.</p>
<p>In my <a href="http://left.subtree.org/2007/01/31/converting-mif-to-xml-java-version/">next post</a> I&#8217;ll discuss a pure Java version of <code>mif2xml</code> which uses a great library called <a href="http://www.jflex.de/">JFlex</a> for generating the MIF lexer.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/qnot.wordpress.com/6/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/qnot.wordpress.com/6/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/qnot.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/qnot.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/qnot.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/qnot.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/qnot.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/qnot.wordpress.com/6/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/qnot.wordpress.com/6/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/qnot.wordpress.com/6/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=left.subtree.org&amp;blog=13566420&amp;post=6&amp;subd=qnot&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://left.subtree.org/2007/01/25/converting-mif-to-xml/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">sigma110</media:title>
		</media:content>
	</item>
	</channel>
</rss>
