A Sourceful of Secrets

Andrew E. Bruno

Counting the number of reads in a BAM file

with 22 comments

The output from short read aligners like Bowtie and BWA is commonly stored in SAM/BAM format. When presented with one of these files a common first task is to calculate the total number of alignments (reads) captured in the file. In this post I show some examples for finding the total number of reads using samtools and directly from Java code. For the examples below, I use the HG00173.chrom11 BAM file from the 1000 genomes project which can be downloaded here.

First, we look at using the samtools command directly. One way to get the total number of alignments is to simply dump the entire SAM file and tell samtools to count instead of print (-c option):

$ samtools view -c HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322

If we’re only interested in counting the total number of mapped reads we can add the -F 4 flag. Alternativley, we can count only the unmapped reads with -f 4:

# Mapped reads only
$ samtools view -c -F 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5068340

# Unmapped reads only
$ samtools view -c -f 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
149982

To understand how this works we first need to inspect the SAM format. The SAM format includes a bitwise FLAG field described here. The -f/-F options to the samtools command allow us to query based on the presense/absence of bits in the FLAG field. So -f 4 only output alignments that are unmapped (flag 0x0004 is set) and -F 4 only output alignments that are not unmapped (i.e. flag 0x0004 is not set), hence these would only include mapped alignments.

An example for paired end reads you could do the following. To count the number of reads having both itself and it’s mate mapped:

$ samtools view -c -f 1 -F 12 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
4906035

The -f 1 switch only includes reads that are paired in sequencing and -F 12 only includes reads that are not unmapped (flag 0x0004 is not set) and where the mate is not unmapped (flag 0x0008 is not set). Here we add 0x0004 + 0x0008 = 12 and use the -F (bits not set), meaning you want to include all reads where neither flag 0x0004 or 0x0008 is set. For help understanding the values for the SAM FLAG field there’s a handy web tool here.

There’s also a nice command included in samtools called flagstat which computes various summary statistics. However, I wasn’t able to find much documentation describing the output and it’s not mentioned anywhere in the man page. This post examines the C code for the flagstat command which provides some insight into the output.

$ samtools flagstat HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322 + 0 in total (QC-passed reads + QC-failed reads)
273531 + 0 duplicates
5068340 + 0 mapped (97.13%:-nan%)
5205999 + 0 paired in sequencing
2603248 + 0 read1
2602751 + 0 read2
4881994 + 0 properly paired (93.78%:-nan%)
4906035 + 0 with itself and mate mapped
149982 + 0 singletons (2.88%:-nan%)
19869 + 0 with mate mapped to a different chr
15271 + 0 with mate mapped to a different chr (mapQ>=5)

The above shows a few simple examples using the samtools command but what if you wanted to count the total number of reads in code? I’ve been using the excellent Picard Java library as of late and haven’t found a simple way to do this via the API. I was looking for a fast way to compute this without having to scan the entire BAM file each time. Would love to see this added as a public function to the BAMIndexMetaData object or similar. Here’s a function I wrote to calcuate the total mapped reads from a BAM file. This makes use of the BAM index for speed and obviously requires you to first index your BAM file:

public int getTotalReadCount(SAMFileReader sam) {
    int count = 0;

    AbstractBAMFileIndex index = (AbstractBAMFileIndex) sam.getIndex();
    int nRefs = index.getNumberOfReferences();
    for (int i = 0; i < nRefs; i++) {
        BAMIndexMetaData meta = index.getMetaData(i);
        count += meta.getAlignedRecordCount();
    }

    return count;
}

This uses the BAMIndex to loop through each reference and sum the total mapped reads. A complete working example is included below:

import java.io.File;

import net.sf.samtools.AbstractBAMFileIndex;
import net.sf.samtools.BAMIndexMetaData;
import net.sf.samtools.SAMFileReader;

public class CountMapped {

    public static void main(String[] args) {
        File bamFile = new File(args[0]);

        SAMFileReader sam = new SAMFileReader(bamFile, 
                                 new File(bamFile.getAbsolutePath() + ".bai"));

        AbstractBAMFileIndex index = (AbstractBAMFileIndex) sam.getIndex();

        int count = 0;
        for (int i = 0; i < index.getNumberOfReferences(); i++) {
            BAMIndexMetaData meta = index.getMetaData(i);
            count += meta.getAlignedRecordCount();
        }

        System.out.println("Total mapped reads: " + count);
    }

}

Requires the Picard Java library. To compile/run:

$ javac -cp samtools.jar CountMapped.java
$ java -cp samtools.jar:. CountMapped HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
Total mapped reads: 5068340

Written by Andrew

2012/04/13 at 22:31

Posted in Bioinformatics, Java

Questions about Software Engineering

leave a comment »

I was going through some old emails yesterday and came across one from a student at RIT who interviewed me for an introductory course he was taking. The interview consisted of questions regarding software engineering and my experience working in the field and at the Center for Computational Research. The questions were good and made me think back through my career about some things I’ve learned along the way. I thought a few of them were worth sharing and if you’re so inclined, feel free to leave how you might have answered the questions in the comments. Here’s an excerpt of the interview with a few questions and my responses:

Q: What skills are important for your job?

I think having a strong foundation in computer science is important. We work across several disciplines and being able to apply computer science concepts to help solve the various problems in a particular research area is important.

Managing expectations. It’s often the case that you work with users who don’t have a firm grasp of the technology and being able to explain the various components of a system in such a way that they understand both the capabilities and limitations is an important skill.

Problem solving. I think as a programmer you’re often confronted with a vast array of problems, whether they be bugs in code, design choices (what implications does a particular design choice have on the overall system?), reverse engineering legacy systems, data processing (how do you efficiently process terabytes of data?). Your ability to adapt to these problems and come up with effective solutions is a very important skill to develop. I’m not sure it’s a skill you ever really master but one that you are constantly fine tuning.

Q: What is your typical process for designing and developing software?

We tend to follow an Agile approach to software development (iterative approach: design, code, test, feedback, repeat). For any given project we first start with a requirements gathering phase. This includes meeting with the various groups involved in the project and getting a solid understanding of the scope. We then meet internally and come up with an initial design and set various milestones for the project. The next phase depends on the main focus of the project, because we are a research focused group we are often times working on a grant application or an already funded grant. In the first case we usually build a prototype of the system which serves as proof of concept should the grant get funded. In the case of an already funded project we follow a more rigorous development cycle involving a more formal design specification and development/staging/production environments for releasing versions of the system.

Q: What makes working for a university unique (as opposed to a private company)?

In a university setting (specifically research) there’s obviously more focus on cutting edge research areas. We’re exploring new topics that aren’t always well defined so there’s quite a bit of room for creativity. We also support researchers across many disciplines and sometimes you work on a project that doesn’t always end up getting funded. So there’s times when you put quite a bit of work into something that never gets off the ground.

In contrast, working for a private company ultimately comes down to being profitable. So design decisions tend to be heavily influenced by customers/investors and what effects the bottom line. In my experience the private sector is much more deadline driven as well. Usually there’s a strict release schedule to meet and a never ending list of features to be implemented.

Q: What do you like most and like least about your job?

I very much enjoy the problem solving aspects of programming. Tackling hard problems is something I enjoy and in this field there’s definitely no shortage of them. I would have to say the thing I like least about my job is having to occasionally work with Microsoft technologies :) I’m a big advocate of free software and I cringe at the thought of having to implement a piece of JavaScript code so that it works in I.E.

Q: What skills have you learned from being a software engineer? What has the job taught you?

I think one of the best skills I’ve learned from being a software engineer is system administration. I’ve always had a passion for tinkering around with different technologies and in doing so spent quite a bit of time building systems so that I could develop on. In the process I’ve learned a lot about system administration which has helped me out tremendously in various projects throughout my career. As a programmer it helps to know more than just how to write code.

Q: Any additional thoughts or comments?

Random advice: One of the most important tools for a programmer is a good editor. There’s a lengthy debate as to which editor is the best (I personally prefer Vim) but whatever one works for you, learn it inside and out.

Written by Andrew

2012/01/14 at 23:16

passtab – store passwords in your wallet

with 6 comments

Here’s a quote from Bruce Schneier that essentially sums up the motivation for this post:

We’re all good at securing small pieces of paper. I recommend that people write
their passwords down on a small piece of paper, and keep it with their other
valuable small pieces of paper: in their wallet.

I recently read an excellent blog post by John Graham-Cumming in which he presents a elegant system for writing down your passwords using a Tabula Recta. I was inspired by this concept so I created a tool called passtab which aims to provide a light-weight system for managing passwords based on his idea. This post is about the general usage of passtab and presents some of the password management capabilities. This is not your grandmothers password manager so if you’re looking for a nice GUI point and click application that’s easy to use you can stop reading right here. This is for hardcore folks who enjoy looking up their passwords in archaic tablets invented by ancient cryptographers with last names like Trithemius. For the impatient, you can grab a copy of the latest version on github.

Introducing passtab

passtab is a light-weight system for managing passwords using a Tabula Recta. passtab has two main features: 1. generating random Tabula Recta’s in PDF format for printing and storing in your wallet 2. fetching passwords from the Tabula Recta (password managment). These features are independent and you can use passtab to only generate PDFs or optionally make use of the password management features. One unique benefit is the ability to have both an electronic and paper copy of your passwords. You can download the binary release of passtab at github here. Unpack the distribution and run ./bin/passtab --help for a list of options. If the startup shell script doesn’t work you can run java -jar lib/passtab-uber.jar --help. The following sections illustrate some use cases of passtab.

Generate a random Tabula Recta in PDF

passtab can generate random Tabula Recta’s in PDF format.

$ ./bin/passtab --format pdf --output passtab.pdf
Jun 12, 2011 11:16:29 AM org.qnot.passtab.PassTab generate
INFO: Generating a random Tabula Recta (might take a while)...
$ ls *.pdf
passtab.pdf

Here’s an example PDF generated from passtab. You can now print this PDF out and store in your wallet!

How to use the Tabula Recta

Here’s a simple example (taken directly from the README), suppose we have the following Tabula Recta:


    | A B C D E F G H I J K L M N 
  --|----------------------------
  A | _ u } I ` } R ) a < L : a A 
  B | - o ( : p # O % . _ ; ' j L 
  C | w c ( c y 2 h y ~ N O * > w 
  D | o : R m L % V , d H r Y B j 
  E | 9 , < 0 J p a o ) O w 0 w # 
  F | C j i } i z 2 $ O R 5 @ T I 
  G | Q - E m 8 N c / + u W Y V > 
  H | , y } U Y i j i q w q c - 4 
  I | K j W H e ; I ? E 7 H v 2 + 
  J | g * 7 4 E } a h Y z < " : w 
  K | . _ } I / J k 1 a D ^ ; p K 
  L | ` < A L c z } } I P ? 4 y T 
  M | F D < 8 < 0 R B t 9 X o B 2 
  N | I r O E m o a + Y W w ; : 7

And suppose we want to get our password for logging into webmail at acme.com. We decide to use the first and last letter of the domain name as the start row/column of the password and we want a password 8 characters in length. So we start at the intersection of ‘A’ and ‘E’ and read off 8 characters diagonally resulting in the password: '#h,)RWc

Defining a scheme for selecting the starting row/column for a given password is completely up to the user and can be as simple or as complex as one desires. The direction for reading the password is also up to the user to define (left, right, diagonally, etc.). See John Graham-Cumming’s excellent blog post for more examples.

This method is slightly more complex than just writing down your passwords on a sheet of paper but the added complexity offers some advantages:

  1. Can store all your passwords on a single sheet of paper
  2. If someone steals this sheet of paper they’ll have a harder time figuring out what your passwords are
  3. Allows you to use strong random passwords
  4. If you want to change your passwords just re-generate a new Tabula Recta. Your scheme for selecting passwords can stay the same

passtab makes no assumptions about how passwords are read nor does it know anything about your scheme (unless you configure it). Now that you don’t have to remember long random passwords anymore what do you need to remember when using a Tabula Recta? Well first, you need to come up with a method for finding the starting position for a given password. In the example above this can be as simple as using characters from a domain/host name. But the beauty is you can be as creative as you want. A scheme that works for most of your passwords would probably be ideal but you can certainly generate multiple Tabula Recta’s if you like. Once you have a way of coming up with a starting location you need to define a method for reading off the password. In passtab this is called a sequence. In the example above we simply read 8 characters diagonally. But again you can be creative here. You could read 8 characters diagonally skipping every 3rd character, etc. Lastly, you’ll need to remember what to do if you hit the edge of the Tabula Recta before the end of the password. For example, if you start at Z:Z and want to read 8 characters diagonally you can’t because you reached the end of the Tabula Recta. In passtab this is called a collision. In this case we could just continue reading following the edge.

Using the Tabula Recta allows you to make use of long secure random passwords and only have to remember three simple things. You also have all your passwords on a single sheet of paper that fits in your wallet.

Custom Alphabets

In passtab, a Tabula Recta consists of two alphabets. The header alphabet and the data alphabet. The header alphabet is used for the row and column heading of the Tabula Recta and forms the basis for finding the starting location of the passwords. The data alphabet is used to generate the contents of the Tabula Recta and passtab will randomly pick characters from this alphabet using a cryptographically secure random number generator. By default, passtab uses a header alphabet of 0-9A-Z and a data alphabet consisting of all printable ASCII characters. It’s important to keep in mind that the data alphabet directly effects the entropy of your passwords. passtab allows you to customize these alphabets allowing you to generate any kind of Tabula Recta, for example:

$ ./bin/passtab -b A,B,C,D -a 'a,b,c,d,1,2,3,4,!,@,#'
Jun 12, 2011 10:24:26 PM org.qnot.passtab.PassTab generate
INFO: Generating a random Tabula Recta (might take a while)...
  A B C D 
A d 1 @ 4 
B c 4 @ 2 
C b 3 3 ! 
D 1 a @ 4 

Here’s a Tabula Recta using greek symbols as the header alphabet (here’s the example PDF):

$ ./bin/passtab -b 'Σ,Τ,Π,ρ,ϋ,ψ' -a 'a,b,c,d,1,2,3,4,!,@,#'
Jun 12, 2011 11:26:00 PM org.qnot.passtab.PassTab generate
INFO: Generating a random Tabula Recta (might take a while)...
  Σ Τ Π ρ ϋ ψ 
Σ 1 2 1 d d c 
Τ 1 2 b b @ c 
Π 1 # c 3 2 @ 
ρ 4 2 d 2 @ 3 
ϋ 2 3 b 1 ! b 
ψ d @ # c ! a

Password Management

So this is all well and great, but in reality it can be a huge pain to have to look up your webmail password in a Tabula Recta that’s on a sheet of paper in your wallet every time you login. For this reason, passtab has some optional features to help read passwords from the Tabula Recta. This allows you to have both a hard copy of the Tabula Recta in your wallet and an electronic version stored on your hard drive for quick access to your passwords. This obviously comes with some security considerations and care must be taken to protect the passtab database as you would any ssh private key for example. If someone got a hold of the passtab database file they could brute force your Tabula Recta. I ended up creating an encrypted thumb drive and store my passtab configuration and database files on it. You could also use gpg to encrypt it or any other method to protect it from the bad guys. This next section discusses the password management features of passtab.

First some definitions:

  • Direction: a direction to move on the Tabula Recta. Valid values are N,S,E,W,NE,NW,SE,SW
  • Sequence Item: a sequence item consists of a length and direction. For example, 12:SE would mean move 12 characters in the SE direction (diagonally)
  • Sequence: a sequence is a list of sequence items. This allows you to define arbitrary sequences for reading passwords. For example, 4:SE,3:N,1:S would mean read 4 characters SE (diagonally) followed by 3 characters N (up) followed by 1 character S (down)
  • Collision: a collision defines what directions to move if we hit the edge of the Tabula Recta before the end of the password. You can define more than one direction and they will be tried in order. For example, N,NE,E,SE,S,SW,W,NW would mean if we hit a wall try those directions in order until we’re able to move again

Generate a Tabula Recta in PDF and save to a passtab database

passtab can generate a Tabula Recta in PDF along with storing it in a passtab database. The passtab database is stored in JSON format and can be easily accessed outside of passtab (any language that can read JSON files). Again, you’ll want to store that JSON file someplace safe. For example:

$ ./bin/passtab --dbsave --name mypasstab
Jun 12, 2011 10:48:33 PM org.qnot.passtab.PassTab generate
INFO: Generating a random Tabula Recta (might take a while)...
$ ls mypasstab.*
mypasstab.json  mypasstab.pdf

Reading passwords from the passtab database

Once we’ve created our passtab database we can now fetch passwords by telling passtab the starting location and the sequence to read. For example, suppose we want to read a password starting at row ‘B’ and column ‘N’ and we want a password 10 characters in length reading diagonally:

$ ./bin/passtab -i mypasstab.json --getpass B:N --sequence 9:SE
o6,ZzH{e$@

Copy the password to the clipboard using xclip:

$ ./bin/passtab -i mypasstab.json --getpass B:N --sequence 9:SE --chomp | xclip

We used 9:SE as our sequence because passtab includes the character at the start location in the password. If we didn’t want to include this character we can optionally skip it like so:

$ ./bin/passtab -i mypasstab.json --getpass B:N --sequence 10:SE --skipstart
6,ZzH{e$@_

Define a list of directions to try in the event of a collision. This will try the directions N,S,E,W in order until we can move again. Here we start at Z:Z and can’t move SE (diagonally) so we try N (up) which works so we move N (up) until we hit another collision:

$ ./bin/passtab -i mypasstab.json --getpass Z:Z --sequence 9:SE --collision N,S,E,W
a((vy&0bV&

Conclusion

This post introduced a new tool called passtab for managing passwords using a Tabula Recta. I’m sure it has plenty of bugs so use at your own risk and if by chance you find it somewhat useful I’d be very interested in any feedback.

Written by Andrew

2011/07/01 at 00:15

Posted in Hacks, Java, passtab, passwords

WordPress Blog to Print Book – A Case Study

with 4 comments

In this post I discuss my experience converting a WordPress blog into a print book. This is by no means a generic how-to guide but more along the lines of a case study. There’s a number of ways one could tackle this problem however I wasn’t able to find any existing methods that fit my needs. Specifically, I wanted to convert the content of a WordPress blog into a high quality print ready PDF book (complete with chapters, sections, table of contents, images, figures, page numbering, index, etc.) which could then be sent to various POD publishers such as Lulu for printing. I wanted to streamline as much of the process as possible to allow for regenerating the PDF book as new posts are made. Ideally there would be a WordPress plugin for this but to make such a plugin generic enough would be tricky and require many assumptions to be made regarding the structure of your blog (i.e. what constitutes a chapter or a section?, etc.). In this post I describe how I ended up creating a PDF book from WordPress and discuss a few challenges I encountered along the way. A brief disclaimer: this post is intended for folks who are familiar with *nix command line and enjoy mucking around in code (and don’t mind slinging around some XML here and there). It’s not for the faint of heart but hopefully it will be useful to others interested in a similar outcome.

If you’d rather skip this epic post and dive right into the code you can browse it all over on github. There’s a README file and a Makefile which details how to run wp2print and includes a simple example.

Background and Assumptions

The idea for this project came about as my family has been writing a private blog that I want to be able to share with my children some day. I’ve often thought about giving them a copy of the blog when their much older and wondered what would be the best way to preserve the content ensuring it’s viewable years down the road. I thought by creating a physical copy of the blog I could have something tangible to pass down through the generations. Using the services provided by companies like Lulu and Blurb printing a book is as simple as uploading a PDF and designing a cover. Having worked in the publishing industry in a former life I had some experience generating PDF books so I looked forward to the challenge.

As my goal was to create a book I needed to convert the content of the blog into an intermediate format which represented the book and could then be used to generate a PDF file. As I had a good amount of experience slinging around DocBook, my general idea was to export the content from WordPress and convert it into DocBook. Then converting DocBook into a PDF is fairly straightforward using the wonderful DocBook stylesheets and Apache FOP.

The first obvious challenge when converting a blog into a book is deciding how to go about organizing the blog posts into chapters and sections. Our family blog was authored in such a way that each post had only one tag (category) and most importantly all posts were tagged chronologically. Meaning that all posts in a given category were sequential. For example, posts 0-5 are all tagged with “tag1″, posts 6-10 are all tagged with “tag2″, and so on. You can probably see where this is going. Having authored the blog in this format allowed me to easily use each tag as a Chapter and each post appeared as a section within that chapter. If your unable to make these assumptions about your blog (and most likely you won’t be able to) just keep this in mind as we delve into the code later on. You would need to modify the script I wrote and add in the appropriate logic to slice your blog posts up into chapters/sections or however you’d like to structure your DocBook file. I experimented with just making each post a chapter (and even a series of DocBook articles) which isn’t a bad option however depending on the number of posts you may want to consider omitting the table of contents.

A few other assumptions I made:

  • All posts were written by the same author so I omitted displaying any author information. Easy enough to add in if needed
  • All comments were excluded from the book. Comments are an important part of any blog but in this case my blog didn’t have very many comments. I was most interested in the content of the post only and decided to omit any and all comments. These could certainly be added in but some thought would be needed on which DocBook element to use for structuring them within the book.
  • There was one page (not a post) with the title “About” that I used as the preface for the book. This can be any post/page or omitted completely if desired.
  • Access to the WordPress code that runs the blog. This won’t work if your blog is hosted for example at WordPress.com. You’ll need to export your blog and run it on your own server.

Here’s a brief outline of the entire process. I’ll go over each step in detail in the next section.

  1. Convert WordPress content to DocBook – using PHP and some XSLT
  2. Convert DocBook to XSL-FO – using DocBook stylesheets
  3. Convert XSL-FO to PDF – using Apache FOP
  4. Upload PDF file to Lulu and order print book

Convert WordPress content to DocBook

First step was to convert the blog into DocBook. This was by far the most challenging step. My first attempt was to use the Export feature in WordPress which dumps the entire contents of your blog in XML format (WordPress eXtended RSS) and write an XSLT to convert into DocBook. This turned out to be slightly harder than I anticipated because of how the content of each post was formatted in the WordPress XML dump. It appeared to be in the native format WordPress uses to store the post in the database and I didn’t want to have to write a custom WordPress post renderer for DocBook. I decided to instead write a fairly simple PHP script which used the WordPress API to render each post in HTML just like it normally would if someone visited the site, then convert the HTML to DocBook. I found converting the HTML to DocBook was slightly easier than having to parse the native WordPress format. I did this in two steps, first I wrote a PHP script to generate a quasi-DocBook file which uses the WordPress API to embed the HTML content of each post within a <section/>. Then I wrote an XSLT which transforms the quasi-DocBook and embedded HTML into a final valid DocBook file. The main PHP code is here. You’ll need to change the include paths in config.php to point to your WordPress installation (see the Makefile for a complete example). The XSLT is here. It looks for various HTML tags that appear in my blog and converts those to valid DocBook elements. I built up the XSLT by trial and error. I first just rendered the quasi-DocBook generated from the PHP script as PDF. The DocBook stylesheets have a nice feature in that any invalid DocBook elements it encounters are highlighted in red in the resulting PDF. By iterating through the invalid elements I was able to add the correct templates to my XSLT to account for all HTML tags found in my blog posts. You’ll most certainly need to modify this XSLT file to suite your specific needs but should serve as a decent starting point. Here’s an example of the HTML generated by WordPress for an image included in a blog post:

<div id="attachment_155" class="wp-caption aligncenter" style="width: 310px">
  <a href="/wp-content/media/2008/11/image.jpg">
        <img class="size-medium wp-image-155" title="image title" src="/wp-content/media/2008/11/image-300x218.jpg" alt="image alt" width="300" height="218"/>
  </a>
  <p class="wp-caption-text">This is a description of the image</p>
</div>

Which then gets converted to a DocBook mediaobject element:

<para>
  <mediaobject>
    <imageobject>
       <imagedata align="center" fileref="images/2008/11/image.jpg" width="4.0in" depth="3.0in" scalefit="1" format="JPG"/>
    </imageobject>
    <caption><para>This is a description of the image</para></caption>
  </mediaobject>
</para>

A note about images…

Care must be taken to ensure any images you want included in the book are print ready. I ended up having quite a few images in my blog that I wanted to include in the final PDF which required some extra work to get them ready for printing. For best results you’ll want make to be sure the resolution of your images are at least 300ppi (pixels per inch). See this post on Lulu. For example, if your image is 600x600px and you set the resolution to be 300ppi, the printed image will be roughly 2x2in. In my case I was printing a 6×9 book and after factoring in margins/spine/bleed etc. I calculated the maximum print size I wanted each image was 4x3in (as defined in the DocBook XML element <imagedata width="4.0in" height="3.0in"/> in the above example). As most of the images were pictures, this print size ended up being large enough so the photo was still viewable but small enough to allow for 2 images per page. This meant that the minimum size (in pixels) each image had to be was 1200x900px. The problem was when we uploaded pictures to our blog we had WordPress resize them to 500x400px (from their original size of 2816x2112px from the camera). Fortunately, I still had the original image files which I collected and used in the final PDF. Something to keep in mind if you have images (especially photographs) in your blog that you want printed. I ran into another edge case with the images which required a little bit of imagemagick. I had a few important pictures that were taken with photo booth on a mac in which the original size image was a mere 640x480px. I knew the print version of the images would look dreadful so my only option was to resample them to a higher resolution. This can easily be accomplished using imagemagick’s convert command:

$ convert -resample 300x orig.jpg hires.jpg

In summary, be sure your images are high enough resolution for printing. It’s definitely worth the extra work. I had roughly 100 images in my blog and all of them turned out really nice in the final print book. I was quite impressed with the quality of Lulu’s printers.

DocBook –> XSL-FO –> PDF

Converting DocBook to PDF was fairly straightforward using two excellent projects DocBook stylesheets and Apache FOP. I won’t cover how to install them on your platform and refer you to the excellent INSTALL guides at the respective sites. If you happen to be running Ubuntu using the stock packages should work fine. Simply run aptitude install fop docbook-xsl and you should be all set. The basic goal for this step was to use the DocBook XSL FO stylesheets to convert the DocBook created from the previous step into XSL-FO which can be fed into Apache FOP for conversion into PDF. This step required that an XSLT processor be installed such as xsltproc (libXML), Saxon, Xalan, etc. I used xsltproc and can easily be installed on Ubuntu aptitude install xsltproc. After running xsltproc I passed the resulting XSL-FO output into Apache FOP to generate the final PDF. For more details see the Makefile. Here’s the basic commands:

$ xsltproc /path/to/docbook-xsl/fo/docbook.xsl docbook-final.xml > book.fo
$ fop book.fo book.pdf

The DocBook XSL FO stylesheets provide a generous number of parameters for customizing the resulting FO. The default parameter settings produce a very nice looking PDF but if you like to tweak things there’s no shortage of knobs to turn. As I ended up printing my book with Lulu there were a few specific customizations that were required. First I was interested in printing a US Trade 6×9 inch hard cover book so the default page width/height needed to be set accordingly. Some other tweaks I made included adjusting the margins slightly to provide some extra room on the spine edge of the book, customizing the table of contents to only include the chapter/sections, and customizing the indentation of chapters and sections (in this case I didn’t want any indentation). Here’s the resulting xsltproc command with the custom parameter settings:

    xsltproc \
    --stringparam page.width 6in \
    --stringparam page.height 9in \
    --stringparam page.margin.inner 1.0in \
    --stringparam page.margin.outer 0.8in \
    --stringparam body.start.indent 0pt \
    --stringparam body.font.family  Times \
    --stringparam title.font.family Times \
    --stringparam dingbat.font.family Times \
    --stringparam generate.toc 'book toc title' \
    --stringparam hyphenate false \
    /path/to/docbook-xsl/fo/docbook.xsl \
    docbook-final.xml > book.fo

A note about Fonts..

The last and most important configuration I made was with fonts. Lulu requires fonts to be fully embedded which means any font you use in your PDF must be embedded (the font files are included directly in the PDF file) or else they will reject the PDF. Embedding fonts is supported by Apache FOP but requires some custom configuration. First I had to decide which font to use. Fonts can be really tricky and I didn’t want to get too fancy. Using a single font for the entire book was fine with me and I decided to stick with a traditional Times New Roman font. I ended up using the FreeSerif TrueType font from GNU FreeFont. It was already installed on my Ubuntu machine and very easy to embed with Apache FOP. By default these fonts are installed in /usr/share/fonts/truetype/freefont/.There’s lots of other free fonts out there that you could use including the Liberation Fonts and even the Micro$oft True Type Core Fonts which can be installed on Ubuntu by running aptitude install msttcorefonts. To configure Apache FOP to use GNU Free Fonts and embed them into the final PDF I created a file called userconf.xconf with the following lines:

<?xml version="1.0"?>
<fop version="1.0">
<renderers>
   <renderer mime="application/pdf">
      <!-- Full path to truetype fonts to be embedded in PDF file -->
      <fonts>
        <font embed-url="file:///usr/share/fonts/truetype/freefont/FreeSerif.ttf">
          <font-triplet name="Times" style="normal" weight="normal"/>
        </font>
        <font embed-url="file:///usr/share/fonts/truetype/freefont/FreeSerifBold.ttf">
          <font-triplet name="Times" style="normal" weight="bold"/>
        </font>
        <font embed-url="file:///usr/share/fonts/truetype/freefont/FreeSerifItalic.ttf">
          <font-triplet name="Times" style="italic" weight="normal"/>
        </font>
        <font embed-url="file:///usr/share/fonts/truetype/freefont/FreeSerifBoldItalic.ttf">
          <font-triplet name="Times" style="italic" weight="bold"/>
        </font>
      </fonts>
   </renderer>
</renderers>
</fop>

Then ran fop passing the -f option like so: fop -f userconf.xconf book.fo book.pdf. Note the <font-triplet name="Times" /> attribute must match the body.font.family Times XSLT parameter passed to xsltproc command.

Simple Example

All the code described in this post is available on github. I also include a simple example to demonstrate the entire conversion process and provide some sample PDFs to see how final book renders. I created a simple test blog consisting of Shakespeare’s Sonnets I thru X and exported the content in WordPress eXtended RSS so you can then import into a fresh install of WordPress. I tested using the latest version of WordPress at the time of this writing (v3.1). To try it out yourself download the code for wp2print and read thru the README file which outlines all the gory details. The Makefile outlines the general process and should provide a good starting point for experimenting. Here’s some sample PDFs that were rendered from the example Shakespeare blog:

Conclusion

With the help of a few simple scripts it’s possible to create a high quality print ready PDF book from a WordPress blog. Depending on the content of the blog you’ll most certainly need to tailor these scripts to suite your specific requirements. The main challenges are figuring out how you want to organize your blog posts into the framework of a book and then modifying the XSLT templates to convert the WordPress html markup of your blog into valid DocBook elements. The services offered by print on demand publishers such as Lulu provide an easy way to turn the resulting PDF into a high quality paper book.

Written by Andrew

2011/04/09 at 05:05

Posted in Hacks, PHP, XML

Logging out without killing a process

leave a comment »

Here’s the scenario, you’re logged into your favorite *nix box and are using bash as your shell. You fired off some process which is going to take a while to run (and forgot to run screen) and you want to logout without killing that process. The command to use is disown. Here’s a really simple example:

$ ssh some-host
$ perl script-that-chugs-along.pl
$ Ctrl-Z (suspend)
$ bg (put it in background)
$ disown -h
$ logout

The disown command allows you to remove jobs from the list of active jobs associated with your login shell. Here’s an excerpt from the bash man page:

Without options, each jobspec is removed from the table of active jobs. If the -h option is given, the job is not removed from the table, but is marked so that SIGHUP is not sent to the job if the shell receives a SIGHUP.

Written by Andrew

2010/09/14 at 04:29

Posted in Linux

No more moko for me

with 2 comments

Well it’s been just over 2 years now since I picked up my Neo Freerunner and unfortunately I have nothing good to report back. When I first got my hands on the Freerunner I had great expectations as can be seen in my previous posts however it turns out I was little too optimistic.

My hope was to write about my experience developing applications for the device but it seems as though I was one of the unlucky users who suffered from the infamous buzz. And I don’t mean buzz in a good way :) Apparently this didn’t effect every user and was more prevalent when using certain GSM bands. However random it may have been for some users it was an absolute show stopper for me and rendered the device pretty much unusable as a phone. Anytime I placed a call, the person on the other end heard a horrible loud buzzing sound. After a while Openmoko finally tracked down the source of the buzzing to a hardware problem and published some instructions on how to mod the device (known as the big-C rework). Unfortunately it requires some fairly advanced SMD-soldering skills and is way out of my league. As far as I know, Openmoko never really offered much help in resolving the issue for people who had already purchased the device with the hardware flaw (rev A5/A6) and were unable to preform the necessary surgery.

Despite the lacking phone functionality the Freerunner itself is still pretty cool. I’m sure I’ll be able to think of some other fun projects to use it for especially since it runs Debian.

Written by Andrew

2010/09/13 at 02:28

Posted in FreeRunner

Freerunner First Boot

with 2 comments

Here’s some notes on my initial experience setting up the Neo Freerunner. I’ve been meaning to write this post for a while now and most of this is already old stuff but I’m posting it anyhow for reference. I purchased the Neo Freerunner fully aware that it was a developer phone but my hope was that I could at least ssh into the device and make/receive a few phone calls. I’m happy to report that after first booting I was able to get most things functioning within a few hours.

First Boot

There’s several distributions for the Freerunner which can get quite confusing but the one that comes stock with the Freerunner is referred to as 2007.2. First time booting up the Freerunner you’re presented the home screen for 2007.2. You can also boot into NAND and NOR flash which allows you to update the kernel, root filesystem and the boot loader (U-Boot).

My first mission was to ssh into the device. Followed the instructions on the wiki for setting up USB networking. By default the IP address of the Freerunner is 192.168.0.202. On the desktop side you first have to ifconfig the usb0 interface and setup the correct routes. Here’s the script I run on my desktop after connecting the Freerunner:

#!/bin/bash

/sbin/ifconfig usb0 192.168.0.200 netmask 255.255.255.0
/sbin/route add -host 192.168.0.202/32 dev usb0

One extra step I had to do was configure my firewall to allow connections to/from usb0. I’m running Ubuntu hardy 8.04 and using Firestarter. Open up Firestarter:

  • Preferences -> Firewall -> Network Settings
  • Set ‘Local network connected device’ to: Unknown device (usb0)
  • Check ‘Enable internet connection sharing’

Verified usb0 network connections:

$ ping -I usb0 192.168.0.202
$ ssh root@192.168.0.202

Once connected to the Freerunner next step was to get the date to display on the home screen. To do this I just followed the instructions on the wiki for customizing the today page (run these commands on the Neo):

# dbus-launch gconftool-2 -t boolean -s /desktop/poky/interface/reduced false
# /etc/init.d/xserver-nodm restart

Here’s a screenshot of the home screen:

Upgrade Software

Once I was able to successfully ssh into the Neo and verifed that I could also connect to the internet from the Neo I wanted to upgrade to the latest software release. To do this you use opgk (package management system based on Ipkg). The first time you upgrade from the software release shipped with the Neo you have to first upgrade dropbear (ssh server) from the terminal on the Neo, then you can ssh back into the Neo and upgrade the rest of the software:

# opkg update
On the Neo, open Terminal and run: # opkg install dropbear
Then ssh to neo and run: # opkg upgrade

At this point I rebooted and inserted my T-Mobile sim card and microSD card. Once back at the home screen it showed I was registered to the T-Mobile network and I opened up the dialer app and placed my first call!

Set up Timezone and correct date/time

To fix the timezone run this from the Neo:

# opkg install tzdata tzdata-americas
# ln -sf /usr/share/zoneinfo/America/New_York /etc/localtime
# /etc/init.d/xserver-nodm restart

To set the correct time using ntp run:

# opkg install ntpclient
# ntpclient -s -h pool.ntp.org
# hwclock --systohc

WLAN

Next up was connecting the Neo to my wireless LAN. The wireless interface on the Neo is eth0. First have to make sure WLAN device is turned on which it seemed to be by default when you first boot. You can check this by holding down the power button for a few seconds which should pop up a menu showing the state of the various devices. Here’s the script I use to connect the Neo to my WLAN:

#!/bin/sh

/sbin/ifconfig eth0 down
/sbin/ifconfig eth0 up
/sbin/iwconfig eth0 key restricted 'xxxxx'
/sbin/iwconfig eth0 essid 'xxxx'
/sbin/udhcpc eth0

GPS

tangoGPS rocks. This app is amazing and it worked right out of the box. Followed the directions on the wiki to get it up and running. There was an issue getting a fix with the SD card installed but by the time I tried this out they already had a kernel update which fixed the issue. I had no problem getting a fix and my TTFF was 35s with the SD card in. Here’s some screenshots of tangoGPS in action:

I also installed and ran AGPS Test which is a program for testing out GPS on the Neo. It shows some nice graphs of the various satellites you’re currently connected to and their signal strengths:

Bugs/Issues

Overall I was impressed by how much I was able to get working the first time around however there’s definitely a few issues I came across. The most concerning was the GSM buzzing during phone calls. On the Neo side everything sounds fine but the person on the other end hears a very loud buzzing noise. Here’s the latest update from the hardware list regarding the issue. I tried tweaking the various alsa settings in /usr/share/openmoko/scenarios/gsmhandset.state with some luck but still wasn’t able to find the right balance to completely eliminate the buzzing. Still trying to wrap my head around which alsa settings do what but I found playing with alsamixer during a live call to be helpful. The basic procedure goes something like this:

  1. ssh to FreeRunner
  2. Make a phone call
  3. While call is in progress run alsamixer
  4. Tweak settings to minimize buzzing/echo
  5. While call is still in progress run: $ alsactl store -f gsmhandset-test1.txt

Now you can diff this new file against the original (/usr/share/openmoko/scenarios/gsmhandset.state) and see which settings were changed. This is really the only thing holding me back from using the Neo as my primary phone so I look forward to a possible fix.

I found using the Terminal on the Neo rather clunky due to the lack of characters available on the keyboard. For example there’s no <TAB> or ‘/’. I’m sure there’s ways to customize the keyboard. Looks like only vi is available by default on the Neo so I plan on seeing if I can find a vim package (.ipk) or figuring out how to compile vim for the Neo.

Written by Andrew

2008/08/13 at 19:42

Posted in FreeRunner

Tagged with ,

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: