apple   art   china   net culture   productivity   reading   shots/pics   stories   tech   work

Using csplit to chop exported annotations from a PDF into smaller pieces for use with Devonthink
27 September 2012

I read a lot of PDFs, usually in either PDF Expert or Skim. These two programs allow me to export any highlighted sections in the PDF and my comments for these sections in a nice plain text file.

However, usually I end up with pretty large text files – unusable to quickly find the juicy parts or to add (Openmeta) tags to them.1 The bigger these files get, the worse are the AI results from Devonthink too. Consequently, these large files must be split up in smaller chunks for better processing. During the last year I have done this manually but always wished for a more automated solution. Halfway into writing a Ruby script2 for that purpose I found the unix command csplit.

csplit -f t6-jnXXX-note-no_ -n 3 -k xyz.txt /^\*\ Hervorhebung/ {100}

This command will split the file xyz.txt into pieces, starting from the "* Hervorhebung" until this phrase is mentioned again. It does so for the number of times specified in the curly {} brackets and stops with an error when it reaches the end of the document.3 The -f option allows to set a prefix (in this case t6-jnXXX-note-no_; I do name all my articles like this jn001, jn002…). The -k option keeps the resulting files even if the command encounters an error, which (see above) seems to necessary for this to work. /^\*\ Hervorhebung/ is a regular expression, albeit a very very simple one. It just looks for a new line starting with * Hervorhebung. Feel free to replace this according to your needs.

Remember that you can drag & drop any file from the Finder to the Terminal and the correct path and file name are filled in.

Demo time:

This is a sample chunk from a typical annotation file.4

* Hervorhebung, page 1 The very core of this founding reform was institulionnl. A significant number of rights were transferred from collective structures to farm households, and this engaged a dynamics of extension of individual rights that is still far from being over today.

* Hervorhebung, page 1 his importance is well acknowledged by the cen- tral authorities: in January 2010, for the seventh consecutive year, Document No. I, jointly issued by the Central Com- mittee of the Communist Party of China (CCPCC) and the State Council of the National's People Congress (SCNPC), was dedicated to rural issues and land rights problems.

// wie auch die meisten anderen No1 Dokumente seit den frühen 80ern

* Hervorhebung, page 2 The CCPCC launched rural reforms in December 1978 by enhancing the I96I adjustment policies. The collective or- ganisation of agriculture was maintained, but the parallel pri- vate economy was expanded. The most detrimental aspects of central planning were also reformed. State prices were corrected in favour of agriculture, while the constraints of local autarky were relaxed. Finally, the rural People's Com- munes were reformed before being eventually dismantled in 1984.

With the use of csplit this:

* Hervorhebung, page 1 The very core of this founding reform was institulionnl. A significant number of rights were transferred from collective structures to farm households, and this engaged a dynamics of extension of individual rights that is still far from being over today.

Would become a single file.

This would be a second file:

* Hervorhebung, page 1 his importance is well acknowledged by the cen- tral authorities: in January 2010, for the seventh consecutive year, Document No. I, jointly issued by the Central Com- mittee of the Communist Party of China (CCPCC) and the State Council of the National's People Congress (SCNPC), was dedicated to rural issues and land rights problems.

// wie auch die meisten anderen No1 Dokumente seit den frühen 80ern

And so on. Hope this helps someone as much as did help me.

1 I have lost part of my tagging enthusiasm. With roughly a thousand tags for my thesis project alone I sometimes struggle to find the right one.

2 Yes, I’m learning Ruby now and in a strange way Rubys structure and Chinese laws seem to have something in common.

3 I haven’t found a more elegant way to do this yet.

4 Notice the OCR mistakes for bonus points.