2017-01-09

I’m starting a side project to get a better understanding of full text searching using Elasticsearch and SOLR and I needed some input. I have been keeping a daily journal for the past six years in Google Docs so I thought I would start with that. The reason I kept it there was for Google’s excellent search. But I also have this uneasiness about leaving it all in Google’s care. There is the remote possibility that they will sunset Google Docs.

When I exported my Journal folder I found that the default format was Microsoft DocX. I really don’t need that level of abstraction. I have since switched to the Asciidoc format for my journal so I went looking for a library that I could use to convert over a thousand files. I found pandoc. It’s a swiss army knife of documentation format conversions. A quick read of the Getting Started guide and off to the shell.

$ find ./ -iname "*.docx" -type f -exec sh -c 'pandoc "${0}" -t asciidoc -o "./output/${0%.docx}.adoc"' {} \;

Thank you pandoc. Now on to Elasticsearch and SOLR.