Friday, October 06, 2006

cut, curl and A Brief History of Time...

Recently I stumbled upon an online version of "A brief History of Time" by Stephen Hawking, I read this book long time back, perhaps too soon for my age. So decided I'll give it another shot, the only issue is I prefer paper books over online versions and even if I have to read an ebook, I generally like reading them in MS Reader (it's got few nice features like annotating, bookmarking, and drawing among others). I know there's a plugin for MS Word which allows converting any word document to MS Reader format, so if only I could get these HTMLs to a doc file, I just have to run that converter. The easiest approach is to open all the htmls in a browser, select all text and paste in a word document but that doesn't appear to be any challenging, so I thought off automating the entire process....

A cursory look on the url naming for this book, showed that files are named from a-n (for some reason n is the first chapter, then every chapter is a-m). So the easiest way to do it w.o. any third party tool was to use curl to save these documents locally.

$curl[a-n].html -o "hawking_#1.html"

This would fetch all the documents named from a-n and save them locally as hawking_[a-n].html.

Step 1 was quite easy, now the chapters also have images embedded for which we need to extract the image sources and save them locally again. Another look at the source html showed that all images are declared as <img src="..." > so using grep I extracted all the img tags in the html pages:

grep -o '<img src="[a-z0-9A-Z\.]*' hawking_*.html >> img.txt

So I extracted all the img tags and saved them to an output file, different story that the grep command took me nearly 3 hrs! I kept on forgetting to add the -o switch, without which the grep would spit out the entire html content, making me think that my regex was perhaps wrong. Anyway, once we had the grep sorted out, the next issue was to extract just the image names from the img src tags grep found, well that quite easy using the cut command:

cut -d\" -f2 img.txt >>wget.txt

Once we have all the img src path, we just need to use a final wget to fetch them:

wget -i wget.txt -B

(-B specifies the base url)

So far so good, so we have saved a local copy of the book with the images (different story that perhaps doing it manually would have taken lesser time).

Now we need to merge these htmls and convert them to a word document, here's a slightly tweaked version of macro in OO that I used:

Sub MergeAndSave(baseUrl)
'urls are a-n
dim i
dim ch
dim cUrl
ch = chr(110)
dim cFile
dim docFile
docFile = "/home/sachin/Hawking/hawking.doc"
cFile = baseUrl + ch + ".html"
cUrl = ConvertToURL( cFile )
oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, Array(MakePropertyValue("FilterName","HTML (StarWriter)") _
, MakePropertyValue("Hidden",False))

cURL = ConvertToURL( docFile )

oDoc.storeToURL( cURL, Array(_
MakePropertyValue( "FilterName", "MS WinWord 6.0" ),)
oDoc.close( True )

Dim oFinalDoc, oCursor,oText
oFinalDoc = StarDesktop.Loadcomponentfromurl(cURL, "_blank", 0, Array())
oText = oFinalDoc.getText
oCursor = oText.createTextCursor()
for i = 97 to 109
ch = chr(i)
cFile = baseUrl + ch + ".html"
cUrl = ConvertToURL( cFile )
oCursor.BreakType =
oCursor.insertDocumentFromUrl(cUrl, Array())

End Sub

Unfortunately, I only have MS Word 2007 Beta installed on my Win Box, so I couldn't convert the doc to Reader format but I know it works as I'd converted a word doc earlier. Oh, by the way the same process could have been accomplished by ditching everything and passing the url of html directly to the oo macro or if you really really wanted to use some linux cmd then
wget allows you to fetch multiple pages by following links (though am not too sure whether that would work for img tags).
Moral of the story: at times we learn few things which we didn't need to learn then but maybe we'll find them to be useful sometime down the line.

No comments:

Post a Comment