Once upon a time...: October 2006

At times we are confronted with choices, and we pick one of them based on which one is more aligned with where we want to go and our priorities. So, which one do you pick between D and E? What are D and E, first of all? They are two things which don't matter to us anyway, but we still have to make a choice, so it's like tossing a coin and taking a pick or coming up with crazy rationale and reasoning so that the choice that we make in the end makes some sense. I think, the most important factor in life is motivation, as long as we are motivated we would always want to grow, we would always want to explore the world around us, we would always want to learn something new, and unfortunately for past some time that's one thing that I've been lacking, maybe D or E were easy enough to choose between but since I just don't have any motivation left I don't care which one I pick.

Recently I stumbled upon an online version of "A brief History of Time" by Stephen Hawking, I read this book long time back, perhaps too soon for my age. So decided I'll give it another shot, the only issue is I prefer paper books over online versions and even if I have to read an ebook, I generally like reading them in MS Reader (it's got few nice features like annotating, bookmarking, and drawing among others). I know there's a plugin for MS Word which allows converting any word document to MS Reader format, so if only I could get these HTMLs to a doc file, I just have to run that converter. The easiest approach is to open all the htmls in a browser, select all text and paste in a word document but that doesn't appear to be any challenging, so I thought off automating the entire process....

A cursory look on the url naming for this book, showed that files are named from a-n (for some reason n is the first chapter, then every chapter is a-m). So the easiest way to do it w.o. any third party tool was to use curl to save these documents locally.


$curl http://www.physics.metu.edu.tr/~fizikt/html/hawking/[a-n].html -o "hawking_#1.html"

This would fetch all the documents named from a-n and save them locally as hawking_[a-n].html.

Step 1 was quite easy, now the chapters also have images embedded for which we need to extract the image sources and save them locally again. Another look at the source html showed that all images are declared as <img src="..." > so using grep I extracted all the img tags in the html pages:

 grep -o  '<img src="[a-z0-9A-Z\.]*' hawking_*.html >> img.txt

So I extracted all the img tags and saved them to an output file, different story that the grep command took me nearly 3 hrs! I kept on forgetting to add the -o switch, without which the grep would spit out the entire html content, making me think that my regex was perhaps wrong. Anyway, once we had the grep sorted out, the next issue was to extract just the image names from the img src tags grep found, well that quite easy using the cut command:


cut -d\" -f2 img.txt >>wget.txt

Once we have all the img src path, we just need to use a final wget to fetch them:


wget -i wget.txt -B http://www.physics.metu.edu.tr/~fizikt/html/hawking/

(-B specifies the base url)

So far so good, so we have saved a local copy of the book with the images (different story that perhaps doing it manually would have taken lesser time).

Now we need to merge these htmls and convert them to a word document, here's a slightly tweaked version of macro in OO that I used:


Sub MergeAndSave(baseUrl)
'urls are a-n
dim i
dim ch
dim cUrl
ch = chr(110)
dim cFile
dim docFile
docFile = "/home/sachin/Hawking/hawking.doc"
cFile = baseUrl + ch + ".html"
 cUrl = ConvertToURL( cFile )
 oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, Array(MakePropertyValue("FilterName","HTML (StarWriter)") _
      , MakePropertyValue("Hidden",False))

 cURL = ConvertToURL( docFile )

 oDoc.storeToURL( cURL, Array(_
          MakePropertyValue( "FilterName", "MS WinWord 6.0" ),)
 oDoc.close( True )

Dim oFinalDoc, oCursor,oText
oFinalDoc = StarDesktop.Loadcomponentfromurl(cURL, "_blank", 0, Array())
oText = oFinalDoc.getText
  oCursor = oText.createTextCursor()
for i = 97 to 109
 ch = chr(i)
 cFile = baseUrl + ch + ".html"
 cUrl = ConvertToURL( cFile ) 
 oCursor.gotoEnd(false)
      oCursor.BreakType = com.sun.star.style.BreakType.PAGE_BEFORE
      oCursor.insertDocumentFromUrl(cUrl, Array())
next

'oFinalDoc.close(True)
End Sub

Unfortunately, I only have MS Word 2007 Beta installed on my Win Box, so I couldn't convert the doc to Reader format but I know it works as I'd converted a word doc earlier. Oh, by the way the same process could have been accomplished by ditching everything and passing the url of html directly to the oo macro or if you really really wanted to use some linux cmd then
wget allows you to fetch multiple pages by following links (though am not too sure whether that would work for img tags).
Moral of the story: at times we learn few things which we didn't need to learn then but maybe we'll find them to be useful sometime down the line.

Once upon a time...

Thursday, October 19, 2006

D or E, take your pick

Friday, October 06, 2006

cut, curl and A Brief History of Time...