A cursory look on the url naming for this book, showed that files are named from a-n (for some reason n is the first chapter, then every chapter is a-m). So the easiest way to do it w.o. any third party tool was to use curl to save these documents locally.
$curl http://www.physics.metu.edu.tr/~fizikt/html/hawking/[a-n].html -o "hawking_#1.html"
This would fetch all the documents named from a-n and save them locally as hawking_[a-n].html.
Step 1 was quite easy, now the chapters also have images embedded for which we need to extract the image sources and save them locally again. Another look at the source html showed that all images are declared as <img src="..." > so using grep I extracted all the img tags in the html pages:
grep -o '<img src="[a-z0-9A-Z\.]*' hawking_*.html >> img.txt
So I extracted all the img tags and saved them to an output file, different story that the grep command took me nearly 3 hrs! I kept on forgetting to add the -o switch, without which the grep would spit out the entire html content, making me think that my regex was perhaps wrong. Anyway, once we had the grep sorted out, the next issue was to extract just the image names from the img src tags grep found, well that quite easy using the cut command:
cut -d\" -f2 img.txt >>wget.txt
Once we have all the img src path, we just need to use a final wget to fetch them:
wget -i wget.txt -B http://www.physics.metu.edu.tr/~fizikt/html/hawking/
(-B specifies the base url)
So far so good, so we have saved a local copy of the book with the images (different story that perhaps doing it manually would have taken lesser time).
Now we need to merge these htmls and convert them to a word document, here's a slightly tweaked version of macro in OO that I used:
Sub MergeAndSave(baseUrl)
'urls are a-n
dim i
dim ch
dim cUrl
ch = chr(110)
dim cFile
dim docFile
docFile = "/home/sachin/Hawking/hawking.doc"
cFile = baseUrl + ch + ".html"
cUrl = ConvertToURL( cFile )
oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, Array(MakePropertyValue("FilterName","HTML (StarWriter)") _
, MakePropertyValue("Hidden",False))
cURL = ConvertToURL( docFile )
oDoc.storeToURL( cURL, Array(_
MakePropertyValue( "FilterName", "MS WinWord 6.0" ),)
oDoc.close( True )
Dim oFinalDoc, oCursor,oText
oFinalDoc = StarDesktop.Loadcomponentfromurl(cURL, "_blank", 0, Array())
oText = oFinalDoc.getText
oCursor = oText.createTextCursor()
for i = 97 to 109
ch = chr(i)
cFile = baseUrl + ch + ".html"
cUrl = ConvertToURL( cFile )
oCursor.gotoEnd(false)
oCursor.BreakType = com.sun.star.style.BreakType.PAGE_BEFORE
oCursor.insertDocumentFromUrl(cUrl, Array())
next
'oFinalDoc.close(True)
End Sub
Unfortunately, I only have MS Word 2007 Beta installed on my Win Box, so I couldn't convert the doc to Reader format but I know it works as I'd converted a word doc earlier. Oh, by the way the same process could have been accomplished by ditching everything and passing the url of html directly to the oo macro or if you really really wanted to use some linux cmd then
wget allows you to fetch multiple pages by following links (though am not too sure whether that would work for img tags).
Moral of the story: at times we learn few things which we didn't need to learn then but maybe we'll find them to be useful sometime down the line.
No comments:
Post a Comment