Grab files from web servers with wget

wget and curl are two fantastic utilities to grab content from the Interwebs. In their simplest forms, both these utilities allow you to download publicly available files from web sites. In reality, each of them is a very complex beast that you would have to learn and practice for a very long time to claim any level of mastery. Point being, we are not here to teach/learn either wget or curl and all their intricacies. There are lots of blog posts, web pages and web sites that teach you how to master wget and curl, so we will leave all that teaching to those posts, pages and sites.

What I want to demonstrate here is an interesting problem I faced today. So, I use wget fairly regularly to download content from the web that I like to read later. As I mentioned earlier, all the content under consideration here falls into publicly available and publicly accessible files on various web sites. If you are point-n-click fan, then you can choose to right-click on a particular link and then choose to save it. That is too much of a trouble for me. I like to “automate” things. 🙂 So, I just make a note of the complete path to the file and save it. Then, I add wget in front of the complete paths of all those files I want to save for later reading, and save all this as a simple shell script. Finally, just before going to bed, I run that download script containing wget followed by comlete URLs of files I want to download. Within a few moments all the files that I need are downloaded, without me having to right-click and choose save … a million times.

Now, is any of that innovative? Absolutely not! So, what problem did I face today? Well! I was trying to download a few files from a site (a stock market in India, which we will not name here) using wget, but, wget failed to get them and in stead gave me “HTTP error code 403 forbidden” error message.

Under ordinary circumstances I would have given up on downloading those files, but, the funny thing was that I could download those files with Firefox “right-click and save”!! Why would that be? It seems that stupid, crazy engineers at some companies (like this stock market in India) can deny output to browsers that they do not know about. So, if a site has been coded to serve information to say, Firefox and IE, then, any other user-agent (browser or bot, etc) could be denied access to the data. Or at least that’s how I understand it. Why would they do that? Who knows! Like I said, my guess is they are stupid and crazy. But, that being so, what’s an ordinary human to do? Thankfully, I had dealt with this problem earlier, so, I ran wget but gave it a user-agent string of Firefox. And within moments, wget was downloading files like there was no tomorrow. So, the command is:

$ wget -U Firefox "URL of file to download"

The “-U Firefox” helps wget announce itself as the “Firefox browser”. That may seem like cheating, but, its not. It is the site that is incorrectly coded.

Bottom line: wget is one fantastic utility.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s