Downloading a list of URLs
Say you’ve got a list of URLs - a long list of URLs - each of which points to a file. Perhaps they’re a set of logs, or backups, or something similar. The file looks like this:
http:/www.somedomain.com/my/file/number-one.txt
http:/www.somedomain.com/my/file/number-two.txt
http:/www.somedomain.com/my/file/number-three.txt
...
http:/www.somedomain.com/my/file/number-five-hundred-and-x-ity-x.txt
Now what we don’t want to do is copy and paste each of those file names into a browser to download the file. That would suck. What would be ideal is to drop the file into a magic box, and that magic box just work through the list, downloading the files until they’re all done.
Happily every *nix command line comes with its very own tooling to build a magic box like this.
wget
My first instinct would be to use wget, which is certainly the friendliest way
I’ve seen to download files on the command line. Taking a short read of the
manual with man wget
we can see the following:
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file,
URLs are read from the standard input. (Use ./- to read from a file
literally named -.)
So the job is incredibly simple - we just type:
$ wget -i file-with-list-of-urls.txt
and we just let wget do its magic.
url
and xargs
That was too easy - I love wget
and usually wind up installing it on any
system I use for longer than 30 seconds. But sometimes it’s unavailable - maybe
there’s no package manager, or you have no rights to install packages because
you’re remoting in to a tiny machine running a very skinny Linux distro. In
these cases we’re going to have to rely on wget
’s older, less forgiving but far
more flexible sibling curl.
The quick and generic curl
command to download a URL is:
$ curl http://www.file.com/file.txt -LO
curl
has a wealth of uses and options - we’re barely scraping the surface with
what we’re doing here. Take a look at the full man
page and you’ll see what
I mean.
But for this command: the -L
flag tells curl to follow redirects - if it
wasn’t there we’d get the 30x
response saved rather than the file at the
location we’re being redirected to. The -O
flag means that curl uses the name
of the remote file to name the file it’s saved as, saving us the bother of
naming the output.
In order to pass each of the URLs into curl one after another we get to use
xargs, which is a wonderful piece of witchcraft you can use to pass lines
from STDIN
in as arguments to another command.
The full command looks like this:
$ cat file-with-list-of-urls.txt | xargs -n 1 curl -LO
cat
we should be comfortable with, it sends each line of a file out to STDIN
one at a time. Here we’re piping out each line to xargs
.
-n 1
tells xargs
that it should be expecting one and only one argument for
each execution from STDIN
- in other words each of the URLs will be used as
a sindle extra argument to curl
. If we didn’t do this, how would xargs
know
how many additional arguments curl
wanted? It could just use every URL as an
extra argument to a single curl
execution. Which would suck.
So we take in an extra argument from STDIN
, here being piped in by cat
, and
we apply it to the end of curl -LO
. xargs
will now run curl
for each of
the URLs.
Optimization
Five hundred or so files is going to take a long time to download. Try passing
-P 24
to xargs
, which tells it to run the multiple curls as 24 parallel
processes. That’ll whip along nicely (if your machine can take it).
Another nice feature would be the ability to output to a filename that was not
the same as the remote file - the path could be really annoying and long. Using
xargs
we’d be somewhat limited, and would have to change the input file to
include not only the new file name but also an extra argument to curl, -o
,
which gives the output file name.
The URL file list would look like this:
http:/www.somedomain.com/my/file/number-one.txt
-o
number-one.txt
http:/www.somedomain.com/my/file/number-two.txt
-o
number-two.txt
and the command would be
$ cat file-with-list-of-urls.txt | xargs -n 3 curl -L
But the same can be achieved without changing the original file list using GNU parallel, which is a distinct improvement (apart from the three extra characters).
$ cat file-with-list-of-urls.txt | parallel curl -L {} -o {/}
which passes the original URL to the {}
and then removes the path from it with
the {/}
. There’s plenty more you can do with parallels
- take a look at the
tutorial.
Finally, it would be remiss of me not to mention that all the uses of cat
above are entirely superfluous - the same could have been achieved with:
$ <file-with-list-of-urls.txt parallel curl -L {} -o {/}
Update
And if you want to avoid reading all those logs and just get on with your life,
try sending the whole process to the background and redirecting stdin
and
stdout
to a file.
$ nohup cat filelist | xargs -n4 curl -L &>output &
nohup
protects the process from being interrupted by the session closing. So
it’ll keep on going even when you close your terminal or SSH connection. Don’t
worry, you can still kill
it if you’ve made a mistake.
And the four arguments?
http:/www.somedomain.com/my/file/number-one.txt
--create-dirs
-o
a-directory/hierarchy/number-one.txt
You get curl
to make you a directory structure too.