Learning the C Programming Language

Part 1: hello, world

I've started to learn C. There's a number of reasons for this. First, it was the 'proper' programming language when I was a kid. Second, I've been learning quite a bit of Go recently, and just about every other page on the excellent Go Blog has a sentence that starts with "In C…".1

Third, I've started a course on Data Structures and Algorithms on Coursera, inspired by the ever-inspirational Denise Yu. The course only accepts submissions in four languages: Python, Java, C++ and C.2 Going through that list my mind went "Basically Ruby with bleaugh indentation, bleaugh Java bleaugh, sounds really hard, sounds hard." So I went with 'sounds hard'.

I thought I'd try and capture my process of learning C as it might be useful to others in a similar position - i.e. no computer science background but know how to program in Ruby and JavaScript. I'll be approaching this in a series of posts, most of which will be following the loose structure of a presentation I gave on C.3

Background

C was invented by Dennis Richie at Bell labs in the 1970s in order to write UNIX.4 He needed a language that provided sufficient abstraction to program quickly and efficiently, while at the same time being able to communicate directly with the computer's memory addresses to allow a programmer to perform low level optimizations. It has been a remarkably popular language, being used to write other languages (Ruby is written in C, NodeJS wouldn't work without libuv, written in C), and heavily influencing most modern programming languages (Java, Ruby, JavaScript, and most obviously, Go).

hello, world

Another of the ways C has influenced programming is through The C Programming Language by Richie and Brian Kernighan. The author's intials gave their name to a style of formatting code, along with giving us the de facto standard for your first program: "hello, world".

1 #include <stdio.h>
2 
3 int main() {
4     printf("hello, world\n");
5 }

Line 1: Include a file called stdio.h. It includes information about the functions in the C standard library that deal with I/O - input/output. In this case printing to the terminal.

Line 3: All C programs start with a function called main - this is the function that gets executed when the program is, well, executed. The arguments (of which we're using none at all) are in the parentheses. The body of the function is in the curly braces. So far so JavaScript. For people who have never seen a typed language the int is a little surprising. All it's telling us (and C) is that the return value for this function is an integer. We'll talk about types later, but for now the int is almost working like def in Ruby or function in JavaScript - it's a keyword declaring a function.

Line 4: The meat of the program. Here we're calling a function called printf which has already been written for us as a part of the C standard library - this is why we did that #include at the beginning. We're calling it with a single argument, a string literal inside double quotes, that just says "hello, world" with a newline character (\n) at the end.

At the end of the line we put a semi-colon to tell C that the line has finished.

Compiling and running

If you put all of that into a file called hello-world.c, save it, head to the terminal and type5

gcc hello-world.c

Then if you ls the same directory you should see a new file calleda.out.6 If you then run this with ./a.out, you'll see hello, world. Mission accomplished.

gcc is the Gnu Compiler Collection,7 which will compile your C program into an executable that your computer can run. All this means is that instead of translating each line of your program into something your computer can understand as you run it, as with something like Ruby or JavaScript, we're translating the whole program in one go before we run it.

a.out isn't that informative, so in order to get a better filename we can pass a flag to GCC:

gcc hello-world.c -o hello-word

Which outputs to the file hello-world, which we can now run with ./hello-world

Success, we're now all C programmers!

Why main returns an int

If you've run programs on the command line before you may be aware that you get exit codes with each program that runs. You may have even been (un)lucky enough to see something on the lines of Error: non-zero exit code. On a POSIX system, a process returns a number to the process that called it to let it know how things went - this is called the exit code. 0 is the good one, every other number is some flavour of 'gone wrong'.

The default return value for main, if we don't explicitly return a value, is 0. We can change this behaviour in our hello-world by returning an explicit value using the keyword return (very Ruby, so JavaScript).

#include <stdio.h>

int main() {
    printf("hello, world\n");
    return 1;
}

(don't forget the semicolon!)

Experiment with different return values. Remember to recompile your program each time you do. You may be able to see the returned value in your terminal's prompt. Otherwise you can echo out the last commands exit status with the command echo $?.


  1. Don't believe me? Look at this 

  2. You can now do the same course with a more diverse set of languages. 

  3. This was given at a Makers Academy alumni event. To view the speakers notes tap n

  4. A longer discussion of the origins of C was written by Richie and is available here 

  5. This assumes you have gcc installed, which is likely if you've been developing on your computer for a while. 

  6. You want to know why it's a.out? Read Richie's C History - link above. 

  7. Yes, it used to be called the Gnu C Compiler - acronyms are so wonderfully flexible... 

Double Dash

If you're anything like me you'll find your directories liberally scattered with some pretty strange directory and file names.

$ ls -la

-rw-r--r--   1 gypsydave5  1482096370   647 11 Oct 20:15 :w
drwxr-xr-x   2 gypsydave5  1482096370    68 10 Feb 08:55 -p/
-rw-r--r--   1 gypsydave5  1482096370  2900 11 Oct 20:15 \(

Hopefully you're nothing like me and you never get this, but I'm both a sloppy and impatient typist and so I will occasionally often mash the keyboard in Vim and name a file after the write command, or somehow create a directory called -p because I was using the recursive flag on mkdir.1

On that subject, let's try and get rid of the -p directory:

$ rm -rf -p

rm: illegal option -- p
usage: rm [-f | -i] [-dPRrvW] file ...
       unlink file

Hmmm, that sucks. What about...

$ rm -rf "-p"

rm: illegal option -- p
usage: rm [-f | -i] [-dPRrvW] file ...
       unlink file

Boo. Happily there's a *nix convention to help with these situations: --. Double-dash tells the command you're running that everything that comes after it is not to be treated as a command option, but is instead a filename. So:

$ rm -rf -- -p
$ ls -la

-rw-r--r--   1 gypsydave5  1482096370   647 11 Oct 20:15 :w
-rw-r--r--   1 gypsydave5  1482096370  2900 11 Oct 20:15 (

DISCO!

This behaviour is implemented in most of the command line tools you'll use on a *nix system - it's useful to know.


  1. I'm still not sure how I managed this. But I'm staring at the evidence now, so I know it must've happened. 

(Even More) Memoization in JavaScript

A while ago I wrote a piece on basic lazy evaluation and memoization in JavaScript. I'd like to continue some of the thoughts there on memoization, and to look at how some popular JS libraries handle it.

Memoization refresher

According to Wikipedia, memoization is:

an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.

We want the program to remember the results of its operation, so it doesn't have to expend any extra time and effort in doing them all over again. It all seems a little obvious really.

So the first time we saw memoization it was in the context of lazy evaluation, where the memoized function had already been prepared to be lazy. The lazily evaluated function was executed within our memoizer, and the result was retained within the scope of the memoized function, to be returned each subsequent time the function was called. As a refresher:

function lazyEvalMemo (fn) {
  const args = arguments;
  let result;
  const lazyEval = fn.bind.apply(fn, args);
  return function () {
    if (result) {
      return result
    }
    result = lazyEval()
    return result;
  }
}

Which is great if want the function to be memoized with a specific set of arguments. But that isn't going to work all of the time - probably not even most of the time. It's rare that there's just one set of parameters we want to be eternally bound to our function. A bit of freedom would be nice.

Keeping track of your arguments

If we want to be able to memoize a function which can take a variety of arguments, we need a way to keep track of them. Let's take one of the simplest cases:

function addOne (x) {
  return x + 1;
};

That shouldn't take too much explanation. Now, in order to have a memoized version of it, we need to keep track of all the different values of x passed to it, and to associate them with the return value of addOne.

A JavaScript object should suffice, taking x as a key, and the result as the value. Let's give it a go:

const memo = {};

function memoAddOne (x) {
  if (memo[x]) {
    return memo[x];
  }
  return memo[x] = x + 1;
};

This takes advantage of the fact that the value of an assignment to an object is the value assigned (in this case x + 1), saving a line or so. The only issue here is that the value of memo is floating around in public, just waiting for other functions to overwrite and mutate. We need a way to hide it.

Well, we could place memo on the function itself - it is an object after all - and just put an underscore in front of its name to try and let the world know that it's private (even though it isn't really private):

function memoAddOne (x) {
  memoAddOne._memo = memoAddOne._memo || {};

  if (memoAddOne._memo[x]) {
    return memoAddOne._memo[x];
  }
  return memoAddOne._memo[x] = x + 1;
};

This isn't beautiful, but it works. _memo gets defined as an empty object on initialization and gets filled up with results on each application of the function - throw some console.log()s in there to prove it to yourself. That's exciting - although we're still a little exposed with the _memo property being available on the function.

That said, we've got what we came for - a memoized version of our function that works for many different arguments.

Fun with strings

Problem is, we're reliant on x being used as the property for our _memo object. As all good school children are taught, JavaScript, like the universe, is just a load of strings held together by poorly-understood forces. When x is used in _memo[x], JavaScript handily casts it to a string to be used as the property name.

I say handily - but try this...

memoAddOne([55]) // => '551'
memoAddOne(55) // => '551'
memoAddOne(66) // => 67
memoAddOne([66]) // => 67

Because...

55.toString() // => '55'
[55].toString() // => '55'

Ah, JavaScript: thou givest with one hand... toString(), which is what JavaScript is using under the hood to cast the non-string property identifier to a string, does not uniquely identify that argument. So our function behaves inconsistently depending on whether it was given the array or the number that toString() converts to the same string.

We need a more predictable way of parsing our argument. Happily, we can borrow one of the built-in functions of JavaScript to do this, JSON.stringify().

JSON.stringify(55) // => '55'
JSON.stringify('55') // => '"55"'
JSON.stringify([55]) // => '[55]'
JSON.stringify(['55']) // => '["55"]'

Pretty good - let's give it a whirl:

function memoAddOne (x) {
  memoAddOne._memo = memoAddOne._memo || {};
  const jsonX = JSON.stringify(x);

  if (memoAddOne._memo[jsonX]) {
    return memoAddOne._memo[jsonX];
  }
  return memoAddOne._memo[jsonX] = x + 1;
};

memoAddOne([55]) // => '551'
memoAddOne(55) // => 56

Sorted.

The General Case

Now let's put together a function that can memoize any function - and as a bonus, we can also hide that nasty _memo property behind a closure:

function memoize (fn) {
  const memo = {};

  return function () {
    const args = Array.prototype.slice.call(arguments);
    const jsonArgs = JSON.stringify(args);

    if (memo[jsonArgs]) {
      return memo[jsonArgs];
    }
    return memo[jsonArgs] = fn.apply(null, args);
  };
};

Much of this should now be familiar, but let's dig in. We take a single argument, hopefully a function, and bind it to the variable fn. We now get to declare memo inside our function - and hooray it's now unavailable to anyone but the function we're returning. Now that's what I call private - thank you closures!

The function we give back, well we don't know how many arguments it's going to be given so why bind them to any parameters? We'll just leave its parameters empty. Any arguments we do get we'll instantly turn into an array by using the funky Array.prototype.slice.call(arguments) dance. And that array we can stringify() on the next line and call something useful like jsonArgs.

Then we do much the same as above, only instead of giving x + 1 as the value of _memo[jsonX], we set the value of memo[jsonArgs] as the result of applying the args array to the original function we were given to memoize. Job done.

Again, throw some console.logs in there to see what's really going on.

Here's One I Made Earlier Installed With npm

The above is so incredibly useful that you'll not be surprised to learn that it's implemented, with slight modifications, in functional JavaScript libraries like Underscore, Lodash and (personal niche favourite) Ramda.1

Let's take a look at the Lodash implementation:

function memoize(func, resolver) {
  if (typeof func != 'function' || (resolver && typeof resolver != 'function')) {
    throw new TypeError(FUNC_ERROR_TEXT);
  }
  var memoized = function() {
    var args = arguments,
        key = resolver ? resolver.apply(this, args) : args[0],
        cache = memoized.cache;

    if (cache.has(key)) {
      return cache.get(key);
    }
    var result = func.apply(this, args);
    memoized.cache = cache.set(key, result);
    return result;
  };
  memoized.cache = new memoize.Cache;
  return memoized;
}

Now, although this is a little more long-winded complicated than the code above, it should be similar enough for us to see that they're doing the same thing. The difference being that in the above we are to supply an external function to hash the arguments (the resolver function), and that Lodash offers a custom caching object with a get() and set() interface, which we can overwrite if we like.

(Bonus Question: why does this implementation of memoize suck if we don't supply a resolver argument?)

The above library code will save us all the hassle of writing a memoization function ourselves - but now we can see how they work under the hood, we can take a more informed decision about whether we need to create a dependency on external library, or whether we just put together the (relatively) simple piece of code ourselves.


  1. Seriously, this is the one to go for. It's amazing, it's beautiful - it's functional

Pre-commit Hooks in Git

Remembering to run your tests before you commit is hard:

* 84f7e2e 2016-02-06 | Another typo. And test[David Wickes]
* ef215f8 2016-02-06 | OMFG semicolons WAAAAAT [David Wickes]
* c20b65d 2016-02-06 | Typo [David Wickes]
* fca4aa8 2016-02-06 | Another fix for the same test[David Wickes]
* f0206a9 2016-02-06 | Fixes failing test [David Wickes]
* 657cc48 2016-02-06 | Amazing new feature [David Wickes]

Yes, I suck. It's even worse when you've just pushed that teensy-tiny, insignificant change to the CI pipeline and it throws a strop because the JS linter is super fussy about semi-colons.

Don't get angry. Get even.

Wait, wut? Don't get even. Automate all the things!

Inside the unexamined recesses of the .git directory of every repo is a folder called hooks. You should look inside it.

applypatch-msg.sample
commit-msg.sample
post-update.sample
pre-applypatch.sample
pre-commit.sample # <--- This one here!
pre-push.sample
pre-rebase.sample
prepare-commit-msg.sample
update.sample

You'll see some pretty self-explanatory instructions on what it does, but the tl;dr is:

  • It is a shell script that runs before you commit
  • You activate it by removing the .sample bit.

So say we have a Node project we're working on. A cool pre-commit hook would look like:

#!/bin/sh

npm test

Pop that in a file called pre-commit inside that hooks directory - make sure it's executable like the sample ones - and see what you get.

So as long as you've set up you package.json file sensibly to run tests when you run npm test you're golden. Same idea holds for rake or gradle or whatever you're using as a task runner.

But you'd hate to do that for every project, right? Automate that too.

Try this:

$ git config --global init.templatedir '~/.git-templates'
$ mkdir -p ~/.git-templates/hooks

Inside the equally unexamined .gitsettings in your home directory you should now see:

[init]
    templatedir = ~/.git-templates

(you could've just written that in there by hand... but here we are)

What this'll do is copy the contents of .git-templates to the .git directories of new projects you clone or initialize.

We now need to make our hook more generic. Let's save the below off in ~/.git-templates/hooks/pre-commit:

#!/bin/sh

if [ -f package.json ]; then
    echo "detected package.json... running npm test"
    npm test
else
    echo "no testable project detected... so no tests before commit"
fi

[ -f package.json ] asks if there's a file called package.json, and if there is we run npm test. The rest of the script is just a little logging so we can see what's happening. Just remember to make it executable1 before you start to use it.

There we have it - the bare bones of a "generic" pre-commit hook. You can easily embelish this with more tests and other exciting/useful/amusing things to execute (is there a Rakefile present? Do any of the files I'm commiting still contain a console.log or a puts?).

As I said, this template will get added to everything that gets cloneed or initialized by Git from now on. For those projects that have already been started, just run git init again to pull in the template.

And there we are - have fun exploring the other sample templates, read the docs, and experiment with other useful scripts. Then tell me about them on Twitter so I can use them.


  1. chmod a+x pre-commit - you didn't need telling, but just in case. 

Downloading a list of URLs

Say you've got a list of URLs - a long list of URLs - each of which points to a file. Perhaps they're a set of logs, or backups, or something similar. The file looks like this:

http:/www.somedomain.com/my/file/number-one.txt
http:/www.somedomain.com/my/file/number-two.txt
http:/www.somedomain.com/my/file/number-three.txt
...
http:/www.somedomain.com/my/file/number-five-hundred-and-x-ity-x.txt

Now what we don't want to do is copy and paste each of those file names into a browser to download the file. That would suck. What would be ideal is to drop the file into a magic box, and that magic box just work through the list, downloading the files until they're all done.

Happily every *nix command line comes with its very own tooling to build a magic box like this.

wget

My first instinct would be to use wget, which is certainly the friendliest way I've seen to download files on the command line. Taking a short read of the manual with man wget we can see the following:

-i file
   --input-file=file
       Read URLs from a local or external file.  If - is specified as file,
       URLs are read from the standard input.  (Use ./- to read from a file
       literally named -.)

So the job is incredibly simple - we just type:

$ wget -i file-with-list-of-urls.txt

and we just let wget do its magic.

url and xargs

That was too easy - I love wget and usually wind up installing it on any system I use for longer than 30 seconds. But sometimes it's unavailable - maybe there's no package manager, or you have no rights to install packages because you're remoting in to a tiny machine running a very skinny Linux distro. In these cases we're going to have to rely on wget's older, less forgiving but far more flexible sibling curl.

The quick and generic curl command to download a URL is:

$ curl http://www.file.com/file.txt -LO

curl has a wealth of uses and options - we're barely scraping the surface with what we're doing here. Take a look at the full man page and you'll see what I mean.

But for this command: the -L flag tells curl to follow redirects - if it wasn't there we'd get the 30x response saved rather than the file at the location we're being redirected to. The -O flag means that curl uses the name of the remote file to name the file it's saved as, saving us the bother of naming the output.

In order to pass each of the URLs into curl one after another we get to use xargs, which is a wonderful piece of witchcraft you can use to pass lines from STDIN in as arguments to another command.

The full command looks like this:

$ cat file-with-list-of-urls.txt | xargs -n 1 curl -LO

cat we should be comfortable with, it sends each line of a file out to STDIN one at a time. Here we're piping out each line to xargs.

-n 1 tells xargs that it should be expecting one and only one argument for each execution from STDIN - in other words each of the URLs will be used as a sindle extra argument to curl. If we didn't do this, how would xargs know how many additional arguments curl wanted? It could just use every URL as an extra argument to a single curl execution. Which would suck.

So we take in an extra argument from STDIN, here being piped in by cat, and we apply it to the end of curl -LO. xargs will now run curl for each of the URLs.

Optimization

Five hundred or so files is going to take a long time to download. Try passing -P 24 to xargs, which tells it to run the multiple curls as 24 parallel processes. That'll whip along nicely (if your machine can take it).

Another nice feature would be the ability to output to a filename that was not the same as the remote file - the path could be really annoying and long. Using xargs we'd be somewhat limited, and would have to change the input file to include not only the new file name but also an extra argument to curl, -o, which gives the output file name.

The URL file list would look like this:

    http:/www.somedomain.com/my/file/number-one.txt
    -o
    number-one.txt
    http:/www.somedomain.com/my/file/number-two.txt
    -o
    number-two.txt

and the command would be

$ cat file-with-list-of-urls.txt | xargs -n 3 curl -L

But the same can be achieved without changing the original file list using GNU parallel, which is a distinct improvement (apart from the three extra characters).

$ cat file-with-list-of-urls.txt | parallel curl -L {} -o {/}

which passes the original URL to the {} and then removes the path from it with the {/}. There's plenty more you can do with parallels - take a look at the tutorial.

Finally, it would be remiss of me not to mention that all the uses of cat above are entirely superfluous - the same could have been achieved with:

$ <file-with-list-of-urls.txt parallel curl -L {} -o {/}

Update

And if you want to avoid reading all those logs and just get on with your life, try sending the whole process to the background and redirecting stdin and stdout to a file.

$ nohup cat filelist | xargs -n4 curl -L &>output &

nohup protects the process from being interrupted by the session closing. So it'll keep on going even when you close your terminal or SSH connection. Don't worry, you can still kill it if you've made a mistake.

And the four arguments?

    http:/www.somedomain.com/my/file/number-one.txt
    --create-dirs
    -o
    a-directory/hierarchy/number-one.txt

You get curl to make you a directory structure too.