SSH SOCKS proxy and Amazon EC2 to the rescue

I’m currently somewhere in the process of building a hadoop clouster in EC2 for one of my clients and one of the most important parts for keeping my sanity is the ability to access all of the node’s web interfaces ( jobtracker, namenode, tasktrackers’, datanodes, etc ). If you aren’t abs(crazy) all of these machines are jailed inside a locked down security group, a micro walled garden.

SSH -D 8080 someMachine.amazon-publicDNS.com

That will setup a socks between your machine and some instance that should be in the same SG as the hadoop cluster… now unless you are a saddist and like to write dozens of host file entries, the SOCKS proxy is useless.

But wait! Proxy Auto-configuration to the rescue! All you really need to get started is here at Wikipedia ( http://en.wikipedia.org/wiki/Proxy_auto-config ) but to be fair a dirt simple proxy might look like:

hadoop_cluster.pac
function FindProxyForURL(url, host) {
if (shExpMatch(host, "*.secret.squirrel.com")) {
return "SOCKS5 127.0.0.1:8080";
}
if (shExpMatch(host, "*.internal")) {
return "SOCKS5 127.0.0.1:8080";
}

return "DIRECT";
}

Save this to your harddrive then find the correct “file:///path/2/hadoop_cluster.pac” from there go into your browsers proxy configuration dialog window and paste that URL into the Proxy Auto-configuration box. After that, going to http://ip-1-2-3-4.amazon.internal in a web browser will automatically go through the SSH proxy into Amazon EC2 cloud space, resolve against Amazon DNS servers, and voila you’re connected.

NOTE: Windows users

It shouldn’t be a surprise that Microsoft has partially fucked up the beauty that is the PAC. Fortunately, they provide directions for resolving the issue here ( http://support.microsoft.com/kb/271361 ).

tl;dwrite – Microsoft’s network stack caches the results of the PAC script instead of checking it for every request. If your proxy goes down or you edit the PAC file, those changes can take sometime to actually come into play. Fortunately Firefox has a nifty “reload” button on their dialog box, but Microsoft Internet Explorer and the default Chrome for windows trust Microsofts netstack.