All of the other core services I’ve dealt with in Hadoop play by the system rules, if I populate fake DNS values in /etc/hosts by golly the services are going to believe it. Well all except for Hbase which didn’t seem to play fair with /etc/resolve.conf or /etc/hosts and did fairly low level reverse DNS lookups against the network DNS, which in this case was provided by Amazon. I so do love those super descriptive ip-101-202-303-404.internal addresses.

Still, once you abandon the long term untenable idea of using /etc/hosts and just get into the habit of memorizing IP/internal DNS addresses its not so bad. Otherwise a stable arrangement was debain squeeze with Cloudera CDH3 Update 2, the stability improvements were painfully obvious as HBase stopped murdering its own HDFS entries and became performant.

Last bit, for small clusters it makes sense to use EBS backed volumes for the datanodes, but generally I felt that the ephemeral volumes were slightly faster in seek times and throughput. This became especially important under very high load HDFS scenario’s where an EBS array on a datanode is capped collectively to 1GB/s but emphemeral can go higher.

Still focusing on pro-emphemeral nodes, the reality is that you’ve lost the game if a single datanode has more then 250GB of JBOD volumes and it’s going to quickly become expensive if you have multiple terabytes of EBS backed data ( .10 USD a GigaByte and .10 USD per million I/O ops ). Instead, the reality is that with 2 or 3 levels of HDFS replication, something downright catastrophic would need to occur to take all of your datanodes completely down. Plus with S3 being right next door to EC2, it’s hard to find a excuse not to make vital backups.

I’m currently somewhere in the process of building a hadoop clouster in EC2 for one of my clients and one of the most important parts for keeping my sanity is the ability to access all of the node’s web interfaces ( jobtracker, namenode, tasktrackers’, datanodes, etc ). If you aren’t abs(crazy) all of these machines are jailed inside a locked down security group, a micro walled garden.

SSH -D 8080 someMachine.amazon-publicDNS.com

That will setup a socks between your machine and some instance that should be in the same SG as the hadoop cluster… now unless you are a saddist and like to write dozens of host file entries, the SOCKS proxy is useless.

But wait! Proxy Auto-configuration to the rescue! All you really need to get started is here at Wikipedia ( http://en.wikipedia.org/wiki/Proxy_auto-config ) but to be fair a dirt simple proxy might look like:

hadoop_cluster.pac

function FindProxyForURL(url, host) {
if (shExpMatch(host, "*.secret.squirrel.com")) {
return "SOCKS5 127.0.0.1:8080";
}
if (shExpMatch(host, "*.internal")) {
return "SOCKS5 127.0.0.1:8080";
}

return "DIRECT";
}

Save this to your harddrive then find the correct “file:///path/2/hadoop_cluster.pac” from there go into your browsers proxy configuration dialog window and paste that URL into the Proxy Auto-configuration box. After that, going to http://ip-1-2-3-4.amazon.internal in a web browser will automatically go through the SSH proxy into Amazon EC2 cloud space, resolve against Amazon DNS servers, and voila you’re connected.

NOTE: Windows users

It shouldn’t be a surprise that Microsoft has partially fucked up the beauty that is the PAC. Fortunately, they provide directions for resolving the issue here ( http://support.microsoft.com/kb/271361 ).

tl;dwrite – Microsoft’s network stack caches the results of the PAC script instead of checking it for every request. If your proxy goes down or you edit the PAC file, those changes can take sometime to actually come into play. Fortunately Firefox has a nifty “reload” button on their dialog box, but Microsoft Internet Explorer and the default Chrome for windows trust Microsofts netstack.

Refactored scope

Category Archives: amazon-ec2

Amazon EC2, HBase, and IMHO benefits of ephemeral over EBS

SSH SOCKS proxy and Amazon EC2 to the rescue

NOTE: Windows users