All of the other core services I’ve dealt with in Hadoop play by the system rules, if I populate fake DNS values in /etc/hosts by golly the services are going to believe it. Well all except for Hbase which didn’t seem to play fair with /etc/resolve.conf or /etc/hosts and did fairly low level reverse DNS lookups against the network DNS, which in this case was provided by Amazon. I so do love those super descriptive ip-101-202-303-404.internal addresses.
Still, once you abandon the long term untenable idea of using /etc/hosts and just get into the habit of memorizing IP/internal DNS addresses its not so bad. Otherwise a stable arrangement was debain squeeze with Cloudera CDH3 Update 2, the stability improvements were painfully obvious as HBase stopped murdering its own HDFS entries and became performant.
Last bit, for small clusters it makes sense to use EBS backed volumes for the datanodes, but generally I felt that the ephemeral volumes were slightly faster in seek times and throughput. This became especially important under very high load HDFS scenario’s where an EBS array on a datanode is capped collectively to 1GB/s but emphemeral can go higher.
Still focusing on pro-emphemeral nodes, the reality is that you’ve lost the game if a single datanode has more then 250GB of JBOD volumes and it’s going to quickly become expensive if you have multiple terabytes of EBS backed data ( .10 USD a GigaByte and .10 USD per million I/O ops ). Instead, the reality is that with 2 or 3 levels of HDFS replication, something downright catastrophic would need to occur to take all of your datanodes completely down. Plus with S3 being right next door to EC2, it’s hard to find a excuse not to make vital backups.