Damn you hbase, damn you to hell

All of the other core services I’ve dealt with in Hadoop play by the system rules, if I populate fake DNS values in /etc/hosts by golly the services are going to believe it. Well all except for Hbase which didn’t seem to play fair with /etc/resolve.conf or /etc/hosts and did fairly low level reverse DNS lookups against the network DNS, which in this case was provided by Amazon. I so do love those super descriptive ip-101-202-303-404.internal addresses.

Still, once you abandon the long term untenable idea of using /etc/hosts and just get into the habit of memorizing IP/internal DNS addresses its not so bad. Otherwise a stable arrangement was debain squeeze with Cloudera CDH3 Update 2, the stability improvements were painfully obvious as HBase stopped murdering its own HDFS entries and became performant.

Last bit, for small clusters it makes sense to use EBS backed volumes for the datanodes, but generally I felt that the ephemeral volumes were slightly faster in seek times and throughput. This became especially important under very high load HDFS scenario’s where an EBS array on a datanode is capped collectively to 1GB/s but emphemeral can go higher.

Still focusing on pro-emphemeral nodes, the reality is that you’ve lost the game if a single datanode has more then 250GB of JBOD volumes and it’s going to quickly become expensive if you have multiple terabytes of EBS backed data ( .10 USD a GigaByte and .10 USD per million I/O ops ). Instead, the reality is that with 2 or 3 levels of HDFS replication, something downright catastrophic would need to occur to take all of your datanodes completely down. Plus with S3 being right next door to EC2, it’s hard to find a excuse not to make vital backups.

 

I’m currently somewhere in the process of building a hadoop clouster in EC2 for one of my clients and one of the most important parts for keeping my sanity is the ability to access all of the node’s web interfaces ( jobtracker, namenode, tasktrackers’, datanodes, etc ). If you aren’t abs(crazy) all of these machines are jailed inside a locked down security group, a micro walled garden.

SSH -D 8080 someMachine.amazon-publicDNS.com

That will setup a socks between your machine and some instance that should be in the same SG as the hadoop cluster… now unless you are a saddist and like to write dozens of host file entries, the SOCKS proxy is useless.

But wait! Proxy Auto-configuration to the rescue! All you really need to get started is here at Wikipedia ( http://en.wikipedia.org/wiki/Proxy_auto-config ) but to be fair a dirt simple proxy might look like:

hadoop_cluster.pac
function FindProxyForURL(url, host) {
if (shExpMatch(host, "*.secret.squirrel.com")) {
return "SOCKS5 127.0.0.1:8080";
}
if (shExpMatch(host, "*.internal")) {
return "SOCKS5 127.0.0.1:8080";
}
 
return "DIRECT";
}

Save this to your harddrive then find the correct “file:///path/2/hadoop_cluster.pac” from there go into your browsers proxy configuration dialog window and paste that URL into the Proxy Auto-configuration box. After that, going to http://ip-1-2-3-4.amazon.internal in a web browser will automatically go through the SSH proxy into Amazon EC2 cloud space, resolve against Amazon DNS servers, and voila you’re connected.

NOTE: Windows users

It shouldn’t be a surprise that Microsoft has partially fucked up the beauty that is the PAC. Fortunately, they provide directions for resolving the issue here ( http://support.microsoft.com/kb/271361 ).

tl;dwrite – Microsoft’s network stack caches the results of the PAC script instead of checking it for every request. If your proxy goes down or you edit the PAC file, those changes can take sometime to actually come into play. Fortunately Firefox has a nifty “reload” button on their dialog box, but Microsoft Internet Explorer and the default Chrome for windows trust Microsofts netstack.

Very true

 Uncategorized  Comments Off
Oct 082011
 

Static typing is like prohibition, responsible citizens don’t need it & are burdened; and some find creative ways to defeat it.

Venkat Subramaniam

 

Unfortunately I cannot find the original usenet post, so here’s the paraphrased summary:

Two programmers are discussing what to do with a slow program and the junior of the two laments “If only there was a way to make the computer run faster.” to which the senior replies “You cannot make the computer run faster, but you can make it do less.” The gist of which I can explain from my own experience.

Caching

With some exceptions, generally it doesn’t really matter what language you choose to write implement a program or application in…as long as it is fast enough. Instead you need to look at what you’re application is spending most of it’s time doing and I don’t mean just a cursory look but really dig into there. In almost every case, the primary culprit to scaling out is going to be whatever you are using for a data-backend.

If you’re fetching from the database a User credential or profile record, you’ve suddenly locked the speed of your entire application to the max number of connections ( not queries ) your database can do. For MySQL that’s about 150-180/second ( or 220-250/second if you have a full time DBA ). If you get more then 250 user requests to your webstack, then your application is locked up solid. So it should be obvious that the solution is to case everything and anything that’s needed from the databases that won’t be changing too often.

My prefered solution for the above is to use memcache with as much ram as you can throw at it, at minimum 2Gbs but I’ve worked on on 128GB categorized arrays before. Now memcache can be summarized as an unreliable key/value data store. You might put a key pair in and it might be there for the next minute or so.

By implementing caching into your application, you’re making it do less. So instead of a 1 to 1 relationship between user requests and databases connections it might go up to 10 to 1.

Division of concerns

This usually catches almost all junior and mid-level developers off guard. If your application serves static content from a Python or Ruby script, your burning capacity up. Instead a better plan is to split your application up into two subprojects: Application and Application Content. From the outside looking in, http://derpCorp.com/application/url and http://static.derpCorp/staticContent/ Generally ngin-x or lighttpd can trounce almost anything else for serving content. Again not applicable to everyone, the cost of infrastructure will lean heavily towards new application servers and not your content servers… so by dividing the two now, when you can you set yourself up for investing wisely vs. just throwing money at the problem.

Divide and conquer

The minute one piece of an application becomes a critical component the door to unending misery begins to open. That one critical piece is going to reliably fail at every investor presentation, at 4am on saturday, and about ten minutes after hit rush hour evening traffic. Usually the critical component is the database and almost always the first solution is to throw more memory and disks at it, hoping the beast will be sated forever and ever. This should be a sign that something needs to change, but sometimes it isn’t heard. Instead of scaling up, the proven winning solution is to scale out. If you have two or more schema’s on the same server, it might be time to separate them. Does User A need to cohabitate with User B’s data?

Don’t ignore your problems

Usually there is a small clan of people clustered around an application, it provides money and stability for them. Sometimes this clan sacrifices their youth, sanity, and credit ratings for the application like it’s some sort of messed up deity. Unfortunately you’re application is stupider then the bacteria growing in your kitchen sink and though the causation of throwing money and time at a half ass solution may seem to correlate with resolution, correlation does not equal causation…especially with software. If half of the application randomly goes belly up every week at the same time… don’t ignore that problem or worse try to bury it, pick someone in your team and send them off on a mission to find the problem and fix it. Otherwise what was once a problem may end up being your clan’s apocalypse.

 

It sucks to admit when you’ve made a bad decision, but step one is admitting you have a problem. My problem specifically was with what I wanted to accomplish and what CouchDB had to offer. To be clear the problem was with me and not CouchDB, its a great tool and resource for someone out there, but not for me.

context

One of my unpublished pets is a delicious clone ( live as of last weekend ) that was designed from month two to be a single user affair ( delicious data, custom spiders, reports, and a reddit cross-analysis ranking thing ). Delicious is/was a bookmarking website where you could apply arbitrary tags to bookmarks then retrieve all bookmarks related to a bookmark.

The data could be modeled as a URL has many tags and a tag has many urls. In SQL you could do something like

table BookMark:
url: text(255)
name: text(255)
date_created: text(255)

table Tags:
name: text(255)
date_created: text(255)

table BookMarks2Tags
tad_id: int
bookmark_id: int

Odd data, odd results

I can’t find the map logic I used, but the gist of it was that the results I getting back were
less then idea. It was easy to aggregate tag counts but to grab all bookmarks that had a specific tag was somewhat contorted.

Lack of straight forward documentation

Rechecking couchdb’s documentation website, I really hate information overload style doc’s. In the beginning I don’t care how something does what it does, just show me well documented examples of accomplishing the basics: Create, replace, update, delete. Probably immediately after that I’ll need how I can connect relate two entities of separate types and do CRUD on that. Rinse and repeat until I’ve made something so goddamn complicated that maybe its time to figure out how the whole mess works.

Error logging from hell

This could be the fault of the Ubuntu package manager for CouchDB or just my cluelessness, but I absolutely hate 5-10 page long exception traces that include a lot of stuff I don’t give a crap about… just tell me I’m an idiot and their’s a runtime syntax error on line two or the road peg doesn’t go in the square hole.

Lack of python support

To be fair, CouchDBKit rocks and did take some of the sting out of learning a new technology, but in earlier 2011 late 2010 I found the python CouchDB view interpreter left a lot to be desired ( partially due to CouchDB’s excessive error vomit traces ). Never mind that typing in whitespace sensitive code into a textarea field for adhoc query testing was entertaining.

Alright, I think I’m done ranting. Does this mean I’m going to completely swear off CouchDB? No. I keep proclaiming I’m never going to do anymore PHP contracts, and then the next thing you know I’m staring at an IDE full of PHP 5.0 ( for non PHP people PHP 5.0 was as good as MySQLDB 5.0… for non MySQL people, MySQL 5.0 was scary ).

MQ joke

 subjective  Comments Off
Sep 272011
 

If an event fires an observable but no one is subscribed to it, does it matter?

Sep 272011
 

Thanks to Pandora, found an electronic border-line ambient band called Hybrid. Best I can say is that I like it.

 

Can’t believe I missed this until now. While researching idea’s for a zoom toggle for my windowing idea, I ran into a Stack overflow question talking about putImageData’s poor performance here ( http://stackoverflow.com/questions/3952856/why-is-putimagedata-so-slow )which led to this answer and the linked to test suite here ( http://jsperf.com/buffering )

 

Code is here ( https://gist.github.com/a17216d5b1db068ab41b )

Taking a break from a client project today and decided to try something real quick out. For one of my unpublished pet projects ( a near real time web based MUD ) I wanted a little map window to give visual feedback to the user on where exactly they were. Fortunately the Canvas API makes this stupid easy.

       var oX = this.player.x * (this.mMX / this.rMX);
    var oY = this.player.y * (this.mMX / this.rMY);
 
    var halfX = ( this.wMX / 2 );
    var halfY = ( this.wMY / 2 );
 
    var startX = Math.max(0, oX - halfX);
    var startY = Math.max(0, oY - halfY);
 
    var endX = Math.max(startX + this.wMX);
    var endY = Math.max(startY + this.wMY);
 
    var buffer = this.ctx.getImageData(startX, startY, endX, endY);
    this.window.putImageData(buffer, 0, 0);

First block takes the player or avatars position then scales up their grid position to the map canvas position.

Second block gets middle point of the window canvas, in this case if the window is 150×150, the origin is (75,75)

Now given the player’s position in the canvas map it finds the upper left and lower right corners of the map that is going to be copied from the map canvas into the window canvas. The Math.Max’s are there to prevent the upper left corner from being an impossible point (123, -50 ).

The end*’s use of Math.Max is actually crap on my part and isn’t needed.

Sep 232011
 

Generally there is only two ways to make a “class” in Javascript. The first is the prototypical way

 
function Foo(){
      this.someProperty = "123";
}
 
Foo.prototype.bar = function(){
     console.log("howdy", this.someProperty);
}
 
var blah = new Foo();
Foo.someProperty = "Hello World!");
Foo.bar();
>>"Howdy", "Hello World"

or something like

   function Foo(){
       this.someProperty = 123;
       this.bar = function(){
              console.log("Howdy", this.someProperty);
       }
   }

It’s sometimes not trivial to choose one over the other. The prototypical path is a tad faster instantiating while the second can be easier to write, read, and maintain. Generally I chose the prototypical when I know I’ll be instantiating the desired object a lot ( thousands to tens or thousands of times ) while the second is preferred when I’m writing a more complicated object definition.

You’d think it would be slam dunk to always choose the closure ( 2nd variety ) but it has one major flaw, by itself, you cannot inherit a closure based class.

Fortunately its 2011 and at this point someone is guaranteed to have already run into the same problem. From my personnel experience, the first group that solved this problem was PrototypeJS and their class system, then I ran into ExtJS and their system. Great and all but what if I don’t want everything else that comes with these two frameworks?

No problem:
There’s the super diet solution offered by John Resig’s proof of concept Simple inheritence thing

and a much more advanced system called BaseJS by Dean Edwards ( http://code.google.com/p/base2/ ).

If I know a projects going to be somewhat involved I would go with investing in Base2JS but if not, John Regig’s script is good enough.

© 2012 Refactored scope Suffusion theme by Sayontan Sinha