Someone recently asked me how the more time involved analyzer I wrote works, unfortunately I can’t post it as its targeted for a client project… but I can post a simpler mockup of it combined with the person’s function print logic.

Sure this loses the ability to step back up the chain of tokens, but for most cases like detecting functions or in my case line numbers to intrument, this construct works for me.

 

If you’re reading this, you’re either one of the 10-15 user-agents stalking my blog or one of the search engines sent you here because you’d like to analyze Javascript from inside of Python. In my cursory research phase, the best option seems to be pynarcissus.

As of 2012/01/6 there has been no commits or updates to the project which seemed disheartening until I gave it a try. From my tests, pynarcissus DOES WORK.

Below is proof of concept code I wrote to walk through the entire token tree of a ExtJS Sencha touch file which is probably the most extreme JS compatible test I could come up with, mostly because ExtJS code reliably and utterly confuses Komodo IDE and other IDE’s I use on a daily basis.

The codez:

 
from pynarcissus import parse
from collections import defaultdict
 
"""
    Syntax analysis done dirty
 
"""
 
#Linenums JSCoverage said were correct
targetnums = [28,44,51,53,57,59,60,68,69,70,77,78,79,86,87,88,95]
 
#Operands/tokens unique to these lines
injectionOPS = {'IF','CALL','VAR','RETURN'}
#operands/tokens that should be avoided
exclusionOPS = {''}
 
#test file
raw = open("./juggernaut/parser/test/test_original.js").read()
 
tokens = parse(raw,'test_original.js')
 
#Master list of linenums to TOKEN types
 
 
def findlines(node, pr_linenums = None, capture_all = False):
    """
        Walk over the hills and through the valleys and hope
        we hit every single token in the tree along the way.
    """
    linenums = defaultdict(set) if pr_linenums is None else pr_linenums
 
    if node.type == 'IF':
        dbgp = 1
 
    if node.type in injectionOPS or capture_all:
        linenums[node.lineno].add(node.type)
 
    for sub in node:
        if len(sub) > 0:
            linenums = findlines(sub, linenums, capture_all)
 
        for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
            child = getattr(sub, attr, None)
            if child:
                linenums = findlines(child, linenums, capture_all)
 
 
        if sub.type in injectionOPS or capture_all:
            linenums[sub.lineno].add(sub.type)
 
    return linenums
 
linenums = findlines(tokens, defaultdict(set), False)
injectionTargets = sorted(linenums.keys())
 
source = raw.split('\n')
from cStringIO import StringIO
buffer = StringIO()
for lineno, line in enumerate(source, 1):
    if lineno in injectionTargets:
        print >> buffer, "$_injectionjs['testfile'][%s] += 1;" % lineno
 
    print >> buffer, line
 
buffer.reset()
open('test_file.js','wb').write(buffer.read())
 
dbgp = 1

Briefly, the above code is a unstructured test to attempt to match what JSCoverage thinks are the correct lines to instrument and for the most part it works well. The most crucial needed to be able to blindly traverse the token node tree is the logic inside the findlines function.

 
    """
         the product of pynarcissus.parse is a Node object.
         Node's inherit lists, so for some JS structures you will
         iterate over it's children.  Other times, you might not be
         so lucky, hence the inner loop to detect Node elements.
 
         For instance, for Node.type == 'IF' the thenPart and elsePart
         properties will be populated alongside with 'expression'.
 
         Othertimes, specially for function/method nodes, there will be a body
         attribute.
 
        All of these attributes, if not None, are instances of pynarcissus's Node class.
 
    """
    for sub in node:
        if len(sub) > 0:
            linenums = findlines(sub, linenums, capture_all)
 
        for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
            child = getattr(sub, attr, None)
            if child:
                linenums = findlines(child, linenums, capture_all)
 
 
        if sub.type in injectionOPS or capture_all:
            linenums[sub.lineno].add(sub.type)

As you can see, there’s a linenums variable being passed around like the village bicycle, here is what it looks like.

pprint.pprint(linenums.items())
[(1, set(['SCRIPT'])),
 (28,
  set(['CALL',
       'DOT',
       'IDENTIFIER',
       'LIST',
       'OBJECT_INIT',
       'SEMICOLON',
       'STRING'])),
 (29, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (30, set(['ARRAY_INIT', 'IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (31, set(['IDENTIFIER', 'OBJECT_INIT'])),
 (32, set(['IDENTIFIER', 'NULL', 'PROPERTY_INIT'])),
 (33, set(['PROPERTY_INIT'])),
 (34, set(['IDENTIFIER', 'OBJECT_INIT'])),
 (35, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (36, set(['FALSE', 'IDENTIFIER', 'PROPERTY_INIT'])),
 (37, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
 (38, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (39, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
 (40, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (41, set(['PROPERTY_INIT'])),
 (43, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (44, set(['BLOCK', 'IF'])),
 (45, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
 (46, set(['IF'])),
 (52, set(['PROPERTY_INIT'])),
 (57, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (58, set(['IDENTIFIER', 'VAR'])),
 (60, set(['IDENTIFIER', 'VAR'])),
 (64, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'THIS'])),
 (66,
  set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING', 'THIS'])),
 (67, set(['RETURN'])),
 (68, set(['PROPERTY_INIT'])),
 (74, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (75, set(['IDENTIFIER', 'VAR'])),
 (76, set(['IDENTIFIER', 'VAR'])),
 (77, set(['RETURN'])),
 (78, set(['PROPERTY_INIT'])),
 (83, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (84, set(['IDENTIFIER', 'VAR'])),
 (85, set(['IDENTIFIER', 'VAR'])),
 (86, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
 (87, set(['PROPERTY_INIT'])),
 (92, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (93, set(['IDENTIFIER', 'VAR'])),
 (94, set(['IDENTIFIER', 'VAR'])),
 (95, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
 (96, set(['PROPERTY_INIT'])),
 (101, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (102, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
 (103, set(['PROPERTY_INIT']))]

This is enough information for me to be able to safely determine injection points for an instrumentation line. I am still working through some problems with this whole mess, specifically line 46 looks like:

       if(1 == 1){
$_injectionjs['testfile'][45] += 1;
            console.log('<REDACTED>');
$_injectionjs['testfile'][46] += 1;
        }else if(2 == 3){
            console.log("boop");
        }else{
            console.log("bleep");
        }

after being instrumented. As you can see, I am skipping over the else if chain… which leads me to believe I am missing YET ANOTHER dynamically assigned Node property.

 

This is my one and only political, totally off subject post, and it’s damn important.

In my youth I was a heavily proactive individual, when the bridges got washed out in my neighborhood I made up maps to help the wayward tourists find their way, I’ve donated my time to Habitat for humanity, built community parks, worked as a special need’s teacher’s assistant, volunteered for Al Gore’s 2000 presidential bid, and then I did a stint in the US military. To some extent I am disenfranchised with my government, specifically I think Capital Hill should be renamed Imperial Hill but at the same time the US democratic foundation isn’t completely dead yet.

After the attacks of 9/11, opportunistic and deeply self serving people set the US on a dead end course to destruction. One of the things done while the country was numb and fearful was deploy the Patriot Act, a bill purported to make America safer and prevent future terrorist attacks. The truth isn’t really close in how the Patriot Act has been used ( http://www.aclu.org/national-security/aclu-releases-comprehensive-report-patriot-act-abuses ), instead eroding constitution rights and making it easier for the greedy to ransack the country under the guise of national security.

For anyone naive enough to believe the latest anti-piracy internet bill won’t be abused, well I’d like to talk to you about a fantastic Real estate propspect for a bridge connecting Manhatten island. Please contact your representative ( if you are a US Citzen ). Those who sacrifice their freedom for security deserve neither.

Nov 152011
 

Yesterday morning I walked out to my carport to discover someone had bashed in my wife’s back window and one of her quarter panel windows on her Volvo. So began a fun filled journey to hunt down replacement glass ( Safelite had the back window but not the Quarter, speciality shop had quarter but at a steep price ). The truly fascinating part is that my car, a $36,000 price tag Volkswagen GTI was left unscathed sitting next to her old Volvo.

Eh, going to see if I can finish up Veterans Guide today and spend a little bit on the security service. In the interim, I think I’ve found a bug in txmongo but its posing to be a doozy to narrow down WHERE exactly the problem lies and how to make a unit-test to repeat the problem.

 

This weekend I enrolled into Startup Weekend Denver #3 AND also started a personnel project for the LinkedIn Veteran’s hackday.

For SWDenver, the project is an penetration testing service. Thanks to my experiences with EC2, I’m relatively confident I can make it marginally profitable by implementing an intelligent booking/scheduling system and the power of twisted. At this point I’ve had a couple Venture Capitalist minions peak in on my team and they’ve walked with a gleam in their eye. Who knows?

For LinkedIn Vet’s day, I’ve been building a Veteran Events and Services national directory using Django and Mongodb. My goal is to run this as a charity/public service to my peers ( USAF Senior Airmen Dev Dave 2001-2005 IYAAYAS ). That said I’ve got the models done, view scaffolding is in place, working on setting up a UserAuth submission pipeline.

As it stands, looks like I will have the SaaS for SWDenver on its way out the door by tomorrow afternoon and the VetDay directory by Monday morning. Nice to flex my RockStar muscles and kick out a distributed SOA project and a national directory in the space of a weekend.

The mess that is the Veterans hack day project is here @ https://github.com/devdave/VetHackday minus 6-7 pushes.

 Permalink  subjective  Comments Off
Nov 102011
 

For Hire

Despite sometimes major butchering of English grammar, I think this blog covers that I am at least competent and well versed in software development. That said, for US citizens I’m willing to pay out a $500 bounty for anyone who can point me in the direction of a Python contract or full time remote position that I get placed into.

contact me at fromthewordress at ominian dot net or dot com.

Philosophy on getting it done

I think a few times now I’ve lamented my experiences with PHP. I’ve been accused by a few people of being two-faced in this regards; openly saying I don’t want to work with PHP any more as a tool set while when faced with a client seeking my advice on what platform to use I have pointed them towards PHP. The truth is really simple, I want to get my client up and going or back on the road. Sure Ruby on rails or Django are superior products in many ways but they are also more expensive in professional resources to maintain. A good example is one of my clients that is the IT department for a major corporation with tens of thousands of customers and just as many computers to maintain. My client is well versed in system administration but not in software engineering. So I wrote up a Kohana v3 application that uses mod_php and jquery. If I get hit by a bus tomorrow they will still be alright and happy.

I don’t think I’m explaining myself too well just yet, so a much better written tale from the industy is in order via the DailyWTF @ ( article ). Yes PHP is not sexy, its almost downright ugly, but it does what the client needs and is mostly reliable in that regard. If tomorrow someone said they needed an asyncronous event handler and they wanted it in PHP5; I’d probably laugh in their face then proceed to start crying when I realized they were serious. Still, if tomorrow someone said they needed a simple low-traffic inventory management system and they already have on hand people competent in LAMP then I’d lean towards PHP.

Why I’m like this, going for the simplest most reliable solution originates with some of my more desperate past clients.

The super website constructor… of doom

Directory websites are usually fairly simple MVC applications that parse browser get requests and dump out a nice simple web structure on demand. This client’s system did all of that work upfront by ingesting a hierarchical data structure and generated a root page, regional pages, sub-pages, and detail pages in one go into a static 4.7GB structure. The plus was that the sales/marketing people could make rudimentary hand changes to pages and they would work… but the MAJOR downside was that structural changes were not possible.

The complexity required to build such a builder was unbelievable and as such was not very future safe. In fact it was generator 1-2 GigaBytes of error/warning messages a day to syslog. In reflection I think the project was built the way it was because of two reasons: first was my predecessors lack of experience/knowledge and secondly more so was from sheer boredom. They made this because it was challenging to implement. Once implemented and the stack went into a maintain and feature addition state, they left.

I hate PHP so I shall do something else.

Another client horror story was a scenario where the company Rockstar got bored with the languages currently in use. Having read an extensive amount of this Rockstar’s code I could clearly see they were a very brilliant individual but at the same time they were shooting their client in the foot with their perpetual need to be brilliant. I would often lament in private conversation with my peers that Rockstar while brilliant always leaned towards the path less taken, making code that was harder to maintain or sometimes understand without a long analysis period.

The final straw for this rock star was when they started writing backend services in a language no one else in the company knew. I think the grand total was half a mega-byte of code that was super critical to the company, very pretty, but also seriously fragile. As expected, the Rock star grew bored and moved on. A few months later it was a somewhat horrifying moment when one of these almost forgetton gems broke under unexpected circumstances and brought an entire application array of 30 servers to a dead stop. The culprit was a UTF-8 character that broke a high volume data extraction script and caused the producer to block, waiting for the stdout pipe to clear up some space.

Fixing that took all of the King’s men, some of the horses, the company CTO, and me. I won’t go into specifics but it turned out to be a combination of weak code and a bug in the target language.

Summary

I don’t resent these people for getting bored; more so I resent that they got bored and didn’t realize it until they had abused an unspoken trust between professional and employer. Code monkeys write code to make their client money, not to entertain themselves. Whenever I get bored I will go on a serendipity hike with my personnel time to see what I can pull off, ultimately a lot of these pet projects
end of going no where… but it’s more about the journey then the destination for me. Each one of my pet’s has ended up teaching me more about things I might not work with professionally then any book or blog post could. Just because you are a professional and work in your vocation doesn’t mean you are going to wake up one day and be a master in that profession, just like martial arts and any other vocation professional or hobby, it takes continual investment of time and energy, striving for harder and more unusual problems to solve to become a master.

Nov 102011
 

As mentioned previously; I’m working on hacking/implementing agent support to Twisted.conch. Fortunately, Exarkun pointed me in the direction of twisted.conch.ssh.agent.SSHAgentClient which implements the wire protocol logic of communicating with the Agent, but there is still a gaping hole to fill in.

Briefly, when a user configures their ssh client to allow for agent forwarding, almost immediately after userauth completion, the client sends a session request for ‘auth-agent-req@openssh.com’. For openssh, the service then kicks off a process of creating a named socket that usually resides in /tmp/ , announces the user’s agent presence in the shell environment, and then binds a specialized TCP port forwarding channel from the named socket back to the client on a channel called “auth agent”. When a service local ssh client then begins it’s own authentication process, it finds this special socket and sends down the wire a request identities or sign request Agent protocol message. Ideally the response will be a correctly counter-signed value and the user can progress.

The last point can be found in the session.c file of OpenSSH as:

239		nc = channel_new("auth socket",
240		    SSH_CHANNEL_AUTH_SOCKET, sock, sock, -1,
241		    CHAN_X11_WINDOW_DEFAULT, CHAN_X11_PACKET_DEFAULT,
242		    0, "auth socket", 1);

Unfortunately I haven’t hunted down what the global const SSH_CHANNEL_AUTH_SOCKET correlates to in regards to Python. I believe argument 1 “auth socket” is equivalent to the class attribute name in channel.SSHChannel. So the skeleton for an “Auth socket” channel might look like:

 
class AuthAgentChannel(SSHListenForwardingFactory):
   name = "auth agent"

Alas I haven’t had time to test. I’m debating hacking up some sort of SSH/twisted.conch testing platform to allow for me to execute arbitrary calls, that would probably make this exercise a tad easier to figure out.

 

For a secret squirrel project, I’ve been diving fairly deep into SSH land. While in the process of implementing my own SSH service via Twisted.Conch, I ran into the problem of trying to figure out how to support agent forwarding.

While tracing through an SSH connection, I got the session request name ‘auth-agent-req@openssh.com’ and after grepping over the openSSH code, sure enough there’s a check for that exact request type.

Will update when/if I can figure out how to translate to Python/Twisted. In the interim, session.c can be viewed here http://anoncvs.mindrot.org/index.cgi/openssh/session.c?view=markup. In passing, I have to say this is some of the most immaculate C code I have ever seen in my life.

 

Damn you hbase, damn you to hell

All of the other core services I’ve dealt with in Hadoop play by the system rules, if I populate fake DNS values in /etc/hosts by golly the services are going to believe it. Well all except for Hbase which didn’t seem to play fair with /etc/resolve.conf or /etc/hosts and did fairly low level reverse DNS lookups against the network DNS, which in this case was provided by Amazon. I so do love those super descriptive ip-101-202-303-404.internal addresses.

Still, once you abandon the long term untenable idea of using /etc/hosts and just get into the habit of memorizing IP/internal DNS addresses its not so bad. Otherwise a stable arrangement was debain squeeze with Cloudera CDH3 Update 2, the stability improvements were painfully obvious as HBase stopped murdering its own HDFS entries and became performant.

Last bit, for small clusters it makes sense to use EBS backed volumes for the datanodes, but generally I felt that the ephemeral volumes were slightly faster in seek times and throughput. This became especially important under very high load HDFS scenario’s where an EBS array on a datanode is capped collectively to 1GB/s but emphemeral can go higher.

Still focusing on pro-emphemeral nodes, the reality is that you’ve lost the game if a single datanode has more then 250GB of JBOD volumes and it’s going to quickly become expensive if you have multiple terabytes of EBS backed data ( .10 USD a GigaByte and .10 USD per million I/O ops ). Instead, the reality is that with 2 or 3 levels of HDFS replication, something downright catastrophic would need to occur to take all of your datanodes completely down. Plus with S3 being right next door to EC2, it’s hard to find a excuse not to make vital backups.

 

I’m currently somewhere in the process of building a hadoop clouster in EC2 for one of my clients and one of the most important parts for keeping my sanity is the ability to access all of the node’s web interfaces ( jobtracker, namenode, tasktrackers’, datanodes, etc ). If you aren’t abs(crazy) all of these machines are jailed inside a locked down security group, a micro walled garden.

SSH -D 8080 someMachine.amazon-publicDNS.com

That will setup a socks between your machine and some instance that should be in the same SG as the hadoop cluster… now unless you are a saddist and like to write dozens of host file entries, the SOCKS proxy is useless.

But wait! Proxy Auto-configuration to the rescue! All you really need to get started is here at Wikipedia ( http://en.wikipedia.org/wiki/Proxy_auto-config ) but to be fair a dirt simple proxy might look like:

hadoop_cluster.pac
function FindProxyForURL(url, host) {
if (shExpMatch(host, "*.secret.squirrel.com")) {
return "SOCKS5 127.0.0.1:8080";
}
if (shExpMatch(host, "*.internal")) {
return "SOCKS5 127.0.0.1:8080";
}
 
return "DIRECT";
}

Save this to your harddrive then find the correct “file:///path/2/hadoop_cluster.pac” from there go into your browsers proxy configuration dialog window and paste that URL into the Proxy Auto-configuration box. After that, going to http://ip-1-2-3-4.amazon.internal in a web browser will automatically go through the SSH proxy into Amazon EC2 cloud space, resolve against Amazon DNS servers, and voila you’re connected.

NOTE: Windows users

It shouldn’t be a surprise that Microsoft has partially fucked up the beauty that is the PAC. Fortunately, they provide directions for resolving the issue here ( http://support.microsoft.com/kb/271361 ).

tl;dwrite – Microsoft’s network stack caches the results of the PAC script instead of checking it for every request. If your proxy goes down or you edit the PAC file, those changes can take sometime to actually come into play. Fortunately Firefox has a nifty “reload” button on their dialog box, but Microsoft Internet Explorer and the default Chrome for windows trust Microsofts netstack.

© 2012 Refactored scope Suffusion theme by Sayontan Sinha