Category Archives: Uncategorized

Working with & using pynarcissus to parse Javascript in Python

If you’re reading this, you’re either one of the 10-15 user-agents stalking my blog or one of the search engines sent you here because you’d like to analyze Javascript from inside of Python. In my cursory research phase, the best option seems to be pynarcissus.

As of 2012/01/6 there has been no commits or updates to the project which seemed disheartening until I gave it a try. From my tests, pynarcissus DOES WORK.

Below is proof of concept code I wrote to walk through the entire token tree of a ExtJS Sencha touch file which is probably the most extreme JS compatible test I could come up with, mostly because ExtJS code reliably and utterly confuses Komodo IDE and other IDE’s I use on a daily basis.

The codez:


from pynarcissus import parse
from collections import defaultdict

"""
    Syntax analysis done dirty

"""

#Linenums JSCoverage said were correct
targetnums = [28,44,51,53,57,59,60,68,69,70,77,78,79,86,87,88,95]

#Operands/tokens unique to these lines
injectionOPS = {'IF','CALL','VAR','RETURN'}
#operands/tokens that should be avoided
exclusionOPS = {''}

#test file
raw = open("./juggernaut/parser/test/test_original.js").read()

tokens = parse(raw,'test_original.js')

#Master list of linenums to TOKEN types


def findlines(node, pr_linenums = None, capture_all = False):
    """
        Walk over the hills and through the valleys and hope
        we hit every single token in the tree along the way.
    """
    linenums = defaultdict(set) if pr_linenums is None else pr_linenums

    if node.type == 'IF':
        dbgp = 1

    if node.type in injectionOPS or capture_all:
        linenums[node.lineno].add(node.type)

    for sub in node:
        if len(sub) > 0:
            linenums = findlines(sub, linenums, capture_all)

        for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
            child = getattr(sub, attr, None)
            if child:
                linenums = findlines(child, linenums, capture_all)


        if sub.type in injectionOPS or capture_all:
            linenums[sub.lineno].add(sub.type)

    return linenums

linenums = findlines(tokens, defaultdict(set), False)
injectionTargets = sorted(linenums.keys())

source = raw.split('\n')
from cStringIO import StringIO
buffer = StringIO()
for lineno, line in enumerate(source, 1):
    if lineno in injectionTargets:
        print >> buffer, "$_injectionjs['testfile'][%s] += 1;" % lineno

    print >> buffer, line

buffer.reset()
open('test_file.js','wb').write(buffer.read())

dbgp = 1

Briefly, the above code is a unstructured test to attempt to match what JSCoverage thinks are the correct lines to instrument and for the most part it works well. The most crucial needed to be able to blindly traverse the token node tree is the logic inside the findlines function.


    """
         the product of pynarcissus.parse is a Node object.
         Node's inherit lists, so for some JS structures you will
         iterate over it's children.  Other times, you might not be
         so lucky, hence the inner loop to detect Node elements.
         
         For instance, for Node.type == 'IF' the thenPart and elsePart
         properties will be populated alongside with 'expression'.
         
         Othertimes, specially for function/method nodes, there will be a body
         attribute.
         
        All of these attributes, if not None, are instances of pynarcissus's Node class.
        
    """
    for sub in node:
        if len(sub) > 0:
            linenums = findlines(sub, linenums, capture_all)

        for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
            child = getattr(sub, attr, None)
            if child:
                linenums = findlines(child, linenums, capture_all)


        if sub.type in injectionOPS or capture_all:
            linenums[sub.lineno].add(sub.type)

As you can see, there’s a linenums variable being passed around like the village bicycle, here is what it looks like.

pprint.pprint(linenums.items())
[(1, set(['SCRIPT'])),
 (28,
  set(['CALL',
       'DOT',
       'IDENTIFIER',
       'LIST',
       'OBJECT_INIT',
       'SEMICOLON',
       'STRING'])),
 (29, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (30, set(['ARRAY_INIT', 'IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (31, set(['IDENTIFIER', 'OBJECT_INIT'])),
 (32, set(['IDENTIFIER', 'NULL', 'PROPERTY_INIT'])),
 (33, set(['PROPERTY_INIT'])),
 (34, set(['IDENTIFIER', 'OBJECT_INIT'])),
 (35, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (36, set(['FALSE', 'IDENTIFIER', 'PROPERTY_INIT'])),
 (37, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
 (38, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (39, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
 (40, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (41, set(['PROPERTY_INIT'])),
 (43, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (44, set(['BLOCK', 'IF'])),
 (45, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
 (46, set(['IF'])),
 (52, set(['PROPERTY_INIT'])),
 (57, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (58, set(['IDENTIFIER', 'VAR'])),
 (60, set(['IDENTIFIER', 'VAR'])),
 (64, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'THIS'])),
 (66,
  set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING', 'THIS'])),
 (67, set(['RETURN'])),
 (68, set(['PROPERTY_INIT'])),
 (74, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (75, set(['IDENTIFIER', 'VAR'])),
 (76, set(['IDENTIFIER', 'VAR'])),
 (77, set(['RETURN'])),
 (78, set(['PROPERTY_INIT'])),
 (83, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (84, set(['IDENTIFIER', 'VAR'])),
 (85, set(['IDENTIFIER', 'VAR'])),
 (86, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
 (87, set(['PROPERTY_INIT'])),
 (92, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (93, set(['IDENTIFIER', 'VAR'])),
 (94, set(['IDENTIFIER', 'VAR'])),
 (95, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
 (96, set(['PROPERTY_INIT'])),
 (101, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (102, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
 (103, set(['PROPERTY_INIT']))]

This is enough information for me to be able to safely determine injection points for an instrumentation line. I am still working through some problems with this whole mess, specifically line 46 looks like:

       if(1 == 1){
$_injectionjs['testfile'][45] += 1;
            console.log('');
$_injectionjs['testfile'][46] += 1;
        }else if(2 == 3){
            console.log("boop");
        }else{
            console.log("bleep");
        }

after being instrumented. As you can see, I am skipping over the else if chain… which leads me to believe I am missing YET ANOTHER dynamically assigned Node property.

The US Federal government cannot be trusted.

This is my one and only political, totally off subject post, and it’s damn important.

In my youth I was a heavily proactive individual, when the bridges got washed out in my neighborhood I made up maps to help the wayward tourists find their way, I’ve donated my time to Habitat for humanity, built community parks, worked as a special need’s teacher’s assistant, volunteered for Al Gore’s 2000 presidential bid, and then I did a stint in the US military. To some extent I am disenfranchised with my government, specifically I think Capital Hill should be renamed Imperial Hill but at the same time the US democratic foundation isn’t completely dead yet.

After the attacks of 9/11, opportunistic and deeply self serving people set the US on a dead end course to destruction. One of the things done while the country was numb and fearful was deploy the Patriot Act, a bill purported to make America safer and prevent future terrorist attacks. The truth isn’t really close in how the Patriot Act has been used ( http://www.aclu.org/national-security/aclu-releases-comprehensive-report-patriot-act-abuses ), instead eroding constitution rights and making it easier for the greedy to ransack the country under the guise of national security.

For anyone naive enough to believe the latest anti-piracy internet bill won’t be abused, well I’d like to talk to you about a fantastic Real estate propspect for a bridge connecting Manhatten island. Please contact your representative ( if you are a US Citzen ). Those who sacrifice their freedom for security deserve neither.

Lose a car window, lose a day

Yesterday morning I walked out to my carport to discover someone had bashed in my wife’s back window and one of her quarter panel windows on her Volvo. So began a fun filled journey to hunt down replacement glass ( Safelite had the back window but not the Quarter, speciality shop had quarter but at a steep price ). The truly fascinating part is that my car, a $36,000 price tag Volkswagen GTI was left unscathed sitting next to her old Volvo.

Eh, going to see if I can finish up Veterans Guide today and spend a little bit on the security service. In the interim, I think I’ve found a bug in txmongo but its posing to be a doozy to narrow down WHERE exactly the problem lies and how to make a unit-test to repeat the problem.

All you need to know about SSH, you can learn from session.c

For a secret squirrel project, I’ve been diving fairly deep into SSH land. While in the process of implementing my own SSH service via Twisted.Conch, I ran into the problem of trying to figure out how to support agent forwarding.

While tracing through an SSH connection, I got the session request name ‘auth-agent-req@openssh.com’ and after grepping over the openSSH code, sure enough there’s a check for that exact request type.

Will update when/if I can figure out how to translate to Python/Twisted. In the interim, session.c can be viewed here http://anoncvs.mindrot.org/index.cgi/openssh/session.c?view=markup. In passing, I have to say this is some of the most immaculate C code I have ever seen in my life.

Quick comments on scaling an application up

Unfortunately I cannot find the original usenet post, so here’s the paraphrased summary:

Two programmers are discussing what to do with a slow program and the junior of the two laments “If only there was a way to make the computer run faster.” to which the senior replies “You cannot make the computer run faster, but you can make it do less.” The gist of which I can explain from my own experience.

Caching

With some exceptions, generally it doesn’t really matter what language you choose to write implement a program or application in…as long as it is fast enough. Instead you need to look at what you’re application is spending most of it’s time doing and I don’t mean just a cursory look but really dig into there. In almost every case, the primary culprit to scaling out is going to be whatever you are using for a data-backend.

If you’re fetching from the database a User credential or profile record, you’ve suddenly locked the speed of your entire application to the max number of connections ( not queries ) your database can do. For MySQL that’s about 150-180/second ( or 220-250/second if you have a full time DBA ). If you get more then 250 user requests to your webstack, then your application is locked up solid. So it should be obvious that the solution is to case everything and anything that’s needed from the databases that won’t be changing too often.

My prefered solution for the above is to use memcache with as much ram as you can throw at it, at minimum 2Gbs but I’ve worked on on 128GB categorized arrays before. Now memcache can be summarized as an unreliable key/value data store. You might put a key pair in and it might be there for the next minute or so.

By implementing caching into your application, you’re making it do less. So instead of a 1 to 1 relationship between user requests and databases connections it might go up to 10 to 1.

Division of concerns

This usually catches almost all junior and mid-level developers off guard. If your application serves static content from a Python or Ruby script, your burning capacity up. Instead a better plan is to split your application up into two subprojects: Application and Application Content. From the outside looking in, http://derpCorp.com/application/url and http://static.derpCorp/staticContent/ Generally ngin-x or lighttpd can trounce almost anything else for serving content. Again not applicable to everyone, the cost of infrastructure will lean heavily towards new application servers and not your content servers… so by dividing the two now, when you can you set yourself up for investing wisely vs. just throwing money at the problem.

Divide and conquer

The minute one piece of an application becomes a critical component the door to unending misery begins to open. That one critical piece is going to reliably fail at every investor presentation, at 4am on saturday, and about ten minutes after hit rush hour evening traffic. Usually the critical component is the database and almost always the first solution is to throw more memory and disks at it, hoping the beast will be sated forever and ever. This should be a sign that something needs to change, but sometimes it isn’t heard. Instead of scaling up, the proven winning solution is to scale out. If you have two or more schema’s on the same server, it might be time to separate them. Does User A need to cohabitate with User B’s data?

Don’t ignore your problems

Usually there is a small clan of people clustered around an application, it provides money and stability for them. Sometimes this clan sacrifices their youth, sanity, and credit ratings for the application like it’s some sort of messed up deity. Unfortunately you’re application is stupider then the bacteria growing in your kitchen sink and though the causation of throwing money and time at a half ass solution may seem to correlate with resolution, correlation does not equal causation…especially with software. If half of the application randomly goes belly up every week at the same time… don’t ignore that problem or worse try to bury it, pick someone in your team and send them off on a mission to find the problem and fix it. Otherwise what was once a problem may end up being your clan’s apocalypse.

Object oriented Javascript

Generally there is only two ways to make a “class” in Javascript. The first is the prototypical way


function Foo(){
      this.someProperty = "123";
}

Foo.prototype.bar = function(){
     console.log("howdy", this.someProperty);
}

var blah = new Foo();
Foo.someProperty = "Hello World!");
Foo.bar();
>>"Howdy", "Hello World"

or something like

   function Foo(){
       this.someProperty = 123;
       this.bar = function(){
              console.log("Howdy", this.someProperty);
       }
   }

It’s sometimes not trivial to choose one over the other. The prototypical path is a tad faster instantiating while the second can be easier to write, read, and maintain. Generally I chose the prototypical when I know I’ll be instantiating the desired object a lot ( thousands to tens or thousands of times ) while the second is preferred when I’m writing a more complicated object definition.

You’d think it would be slam dunk to always choose the closure ( 2nd variety ) but it has one major flaw, by itself, you cannot inherit a closure based class.

Fortunately its 2011 and at this point someone is guaranteed to have already run into the same problem. From my personnel experience, the first group that solved this problem was PrototypeJS and their class system, then I ran into ExtJS and their system. Great and all but what if I don’t want everything else that comes with these two frameworks?

No problem:
There’s the super diet solution offered by John Resig’s proof of concept Simple inheritence thing

and a much more advanced system called BaseJS by Dean Edwards ( http://code.google.com/p/base2/ ).

If I know a projects going to be somewhat involved I would go with investing in Base2JS but if not, John Regig’s script is good enough.

A better typeof for Javascript

If you know Javascript, then this isn’t a suprise:

typeof {a: 4}; //"object"
typeof [1, 2, 3]; //"object"
(function() {console.log(typeof arguments)})(); //object
typeof new ReferenceError; //"object"
typeof new Date; //"object"
typeof /a-z/; //"object"
typeof Math; //"object"
typeof JSON; //"object"
typeof new Number(4); //"object"
typeof new String("abc"); //"object"
typeof new Boolean(true); //"object"

But thanks to some dilligent work by one Angus Croll, you can do something like:

toType({a: 4}); //"object"
toType([1, 2, 3]); //"array"
(function() {console.log(toType(arguments))})(); //arguments
toType(new ReferenceError); //"error"
toType(new Date); //"date"
toType(/a-z/); //"regexp"
toType(Math); //"math"
toType(JSON); //"json"
toType(new Number(4)); //"number"
toType(new String("abc")); //"string"
toType(new Boolean(true)); //"boolean"

Check out the full explanation @
http://javascriptweblog.wordpress.com/2011/08/08/fixing-the-javascript-typeof-operator/