Working with & using pynarcissus to parse Javascript in Python

If you’re reading this, you’re either one of the 10-15 user-agents stalking my blog or one of the search engines sent you here because you’d like to analyze Javascript from inside of Python. In my cursory research phase, the best option seems to be pynarcissus.

As of 2012/01/6 there has been no commits or updates to the project which seemed disheartening until I gave it a try. From my tests, pynarcissus DOES WORK.

Below is proof of concept code I wrote to walk through the entire token tree of a ExtJS Sencha touch file which is probably the most extreme JS compatible test I could come up with, mostly because ExtJS code reliably and utterly confuses Komodo IDE and other IDE’s I use on a daily basis.

The codez:


from pynarcissus import parse
from collections import defaultdict

"""
    Syntax analysis done dirty

"""

#Linenums JSCoverage said were correct
targetnums = [28,44,51,53,57,59,60,68,69,70,77,78,79,86,87,88,95]

#Operands/tokens unique to these lines
injectionOPS = {'IF','CALL','VAR','RETURN'}
#operands/tokens that should be avoided
exclusionOPS = {''}

#test file
raw = open("./juggernaut/parser/test/test_original.js").read()

tokens = parse(raw,'test_original.js')

#Master list of linenums to TOKEN types


def findlines(node, pr_linenums = None, capture_all = False):
    """
        Walk over the hills and through the valleys and hope
        we hit every single token in the tree along the way.
    """
    linenums = defaultdict(set) if pr_linenums is None else pr_linenums

    if node.type == 'IF':
        dbgp = 1

    if node.type in injectionOPS or capture_all:
        linenums[node.lineno].add(node.type)

    for sub in node:
        if len(sub) > 0:
            linenums = findlines(sub, linenums, capture_all)

        for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
            child = getattr(sub, attr, None)
            if child:
                linenums = findlines(child, linenums, capture_all)


        if sub.type in injectionOPS or capture_all:
            linenums[sub.lineno].add(sub.type)

    return linenums

linenums = findlines(tokens, defaultdict(set), False)
injectionTargets = sorted(linenums.keys())

source = raw.split('\n')
from cStringIO import StringIO
buffer = StringIO()
for lineno, line in enumerate(source, 1):
    if lineno in injectionTargets:
        print >> buffer, "$_injectionjs['testfile'][%s] += 1;" % lineno

    print >> buffer, line

buffer.reset()
open('test_file.js','wb').write(buffer.read())

dbgp = 1

Briefly, the above code is a unstructured test to attempt to match what JSCoverage thinks are the correct lines to instrument and for the most part it works well. The most crucial needed to be able to blindly traverse the token node tree is the logic inside the findlines function.


    """
         the product of pynarcissus.parse is a Node object.
         Node's inherit lists, so for some JS structures you will
         iterate over it's children.  Other times, you might not be
         so lucky, hence the inner loop to detect Node elements.
         
         For instance, for Node.type == 'IF' the thenPart and elsePart
         properties will be populated alongside with 'expression'.
         
         Othertimes, specially for function/method nodes, there will be a body
         attribute.
         
        All of these attributes, if not None, are instances of pynarcissus's Node class.
        
    """
    for sub in node:
        if len(sub) > 0:
            linenums = findlines(sub, linenums, capture_all)

        for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
            child = getattr(sub, attr, None)
            if child:
                linenums = findlines(child, linenums, capture_all)


        if sub.type in injectionOPS or capture_all:
            linenums[sub.lineno].add(sub.type)

As you can see, there’s a linenums variable being passed around like the village bicycle, here is what it looks like.

pprint.pprint(linenums.items())
[(1, set(['SCRIPT'])),
 (28,
  set(['CALL',
       'DOT',
       'IDENTIFIER',
       'LIST',
       'OBJECT_INIT',
       'SEMICOLON',
       'STRING'])),
 (29, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (30, set(['ARRAY_INIT', 'IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (31, set(['IDENTIFIER', 'OBJECT_INIT'])),
 (32, set(['IDENTIFIER', 'NULL', 'PROPERTY_INIT'])),
 (33, set(['PROPERTY_INIT'])),
 (34, set(['IDENTIFIER', 'OBJECT_INIT'])),
 (35, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (36, set(['FALSE', 'IDENTIFIER', 'PROPERTY_INIT'])),
 (37, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
 (38, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (39, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
 (40, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
 (41, set(['PROPERTY_INIT'])),
 (43, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (44, set(['BLOCK', 'IF'])),
 (45, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
 (46, set(['IF'])),
 (52, set(['PROPERTY_INIT'])),
 (57, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (58, set(['IDENTIFIER', 'VAR'])),
 (60, set(['IDENTIFIER', 'VAR'])),
 (64, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'THIS'])),
 (66,
  set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING', 'THIS'])),
 (67, set(['RETURN'])),
 (68, set(['PROPERTY_INIT'])),
 (74, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (75, set(['IDENTIFIER', 'VAR'])),
 (76, set(['IDENTIFIER', 'VAR'])),
 (77, set(['RETURN'])),
 (78, set(['PROPERTY_INIT'])),
 (83, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (84, set(['IDENTIFIER', 'VAR'])),
 (85, set(['IDENTIFIER', 'VAR'])),
 (86, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
 (87, set(['PROPERTY_INIT'])),
 (92, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (93, set(['IDENTIFIER', 'VAR'])),
 (94, set(['IDENTIFIER', 'VAR'])),
 (95, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
 (96, set(['PROPERTY_INIT'])),
 (101, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
 (102, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
 (103, set(['PROPERTY_INIT']))]

This is enough information for me to be able to safely determine injection points for an instrumentation line. I am still working through some problems with this whole mess, specifically line 46 looks like:

       if(1 == 1){
$_injectionjs['testfile'][45] += 1;
            console.log('');
$_injectionjs['testfile'][46] += 1;
        }else if(2 == 3){
            console.log("boop");
        }else{
            console.log("bleep");
        }

after being instrumented. As you can see, I am skipping over the else if chain… which leads me to believe I am missing YET ANOTHER dynamically assigned Node property.

5 thoughts on “Working with & using pynarcissus to parse Javascript in Python”

Patrick January 10, 2012 at 4:38 pm
Hi David,
Thanks a lot for this post. I myself was in need of a JS parser written in Python for a code review tool I’m working on and pynarcissus seems to be doing the job perfectly.
I have to admit I gave antlr a try before but couldn’t get it working for some reason.
pynarcissus worked perfectly the first time but now I’m facing my own lack of understanding of how parsers work and how one is supposed to visit the resulting tree.

My use case is that I want to get all functions from a given file (whether they’re anonymous or not, closures, object properties, etc…). I have created the following code which works, but I was wondering if you could point me into the direction of how one is supposed to write these kinds of visitors “correctly”:
```
def visitFunctions(root):
	for node in root:
		# Most nodes are lists
		if len(node) > 0:
			visitFunctions(node)
		# Some have additional attributes to be visited as well
		for attr in ['thenPart', 'elsePart', 'expression', 'body', 'initializer']:
			child = getattr(node, attr, None)
			if child:
				visitFunctions(child)
		
		# Named functions
		if node.type == "FUNCTION" and getattr(node, "name", None):
			print str(node.lineno) + " | function " + node.name

		# Anonymous functions declared with var name = function() {};
		try:
			if node.type == "IDENTIFIER" and node.initializer.type == "FUNCTION":
				print str(node.lineno) + " | function " + node.name
		except:
			pass
			
		# Anonymous functions declared as a property of an object
		try:
			if node.type == "PROPERTY_INIT" and node[1].type == "FUNCTION":
				print str(node.lineno) + " | function " + node[0].value
		except:
			pass
```
David Post authorJanuary 11, 2012 at 9:47 pm

Try something like this – https://gist.github.com/1598830 My system uses something like Visitor and then has a bunch of sub-classes for different cases… but for just scanning for functions this might be a slightly cleaner approach and it would be trivial to add a dict property to record the functions by name or by line as the key. Also the if “node.type ==” can obviously be stripped off, I was just in a hurry and copy & pasted your code into a stripped down version of my class.
Patrick January 12, 2012 at 12:07 am

Thanks David for that. I’ll have a look at it.
Patrick January 14, 2012 at 7:20 am

Your solution worked great, thanks.
I fixed a tiny mistake though: https://gist.github.com/1611632
In the case of PROPERTY_INIT type of node, the line number returned was the last line of the value. I replaced print str(node.lineno) by print str(node[0].lineno)
Cheers
David Post authorJanuary 14, 2012 at 12:54 pm

Patrick,
Unfortunately I did this inside of juggernaut ( client project ) but a useful debugging cheat sheet I found was to make a simple Visitor child class that only implements visit_ANY and records something like defaultdict(list) sources like self.node2lines[node.lineno].append( (node.type, node.source) ). At the end you’d dump that to a html file like so:

for lineno, nodes for sources.items():
print <div><span>Line: %(lineno)s</span><br>
for (node_type, node_source) in nodes:
print <span title=”%(node_type)s”>%(node_source)s</span>\n
print <div>

Not sure if the tag soup will make much sense… but the idea is that this decorator visitor ingests a raw javascript file and spits it back out with every token wrapped in span tags. The oddest discovery I made is the < node type=jsparser.SEMICOLON > which usually encapsulates entire blocks of code ( entire functions, objects, conditional statements, vars, etc ).