If you’re reading this, you’re either one of the 10-15 user-agents stalking my blog or one of the search engines sent you here because you’d like to analyze Javascript from inside of Python. In my cursory research phase, the best option seems to be pynarcissus.
As of 2012/01/6 there has been no commits or updates to the project which seemed disheartening until I gave it a try. From my tests, pynarcissus DOES WORK.
Below is proof of concept code I wrote to walk through the entire token tree of a ExtJS Sencha touch file which is probably the most extreme JS compatible test I could come up with, mostly because ExtJS code reliably and utterly confuses Komodo IDE and other IDE’s I use on a daily basis.
The codez:
from pynarcissus import parse
from collections import defaultdict
"""
Syntax analysis done dirty
"""
#Linenums JSCoverage said were correct
targetnums = [28,44,51,53,57,59,60,68,69,70,77,78,79,86,87,88,95]
#Operands/tokens unique to these lines
injectionOPS = {'IF','CALL','VAR','RETURN'}
#operands/tokens that should be avoided
exclusionOPS = {''}
#test file
raw = open("./juggernaut/parser/test/test_original.js").read()
tokens = parse(raw,'test_original.js')
#Master list of linenums to TOKEN types
def findlines(node, pr_linenums = None, capture_all = False):
"""
Walk over the hills and through the valleys and hope
we hit every single token in the tree along the way.
"""
linenums = defaultdict(set) if pr_linenums is None else pr_linenums
if node.type == 'IF':
dbgp = 1
if node.type in injectionOPS or capture_all:
linenums[node.lineno].add(node.type)
for sub in node:
if len(sub) > 0:
linenums = findlines(sub, linenums, capture_all)
for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
child = getattr(sub, attr, None)
if child:
linenums = findlines(child, linenums, capture_all)
if sub.type in injectionOPS or capture_all:
linenums[sub.lineno].add(sub.type)
return linenums
linenums = findlines(tokens, defaultdict(set), False)
injectionTargets = sorted(linenums.keys())
source = raw.split('\n')
from cStringIO import StringIO
buffer = StringIO()
for lineno, line in enumerate(source, 1):
if lineno in injectionTargets:
print >> buffer, "$_injectionjs['testfile'][%s] += 1;" % lineno
print >> buffer, line
buffer.reset()
open('test_file.js','wb').write(buffer.read())
dbgp = 1
Briefly, the above code is a unstructured test to attempt to match what JSCoverage thinks are the correct lines to instrument and for the most part it works well. The most crucial needed to be able to blindly traverse the token node tree is the logic inside the findlines function.
"""
the product of pynarcissus.parse is a Node object.
Node's inherit lists, so for some JS structures you will
iterate over it's children. Other times, you might not be
so lucky, hence the inner loop to detect Node elements.
For instance, for Node.type == 'IF' the thenPart and elsePart
properties will be populated alongside with 'expression'.
Othertimes, specially for function/method nodes, there will be a body
attribute.
All of these attributes, if not None, are instances of pynarcissus's Node class.
"""
for sub in node:
if len(sub) > 0:
linenums = findlines(sub, linenums, capture_all)
for attr in ['thenPart', 'elsePart', 'expression', 'body' ]:
child = getattr(sub, attr, None)
if child:
linenums = findlines(child, linenums, capture_all)
if sub.type in injectionOPS or capture_all:
linenums[sub.lineno].add(sub.type)
As you can see, there’s a linenums variable being passed around like the village bicycle, here is what it looks like.
pprint.pprint(linenums.items())
[(1, set(['SCRIPT'])),
(28,
set(['CALL',
'DOT',
'IDENTIFIER',
'LIST',
'OBJECT_INIT',
'SEMICOLON',
'STRING'])),
(29, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
(30, set(['ARRAY_INIT', 'IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
(31, set(['IDENTIFIER', 'OBJECT_INIT'])),
(32, set(['IDENTIFIER', 'NULL', 'PROPERTY_INIT'])),
(33, set(['PROPERTY_INIT'])),
(34, set(['IDENTIFIER', 'OBJECT_INIT'])),
(35, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
(36, set(['FALSE', 'IDENTIFIER', 'PROPERTY_INIT'])),
(37, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
(38, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
(39, set(['IDENTIFIER', 'PROPERTY_INIT', 'TRUE'])),
(40, set(['IDENTIFIER', 'PROPERTY_INIT', 'STRING'])),
(41, set(['PROPERTY_INIT'])),
(43, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
(44, set(['BLOCK', 'IF'])),
(45, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
(46, set(['IF'])),
(52, set(['PROPERTY_INIT'])),
(57, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
(58, set(['IDENTIFIER', 'VAR'])),
(60, set(['IDENTIFIER', 'VAR'])),
(64, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'THIS'])),
(66,
set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING', 'THIS'])),
(67, set(['RETURN'])),
(68, set(['PROPERTY_INIT'])),
(74, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
(75, set(['IDENTIFIER', 'VAR'])),
(76, set(['IDENTIFIER', 'VAR'])),
(77, set(['RETURN'])),
(78, set(['PROPERTY_INIT'])),
(83, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
(84, set(['IDENTIFIER', 'VAR'])),
(85, set(['IDENTIFIER', 'VAR'])),
(86, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
(87, set(['PROPERTY_INIT'])),
(92, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
(93, set(['IDENTIFIER', 'VAR'])),
(94, set(['IDENTIFIER', 'VAR'])),
(95, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON'])),
(96, set(['PROPERTY_INIT'])),
(101, set(['FUNCTION', 'IDENTIFIER', 'SCRIPT'])),
(102, set(['CALL', 'DOT', 'IDENTIFIER', 'LIST', 'SEMICOLON', 'STRING'])),
(103, set(['PROPERTY_INIT']))]
This is enough information for me to be able to safely determine injection points for an instrumentation line. I am still working through some problems with this whole mess, specifically line 46 looks like:
if(1 == 1){
$_injectionjs['testfile'][45] += 1;
console.log('');
$_injectionjs['testfile'][46] += 1;
}else if(2 == 3){
console.log("boop");
}else{
console.log("bleep");
}
after being instrumented. As you can see, I am skipping over the else if chain… which leads me to believe I am missing YET ANOTHER dynamically assigned Node property.