1 TreeHarp Class

The TreeHarp S4 class is used to represent an R expression. It can then be manipulated in several ways in order to perform static code analysis of student submissions.

The lintr package does a superb job of parsing R code, but it provides too much detail for a simpler task that the autoharp does. Hence the tokens that the autoharp focuses on are much fewer. For instance, the ( parentheses are dropped.

Here is a simple expression. Let us use it to study the elements of a TreeHarp object. Suppose we fit a linear model to variables in a dataset. To create a TreeHarp object from an expression, we provide the expression together with the quote = TRUE argument. This is important because method dispatch is performed based on that second argument, not the first! If we were to dispatch on the first, R would evalate the expression in order to check its class - thus destroying the expression we intended to capture.

library(autoharp)
tree1 <- TreeHarp(quote(lm(y ~ x1 + x2, data=mydata)), TRUE)
opar <- par(mar=c(0,0,0,0))
plot(tree1, vertex.size=25, asp=0.6, vertex.color="gray", vertex.frame.color=NA)

par(opar)

The autoharp relies on the igraph package for plotting graphs, so as you can see, the full set of parameter options can be utilised to create a rich plot for our tree.

1.1 Slots

There are 4 slots in a TreeHarp object. The only compulsory one is the adjList.

1.1.1 adjList

This slot contains an adjacency list that represents the tree structure of the code. Nodes in a tree are labelled in Breadth-First Search (BFS) order. Thus the root node has id 1, and does not appear in the adjacency list. The TreeHarp convention is to list each edge only once, as a child. Here’s what we mean: node 2 in the example above is a neighbour of node 1 and 4, but it only appears under node 1. It does not appear as an adjacent node of node 4 because it is not child of node 4. Terminal nodes (leafs) have a NULL entry in the list.

slot(tree1, "adjList")
#> $lm
#> [1] 2 3
#> 
#> $`~`
#> [1] 4 5
#> 
#> $data
#> [1] 6
#> 
#> $y
#> NULL
#> 
#> $`+`
#> [1] 7 8
#> 
#> $mydata
#> NULL
#> 
#> $x1
#> NULL
#> 
#> $x2
#> NULL

1.1.2 nodeTypes

If the object was constructed from an R language object, this slot will be automatically populated. It will contain information about the types of nodes in the object. The object in this slot is a data frame with the following columns:

  1. id (node id). The root node has id 1.
  2. name. The name of the node.
  3. call_status. A TRUE/FALSE column indicating if the node was a call or not.
  4. formal_arg. If the node is not a call, then this column will indicate if it is a formal argument or not. If it is not a call and not a formal argument, it is a symbol representing an R object - we call this an actual argument.
  5. depth. This is the depth of the node in the tree. The root of the tree has depth 1.
slot(tree1, "nodeTypes")
#>   id   name call_status formal_arg depth
#> 1  1     lm        TRUE      FALSE     1
#> 2  2      ~        TRUE      FALSE     2
#> 3  3   data       FALSE       TRUE     2
#> 4  4      y       FALSE      FALSE     3
#> 5  5      +        TRUE      FALSE     3
#> 6  6 mydata       FALSE      FALSE     3
#> 7  7     x1       FALSE      FALSE     4
#> 8  8     x2       FALSE      FALSE     4

1.1.3 call

The call slot stores the original expression that was used to construct the TreeHarp object, just in case you need it later.

slot(tree1, "call")
#> lm(y ~ x1 + x2, data = mydata)

1.1.4 repr

The repr slot contains a string representation of the object. If the original TreeHarp object has been mutilated (using one of the routines below), then it may not be a proper R expression, so this slot stores the best representation of it. This slot is used when the object is printed, or when show is called on the S4 object.

tree1
#> lm(y ~ x1 + x2, data = mydata)

1.2 Methods

As we have already demonstrated, the plot method exists for this class. It relies on the tree layout of igraph package, but additional arguments can be used to customise the plot. For instance, we could use colour to distinguish between calls and non-call nodes:

opar <- par(mar=c(0,0,0,0))
plot(tree1, vertex.size=25, asp=0.6, vertex.color=tree1@nodeTypes$call_status)

par(opar)

Here are other methods defined for the TreeHarp class:

  • length: returns the number of nodes.
  • names: returns the node names.
  • as.matrix: returns a symmetric matrix of 1’s and 0’s indicating edges in the tree.
  • get_parent_id: returns the parent id of a node.
  • get_child_ids: returns the ids of the children of a node.
  • get_node_types: returns the nodeTypes slot from a TreeHarp object.
  • get_adj_list: returns the adjacency list slot from a TreeHarp object.

2 ForestHarp

A ForestHarp is a loosely defined idea, in the sense that there is no such class defined. However, it is utilised fairly commonly when dealing with script submissions from students. The precise definition is a list of TreeHarp objects. It can be created using the rmd_to_forestharp function.

f1 <- rmd_to_forestharp(system.file("examples", "student_scripts", 
                                    "qn01_scr_02.R", package="autoharp"))
length(f1)
#> [1] 2

From the sample R script above, we would obtain a list of length 2, with each element being a separate TreeHarp object. To work with such lists, the package provides fapply, for applying functions to lists of TreeHarp objects and combining them back together again. The function applied to each TreeHarp object has to be a function that is specifically for TreeHarp objects. fapply gives you options to combine the output from each TreeHarp object in however which way you choose.

Suppose that we wished to count the number of times a for loop was used in the whole script. Then this would do the trick:

fapply(f1, count_fn_call, pattern="^for$")
#> [1] 1

count_fn_call is a ForestHarp helper function. fapply applies it to each TreeHarp object in the list, and then combines that output into a scalar. By default, it combines them by taking the sum. However, it is possible to specify the combining function, or to just return the uncombined list. Here’s an example of the former, returning a logical value instead of the sum:

fapply(f1, count_fn_call, pattern="^for$", combine=TRUE, 
       combiner_fn = function(x) any(unlist(x) > 0))
#> [1] TRUE

In this case, it returns a single TRUE/FALSE, indicating whether a for loop was present in any of the two expressions.

To return the uncombined list, we can use the following command:

fapply(f1, count_fn_call, pattern="^for$", combine=FALSE)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 0

This is the approach used to accommodate static code analysis of student R scripts. The autoharp comes with a set of ForestHarp helpers - they are documented on the help page. If you require something more specific for what you intend to extract, you will have to write your own. All it has to do is take in a TreeHarp object and return what you are looking for.

?`forestharp-helpers`

3 TreeHarp Routines

3.1 subtree_at

This function returns a subtree. It takes a node number (id) as an argument, and returns a new TreeHarp object rooted at that node. If the subtree is not a legitimate call, then call status cannot be preserved.

th3 <- TreeHarp(quote(a <- f(x, y, g(v, w))), quote=TRUE)
sub_th <- subtree_at(th3, 3, TRUE) # preserves call status
sub_th
#> f(x, y, g(v, w))

3.2 generate_all_subtrees

This routine enumerates and prints all subtrees of a TreeHarp object. It follows the algorithm in (Ruskey 1981). The method of specifying a subtree is to use a binary array, known as a characteristic array. The ordering of the nodes is BFS.

th4 <- TreeHarp(quote(f(x, g(z=2))), quote=TRUE)
all_trees <- generate_all_subtrees(th4)
all_trees
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    1    0    1    0    0
#> [3,]    1    0    1    1    0
#> [4,]    1    0    1    1    1
#> [5,]    1    1    0    0    0
#> [6,]    1    1    1    0    0
#> [7,]    1    1    1    1    0
#> [8,]    1    1    1    1    1

3.3 carve_subtree

This function takes a TreeHarp object and a characteristic array as inputs. It returns a TreeHarp object with only the nodes indicated there present.

carve_subtree(th4, all_trees[5,])

This routine was meant for internal use initially. Since the returned object is not necessarily a legitimate call in R, this function drops the information in all other slots, including the nodeTypes and repr.

3.4 is_subtree_rooted_at

This function can be used to check if one tree is rooted at a particular node of another. Take note that only node names are checked, because once a a subtree is created there is no guarantee that a call is preserved, especially when doing similarity of tree calculations.

thb1 <- TreeHarp(quote(b(d)), TRUE)
tha1 <- TreeHarp(quote(a(b(d), c)), TRUE)
is_subtree_rooted_at(thb1, tha1, 1) # FALSE
#> [1] FALSE
is_subtree_rooted_at(thb1, tha1, 2) # TRUE
#> [1] TRUE

3.5 path_to_root

This function identifies the nodes on the path from a node up to the root of a TreeHarp object. It returns a characteristic array indicating the nodes on that path up to the root.

ex1 <- quote(x <- f(y, g(5)))
th1 <- TreeHarp(ex1, TRUE)
path_to_root(th1, 5)
#> [1] 1 0 1 0 1 0

3.6 get_recursive_index

This function can be used to obtain a recursive call index that can be used to extract a sub-call from an R call (language expression). This is useful when we need to preserve the call, yet extract a subtree.

ex3 <- quote(x <- f(y = g(3, 4), z=1L))
t1 <- TreeHarp(ex3, TRUE)
rec_index <- get_recursive_index(t1, 6)
ex3[[rec_index + 1]]
#> g(3, 4)
ex3[[get_recursive_index(t1, 3)+1]]
#> f(y = g(3, 4), z = 1L)

3.7 get_parent_call_id

This function returns the most recent call from a particular node id. Sometimes, the parent of a node is a formal argument. If we are interested in the parent call, not the immediate parent, then this function might be what you need.

ex3 <- quote(x <- f(y = g(3, 4), z=1L))
t1 <- TreeHarp(ex3, TRUE)
# get the function that calls g:
get_parent_call_id(t1, 6) 
#> [1] 3
#contrast with this:
get_parent_id(t1, 6)
#> [1] 4

4 Tree Similarity Metrics

One of the ambitions of this package was to detect duplicated lines of code that can be reduced through the efficiencies of R for data analysis. An example of undesirable code would be

x1 <- my_function(arg1)
x2 <- my_function(arg2)
x3 <- my_function(arg3)

Within R, an lapply or vapply on c(arg1, arg2, arg3) would be more appropriate. In order to detect such code, some form of similarity measures would be appropriate and useful. This section documents some of the similarity measures currently present within the autoharp.

4.1 Jaccard Similarity

The Jaccard index is a value that summarises the similarity between two sets. At it’s most basic, it is equal to \[ \text{Jac}(A,B) = \frac{A \cap B}{A \cup B}\]

The weighted index, which is appropriate when there could be duplicate items within each set, is described on Wikipedia.

jaccard_treeharp(tha1, thb1)
#> [1] 0.5

4.2 Dot Product of All Possible Subtrees

A similarity measure based on counts of all possible subtrees is described in (Collins and Duffy 2002) and (Haussler 1999). That algorithm is implemented here, with some modifications to use brute-force enumeration where necessary.

To normalise this measure, we divide by the similarity with itself.

K2(tha1, thb1)
#> [1] 3
# normalised to 0 - 1:
K2(tha1, thb1)/sqrt(K2(tha1, tha1)* K2(thb1, thb1))
#> [1] 0.5477226

This technique was used in (Song, Park, and Park 2015) to detect plagiarism in source codes.

4.3 Natural Language Processing (NLP) Techniques

Traditional NLP techniques can also be used to compute similarities between student scripts. The function rmd_to_token_count provides a count of tokens.

library(tidytext)
library(text2vec)
library(tm)

stud_script_paths <- system.file("examples", "student_scripts", package="autoharp")
stud_script_names <- list.files(stud_script_paths, full.names = TRUE)

token_count_list <- lapply(stud_script_names, rmd_to_token_count)
token_count_df <- bind_rows(token_count_list)
#head(token_count_df)

The counts can then be used in a Bag-of-Words model to compute cosine similarities, for instance.

dtm_out <- cast_dtm(token_count_df, fname, token, n, tm::weightTfIdf)
dtm_mat <- as.matrix(dtm_out)
sim2(dtm_mat)
#Docs            qn01_scr_01.R qn01_scr_02.R qn02_scr_01.R qn02_scr_02.R
#  qn01_scr_01.R     1.0000000    0.32776151    0.00000000    0.00000000
#  qn01_scr_02.R     0.3277615    1.00000000    0.01166493    0.01166493
#  qn02_scr_01.R     0.0000000    0.01166493    1.00000000    0.57194128
#  qn02_scr_02.R     0.0000000    0.01166493    0.57194128    1.00000000

References

Collins, Michael, and Nigel Duffy. 2002. “Convolution Kernels for Natural Language.” In Advances in Neural Information Processing Systems, 625–32.

Haussler, David. 1999. “Convolution Kernels on Discrete Structures.” Technical report, Department of Computer Science, University of California Santa Cruz.

Ruskey, Frank. 1981. “Listing and Counting Subtrees of a Tree.” SIAM Journal on Computing 10 (1): 141–50.

Song, Hyun-Je, Seong-Bae Park, and Se Young Park. 2015. “Computation of Program Source Code Similarity by Composition of Parse Tree and Call Graph.” Mathematical Problems in Engineering 2015.