Some musicians are so idiosyncratic and distinctive that you can pick out their sound anywhere: Mark Knopfler and Jerry Garcia on guitar, Flea on bass, Stevie Nicks on vocals, and Trent Reznor on, well, everything.

It turns out that programmers are the same way.

“Stylometry,” which helps identify creators through the study of the person’s style was first applied to music, then to fine art. Now some researchers—led by Aylin Caliskan-Islam, a PhD student at Drexel University—have recently discovered that this technique can be applied to code as well to determine who programmed it.

Every developer has preferences not only for things like spacing (spaces vs tabs), naming styles (CamelCase vs. snake_case) and commenting, but also how he or she implements certain types of functionality,” explains Phil Johnson in ITWorld.

Other traits that that the researchers used included layout (such as the white space in the program) and lexical attributes (for example, counts of various types of tokens), Johnson explains. “Their real innovation, though, was in developing what they call ‘abstract syntax trees’ which are similar to parse trees for sentences, and are derived from language-specific syntax and keywords,” he continues. “These trees capture a syntactic feature set which, the authors wrote, ‘was created to capture properties of coding style that are completely independent from writing style.’”

In other words, “Programmers can obfuscate their variable or function names, but not the structures they subconsciously prefer to use or their favorite increment operators,” the researchers say in a lecture about the project.

The technique was developed by gathering publicly available C++ source code from seven years’ worth of Google’s Code Jam, an annual programming competition that attracts a wide range of programmers—more than 100,000 in total. Eventually, they achieved a 95 percent success rate, Johnson writes, which rises to 97 percent as the code sample sizes increase. The more complex the programming problem, the better success researchers had in identifying the coder, he adds.

Of course success first requires having a known database of programmers and their code with which to compare it. This was apparently left as an exercise for the reader.

Why is this a big deal? Does it really matter who wrote some piece of code? It does when you consider that stylometry could be used to help identify the author of malicious source code. It might also help resolve plagiarism and copyright disputes.

“Source code authorship attribution could provide proof of authorship in court, automate the process of finding a cyber criminal from the source code left in an infected system, or aid in resolving copyright, copyleft and plagiarism issues in the programming fields,” the researchers write. Less seriously, it could also be used in computer programming classes to determine whether students are cheating on the assignments.

Stylometry is also being considered as a security method by organizations such as the Defense Advanced Research Projects Agency and the U.S. Army—where, incidentally, Caliskan-Islam served as a visiting researcher.

In fact, this research could put some programmers at risk, the researchers warn. Programmers could now be identified when they contribute to an open-source project—which could get sticky when they don’t want their employer to know. It also becomes risky to work on something embarrassing—or dangerous. “An Iranian programmer was sentenced to death in 2012 for developing photo sharing software that was used on pornographic websites,” researchers note. “Sadly but predictably, there is virtually no technical difference between security-enhancing and privacy-infringing use cases.”

Related Posts