Read 100’000 lines of code

Based on the text “Wie liest man 100’000 Zeilen Code?” by Till A. Heilmann

The problem is, as Heilmann outlines.

“Computer programs are written in formal, not so-called natural languages. Their structure with control structures such as branches and loops or with object-oriented modelling is also generally not linear. Even with just a few hundred lines, it is easy to lose track. The problem is exacerbated if the authors and readers have only rudimentary or no practical programming knowledge. It is therefore not surprising that many analyses in the CCS often concern non-commercial or non-professional small software, often also artistic projects and thus code that was at least partially written to be read as code (and not just executed). However, if the CCS want to analyse the cultural imprint and effectiveness of software on a social scale, then they have to deal with code that is productive on the corresponding scale.” (“Quellcodekritik: zur Philologie von Algorithmen”, 2024, p. 87)

In the following I list a few basic steps that can be taken for a fundamental exploration of a source code base. These are not listed as such explicitly in the text. The commands I added are simply examples that will most likely not work in any other case, but stand exemplarily for the command line tools that should be looked into.

1. Filenames and -sizes

“Even before the contents of the files, their names and sizes alone say something about the structure of the source code of Photoshop 1.0.1.” (“Quellcodekritik: zur Philologie von Algorithmen”, 2024, p. 93)

What do filenames and -sizes tell us about the code base?
Listing and file size commands ls and du aid to get a general feeling of the code base.

$ ls -lS            # list files and sort by size
$ du -ch *.OBJ      # list all files of type with their human readable filesize

2. General File Content

Which files contain what kind of (programming) language or other relevant material?
What was it used for, specifically in this case, but also generally?
Open the files in a text editor and figure out what programming language is used, or what kind of resource file it is, make notes.

3. Scope

How many lines of code are in the files?
Word count and count line of code (cloc¹) commands help to summarize some basic stats on how much code there is.

$ wc -l *.OBJ       # count all lines in all files of type
$ cloc *.bas *.asm  # counts code and commentary lines

4. Code/Comments

What is code, what is comment?
Every language has its own way of letting programmers leave comments. Separating comments (meant for humans) and code (meant for humans and machines) is a basic step to prepare the material for further analysis.

$ # extract all comments from a 6502 assembly file
$ grep ';.*' -n POIZONE.asm > comments.txt

5. Bag of Words

Where to start reading?
Breaking the text into words and looking for significance in this bag of tokens helps to find focus.
Is something used especially often? Are there curious ‘words’?

$ # split into 'words'
$ cat POIZONE_10.asm | tr -s "[ \t,:\(\)\{\}]" "\n" | sort -f | uniq -c > POIZONE_words.txt
$ # list 30 most used 'words'
$ grep '[[:alnum:]]\+$' POIZONE_words.txt | sort -k1nr | head -n 30
$ # list 30 most used registers
$ grep '\sR[0-9].*' POIZONE_words.txt |  grep '[[:alnum:]]\+$' | sort -k1nr | head -n 30

6. Dive

Use the bag of words or specific structures as inspiration to focus on, such as often used or odd ‘words’, or variable declaration.

$ # extracts all branching instructions
$ grep -E '(gosub|goto)\s?([0-9]*)' robox_1.bas
$ # why is R13 used least of all registers?
$ # extracts all lines where the use of R13 is commented
$ grep 'R13.*;' POIZONE_10.asm

7. Restructure Findings

Use findings to visualize structures and relationships in code.

References

Heilmann, Till A. 2024. “Wie liest man 100’000 Zeilen Code?” In Quellcodekritik: zur Philologie von Algorithmen, edited by Hannes Bajohr and Markus Krajewski, Erste Auflage, 87–126. August Akademie. Berlin: August Verlag.

GitHub - AlDanial/cloc: cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.↩︎