Read 100’000 lines of code

Based on the text “Wie liest man 100’000 Zeilen Code?” by Till A. Heilmann

The problem is, as Heilmann outlines.

“Computer programs are written in formal, not so-called natural languages. Their structure with control structures such as branches and loops or with object-oriented modelling is also generally not linear. Even with just a few hundred lines, it is easy to lose track. The problem is exacerbated if the authors and readers have only rudimentary or no practical programming knowledge. It is therefore not surprising that many analyses in the CCS often concern non-commercial or non-professional small software, often also artistic projects and thus code that was at least partially written to be read as code (and not just executed). However, if the CCS want to analyse the cultural imprint and effectiveness of software on a social scale, then they have to deal with code that is productive on the corresponding scale.” (“Quellcodekritik: zur Philologie von Algorithmen”, 2024, p. 87)

In the following I list a few basic steps that can be taken for a fundamental exploration of a source code base. These are not listed as such explicitly in the text. The commands I added are simply examples that will most likely not work in any other case, but stand exemplarily for the command line tools that should be looked into.

1. Filenames and -sizes

“Even before the contents of the files, their names and sizes alone say something about the structure of the source code of Photoshop 1.0.1.” (“Quellcodekritik: zur Philologie von Algorithmen”, 2024, p. 93)

$ ls -lS            # list files and sort by size
$ du -ch *.OBJ      # list all files of type with their human readable filesize

2. General File Content

3. Scope

$ wc -l *.OBJ       # count all lines in all files of type
$ cloc *.bas *.asm  # counts code and commentary lines

4. Code/Comments

$ # extract all comments from a 6502 assembly file
$ grep ';.*' -n POIZONE.asm > comments.txt

5. Bag of Words

$ # split into 'words'
$ cat POIZONE_10.asm | tr -s "[ \t,:\(\)\{\}]" "\n" | sort -f | uniq -c > POIZONE_words.txt
$ # list 30 most used 'words'
$ grep '[[:alnum:]]\+$' POIZONE_words.txt | sort -k1nr | head -n 30
$ # list 30 most used registers
$ grep '\sR[0-9].*' POIZONE_words.txt |  grep '[[:alnum:]]\+$' | sort -k1nr | head -n 30

6. Dive

$ # extracts all branching instructions
$ grep -E '(gosub|goto)\s?([0-9]*)' robox_1.bas
$ # why is R13 used least of all registers?
$ # extracts all lines where the use of R13 is commented
$ grep 'R13.*;' POIZONE_10.asm 

7. Restructure Findings

References

Heilmann, Till A. 2024. “Wie liest man 100’000 Zeilen Code?” In Quellcodekritik: zur Philologie von Algorithmen, edited by Hannes Bajohr and Markus Krajewski, Erste Auflage, 87–126. August Akademie. Berlin: August Verlag.


  1. GitHub - AlDanial/cloc: cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.↩︎