The main results of our current analysis of the Debian 2.2 GNU/Linux release can be organized in the following categories:
Size of Debian potato.
Importance of the most used programming languages.
Analysis of the evolution in the size of the most relevant packages.
Effort estimations.
We have counted the number of source codes of Debian GNU/Linux 2.2 in three ways, with the following results (all numbers are approximate, see appendix for details):
Count of upstream packages "as such": 52,810,000 SLOC
Count of Debian source packages: 56,180,000 SLOC
Count of Debian source packages without debian directory: 55,920,000 SLOC
For details on the meaning of each category, the reader may revisit the subsection "Downloading and collecting data". In short, the count of upstream packages could be considered as the size of the original software used in Debian. The count of Debian source packages represents the amount of code actually present in the Debian 2.2 release, including both the work of the original authors and the work of Debian developers. This latter work includes Debian-related scripts and patches. Patches can be the work of Debian developers (for instance to adapt a package to the Debian policy) or be the downloaded from elsewhere. The count of Debian packages without the debian directory excludes Debian-related scripts, and therefore is a good measure of the size of the packages as they are found in Debian, excluding the specific Debian-related scripts.
It is also important to notice that packages developed specifically for Debian have usually no upstream source package. This is, for instance the case of apt, which is present only as a Debian source package.
The number of SLOC classified by programming language are (roughly rounded) as follows (numbers for Debian source packages):
ANSI C: 39,960,000 SLOC (71.12%)
C++: 5,500,000 SLOC (9.79%)
LISP: 2,800,000 SLOC (4.98%)
Shell: 2,640,000 SLOC (4.70%)
Perl: 1,330,000 SLOC (2.36%)
FORTRAN: 1,150,000 SLOC (2.04%)
Tcl: 550,000 SLOC (0.99%)
Objective C: 425,000 SLOC (0.76%)
Assembler: 425,000 SLOC (0.75%)
Ada: 405,000 SLOC (0.73%)
Python: 360,000 SLOC (0.65%)
Below 0.5% we find some other languages: Yacc (0.46%), Java (0.20%), Expect (0.20%), Lex (0.13%), and others below 0.1%.
When we count the lines in the Debian source packages without the debian directory (which contains package configuration files and maintainer scripts), the numbers are similar. This means that the maintainer scripts are not a significant part of the distribution. The main difference is in Shell lines (about 150,000 less) and in Perl lines (about 80,000 less), which uncovers the preferred languages for those scripts.
However, when we count original (upstream) source packages there are some remarkable differences: about 2,000,000 lines of ANSI C code, 300,000 lines of LISP, 200,000 lines of FORTRAN, and minor variations in other languages. This differences can usually be amounted to patches to upstream packages made by the Debian developer. Therefore, looking at this numbers, we can know in which languages are written the most patched packages.
The largest packages in the Debian potato distribution are:
Mozilla (M18): 2,010,000 SLOC (2,010,000). C++ amounts for 1,260,000 SLOC, ANSI C for 702,000. Mozilla is the well known open source WWW browser.
Linux kernel (2.2.19): 1,780,000 SLOC (1,780,000). ANSI C amounts for 1,700,000 SLOC, Assembler for 65,000. The Linux 2.x kernels were the stable series at the time of the Debian 2.2 release.
XFree86 (3.3.6): 1,270,000 SLOC (1,265,000). Mainly 1,222,000 SLOC of ANSI C. This is an X Window implementation, including graphics server and basic programs.
PM3 (1.1.13): 1,115,000 SLOC (1,114,000). 983,000 SLOC of ANSI C, 57,000 of C++. PM3 is the Modula-3 distribution of the Ecole Polytechnique de Montreal, including a compiler and libraries.
OSKit (0.97): 859,000 SLOC (859,000). Amounts for 842,000 SLOC of ANSI C. OSKit is the Flux Operating System Toolkit, a framework for operating system design.
Stalin (0.8): 805,000 SLOC (805,000). Almost fully written in ANSI C, 804,000 SLOC. Stalin is an Scheme compiler, designed to improve the efficiency of Scheme programs.
GDB (4.18): 801,000 SLOC (800,000). Includes 727,000 lines of ANSI C and 38,000 of Expect. GDB is the GNU source-level debugger.
GNAT (3.12p): 688,000 SLOC (687,000). About 410,000 SLOC of ANSI C and 248,000 SLOC of Ada. GNAT is the GNU Ada 95 compiler, including libraries.
Emacs (20.7): 630,000 SLOC (629,000). 454,000 SLOC of LISP, 171,000 SLOC of ANSI C. Emacs is the well known extensible text editor (and many, many things more).
NCBI Libraries (6.0.2): 591,000 SLOC (591,000). Almost only ANSI C is found, 590,000 SLOC. This package includes libraries for biology applications.
EGCS (1.1.2): 578,000 SLOC (562,000). Includes 470,000 SLOC of ANSI C and 55,000 SLOC of C++. This package includes the GNU C++ extension library.
XEmacs, base support (21): 513,000 SLOC (513,000). An almost pure LISP package, 510,000 SLOC. Includes the base extra Emacs LISP files needed to have a working XEmacs.
Numbers in parenthesis are approximate number of SLOC of upstream packages, the rest of the numbers are approximate number of SLOC of the Debian source packages. Only data for the more relevant languages found in each package are reported. The reader may notice that in most cases, the numbers is both cases are roughly equal, showing evidence that, in those cases, the additions done by Debian developers are minimal (although modifications could be more important).
The release numbers of the packages are obviously not current, but those were the ones available at the time of the freeze for Debian 2.2 (Spring 2001). The classification could be different had Debian developers to package things in other ways. For instance, if all Emacs extensions were in the Emacs package, it would have been much larger. However, Debian source packages match usually well the idea of package who have upstream authors, and the one generally considered.
The next packages by SLOC size (between 350,000 and 500,000 SLOC) are Binutils (GNU assembler, linker, and binary utilities), TenDRA (C and C++ compiler and checker), LAPACK (a set of linear algebra routines), and Gimp (the GNU Image Manipulation Package). Except for LAPACK (which is composed mainly of FORTRAN files), these packages are mainly written in ANSI C.
Using the basic COCOMO model, the effort to build a system with the same size as Debian 2.2 can be estimated. This estimation assumes a "classical", proprietary development model, and therefore is not valid to estimate the effort which has been applied to build this software. But it can give us at least an order of magnitude of the effort which would be needed in case a proprietary development model had been used.
Using the SLOC count for the Debian source packages, the data provided by the basic COCOMO model are as follows:
Total SLOC count: 56,184,171
Estimated effort: 171,141 person-months (14,261 person-years)
Formula: 2.4 * (KSLOC**1.05)
Estimated schedule: 72.53 months (6.04 years)
Formula: 2.5 * (Effort**0.38)
Estimated cost to develop: 1,848,225,000 USD
For calculating the cost estimation, we have used the mean salary for a full-time systems programmer during 2000, according to Computer World, which is of 54,000 USD per year, and an overhead factor of 2.4.