Trends in atomistic simulation software usage [1.3]

Driven by the unprecedented computational power available to scientiﬁc research, the use of computers in solid-state physics, chemistry and materials science has been on a continuous rise. This review focuses on the software used for the simulation of matter at the atomic scale. We provide a comprehensive overview of major codes in the ﬁeld, and analyze how citations to these codes in the academic literature have evolved since 2010. An interactive version of the underlying data set is available at https://atomistic.software.


Introduction
Scientists today have unprecedented access to computational power.This statement would be unremarkable, were it not for the extent to which computational power has exploded.Figure 1 shows the performance ranking of the top 500 supercomputers in the world over the last decades, including the performance of the top-ranked machine, the machine at the bottom of the list, and the sum of all 500.Remarkably, the least-squares fit to the sum (green line) corresponds to a growth rate of ∼75% per year, which translates to a performance increase of more than one million times over the last 25 years.Similar improvements in commodity hardware mean that many of 2020's laptop computers would have made the top 500 list of the early 2000s [1].
First versions of quantum-chemistry codes, such as Gaussian [2], were already released in the 1970s, followed by force-field codes, such as GROMOS [3], and periodic density-functional theory (DFT) codes, such as CASTEP [4], in the 1980s and 1990s.In other words, many of these atomistic simulation engines have been around during this explosion of computational power, continuously evolving to take advantage of new algorithms, processor architectures, increasing parallelism and, more recently, dedicated accelerator hardware.Over time, they have developed from instruments for specialists to proven and tested tools in the arsenal of practitioners in physics, chemistry, and materials science.
Records of the pervasive use of these tools can be found in the scientific literature.In a 2014 survey, van Noorden et al. found that 12 of the top 100 most cited papers of all time were on density-functional theory [5].As with other exam-Table 1. Top ten most highly cited articles published by the American Physical Society, all of which deal with density functional theory and its practical application.Data collected from the Web of Science on June 16th, 2021.Shown are the sum of the entire list (green dots), the performance of the top machine (brown triangles), and performance of the bottom of the list (blue squares), together with least squares fits.Adapted from top500.org/statistics/perfdevel. ples in van Noorden's list, the flood of citations are indicative of papers being cited by the many practitioners (here: of density-functional theory) rather than the few method developers.If one focuses the analysis on articles published in physics journals, the footprint of density-functional theory grows even further: For example, table 1 shows the top ten most cited papers published by journals of the American Physical Society.All of them are related to density-functional theory and its application.
There used to be a time when it was commonplace for computational condensed-matter physicists and quantum chemists to write their own electronic-structure code, and many of the atomistic-simulation engines that are in broad use today have started this way.Over the years, however, many of these engines have developed into complex software distributions.Table 2 shows counts for the lines of code in some of today's popular open-source simulation engines: they range from hundreds of thousands to millions of lines of code, typically written in Fortran or C++, with similar numbers being reported for commercial packages [19].While statistics like these are by no means accurate measures of code complexity (and developers follow different approaches to packaging and outsourcing of functionality to external libraries), they nevertheless suggest that many of these code bases are too large to be sustained by any single person.
This poses important questions for how to sustain these software projects going forward: questions of funding, business models, and software licenses.Proponents of the opensource route argue that it democratizes research and education by removing barriers for both users and developers, and that science carried out with commercial software is harder to verify and reproduce [20,21,1].The open-source model can also be adopted irrespective of the size of a code's target user group, while commercial activity tends to require a  minimal market size.On the other hand, the strong focus on innovation and development often found in open-source scientific software can negatively impact usability and quality of documentation [22].Proponents of commercial licenses argue that making academic software accessible to a broader community is a technical task best left to professional software engineers, and that the resulting gain in scientific productivity can easily outweigh the license fees that pay for it [19].We note that open-source and commercial activity do not necessarily exclude each other: numerous software companies are built around an open-source core, and we start seeing first examples in atomistic modelling as well (e.g.Molcas/OpenMolcas [23]).Overall, it has been suggested that "scientific publications are a more sound metric [of the scientific impact of software] than either the price of a product or whether its source code is available in the public domain" [19].Providing a peek into this citation record is one of the reasons for creating the atomistic.softwarecollection.
The other reason is a practical one: When young scientists start their first research project in atomistic simulations, they often have no grasp of the extent of this software ecosystem, let alone of current trends in the field -at least this is the personal experience of the authors of this review.Software choices in research groups are therefore often informed by what other members of the group already use.This makes sense: colleagues have vetted the code for the type of problems the research group is working on, and built up expertise around which of the many knobs to turn in order to find the sweet spot between efficiency and accuracy.
But what if that code is no longer actively developed?What if there was another code that was better suited to solve the specific research problem at hand?That had a larger user/developer community?Was free instead of commercial?Was open source instead of closed source?The goal of the atomistic.softwarecollection is to provide a comprehensive overview of all major atomistic simulation engines (cf. Figure 2), and to help newcomers to the field as well as experienced practitioners and software developers find better answers to some of these questions.
Readers interested mainly in the results are invited to go straight to the atomistic.softwareweb site that displays the data discussed in this review.For those wanting to know more, the following sections provide details on the methodologies used, and discuss some of the trends that can be observed.

The atomistic.software collection 2.1 Overview
The atomistic.softwarecollection draws upon existing lists of atomistic simulation codes [24,25,26,27,28], in particular the "Major codes in electronic-structure theory, quantum chemistry, and molecular-dynamics" [29] maintained by the NOMAD Centre of Excellence from 2017-2019.It enriches these with annual citation data from the Google Scholar search engine, which provides an overview of the current usage landscape as well as ongoing trends, both at the level of individual codes and at the ecosystem level.
The overview table Fig. 3 lists all codes in the data set, ordered by how often they are referenced by articles indexed in Google Scholar.Clicking on the citation count opens the corresponding query on Google Scholar, so users   Users can filter codes by the methods or basis sets they use, and select only codes that are commercial, free or open-source (Fig. 4).Hovering with the mouse over any abbreviation in the list opens a tool-tip with an explanation.Further metadata include information on which range of the periodic table the code covers, available installation routes, support for parallelization/acceleration, support for standard APIs, and the availability of benchmarks.In order not to clutter the interface, not all metadata is displayed by default but columns can be added/removed via the "View Columns" button.
Besides the generic overview, each engine comes with a citation "trend" over the last couple of years, which serves as an indicator of how its user community has developed over time (Figure 5).
Finally, the statistics page looks at the top codes by citation growth, indicating a rapidly growing user community.Ranking by absolute growth naturally favors established codes, while considering relative growth provides insight into the dynamics of new contenders in the list.Ideally, codes rank highly in both metrics as is currently the case, e.g., for Desmond [30] and OpenMM [15].
A word of caution: The popularity of a code is a factor of many variables (starting, e.g., with the size of the target audience) -please do not choose the code for your next research project merely based on its ranking in this list.atomistic.softwarelinks to the scholar query for the papers citing  the code, thus making it quick and easy to get an impression of the research the code is currently used for.Yet, certain aspects of a code are likely to correlate with popularity, such as how much used/tested the software is; how many Q&A resources one is likely to find online or how much tooling there is likely around this code, etc. From a software-developer perspective, knowing the popularity of a code can be useful for gauging the potential impact of supporting the code in your own tool (such as a workflow manager, visualization software, . . .).And, finally, the citation trend provides an interesting peek into the future -is the user community growing, stagnating or decreasing?

Scope
atomistic.softwareuses the following working definition of an atomistic simulation engine: A piece of software that, given two sets of atomic elements and positions (and, possibly, bond network), can compute their relative internal energies.In almost all cases, engines will also be able to compute the derivative of the energy with respect to the positions, i.e. the forces on the atoms, and thus be able to perform tasks like geometry optimizations or molecular dynamics.This covers the Density-Functional Theory (DFT), Wave-Function Methods (WFM), Quantum Monte Carlo (QMC), Tight-Binding (TB), and Force-Field (FF) categories.Codes in the Spectroscopy (S) category are not necessarily simulation engines in the above sense, but compute the response of a given atomic structure to an external excitation (via photons, electrons, . . .).
atomistic.softwareaims to be a comprehensive list of all major atomistic simulation engines, with annual updates going forward.Since there is a long tail of simulation engines with a limited user base, a relevance criterion is introduced in order to keep maintenance of the list manageable.The criterion has been set to having at least one year with 100 citations or more.The value of 100 is not set in stone and could be re-evaluated in the future, once the list has had some time to consolidate.A "watch list" is kept of codes that do not yet meet the criterion.

Methodology & Limitations
Approximate citation counts are obtained from Google Scholar as follows: 1. Search for name of the code and the last name of a representative developer who is a coauthor of all key publications on the software (vast majority of codes).If no such coauthor exists, this is easily extended to searching for the presence of one of multiple author names (or a company name for commercial codes).2. When the name of the code is too common a search term, additional search terms may be added or citations of a major reference article are counted (minority of codes) Google Scholar was chosen over alternative sources like the Web of Science or Scopus, since it provides full-text search and is available for free, thus enabling direct links to the queries.The supporting information contains a case study comparing citation counts from the search-based Google Scholar approach against counting the citations of reference papers (both in Google Scholar and the Web of Science).
Owing to the lack of standardization in today's software citation practices [31,32], the citation counts reported here are necessarily approximate.Shortcomings include the following: • While spot checks have been performed to weed out false matches (and reports on the ltalirz/atomisticsoftware GitHub repository are highly welcome), details of the query can have significant impact on the number of results.This means, in particular, that the ranking by absolute number of citations is not set in stone and may be subject to change if more accurate search terms are identified.• Citation counts reported by Google Scholar are not entirely static, even for years that lie in the past.
Reasons may include new publishers being indexed, more text being extracted, different citations being disambiguated, or even the heuristic evolving that predicts the total number of results.In our experience, citation data for the previous year can be subject to significant (upwards) fluctuation, while citation data for years further in the past are quite stable.For this reason, for each data point the date of collection is recorded in the source code repository.• Counting citations does not directly measure how often simulation codes are used but how often they are referenced in the scientific literature.This may involve some systematic bias, for example if popular codes are more likely to be mentioned without being used, or if a software targets industrial users who may be less likely to publish their results.
The caveats listed above mainly affect the absolute number of citations reported, and thus the ranking of codes.Citation trends on the individual code level should be more robust, and potential shortcomings in that domain (e.g.missing citations to a new reference paper with different authors) can be addressed by adapting the corresponding query.
The categorization of codes in terms of methods, tags and licenses is an evolution of the classification devised by the NOMAD list [29].For the sake of this data set, the following terminology has been adopted: • commercial: payment required to obtain the software 1• free for academic use: free for academics around the world2 • free: free to use for anyone, possibly after registration • source available: source code available either for free or against payment • open-source: open-source license approved by the Open Source Initiative (OSI, https://opensource.org/) We note that license terms can (and sometimes do) change over time.This is currently not reflected in this data set (only the latest license terms are recorded), but could be taken into account in future updates.
All data, as well as the source code of the web application running on atomistic.softwareare hosted in the ltalirz/atomistic-software GitHub repository.The data is released under version 4 of the Creative Commons Share-Alike Attribution International License (CC-BY-SA).The web application is written in JavaScript using the React framework (reactjs.org)and released under version 3 of the Affero General Public License (AGPL).

Trends
Extensive cross-checks of atomistic.softwareagainst other lists [24,25,26,27,28] suggest that the collection is already fairly complete, and can thus enable a look at the landscape of atomistic simulation software as a whole.Today's atomistic simulation engines are highly sophisticated pieces of software that each take many human-years of development, and developers have chosen different routes to support these efforts: from commercial to free, from closed source to open, and many shades of grey in between.One question we can ask is: How do commercial codes fare versus their free competitors?
Figure 7 compares the compound citations to commercial and free codes.It illustrates that commercial codes are alive and well: they are ahead in terms of citations gathered, and have been ahead throughout the last decade, with Gaussian [33] and VASP [34] together accounting for more than half of all citations of the 23 commercial codes.At the same time, citations of the 40 free codes (including those that are only free for academic use) have been growing roughly at the same absolute rate, mostly driven by the codes that are free for general use.
We note some caveats that apply to this statistic: • The current dataset only records the latest license conditions, while some codes (e.g.CASTEP [35] or Dalton [36]) have moved to more open license terms over time, thus switching categories.• For codes that are free for academic use only, some researchers may prefer to use the commercial version (e.g. using CASTEP through Biovia's Materials Studio software [37]).
It seems safe to conclude, however, that -while commercial codes remain highly popular -free codes are slowly gaining market share.
Another important question concerns source-code availability, which is relevant for the ability of researchers to independently verify published calculations and pin down bugs. Figure 8 shows that consistently at least ∼90% of citations went to engines whose source code is available, strongly dominating over the 12 closed-source codes, whose citations have stagnated 3 during the 2010s.We recall here that counting citations in the scientific literature places a focus on the usage in academia, and that usage patterns in industry may differ from the trends identified here.
While citations to source-available engines have grown by ∼130% since 2010, citations to the 24 open-source en- 3 One notable exception is the closed-source ORCA code [38] that is free for academic use.gines within that group rose by >300% within the same time frame, gaining market share.In this context, it is useful to recall that the development of several engines on the lists predates the open-source movement and the creation of many of the open-source licenses that are in broad use today (see Table 3).atomistic.softwaredistinguishes between • copyleft open-source licenses, such as the GPL family, which require 4 derivative software to be distributed under the same open-source license (thus also called share-alike or viral licenses), and • permissive licenses, such as the BSD, Apache, and MIT licenses, which permit relicensing of derivative works.
The enforced sharing of improvements in derivative works can be a competitive advantage of adopting a viral license.It can also be one path towards financial revenue when companies seek a separate license agreement that allows them to keep derivative works proprietary.Other developers may want to maximize impact of their software by lowering the barrier for adoption across the board, and thus prefer permissive licenses.Overall, the choice of license is highly nuanced and an extensive discussion is beyond the scope of this article (interested readers are referred to choosealicense.com)but it may be instructive to observe the choices made by the codes in the collection.
Out of the 24 open-source codes in the atomistic.softwarecollection, the majority (20) adopt the GPL or LGPL license.The four codes that are distributed under permissive licenses (NWChem [6], OpenMM [15], RASPA [39] and PySCF [40]) either switched to this licensing scheme in the late 2000s or 2010s or started being developed during that time.This indicates that the use of permissive licenses is a recent phenomenon in the space of atomistic simulation engines, and may follow in the footsteps of the open-source community at large which is exhibiting a similar trend: according to an analysis of over 4 million open-source packages by WhiteSource [41], the use of permissive open-source licenses has nearly doubled from 41% in 2012 to 76% in 2020, with the Apache and MIT licenses alone accounting for more than half of all licenses that year.
So much about the differences between licensing models.Overall, citations of atomistic simulation engines in the collection have grown at an annual compound growth rate of ∼8%, roughly twice the 4% growth rate seen in the publication of peer-reviewed articles in science and engineering over the last decade [42].While part of this difference may reflect changing citation practises 5 , it likely indicates an increas- 4 Under specific circumstances, which differ significantly between the GPL and the LGPL. 5 According to Mammola et al., the length of reference lists in ecology journals has been increasing by ∼2% per year over the last two decades.[43] ing adoption of (atomistic) computational materials science throughout the scientific literature.

Conclusions & Outlook
At the time of writing, the atomistic.softwarecollection contains over 60 simulation engines that each gather >100 citations per year, some several hundreds or thousands.Overall, this review paints a bright future for the field of atomistic simulation: a growing variety of both commercial and free software to choose from, citation growth rates that substantially outpace the rest of the scientific literature, and a forecasted trillion dollar market potential for a digitally-driven materials revolution [44].There are, however, some challenges ahead as well.
The continued slowdown in single-core performance scaling 6 creates a powerful driving force for the specialization of computer hardware.Small-to medium-size development teams often lack the expertise or the resources to adapt their code base to an ever growing number of hardware accelerators and are at risk of falling behind.One way of approaching this issue is to try and identify low-level, performance-critical primitives that are needed by multiple codes.These primitives can then be bundled into domain-specific libraries, such as libxc [45], libint [46], ELSI [47], SIRIUS [48], M-A-D-N-E-S-S [49], or TiledArray [50] that are ported to and optimized for the various accelerator architectures by HPC specialists.
Another issue that requires attention is the one of software citation.With the increasing role that software plays in advancing science, it is crucial that credit for the creation of software is attributed adequately and accurately.One reason why the citation counts in atomistic.softwareare approximate is that software citation in the field of atomistic simulations comes in many different forms: • references of papers that summarize recent developments of the software, • references of papers that describe the implementation of specific methods, • references to the home page of the code, or even just • mentioning the code by name in the main text, possibly followed by a key author or company in parentheses, similar to what has been found in the field of biology [51].Furthermore, there is no standardized way of expressing whether a specific version of the software was used or whether the software was referenced as a general concept, e.g. as part of an enumeration of different codes like in this review.In order to encourage adoption of a consistent policy for software citation across disciplines and venues, in 2016 the software citation working group of the FORCE11 coalition (force11.org)issued detailed software citation principles [52], which include the recommendation of citing a unique, persistent identifier that indicates which version of the software has been used.Today, the technological infrastructure for creating these identifiers is in place: for example, for software hosted on github.com, the Zenodo-GitHub integration [53] automatically stores the source code of each software release on the Zenodo repository operated by CERN, and mints a document object identifier (DOI) for it.
Figure 9 illustrates how citing such a DOI works from the user's perspective: When code developers place the DOI badge offered by Zenodo in the "how to cite" section of their documentation, users can click on it and be redirected to the landing page of the accompanying Zenodo record.There, they select the DOI corresponding to the version they usedor, if they are referring to the software in general, they can cite the "concept DOI" of the software that represents all versions and always resolves to the latest one.Finally, they can select the desired citation style and copy the citation into their manuscript or download the citation in a format supported by their reference manager.
At least three codes in the atomistic.softwarecollection (PySCF [40,55], OpenMM [15,56], and xtb [16,57]) have already enabled the Zenodo-GitHub integration but none of them mention this in their citation recommendations yet, effectively reducing the functionality of the integration to that of a future-proof backup of individual software versions.This is a common theme found across software records on Zenodo today [31].Part of the reason may be that Zenodo records are not (yet) indexed by the widely used scholarly search engines, such as the Web of Science, Scopus or Google Scholar.But as researchers are getting increasingly accustomed to using platforms like Zenodo, the Open Science Framework (osf.io) or Figshare (figshare.com)for depositing and citing data sets, it seems to be just a matter of time until analogous practices in software citation will reach the mainstream.Zenodo already does track citations of its records through publicly available sources such as Crossref and Europe PubMedCentral (and displays them on the record).Trailblazing developers can therefore recommend their users to cite a version-specific Zenodo DOI in addition to a review paper, and thereby get valuable statistics on which versions of their software are being used in return.It can also be a convenient way of making a new code citable before a paper has been written on it.
As for the atomistic.softwarecollection, this review only marks the beginning.Going forward, the collection will receive annual updates, including updates of this perpetual review when warranted.Possible directions for further work include • adding any simulation engines that were missed, • recording the time evolution of licenses at the level of individual codes, and • potentially evolving the scope of the collection, e.g. to include software for atomistic visualization or workflow management (although care would need to be taken in order not to lose focus).
Suggestions for future directions as well as updates and corrections of engine metadata (search keywords, tags, distribution channels, accelerator support, supported APIs, benchmarks, . . . ) are highly welcome, be it through public discussions on the ltalirz/atomistic-software issue tracker, via pull requests to the repository or via private communication to the authors.

Figure 1 .
Figure 1.Performance of the top 500 supercomputers in the world from 1993 to 2020 in solving linear equations (Linpack benchmark), measured in 64bit floating point operations per second (FLOP/s).Shown are the sum of the entire list (green dots), the performance of the top machine (brown triangles), and performance of the bottom of the list (blue squares), together with least squares fits.Adapted from top500.org/statistics/perfdevel.

Figure 2 .
Figure 2. Highly cited atomistic simulation engines in the scientific literature.Font size scaled (approximately) by the number of citations during the year 2020 as reported by Google Scholar.

Figure 3 .
Figure 3. Overview table of atomistic simulation engines, sorted by how often they are referenced on Google Scholar during the previous year (here: 2020).A drop-down menu provides access to annual citation data reaching back to the year 2010.

Figure 4 .
Figure 4. Filtering for density-functional theory codes with both permissive (P) and copyleft (CL) open-source licenses.

Figure 5 .
Figure 5. Citation trend for the Quantum ESPRESSO code.

Figure 6 .
Figure 6.2020 rankings for absolute and relative citation growth with respect to the previous year.

Figure 7 .
Figure 7. Citations of commercial vs free codes, including a breakdown into those free for general use and those free for academic use only.

Figure 8 .
Figure 8. Citations by source code availability."Source available" includes all engines whose source code can be obtained for free or for a fee."Open-source" includes only OSI-approved licenses.

Figure 9 .
Figure 9. User flow for citation via Zenodo-GitHub integration.(a) User clicks on DOI badge in the README file of the source code repository or in the documentation of the software.(b) User is redirected to the landing page of the Zenodo software record, where they can pick the version they used.(c) User enters the desired citation style and copies the citation [54] into their manuscript or downloads the citation in a format supported by their reference manager.

Table 2 .
[18]ting lines of code for 11 popular open-source atomistic simulation engines (and the Linux kernel for comparison), using the latest releases as of June 2021.Line counts are determined by cloc v1.6.0[18]andexclude blank lines, comments, and markup languages (detailed reports in the supporting information).Contributors for the year 2020 were determined by counting the number of different committers to the source code from January 1st 2020 to January 1st 2021 (numbers for the Linux kernel are from 2019).* Roughly 3 million lines of code of NWChem are computer-generated.

Table 3 .
OSI-approved open-source licenses used in the collection.See the SPDX license list at spdx.org/licenses for the full license terms corresponding to the abbreviations.

4 Author Contributions Leopold Talirz:
Conceptualization, Methodology, Software and Data curation for atomistic.software.Writing-Original Draft, Reviewing and Editing.Luca M. Ghiringhelli: Conceptualization, Methodology and Data curation for the original static version of the collection.Writing-Reviewing and Editing.
Berend Smit: Supervision, Writing-Reviewing and Editing, Funding acquisition