Extract the wealth of knowledge trapped inside your code base.
When most people think about a company's reusable assets, source code doesn't usually show up on the list, even though millions of dollars are spent every year on creating and maintaining code. Most large companies are managing hundreds of millions of lines of code—the majority of which was purpose-built to solve a specific application problem. Most of that code is locked up in source control management systems (SCMs) specific to an application or a siloed organization.
Add to this the world of open-source software development where similarly billions of lines of code exist, but where source code is shared publicly and regularly reused—both wholesale and through forking. Here too, plenty of effort and resources are spent in writing and maintaining source code. Source code is maintained, extended and reused by a large number of developers. And, like enterprise code, open-source code also is stored in various source code repositories.
Collectively, the code that lives in internal SCMs across large organizations, together with the billions of lines of code that exist in the open-source world, reflect the implementation artifacts of literally millions of developers. These artifacts can be used as powerful resources to assist with the design, development, analysis and problem-solving of future applications.
But, how can we leverage this massive resource?
A code search engine is a tool that can help developers unlock the wealth of diverse implementation knowledge buried inside large repositories. A code search engine facilitates search operations that are specific to source code and applies analysis and heuristics specific to source code while processing, indexing and retrieving source code. A source code engine, unlike general text-based search engines, is designed and implemented especially to cater to developers' information needs related to source code. With these features, a code search engine facilitates source code search.
Source code search (or simply, code search) is a technique to find relevant source code in multiple source code repositories. Code search can help fulfill commonly occurring search needs during development tasks, such as finding the usage of APIs across different projects, finding how a known information structure is implemented in code (such as base 64 encoding) and so on. What a developer finds useful in code search results depends on the search need at hand. An effective code search engine facilitates fulfilling such search needs by delivering relevant results and providing the means to explore and narrow down search results in cases where the need is vague and unclear. Given that alternative choices are available in the results, a code search engine can act as a choice engine by allowing code-specific faceting and filtering mechanisms.
Enterprise code search is code search as applied inside a company's firewall, searching corporate source code repositories. Enterprise code search must adhere to additional enterprise requirements, such as authorization and access policies on source code visibility. This poses additional requirements and challenges when considering a code search engine for an enterprise, since the search tool has to meet the company's standards and needs to fit with existing IT, enterprise tools and deployment procedures in place.
Developers frequently use code search tools for copy-paste programming. Best practices developers frequently seek to reuse existing solutions, and once implemented, a common solution to a problem (such as a well-known algorithm) can be used again and again. Copying and pasting code from an existing solution, when legally permissible, often can be the most efficient approach, saving developers time and resources to focus on more challenging tasks. A code search engine can be an ideal tool to find such solutions. Although there certainly are reservations against practices like copy-paste programming, some of which are reasonable (for example, one might not be able to trust someone else's code blindly), code search engines deployed inside enterprises can winnow down results to internal projects that reveal code written by experts, helping to alleviate such concerns while still permitting the much-practiced copy-paste programming.
Developers are not always looking for exact lines of code to copy and reuse. More often they seek useful patterns they can add to their repertoire of knowledge to solve recurring tasks. For example, while using APIs, developers need to learn the patterns of API usage. Today's applications frequently leverage API calls to other internal or external components. The typical API has little documentation and few good examples, so it can be frustrating and time-consuming for developers to figure out how to use them successfully. Two easy answers to this problem would be either to enable developers to see examples of how other developers have used an API or to provide visibility into the code behind the API. To accomplish this, developers need an easy way to search and view an API call or other code that calls the API. A code search engine allows developers to accomplish this task easily. Code samples and examples are vital learning tools for developers who often will copy and modify existing examples to fit their purposes. A code search engine lets developers use existing code repositories as sources of examples—in the above case, sources of API usage examples.
A code search engine can be helpful in various other scenarios. When starting a new project with new languages and frameworks, developers would benefit by researching and studying the code bases of mature projects using the same languages and frameworks. Open-source implementations can be a great way for developers to learn solutions to complex computing problems, such as implementing distributed systems, search engines, network servers and so on. Code search engines in the enterprise also can be extremely helpful during normal development activities, such as maintenance, porting and working with legacy code. Code search engines can be used to index and cross-link files spanning multiple types and languages, thus supporting traceability in the search results. Developers can use code search during maintenance to find source files, unit tests and configuration files related to a particular feature.
The use cases presented above demonstrate the potential benefits of a code search engine, but these benefits cannot be realized unless the code search engine is effective and efficient. Code search results must be relevant, comprehensive and meet the users' information need for the tool to be effective. It must be designed with the features and capabilities needed by a wide range of developers who are under constant pressure to work more efficiently and cost effectively. To be efficient, the code search solution must be capable of delivering effective results within acceptable response times by having the capacity to scale to very large repositories.
Source code, unlike plain or natural language text, tends to be very sparse. This poses a serious challenge in building effective code search engines if one resorts only to techniques that work for natural text. The lack of rich vocabulary in code has to be compensated with additional attributes that can be leveraged and would exist only in source code. One such attribute is the rich structural information that exists in source code. Unlike natural text, source code is highly structured with definitions of various nested elements and relations between these elements. For example, in a typical object-oriented program, one would find classes and methods, where classes extend to other classes, and method calls to other methods. A code search engine needs to parse source code to extract such elements to provide search operators that specifically allow the retrieval of these elements. For example, when a developer needs to find a certain method name, an operator (such as mdef in ohloh.code) easily can deliver effective search results on such a query.
This rich interlinked structure relates several elements with one another and can be the basis of accumulating similar terms when vocabulary is sparse. Similar to the Web, the link structure in code itself can be used to build new metrics of popularity and ranking, if used properly. There are several conventions (such as naming conventions) found in source code writing that are uncommon in natural text that make special tokenization and processing suitable for source code. (To learn more about these topics, refer to the author's doctoral dissertation: Facilitating Internet-Scale Code Retrieval at dl.acm.org/citation.cfm?id=2019966.)
For proper extraction of elements in source code and relations among such elements, a code search engine first needs to be able to detect the implementation language and perform detailed parsing of the code, which can be nontrivial for complex languages and for repositories where erroneous or incomplete code exists.
Beyond lexical and structural properties, source code has executable properties making it an executable artifact with runtime behaviors that change as the code evolves. Understanding such behavior is vital to activities like fixing bugs or improving performance. A code search engine can leverage the stored representations of runtime behavior as captured in test coverage reports, call traces, profiling outputs and logs, and relate them with appropriate elements defined in source code to provide answers related to unexpected behavior in code.
Finally, being produced and maintained by developers who work collaboratively, source code even has human-centric attributes. Since most of the activities on source code are logged in source repositories, a source code engine can tap into information connected to such activities to provide answers related to developers and their activities when needed. For example, it can help find an expert on a certain feature, or a developer tasked with managing a specific project can be notified when a certain portion of the code in question is changed.
An effective code search engine allows developers to extract, represent, store, mine and use these source code-specific attributes irrespective of the scale at which all such attributes can expand in size when applied to enterprise or Internet-scale source code repositories.
There are some important differences between enterprise and open-source code search. Open-source code search is done over code repositories found on the Internet and can be seen as an instance of Internet code search—developers searching for code on the Internet. Results can vary widely when searching one's own enterprise code base compared to searching open-source repositories. Inside an enterprise, it's likely there are more stringent code quality checks, better practices for using APIs and stricter code authorship attribution. These are just a few of the factors that can influence the examples developers can find when searching their enterprise code bases.
From a tool-builder's perspective, additional benefits of enterprise code search include tighter integration with ALM tools. Tool builders also can use code search to conduct more accurate analyses during indexing, because code in enterprise source code repositories could be quality controlled or automated to prevent erroneous and incomplete check-ins. In short, there are even more opportunities for us to explore leveraging the unique aspects of enterprise code base.
The usage of enterprise code search engines is still in the early adoption phase, so measuring the benefits can be a challenge. Without hard empirical data, these benefits are difficult to quantify but not impossible. Following are examples of how enterprises can assess the benefits:
As a productivity tool for developers: how much time and effort is spent on questions about code every day? How long does a developer have to wait to get an answer? How much time and effort could a developer save with code search tools, not only herself, but also for other members of the development team who collaborate with her and each other on a daily basis? With a code search tool, many such delays could be avoided, saving the valuable time of not only one but many developers.
Value of code search engine as a knowledge-enhancing tool: enhancing one's own knowledge is certainly invaluable, and if a code search engine works as a knowledge-building tool for developers, its value is already justified. To developers, source code is their literature, and a code search engine can act as a tool to navigate and master such literature.
More quantitative measures: there can be more quantitative and long-term means of measuring the benefits of a code search engine. Detailed tracking and logging of activities in the code search engine can lead to quantifiable discoveries of code reuse. Looking at activities over time (as permitted by honoring privacy concerns), such as searches, downloads and copy-paste events, enterprises can gain invaluable insights into their code base that can be applied to improving developer efficiency and software performance proactively.
Overall, as a team or a company, one can devise a strategy to measure the benefits of a code search tool by looking at things one can quantify, such as logs, and by understanding benefits that could be qualitative—by asking the end users, developers, managers and other collaborators to share how these tools benefit them individually and as a group.
Leveraging code artifacts from other developers can open up new opportunities for learning, code reuse and lowering the time and cost of software development and maintenance. The ability to search collections of large code repositories rapidly is fundamental to realizing these benefits. By putting more focus on leveraging code as a valuable learning asset, we can build upon the collective experiences within our industry to work more efficiently as innovators in developing new code.