GitHub along with Weights & Biases introduced CodeSearchNet challenge evaluation and CodeSearchNet Corpus
Yesterday, the team at GitHub along with its partners from Weights & Biases introduced the CodeSearchNet challenge evaluation environment and leaderboard. The team is also releasing a large dataset to help data scientists in building models for this task and several baseline models that highlight the current state of the art.
Semantic code search involves retrieving relevant code when a natural language query is given. While dealing with other information retrieval tasks, it needs to bridge the gap between the language used in code and natural language. Also, the standard information retrieval methods don’t work effectively in the code search domain because usually there is little shared vocabulary between search terms and results. Evaluating methods for this task is very difficult, as there are no substantial datasets that were made for this task.
Considering these issues and to evaluate the progress on code search, the team is releasing CodeSearchNet Corpus and they are presenting the CodeSearchNet Challenge. The CodeSearchNet Challenge consists of 99 natural language queries and around 4k expert relevance annotations.
The CodeSearchNet Corpus
CodeSearchNet corpus contains automatically generated query-like natural language for around 2 million functions. It also includes the metadata that indicates the original location where the data was found.
CodeSearchNet Corpus collection
The team collects the corpus from publicly available open-source non-fork GitHub repositories and uses libraries.io for identifying all projects which are used by at least one other project. They further sort these projects based on their ‘popularity’ by checking the number of stars and forks. The team removes the projects that do not have a license or whose license does not allow the re-distribution of parts of the project.
The CodeSearchNet Challenge
The team collected an initial set of code search queries for evaluating code search models. They started by collecting the common search queries that had high click-through rates from Bing and then combined these with queries from StaQC. The team manually filtered out those queries that were clearly ‘technical keywords’ for obtaining a set of 99 natural language queries.
The team then used a standard Elasticsearch installation and baseline models for obtaining 10 results per query from their CodeSearchNet Corpus. They then asked data scientists, programmers, and machine learning researchers for annotating the results for relevance to the query. For evaluating the CodeSearchNet Challenge, a method should return a set of results from CodeSearchNet Corpus for each of 99 pre-defined natural language queries.