A Retrieval Augmented Generation system to query the scikit-learn documentation

Rubber ducks are used for many years to help Pythonistas in their everyday quest. At scikit-learn, we’ve elevated ducky to another level: come and meet the scikit-learn Ragger Duck, a RAG system designed to answer all your scikit-learn questions – at least as effectively as a duck can.

Currently, the scikit-learn website provides an “exact” search engine based on the tools provided by the Sphinx Python package (i.e., https://www.sphinx-doc.org/). The current search engine is implemented in JavaScript and runs locally using an index built when generating the documentation. This solution has the advantage of being lightweight and does not require any server to handle the query. However, the complexity of the query treated is weak: since the search is “exact,” it is not robust to spelling mistakes, and the search is intended for searches based on keywords.

As large language models (LLMs) are becoming more popular, we have been interested in experimenting with this technology, knowing that they could address some of the previously stated limitations. As an open-source project, we have limited resources in terms of compute and limited available datasets; therefore, we discarded the option of fine-tuning an LLM and leaned towards retrieval augmented generation (RAG) systems.

This talk presents an experimental RAG system developed to query the scikit-learn documentation. As constraints, we impose ourselves to use an open-source software stack and open-weight models to build our system. The talk is decomposed as follows:

First, we provide some background on the RAG system and the pipeline to follow to implement such a system.

Then, we go into details in the different stages of the RAG pipeline. We provide some insights regarding documentation scraping strategies that we developed by leveraging the numpydoc and sphinx-gallery parser. Then, we discuss the solution that we tested to perform lexical and semantic searches. Finally, we explain how the context found can be fed to the LLM to help generate an answer to the user query. We provide a small demo to compare queries performed on an LLM-only system and on the developed RAG system. All the code for the experiment is hosted at the following GitHub repository: https://github.com/glemaitre/sklearn-ragger-duck.

Finally, we put into perspective the gains and pains of such an RAG system when it comes to integrating it into an open-source project. Notably, we question the hosting and cost of such systems and compare it with other approaches that could tackle some of the original issues.

A Retrieval Augmented Generation system to query the scikit-learn documentation

Thursday, May 23

12:35 - 13:05

Lemaitre