SLM series - Tabnine: A working combination of SLMs, LLMs (and the case for RAG)
This is a guest post written for the Computer Weekly Developer Network written by Ameya Deshmukh in his role as head of programmes for Tabnine.
Tabnine is an AI software development platform with the industry’s most contextually aware AI agents for every step of the SDLC.
Deshmukh writes in full as follows…
Last year we saw a number of Small Language Models (SLMs) released – but if small really is the new big, where does this leave enterprise teams with more models to manage than Dior?
One emerging use for SLMs is to be integrated with Large Language Models in software engineering, bringing the power of both to bear on those gnarly but essential undertakings such as code generation, debugging and documentation.
In a hybrid model of SLMs and LLMs, tasks are assigned to the model best suited to address them. LLMs adopt the role of general-purpose models to process large and complex activities while SLMs are turned into “specialiaed” models trained for specific tasks.
The differentiating factor between LLMs and SLMs is, unsurprisingly, size.
Corpus coolness
SLMs contain fewer parameters, making them ideally suited to the demands of development teams. They have the low latency and fast response times needed for the world of dev and test and are easier and quicker to fine tune and manage. Their smaller data load means SLMs are simpler to audit, too, making it easier to address data security and regulatory concerns, while their corpus can be refreshed faster as new data is required.
And finally, their size means SLMs can be deployed on-prem, making them cheaper and easier to acquire.
With this in mind, we’re seeing SLMs take on the role of “expert” model in sectors and projects where a precise body of knowledge is required – regulatory conditions or a business domain within which the finished AI system will be expected to function. We’ve therefore seen SLMs used for code completions by thousands of engineers in fields including semiconductor manufacturing, aerospace and defence and government.
But let’s not forget the LLM.
They excel in solving complex problems and offer deep contextual understanding using retrieval-augmented generation. They are suited to debugging and full-scale code generation, which makes them ideal for those building large enterprise systems.
But their size doesn’t lend them to easy fine tuning or fast-paced change.
Intelligent routing tooting
The question, therefore, is how to make this winning combination work effectively together.
Intelligent routing is the standard framework in AI systems, but it presents challenges in an enterprise development environment. The framework works by assessing queries and directing them to what it considers the most appropriate model. Implementing and operating intelligent routing successfully, however, comes with a lot of overhead that includes managing the configuration of routing rules and identifying the best paths to use when overriding responses.
Not only is that a lot of work, but success is not guaranteed and you can end up creating more work in the long run.
Intelligent routing means you must fine-tune your SLMs to work with different codebases, languages, libraries, dependencies, architecture, design patterns and so on.
In an enterprise setting that could mean working on this across hundreds of SLMs. Not only is this time-consuming and expensive but the pace of change in dev and test and the scale of the enterprise environments, would make it difficult to keep the SLMs up-to-date – thereby exposing your AI systems to the risk of hallucination.
RAG time
There is an alternative to intelligent routing: Retrieval Augmented Generation (RAG).
RAG retrieves data in real time to ensure the responses used in an AI system use the “best” data available – thereby improving the quality of recommendations.
RAG is built using a powerful framework that reduces the workload on the engineers responsible for operating and maintaining the system. RAG can be applied to a broad range of scenarios without need to establish a network of predefined model routes – it uses context routes that are defined by the user. RAG is also efficient from a computational perspective: it’s capable of interrogating expert models at a deep level – pulling back data from the task, user, project, codebase and organisational level as necessary. The knock-on effect is you can reduce the number of expert models, helping cut costs and improve your environment’s performance.
Finally – and rather helpfully – SLMs are suited to RAG in areas like agentic code validation and code review.
Software engineering is ripe for the marriage of LLMs and SLMS – but it takes RAG to deliver on the promise of a finely-tuned SLM in your toolkit.