SLM series: Editorial brief & scope
There are language models and there are language models.
While, at the time of writing at least, there is little talk of medium language models (Ed: give it a month, it’s bound to happen) we currently centralise our discussion in this space around large language models (LLMs) and small language models (SLMs).
Large language models are or course large, the clue in the name, repositories of information drawn to enable pattern recognition and analytics when fed into artificial intelligence engines, models and application services.
Conversely and obviously then, small language models are more diminutive information repositories, typically with a specific slant, narrowing or specialism (and therefore amplified adaptability and responsiveness) to enable them to provide more focused input DNA to the AI model which they serve.
Like LLMs, SLMs are capable of processing and generating human language and both are trained on massive quantities of text-based data – the same basic rules apply to the creation of large/small image models, large/small audio models and (perhaps one day soon) the next paradigm of AI modelling, perhaps haptic sensory motion analysis.
Complex contextualisation
While LLMs are often so large that they exhibit an inherent broadness due to their non-specific generalisation, SLMs have a more defined ability to provide contextual awareness of the subject matter to which they are applied. A
s a result of their increased ability to be responsive, SLMs are able to interpret more complex sentences and are widely agreed to be more inherently suited to real-time application deployments due to their lower latency and lower processing and memory requirements.
Usual suspects (for now) in the SLM space include: DistilBERT (a version of BERT); Google’s Gemma; Mistral 7B; the compact (3.8 billion parameters) Phi-3 Mini; and from the House of Zuck we of course have the open source Llama 3.
How small is a SLM?
We referenced SLM smallness above, but to be small in this category we can include models that have just a few million parameters.
On the large end of SLM smallness, we find models with billions of parameters and within this definition, we can also include SLMs with trillions of parameters. How big SLMs need to be for any given use case and what optimum size they should be for a specific task is open to question (hint: we’d like to know opinions here please) today.
What is an SLM parameter?
We’re glad you asked, SLM parameters include core variable data points and also include weights and biases to create the framework for machine learning to draw semantic meaning from. How these weights and biases are structured is not widely documented (another hint, if you please) and the way these interrelationships are formed may be a key defining factor for how successful different SLMs are in usage.
Because SLMs are small – and it may sound painfully obvious to say this out loud – they are good at getting into (aka being deployed) small spaces, so the potential application of SLMs in edge computing devices across the Internet of Things and the mobile landscape, in general, is huge.
How do you build an SLM?
As small as they are, SLMs stem from LLMs in order to exist. Their neural network-based architecture is known as the transformer model.
SLMs are created via a process known as model compression, where specific data management techniques are used to create a smaller model from a working LLM… quite how this process works and how accurate the controls applied at this level are may help define how successful the SLM is.
As IBM reminds us, “A transformer model can translate text and speech in near-real-time. For example, there are apps that now allow tourists to communicate with locals on the street in their primary language. They help researchers understand DNA and speed up drug design. They can help detect anomalies and prevent fraud in finance and security. Vision transformers are similarly used for computer vision tasks.”
SLMs are built using encoders to modify sequences of data into numerical values known as embeddings – and it is these embeddings that denote the semantics of the model itself and help to define where tokens exist within it.
As miquodo notes, “A token in AI is essentially a component of a larger data set, which may represent words, characters, or phrases. For example, when processing text, a sentence is divided into tokens, where each word or punctuation mark is considered a separate token in AI. This process of tokenisation is a crucial step in preparing data for further processing in AI model processes.”
Inside model compression
Model compression itself features a whole sub-genre of techniques including pruning, quantisation, low-rank factorisation, knowledge distillation and pruning to remove the least important and unneeded and/or redundant parameters from the neural network that makes up the SLM.
While each of these techniques, disciplines and processes are a textbook in their own right, we might here just define low-rank factorisation as the decomposition of a matrix of (relationship and importance) weights into a smaller-sized matrix that represents an approximation of the original dataset.
Be our guest (author)
But enough already, so begins the SLM series right here on the Computer Weekly Developer Network.
We now feature a series of pieces designed to showcase and highlight insight from real world software engineering and data science practitioners who have first-hand experience in building and/or working with SLM technologies.
We want answers to all hints posed above – and please tell us what we have missed in this opening exposition monologue – and we want practical direction on what the IT industry needs to be thinking about next in this space.
Questions to answer in this discussion may include some or all of the following:
- Should SLMs and LLMs be combined and used in union and in concert?
- What role does intelligent routing play to inside an AI when it needs to choose the most appropriate model to direct queries towards?
- Is a selection of dedicated SLMs better than one LLM – and so, are there use cases where SLMs are not needed and an LLM works more effectively?
- By nature of their size, SLMs faster to train, what advantages does this bring to AI data engineering?
- Are SLMs always deployed on-premises… and should they only (or mostly) always be used in prviate cloud environments due to their proximity to what is often mission-critical and/or private information?
- Are SLMs more environmentally sustainable due to their footprint?
- What limitations do SLMs exhibit and do they harbour the same propensity for bias as large-scale AI technologies? On this point, are SLMs likely to suffer from decreased performance (a lack of factual knowledge benchmarks) when pushed to complete complex tasks?
- Are domain-specific LLMs just as good (or better) than SLMs?
- What applications are SLMs best suited to – and is it chatbots and other customer-facing technologies that need to exhibit sentiment analysis?
- What key verticals should we think about for SLMs (some SLMs have proved useful when tasked with ingesting and analysing physicians’ notes) – and along with customer CRM in all domains… and are finance and retail key areas and if so why?
Let’s go big on being small and cover this subject bigly.