Personalization in Slipbox with On-device AI

Personalization in Slipbox with On-device AI

Slipbox Founders

Slipbox Founders

In AI, we are witnessing a notable shift towards efficiency, accessibility, and personalization while preserving privacy. Large Language Models (LLMs) are continuing to advance with new capabilities being released frequently. However, with user data being processed in the cloud, privacy remains a critical concern for conversational data. Significant progress is also being made in efficient, on-device AI implementations, where personalized data can be processed locally, preserving privacy. For applications such as Slipbox, that captures conversations between individuals, privacy is a crucial consideration.

The evolution of Small Language Models (SLMs), such as DeepSeek's distilled smaller models, marks a shift in AI deployment, enabling use of edge devices and personalization without compromising on the accuracy. In addition, novel techniques such as quantization, pruning, knowledge distillation, and low-rank approximation have paved the way for more personalized AI experiences. Hardware evolution, such as Apple's Neural Processing Units (NPUs), has sparked industry-wide advancement, with Intel achieving a 4x increase in Trillions of Operations Per Second (TOPS) recently within a year, while Qualcomm has progressed from basic audio processing to handling sophisticated multi-modal AI models. Modern System-on-Chip architectures now integrate multiple processing units (CPUs, GPUs, NPUs, and Digital Signal Processors) to optimize resource utilization and reduce latency, supporting increasingly sophisticated AI applications. This technological progression is reflected in the market's rapid adoption of AI-capable PCs, with sales projected to rise from 17% to 43% (Figure 1) for AI PCs worldwide. This trend indicates a broader industry shift toward AI-ready hardware across all major platforms.

AI PC Market Growth

Figure 1: AI PC share of PC shipments worldwide in the last two years

Applications, such as Slipbox, that run on devices can leverage these developments and provide more value to users. Our hybrid architecture can harness both the advances in on-device small models and cloud large language models, focusing on privacy, personalization, and cost-effectiveness. While users value privacy, especially concerning personal conversations and corporate data, this creates challenges for personalization. In our previous blog, From Shallow Talk to Deep Talk, we describe how Slipbox creates a privacy-first personal AI meeting companion. In this blog, we discuss in detail how Slipbox can achieve personalization while maintaining data privacy.

Personalization on Device

As the demand for personal AI assistants grows, on-device SLMs face significant challenges, particularly when handling long conversations or documents. Devices such as phones and laptops have limited computing power and resources, making it difficult to process large data efficiently. For example, summarizing accurately a transcript from an hour-long meeting becomes impractical due to device memory usage in language models growing quadratically with text length. Moreover, longer context lengths lead to worse performance. With all these constraints and restrictions, Slipbox provides personalization on devices with the limited use of cloud LLMs. There are several ways in which personalization can be provided to users both at a coarse grained as well as fine grained level.

Coarse-Grained Personalization: Domain-Specific SLMs

Coarse-grained personalization focuses on pre-trained domain-specific SLMs. These models, fine-tuned externally on specialized datasets, offer a solid foundation for personalized AI experiences on devices such as AI PCs. For example, domain-specific pre-training enables models to excel in fields such as legal, medical, or financial, while still maintaining broad capabilities. This approach provides an effective balance between performance and efficiency, offering offline functionality and reduced latency without overwhelming the device's resources. A small model specific to the persona and industry can be installed with Slipbox.

Fine-Grained Personalization: Adaptive Learning on Device

Fine-grained personalization represents how Slipbox can adapt to individual users, combining advanced techniques to create truly personalized experiences while maintaining privacy and efficiency. Here are a few options that Slipbox is exploring:

Personal Memory Generation and Context Retention

A personal memory module is key for building self-evolving, personalized intelligent meeting assistants. Memory modules that retain conversational and historical context across multiple interactions are the foundation of personalization. These memory modules allow Slipbox to generate coherent and contextually relevant responses, and to create a personalized knowledge base that improves future interactions. Personalization can be further enhanced by tracking topics, knowledge graphs, and meeting contexts. At Slipbox, we are partially using SLMs to do some of these tasks, for example, generating these personalized memories. Figure 2 shows an example of how memories are stored for individuals.

Memory Interface

Figure 2: The interface shows user preferences that Slipbox has learned from past interactions

Memory in Context Prompt:
Providing memory as context to SLMs in a structured prompt is one approach to personalize the interaction with Slipbox. The entire memory and current context can fit into the model's context window if the memory size is small. Effective prompting is necessary in this method to ensure that the model concentrates on the pertinent information and does not get overloaded.

RAG + Context Prompt:
In Slipbox, as the conversation and the memory stored increase, the previous method fails even with a reasonable context window in the SLMs. We could identify the most relevant memories using RAG for the current conversation and provide them using a prompt. However, these techniques and the amount of memory that can be sent directly in prompt becomes a problem if the conversational length increases. Recent research, however, is making strides in overcoming these limitations. The Qwen2.5 model can process up to one million tokens by breaking text into manageable chunks and capturing both local and broader connections but the quality is still an issue.

PRISM, proposed by Google, is another technique for using large context windows in SLMs. It uses an incremental approach, processing information in a stream of chunks while maintaining an in-context structured memory. This structured memory is revised with each new chunk of information (e.g., historical conversation data), and the model receives this memory as part of the input, which helps it retain prior context. Figure 3 showcases the high level idea of this approach.

PRISM Approach

Figure 3: Structured memory updation using stream of chunk of data

RAG-based approach of keeping track of information and using that in-context has several advantages including, seamlessly incorporating knowledge from user's documents, emails, and browsing data, maintaining current information without requiring model retraining, and improving response accuracy by grounding generations in user-specific content.

However, there are challenges using a RAG-based solution, such as increased system complexity, document selection errors, latency during retrieval, and limitations of vector embeddings. Moreover, it is important to maintain and update retrieval and reranking algorithms and latency reduction strategies during the vectorization of memory chunks.

For Slipbox, a power user, who switches contexts between a lot of different topics, could benefit from an information retrieval system to achieve optimal personalization. Even though RAG is a separate system that needs to be maintained for personalization, we will have to maintain a RAG system for the "Ask Slipbox" feature, where the user can retrieve context from historical meetings, which can additionally benefit memory and personalization. These techniques need prompt engineering and use the XLM tags for SLMs to better understand the structure of prompts.

Adaptive Vocabulary Systems

The earlier section covered personalization with respect to context and conversation history. Vocabulary is another aspect that requires adaptation. There are three vocabulary types: industry-specific, team-specific, and individual-specific. Industry-specific vocabulary can be better understood by SLMs through coarse-grained personalization techniques, as discussed earlier. However, team and individual-specific jargon require fine-grained personalization.

Vocabulary extension is an important personalization feature for Slipbox, which needs to handle daily conversations within a company. On individual devices, vocabulary expansion can be achieved through a combination of automatic content analysis, user-driven additions (Figure 4 shows how Slipbox provides a mechanism to define dictionary mapping), context-aware learning, and runtime adaptive tokenization techniques with or without fine-tuning. Tokenization is the process of breaking down text into smaller units, called tokens, which can be words, subwords, or characters, for further processing by a language model. This tokenization process often includes building a dictionary that maps tokens to their corresponding numerical representations for efficient computation. Vocabulary expansion can also help to improve accuracy of automatic speech recognition.

Dictionary Mapping

Figure 4: User defined dictionary of commonly misspelled words

Augmenting the Local Model

A major problem when using memories is that it increases the prompt size (compared to having no memory), leads to more tokens that need to be encoded. This can be a serious issue when dealing with incremental tasks that have many chunks of information, or if the memory size is much larger than the rest of the prompt. Earlier, we discussed RAG being one of the ways to improve by preselecting the memories, but it comes with its own challenges as we discussed. One way to improve encoding efficiency, which decreases the latency, is to use key-value caching within the model. With this method, the model can reuse previously computed key-value activations. Cache Augmented Generation (CAG), shown in Figure 5, uses the key-value caching to encode the external knowledge source and embed the cache directly into LLM. This system essentially integrates knowledge directly into the model's processing pipeline rather than using external retrieval mechanisms.

For Slipbox, the memories generated can be handled using these techniques, as long as they are not too large. Slipbox can preselect and attach different knowledge source caches to SLMs. These caches can be generated based on various topics and the nature of the conversation, since the memories are already available. For users, with conversations focusing on a few number topics, pre-encoding the knowledge as part of SLMs can be a good option.

Cache Augmented Generation

Figure 5: Cache Augmented Generation. (Source: https://github.com/hhhuang/CAG)

Advantages of CAG techniques include:

  • The workaround over vector space that is required with traditional RAG is not necessary.
  • Knowledge is encoded to use within SLMs mathematical operation.
  • Eliminates the need for a separate retrieval system.
  • Works with existing transformer architecture's self-attention mechanism.

While these techniques can improve and expand the memory of SLMs, limitations still exist. For example, the performance of SLMs may still be negatively affected by very long contexts. There are additional techniques such as context shortening and key-value compression that can be applied to make it more viable for longer context lengths.

On-device Fine Tuning

Before diving into the technical architecture of on-device model fine-tuning, it helps to understand the fundamental process that enables personalization. Fine-tuning represents a specialized form of transfer learning where a pre-trained model's parameters are selectively adjusted using your personal data. Unlike training from scratch, which requires massive computational resources and datasets, fine-tuning focuses on adapting existing neural pathways within the model to better align with users' specific usage patterns.

In the context of Slipbox, as the volume of contextual data and memories grows, it becomes increasingly difficult to manage. If this knowledge could be parameterized within the SLMs, it would enable a deeper connection between different concepts. Fine-tuning can significantly improve personalization while improving performance and privacy. There are several approaches to fine-tune models on-device, with tools like PocketEngine accelerating local training to help AI models adapt to user data without compromising privacy. Both Microsoft and Apple have made strides in enabling on-device personalization and fine tuning. Microsoft supports on-device training in ONNX runtime, which uses a two-stage process: the first stage generates an offline artifact required for training, and the second stage executes the training on-device. Apple's MLX framework also facilitates on-device personalized learning.

Another technique is to use a hybrid approach to enhance personalization, where a smaller "student" model runs locally on a resource-limited device, while receiving guidance from a more powerful "teacher" model through a remote machine or the cloud. This allows the student model to continuously adapt to user needs. This enables models to maintain efficiency while adapting to user-specific knowledge.

Updating the fine-tuned Model:

The state-of-the-art SLMs are being redefined frequently, when that happens, one needs to consider how we will update the underlying base model. Updating models on-device presents significant challenges in ensuring compatibility, continuity, efficiency, and personalization. Evolving models may differ in architecture, tokenization, and embedding spaces, making direct parameter transfer difficult and necessitating alignment techniques or distillation methods. Figure 6 shows an example of Apple's technique, Model Update Strategy for Compatible LLM Evolution (MUSCLE).

MUSCLE Approach

Figure 6: Apple proposed MUSCLE for updating the on-device model to avoid regression

Evaluation is also complex, as upgrades might inadvertently degrade performance or lose personalization aspects without careful benchmarking. Memory management is another concern, as models with accumulated personalized data risk slowing down or exceeding device constraints, requiring pruning techniques. Moreover, updates must preserve privacy and remain lightweight without constant cloud dependency, necessitating advancements in on-device fine-tuning, efficient retrieval mechanisms, and hybrid local-cloud adaptation. Balancing these factors is crucial to achieving seamless, non-disruptive model updates and upgrades that retain personalization without compromising performance and catastrophic forgetfulness.

When upgrading a device, it is important to consider how to best utilize the additional compute resources. One way to do this is to transfer the knowledge from the smaller model, which can run on less powerful laptops, to the larger model, which can run on more powerful laptops. How to do the knowledge transfer to a larger model from a smaller one still remains as an open question.

Final Thoughts on Slipbox Personalization

The evolution toward local AI processing represents a transformative opportunity for Slipbox specifically. By embracing on-device compute and SLMs, Slipbox is positioning itself at the forefront of a new era in personal productivity tools, one that prioritizes both privacy and personalization. While technical challenges remain in optimizing memory management, vocabulary adaptation, and model updating, Slipbox's hybrid architecture is uniquely suited to navigate these complexities by balancing on-device processing with selective cloud support.

Slipbox users will benefit from AI assistance that is increasingly personalized to their specific communication patterns, terminology, and workflows without compromising their sensitive conversational data. This is made possible by the platform's investments in memory modules and adaptive learning, which will enable it to serve as an intelligent companion that understands and anticipates their needs.

As on-device AI capabilities mature over the next few years, Slipbox will be able to deliver increasingly sophisticated personalization features while simultaneously reducing operational costs and cloud dependencies. This will ultimately translate to a more responsive, intuitive, and private experience for users across various industries and team configurations.

For Slipbox, the vision isn't simply about more powerful AI; it's about creating a more thoughtful, personalized experience that amplifies human capabilities while preserving privacy.

Capture every insight, from conversations to observations, and make them count

Get started for free. No credit card required.