Adding Knowledge to LLMs – What I Learned at GTC Paris

This summer, I attended NVIDIA’s course “Adding New Knowledge to LLMs”. Here’s the short summary of what I learned. Adding domain knowledge — like a new language or tool — to an existing model is very doable without retraining the whole thing from scratch. Let’s say you’re working with a model that doesn’t understand Norwegian.…

Magnar Johnsen

4. Jul 2025

2–3 minutes

AI, Digital Workplace, LLM, Nvidia

This summer, I attended NVIDIA’s course “Adding New Knowledge to LLMs”. Here’s the short summary of what I learned.

Adding domain knowledge — like a new language or tool — to an existing model is very doable without retraining the whole thing from scratch. Let’s say you’re working with a model that doesn’t understand Norwegian. Rather than rebuilding it, you can fine-tune it with curated Norwegian datasets. Same goes for adding support for a new coding language, a new tool, or even medical terminology.

Still, RAG is the most cost-effective and practical method for adding contextual knowledge — things like internal knowledge articles, documentation, and fast-changing information. It wouldn’t make sense to train a model on this kind of content because it evolves too quickly and training is expensive. That said, there are edge cases — like local models deployed in the field with strict latency requirements — where injecting that knowledge directly into the model does make sense.

One thing that stood out is just how much work goes into building good datasets. A huge part of the process is about cleaning, deduplicating, and formatting data.

Evaluation is also key. The course introduced NeMo Evaluator, which lets you benchmark models in your specific domain — even before tuning. The idea is to use a kind of “ground truth” LLM to assess how well your model performs, and then measure improvement after fine-tuning.

When it comes to optimizing models, there are several techniques:

Quantization reduces the precision (e.g., down to FP8 or FP4). This makes models faster and less resource-intensive to run — though at the cost of some accuracy.
Pruning involves trimming layers from the model to make it smaller. But in our lab tests, this led to a noticeable drop in quality.
Distillation, on the other hand, was more promising. Using a “teacher-student” setup, a smaller model is trained to mimic a larger one — and interestingly, it sometimes outperforms the original in certain benchmarks. Distillation is becoming more common for a reason.

NVIDIA also wraps open-source models in their NIM containers. These wrappers help optimize performance and reduce resource usage when running the models — a nice bonus if you’re thinking about operational efficiency.

The tools that support all this — NeMo Evaluator for benchmarking, NeMo Curator for cleaning and preparing datasets, NeMo Model Optimizer for quantization, pruning, and distillation — form a solid toolkit for taking models from generic to specialized.

I see a lot of pothential in this. Training AI from scratch is expensive and resource demanding. Instead, organizations can use open weight models, localize and optimize it, and run it in their own systems and organizations with reasonable effort and cost.