Sep 17, 2024 | AI
Thomson Reuters Labs: Training large language models using Amazon SageMaker HyperPod
John Duprey, a distinguished engineer in Thomson Reuters Labs – the dedicated applied research division of Thomson Reuters – highlights how the Labs’ Foundational Research team scaled large language model training with Amazon SageMaker HyperPod.
2023 proved to be an inflection point for AI, prompting Thomson Reuters to consider how our high-value, curated, data could improve general language models on customer-specific tasks. Training and finetuning a large language model (LLM) is compute-intensive and requires specialized hardware.
Thomson Reuters Labs quickly discovered that it was extremely difficult to acquire these resources on-demand and at scale in our cloud environments. Further, looking to other third parties presented its own set of risks and challenges.
We turned to Amazon Web Services (AWS), which has long been a trusted partner in secure and scalable solutions, to get early access to Amazon SageMaker HyperPod. With our computing platform acquired, we were ready to roll up our sleeves and do the hard work of exploring how to optimally train and finetune models to our domain. In our first phase of experimentation, we peaked at 16 P4de compute instances, 128 A100 GPUs, with the longest job taking 36 days to complete training a 70 billion parameter model.
Initial results of our custom models look promising and our research continues, supported by the release of HyperPod for Kubernetes. Our AWS blog post explores the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted LLMs using Amazon SageMaker HyperPod.
This is a guest post from John Duprey, distinguished engineer, Thomson Reuters.