Fujitsu has developed a new generative AI reconfiguration technology, a core technology for its AI service “Fujitsu Kozuchi,” that reduces the weighting and power consumption of large-scale language models (LLMs). This technology has successfully enhanced its LLM “Takane.” This technology consists of two core technologies: the world’s highest-precision quantization technology, which minimizes the weights assigned to the connections between neurons that form the basis of AI thinking; and the world’s first specialized AI distillation technology, which achieves both lightweighting and accuracy exceeding that of the original AI model. By applying this quantization technology to “Takane,” we achieved the world’s highest accuracy retention rate of 89% compared to before quantization, with 1-bit quantization (reducing memory consumption by up to 94%), and three times the speed before quantization. This significantly exceeds the accuracy retention rate of less than 20% of the conventional mainstream quantization method (GPTQ). This technology enables large generative AI models that previously required four high-end GPUs to be executed at high speed on a single low-end GPU.
The dramatic weight savings achieved by this technology enable AI agents to be run on edge devices such as smartphones and factory machines. This will improve real-time responsiveness, strengthen data security, and dramatically reduce power consumption during AI operation, contributing to a sustainable AI society. Fujitsu will begin gradually offering trial environments for “Takane,” which applies quantization technology, from the second half of fiscal year 2025. Furthermore
, starting today, we will progressively release models of Cohere’s research-oriented open weight “Command A,” quantized using this technology, via Hugging Face. Fujitsu will continue to dramatically improve the capabilities of generative AI while promoting research and development to ensure its reliability, thereby contributing to solving the more difficult challenges facing customers and society and pioneering new possibilities for the use of generative AI.
近年、生成AIは自律的にタスクを実行するAIエージェントへと進化し、その産業実装は急速に進んでいます。しかし、その基盤となるLLMは大規模化し、高性能なGPUを大量に必要とするため、開発・運用コストの上昇や、消費電力の多さによる環境負荷など、大きな課題となっています。また、企業がジェネレーティブAIを業務に十分に活用するためには、単に汎用的なモデルを利用するのではなく、特定の業務に合わせたモデルの精度向上や、工場や店舗のエッジデバイスで利用できる軽量化が不可欠です。
こちらもお読みください: NTT-AT、RPA「WinActor」にAI機能を搭載
生成的AI再構成技術を構成する2つのコア技術
Many of the tasks performed by AI agents require only a small portion of the general-purpose capabilities of LLMs. In designing LLMs, the generative AI reconfiguration technology developed here is inspired by the human brain’s ability to reconfigure its neural circuits to specialize in specific skills, reorganizing them in response to learning, experience, and changes in the environment. It efficiently extracts only the knowledge necessary for a specific task from a huge model with general knowledge, creating a lightweight, highly efficient, and highly reliable AI model, similar to the brain of an expert. This is made possible by the following two core technologies of this technology.
AIの思考を効率化し、消費電力を削減する量子化技術
This technology compresses the vast amount of parameter information that forms the basis of generative AI thinking, significantly reducing the weight, power consumption, and speed of generative AI models. Previous methods posed a challenge in neural networks with many layers, such as LLMs, due to the exponential accumulation of quantization error. Based on theoretical insights, Fujitsu Laboratories developed a new quantization algorithm (QEP: Quantization Error Propagation) that prevents quantization error from increasing by propagating it across layers. Furthermore, by utilizing QQA, the world’s most accurate optimization algorithm for large-scale problems developed by 富士通 研究所では、LLMの1ビット量子化を達成しました。
専門知識を凝縮し、精度を高めるAI専門蒸留技術
This technology optimizes the structure of AI models so that the brain reinforces necessary knowledge and organizes unnecessary memories. First, we generate a diverse set of candidate models by pruning the base AI model to remove unnecessary knowledge and adding Transformer blocks to impart new capabilities. Next, we use Neural Architecture Search (NAS), a proprietary proxy evaluation technology, to automatically select the optimal model from these candidates that balances customer requirements (GPU resources, speed) and accuracy. Finally, we distill knowledge from training models such as “Takane” into the selected model. This unique approach goes beyond simple compression and achieves accuracy that exceeds that of the base generative AI model for specialized tasks.
当社が保有するCRM(顧客関係管理)データを用いたテキストQAタスクで、各営業案件の結果を予測するデモでは、過去データに基づくタスク固有の知識のみを抽出したモデルを用いることで、推論速度を11倍に向上させ、精度を43%向上させるなど、大幅な精度向上を確認しました。また、高精度化とモデル圧縮を同時に実現することで、パラメータサイズが1/100の軽量な生徒モデルでも教師モデルを上回る精度を達成できることを確認し、必要なGPUメモリと運用コストを70%削減するとともに、より信頼性の高い取引結果の予測を可能にしました。さらに、画像認識タスクでは、未学習のオブジェクトの検出精度を、既存の蒸留技術と比較して10%向上させることに成功しました。これは画期的な成果であり、この分野における過去2年間の精度向上の3倍以上です。
ソース 富士通
