计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (16): 215-223.DOI: 10.3778/j.issn.1002-8331.2406-0033

• 模式识别与人工智能 • 上一篇    下一篇

融合模型量化和缓存优化的实时语音监测方法

吴非,沈润楠,陈宇   

  1. 1.北京市文化市场综合执法总队 宣传与执法保障中心,北京 100161
    2.北京航空航天大学 复杂关键软件环境全国重点实验室,北京 100191
    3.北京航空航天大学 计算机学院,北京 100191
    4.北京航空航天大学 沈元学院,北京 100191
  • 出版日期:2025-08-15 发布日期:2025-08-15

Model Quantization and Buffer Optimization Based Real-Time Streaming Transcription Monitoring Method

WU Fei, SHEN Runnan, CHEN Yu   

  1. 1.Publicity and Law Enforcement Guarantee Center, Beijing Law Enforcement on Cultural Market, Beijing 100161, China
    2.National Key Laboratory of Complex Critical Software Environments, Beihang University, Beijing 100191, China
    3.School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    4.Shen Yuan Honors College, Beihang University, Beijing 100191, China
  • Online:2025-08-15 Published:2025-08-15

摘要: 针对文化市场新业态的监管需求,提出一种融合模型量化和缓存优化的实时语音监测方法。通过模型量化,在有限精度损失的情况下优化大模型加载速度并降低系统资源开销。在数据缓存优化方面采用最长公共前缀匹配策略动态调整缓冲区设置,提升语音转录内容上下文关联,同时降低词错率(word error rate,WER)。针对敏感内容训练基于BERT-TextCNN的敏感信息检测模型,建立非现场监管语音监测体系,实现对演出内容的实时监测和预警。实验结果表明,提出的模型量化方法在Whisper-large-v3预训练模型的FP16和FP32两个基准测试中分别能够提升2.62倍和2.11倍推理速度,与现有方法相比具有优势;在语音识别准确率和延迟方面,采用缓存优化策略后语音转录延迟平均降低了12.88%,中文词错率降低了14.42%;在语言类演出节目构成的真实数据集上进行实验,BERT-TextCNN模型对敏感内容的检测准确率达到92.66%,与其他方法相比具有更高的精确度和召回率,证明了所提方法能够有效支撑对小剧场等文化演出形式的非现场监管。

关键词: 语音识别, 模型量化, 最长公共前缀, 敏感内容检测

Abstract: A real-time streaming transcription monitoring method that integrates model quantification and buffer optimization is proposed to meet the regulatory needs of new business formats in the cultural market. By quantifying the model, the inference speed of large models is optimized and the system resource overhead is reduced in the case of limited accuracy loss. In terms of data buffering optimization, the longest common prefix matching strategy is used to dynamically adjust the buffer capacity to improve the contextual relevance of speech transcription content and reduce the word error rate (WER). A BERT-TextCNN sensitive information detection model is trained for sensitive content, and an off-site supervision voice monitoring system is constructed to achieve detection and early warning of cultural performance content. Experimental results show that the proposed model quantization method can increase the inference speed by 2.62 times and 2.11 times respectively in the two benchmark tests of FP16 and FP32 of the Whisper-large-v3 pre-training model, which has advantages over existing methods. In terms of speech recognition, the average transcription delay of the original model is reduced by 12.88%, and the Chinese word error rate is reduced by 14.42%. Experiments are conducted on real datasets composed of language performances, and the detection accuracy of sensitive content reaches 92.66%, which has higher precision and recall than other methods and demonstrates that the proposed method can effectively support off-site supervision of performances such as live streaming and little theatre.

Key words: speech recognition, model quantization, longest common prefix, sensitive content detection