FunASR 的热词配置使用方式说明，官方文档太太太太太太太简略

1. FunASR介绍

FunASR 是阿里巴巴达摩院开源的一款语音识别基础工具包，主要面向工业级应用，提供从语音活动检测 VAD 到语音识别 ASR，标点恢复等全链路解决方案。

FunASR 地址：https://github.com/modelscope/FunASR

2. 热词配置

首先说明为啥会有“热词”这个东西，因为我们语音识别的时候，想要识别特定领域词汇，或者自定义的一些术语，语音识别常常会识别“错误”，比如下面的情况：

“菁菁” ->“晶晶”

“菁菁”会被识别成“晶晶”这样的词汇，你说对吧，人家是别的也挺对，你说不对吧，但是识别的又不是我们想要的“菁菁”，所以就有了热词配置这么一说。

根据官方文档，有一个代码片段，可以跑通热词识别的demo，如下所示：

from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel(model="paraformer-zh",  vad_model="fsmn-vad", punc_model="ct-punc", 
                  # spk_model="cam++"
                  )
res = model.generate(input=f"{model.model_path}/example/asr_example.wav", 
            batch_size_s=300, 
            hotword='魔搭')
print(res)

这个代码可以跑通热词配置没问题，但是热词肯定不能硬编码到代码中，所以hotword也支持传入文件路径，文件一行一个，但是不支持传入权重，只能是纯热词的词汇，并且文件必须是以txt结尾，核心代码里是这样判断的（.venv/lib/python3.10/site-packages/funasr/models/seaco_paraformer/model.py）：

		# for None
        if hotword_list_or_file is None:
            hotword_list = None
        # for local txt inputs
        elif os.path.exists(hotword_list_or_file) and hotword_list_or_file.endswith(".txt"):
            logging.info("Attempting to parse hotwords from local txt...")
            hotword_list = []
            hotword_str_list = []
            with codecs.open(hotword_list_or_file, "r") as fin:
                for line in fin.readlines():
                    hw = line.strip()
                    hw_list = hw.split()
                    if seg_dict is not None:
                        hw_list = seg_tokenize(hw_list, seg_dict)
                    hotword_str_list.append(hw)
                    hotword_list.append(tokenizer.tokens2ids(hw_list))
                hotword_list.append([self.sos])
                hotword_str_list.append("<s>")
            logging.info(
                "Initialized hotword list from file: {}, hotword list: {}.".format(
                    hotword_list_or_file, hotword_str_list
                )
            )
        # for url, download and generate txt
        elif hotword_list_or_file.startswith("http"):
            logging.info("Attempting to parse hotwords from url...")
            work_dir = tempfile.TemporaryDirectory().name
            if not os.path.exists(work_dir):
                os.makedirs(work_dir)
            text_file_path = os.path.join(work_dir, os.path.basename(hotword_list_or_file))
            local_file = requests.get(hotword_list_or_file)
            open(text_file_path, "wb").write(local_file.content)
            hotword_list_or_file = text_file_path
            hotword_list = []
            hotword_str_list = []
            with codecs.open(hotword_list_or_file, "r") as fin:
                for line in fin.readlines():
                    hw = line.strip()
                    hw_list = hw.split()
                    if seg_dict is not None:
                        hw_list = seg_tokenize(hw_list, seg_dict)
                    hotword_str_list.append(hw)
                    hotword_list.append(tokenizer.tokens2ids(hw_list))
                hotword_list.append([self.sos])
                hotword_str_list.append("<s>")
            logging.info(
                "Initialized hotword list from file: {}, hotword list: {}.".format(
                    hotword_list_or_file, hotword_str_list
                )
            )
        # for text str input
        elif not hotword_list_or_file.endswith(".txt"):
            logging.info("Attempting to parse hotwords as str...")
            hotword_list = []
            hotword_str_list = []
            for hw in hotword_list_or_file.strip().split():
                hotword_str_list.append(hw)
                hw_list = hw.strip().split()
                if seg_dict is not None:
                    hw_list = seg_tokenize(hw_list, seg_dict)
                hotword_list.append(tokenizer.tokens2ids(hw_list))
            hotword_list.append([self.sos])
            hotword_str_list.append("<s>")
            logging.info("Hotword list: {}.".format(hotword_str_list))
        else:
            hotword_list = None

里面写的也很明白:

txt 文件配置热词：os.path.exists(hotword_list_or_file) and hotword_list_or_file.endswith(“.txt”)
从 url 获取热词：hotword_list_or_file.startswith(“http”)
纯字符串热词：not hotword_list_or_file.endswith(“.txt”)

针对上面的核心代码，也就能明白

txt 文件配置热词：文件必须存在，并且以 txt 结尾，每一个词一行
url 获取热词：代码逻辑就是请求过来的文件内容写入了本地文件，所以远端的文件只能是纯文本文件，与 txt 规则一致，一行一个热词
纯字符串热词：根据代码逻辑可以看到，多个热词之间按照空格分开

3. 遇到的问题

3.1 Windows 热词失效

同样的代码，Linux 没问题，但是在 windows 平台上跑，热词竟然“失效”了，翻源码之后，问题出现在这一行源码上：

codecs.open(hotword_list_or_file, "r")

读取文件的时候，FunASR 并没有指定文件编码格式，所以这个默认行为就是用对应平台的默认编码打开文件，Linux、MacOS 一般是 UTF8，而 Windows 是 GBK，所以需要更改热词文件编码格式为 GBK 后，热词才会生效。

3.2 传入热词文件时，热词不生效

使用 hotword 参数传入热词可以正常生效，但是传入文件路径时却没有生效，这个一定要自己先判断下热词文件路径是否正确，直接通过源码中的判断方式进行判断：

import os

hotword_list_or_file = "./xxx.txt"
if os.path.exists(hotword_list_or_file) and hotword_list_or_file.endswith(".txt"):
    print("文件合法")

自己检验完成后保证文件生效，这个时候一般就 OK 了，如果还不行，就检查文件编码问题，肯定是读取的时候乱码了，乱码的热词是无法生效的。

3.3 更换模型后，热词不生效

当更换为 iic/SenseVoiceSmall模型后，热词又又又又又又不生效了！

然后翻 issue：https://github.com/modelscope/FunASR/issues/1499

里面官方回复，热词并不是对所有的输入生效的，并且查看源码后，热词的支持是在模型中定义的，而不是框架支持的，所以需要看对应模型的代码是否支持热词hotword参数，最简单的办法就是上源码里面搜，自己即将要用的模型，是否支持了这个参数，模型文件目录.venv/lib/python3.10/site-packages/funasr/models

不过我找了很多模型，只有官方 README 里面的那个模型（paraformer-zh）支持热词，其他模型都不支持。

FunASR热词配置