跳到正文

如何度量代码生成大模型的准确性?

代码生成大模型

Posted by lili on April 10, 2024 · 读取中...

    如何度量代码生成大模型的准确性?

    项目背景

    在公司内网部署StarCoder-1B大模型已经半月有余了,在考虑继续优化升级StarCoder-7B模型的同时,一个重要的问题出现了。

    在实际生产开发中,我们应该如何评估代码生成大模型的准确性?是从10亿到70亿参数越多就越好吗?还是看论文测试说哪个模型好就一定更好?

    用数据说话!

    评估指标

    • 插件的下载数量—-普及度
      • 由于我把插件放在飞书文档,因此无法统计下载安装的数量(失策了..)
    • 代码的生成质量—-满意度
      • 通过tabby产生的日志可以分析。

    日志分析

    tabby服务端会每天产生日志如:2024-03-07.json

    日志的内容:

    1
    2
    3
    4
    5
    6
    7
    8
    
    {"ts":1709541131885,"event":{"completion":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","language":"java","prompt":"*******","segments":{"prefix":"*****","suffix":"*****"},"choices":[{"index":0,"text":"*********"}]}}}
    {"ts":1709541132062,"event":{"view":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541131909"}}}
    {"ts":1709541132768,"event":{"select":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541131909","elapsed":704}}}
    {"ts":1709541133661,"event":{"view":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541133514"}}}
    {"ts":1709541134151,"event":{"select":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541133514","elapsed":489}}}
    {"ts":1709541134902,"event":{"view":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541134755"}}}
    {"ts":1709541135314,"event":{"select":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541134755","elapsed":412}}}
    
    

    通过分析日志的结构可以看出,日志一共分为4类

    • completion:完成
    • view:展示
    • select :选择
    • dismiss:放弃

    数据分析

    日期 completion view dismiss select 代码采纳率
    2024-03-07 2451 2059 1873 183 8.89%
    2024-03-08 1276 933 859 73 7.82%
               
    2024-03-11 2137 1581 1504 77 4.87%
    2024-03-12 1642 1527 1407 116 7.60%
    2024-03-13 1483 1146 1068 78 6.81%
    2024-03-14 1649 1132 1080 52 4.59%
    2024-03-15 1495 1159 1096 62 5.35%
               
    2024-03-18 834 645 625 16 2.48%
    2024-03-19 931 690 659 30 4.35%
    2024-03-20 1395 1017 961 54 5.31%
    2024-03-21 1391 1007 937 71 7.05%
    2024-03-22 1352 1041 974 66 6.34%

    分析结论

    通过数据可以得到代码的采纳率大概在5%左右,考虑到很多用户都是设置的自动模式,因此必定会有非常多的无效提示。实际有效的采纳估计会超过15%

    数据也算是差强人意,但还有很大的提升空间。

    后续规划

    • 升级7B模型,通过测试用例和「代码采纳率」来评估新模型的效果。
    • 优化idea和vscode插件的功能,增加埋点,更精准的统计用户数据,使分析结果更科学。