如何度量代码生成大模型的准确性?
项目背景
在公司内网部署StarCoder-1B大模型已经半月有余了,在考虑继续优化升级StarCoder-7B模型的同时,一个重要的问题出现了。
在实际生产开发中,我们应该如何评估代码生成大模型的准确性?是从10亿到70亿参数越多就越好吗?还是看论文测试说哪个模型好就一定更好?
用数据说话!
评估指标
- 插件的下载数量—-普及度
- 由于我把插件放在飞书文档,因此无法统计下载安装的数量(失策了..)
- 代码的生成质量—-满意度
- 通过tabby产生的日志可以分析。
日志分析
tabby服务端会每天产生日志如:2024-03-07.json
日志的内容:
1
2
3
4
5
6
7
8
{"ts":1709541131885,"event":{"completion":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","language":"java","prompt":"*******","segments":{"prefix":"*****","suffix":"*****"},"choices":[{"index":0,"text":"*********"}]}}}
{"ts":1709541132062,"event":{"view":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541131909"}}}
{"ts":1709541132768,"event":{"select":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541131909","elapsed":704}}}
{"ts":1709541133661,"event":{"view":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541133514"}}}
{"ts":1709541134151,"event":{"select":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541133514","elapsed":489}}}
{"ts":1709541134902,"event":{"view":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541134755"}}}
{"ts":1709541135314,"event":{"select":{"completion_id":"cmpl-dea2b314-0518-414e-ace7-f74c3494ba20","choice_index":0,"view_id":"view-dea2b314-0518-414e-ace7-f74c3494ba20-at-1709541134755","elapsed":412}}}
通过分析日志的结构可以看出,日志一共分为4类
- completion:完成
- view:展示
- select :选择
- dismiss:放弃
数据分析
| 日期 | completion | view | dismiss | select | 代码采纳率 |
|---|---|---|---|---|---|
| 2024-03-07 | 2451 | 2059 | 1873 | 183 | 8.89% |
| 2024-03-08 | 1276 | 933 | 859 | 73 | 7.82% |
| 2024-03-11 | 2137 | 1581 | 1504 | 77 | 4.87% |
| 2024-03-12 | 1642 | 1527 | 1407 | 116 | 7.60% |
| 2024-03-13 | 1483 | 1146 | 1068 | 78 | 6.81% |
| 2024-03-14 | 1649 | 1132 | 1080 | 52 | 4.59% |
| 2024-03-15 | 1495 | 1159 | 1096 | 62 | 5.35% |
| 2024-03-18 | 834 | 645 | 625 | 16 | 2.48% |
| 2024-03-19 | 931 | 690 | 659 | 30 | 4.35% |
| 2024-03-20 | 1395 | 1017 | 961 | 54 | 5.31% |
| 2024-03-21 | 1391 | 1007 | 937 | 71 | 7.05% |
| 2024-03-22 | 1352 | 1041 | 974 | 66 | 6.34% |
分析结论
通过数据可以得到代码的采纳率大概在5%左右,考虑到很多用户都是设置的自动模式,因此必定会有非常多的无效提示。实际有效的采纳估计会超过15%
数据也算是差强人意,但还有很大的提升空间。
后续规划
- 升级7B模型,通过测试用例和「代码采纳率」来评估新模型的效果。
- 优化idea和vscode插件的功能,增加埋点,更精准的统计用户数据,使分析结果更科学。