2025年分词语料实用格式处理脚本

分词语料实用格式处理脚本格式参考 切分格式示例如下 格式描述 每一句话或一段话换一次行 句子中的单词之间用空格间隔 标点等其它符号也用空格隔开 BMES 格式示例如下 格式描述 句子中的每个字 包括标点等其它符号 单独占一行 后面跟随这个字的标签

大家好,我是讯享网,很高兴认识大家。

格式参考

切分格式示例如下:
在这里插入图片描述
讯享网
格式描述:每一句话或一段话换一次行,句子中的单词之间用空格间隔,标点等其它符号也用空格隔开。

BMES格式示例如下:

在这里插入图片描述
格式描述:句子中的每个字(包括标点等其它符号)单独占一行,后面跟随这个字的标签(BMES),二者之间用制表符隔开。

生语料格式如下:

在这里插入图片描述
格式描述:每一句或一段话换一次行,即未经过任何预处理和标注的语料,比如我们平常上网爬取的语料内容。

从BMES格式向切分格式转换:

#! /usr/bin/python -*- coding:UTF-8 -* fo = open("train", "r" , encoding = "UTF-8") fo2 = open("train.txt", "a" , encoding = "UTF-8") str = fo.readlines() for st in str : if len(st) == 1 : fo2.write("\n") continue i = st.split(" ") if i[1].strip() == "S" : fo2.write(i[0]+" ") elif i[1].strip() == "B" or i[1].strip() == "M" : fo2.write(i[0]) else : fo2.write(i[0]+" ") 

讯享网

从切分格式向BMES格式转换:

讯享网#! /usr/bin/python -*- coding:UTF-8 -* fo = open("test", "r" , encoding = "UTF-8") fo2 = open("test.txt", "a" , encoding = "UTF-8") st = fo.readline() while st != "" : str = st.strip() for i in str.split(" "): m = 1 for char in i: if i == "。" : fo2.write(char+" "+"S"+"\n\n") elif i == "【" or i == "】" : continue elif len(i)==1 : fo2.write(char+" "+"S"+"\n") elif m == 1 : fo2.write(char+" "+"B"+"\n") elif m == len(i) : fo2.write(char+" "+"E"+"\n") else : fo2.write(char+" "+"M"+"\n") m = m + 1 st = fo.readline() 

通过语料建立词典(文件“words”)、建立词典并统计统计词频(文件“word_for_trainning”),代码如下:(过滤标点符号)

#! /usr/bin/python # -*- coding:UTF-8 -*- from zhon.hanzi import punctuation as punc import string epunc = string.punctuation fo = open("train","r") fo2 = open("words_for_training","a") fo3 = open("words","a") st = fo.readline() vocab = {} while st != "" : for string in st.split(" "): i = string.rstrip() if len(i) == 1 and (i in punc or i in epunc) : continue if i not in vocab : vocab[i] = 1 else : vocab[i] = vocab[i] + 1 st = fo.readline() keys = vocab.keys() for i in keys : fo2.write("NONE " + i + " " + str(vocab[i]) + "\n") fo3.write(i+"\n") 

从切分格式向BMES格式转换:(过滤标点符号)

讯享网#! /usr/bin/python # -*- coding:UTF-8 -*- from zhon.hanzi import punctuation as punc import string epunc = string.punctuation fo = open("test", "r" , encoding = "UTF-8") fo2 = open("test2", "a" , encoding = "UTF-8") st = fo.readline() while st != "" : string = st.rstrip() for i in string.split(" "): if len(i) == 1 and (i in punc or i in epunc): continue m = 1 for char in i: if len(i)==1 : fo2.write(char+"#"+"\<NONE>"+"#"+"S"+"_NONE\n") elif m == 1 : fo2.write(char+"#"+"\<NONE>"+"#"+"B"+"_NONE\n") elif m == len(i) : fo2.write(char + "#" + "\<NONE>" + "#" + "E" + "_NONE\n") else : fo2.write(char + "#" + "\<NONE>" + "#" + "M" + "_NONE\n") m = m + 1 st = fo.readline() 

从切分格式转换到生语料(方便测试使用):

! /usr/bin/python # -*- coding:UTF-8 -*- fo = open("test","r") fo2 = open("test_raw","a") str = fo.readline() while str != "" : for i in str.split(" "): fo2.write(i) str = fo.readline() 
小讯
上一篇 2025-03-20 19:25
下一篇 2025-02-21 14:15

相关推荐

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/30872.html