2025年分词语料实用格式处理脚本

科技前沿 • 2025-04-11 12:21 • 阅读 20

大家好，我是讯享网，很高兴认识大家。

格式参考

切分格式示例如下：
在这里插入图片描述
讯享网
格式描述：每一句话或一段话换一次行，句子中的单词之间用空格间隔，标点等其它符号也用空格隔开。

BMES格式示例如下：

在这里插入图片描述
格式描述：句子中的每个字（包括标点等其它符号）单独占一行，后面跟随这个字的标签（BMES），二者之间用制表符隔开。

生语料格式如下：

在这里插入图片描述
格式描述：每一句或一段话换一次行，即未经过任何预处理和标注的语料，比如我们平常上网爬取的语料内容。

从BMES格式向切分格式转换：

#! /usr/bin/python -*- coding:UTF-8 -* fo = open("train", "r" , encoding = "UTF-8") fo2 = open("train.txt", "a" , encoding = "UTF-8") str = fo.readlines() for st in str : if len(st) == 1 : fo2.write("\n") continue i = st.split(" ") if i[1].strip() == "S" : fo2.write(i[0]+" ") elif i[1].strip() == "B" or i[1].strip() == "M" : fo2.write(i[0]) else : fo2.write(i[0]+" ")

讯享网

从切分格式向BMES格式转换：

讯享网#! /usr/bin/python -*- coding:UTF-8 -* fo = open("test", "r" , encoding = "UTF-8") fo2 = open("test.txt", "a" , encoding = "UTF-8") st = fo.readline() while st != "" : str = st.strip() for i in str.split(" "): m = 1 for char in i: if i == "。" : fo2.write(char+" "+"S"+"\n\n") elif i == "【" or i == "】" : continue elif len(i)==1 : fo2.write(char+" "+"S"+"\n") elif m == 1 : fo2.write(char+" "+"B"+"\n") elif m == len(i) : fo2.write(char+" "+"E"+"\n") else : fo2.write(char+" "+"M"+"\n") m = m + 1 st = fo.readline()

通过语料建立词典（文件“words”）、建立词典并统计统计词频（文件“word_for_trainning”），代码如下：（过滤标点符号）

#! /usr/bin/python # -*- coding:UTF-8 -*- from zhon.hanzi import punctuation as punc import string epunc = string.punctuation fo = open("train","r") fo2 = open("words_for_training","a") fo3 = open("words","a") st = fo.readline() vocab = {} while st != "" : for string in st.split(" "): i = string.rstrip() if len(i) == 1 and (i in punc or i in epunc) : continue if i not in vocab : vocab[i] = 1 else : vocab[i] = vocab[i] + 1 st = fo.readline() keys = vocab.keys() for i in keys : fo2.write("NONE " + i + " " + str(vocab[i]) + "\n") fo3.write(i+"\n")

从切分格式向BMES格式转换：（过滤标点符号）

讯享网#! /usr/bin/python # -*- coding:UTF-8 -*- from zhon.hanzi import punctuation as punc import string epunc = string.punctuation fo = open("test", "r" , encoding = "UTF-8") fo2 = open("test2", "a" , encoding = "UTF-8") st = fo.readline() while st != "" : string = st.rstrip() for i in string.split(" "): if len(i) == 1 and (i in punc or i in epunc): continue m = 1 for char in i: if len(i)==1 : fo2.write(char+"#"+"\<NONE>"+"#"+"S"+"_NONE\n") elif m == 1 : fo2.write(char+"#"+"\<NONE>"+"#"+"B"+"_NONE\n") elif m == len(i) : fo2.write(char + "#" + "\<NONE>" + "#" + "E" + "_NONE\n") else : fo2.write(char + "#" + "\<NONE>" + "#" + "M" + "_NONE\n") m = m + 1 st = fo.readline()

从切分格式转换到生语料（方便测试使用）：

! /usr/bin/python # -*- coding:UTF-8 -*- fo = open("test","r") fo2 = open("test_raw","a") str = fo.readline() while str != "" : for i in str.split(" "): fo2.write(i) str = fo.readline()