怎样收集唤醒词数据和得到期望的数据-Loudest Section

科技前沿 • 2025-01-13 11:29 • 阅读 66

大家好，我是讯享网，很高兴认识大家。

问题：怎样录音

https://github.com/tinyMLx/open-speech-recording

https://github.com/petewarden/open-speech-recording

问题：如果录音3秒钟，而其中说话的部分只有约1秒钟，其他是背景音，怎样找到说话的那一秒钟对应的数据？

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

https://arxiv.org/pdf/1804.03209.pdf

5.5 Extract Loudest Section

From manual inspection of the results, there were still large numbers of utterances that were too quiet or completely silent. The alignment of the spoken words within the 1.5 second file was quite arbitrary too, depending on the speed of the user’s response to the word displayed.

To solve both these problems, I created a simple audio processing tool called Extract Loudest Section to examine the overall volume of the clips. As a first stage,

讯享网

I summed the absolute differences of all the samples from zero (using a scale where -32768 in the 16-bit sample data was -1.0 as a floating-point number, and +32767 was 1.0), and looked at the mean average of that value to estimate the overall volume of the utterance. From experimentation, anything below 0.004 on this metric was likely to be to quiet to be intelligible, and so all of those clips were removed.

To approximate the correct alignment, the tool then extracted the one-second clip that contained the highest overall volume. This tended to center the spoken word in the middle of the trimmed clip, assuming that the utterance was the loudest part of the recording.

对应的代码
https://github.com/petewarden/extract_loudest_section

核心函数如下：开始怀疑这个算法不能处理含有噪声的场景，如说话后，无噪声，再有噪声，仔细看代码后发现：这种场景也没有问题

current_volume_sum始终是desired_samples 连续数据的和，判断条件是连续数据和比较，这样就能保证得到loudest section 和end index

void TrimToLoudestSegment(const std::vector<float>& input,
int64_t desired_samples, std::vector<float>* output) {
const int64_t input_size = input.size();
if (desired_samples >= input_size) {
*output = input;
return;
}

float current_volume_sum = 0.0f;
for (int64_t i = 0; i < desired_samples; ++i) {
const float input_value = input[i];
current_volume_sum += fabsf(input_value * input_value);
}

int64_t loudest_end_index = desired_samples;
float loudest_volume = current_volume_sum;
// based on the desired_samples of head, go until wave file end
// rm trail and add lead element,
// if making the sum increased change the loudest_volume and loudest_end_index
// if not, continue go next
// for case, after 4 element current_volume_sum is conutinus sum of desired_length
// the condition if sum of desired_length not a specific elemnt.
// So even with noise , the workes well.
for (int64_t i = desired_samples; i < input_size; ++i) {
const float trailing_value = input[i - desired_samples];
current_volume_sum -= fabsf(trailing_value);

const float leading_value = input[i];
current_volume_sum += fabsf(leading_value);
if (current_volume_sum > loudest_volume) {
loudest_volume = current_volume_sum;
loudest_end_index = i;
}
}

const int64_t loudest_start_index = loudest_end_index - desired_samples;
output->resize(desired_samples);
std::copy(input.begin() + loudest_start_index,
input.begin() + loudest_end_index, output->begin());
}

怎样收集唤醒词数据和得到期望的数据-Loudest Section

问题：怎样录音

问题： 如果录音3秒钟，而其中说话的部分只有约1秒钟，其他是背景音，怎样找到说话的那一秒钟对应的数据？

5.5 Extract Loudest Section

对应的代码https://github.com/petewarden/extract_loudest_section

相关推荐

问题：如果录音3秒钟，而其中说话的部分只有约1秒钟，其他是背景音，怎样找到说话的那一秒钟对应的数据？

对应的代码
https://github.com/petewarden/extract_loudest_section