问题:怎样录音
https://github.com/tinyMLx/open-speech-recording
https://github.com/petewarden/open-speech-recording
问题: 如果录音3秒钟,而其中说话的部分只有约1秒钟,其他是背景音,怎样找到说话的那一秒钟对应的数据?
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
https://arxiv.org/pdf/1804.03209.pdf
5.5 Extract Loudest Section
From manual inspection of the results, there were still large numbers of utterances that were too quiet or completely silent. The alignment of the spoken words within the 1.5 second file was quite arbitrary too, depending on the speed of the user’s response to the word displayed.
To solve both these problems, I created a simple audio processing tool called Extract Loudest Section to examine the overall volume of the clips. As a first stage,
I summed the absolute differences of all the samples from zero (using a scale where -32768 in the 16-bit sample data was -1.0 as a floating-point number, and +32767 was 1.0), and looked at the mean average of that value to estimate the overall volume of the utterance. From experimentation, anything below 0.004 on this metric was likely to be to quiet to be intelligible, and so all of those clips were removed.
To approximate the correct alignment, the tool then extracted the one-second clip that contained the highest overall volume. This tended to center the spoken word in the middle of the trimmed clip, assuming that the utterance was the loudest part of the recording.
对应的代码
https://github.com/petewarden/extract_loudest_section

核心函数如下:开始怀疑这个算法不能处理含有噪声的场景,如说话后,无噪声,再有噪声,仔细看代码后发现:这种场景也没有问题
current_volume_sum始终是desired_samples 连续数据的和,判断条件是连续数据和比较,这样就能保证得到loudest section 和end index
void TrimToLoudestSegment(const std::vector<float>& input,
int64_t desired_samples, std::vector<float>* output) {
const int64_t input_size = input.size();
if (desired_samples >= input_size) {
*output = input;
return;
}
float current_volume_sum = 0.0f;
for (int64_t i = 0; i < desired_samples; ++i) {
const float input_value = input[i];
current_volume_sum += fabsf(input_value * input_value);
}
int64_t loudest_end_index = desired_samples;
float loudest_volume = current_volume_sum;
// based on the desired_samples of head, go until wave file end
// rm trail and add lead element,
// if making the sum increased change the loudest_volume and loudest_end_index
// if not, continue go next
// for case, after 4 element current_volume_sum is conutinus sum of desired_length
// the condition if sum of desired_length not a specific elemnt.
// So even with noise , the workes well.
for (int64_t i = desired_samples; i < input_size; ++i) {
const float trailing_value = input[i - desired_samples];
current_volume_sum -= fabsf(trailing_value);
const float leading_value = input[i];
current_volume_sum += fabsf(leading_value);
if (current_volume_sum > loudest_volume) {
loudest_volume = current_volume_sum;
loudest_end_index = i;
}
}
const int64_t loudest_start_index = loudest_end_index - desired_samples;
output->resize(desired_samples);
std::copy(input.begin() + loudest_start_index,
input.begin() + loudest_end_index, output->begin());
}

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/49857.html