跳过导航

Russian Open Speech To Text

Speech to Text Russian Open STT

来源于各种音频源的语音示例集合。 数据集包含简短的俄语音频剪辑。

可以说是迄今为止最大的公共俄语 STT 数据集:

  • 约 16m 个言语;
  • 约 20 000 个小时;
  • 2,3 TB(未经压缩,.wav 格式,int16 类型),.opus 形式则为 356G;
  • 现在除了验证数据集,其余所有文件均已转换成 opus;

数据集的主要目的是训练“语音转文本”模型。

数据集组合

已给定 .wav 文件的数据集大小。

数据集 陈述 小时 GB 秒/字符 评论 Annotation 质量/干扰信息
无线电_v4 (*) 7,603,192 10,430 1,195 5 秒 / 68 单选 对齐 95% / 清晰
public_speech (*) 1,700,060 2,709 301 6 秒 / 79 公共演讲 对齐 95% / 清晰
有声读物_2 1,149,404 1,511 162 5 秒 / 56 书籍 对齐 95% / 清晰
radio_2 651,645 1,439 154 8 秒 / 110 单选 对齐 95% / 清晰
public_youtube1120 1,410,979 1,104 237 3 秒 / 34 Youtube 对白字幕 95% / ~清晰
public_youtube700 759,483 701 75 3 秒 / 43 Youtube 对白字幕 95% / ~清晰
tts_russian_addresses 1,741,838 754 81 2 秒 / 20 地址 TTS 4 语音 100% / 清晰
asr_public_phone_calls_2 603,797 601 66 4 秒 / 37 电话呼叫 ASR 70% / 杂音
public_youtube1120_hq 369,245 291 31 3 秒 / 37 YouTube HQ 对白字幕 95% / ~清晰
asr_public_phone_calls_1 233,868 211 23 3 秒 / 29 电话呼叫 ASR 70% / 杂音
radio_v4_add (*) 92,679 157 18 6 秒 / 80 单选 对齐 95% / 清晰
asr_public_stories_2 78,186 78 9 4 秒 / 43 书籍 ASR 80% / 清晰
asr_public_stories_1 46,142 38 4 3 秒 / 30 书籍 ASR 80% / 清晰
public_series_1 20,243 17 2 3 秒 / 38 Youtube 对白字幕 95% / ~清晰
asr_calls_2_val 12,950 7,7 2 2 秒 / 34 电话呼叫 手动批注 99% / 清晰
public_lecture_1 6,803 6 1 3 秒 / 47 讲座 对白字幕 95% / 清晰
buriy_audiobooks_2_val 7,850 4,9 1 2 秒 / 31 书籍 手动批注 99% / 清晰
public_youtube700_val 7,311 4,5 1 2 秒 / 35 Youtube 手动批注 99% / 清晰

(*) 只有一个数据示例是采用 txt 文件提供的。

批注方法

数据集是使用开放源代码编译的。 使用语音活动检测和比对将长序列分割为音频块。 会自动批注某些音频类型,并通过统计或启发式方法进行验证。

数据卷和更新频率

整个数据集的总大小为 350 GB。 带有公共共享标签的数据集的总大小为 130GB。

数据集本身不太可能更新为向后兼容。 请遵循原始基准库并排除文件。

将来可能会添加新的域和语言。

音频规范化

所有文件都是标准化的,以便更快捷地实现运行时改进和处理,如下所示:

  • 必要时转换成 mono;
  • 必要时转换为 16 kHz 采样率;
  • 存储为 16 位整数;
  • 转换为 OPUS;

磁盘上 DB 方法

每个音频文件(wav、二进制)都经过了哈希处理。 它的哈希用于创建文件夹层次结构以实现更优化的 fs 操作。

target_format = 'wav' wavb = wav.tobytes() f_hash = hashlib.sha1(wavb).hexdigest() store_path = Path(root_folder, f_hash[0], f_hash[1:3], f_hash[3:15] + '.' + target_format)
下载

数据集以 2 种形式提供:

  • 可通过 Azure blob 存储和/或直接链接访问存档;
  • 可通过 Azure blob 存储访问原始文件;

所有数据均存储在 https://azureopendatastorage.blob.core.windows.net/openstt/

文件夹结构:

└── ru_open_stt_opus <= archived folders │ │ │ ├── archives │ │ ├── asr_calls_2_val.tar.gz <= tar.gz archives with opus and wav files │ │ │ ... <= see the below table for enumeration │ │ └── tts_russian_addresses_rhvoice_4voices.tar.gz │ │ │ └── manifests │ ├── asr_calls_2_val.csv <= csv files with wav_path, text_path, duration (see notebooks) │ │ ... │ └── tts_russian_addresses_rhvoice_4voices.csv └── ru_open_stt_opus_unpacked <= a separate folder for each uploaded domain ├── public_youtube1120 │ ├── 0 <= see "On disk DB methodology" for details │ ├── 1 │ │ ├── 00 │ │ │ ... │ │ └── ff │ │ ├── *.opus <= actual files │ │ └── *.txt │ │ ... │ └── f ├── public_youtube1120_hq ├── public_youtube700_val ├── asr_calls_2_val ├── radio_2 ├── private_buriy_audiobooks_2 ├── asr_public_phone_calls_2 ├── asr_public_stories_2 ├── asr_public_stories_1 ├── public_lecture_1 ├── asr_public_phone_calls_1 ├── public_series_1 └── public_youtube700
数据集 GB,wav GB,存档 Archive 清单
定型
无线电和公共演讲示例 - 11.4 opus+txt - 清单
有声读物_2 162 25.8 opus+txt Internet + 比对 清单
radio_2 154 24.6 opus+txt 单选 清单
public_youtube1120 237 19.0 opus+txt YouTube 视频 清单
asr_public_phone_calls_2 66 9.4 opus+txt Internet + ASR 清单
public_youtube1120_hq 31 4.9 opus+txt YouTube 视频 清单
asr_public_stories_2 9 1.4 opus+txt Internet + 比对 清单
tts_russian_addresses_rhvoice_4voices 80.9 12.9 opus+txt TTS 清单
public_youtube700 75.0 12.2 opus+txt YouTube 视频 清单
asr_public_phone_calls_1 22.7 3.2 opus+txt Internet + ASR 清单
asr_public_stories_1 4.1 0.7 opus+txt 公共情景 清单
public_series_1 1.9 0.3 opus+txt 公用系列 清单
public_lecture_1 0.7 0.1 opus+txt Internet + 手动任务 清单
Val
asr_calls_2_val 2 0.8 wav+txt Internet 清单
buriy_audiobooks_2_val 1 0.5 wav+txt 书籍 + 手动任务 清单
public_youtube700_val 2 0.13 wav+txt YouTube 视频 + 手动任务 清单
下载说明

直接

请参阅此处 - https://github.com/snakers4/open_stt#download-instructions

通过安装 azure blob 存储

请查看“数据访问”选项卡中的笔记本

联系人

如果在数据方面需要帮助或抱有疑问,请联系数据作者 (aveysov@gmail.com)

许可证

该许可证准许重用者仅可出于非商业目的,且仅在归属权属于创作者的情况下,以任何媒体或格式分发、组合、改编采用该资料和基于它进行构建。 它包含以下元素:
* BY - 版权必须归属于创作者
* NC - 仅准许出于非商业用途使用该资料

已与数据集作者达成一致,适用 CC-BY-NC,可用于商业用途。

参考资料/延伸阅读

原始数据集

  • https://github.com/snakers4/open_stt

英语文章

  • https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
  • https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/

中文文章

  • https://www.infoq.cn/article/4u58WcFCs0RdpoXev1E2

俄语文章

  • https://habr.com/ru/post/494006/
  • https://habr.com/ru/post/474462/

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Select your preferred service:

Azure Notebooks

Azure Notebooks

Package: Language: Python

Helper functions / dependencies

Building libsndfile

The best efficient way to read opus files in python (the we know of) that does incur any significant overhead is to use pysoundfile (a python CFFI wrapper around libsoundfile).

When this solution was being researched the community had been waiting for a major libsoundfile release for some time.

Opus support has been implemented some time ago upstream, but it has not been properly released. Therefore we opted for a custom build + monkey patching.

At the time when you read / use this - probably there will be decent / proper builds of libsndfile.

Please replace with your faviourite tool if there is one.

Typically, you need to run this in your shell with sudo access:

apt-get update
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y

cd /usr/local/lib
git clone https://github.com/erikd/libsndfile.git
cd libsndfile
git reset --hard 49b7d61
mkdir -p build && cd build

cmake .. -DBUILD_SHARED_LIBS=ON
make && make install
cmake --build .

Helper functions / dependencies

Install the following libraries (versions do not matter much):

pandas
numpy
scipy
tqdm
soundfile
librosa

Depending on how this notebook is run, this sometimes can be as easy as (if, for example your miniconda is not installed under root):

In [ ]:
!pip install numpy
!pip install tqdm
!pip install scipy
!pip install pandas
!pip install soundfile
!pip install librosa
!pip install azure-storage-blob

Manifests are just csv files with the following columns:

  • Path to audio
  • Path to text file
  • Duration

They proved to be the most simple / helpful format of accessing data.

For ease of use all the manifests are already rerooted, i.e. all paths in them are relative and you just need to add a root folder.

In [1]:
# manifest utils
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from urllib.request import urlopen



def reroot_manifest(manifest_df,
                    source_path,
                    target_path):
    if source_path != '':
        manifest_df.wav_path = manifest_df.wav_path.apply(lambda x: x.replace(source_path,
                                                                              target_path))
        manifest_df.text_path = manifest_df.text_path.apply(lambda x: x.replace(source_path,
                                                                                target_path))
    else:
        manifest_df.wav_path = manifest_df.wav_path.apply(lambda x: os.path.join(target_path, x))
        manifest_df.text_path = manifest_df.text_path.apply(lambda x: os.path.join(target_path, x))    
    return manifest_df


def save_manifest(manifest_df,
                  path,
                  domain=False):
    if domain:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration', 'domain']
    else:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration']

    manifest_df.reset_index(drop=True).sort_values(by='duration',
                                                   ascending=True).to_csv(path,
                                                                          sep=',',
                                                                          header=False,
                                                                          index=False)
    return True


def read_manifest(manifest_path,
                  domain=False):
    if domain:
        return pd.read_csv(manifest_path,
                        names=['wav_path',
                               'text_path',
                               'duration',
                               'domain'])
    else:
        return pd.read_csv(manifest_path,
                        names=['wav_path',
                               'text_path',
                               'duration'])


def check_files(manifest_df,
                domain=False):
    orig_len = len(manifest_df)
    if domain:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration']
    else:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration', 'domain']
    wav_paths = list(manifest_df.wav_path.values)
    text_path = list(manifest_df.text_path.values)

    omitted_wavs = []
    omitted_txts = []

    for wav_path, text_path in zip(wav_paths, text_path):
        if not os.path.exists(wav_path):
            print('Dropping {}'.format(wav_path))
            omitted_wavs.append(wav_path)
        if not os.path.exists(text_path):
            print('Dropping {}'.format(text_path))
            omitted_txts.append(text_path)

    manifest_df = manifest_df[~manifest_df.wav_path.isin(omitted_wavs)]
    manifest_df = manifest_df[~manifest_df.text_path.isin(omitted_txts)]
    final_len = len(manifest_df)

    if final_len != orig_len:
        print('Removed {} lines'.format(orig_len-final_len))
    return manifest_df


def plain_merge_manifests(manifest_paths,
                          MIN_DURATION=0.1,
                          MAX_DURATION=100):

    manifest_df = pd.concat([read_manifest(_)
                             for _ in manifest_paths])
    manifest_df = check_files(manifest_df)

    manifest_df_fit = manifest_df[(manifest_df.duration>=MIN_DURATION) &
                                  (manifest_df.duration<=MAX_DURATION)]

    manifest_df_non_fit = manifest_df[(manifest_df.duration<MIN_DURATION) |
                                      (manifest_df.duration>MAX_DURATION)]

    print(f'Good hours: {manifest_df_fit.duration.sum() / 3600:.2f}')
    print(f'Bad hours: {manifest_df_non_fit.duration.sum() / 3600:.2f}')

    return manifest_df_fit


def save_txt_file(wav_path, text):
    txt_path = wav_path.replace('.wav','.txt')
    with open(txt_path, "w") as text_file:
        print(text, file=text_file)
    return txt_path


def read_txt_file(text_path):
    #with open(text_path, 'r') as file:
    response = urlopen(text_path)
    file = response.readlines()
    for i in range(len(file)):
        file[i] = file[i].decode('utf8')
    return file 

def create_manifest_from_df(df, domain=False):
    if domain:
        columns = ['wav_path', 'text_path', 'duration', 'domain']
    else:
        columns = ['wav_path', 'text_path', 'duration']
    manifest = df[columns]
    return manifest


def create_txt_files(manifest_df):
    assert 'text' in manifest_df.columns
    assert 'wav_path' in manifest_df.columns
    wav_paths, texts = list(manifest_df['wav_path'].values), list(manifest_df['text'].values)
    # not using multiprocessing for simplicity
    txt_paths = [save_txt_file(*_) for _ in tqdm(zip(wav_paths, texts), total=len(wav_paths))]
    manifest_df['text_path'] = txt_paths
    return manifest_df


def replace_encoded(text):
    text = text.lower()
    if '2' in text:
        text = list(text)
        _text = []
        for i,char in enumerate(text):
            if char=='2':
                try:
                    _text.extend([_text[-1]])
                except:
                    print(''.join(text))
            else:
                _text.extend([char])
        text = ''.join(_text)
    return text
In [2]:
# reading opus files
import os
import soundfile as sf



# Fx for soundfile read/write functions
def fx_seek(self, frames, whence=os.SEEK_SET):
    self._check_if_closed()
    position = sf._snd.sf_seek(self._file, frames, whence)
    return position


def fx_get_format_from_filename(file, mode):
    format = ''
    file = getattr(file, 'name', file)
    try:
        format = os.path.splitext(file)[-1][1:]
        format = format.decode('utf-8', 'replace')
    except Exception:
        pass
    if format == 'opus':
        return 'OGG'
    if format.upper() not in sf._formats and 'r' not in mode:
        raise TypeError("No format specified and unable to get format from "
                        "file extension: {0!r}".format(file))
    return format


#sf._snd = sf._ffi.dlopen('/usr/local/lib/libsndfile/build/libsndfile.so.1.0.29')
sf._subtypes['OPUS'] = 0x0064
sf.SoundFile.seek = fx_seek
sf._get_format_from_filename = fx_get_format_from_filename


def read(file, **kwargs):
    return sf.read(file, **kwargs)


def write(file, data, samplerate, **kwargs):
    return sf.write(file, data, samplerate, **kwargs)
In [3]:
# display utils
import gc
from IPython.display import HTML, Audio, display_html
pd.set_option('display.max_colwidth', 3000)
#Prepend_path is set to read directly from Azure. To read from local replace below string with path to the downloaded dataset files
prepend_path = 'https://azureopendatastorage.blob.core.windows.net/openstt/ru_open_stt_opus_unpacked/'


def audio_player(audio_path):
    return '<audio preload="none" controls="controls"><source src="{}" type="audio/wav"></audio>'.format(audio_path)

def display_manifest(manifest_df):
    display_df = manifest_df
    display_df['wav'] = [audio_player(prepend_path+path) for path in display_df.wav_path]
    display_df['txt'] = [read_txt_file(prepend_path+path) for path in tqdm(display_df.text_path)]
    audio_style = '<style>audio {height:44px;border:0;padding:0 20px 0px;margin:-10px -20px -20px;}</style>'
    display_df = display_df[['wav','txt', 'duration']]
    display(HTML(audio_style + display_df.to_html(escape=False)))
    del display_df
    gc.collect()

Play with a dataset

Play a sample of files

On most platforms browsers usually support native audio playback.

So we can leverage HTML5 audio players to view our data.

In [4]:
manifest_df = read_manifest(prepend_path +'/manifests/public_series_1.csv')
#manifest_df = reroot_manifest(manifest_df,
                              #source_path='',
                              #target_path='../../../../../nvme/stt/data/ru_open_stt/')
In [5]:
sample = manifest_df.sample(n=20)
display_manifest(sample)
100%|██████████| 20/20 [00:07<00:00,  2.66it/s]
wav txt duration
5963 [пожалуйста прости всё в порядке\n] 2.48
19972 [хотелось бы хотя бы разок глазком на неё посмотреть раз такое дело\n] 5.68
15555 [они с егерем на след напали до инспектора не дозвониться\n] 3.84
430 [что то случилось\n] 1.36
4090 [так давай опаздываем\n] 2.16
18590 [да саид слушаю тебя троих нашли а в полётном листе\n] 4.60
17734 [надо сначала самому серьёзным человеком стать понимаешь\n] 4.32
978 [вот что случилось\n] 1.56
13269 [да паш юль пожалуйста не делай глупостей\n] 3.48
4957 [полусладкое или сухое\n] 2.32
1913 [ищи другую машину\n] 1.80
10522 [гражданин финн не зная что я полицейский\n] 3.08
9214 [ты чего трубку не берёшь я же переживаю\n] 2.88
10014 [я не окажу сопротивления я без оружия\n] 3.00
8351 [звони партнёру пусть он напишет\n] 2.80
3818 [ну что пойдём обсудим\n] 2.12
11097 [вы простите понимаете все об этом знают\n] 3.16
2989 [какие уж разводки\n] 2.00
12229 [я получается какой то диспетчер а не напарник\n] 3.28
5348 [я же тебе сказала никакой карелии\n] 2.40

Read a file

In [ ]:
!ls ru_open_stt_opus/manifests/*.csv

A couple of simplistic examples showing how to best read wav and opus files.

Scipy is the fastest for wav, pysoundfile is the best overall for opus.

In [6]:
%matplotlib inline

import librosa
from scipy.io import wavfile
from librosa import display as ldisplay
from matplotlib import pyplot as plt

Read a wav

In [7]:
manifest_df = read_manifest(prepend_path +'manifests/asr_calls_2_val.csv')
#manifest_df = reroot_manifest(manifest_df,
                              #source_path='',
                              #target_path='../../../../../nvme/stt/data/ru_open_stt/')
In [8]:
sample = manifest_df.sample(n=5)
display_manifest(sample)
100%|██████████| 5/5 [00:01<00:00,  2.61it/s]
wav txt duration
7802 [это же позитивные новости не негативные\n] 2.01
3590 [белый цветочек\n] 1.17
10594 [какое отношение имеет ваша пенсия к моему отделению\n] 3.14
4630 [есть есть видео\n] 1.35
468 [что ещё раз\n] 0.62
In [9]:
from io import BytesIO

wav_path = sample.iloc[0].wav_path
response = urlopen(prepend_path+wav_path)
data = response.read()
sr, wav = wavfile.read(BytesIO(data))
wav.astype('float32')
absmax = np.max(np.abs(wav))
wav =  wav / absmax
In [10]:
# shortest way to plot a spectrogram
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max)
plt.figure(figsize=(12, 6))
ldisplay.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# shortest way to plot an envelope
plt.figure(figsize=(12, 6))
ldisplay.waveplot(wav, sr=sr, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None)
Out[10]:
<matplotlib.collections.PolyCollection at 0x7fdf62f7e8d0>

Read opus

In [11]:
manifest_df = read_manifest(prepend_path +'manifests/asr_public_phone_calls_2.csv')
#manifest_df = reroot_manifest(manifest_df,
                              #source_path='',
                              #target_path='../../../../../nvme/stt/data/ru_open_stt/')
In [12]:
sample = manifest_df.sample(n=5)
display_manifest(sample)
100%|██████████| 5/5 [00:02<00:00,  2.24it/s]
wav txt duration
5018 [а вы кто\n] 0.96
143473 [пьеса дружбы нету\n] 1.86
272155 [не знаю где находится\n] 2.64
334225 [ты куда звонишь то куда ты звонишь ты знаешь\n] 3.12
143789 [помощник дежурного\n] 1.86
In [13]:
opus_path = sample.iloc[0].wav_path
response = urlopen(prepend_path+opus_path)
data = response.read()
wav, sr = sf.read(BytesIO(data))
wav.astype('float32')
absmax = np.max(np.abs(wav))
wav =  wav / absmax
In [14]:
# shortest way to plot a spectrogram
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max)
plt.figure(figsize=(12, 6))
ldisplay.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# shortest way to plot an envelope
plt.figure(figsize=(12, 6))
ldisplay.waveplot(wav, sr=sr, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None)
Out[14]:
<matplotlib.collections.PolyCollection at 0x7fdf62f8ee10>
In [ ]: