Een verzameling voorbeelden van uitspraken afkomstig uit diverse audiobronnen. De gegevensset bevat korte geluidsfragmenten in het Russisch.
Op dit moment de grootste openbare Russische STT-gegevensset:
- ~16 miljoen uitingen;
- ~ 20.000 uur;
- 2,3 TB (niet-gecomprimeerd in
.wav
-indeling inint16
), 356 G in.opus
; - Alle bestanden zijn getransformeerd naar Opus, behalve validatiegegevenssets;
De gegevenssets is bedoeld om spraak-naar-tekst-modellen te trainen.
Samenstelling van de gegevenssetDe opgegeven grootte van de gegevensset heeft betrekking op de .wav
-bestanden.
Gegevensset | Utterances | Tijden | GB | Seconden/tekens | Opmerking | Aantekening | Kwaliteit/ruis |
---|---|---|---|---|---|---|---|
radio_v4 (*) | 7.603.192 | 10.430 | 1195 | 5s/68 | Radio | Uitlijnen | 95%/helder |
public_speech (*) | 1.700.060 | 2709 | 301 | 6s/79 | Openbare spraak | Uitlijnen | 95%/helder |
audiobook_2 | 1.149.404 | 1511 | 162 | 5s/56 | Books | Uitlijnen | 95%/helder |
radio_2 | 651.645 | 1439 | 154 | 8s/110 | Radio | Uitlijnen | 95%/helder |
public_youtube1120 | 1.410.979 | 1104 | 237 | 3s/34 | YouTube | Ondertiteling | 95%/~helder |
public_youtube700 | 759.483 | 701 | 75 | 3s/43 | YouTube | Ondertiteling | 95%/~helder |
tts_russian_addresses | 1.741.838 | 754 | 81 | 2s/20 | Adressen | TTS van 4 stemmen | 100%/helder |
asr_public_phone_calls_2 | 603.797 | 601 | 66 | 4s/37 | Telefoonoproep | ASR | 70%/bevat ruis |
public_youtube1120_hq | 369.245 | 291 | 31 | 3s/37 | YouTube HQ | Ondertiteling | 95%/~helder |
asr_public_phone_calls_1 | 233.868 | 211 | 23 | 3s/29 | Telefoonoproep | ASR | 70%/bevat ruis |
radio_v4_add (*) | 92.679 | 157 | 18 | 6s/80 | Radio | Uitlijnen | 95%/helder |
asr_public_stories_2 | 78.186 | 78 | 9 | 4s/43 | Books | ASR | 80%/helder |
asr_public_stories_1 | 46.142 | 38 | 4 | 3s/30 | Books | ASR | 80%/helder |
public_series_1 | 20.243 | 17 | 2 | 3s/38 | YouTube | Ondertiteling | 95%/~helder |
asr_calls_2_val | 12.950 | 7,7 | 2 | 2s/34 | Telefoonoproep | Handmatige aantekeningen | 99%/helder |
public_lecture_1 | 6803 | 6 | 1 | 3s/47 | Lezingen | Ondertiteling | 95%/helder |
buriy_audiobooks_2_val | 7850 | 4,9 | 1 | 2s/31 | Books | Handmatige aantekeningen | 99%/helder |
public_youtube700_val | 7311 | 4,5 | 1 | 2s/35 | YouTube | Handmatige aantekeningen | 99%/helder |
(*) Bij TXT-bestanden wordt alleen een gegevenssample geleverd.
Methodologie voor aantekeningenDe gegevensset wordt gecompileerd met open source-bronnen. Lange reeksen worden in audioblokken gesplitst door middel van detectie van spraakactiviteit en uitlijning. Voor sommige audiotypen worden er automatisch aantekeningen gemaakt en wordt verificatie statistisch/door middel van heuristiek uitgevoerd.
Gegevensvolumes en bijwerkfrequentieDe totale grootte van de gegevensset is 350 GB. De totale grootte van de gegevensset inclusief openbaar gedeelde labels is 130 GB.
De gegevensset zelf wordt waarschijnlijk niet bijgewerkt voor compatibiliteit met eerdere versies. Volg de oorspronkelijke opslagplaats voor benchmarks en om bestanden uit te sluiten.
Mogelijk worden er in de toekomst nieuwe domeinen en talen toegevoegd.
Audionormalisering
Alle bestanden worden als volgt genormaliseerd voor eenvoudigere/snellere runtimeverbeteringen en verwerking:
- Geconverteerd naar mono, indien nodig;
- Geconverteerd naar een samplefrequentie van 16 kHz, indien nodig;
- Opgeslagen als gehele getallen van 16-bit;
- Geconverteerd naar OPUS;
Methodologie voor DB op schijf
Elk audiobestand (wav, binair) wordt gehasht. De hash wordt gebruikt om een maphiërarchie te maken voor optimale fs-uitvoering.
target_format = 'wav'
wavb = wav.tobytes()
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15] + '.' + target_format)
De gegevensset is beschikbaar in twee vormen:
- Archieven die beschikbaar zijn via Azure-blobopslag en/of directe koppelingen;
- De oorspronkelijke bestanden zijn beschikbaar via Azure-blobopslag;
Alles wordt opgeslagen in https://azureopendatastorage.blob.core.windows.net/openstt/
Mapstructuur:
└── ru_open_stt_opus <= archived folders
│ │
│ ├── archives
│ │ ├── asr_calls_2_val.tar.gz <= tar.gz archives with opus and wav files
│ │ │ ... <= see the below table for enumeration
│ │ └── tts_russian_addresses_rhvoice_4voices.tar.gz
│ │
│ └── manifests
│ ├── asr_calls_2_val.csv <= csv files with wav_path, text_path, duration (see notebooks)
│ │ ...
│ └── tts_russian_addresses_rhvoice_4voices.csv
│
└── ru_open_stt_opus_unpacked <= a separate folder for each uploaded domain
├── public_youtube1120
│ ├── 0 <= see "On disk DB methodology" for details
│ ├── 1
│ │ ├── 00
│ │ │ ...
│ │ └── ff
│ │ ├── *.opus <= actual files
│ │ └── *.txt
│ │ ...
│ └── f
│
├── public_youtube1120_hq
├── public_youtube700_val
├── asr_calls_2_val
├── radio_2
├── private_buriy_audiobooks_2
├── asr_public_phone_calls_2
├── asr_public_stories_2
├── asr_public_stories_1
├── public_lecture_1
├── asr_public_phone_calls_1
├── public_series_1
└── public_youtube700
Gegevensset | GB, wav | GB, archief | Archiveren | Bron | Manifest |
---|---|---|---|---|---|
Trainen | |||||
Sample van radio en openbare spraak | - | 11,4 | opus+txt | - | manifest |
audiobook_2 | 162 | 25,8 | opus+txt | Internet + uitlijning | manifest |
radio_2 | 154 | 24,6 | opus+txt | Radio | manifest |
public_youtube1120 | 237 | 19,0 | opus+txt | YouTube-video’s | manifest |
asr_public_phone_calls_2 | 66 | 9,4 | opus+txt | Internet + ASR | manifest |
public_youtube1120_hq | 31 | 4.9 | opus+txt | YouTube-video’s | manifest |
asr_public_stories_2 | 9 | 1.4 | opus+txt | Internet + uitlijning | manifest |
tts_russian_addresses_rhvoice_4voices | 80,9 | 12,9 | opus+txt | TTS | manifest |
public_youtube700 | 75,0 | 12,2 | opus+txt | YouTube-video’s | manifest |
asr_public_phone_calls_1 | 22,7 | 3.2 | opus+txt | Internet + ASR | manifest |
asr_public_stories_1 | 4.1 | 0,7 | opus+txt | Openbare verhalen | manifest |
public_series_1 | 1,9 | 0,3 | opus+txt | Openbare reeks | manifest |
public_lecture_1 | 0,7 | 0,1 | opus+txt | Internet + handleiding | manifest |
Val | |||||
asr_calls_2_val | 2 | 0,8 | wav+txt | Internet | manifest |
buriy_audiobooks_2_val | 1 | 0,5 | wav+txt | Boeken + handleiding | manifest |
public_youtube700_val | 2 | 0.13 | wav+txt | YouTube-video’s + handleiding | manifest |
rechtstreeks
Meer informatie: https://github.com/snakers4/open_stt#download-instructions
Via het koppelen van Azure-blobopslag
Raadpleeg het notebook op het tabblad Gegevenstoegang
ContactpersonenVoor hulp bij of vragen over de gegevens kunt u contact opnemen met de data-auteur(s) op aveysov@gmail.com
LicentieMet deze licentie kunnen hergebruikers het materiaal alleen voor niet-commerciële doeleinden verspreiden, anders indelen, aanpassen en gebruiken als basis in elk medium of elke indeling en alleen als erkenning wordt verleend aan de maker. Dit bevat de volgende onderdelen:
* DOOR – De maker moet worden vermeld
* NC – Alleen niet-commercieel gebruik van het werk is toegestaan
CC-BY-NC (Creative Commons Naamsvermelding-NietCommercieel) en commercieel gebruik is beschikbaar mits als er een overeenkomst is met de auteurs van de gegevensset.
Referentiemateriaal/meer informatieOorspronkelijke gegevensset
- https://github.com/snakers4/open_stt
Engelse artikelen
- https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
- https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/
Chinese artikelen
- https://www.infoq.cn/article/4u58WcFCs0RdpoXev1E2
Russische artikelen
- https://habr.com/ru/post/494006/
- https://habr.com/ru/post/474462/
Access
Available in | When to use |
---|---|
Azure Notebooks | Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine. |
Select your preferred service:
Azure Notebooks
!pip install numpy
!pip install tqdm
!pip install scipy
!pip install pandas
!pip install soundfile
!pip install librosa
!pip install azure-storage-blob
# manifest utils
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from urllib.request import urlopen
def reroot_manifest(manifest_df,
source_path,
target_path):
if source_path != '':
manifest_df.wav_path = manifest_df.wav_path.apply(lambda x: x.replace(source_path,
target_path))
manifest_df.text_path = manifest_df.text_path.apply(lambda x: x.replace(source_path,
target_path))
else:
manifest_df.wav_path = manifest_df.wav_path.apply(lambda x: os.path.join(target_path, x))
manifest_df.text_path = manifest_df.text_path.apply(lambda x: os.path.join(target_path, x))
return manifest_df
def save_manifest(manifest_df,
path,
domain=False):
if domain:
assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration', 'domain']
else:
assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration']
manifest_df.reset_index(drop=True).sort_values(by='duration',
ascending=True).to_csv(path,
sep=',',
header=False,
index=False)
return True
def read_manifest(manifest_path,
domain=False):
if domain:
return pd.read_csv(manifest_path,
names=['wav_path',
'text_path',
'duration',
'domain'])
else:
return pd.read_csv(manifest_path,
names=['wav_path',
'text_path',
'duration'])
def check_files(manifest_df,
domain=False):
orig_len = len(manifest_df)
if domain:
assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration']
else:
assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration', 'domain']
wav_paths = list(manifest_df.wav_path.values)
text_path = list(manifest_df.text_path.values)
omitted_wavs = []
omitted_txts = []
for wav_path, text_path in zip(wav_paths, text_path):
if not os.path.exists(wav_path):
print('Dropping {}'.format(wav_path))
omitted_wavs.append(wav_path)
if not os.path.exists(text_path):
print('Dropping {}'.format(text_path))
omitted_txts.append(text_path)
manifest_df = manifest_df[~manifest_df.wav_path.isin(omitted_wavs)]
manifest_df = manifest_df[~manifest_df.text_path.isin(omitted_txts)]
final_len = len(manifest_df)
if final_len != orig_len:
print('Removed {} lines'.format(orig_len-final_len))
return manifest_df
def plain_merge_manifests(manifest_paths,
MIN_DURATION=0.1,
MAX_DURATION=100):
manifest_df = pd.concat([read_manifest(_)
for _ in manifest_paths])
manifest_df = check_files(manifest_df)
manifest_df_fit = manifest_df[(manifest_df.duration>=MIN_DURATION) &
(manifest_df.duration<=MAX_DURATION)]
manifest_df_non_fit = manifest_df[(manifest_df.duration<MIN_DURATION) |
(manifest_df.duration>MAX_DURATION)]
print(f'Good hours: {manifest_df_fit.duration.sum() / 3600:.2f}')
print(f'Bad hours: {manifest_df_non_fit.duration.sum() / 3600:.2f}')
return manifest_df_fit
def save_txt_file(wav_path, text):
txt_path = wav_path.replace('.wav','.txt')
with open(txt_path, "w") as text_file:
print(text, file=text_file)
return txt_path
def read_txt_file(text_path):
#with open(text_path, 'r') as file:
response = urlopen(text_path)
file = response.readlines()
for i in range(len(file)):
file[i] = file[i].decode('utf8')
return file
def create_manifest_from_df(df, domain=False):
if domain:
columns = ['wav_path', 'text_path', 'duration', 'domain']
else:
columns = ['wav_path', 'text_path', 'duration']
manifest = df[columns]
return manifest
def create_txt_files(manifest_df):
assert 'text' in manifest_df.columns
assert 'wav_path' in manifest_df.columns
wav_paths, texts = list(manifest_df['wav_path'].values), list(manifest_df['text'].values)
# not using multiprocessing for simplicity
txt_paths = [save_txt_file(*_) for _ in tqdm(zip(wav_paths, texts), total=len(wav_paths))]
manifest_df['text_path'] = txt_paths
return manifest_df
def replace_encoded(text):
text = text.lower()
if '2' in text:
text = list(text)
_text = []
for i,char in enumerate(text):
if char=='2':
try:
_text.extend([_text[-1]])
except:
print(''.join(text))
else:
_text.extend([char])
text = ''.join(_text)
return text
# reading opus files
import os
import soundfile as sf
# Fx for soundfile read/write functions
def fx_seek(self, frames, whence=os.SEEK_SET):
self._check_if_closed()
position = sf._snd.sf_seek(self._file, frames, whence)
return position
def fx_get_format_from_filename(file, mode):
format = ''
file = getattr(file, 'name', file)
try:
format = os.path.splitext(file)[-1][1:]
format = format.decode('utf-8', 'replace')
except Exception:
pass
if format == 'opus':
return 'OGG'
if format.upper() not in sf._formats and 'r' not in mode:
raise TypeError("No format specified and unable to get format from "
"file extension: {0!r}".format(file))
return format
#sf._snd = sf._ffi.dlopen('/usr/local/lib/libsndfile/build/libsndfile.so.1.0.29')
sf._subtypes['OPUS'] = 0x0064
sf.SoundFile.seek = fx_seek
sf._get_format_from_filename = fx_get_format_from_filename
def read(file, **kwargs):
return sf.read(file, **kwargs)
def write(file, data, samplerate, **kwargs):
return sf.write(file, data, samplerate, **kwargs)
# display utils
import gc
from IPython.display import HTML, Audio, display_html
pd.set_option('display.max_colwidth', 3000)
#Prepend_path is set to read directly from Azure. To read from local replace below string with path to the downloaded dataset files
prepend_path = 'https://azureopendatastorage.blob.core.windows.net/openstt/ru_open_stt_opus_unpacked/'
def audio_player(audio_path):
return '<audio preload="none" controls="controls"><source src="{}" type="audio/wav"></audio>'.format(audio_path)
def display_manifest(manifest_df):
display_df = manifest_df
display_df['wav'] = [audio_player(prepend_path+path) for path in display_df.wav_path]
display_df['txt'] = [read_txt_file(prepend_path+path) for path in tqdm(display_df.text_path)]
audio_style = '<style>audio {height:44px;border:0;padding:0 20px 0px;margin:-10px -20px -20px;}</style>'
display_df = display_df[['wav','txt', 'duration']]
display(HTML(audio_style + display_df.to_html(escape=False)))
del display_df
gc.collect()
manifest_df = read_manifest(prepend_path +'/manifests/public_series_1.csv')
#manifest_df = reroot_manifest(manifest_df,
#source_path='',
#target_path='../../../../../nvme/stt/data/ru_open_stt/')
sample = manifest_df.sample(n=20)
display_manifest(sample)
!ls ru_open_stt_opus/manifests/*.csv
%matplotlib inline
import librosa
from scipy.io import wavfile
from librosa import display as ldisplay
from matplotlib import pyplot as plt
manifest_df = read_manifest(prepend_path +'manifests/asr_calls_2_val.csv')
#manifest_df = reroot_manifest(manifest_df,
#source_path='',
#target_path='../../../../../nvme/stt/data/ru_open_stt/')
sample = manifest_df.sample(n=5)
display_manifest(sample)
from io import BytesIO
wav_path = sample.iloc[0].wav_path
response = urlopen(prepend_path+wav_path)
data = response.read()
sr, wav = wavfile.read(BytesIO(data))
wav.astype('float32')
absmax = np.max(np.abs(wav))
wav = wav / absmax
# shortest way to plot a spectrogram
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max)
plt.figure(figsize=(12, 6))
ldisplay.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# shortest way to plot an envelope
plt.figure(figsize=(12, 6))
ldisplay.waveplot(wav, sr=sr, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None)
manifest_df = read_manifest(prepend_path +'manifests/asr_public_phone_calls_2.csv')
#manifest_df = reroot_manifest(manifest_df,
#source_path='',
#target_path='../../../../../nvme/stt/data/ru_open_stt/')
sample = manifest_df.sample(n=5)
display_manifest(sample)
opus_path = sample.iloc[0].wav_path
response = urlopen(prepend_path+opus_path)
data = response.read()
wav, sr = sf.read(BytesIO(data))
wav.astype('float32')
absmax = np.max(np.abs(wav))
wav = wav / absmax
# shortest way to plot a spectrogram
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max)
plt.figure(figsize=(12, 6))
ldisplay.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# shortest way to plot an envelope
plt.figure(figsize=(12, 6))
ldisplay.waveplot(wav, sr=sr, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None)