当前位置：网站首页>Train-clean-100 dataset

Train-clean-100 dataset

2022-07-24 07:40:00 【yn20000227】

LibriSpeech ：

It is a reading phonetic corpus , be based on LibriVox Public domain audiobooks . Its purpose is to realize automatic speech recognition (ASR) System training and testing .

The corpus is divided into several parts , So that users can selectively download its subset according to their own needs . The name carries “clean” A subset of is considered more than other audio and American English accents “ clean ”（ At least on average ）. This classification is obtained using very rough automated means , It should not be considered completely reliable . Subsets are disjoint , That is, each speaker's audio is assigned to a subset .

The structure of the corpus is as follows ：

* dev-clean, test-clean - contain “ clean ” Voice development and test set .
* train-clean-100 - Training set , about 100 Hours of “ clean ” voice
* train-clean-360 - Training set , about 360 Hours of “ clean ” voice
* dev-other, test-other - Development and test sets , Voice is automatically selected to be more “ challenging ” The identification of
* train-other-500 - about 500 Hour training set , Contains are not classified as “ clean ” The voice of
* intro - subset Only some readers LibriVox Introduce a subset of disclaimers
* mp3 - The original on which the corpus is based MP3 Encode audio
* texts - The text corresponding to the audio in the corpus