How to automatically segment speech in praat

12/5/2023

Over the last decades, several tools have been developed to ease this task (see, e.g., van Bael et al., 2007 Lecouteux et al., 2012). One may perform segmentation by hand or apply an automatic speech segmentation system, or a combination of these. The second challenge is the construction of the segmentation itself. Language research represents a highly niche segmentation usage case with its own specific requirements and constraints. Segmentation of varying degrees of accuracy may be required for rich diarization of meetings, or for the adaptation of acoustic models in automatic speech recognition (ASR). Rough, errorful transcription may be sufficient for text query-based services, and may be quickly constructed. Due to fine phonetic details (Hawkins, 2003) and reduction phenomena (Ernestus & Warner, 2011), word-based transcriptions are much easier and faster to construct than high-quality finer-grained faithful phonetic segmentations. The first challenge is to take into account the purpose of the segmentation for determining the desired granularity level for the segmentation units. Constructing transcriptions and segmentations typically involves three challenges. This article primarily addresses segmentation. The related process of segmentation concerns additionally determining when the transcribed words and segments occur in a speech recording. Speech transcription concerns the generation of a verbatim textual record of speech.

Many of the speech databases available to date(e.g., via, The Language Archive, 2019 European Language Resources Association, 2019 Linguistic Data Consortium, 2019) have been (at least partly) enriched with a verbatim word-level and/or a phonetic transcription. In many speech-based disciplines, the availability of adequately segmented and transcribed speech corpora is essential for designing and benchmarking computational models of speech processing and for sharpening theories of speech production and perception. Because of its greater efficiency without sacrificing reliability, POnSS represents a distinct methodological advance for the segmentation of speech data. We observed that POnSS achieved comparable reliability to segmentation using Praat, but required 23% less annotator time investment. We evaluated segmentations made with POnSS against a baseline of segmentations of the same data made conventionally in Praat. In developing POnSS, we identified several sub-tasks of segmentation, and implemented each of these as separate interfaces for the annotators to interact with to streamline their task as much as possible. We introduce POnSS, a browser-based system that is specialized for the task of segmenting the onsets and offsets of words, which combines aspects of ASR with limited human input. Conventional approaches to manual segmentation are very labor-intensive. Despite advances in automatic speech recognition (ASR), human input is still essential for producing research-grade segmentations of speech data.

0 Comments

How to automatically segment speech in praat

Leave a Reply.

Author

Archives

Categories