Keyword identification framework for speech communication on construction sites


  • Asif Mansoor University of Alberta, Canada
  • Shuai Liu University of Alberta, Canada
  • Ghulam Muhammad Ali University of Alberta, Canada
  • Ahmed Bouferguene University of Alberta, Canada
  • Mohamed Al-Hussein University of Alberta, Canada
  • Imran Hassan University of Turbat



Keyword identification, Mel-frequency cepstral coefficients, Convolutional neural network, Communication, Crane signalman


Worksite communication is a key to boosting teamwork and improving worker performance on the construction worksite. Communication among workers on the construction site mostly consists of speech communication. However, construction sites are typically noisy due to construction tasks like drilling and operation of heavy equipment. Meanwhile, workers on construction sites typically represent a range of different ethnic and linguistic backgrounds and have different speaking accents. This can make it difficult for the listener to understand the speaker clearly, leading to miscommunication and errors in decision making on the construction site. Technological advancements in recent years can be leveraged to mitigate this problem. In this paper, a keyword identification framework is developed for speech communication on the construction site. For this framework, 12 hours of raw audio data containing 18 crane signalman speech commands (referred to as “keywords”) are collected. The crane signalman uses specific keywords to communicate with the crane operator and guide the crane operator in the crane operations. The 2-second audio clips (this being the approximate duration of each keyword) are extracted from the raw audio dataset, and construction site noise is added. Moreover, mel-frequency cepstral coefficients are extracted from the waveform audio dataset. The extracted mel-frequency cepstral coefficients, in turn, are used to train the 1-dimensional convolutional neural network. After training, the model is found to achieve a training accuracy of 97.3%, a validation accuracy of 96.1%, and a testing accuracy of 93.8%. The model is further deployed for real-time identification of keywords in speech, with the model achieving an accuracy of 95.3%. In light of these findings, it can be concluded that the developed framework is suitable for real-time application in noisy construction sites for identifying specific keywords in speech.