We present JustSayIt.jl, a software and high-level API for offline, low latency and secure translation of human speech to computer commands or text, leveraging the Vosk Speech Recognition Toolkit. The API includes an unprecedented, highly generic extension to the Julia programming language, which allows to declare arguments in standard function definitions to be obtainable by voice. As a result, it empowers any programmer to quickly write new commands that take arguments from human voice.
Leading software companies have heavily invested in voice assistant software since the dawn of the century. However, they naturally prioritize use cases that directly or indirectly bring economic profit. As a result, their developments cover, e.g., the needs of the entertainment sector abundantly, but those of academia and software development only poorly. There is particularly little support for Linux, whereas it is the preferred operating system for many software developers and computational scientists. The open source voice assistant project MyCroft fully supports Linux, but provides little tools that appear helpful for productive work in academia and software development; moreover, adding new skills to MyCroft seems to be complex for average users and appears to require considerable knowledge about the specificities of MyCroft. JustSayIt.jl addresses these shortcomings by providing a lightweight framework for easily extensible, offline, low latency, highly accurate and secure speech to command or text translation on Linux, MacOS and Windows.
JustSayIt's high-level API allows to declare arguments in standard Julia function definitions to be obtainable by voice, which constitutes an unprecedented, highly generic extension to the Julia programming language. For such functions, JustSayIt automatically generates a wrapper method that takes care of the complexity of retrieving the arguments from the speakers voice, including interpretation and conversion of the voice arguments to potentially any data type. JustSayIt commands are implemented with such voice argument functions, triggered by a user definable mapping of command names to functions. As a result, it empowers programmers without any knowledge of speech recognition to quickly write new commands that take their arguments from the speakers voice. Moreover, JustSayIt unites the Julia and Python communities by using both languages: it leverages Julia's performance and metaprogramming capabilities and Python's larger ecosystem where no Julia package is considered suitable. JustSayIt relies on PyCall.jl and Conda.jl, which renders installing and calling Python packages from within Julia almost trivial. JustSayIt is ideally suited for development by the world-wide open source community as it provides an intuitive high-level API that is readily understandable by any programmer and unites the Python and Julia community.
JustSayIt implements a novel algorithm for high performance context dependent recognition of spoken commands which leverages the Vosk Speech Recognition Toolkit. A specialized high performance recognizer is defined for each function argument that is obtainable by voice and has a restriction on the valid input. In addition, when beneficial for recognition accuracy, the recognizer for a voice argument is generated dynamically depending on the command path taken before the argument. To enable minimal latency for single word commands (latency refers here to the time elapsed between a command is spoken and executed), the latter can be triggered in certain conditions upon bare recognition of the corresponding sounds without waiting for silence as normally done for the confirmation of recognitions. Thus, JustSayIt is suitable for commands where a perceivable latency would be unacceptable, as, e.g., mouse clicks. Single word commands' latency is typically in the order of a few milliseconds on a regular notebook. JustSayIt achieves this high performance using only one CPU core and can therefore run continuously without harming the computer usage experience.
In conclusion, JustSayIt demonstrates that the development of our future voice assistants can take a fresh and new path that is neither driven by the priorities and economic interests of global software companies nor by a small open source community of speech recognition experts; instead, the entire world-wide open source community is empowered to contribute in shaping our future daily assistants.