You don’t always need to have a verb. The way I use my Apple TV is usually to just say the name of the content because I want to pick where it comes from.
But even if you say play, it could still ask you where from and/or confirm it got the right thing. Roku != Amazon Echo.
"[...] favor YouTube music results from voice commands made on the Roku remote while the YouTube app is open"
Key word is "favor"