VoiceAI "Transpose" ambiguity

Answered

September 24, 2024 12:01

I'm having a hard time figuring out how best to use VoiceAI's presets in music. Each preset is provided with a preferred pitch. Is that pitch equivalent to the best key for that preset, or does it mean something else? Sonarworks' guidance states, “Transpose values above or below 12 might produce unexpected results.” What, precisely, is that supposed to mean? Is a VoiceAI user limited to transposing up or down one octave without introducing audio problems, but not intermediate ranges? If so, why?

Finished songs have fixed keys; changing their keys is impractical most of the time. Therefore, replacing a vocal track in a finished song with a VoiceAI preset requires matching its output to the song's key. Actually doing that presents more difficulty than I anticipated. In the case of female presets, the pitch range is eight semitones from D-flat4 to A4. As actual female singers' ranges are concerned, that's restrictive. In a song in G, applying a female preset with a G-flat pitch with manual Transpose of 0, the output is on-key. However, notes below F3 or so exhibit artifacts and distortion that makes them unusable. Diana Krall sings as low as G2 in her songs; Shawn Colvin and Beth Nielsen Chapman are frequently down around F3. VoiceAI's available female presets do not appear to be able to produce usable output near those note ranges.

Sonarworks' guidance states, “If the target preset sings in a lower/higher pitch than your input voice track, decrease/increase the value of the Transpose parameter.” If I use a Transpose value of ±1-11, I'll have to pitch-shift the result, alter its formants, and introduce artifacts into the output signal in order to use it in a song. Why would I ever want to do that?

Seems to me that if you can't find a preset that meets your needs with a Transpose parameter of 0, then you're just out of luck. Am I missing something?

Related to

1 comment

Kārlis Stenders

September 26, 2024 09:42

Brian Halonen thanks for posting!

It means the best input pitch, as described in the plugin. Each vocal model can be looked at as a real human being, having their own vocal ranges. They will work best at the described input pitch and the range around it.

And yes, if you use a Transpose setting of more than +12, some unexpected results may start to be introduced. This is of course very difficult to describe as it is highly dependent on the input audio (which can be anything). But let's call them audio artifacts for a lack of better description. Once you start pushing the vocal model to the Transpose limits, you will slowly start hearing that it no longer sounds like a natural/realistic human voice (unexpected results). For example, if you apply vocal model processing, and lower it by 2 octaves (making it sound like a bass, barely resembling a human voice), you will inevitably start hearing some strange non-human results (although they might be usable, or even desirable in creative/musical context). Within the ±12 range though, there shouldn't be any issues at all getting to your needed key, keeping in mind you want to apply a present with an appropriate input pitch.

We haven't received any other user feedback on the voice catalog being fundamentally restrictive to any particular input pitch or other nuances - so far, the feedback suggests that everyone can get pretty much what they want out of the plugin in terms of vocal type and input/output pitch. It sounds like there might be something very particular going on in the examples you are describing. If there is indeed an issue, we'd need to hear the input/output audio, and see the project details for context. Perhaps you could send a screen video/audio capture of the entire example? Feel free to submit a support request if you wish for us to investigate in detail.

It would be great to see some feedback from other users here, as we haven't really seen much about the input pitch and Transpose limits.

Please sign in to leave a comment.