I DID IT
I FUCKING FIGURED IT OUT.
So, it still wasn't working and then as I was staring into space the obvious answer hit me - I will just have to make sure that the VAs will include at least 2 seconds of silence between each line. Simple! I went to my example files and roughly put a 2 second pause between each line. I restarted everything from the top with the code. I explicitly described to ChatGPT to make sure the code splices up the audio after recognizing that there are 2 seconds of silence. I used an example from the script so that it would understand that the program would have to go back to an audio file and pick up where it last left off in order to splice the audio into a coherent dialogue with SFX according to the script. It's not even doing any voice recognition anymore. It's all based on recognizing the speakers and the SFXs with a unique format in the script (each speaker and SFX in the script is indicated by a colon before and after the name, so for example ":BOB: Hi!" or ":SFX: explosions") and understanding that each of those instances means audio is playing from their respective files (example - ":BOB:" would mean it needs to look for "Bob_1.wav" in my directory), and to splice the files by recognizing the 2 second silence.
I FUCKING FIGURED IT OUT.
So, it still wasn't working and then as I was staring into space the obvious answer hit me - I will just have to make sure that the VAs will include at least 2 seconds of silence between each line. Simple! I went to my example files and roughly put a 2 second pause between each line. I restarted everything from the top with the code. I explicitly described to ChatGPT to make sure the code splices up the audio after recognizing that there are 2 seconds of silence. I used an example from the script so that it would understand that the program would have to go back to an audio file and pick up where it last left off in order to splice the audio into a coherent dialogue with SFX according to the script. It's not even doing any voice recognition anymore. It's all based on recognizing the speakers and the SFXs with a unique format in the script (each speaker and SFX in the script is indicated by a colon before and after the name, so for example ":BOB: Hi!" or ":SFX: explosions") and understanding that each of those instances means audio is playing from their respective files (example - ":BOB:" would mean it needs to look for "Bob_1.wav" in my directory), and to splice the files by recognizing the 2 second silence.