![]() |
Creating a program to automate podcast editing - Printable Version +- we live in hell (https://weliveinhell.net) +-- Forum: interests & hobbies (https://weliveinhell.net/forumdisplay.php?fid=10) +--- Forum: science & technology (https://weliveinhell.net/forumdisplay.php?fid=11) +--- Thread: Creating a program to automate podcast editing (/showthread.php?tid=103) |
Creating a program to automate podcast editing - ScottyMcGee - 02-01-2025 Long post, so the tl;dr version is - I'm trying to create a program in Python that would help me splice up audio files to edit podcasts according to the script. Heya! So. I've been fiddling with Python for a couple years now but recently caved to using to AI to help edit/come up with a program in Python that can splice together audio files and rearrange them in the order that they appear in the script. Editing my podcast has been the bane of my existence, and the show would be up to speed if I could minimize the time it takes to editing by a lot. There will still be room for things I would want to do myself (i.e. creating Foleys, modifying audio to give a certain effect, adding music - that's all the FUN stuff of audio editing) but the BASIC thing I want this program to be able to do is splice all the audio files in order because that right there is 90% of the tedious work. For context, this isn't like a "regular" podcast of two or three people talking about a topic. This is an audio drama. Voice artists give me their recording and I manually splice all of their lines together according to the script. The stories have a lot of one-off characters and many sound effects, amounting to hundreds of audio files. Voice artists have typically given me one big audio file with all their character's lines in the script to work with, as it is easiest for them to record that way. Now, what I've been trying to make here SORT OF has been working but it's still not great, and here's where I'm opening it up to others to see if they have any ideas. The program currently does the following: -Reads the script (pdf or docx) -Reads the audio files (indicated in the filename format "character_episode number.mp3" so for example "Bob_1.mp3" and it can recognize either .mp3 or .wav. -Export a new audio file in .mp3 format after reading the script and using voice-text recognition to splice the audio files in order of the script. -right now, it utilizes pyPDF2, python-docx, pydub, ffmpeg to do these tasks. The program (which I call "the assembler") first recognized where a character started speaking but didn't know where to stop. I had told it to recognize dialogue in the script by the format "character(colon)" so "Bob:" would indicate that Bob is speaking after the colon. The script typically has the format as: Quote:Bob: But the assembler created a file that would play Bob's file in its entirety and then Ryan's file in its entirety, so it didn't splice any audio to create the dialogue. Therefore, I modified the script to add a colon before each character name (and SFX too) so that the assembler understands the colon as a "go and stop" for each character and SFX. So now the script is like: Quote::Bob: I got excited because it started working but then ran into another issue - the assembler needs to understand when and where to splice audio. It would sometimes cut off Bob's lines or Ryan's lines when there was a pause in their speaking. ChatGPT first suggested inserting timestamps in the script for each line. Then the assembler can understand exactly when and where to splice. But in NO WAY is any sane person going to go through all these long-ass audio files for each character and markup the timestamps for each line of speaking. I asked if it could somehow recognize the silences between lines and use the silence as a marker for splicing, and yes there is a way to do that. I got closer to a more comprehensible finished audio file, but here's the remaining problem - not all silences are the same length. Some lines have awkward pauses, or just a moment to breathe here and there during narration (dramatic pauses and the like). So while I can enter in the code something like "Splice after 1.5 seconds of silence", there are going to be lines that will no doubt be cut off because maybe there are 2 second pauses or 3 second pauses, etc. So now I believe I'm reaching more complicated territory on how this thing can recognize the nuances of the dialogue. There are also some general issues I noticed where some lines are seemingly not able to be understood, probably because the voice artist was using an accent or because they pronounced words in a way the assembler couldn't recognize. Now - a much simpler way I realized this could work is to do away with splicing and have each spoken line in the script recorded individually. So say a character has 25 lines in a script. Instead of recording it as one audio file, the voice artist would export 25 audio files, one for each line. The assembler wouldn't have to rely on voice recognition as it would just go down the line based on recognizing the numbers (i.e. Bob Line 1.wav goes here, SFX: poo poo pee pee goes here, Ryan Line 3.wav goes here) But now we have a downside because that is far more work for the voice artist, especially if the total number of lines run in the hundreds. Voice work is far more comfortable to just keep recording as you go along, pausing and redoing flubs as they happen, rather than just "talk and stop, talk and stop" recording like that. It would also require me to remind voice artists to strictly format their filenames. Each character would have their line numbered so that the assembler knows "Okay this is Bob line 1. Next is Ryan line 1. now Bob line 2". So again - more work for the voice artists that make things tedious for them. And lastly, here is the code for the assembler to look over. There are other nitty-gritty details I didn't mention yet that are in there like creating exceptions and making sure there are ways for the assembler to still continue making an audio file if something is missing or incomprehensible. Quote:import os RE: Creating a program to automate podcast editing - gorzek - 02-02-2025 Have it break up every single line as its own file. Then each script file corresponds to exactly one audio file. ffprobe can tell you the length of each audio file, so now you're just doing simple addition to timestamp everything. You can also use ffmpeg to insert silences before/after each line to ensure enough pause between them. I'd say give that approach a shot before trying anything more complex. RE: Creating a program to automate podcast editing - ScottyMcGee - 02-12-2025 I DID IT I FUCKING FIGURED IT OUT. So, it still wasn't working and then as I was staring into space the obvious answer hit me - I will just have to make sure that the VAs will include at least 2 seconds of silence between each line. Simple! I went to my example files and roughly put a 2 second pause between each line. I restarted everything from the top with the code. I explicitly described to ChatGPT to make sure the code splices up the audio after recognizing that there are 2 seconds of silence. I used an example from the script so that it would understand that the program would have to go back to an audio file and pick up where it last left off in order to splice the audio into a coherent dialogue with SFX according to the script. It's not even doing any voice recognition anymore. It's all based on recognizing the speakers and the SFXs with a unique format in the script (each speaker and SFX in the script is indicated by a colon before and after the name, so for example ":BOB: Hi!" or ":SFX: explosions") and understanding that each of those instances means audio is playing from their respective files (example - ":BOB:" would mean it needs to look for "Bob_1.wav" in my directory), and to splice the files by recognizing the 2 second silence. RE: Creating a program to automate podcast editing - gorzek - 02-12-2025 Hey, I'm glad it worked!!! |