Long post, so the tl;dr version is - I'm trying to create a program in Python that would help me splice up audio files to edit podcasts according to the script.
Heya! So. I've been fiddling with Python for a couple years now but recently caved to using to AI to help edit/come up with a program in Python that can splice together audio files and rearrange them in the order that they appear in the script. Editing my podcast has been the bane of my existence, and the show would be up to speed if I could minimize the time it takes to editing by a lot. There will still be room for things I would want to do myself (i.e. creating Foleys, modifying audio to give a certain effect, adding music - that's all the FUN stuff of audio editing) but the BASIC thing I want this program to be able to do is splice all the audio files in order because that right there is 90% of the tedious work.
For context, this isn't like a "regular" podcast of two or three people talking about a topic. This is an audio drama. Voice artists give me their recording and I manually splice all of their lines together according to the script. The stories have a lot of one-off characters and many sound effects, amounting to hundreds of audio files. Voice artists have typically given me one big audio file with all their character's lines in the script to work with, as it is easiest for them to record that way.
Now, what I've been trying to make here SORT OF has been working but it's still not great, and here's where I'm opening it up to others to see if they have any ideas.
The program currently does the following:
-Reads the script (pdf or docx)
-Reads the audio files (indicated in the filename format "character_episode number.mp3" so for example "Bob_1.mp3" and it can recognize either .mp3 or .wav.
-Export a new audio file in .mp3 format after reading the script and using voice-text recognition to splice the audio files in order of the script.
-right now, it utilizes pyPDF2, python-docx, pydub, ffmpeg to do these tasks.
The program (which I call "the assembler") first recognized where a character started speaking but didn't know where to stop. I had told it to recognize dialogue in the script by the format "character(colon)" so "Bob:" would indicate that Bob is speaking after the colon. The script typically has the format as:
But the assembler created a file that would play Bob's file in its entirety and then Ryan's file in its entirety, so it didn't splice any audio to create the dialogue. Therefore, I modified the script to add a colon before each character name (and SFX too) so that the assembler understands the colon as a "go and stop" for each character and SFX. So now the script is like:
I got excited because it started working but then ran into another issue - the assembler needs to understand when and where to splice audio. It would sometimes cut off Bob's lines or Ryan's lines when there was a pause in their speaking.
ChatGPT first suggested inserting timestamps in the script for each line. Then the assembler can understand exactly when and where to splice. But in NO WAY is any sane person going to go through all these long-ass audio files for each character and markup the timestamps for each line of speaking. I asked if it could somehow recognize the silences between lines and use the silence as a marker for splicing, and yes there is a way to do that.
I got closer to a more comprehensible finished audio file, but here's the remaining problem - not all silences are the same length. Some lines have awkward pauses, or just a moment to breathe here and there during narration (dramatic pauses and the like). So while I can enter in the code something like "Splice after 1.5 seconds of silence", there are going to be lines that will no doubt be cut off because maybe there are 2 second pauses or 3 second pauses, etc. So now I believe I'm reaching more complicated territory on how this thing can recognize the nuances of the dialogue. There are also some general issues I noticed where some lines are seemingly not able to be understood, probably because the voice artist was using an accent or because they pronounced words in a way the assembler couldn't recognize.
Now - a much simpler way I realized this could work is to do away with splicing and have each spoken line in the script recorded individually. So say a character has 25 lines in a script. Instead of recording it as one audio file, the voice artist would export 25 audio files, one for each line. The assembler wouldn't have to rely on voice recognition as it would just go down the line based on recognizing the numbers (i.e. Bob Line 1.wav goes here, SFX: poo poo pee pee goes here, Ryan Line 3.wav goes here)
But now we have a downside because that is far more work for the voice artist, especially if the total number of lines run in the hundreds. Voice work is far more comfortable to just keep recording as you go along, pausing and redoing flubs as they happen, rather than just "talk and stop, talk and stop" recording like that.
It would also require me to remind voice artists to strictly format their filenames. Each character would have their line numbered so that the assembler knows "Okay this is Bob line 1. Next is Ryan line 1. now Bob line 2". So again - more work for the voice artists that make things tedious for them.
And lastly, here is the code for the assembler to look over. There are other nitty-gritty details I didn't mention yet that are in there like creating exceptions and making sure there are ways for the assembler to still continue making an audio file if something is missing or incomprehensible.
Heya! So. I've been fiddling with Python for a couple years now but recently caved to using to AI to help edit/come up with a program in Python that can splice together audio files and rearrange them in the order that they appear in the script. Editing my podcast has been the bane of my existence, and the show would be up to speed if I could minimize the time it takes to editing by a lot. There will still be room for things I would want to do myself (i.e. creating Foleys, modifying audio to give a certain effect, adding music - that's all the FUN stuff of audio editing) but the BASIC thing I want this program to be able to do is splice all the audio files in order because that right there is 90% of the tedious work.
For context, this isn't like a "regular" podcast of two or three people talking about a topic. This is an audio drama. Voice artists give me their recording and I manually splice all of their lines together according to the script. The stories have a lot of one-off characters and many sound effects, amounting to hundreds of audio files. Voice artists have typically given me one big audio file with all their character's lines in the script to work with, as it is easiest for them to record that way.
Now, what I've been trying to make here SORT OF has been working but it's still not great, and here's where I'm opening it up to others to see if they have any ideas.
The program currently does the following:
-Reads the script (pdf or docx)
-Reads the audio files (indicated in the filename format "character_episode number.mp3" so for example "Bob_1.mp3" and it can recognize either .mp3 or .wav.
-Export a new audio file in .mp3 format after reading the script and using voice-text recognition to splice the audio files in order of the script.
-right now, it utilizes pyPDF2, python-docx, pydub, ffmpeg to do these tasks.
The program (which I call "the assembler") first recognized where a character started speaking but didn't know where to stop. I had told it to recognize dialogue in the script by the format "character(colon)" so "Bob:" would indicate that Bob is speaking after the colon. The script typically has the format as:
Quote:Bob:
Hey there!
Ryan:
What's up?
SFX: explosions
But the assembler created a file that would play Bob's file in its entirety and then Ryan's file in its entirety, so it didn't splice any audio to create the dialogue. Therefore, I modified the script to add a colon before each character name (and SFX too) so that the assembler understands the colon as a "go and stop" for each character and SFX. So now the script is like:
Quote::Bob:
Hey there!
:Ryan:
What's up?
:SFX: explosions
I got excited because it started working but then ran into another issue - the assembler needs to understand when and where to splice audio. It would sometimes cut off Bob's lines or Ryan's lines when there was a pause in their speaking.
ChatGPT first suggested inserting timestamps in the script for each line. Then the assembler can understand exactly when and where to splice. But in NO WAY is any sane person going to go through all these long-ass audio files for each character and markup the timestamps for each line of speaking. I asked if it could somehow recognize the silences between lines and use the silence as a marker for splicing, and yes there is a way to do that.
I got closer to a more comprehensible finished audio file, but here's the remaining problem - not all silences are the same length. Some lines have awkward pauses, or just a moment to breathe here and there during narration (dramatic pauses and the like). So while I can enter in the code something like "Splice after 1.5 seconds of silence", there are going to be lines that will no doubt be cut off because maybe there are 2 second pauses or 3 second pauses, etc. So now I believe I'm reaching more complicated territory on how this thing can recognize the nuances of the dialogue. There are also some general issues I noticed where some lines are seemingly not able to be understood, probably because the voice artist was using an accent or because they pronounced words in a way the assembler couldn't recognize.
Now - a much simpler way I realized this could work is to do away with splicing and have each spoken line in the script recorded individually. So say a character has 25 lines in a script. Instead of recording it as one audio file, the voice artist would export 25 audio files, one for each line. The assembler wouldn't have to rely on voice recognition as it would just go down the line based on recognizing the numbers (i.e. Bob Line 1.wav goes here, SFX: poo poo pee pee goes here, Ryan Line 3.wav goes here)
But now we have a downside because that is far more work for the voice artist, especially if the total number of lines run in the hundreds. Voice work is far more comfortable to just keep recording as you go along, pausing and redoing flubs as they happen, rather than just "talk and stop, talk and stop" recording like that.
It would also require me to remind voice artists to strictly format their filenames. Each character would have their line numbered so that the assembler knows "Okay this is Bob line 1. Next is Ryan line 1. now Bob line 2". So again - more work for the voice artists that make things tedious for them.
And lastly, here is the code for the assembler to look over. There are other nitty-gritty details I didn't mention yet that are in there like creating exceptions and making sure there are ways for the assembler to still continue making an audio file if something is missing or incomprehensible.
Quote:import os
import sys
import re
import glob
import argparse
from pydub import AudioSegment
from pydub.silence import split_on_silence
# Import document readers.
try:
from docx import Document
except ImportError:
print("Missing python-docx. Please install it via pip (pip install python-docx)")
sys.exit(1)
try:
import PyPDF2
except ImportError:
print("Missing PyPDF2. Please install it via pip (pip install PyPDF2)")
sys.exit(1)
# Base directory for audio and script files.
BASE_DIR = r"E:\$Galactic Punch Bowl\Podcast\Studio"
def read_script(file_path):
"""
Reads a script from a PDF or DOCX file and returns a list of nonempty text lines.
For PDFs, a heuristic is applied to remove the header and footer (first and last line of each page).
"""
ext = os.path.splitext(file_path)[1].lower()
lines = []
if ext == ".pdf":
try:
with open(file_path, "rb") as f:
reader = PyPDF2.PdfReader(f)
for page in reader.pages:
page_text = page.extract_text()
if page_text:
page_lines = page_text.splitlines()
if len(page_lines) > 2:
page_lines = page_lines[1:-1] # Remove header and footer.
lines.extend([line.strip() for line in page_lines if line.strip()])
except Exception as e:
print(f"Error reading PDF file: {e}")
sys.exit(1)
elif ext == ".docx":
try:
doc = Document(file_path)
for para in doc.paragraphs:
if para.text.strip():
lines.append(para.text.strip())
except Exception as e:
print(f"Error reading DOCX file: {e}")
sys.exit(1)
else:
print("Unsupported file type. Please use a PDF or DOCX file.")
sys.exit(1)
return lines
def parse_script(lines):
"""
Parses script lines into a list of tasks.
Expected format for auto‑split mode:
:IDENTIFIER: remaining text
For example:
:LIVIA: Hello!
:MARCUS: Hi there!
:SFX: explosions
The IDENTIFIER "SFX" (case-insensitive) indicates a sound effect task;
otherwise, the task is treated as dialogue.
"""
tasks = []
for line in lines:
line = line.strip()
if not line:
continue
if line.startswith(":"):
parts = line.split(":")
if len(parts) < 3:
print(f"Warning: Not enough parts in line, skipping: {line}")
continue
identifier = parts[1].strip()
if identifier.upper() == "SFX":
sfx_name = parts[2].strip()
if sfx_name:
tasks.append({
"type": "sfx",
"name": sfx_name
})
else:
print(f"Warning: No SFX name provided, skipping: {line}")
else:
# For dialogue, we ignore any text after the identifier.
tasks.append({
"type": "dialogue",
"name": identifier
})
else:
print(f"Warning: Unrecognized line format, skipping: {line}")
return tasks
def assemble_podcast(script_file, episode):
print(f"Reading script: {script_file}")
script_lines = read_script(script_file)
tasks = parse_script(script_lines)
if not tasks:
print("Warning: No valid tasks found in the script.")
final_audio = AudioSegment.empty()
# Cache for loaded large audio files.
audio_cache = {}
# Cache for auto-split segments along with an index pointer.
auto_split_cache = {}
for idx, task in enumerate(tasks, start=1):
print(f"Processing task {idx}: {task}")
if task["type"] == "dialogue":
# Instead of forcing an episode number, search for any file that starts with the character's name.
pattern_mp3 = os.path.join(BASE_DIR, f"{task['name']}*.mp3")
pattern_wav = os.path.join(BASE_DIR, f"{task['name']}*.wav")
files = glob.glob(pattern_mp3)
if not files:
files = glob.glob(pattern_wav)
if not files:
print(f"Warning: Large audio file not found for dialogue '{task['name']}'. Skipping task.")
continue
files.sort() # Sort to ensure consistent behavior.
key = os.path.splitext(os.path.basename(files[0]))[0]
elif task["type"] == "sfx":
# For SFX, the file name must be exactly "SFX_<name>" (no episode number).
key = f"SFX_{task['name']}"
else:
print(f"Warning: Unknown task type '{task['type']}', skipping.")
continue
# Load the large audio file if not already loaded.
if key not in audio_cache:
audio_file_path = None
for ext in [".mp3", ".wav"]:
candidate = os.path.join(BASE_DIR, key + ext)
if os.path.exists(candidate):
audio_file_path = candidate
break
if not audio_file_path:
print(f"Warning: Large audio file not found for key '{key}'. Skipping task.")
continue
try:
large_audio = AudioSegment.from_file(audio_file_path)
audio_cache[key] = large_audio
print(f"Loaded large audio file for key '{key}': {audio_file_path}")
except Exception as e:
print(f"Warning: Error loading audio file '{audio_file_path}': {e}. Skipping task.")
continue
else:
large_audio = audio_cache[key]
# Auto-split the large audio file using silence detection.
if key not in auto_split_cache:
silence_thresh = large_audio.dBFS - 16
segments = split_on_silence(
large_audio,
min_silence_len=2000, # Split when silence is >2 seconds.
silence_thresh=silence_thresh,
keep_silence=200 # Optionally retain a bit of silence.
)
if not segments:
print(f"Warning: No segments detected in audio for key '{key}'. Skipping task.")
continue
auto_split_cache[key] = {"segments": segments, "index": 0}
print(f"Auto-split {len(segments)} segments for key '{key}'.")
cache_entry = auto_split_cache[key]
segments_list = cache_entry["segments"]
seg_index = cache_entry["index"]
if seg_index >= len(segments_list):
print(
f"Warning: Not enough segments in audio for key '{key}' (requested segment {seg_index + 1}). Skipping task.")
continue
segment = segments_list[seg_index]
cache_entry["index"] += 1
final_audio += segment
print(f"Appended auto-split segment from key '{key}' (segment {seg_index + 1}).")
output_filename = f"episode_{episode}_final.mp3"
output_path = os.path.join(BASE_DIR, output_filename)
try:
final_audio.export(output_path, format="mp3")
print(f"Podcast assembled successfully. Output file: {output_path}")
except Exception as e:
print(f"Error exporting final audio: {e}")
def main():
parser = argparse.ArgumentParser(
description="Assemble a podcast episode by automatically splitting large audio files based on silence detection."
)
parser.add_argument(
"--script",
required=False,
default=r"E:\$Galactic Punch Bowl\Podcast\Studio\episode_3_script.pdf",
help="Path to the script file (.pdf or .docx)."
)
parser.add_argument(
"--episode",
required=False,
default="1",
help="Episode number (used in naming dialogue audio files if applicable). Defaults to '1' if not provided."
)
args = parser.parse_args()
assemble_podcast(args.script, args.episode)
if __name__ == "__main__":
main()