Combining Shinkai tools : from PowerPoint to audio
Introduction
In this tutorial, you will learn to combine Shinkai tools to create an AI tool that extracts the text content of a .pptx presentation, generates a text for a lesson about the presentation, and generates an audio file of this lesson. This tool is available in the Shinkai AI Store.
You will learn how to :
- build a tool and add features step by step using the Shinkai AI-assisted tool builder.
- combine Shinkai tools efficiently (optional features, customizability, config validation, error handling)
- implement Optical Character Recognition
- implement text to speech
- use the created tool
This tutorial is a step-by-step guide on how to implement the full tool.
You can find the complete code below for reference, but we will see how to use the AI-assisted tool builder and how to prompt the AI to recreate its elements one by one. Additionally, you can see some usage examples in the last section of this tutorial (Part 7 : Using the tool).
While trying to build such a complex tool in one go with the AI assistance might sound faster, building it step-by-step can be quicker, deliver a better tool, and be cheaper, because :
- the LLM works on smaller instructions and is less likely to get confused, leading to a better implementation of your instructions
- if needed, prompting the LLM to edit, fix, improve the code generated so far is faster and cheaper because there is less code to interpret and regenerate each time (compared to editing a full code for the entire tool)
We will see how to recreate this tool both ways : in one go and step-by-step.
Prerequisites
To follow this tutorial, you will need :
- the latest version of Shinkai Desktop installed
- to install Tesseract for OCR
- to install the ElevenLabs text-to-speech tool from the Shinkai AI Store and configure it
- an ElevenLabs API key with sufficient credits
Part 0 : Trying to build the full tool in one go with Shinkai AI-assisted tool builder
You can try to build a working prototype of the full tool using 1 detailed prompt and a performant LLM.
In the tool creation UI, select a performant LLM (e.g. gpt_4_1, shinkai_free_trial), select Python, activate the 2 tools ‘shinkai_llm_prompt_processor’ and ‘eleven_labs_text_to_speech’, write a prompt describing the tool well, and execute it.
For a good result, your prompt should be detailed and clearly describe :
- the goal of the tool to create and its steps
- how each of the selected tools should be used
- what you would want in configuration versus inputs
- which feature should be optional
- how to handle errors
Below is an example of a prompt to generate a full prototype of our PowerPoint to audio lesson tool. It uses tags to make things clear for the LLM. At the very least, such prompt will create a good code flow for the intended tool, from which you can debug, edit, improve.
Alternatively, you can build the tool progressively, step-by-step, by first building the content extraction part, then adding the lesson text generation, then the audio generation, and finally adding general improvements.
Below, you can study a step-by-step implemention of the tool.
Part 1 : Extracting the text content from a .pptx file
You can try to build the content extraction feature first using the AI assistance, and then later on add the other features.
To do so, do not select any tool as this feature does not rely on any, and use a good prompt. Because the prompt would be short as it is about just one feature, you can make it very thorough and add details on how to build the tool without risking to overwhelm the LLM. Here is an example prompt to create a tool that extracts the text content from a .pptx file :
At the very least, such prompt should create a good code flow for the intended feature, from which you can debug, edit, improve.
Similar prompts were actually used to build the full code above, with step-by-step improvements through prompting and few manual coding.
You should get code that :
- imports libraries and the shinkai tools needed (e.g. Tesseract, ElevenLabs)
- defines the configuration for the Tesseract executable path
- creates the output class for the content, error messages and status
- define 3 functions to extract text from text blocks, tables and charts
- creates a function to extract the content from shapes using the functions defined above, plus use Tesseract OCR for picture shapes
- defines a function to read the presentation from either URL or local file path
- creates a function that applies the content extraction slide by slide and shape by shape
- implement a validation function to stop the tool and log errors if there are issues with the Tesseract OCR implementation
- defines a run function using all the functions defined above.
- includes a step to check if the extracted content is empty or not. It’s a useful step because later the tool will use this extracted content to generate a lesson text, and this check will ensure there is a content, and stop the tool and inform the user if there isn’t, saving compute and time.
Part 2 : Adding a LLM prompt processor to generate a lesson text
Now you can use the AI-assisted tool builder to add a step that generates the lesson text, using the slides content extracted in the first step.
To do so, activate the tool ‘shinkai_llm_prompt_processor’, and use a prompt similar to this one :
At the very least, such prompt should add a good code flow for the intended additional feature, from which you can debug, edit, improve.
Similar prompts were actually used to add this next feature, with step-by-step improvements through prompting and few manual coding.
You should get code that :
- adds to the imports the ‘shinkai_llm_prompt_processor’ tool
- adds an input for additional instructions to generate the lesson text, so that the user can customize it. Set default to ‘none’.
- adds an output for the generated lesson
- adds a step to defined a detailed prompt to generate optimal lesson text. Give some context describing the type of content the LLM will use and its specificities. Include formatting instructions. Include the optional additional instructions coming from the user. Organise it well and use tags to make things clear for the LLM.
- calls the LLM prompt processor tool using the prompt defined above.
- cleans the text generated lesson text from special characters, in case the LLM includes some despite our prompt format instructions
- includes the cleaned generated text in the output.
Part 3 : Adding an optional text-to-speech feature to create an audio file of the lesson
Now you can use the AI-assisted tool builder to add a final optional step which generates an audio file of the cleaned lesson text generated by the 2nd feature of the tool.
To do so, activate the tool ‘eleven_labs_text_to_speech’, and use a prompt similar to this one :
At the very least, such prompt should add a good code flow for the intended last feature, from which you can debug, edit, improve.
Similar prompts were actually used to add this next feature, with step-by-step improvements through prompting and few manual coding.
You should get code that :
- adds the ‘eleven_labs_text_to_speech’ tool to the import, and also ‘shutil’ (used for file operations).
- adds to the the config the option to generate the audio
- adds to the output the optional audio file
- defines a function to get the name of the .pptx file. It will be used to save the audio file with the same name.
- adds a step to the validate_config function to also check the configuration of the optional audio generation.
- adds a step to the run functions to call the ‘eleven_labs_text_to_speech’ tool. This step is optional according to the configuration.
- adds a step to change the name of the audio file generated by the text-to-speech tool : make it more user-friendly by simply using the name of the original .pptx file.
- includes an error message if the audio file generation failed.
Note: You can modify the tool to use another text-to-speech provider, including local options, by adjusting the relevant code.
Part 4 : Troubleshooting
If the tool created or modified by the AI assistance generates errors when you run it, consider these steps:
- Provide Feedback : Copy the error message and the relevant code snippet back into the AI tool builder chat. Explain what input caused the error and ask the AI to fix it.
- Use a More Capable LLM : Some LLMs are better at coding tasks than others. If you’re using a less capable model, try switching to one known for stronger coding abilities.
- Refine Your Prompts : Make your instructions even more specific. Break down complex requests into smaller sub-tasks. Clearly define expected inputs, outputs, and error conditions for each part.
- Isolate the Problem : If the multi-step tool fails, try running only the first step (e.g., text extraction) by commenting out later steps or using a simpler version of the tool. Once the first step works, incrementally add back the next steps until you find where the error occurs.
- Examine Intermediate Outputs : Modify the code temporarily to print or output intermediate results (like the raw extracted text before the LLM call, or the LLM output before cleaning/TTS) to see if the data looks as expected at each stage.
- Seek Community Support : For additional help, contact the Shinkai support team or join the Shinkai community on Discord to ask questions and share your problem.
Part 5 : Perfecting your tools combination : useful prompts
For complex tools that chain multiple steps and call other tools, careful design is crucial for usability, reliability, and maintainability. Here are common areas for refinement and example prompts you can use with the AI tool builder to improve your PPTX-to-audio tool :
Changing Configurations to Inputs : Decide carefully what should be a fixed setting (config) versus a per-run choice (input). Things that change often belong in inputs.
Renaming Parameters : Ensure variable, function, and output names are explicit and unambiguous. This helps both users understand the tool’s parameters and results and the AI interpret the results as well as understand how to apply code modification prompts correctly.
Adding Optional Features : Introduce new capabilities or customization options for the user.
Enhancing Validation and Error Handling : Add checks for inputs and configurations early to fail fast and provide clear error messages. Make error reporting more specific.
Enhancing Output Flexibility : Provide more detailed or intermediate outputs for debugging or advanced use cases.
Part 6 : Improving the metadata of the tool
Shinkai automates tool metadata generation, but you can enhance it.
Good tool metadata should include :
- an explicit tool title
- a thorough description (features, options, requirements, extra information)
- explicit descriptions for configurations and inputs
- adequate usable keywords to trigger the tool
Go to the metadata section, and improve the above. Below is a good metadata for the tool.
Title :
Description :
Metadata JSON :
Now the tool should be complete. Save it.
Below you’ll find usage examples.
Part 7 : Using the tool ‘PPTX Content Extractor With OCR And Audio Lesson Generator’
7.1 Installing extra components and setting up configurations
Install Tesseract for OCR, and set its executable path in the configuration of the ‘PPTX Content Extractor With OCR And Audio Lesson Generator’ tool.
Install the ‘eleven_labs_text_to_speech’ tool from the Shinkai AI Store. Get an ElevenLabs API key with some credits. Go to the configuration tab of this ElevenLabs Shinkai tool and set your API key and pick a voice.
Set audio generation to ‘yes’ or ‘no’ in the configuration of the ‘PPTX Content Extractor With OCR And Audio Lesson Generator’.
7.2 Usage examples
To generate an audio lesson from a .pptx file, set audio generation to ‘yes’ in the configuration and include the filename in your prompt.
To interact with the .pptx file content through prompts, have the audio generation set on ‘no’, include the file path in your prompt, and add instructions.
Because the content is extracted slide by slide, you can also ask about specific slides.